HadoopPlugin (Terrier Information Retrieval Platform version 2.2.1 API Specification)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

Terrier IR Platform
2.2.1

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

uk.ac.gla.terrier.utility.io
Class HadoopPlugin

java.lang.Object
  uk.ac.gla.terrier.utility.io.HadoopPlugin

All Implemented Interfaces:: ApplicationSetup.TerrierApplicationPlugin

public class HadoopPlugin
extends java.lang.Object
implements ApplicationSetup.TerrierApplicationPlugin
extends java.lang.Object
implements ApplicationSetup.TerrierApplicationPlugin

This class provides the main glue between Terrier and Hadoop. It has several main roles:

Configure Terrier such that the Hadoop file systems can be accessed by Terrier.
Provide a means to access the Hadoop map-reduce cluster, using Hadoop on Demand (HOD) if necessary.

Configuring Terrier to access HDFS

Terrier can access a Hadoop Distributed File System (HDFS), allowing collections and indices to be placed there. To do so, ensure that your Hadoop conf/ is on your CLASSPATH, and that the Hadoop plugin is loaded by Terrier, by setting terrier.plugins=uk.ac.gla.terrier.utility.io.HadoopPlugin in your terrier.properties file.

Configuring Terrier to access an existing Hadoop MapReduce cluster

Terrier can access an existing MapReduce cluster, as long as the conf/ folder for Hadoop is on your CLASSPATH. If you do not already have an existing Hadoop cluster, Terrier can be configured to use HOD, to build a temporary Hadoop cluster from a PBS (Torque) cluster. To configure HOD itself, the reader is referred to the HOD documentation. To use HOD from Terrier, set the following properties:

plugin.hadoop.hod - path to the hod binary, normally $HADOOP_HOME/contrib/hod/bin. If unset, then HOD is presusmed to be unconfigured.
plugin.hadoop.hod.nodes - the number of nodes/CPUs that you want to request from the PBS Torque cluster. Defaults to 6.
plugin.hadoop.hod.params - any additional options you want to set on the HOD command line. See the HOD User guide for examples.

Using Hadoop MapReduce from Terier

You should use the JobFactory provided by this class when creating a MapReduce job from Terrier. The JobFactory creates a HOD session should one be required, and also configures jobs such that the Terrier environment can be recreated on the execution nodes.

 HadoopPlugin.JobFactory jf = HadoopPlugin.getJobFactory("HOD-TerrierIndexing");
 if (jf == null)
     throw new Exception("Could not get JobFactory from HadoopPlugin");
 JobConf conf = jf.newJob();
 ....
 jf.close(); //closing the JobFactory will ensure that the HOD session ends

When using your own code in Terrier MapReduce jobs, ensure that you configure the Terrier application before anything else:

 public void configure(JobConf jc)
 {
     try{
         HadoopUtility.loadTerrierJob(jc);
     } catch (Exception e) {
         throw new Error("Cannot load ApplicationSetup", e);
     }
 }

Since:: 2.2
Version:: $Revision: 1.4 $
Author:: Craig Macdonald

Nested Class Summary
`static class`	`HadoopPlugin.JobFactory` a Job Factory is responsible for creating Terrier Map Reduce jobs.

Constructor Summary
`HadoopPlugin()`

Method Summary
`org.apache.hadoop.conf.Configuration`	`getConfiguration()`
`static org.apache.hadoop.conf.Configuration`	`getGlobalConfiguration()`
`static HadoopPlugin.JobFactory`	`getJobFactory(java.lang.String sessionName)` Get a JobFactory with the specified session name.
`void`	`initialise()`
`static void`	`setGlobalConfiguration(org.apache.hadoop.conf.Configuration _config)`

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

HadoopPlugin

public HadoopPlugin()

Method Detail

getJobFactory

public static HadoopPlugin.JobFactory getJobFactory(java.lang.String sessionName)

Get a JobFactory with the specified session name. This method attempts three processes, in order:

If the current/default Hadoop configuration has a real Hadoop cluster Job Tracker configured, then that will be used. This requires that the mapred.job.tracker property in the haddop-site.xml be configured.
Next, it will attempt to use HOD to build a Hadoop MapReduce cluster. This requies the Terrier property relating to HOD be configured to point to the location of the HOD binary - plugin.hadoop.hod
As a last resort, Terrier will use the local job tracker that Hadoop provides on the localhost.

setGlobalConfiguration

public static void setGlobalConfiguration(org.apache.hadoop.conf.Configuration _config)

getGlobalConfiguration

public static org.apache.hadoop.conf.Configuration getGlobalConfiguration()

initialise

public void initialise()
                throws java.lang.Exception

Specified by:: initialise in interface ApplicationSetup.TerrierApplicationPlugin

Throws:: java.lang.Exception

getConfiguration

public org.apache.hadoop.conf.Configuration getConfiguration()