uk.ac.gla.terrier.utility.io
Class HadoopPlugin
java.lang.Object
uk.ac.gla.terrier.utility.io.HadoopPlugin
- All Implemented Interfaces:
- ApplicationSetup.TerrierApplicationPlugin
public class HadoopPlugin
- extends java.lang.Object
- implements ApplicationSetup.TerrierApplicationPlugin
This class provides the main glue between Terrier and Hadoop. It has several main roles:
- Configure Terrier such that the Hadoop file systems can be accessed by Terrier.
- Provide a means to access the Hadoop map-reduce cluster, using Hadoop on Demand (HOD) if necessary.
Configuring Terrier to access HDFS
Terrier can access a Hadoop Distributed File System (HDFS), allowing collections and indices to be placed there.
To do so, ensure that your Hadoop conf/ is on your CLASSPATH, and that the Hadoop plugin is loaded by Terrier,
by setting terrier.plugins=uk.ac.gla.terrier.utility.io.HadoopPlugin in your terrier.properties file.
Configuring Terrier to access an existing Hadoop MapReduce cluster
Terrier can access an existing MapReduce cluster, as long as the conf/ folder for Hadoop is on your CLASSPATH.
If you do not already have an existing Hadoop cluster, Terrier can be configured to use HOD, to build a temporary
Hadoop cluster from a PBS (Torque) cluster. To configure HOD itself, the reader is referred to the
HOD documentation. To use HOD from Terrier,
set the following properties:
- plugin.hadoop.hod - path to the hod binary, normally $HADOOP_HOME/contrib/hod/bin. If unset, then HOD is presusmed
to be unconfigured.
- plugin.hadoop.hod.nodes - the number of nodes/CPUs that you want to request from the PBS Torque cluster. Defaults to 6.
- plugin.hadoop.hod.params - any additional options you want to set on the HOD command line. See the
HOD User guide for examples.
Using Hadoop MapReduce from Terier
You should use the JobFactory provided by this class when creating a MapReduce job from Terrier. The JobFactory
creates a HOD session should one be required, and also configures jobs such that the Terrier environment can
be recreated on the execution nodes.
HadoopPlugin.JobFactory jf = HadoopPlugin.getJobFactory("HOD-TerrierIndexing");
if (jf == null)
throw new Exception("Could not get JobFactory from HadoopPlugin");
JobConf conf = jf.newJob();
....
jf.close(); //closing the JobFactory will ensure that the HOD session ends
When using your own code in Terrier MapReduce jobs, ensure that you configure the Terrier application before
anything else:
public void configure(JobConf jc)
{
try{
HadoopUtility.loadTerrierJob(jc);
} catch (Exception e) {
throw new Error("Cannot load ApplicationSetup", e);
}
}
- Since:
- 2.2
- Version:
- $Revision: 1.4 $
- Author:
- Craig Macdonald
Nested Class Summary |
static class |
HadoopPlugin.JobFactory
a Job Factory is responsible for creating Terrier Map Reduce jobs. |
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
HadoopPlugin
public HadoopPlugin()
getJobFactory
public static HadoopPlugin.JobFactory getJobFactory(java.lang.String sessionName)
- Get a JobFactory with the specified session name. This method attempts three processes, in order:
- If the current/default Hadoop configuration has a real Hadoop cluster Job Tracker configured, then
that will be used. This requires that the mapred.job.tracker property in the haddop-site.xml
be configured.
- Next, it will attempt to use HOD to build a Hadoop MapReduce cluster. This requies the Terrier property
relating to HOD be configured to point to the location of the HOD binary - plugin.hadoop.hod
- As a last resort, Terrier will use the local job tracker that Hadoop provides on the localhost.
setGlobalConfiguration
public static void setGlobalConfiguration(org.apache.hadoop.conf.Configuration _config)
getGlobalConfiguration
public static org.apache.hadoop.conf.Configuration getGlobalConfiguration()
initialise
public void initialise()
throws java.lang.Exception
- Specified by:
initialise
in interface ApplicationSetup.TerrierApplicationPlugin
- Throws:
java.lang.Exception
getConfiguration
public org.apache.hadoop.conf.Configuration getConfiguration()
Terrier Information Retrieval Platform 2.2.1. Copyright 2004-2008 University of Glasgow