org.terrier.utility.io
Class HadoopPlugin

java.lang.Object
  extended by org.terrier.utility.io.HadoopPlugin
All Implemented Interfaces:
ApplicationSetup.TerrierApplicationPlugin

public class HadoopPlugin
extends Object
implements ApplicationSetup.TerrierApplicationPlugin

This class provides the main glue between Terrier and Hadoop. It has several main roles:

  1. Configure Terrier such that the Hadoop file systems can be accessed by Terrier.
  2. Provide a means to access the Hadoop map-reduce cluster, using Hadoop on Demand (HOD) if necessary.

Configuring Terrier to access HDFS

Terrier can access a Hadoop Distributed File System (HDFS), allowing collections and indices to be placed there. To do so, ensure that your Hadoop conf/ is on your CLASSPATH, and that the Hadoop plugin is loaded by Terrier, by setting terrier.plugins=org.terrier.utility.io.HadoopPlugin in your terrier.properties file.

Configuring Terrier to access an existing Hadoop MapReduce cluster

Terrier can access an existing MapReduce cluster, as long as the conf/ folder for Hadoop is on your CLASSPATH. If you do not already have an existing Hadoop cluster, Terrier can be configured to use HOD, to build a temporary Hadoop cluster from a PBS (Torque) cluster. To configure HOD itself, the reader is referred to the HOD documentation. To use HOD from Terrier, set the following properties:

Using Hadoop MapReduce from Terier

You should use the JobFactory provided by this class when creating a MapReduce job from Terrier. The JobFactory creates a HOD session should one be required, and also configures jobs such that the Terrier environment can be recreated on the execution nodes.
 HadoopPlugin.JobFactory jf = HadoopPlugin.getJobFactory("HOD-TerrierIndexing");
 if (jf == null)
         throw new Exception("Could not get JobFactory from HadoopPlugin");
 JobConf conf = jf.newJob();
 ....
 jf.close(); //closing the JobFactory will ensure that the HOD session ends
 
When using your own code in Terrier MapReduce jobs, ensure that you configure the Terrier application before anything else:
 public void configure(JobConf jc)
 {
         try{
                 HadoopUtility.loadTerrierJob(jc);
         } catch (Exception e) {
                 throw new Error("Cannot load ApplicationSetup", e);
         }
 }
 

Since:
2.2
Author:
Craig Macdonald

Nested Class Summary
static class HadoopPlugin.JobFactory
          a Job Factory is responsible for creating Terrier MapReduce jobs.
 
Field Summary
protected  org.apache.hadoop.conf.Configuration config
          configuration used by this plugin
protected  org.apache.hadoop.fs.FileSystem hadoopFS
          distributed file system used by this plugin
protected static org.apache.log4j.Logger logger
          The logger used
protected static org.apache.hadoop.conf.Configuration singletonConfiguration
          main configuration object to use for Hadoop access
protected static HadoopPlugin singletonHadoopPlugin
          instance of this class - it is a singleton
 
Constructor Summary
HadoopPlugin()
          Constructs a new plugin
 
Method Summary
 org.apache.hadoop.conf.Configuration getConfiguration()
          Returns the Hadoop configuration underlying this plugin instance
static org.apache.hadoop.fs.FileSystem getDefaultFileSystem()
          What is the default file system according to Hadoop
static String getDefaultFileSystemPrefix()
          What is the String prefix of the default file system according to Hadoop
static URI getDefaultFileSystemURI()
          What is the URI of the default file system according to Hadoop
static org.apache.hadoop.conf.Configuration getGlobalConfiguration()
          Obtain the global Hadoop configuration in use by the plugin
static HadoopPlugin.JobFactory getJobFactory(String sessionName)
          Get a JobFactory with the specified session name.
protected static HadoopPlugin.JobFactory getJobFactory(String sessionName, boolean persistent)
          implements the obtaining of job factories
 void initialise()
          Initialises the Plugin, by connecting to the distributed file system
static void setGlobalConfiguration(org.apache.hadoop.conf.Configuration _config)
          Update the global Hadoop configuration in use by the plugin
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

singletonHadoopPlugin

protected static HadoopPlugin singletonHadoopPlugin
instance of this class - it is a singleton


singletonConfiguration

protected static org.apache.hadoop.conf.Configuration singletonConfiguration
main configuration object to use for Hadoop access


logger

protected static final org.apache.log4j.Logger logger
The logger used


config

protected org.apache.hadoop.conf.Configuration config
configuration used by this plugin


hadoopFS

protected org.apache.hadoop.fs.FileSystem hadoopFS
distributed file system used by this plugin

Constructor Detail

HadoopPlugin

public HadoopPlugin()
Constructs a new plugin

Method Detail

getJobFactory

public static HadoopPlugin.JobFactory getJobFactory(String sessionName)
Get a JobFactory with the specified session name. This method attempts three processes, in order:
  1. If the current/default Hadoop configuration has a real Hadoop cluster Job Tracker configured, then that will be used. This requires that the mapred.job.tracker property in the haddop-site.xml be configured.
  2. Next, it will attempt to use HOD to build a Hadoop MapReduce cluster. This requies the Terrier property relating to HOD be configured to point to the location of the HOD binary - plugin.hadoop.hod
  3. As a last resort, Terrier will use the local job tracker that Hadoop provides on the localhost. This is useful for unit testing, however it does not support multiple reducers.


getJobFactory

protected static HadoopPlugin.JobFactory getJobFactory(String sessionName,
                                                       boolean persistent)
implements the obtaining of job factories


setGlobalConfiguration

public static void setGlobalConfiguration(org.apache.hadoop.conf.Configuration _config)
Update the global Hadoop configuration in use by the plugin


getGlobalConfiguration

public static org.apache.hadoop.conf.Configuration getGlobalConfiguration()
Obtain the global Hadoop configuration in use by the plugin


getDefaultFileSystemPrefix

public static String getDefaultFileSystemPrefix()
What is the String prefix of the default file system according to Hadoop


getDefaultFileSystemURI

public static URI getDefaultFileSystemURI()
What is the URI of the default file system according to Hadoop


getDefaultFileSystem

public static org.apache.hadoop.fs.FileSystem getDefaultFileSystem()
                                                            throws IOException
What is the default file system according to Hadoop

Throws:
IOException

initialise

public void initialise()
                throws Exception
Initialises the Plugin, by connecting to the distributed file system

Specified by:
initialise in interface ApplicationSetup.TerrierApplicationPlugin
Throws:
Exception

getConfiguration

public org.apache.hadoop.conf.Configuration getConfiguration()
Returns the Hadoop configuration underlying this plugin instance



Terrier 3.6. Copyright © 2004-2011 University of Glasgow