HadoopPlugin (Terrier 3.6 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.terrier.utility.io
Class HadoopPlugin

java.lang.Object
  org.terrier.utility.io.HadoopPlugin

All Implemented Interfaces:: ApplicationSetup.TerrierApplicationPlugin

public class HadoopPlugin
extends Object
implements ApplicationSetup.TerrierApplicationPlugin
extends Object
implements ApplicationSetup.TerrierApplicationPlugin

This class provides the main glue between Terrier and Hadoop. It has several main roles:

Configure Terrier such that the Hadoop file systems can be accessed by Terrier.
Provide a means to access the Hadoop map-reduce cluster, using Hadoop on Demand (HOD) if necessary.

Configuring Terrier to access HDFS

Terrier can access a Hadoop Distributed File System (HDFS), allowing collections and indices to be placed there. To do so, ensure that your Hadoop conf/ is on your CLASSPATH, and that the Hadoop plugin is loaded by Terrier, by setting terrier.plugins=org.terrier.utility.io.HadoopPlugin in your terrier.properties file.

Configuring Terrier to access an existing Hadoop MapReduce cluster

Terrier can access an existing MapReduce cluster, as long as the conf/ folder for Hadoop is on your CLASSPATH. If you do not already have an existing Hadoop cluster, Terrier can be configured to use HOD, to build a temporary Hadoop cluster from a PBS (Torque) cluster. To configure HOD itself, the reader is referred to the HOD documentation. To use HOD from Terrier, set the following properties:

plugin.hadoop.hod - path to the hod binary, normally $HADOOP_HOME/contrib/hod/bin. If unset, then HOD is presusmed to be unconfigured.
plugin.hadoop.hod.nodes - the number of nodes/CPUs that you want to request from the PBS Torque cluster. Defaults to 6.
plugin.hadoop.hod.params - any additional options you want to set on the HOD command line. See the HOD User guide for examples.

Using Hadoop MapReduce from Terier

You should use the JobFactory provided by this class when creating a MapReduce job from Terrier. The JobFactory creates a HOD session should one be required, and also configures jobs such that the Terrier environment can be recreated on the execution nodes.

 HadoopPlugin.JobFactory jf = HadoopPlugin.getJobFactory("HOD-TerrierIndexing");
 if (jf == null)
         throw new Exception("Could not get JobFactory from HadoopPlugin");
 JobConf conf = jf.newJob();
 ....
 jf.close(); //closing the JobFactory will ensure that the HOD session ends

When using your own code in Terrier MapReduce jobs, ensure that you configure the Terrier application before anything else:

 public void configure(JobConf jc)
 {
         try{
                 HadoopUtility.loadTerrierJob(jc);
         } catch (Exception e) {
                 throw new Error("Cannot load ApplicationSetup", e);
         }
 }

Since:: 2.2
Author:: Craig Macdonald

Nested Class Summary
`static class`	`HadoopPlugin.JobFactory` a Job Factory is responsible for creating Terrier MapReduce jobs.

Field Summary
`protected org.apache.hadoop.conf.Configuration`	`config` configuration used by this plugin
`protected org.apache.hadoop.fs.FileSystem`	`hadoopFS` distributed file system used by this plugin
`protected static org.apache.log4j.Logger`	`logger` The logger used
`protected static org.apache.hadoop.conf.Configuration`	`singletonConfiguration` main configuration object to use for Hadoop access
`protected static HadoopPlugin`	`singletonHadoopPlugin` instance of this class - it is a singleton

Constructor Summary
`HadoopPlugin()` Constructs a new plugin

Method Summary
`org.apache.hadoop.conf.Configuration`	`getConfiguration()` Returns the Hadoop configuration underlying this plugin instance
`static org.apache.hadoop.fs.FileSystem`	`getDefaultFileSystem()` What is the default file system according to Hadoop
`static String`	`getDefaultFileSystemPrefix()` What is the String prefix of the default file system according to Hadoop
`static URI`	`getDefaultFileSystemURI()` What is the URI of the default file system according to Hadoop
`static org.apache.hadoop.conf.Configuration`	`getGlobalConfiguration()` Obtain the global Hadoop configuration in use by the plugin
`static HadoopPlugin.JobFactory`	`getJobFactory(String sessionName)` Get a JobFactory with the specified session name.
`protected static HadoopPlugin.JobFactory`	`getJobFactory(String sessionName, boolean persistent)` implements the obtaining of job factories
`void`	`initialise()` Initialises the Plugin, by connecting to the distributed file system
`static void`	`setGlobalConfiguration(org.apache.hadoop.conf.Configuration _config)` Update the global Hadoop configuration in use by the plugin

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

singletonHadoopPlugin

protected static HadoopPlugin singletonHadoopPlugin

instance of this class - it is a singleton

singletonConfiguration

protected static org.apache.hadoop.conf.Configuration singletonConfiguration

main configuration object to use for Hadoop access

logger

protected static final org.apache.log4j.Logger logger

The logger used

config

protected org.apache.hadoop.conf.Configuration config

configuration used by this plugin

hadoopFS

protected org.apache.hadoop.fs.FileSystem hadoopFS

distributed file system used by this plugin

Constructor Detail

HadoopPlugin

public HadoopPlugin()

Constructs a new plugin

Method Detail

getJobFactory

public static HadoopPlugin.JobFactory getJobFactory(String sessionName)

Get a JobFactory with the specified session name. This method attempts three processes, in order:

If the current/default Hadoop configuration has a real Hadoop cluster Job Tracker configured, then that will be used. This requires that the mapred.job.tracker property in the haddop-site.xml be configured.
Next, it will attempt to use HOD to build a Hadoop MapReduce cluster. This requies the Terrier property relating to HOD be configured to point to the location of the HOD binary - plugin.hadoop.hod
As a last resort, Terrier will use the local job tracker that Hadoop provides on the localhost. This is useful for unit testing, however it does not support multiple reducers.

getJobFactory

protected static HadoopPlugin.JobFactory getJobFactory(String sessionName,
                                                       boolean persistent)

implements the obtaining of job factories

setGlobalConfiguration

public static void setGlobalConfiguration(org.apache.hadoop.conf.Configuration _config)

Update the global Hadoop configuration in use by the plugin

getGlobalConfiguration

public static org.apache.hadoop.conf.Configuration getGlobalConfiguration()

Obtain the global Hadoop configuration in use by the plugin

getDefaultFileSystemPrefix

public static String getDefaultFileSystemPrefix()

What is the String prefix of the default file system according to Hadoop

getDefaultFileSystemURI

public static URI getDefaultFileSystemURI()

What is the URI of the default file system according to Hadoop

getDefaultFileSystem

public static org.apache.hadoop.fs.FileSystem getDefaultFileSystem()
                                                            throws IOException

What is the default file system according to Hadoop

Throws:: IOException

initialise

public void initialise()
                throws Exception

Initialises the Plugin, by connecting to the distributed file system

Specified by:: initialise in interface ApplicationSetup.TerrierApplicationPlugin

Throws:: Exception

getConfiguration

public org.apache.hadoop.conf.Configuration getConfiguration()

Returns the Hadoop configuration underlying this plugin instance

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.terrier.utility.io Class HadoopPlugin

Configuring Terrier to access HDFS

Configuring Terrier to access an existing Hadoop MapReduce cluster

Using Hadoop MapReduce from Terier

singletonHadoopPlugin

singletonConfiguration

logger

config

hadoopFS

HadoopPlugin

getJobFactory

getJobFactory

setGlobalConfiguration

getGlobalConfiguration

getDefaultFileSystemPrefix

getDefaultFileSystemURI

getDefaultFileSystem

initialise

getConfiguration

org.terrier.utility.io
Class HadoopPlugin