[Previous: TREC Experiment Examples] [Contents] [Next: Hadoop Map Reduce Indexing with Terrier]

Configuring Terrier for Hadoop

Overview

From version 2.2, Terrier supports the Hadoop Map Reduce framework. In this initial release, the Hadoop supports exists only for map reduce indexing, however, we expect this to expand in the future. In this document, we describe how to integrate your Hadoop and Terrier setups. Hadoop is useful because it allows extremely large-scale operations, using Map Reduce technology, built on a distributed file system. More information can be found about deploying Hadoop using a cluster of nodes in the Hadoop Core documentation.

Pre-requisites

Terrier requires a working Hadoop setup, built using a cluster of one or more machines, of version 0.18.x Hadoop. In the Hadoop Core documentation, we recommend quickstart and cluster setup documents. If you do not have a dedicated cluster of machines with Hadoop running, you can use Hadoop on Demand (HOD), while allows a Map Reduce cluster to be built on a existing Torque PBS job cluster.

In general, Terrier can be configured to use an existing Hadoop installation, by two changes:

Add the location of your $HADOOP_HOME/conf folder to the CLASSPATH environment variable before running Terrier (you may want to edit bin/terrier-env.sh to achieve this.
Set property terrier.plugins=uk.ac.gla.terrier.utility.io.HadoopPlugin in your terrier.properties file.
You must also ensure that there is a world-writable /tmp directory on Hadoop's default file system.

This will allow Terrier to access the shared file system described in your hadoop-site.xml. If you also have the Map Reduce job tracker setup, then Terrier can now directly access the Map Reduce job tracker to submit jobs.

Using Hadoop On Demand (HOD)

If you are using HOD, then Terrier can be configured to automatically access HOD. Firstly, ensure HOD is working correctly, as described in the HOD user and admin guides. When Terrier wants to submit a Map Reduce job, it will use the HadoopPlugin to request a Map Reduce cluster from HOD. To configure this use the following properties:

plugin.hadoop.hod - set the full path to the local HOD executable. If this is not set, then HOD will not be used.
plugin.hadoop.hod.nodes - the number of nodes to request from HOD. Defaults to 6 nodes (sometimes CPUs).
plugin.hadoop.hod.params - any additional options you want to set on the HOD command line.

For more information on using HOD, see HadoopPlugin.

Indexing with Hadoop Map Reduce

See Indexing with Hadoop Map Reduce documentation.

Developing Map Reduce jobs with Terrier

It is possible to use Terrier for other Map Reduce tasks. Terrier requires some careful configuration to use in the Map Reduce setting. However, HadoopPlugin and HadoopUtility should be used. In particular, HadoopPlugin/HadoopUtility ensure that Terrier's share/ folder and the terrier.properties file are copied to a shared space that all job tasks can access. In the configure() method of the Map and Reduce tasks, you must call HadoopUtility.loadTerrierJob(jobConf). For more information, see HadoopPlugin.

[Previous: TREC Experiment Examples] [Contents] [Next: Hadoop Map Reduce Indexing with Terrier]