[Previous: TREC Experiment Examples] [Contents] [Next: Hadoop MapReduce Indexing with Terrier]

Configuring Terrier for Hadoop

Overview

From version 2.2 onwards, Terrier has supported the Hadoop MapReduce framework. Currently, Terrier provides single-pass distributed indexing under MapReduce, however, Terrier has been designed to be compatible with other Hadoop driven functionality. In this document, we describe how to integrate your Hadoop and Terrier setups. Hadoop is useful because it allows extremely large-scale operations, using MapReduce technology, built on a distributed file system. More information can be found about deploying Hadoop using a cluster of nodes in the Hadoop Core documentation.

Pre-requisites

Terrier requires a working Hadoop setup, built using a cluster of one or more machines, of Hadoop version 0.20.x. The core does not currently support newer versions of Hadoop due to minor changes in the Hadoop API. However, should you wish to use a newer version to take advantage of the numerous bug fixes and efficiency improvements introduced, we anticipate that the core can be updated to achieve this. In the Hadoop Core documentation, we recommend quickstart and cluster setup documents. If you do not have a dedicated cluster of machines with Hadoop running and do not wish to create one, an alternative is to use use Hadoop on Demand (HOD). In particular, HOD allows a MapReduce cluster to be built upon an existing Torque PBS job cluster.

In general, Terrier can be configured to use an existing Hadoop installation, by two changes:

  1. Add the location of your $HADOOP_HOME/conf folder to the CLASSPATH environment variable before running Terrier.
  2. Set property terrier.plugins=org.terrier.utility.io.HadoopPlugin in your terrier.properties file.
  3. You must also ensure that there is a world-writable /tmp directory on Hadoop's default file system.

This will allow Terrier to access the shared file system described in your core-site.xml. If you also have the MapReduce job tracker setup specified in mapred-site.xml, then Terrier can now directly access the MapReduce job tracker to submit jobs.

Using Hadoop On Demand (HOD)

If you don't have a dedicated Hadoop cluster yet, don't worry. Hadoop provides a utility called Hadoop On Demand (HOD), which can use a Torque PBS cluster to create a Hadoop cluster. Terrier fully supports accessing Hadoop clusters created by HOD, and can even call HOD to create the cluster when its needed for a job. If your cluster is based on Sun Grid Engine, this now has support for Hadoop.

If you are using HOD, then Terrier can be configured to automatically access it. Firstly, ensure HOD is working correctly, as described in the HOD user and admin guides. When Terrier wants to submit a MapReduce job, it will use the HadoopPlugin to request a MapReduce cluster from HOD. To configure this use the following properties:

For more information on using HOD, see our HadoopPlugin documentation.

Indexing with Hadoop MapReduce

We provide a guide for configuring single-pass indexing with MapReduce under Hadoop.

Developing MapReduce jobs with Terrier

Importantly, it should be possible to modify Terrier to perform other information retrieval tasks using MapReduce. Terrier requires some careful configuration to use in the MapReduce setting. The included, HadoopPlugin and HadoopUtility should be used to link Terrier to Hadoop. In particular, HadoopPlugin/HadoopUtility ensure that Terrier's share/ folder and the terrier.properties file are copied to a shared space that all job tasks can access. In the configure() method of the map and reduce tasks, you must call HadoopUtility.loadTerrierJob(jobConf). For more information, see HadoopPlugin. Furthermore, we suggest that you browse the MapReduce indexing source code, both for the map and reduce functions stored in the Hadoop_BasicSinglePassIndexer and as well as the input format and partitioner.

[Previous: TREC Experiment Examples] [Contents] [Next: Hadoop MapReduce Indexing with Terrier]