From version 2.2 onwards, Terrier has supported the Hadoop MapReduce framework. Currently, Terrier provides single-pass distributed indexing under MapReduce, however, Terrier has been designed to be compatible with other Hadoop driven functionality. In this document, we describe how to integrate your Hadoop and Terrier setups. Hadoop is useful because it allows extremely large-scale operations, using MapReduce technology, built on a distributed file system. More information can be found about deploying Hadoop using a cluster of nodes in the Hadoop Core documentation.
In general, Terrier can be configured to use an existing Hadoop installation, by two changes:
This will allow Terrier to access the shared file system described in your core-site.xml. If you also have the MapReduce job tracker setup specified in mapred-site.xml, then Terrier can now directly access the MapReduce job tracker to submit jobs.
If you don't have a dedicated Hadoop cluster yet, don't worry. Hadoop provides a utility called Hadoop On Demand (HOD), which can use a Torque PBS cluster to create a Hadoop cluster. Terrier fully supports accessing Hadoop clusters created by HOD, and can even call HOD to create the cluster when its needed for a job. If your cluster is based on Sun Grid Engine, this now has support for Hadoop.
If you are using HOD, then Terrier can be configured to automatically access it. Firstly, ensure HOD is working correctly, as described in the HOD user and admin guides. When Terrier wants to submit a MapReduce job, it will use the HadoopPlugin to request a MapReduce cluster from HOD. To configure this use the following properties:
For more information on using HOD, see our HadoopPlugin documentation.
Importantly, it should be possible to modify Terrier to perform other information retrieval tasks using MapReduce. Terrier requires some careful configuration to use in the MapReduce setting. The included, HadoopPlugin and HadoopUtility should be used to link Terrier to Hadoop. In particular, HadoopPlugin/HadoopUtility ensure that Terrier's share/ folder and the terrier.properties file are copied to a shared space that all job tasks can access. In the configure() method of the map and reduce tasks, you must call HadoopUtility.loadTerrierJob(jobConf). For more information, see HadoopPlugin. Furthermore, we suggest that you browse the MapReduce indexing source code, both for the map and reduce functions stored in the Hadoop_BasicSinglePassIndexer and as well as the input format and partitioner.
[Previous: TREC Experiment Examples] [Contents] [Next: Hadoop MapReduce Indexing with Terrier]Copyright © 2011 University of Glasgow | All Rights Reserved