Configuring Terrier for Hadoop |
From version 2.2, Terrier supports the Hadoop Map Reduce framework. In this initial release, the Hadoop supports exists only for map reduce indexing, however, we expect this to expand in the future. In this document, we describe how to integrate your Hadoop and Terrier setups. Hadoop is useful because it allows extremely large-scale operations, using Map Reduce technology, built on a distributed file system. More information can be found about deploying Hadoop using a cluster of nodes in the Hadoop Core documentation.
In general, Terrier can be configured to use an existing Hadoop installation, by two changes:
This will allow Terrier to access the shared file system described in your hadoop-site.xml. If you also have the Map Reduce job tracker setup, then Terrier can now directly access the Map Reduce job tracker to submit jobs.
If you are using HOD, then Terrier can be configured to automatically access HOD. Firstly, ensure HOD is working correctly, as described in the HOD user and admin guides. When Terrier wants to submit a Map Reduce job, it will use the HadoopPlugin to request a Map Reduce cluster from HOD. To configure this use the following properties:
For more information on using HOD, see HadoopPlugin.
It is possible to use Terrier for other Map Reduce tasks. Terrier requires some careful configuration to use in the Map Reduce setting. However, HadoopPlugin and HadoopUtility should be used. In particular, HadoopPlugin/HadoopUtility ensure that Terrier's share/ folder and the terrier.properties file are copied to a shared space that all job tasks can access. In the configure() method of the Map and Reduce tasks, you must call HadoopUtility.loadTerrierJob(jobConf). For more information, see HadoopPlugin.
[Previous: TREC Experiment Examples] [Contents] [Next: Hadoop Map Reduce Indexing with Terrier]Copyright © 2015 University of Glasgow | All Rights Reserved