[Previous: Terrier/Hadoop Configuration] [Contents] [Next: Properties in Terrier]

Configuring Terrier for Hadoop

Overview

From version 2.2, Terrier supports the indexing of large collections in a Hadoop Map Reduce fashion. This uses the single-pass indexer to index sections of each collection (as batches of files) as map tasks. The output from the Map tasks take three forms: (a) terms and mini posting lists (known as runs in the single-pass indexer); (b) document indices from each map task; (c) information about the number of documents saved per run.

Configuration

To index using the Map Reduce indexer, you need to have Terrier setup to use your Hadoop cluster. More information can be found in the Configuring Terrier for Hadoop documentation. For indexing using Map Reduce, your indexing Collection must have a InputStream constructor - currently only TRECCollection and TRECUTFCollection have this, however, this is expected to change in future releases. Choose your collection using the property trec.collection.class as per normal.

Next, the location of your collection and your index are both important. You will get most benefit from Hadoop if your collection is stored on one of the supported distributed file systems (e.g. HDFS). Hadoop requires that the files of the collection that your are indexing are stored in the shared file system (the one named by fs.default.name in the Hadoop configuration). For example:

$ cat etc/collection.spec
hdfs://master:9000/Collections/WT2G/WT01/B01.gz
hdfs://master:9000/Collections/WT2G/WT01/B02.gz
hdfs://master:9000/Collections/WT2G/WT01/B03.gz
(etc)

You should also ensure that your index is in on the shared filespace (again, the one named by fs.default.name in the Hadoop configuration), by setting the terrier.index.path property accordingly. For example you can set, terrier.index.path=hdfs://master:9000/Indices/WT2G.

Running Indexing Job

Running the Map Reduce indexer is straightforward, using the -H or --hadoop options to TrecTerrier. A run of the Map Reduce indexer (also using HOD) is shown below:
$ bin/trec_terrier.sh -i -H
Setting TERRIER_HOME to /users/tr.craigm/src/trhadoop/terrier
INFO  HadoopPlugin - Processing HOD for HOD-TerrierIndexing at hod request for 6 nodes
INFO  HadoopPlugin - INFO - Cluster Id 100.master
INFO  HadoopPlugin - INFO - HDFS UI at http://master:50070
INFO  HadoopPlugin - INFO - Mapred UI at http://node01:59794
INFO  HadoopPlugin - INFO - hadoop-site.xml at /tmp/hod679442803
INFO  HadoopUtility - Copying terrier share/ directory to shared storage area (hdfs://master:9000/tmp/1265627345-terrier.share)
WARN  JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
WARN  JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
INFO  MultiFileCollectionInputFormat - Allocating 10 files across 2 map tasks
INFO  JobClient - Running job: job_200812161322_0001
INFO  JobClient -  map 0% reduce 0%
INFO  JobClient -  map 20% reduce 0%
INFO  JobClient -  map 40% reduce 0%
INFO  JobClient -  map 60% reduce 0%
INFO  JobClient -  map 70% reduce 0%
INFO  JobClient -  map 80% reduce 0%
INFO  JobClient -  map 90% reduce 0%
INFO  JobClient -  map 100% reduce 0%
INFO  JobClient -  map 100% reduce 16%
INFO  JobClient -  map 100% reduce 70%
INFO  JobClient -  map 100% reduce 98%
INFO  JobClient - Job complete: job_200812161322_0001
INFO  JobClient - Counters: 16
INFO  JobClient -   File Systems
INFO  JobClient -     HDFS bytes read=9540868
INFO  JobClient -     HDFS bytes written=3020756
INFO  JobClient -     Local bytes read=7937792
INFO  JobClient -     Local bytes written=15875666
INFO  JobClient -   Job Counters 
INFO  JobClient -     Launched reduce tasks=1
INFO  JobClient -     Rack-local map tasks=2
INFO  JobClient -     Launched map tasks=2
INFO  JobClient -   Map-Reduce Framework
INFO  JobClient -     Reduce input groups=46130
INFO  JobClient -     Combine output records=0
INFO  JobClient -     Map input records=2124
INFO  JobClient -     Reduce output records=0
INFO  JobClient -     Map output bytes=7724117
INFO  JobClient -     Map input bytes=-4167449
INFO  JobClient -     Combine input records=0
INFO  JobClient -     Map output records=75700
INFO  JobClient -     Reduce input records=75700
INFO  HadoopPlugin - Processing HOD disconnect

NB: Please note that you must wait for the Map Reduce job to end, and not kill the local Terrier process, otherwise temporary files may be left on the shared file system.

Previous: Terrier/Hadoop Configuration] [Contents] [Next: Properties in Terrier]