Due to the ever increasing size of test collections, particularly for those crawled from the Web, there has been a need to provide a distributed means to index such collections efficiently. As such, since version 2.2, Terrier has supported the indexing of large collections in a distributed manner using Hadoop's MapReduce implementation. This approach uses the single-pass indexer to index sections of each collection (as batches of files) as map tasks. The output from the Map tasks take three forms: (a) terms and mini posting lists (known as runs in the single-pass indexer); (b) document indices from each map task; (c) information about the number of documents saved per run. For more information on our indexing strategy, we recommend the following papers, Comparing Distributed Indexing: To MapReduce or Not?, On Single-Pass Indexing with MapReduce and MapReduce Indexing Strategies: Studying Scalability and Efficiency.
To index using the MapReduce indexer, you need to have Terrier setup to use your Hadoop cluster. More information can be found in the Configuring Terrier for Hadoop documentation. For indexing using MapReduce, your indexing Collection must have a InputStream constructor (TRECCollection, TRECWebCollection, WARC09Collection and WARC018Collection are all supported). Choose your collection using the property trec.collection.class as per normal.
Next, the location of your collection and your index are both important. You will get most benefit from Hadoop if your collection is stored on one of the supported distributed file systems (e.g. HDFS). Hadoop requires that the files of the collection that you are indexing are stored in the shared file system (the one named by fs.default.name in the Hadoop configuration). For example:
$ cat etc/collection.spec hdfs://master:9000/Collections/WT2G/WT01/B01.gz hdfs://master:9000/Collections/WT2G/WT01/B02.gz hdfs://master:9000/Collections/WT2G/WT01/B03.gz (etc)
You should also ensure that your index is in on the shared filespace (again, the one named by fs.default.name in the Hadoop configuration), by setting the terrier.index.path property accordingly. For example you can set, terrier.index.path=hdfs://master:9000/Indices/WT2G. You must remember to give the appropriate file system prefix, in this case hdfs://master:9000/.
Terrier's MapReduce indexing can operate in two modes, which define the style of index or indices that are produced. In particular, in term-partitioned mode, one single index is created, but the inverted index is split across up to 26 files. In document-partitioned model, multiple Terrier indices are created, one by each reducer. This can be useful if the corpus is too large to be useful in one index. To use document partitioning, run HadoopIndexing directly from the command line (instead of via TrecTerrier), and specify the -p argument. The advantage of term-partitioning is that Terrier can access the output as one single index, making retrieval simpler. However, note that the current upper limit on the number of files an inverted index can be split across is 31. To achieve this limit, we partition the inverted index with respect to each letter in the English alphabet -- this naturally limits the parallelism of the reduce phase to 26 reduce tasks. We do note however, that should a higher level of parallelism be needed, then this might be achieved through careful modification of the partitioner.
$ bin/trec_terrier.sh -i -H Setting TERRIER_HOME to /users/tr.craigm/src/trhadoop/terrier INFO HadoopPlugin - Processing HOD for HOD-TerrierIndexing at hod request for 6 nodes INFO HadoopPlugin - INFO - Cluster Id 100.master INFO HadoopPlugin - INFO - HDFS UI at http://master:50070 INFO HadoopPlugin - INFO - Mapred UI at http://node01:59794 INFO HadoopPlugin - INFO - hadoop-site.xml at /tmp/hod679442803 INFO HadoopUtility - Copying terrier share/ directory to shared storage area (hdfs://master:9000/tmp/1265627345-terrier.share) WARN JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. WARN JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). INFO MultiFileCollectionInputFormat - Allocating 10 files across 2 map tasks INFO JobClient - Running job: job_200812161322_0001 INFO JobClient - map 0% reduce 0% INFO JobClient - map 20% reduce 0% INFO JobClient - map 40% reduce 0% INFO JobClient - map 60% reduce 0% INFO JobClient - map 70% reduce 0% INFO JobClient - map 80% reduce 0% INFO JobClient - map 90% reduce 0% INFO JobClient - map 100% reduce 0% INFO JobClient - map 100% reduce 16% INFO JobClient - map 100% reduce 70% INFO JobClient - map 100% reduce 98% INFO JobClient - Job complete: job_200812161322_0001 INFO JobClient - Counters: 16 INFO JobClient - File Systems INFO JobClient - HDFS bytes read=9540868 INFO JobClient - HDFS bytes written=3020756 INFO JobClient - Local bytes read=7937792 INFO JobClient - Local bytes written=15875666 INFO JobClient - Job Counters INFO JobClient - Launched reduce tasks=1 INFO JobClient - Rack-local map tasks=2 INFO JobClient - Launched map tasks=2 INFO JobClient - Map-Reduce Framework INFO JobClient - Reduce input groups=46130 INFO JobClient - Combine output records=0 INFO JobClient - Map input records=2124 INFO JobClient - Reduce output records=0 INFO JobClient - Map output bytes=7724117 INFO JobClient - Map input bytes=-4167449 INFO JobClient - Combine input records=0 INFO JobClient - Map output records=75700 INFO JobClient - Reduce input records=75700 INFO HadoopPlugin - Processing HOD disconnect
NB: Please note that you must wait for the MapReduce job to end, and not kill the local Terrier process, otherwise temporary files may be left on the shared file system.
Once you have indexed your corpus using MapReduce, you can also create a direct file in MapReduce. In particular, Inv2DirectMultiReduce uses a MapReduce job to re-invert the inverted index and create an inverted index. This class is more scalable for large indices than using Inverted2DirectIndexBuilder.
Copyright © 2011 University of Glasgow | All Rights Reserved