HadoopIndexing (Terrier 4.0 API)

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- org.terrier.applications.HadoopIndexing

```
public class HadoopIndexing
extends Object
```
Main run class for the MapReduce indexing system. Provides facilities to preform indexing over multiple machines in a MapReduce cluster.
Input
The collection is assumed to be a list of files, as specified in the collection.spec. For more advanced collections, this class will be need to be changed. The files listed in collection.spec are assumed to be on the Hadoop shared default filesystem - usually HDFS (else Hadoop will throw an error).

Output
This class creates indices for the indexed collection, in the directory specified by terrier.index.path. If this folder is NOT on the Hadoop shared default (e.g. HDFS), then Hadoop will throw an error.

Reducers
Two reduce modes are supported: term-partitioning creates a single index with multiple files making up the inverted structure; document-partitioning creates mulitiple indices, partitioned by docid. More reduce tasks results in higher indexing speed due to greater concurrency.
Term-partitioning is the default scenario. In this scenario, the maximum reducers allowed is 32. To select document-partitioning, specify the -p flag to main();
Properties:
- terrier.hadoop.indexing.reducers - number of reduce tasks, defaults to 26.
- If block.indexing is set, then a block index will be created.
Since:

2.2

Author:

Richard McCreadie and Craig Macdonald

Field Summary

Fields
Modifier and Type Field and Description

protected static org.apache.log4j.Logger logger
logger for this class

Constructor Summary

Constructors
Constructor and Description

HadoopIndexing()

Method Summary

Methods
Modifier and Type	Method and Description
`static void`	`deleteTaskFiles(String path, org.apache.hadoop.mapred.JobID job)` Performs cleanup of an index path removing temporary files
`static void`	`main(String[] args)` Starts the MapReduce indexing.
`protected static void`	`mergeLexiconInvertedFiles(String index_path, int numberOfReducers)` for term partitioned indexing, this method merges the lexicons from each reducer

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

logger

protected static final org.apache.log4j.Logger logger

logger for this class

Constructor Detail
- HadoopIndexing
```
public HadoopIndexing()
```

Method Detail

main

public static void main(String[] args)
                 throws Exception

Starts the MapReduce indexing.

Parameters:: args -
Throws:: Exception

mergeLexiconInvertedFiles
```
protected static void mergeLexiconInvertedFiles(String index_path,
                             int numberOfReducers)
                                         throws IOException
```
for term partitioned indexing, this method merges the lexicons from each reducer

Parameters:
index_path - path of index
numberOfReducers - number of inverted files expected

Throws:

IOException

deleteTaskFiles

public static void deleteTaskFiles(String path,
                   org.apache.hadoop.mapred.JobID job)

Performs cleanup of an index path removing temporary files

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

Terrier 4.0. Copyright © 2004-2014 University of Glasgow