Terrier IR Platform
2.2.1

uk.ac.gla.terrier.indexing.hadoop
Class Hadoop_BasicSinglePassIndexer

java.lang.Object
  extended by uk.ac.gla.terrier.indexing.Indexer
      extended by uk.ac.gla.terrier.indexing.BasicIndexer
          extended by uk.ac.gla.terrier.indexing.BasicSinglePassIndexer
              extended by uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer
All Implemented Interfaces:
java.io.Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,Wrapper<Document>,MapEmittedTerm,MapEmittedPostingList>, org.apache.hadoop.mapred.Reducer<MapEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>
Direct Known Subclasses:
Hadoop_BlockSinglePassIndexer

public class Hadoop_BasicSinglePassIndexer
extends BasicSinglePassIndexer
implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,Wrapper<Document>,MapEmittedTerm,MapEmittedPostingList>, org.apache.hadoop.mapred.Reducer<MapEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>

Single Pass Map-Reduce indexer.

Map phase processing

Indexes as a Map task, taking in a series of documents, emitting posting lists for terms as memory becomes exhausted. Two side-files are created for each map task: the first (run files) takes note of how many documents were indexed for each flush and for each map; the second contains the statistics for each document in a minature document index

Reduce phase processing

All posting lists for each term are read in, one term at a time. Using the run files, the posting lists are output into the final inverted file, with all document ids corrected. Lastly, when all terms have been processed, the document indexes are merged into the final document index, and the lexicon hash and lexid created.

Partitioned Reduce processing

Normally, the map reduce indexer is used with a single reducer. However, if the partitioner is used, multiple reduces can run concurrently, building several final indices. In doing so, a large collection can be indexed into several output indices, which may be useful for distributed retrieval.

Since:
2.2
Version:
$Revision: 1.4 $
Author:
Richard McCreadie and Craig Macdonald

Constructor Summary
Hadoop_BasicSinglePassIndexer()
          Empty constructor.
 
Method Summary
 void close()
          Called when the Map or Reduce task ends, to finish up the indexer.
 void configure(org.apache.hadoop.mapred.JobConf jc)
          Configure this indexer.
 void map(org.apache.hadoop.io.Text key, Wrapper<Document> value, org.apache.hadoop.mapred.OutputCollector<MapEmittedTerm,MapEmittedPostingList> _outputPostingListCollector, org.apache.hadoop.mapred.Reporter reporter)
          Map processes a single document.
 void reduce(MapEmittedTerm Term, java.util.Iterator<MapEmittedPostingList> postingIterator, org.apache.hadoop.mapred.OutputCollector<java.lang.Object,java.lang.Object> output, org.apache.hadoop.mapred.Reporter reporter)
          Main reduce algorithm step.
 void startReduce(java.util.LinkedList<MapData> mapData)
          Merge the postings for the current term, converts the document ID's in the postings to be relative to one another using the run number, number of documents covered in each run, the flush number for that run and the number of documents flushed.
 
Methods inherited from class uk.ac.gla.terrier.indexing.BasicSinglePassIndexer
createDirectIndex, createInvertedIndex, createInvertedIndex, performMultiWayMerge
 
Methods inherited from class uk.ac.gla.terrier.indexing.Indexer
index, isUTFIndexing, main, merge, merge, useFieldInformation
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Hadoop_BasicSinglePassIndexer

public Hadoop_BasicSinglePassIndexer()
Empty constructor.

Method Detail

configure

public void configure(org.apache.hadoop.mapred.JobConf jc)
Configure this indexer. Firstly, loads ApplicationSetup appropriately. Actual configuration of indexer is then handled by configureMap() or configureReduce() depending on whether a Map or Reduce task is being configured.

Specified by:
configure in interface org.apache.hadoop.mapred.JobConfigurable
Parameters:
jc - The configuration for the job

close

public void close()
           throws java.io.IOException
Called when the Map or Reduce task ends, to finish up the indexer. Actual cleanup is handled by closeMap() or closeReduce() depending on whether this is a Map or Reduce task.

Specified by:
close in interface java.io.Closeable
Throws:
java.io.IOException

map

public void map(org.apache.hadoop.io.Text key,
                Wrapper<Document> value,
                org.apache.hadoop.mapred.OutputCollector<MapEmittedTerm,MapEmittedPostingList> _outputPostingListCollector,
                org.apache.hadoop.mapred.Reporter reporter)
         throws java.io.IOException
Map processes a single document. Stores the terms in the document along with the posting list until memory is full or all documents in this map have been processed then writes then to the output collector.

Specified by:
map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,Wrapper<Document>,MapEmittedTerm,MapEmittedPostingList>
Parameters:
key - - Wrapper for Document Number
value - - Wrapper for Document Object
_outputPostingListCollector - Collector for emitting terms and postings lists
Throws:
java.io.IOException

startReduce

public void startReduce(java.util.LinkedList<MapData> mapData)
Merge the postings for the current term, converts the document ID's in the postings to be relative to one another using the run number, number of documents covered in each run, the flush number for that run and the number of documents flushed.

Parameters:
mapData - - info about the runs(maps) and the flushes

reduce

public void reduce(MapEmittedTerm Term,
                   java.util.Iterator<MapEmittedPostingList> postingIterator,
                   org.apache.hadoop.mapred.OutputCollector<java.lang.Object,java.lang.Object> output,
                   org.apache.hadoop.mapred.Reporter reporter)
            throws java.io.IOException
Main reduce algorithm step. Called for every term in the merged index, together with accessors to the posting list information that has been written. This reduce has no output.

Specified by:
reduce in interface org.apache.hadoop.mapred.Reducer<MapEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>
Parameters:
Term - indexing term which we are reducing the posting lists into
postingIterator - Iterator over the temporary posting lists we have for this term
output - Unused output collector
reporter - Used to report progress
Throws:
java.io.IOException

Terrier IR Platform
2.2.1

Terrier Information Retrieval Platform 2.2.1. Copyright 2004-2008 University of Glasgow