org.terrier.structures.indexing.singlepass.hadoop
Class HadoopRunsMerger

java.lang.Object
  extended by org.terrier.structures.indexing.singlepass.RunsMerger
      extended by org.terrier.structures.indexing.singlepass.hadoop.HadoopRunsMerger

public class HadoopRunsMerger
extends RunsMerger

This is the main merger class for Hadoop runs. It provides functionality for the merging of lexicons and inverted index shards from the map task indexers.

Since:
2.2
Author:
Richard McCreadie and Craig Macdonald

Nested Class Summary
 
Nested classes/interfaces inherited from class org.terrier.structures.indexing.singlepass.RunsMerger
RunsMerger.PostingComparator
 
Field Summary
protected  java.util.LinkedList<MapData> mapData
          The data loaded from side-effect files about each map task
protected  int numReducers
          Number of Reducers Used
 
Fields inherited from class org.terrier.structures.indexing.singlepass.RunsMerger
bos, currentTerm, lastDocFreq, lastDocument, lastFreq, lastTermWritten, myRun, numberOfPointers, queue, runsSource, startOffset, termStatistics
 
Constructor Summary
HadoopRunsMerger(RunIteratorFactory _runsSource)
          Constructs an instance of HadoopRunsMerger.
 
Method Summary
 void beginMerge(java.util.LinkedList<MapData> _mapData)
          Alternate Merge operation for merging a linked list of runs of the form Hadoop_MapData.
 void endMerge(LexiconOutputStream<java.lang.String> lexStream)
          Ends the merging phase, writes the last entry and closes the streams.
 int getDocumentOffset(int splitNo, int flushNumber)
          Get the offset for the document based on a split and flush.
 int getNumReducers()
          Gets the number of Reducers to Merge for: 1 for single Reducer, >1 for multi-Reducers
 void mergeOne(LexiconOutputStream<java.lang.String> lexStream)
          Mergers one term in the runs.
 void setNumReducers(int _numReducers)
          Sets the number of Reducers to Merge for: 1 for single Reducer, >1 for multi-Reducers
 
Methods inherited from class org.terrier.structures.indexing.singlepass.RunsMerger
beginMerge, getBitOffset, getBos, getByteOffset, getLastDocFreq, getLastFreq, getLastTermWritten, getNumberOfPointers, getNumberOfTerms, init, init, isDone, setBos, setLastTermWritten
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

mapData

protected java.util.LinkedList<MapData> mapData
The data loaded from side-effect files about each map task


numReducers

protected int numReducers
Number of Reducers Used

Constructor Detail

HadoopRunsMerger

public HadoopRunsMerger(RunIteratorFactory _runsSource)
Constructs an instance of HadoopRunsMerger.

Parameters:
_runsSource -
Method Detail

beginMerge

public void beginMerge(java.util.LinkedList<MapData> _mapData)
Alternate Merge operation for merging a linked list of runs of the form Hadoop_MapData. This routine merges the multiple runs created during the map process of hadoop indexing as such it corrects for Document id 'shift' caused by random splitting of runs due to flushing and map splitting.

Parameters:
_mapData - - information about the number of documents per map and run. One element for every map.
Throws:
java.io.IOException

endMerge

public void endMerge(LexiconOutputStream<java.lang.String> lexStream)
Ends the merging phase, writes the last entry and closes the streams.

Overrides:
endMerge in class RunsMerger
Parameters:
lexStream - LexiconOutputStream used to write the lexicon.

mergeOne

public void mergeOne(LexiconOutputStream<java.lang.String> lexStream)
              throws java.lang.Exception
Mergers one term in the runs. If a run is exhausted, it is closed and removed from the queue.

Overrides:
mergeOne in class RunsMerger
Parameters:
lexStream - LexiconOutputStream used to write the lexicon.
Throws:
java.lang.Exception - if an I/O error occurs.

getNumReducers

public int getNumReducers()
Gets the number of Reducers to Merge for: 1 for single Reducer, >1 for multi-Reducers

Returns:
how many reducers are in use.

setNumReducers

public void setNumReducers(int _numReducers)
Sets the number of Reducers to Merge for: 1 for single Reducer, >1 for multi-Reducers


getDocumentOffset

public int getDocumentOffset(int splitNo,
                             int flushNumber)
                      throws java.io.IOException
Get the offset for the document based on a split and flush.

Throws:
java.io.IOException


Terrier 3.5. Copyright © 2004-2011 University of Glasgow