org.terrier.indexing.hadoop
Class Hadoop_BasicSinglePassIndexer

java.lang.Object
  extended by org.terrier.indexing.Indexer
      extended by org.terrier.indexing.BasicIndexer
          extended by org.terrier.indexing.BasicSinglePassIndexer
              extended by org.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer
All Implemented Interfaces:
java.io.Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>,SplitEmittedTerm,MapEmittedPostingList>, org.apache.hadoop.mapred.Reducer<SplitEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>
Direct Known Subclasses:
Hadoop_BlockSinglePassIndexer

public class Hadoop_BasicSinglePassIndexer
extends BasicSinglePassIndexer
implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>,SplitEmittedTerm,MapEmittedPostingList>, org.apache.hadoop.mapred.Reducer<SplitEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>

Single Pass MapReduce indexer.

Map phase processing

Indexes as a Map task, taking in a series of documents, emitting posting lists for terms as memory becomes exhausted. Two side-files are created for each map task: the first (run files) takes note of how many documents were indexed for each flush and for each map; the second contains the statistics for each document in a minature document index

Reduce phase processing

All posting lists for each term are read in, one term at a time. Using the run files, the posting lists are output into the final inverted file, with all document ids corrected. Lastly, when all terms have been processed, the document indexes are merged into the final document index, and the lexicon hash and lexid created.

Partitioned Reduce processing

Normally, the MapReduce indexer is used with a single reducer. However, if the partitioner is used, multiple reduces can run concurrently, building several final indices. In doing so, a large collection can be indexed into several output indices, which may be useful for distributed retrieval.

Since:
2.2
Author:
Richard McCreadie and Craig Macdonald

Nested Class Summary
 
Nested classes/interfaces inherited from class org.terrier.indexing.BasicIndexer
BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor
 
Field Summary
protected  org.apache.hadoop.mapred.Reporter currentReporter
           
protected  java.util.LinkedList<java.lang.Integer> flushList
          List of how many documents are in each flush we have made
protected  int flushNo
          How many flushes have we made
protected  org.apache.hadoop.mapred.JobConf jc
          JobConf of the current running job
protected  org.apache.hadoop.mapred.Reporter lastReporter
           
protected  LexiconOutputStream<java.lang.String> lexstream
          OutputStream for the Lexicon
protected  java.lang.String[] MapIndexPrefixes
           
protected  java.lang.String mapTaskID
          Current map number
protected  boolean mutipleIndices
           
protected  org.apache.hadoop.mapred.OutputCollector<SplitEmittedTerm,MapEmittedPostingList> outputPostingListCollector
          output collector for the current map indexing process
protected  int reduceId
           
protected  boolean reduceStarted
          records whether the reduce() has been called for the first time
protected  java.io.DataOutputStream RunData
          OutputStream for the the data on the runs (runNo, flushes etc)
protected  HadoopRunIteratorFactory runIteratorF
          runIterator factory being used to generate RunIterators
protected  int splitnum
          The split that these documents came form
protected  boolean start
           
 
Fields inherited from class org.terrier.indexing.BasicSinglePassIndexer
basicInvertedIndexPostingIteratorClass, currentFile, currentId, docsPerCheck, fieldInvertedIndexPostingIteratorClass, fileNames, invertedIndexClass, invertedIndexInputStreamClass, maxDocsPerFlush, maxMemory, memoryAfterFlush, memoryCheck, merger, mp, numberOfDocsSinceCheck, numberOfDocsSinceFlush, numberOfDocuments, numberOfPointers, numberOfTokens, numberOfUniqueTerms, runtime
 
Fields inherited from class org.terrier.indexing.BasicIndexer
numOfTokensInDocument, termFields, termsInDocument
 
Fields inherited from class org.terrier.indexing.Indexer
basicDirectIndexPostingIteratorClass, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocIndexEntry, fieldDirectIndexPostingIteratorClass, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation
 
Constructor Summary
Hadoop_BasicSinglePassIndexer()
          Empty constructor.
 
Method Summary
 void close()
          Called when the Map or Reduce task ends, to finish up the indexer.
protected  void closeMap()
          Finish up the map processing.
protected  void closeReduce()
          finishes the reduce step, by closing the lexicon and inverted file output, building the lexicon hash and index, and merging the document indices created by the map tasks.
 void configure(org.apache.hadoop.mapred.JobConf _jc)
          Configure this indexer.
protected  void configureMap()
           
protected  void configureReduce()
           
protected  MetaIndexBuilder createMetaIndexBuilder()
           
protected  RunsMerger createtheRunMerger()
          Creates the RunsMerger and the RunIteratorFactory
static void finish(java.lang.String destinationIndexPath, int numberOfReduceTasks, HadoopPlugin.JobFactory jf)
          finish
protected  void forceFlush()
           
protected  void indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties)
          Write the empty document to the inverted index
protected  void load_builder_boundary_documents()
          Loads the builder boundary documents from the property indexing.builder.boundary.docnos, comma delimited.
protected  java.util.LinkedList<MapData> loadRunData()
           
static void main(java.lang.String[] args)
          main
 void map(org.apache.hadoop.io.Text key, SplitAwareWrapper<Document> value, org.apache.hadoop.mapred.OutputCollector<SplitEmittedTerm,MapEmittedPostingList> _outputPostingListCollector, org.apache.hadoop.mapred.Reporter reporter)
          Map processes a single document.
protected  void mergeDocumentIndex(Index[] src)
          Merges the simple document indexes made for each map, instead creating the final document index
 void reduce(SplitEmittedTerm Term, java.util.Iterator<MapEmittedPostingList> postingIterator, org.apache.hadoop.mapred.OutputCollector<java.lang.Object,java.lang.Object> output, org.apache.hadoop.mapred.Reporter reporter)
          Main reduce algorithm step.
 void startReduce(java.util.LinkedList<MapData> mapData)
          Merge the postings for the current term, converts the document ID's in the postings to be relative to one another using the run number, number of documents covered in each run, the flush number for that run and the number of documents flushed.
 
Methods inherited from class org.terrier.indexing.BasicSinglePassIndexer
checkFlush, createDirectIndex, createFieldRunMerger, createInvertedIndex, createInvertedIndex, createMemoryPostings, createRunMerger, finishMemoryPosting, getFileNames, indexDocument, load_indexer_properties, performMultiWayMerge
 
Methods inherited from class org.terrier.indexing.BasicIndexer
createDocumentPostings, finishedInvertedIndexBuild, getEndOfPipeline
 
Methods inherited from class org.terrier.indexing.Indexer
finishedDirectIndexBuild, index, init, load_field_ids, load_pipeline, merge, merge, mergeTwoIndices, parseInts, useFieldInformation
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

jc

protected org.apache.hadoop.mapred.JobConf jc
JobConf of the current running job


splitnum

protected int splitnum
The split that these documents came form


start

protected boolean start

outputPostingListCollector

protected org.apache.hadoop.mapred.OutputCollector<SplitEmittedTerm,MapEmittedPostingList> outputPostingListCollector
output collector for the current map indexing process


mapTaskID

protected java.lang.String mapTaskID
Current map number


flushNo

protected int flushNo
How many flushes have we made


RunData

protected java.io.DataOutputStream RunData
OutputStream for the the data on the runs (runNo, flushes etc)


flushList

protected java.util.LinkedList<java.lang.Integer> flushList
List of how many documents are in each flush we have made


currentReporter

protected org.apache.hadoop.mapred.Reporter currentReporter

lexstream

protected LexiconOutputStream<java.lang.String> lexstream
OutputStream for the Lexicon


runIteratorF

protected HadoopRunIteratorFactory runIteratorF
runIterator factory being used to generate RunIterators


reduceStarted

protected boolean reduceStarted
records whether the reduce() has been called for the first time


mutipleIndices

protected boolean mutipleIndices

reduceId

protected int reduceId

MapIndexPrefixes

protected java.lang.String[] MapIndexPrefixes

lastReporter

protected org.apache.hadoop.mapred.Reporter lastReporter
Constructor Detail

Hadoop_BasicSinglePassIndexer

public Hadoop_BasicSinglePassIndexer()
Empty constructor.

Method Detail

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
main

Parameters:
args -
Throws:
java.lang.Exception

finish

public static void finish(java.lang.String destinationIndexPath,
                          int numberOfReduceTasks,
                          HadoopPlugin.JobFactory jf)
                   throws java.lang.Exception
finish

Parameters:
destinationIndexPath -
numberOfReduceTasks -
jf -
Throws:
java.lang.Exception

configure

public void configure(org.apache.hadoop.mapred.JobConf _jc)
Configure this indexer. Firstly, loads ApplicationSetup appropriately. Actual configuration of indexer is then handled by configureMap() or configureReduce() depending on whether a Map or Reduce task is being configured.

Specified by:
configure in interface org.apache.hadoop.mapred.JobConfigurable
Parameters:
_jc - The configuration for the job

close

public void close()
           throws java.io.IOException
Called when the Map or Reduce task ends, to finish up the indexer. Actual cleanup is handled by closeMap() or closeReduce() depending on whether this is a Map or Reduce task.

Specified by:
close in interface java.io.Closeable
Throws:
java.io.IOException

load_builder_boundary_documents

protected void load_builder_boundary_documents()
Description copied from class: Indexer
Loads the builder boundary documents from the property indexing.builder.boundary.docnos, comma delimited.

Overrides:
load_builder_boundary_documents in class Indexer

configureMap

protected void configureMap()
                     throws java.lang.Exception
Throws:
java.lang.Exception

createMetaIndexBuilder

protected MetaIndexBuilder createMetaIndexBuilder()
Overrides:
createMetaIndexBuilder in class Indexer

forceFlush

protected void forceFlush()
                   throws java.io.IOException
Overrides:
forceFlush in class BasicSinglePassIndexer
Throws:
java.io.IOException

map

public void map(org.apache.hadoop.io.Text key,
                SplitAwareWrapper<Document> value,
                org.apache.hadoop.mapred.OutputCollector<SplitEmittedTerm,MapEmittedPostingList> _outputPostingListCollector,
                org.apache.hadoop.mapred.Reporter reporter)
         throws java.io.IOException
Map processes a single document. Stores the terms in the document along with the posting list until memory is full or all documents in this map have been processed then writes then to the output collector.

Specified by:
map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>,SplitEmittedTerm,MapEmittedPostingList>
Parameters:
key - - Wrapper for Document Number
value - - Wrapper for Document Object
_outputPostingListCollector - Collector for emitting terms and postings lists
Throws:
java.io.IOException

indexEmpty

protected void indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties)
                   throws java.io.IOException
Write the empty document to the inverted index

Overrides:
indexEmpty in class Indexer
Throws:
java.io.IOException

closeMap

protected void closeMap()
                 throws java.io.IOException
Finish up the map processing. Forces a flush, then writes out the final run data

Throws:
java.io.IOException

configureReduce

protected void configureReduce()
                        throws java.lang.Exception
Throws:
java.lang.Exception

loadRunData

protected java.util.LinkedList<MapData> loadRunData()
                                             throws java.io.IOException
Throws:
java.io.IOException

startReduce

public void startReduce(java.util.LinkedList<MapData> mapData)
                 throws java.io.IOException
Merge the postings for the current term, converts the document ID's in the postings to be relative to one another using the run number, number of documents covered in each run, the flush number for that run and the number of documents flushed.

Parameters:
mapData - - info about the runs(maps) and the flushes
Throws:
java.io.IOException

reduce

public void reduce(SplitEmittedTerm Term,
                   java.util.Iterator<MapEmittedPostingList> postingIterator,
                   org.apache.hadoop.mapred.OutputCollector<java.lang.Object,java.lang.Object> output,
                   org.apache.hadoop.mapred.Reporter reporter)
            throws java.io.IOException
Main reduce algorithm step. Called for every term in the merged index, together with accessors to the posting list information that has been written. This reduce has no output.

Specified by:
reduce in interface org.apache.hadoop.mapred.Reducer<SplitEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>
Parameters:
Term - indexing term which we are reducing the posting lists into
postingIterator - Iterator over the temporary posting lists we have for this term
output - Unused output collector
reporter - Used to report progress
Throws:
java.io.IOException

mergeDocumentIndex

protected void mergeDocumentIndex(Index[] src)
                           throws java.io.IOException
Merges the simple document indexes made for each map, instead creating the final document index

Throws:
java.io.IOException

closeReduce

protected void closeReduce()
                    throws java.io.IOException
finishes the reduce step, by closing the lexicon and inverted file output, building the lexicon hash and index, and merging the document indices created by the map tasks. The output index finalised

Throws:
java.io.IOException

createtheRunMerger

protected RunsMerger createtheRunMerger()
Creates the RunsMerger and the RunIteratorFactory



Terrier 3.5. Copyright © 2004-2011 University of Glasgow