Hadoop_BasicSinglePassIndexer (Terrier 4.0 API)

java.lang.Object
- org.terrier.structures.indexing.Indexer
- - org.terrier.structures.indexing.classical.BasicIndexer
  - - org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer
    - - org.terrier.structures.indexing.singlepass.hadoop.Hadoop_BasicSinglePassIndexer

All Implemented Interfaces:

Closeable, AutoCloseable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>,SplitEmittedTerm,MapEmittedPostingList>, org.apache.hadoop.mapred.Reducer<SplitEmittedTerm,MapEmittedPostingList,Object,Object>

Direct Known Subclasses:

Hadoop_BlockSinglePassIndexer
```
public class Hadoop_BasicSinglePassIndexer
extends BasicSinglePassIndexer
implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>,SplitEmittedTerm,MapEmittedPostingList>, org.apache.hadoop.mapred.Reducer<SplitEmittedTerm,MapEmittedPostingList,Object,Object>
```
Single Pass MapReduce indexer.
Map phase processing
Indexes as a Map task, taking in a series of documents, emitting posting lists for terms as memory becomes exhausted. Two side-files are created for each map task: the first (run files) takes note of how many documents were indexed for each flush and for each map; the second contains the statistics for each document in a minature document index

Reduce phase processing
All posting lists for each term are read in, one term at a time. Using the run files, the posting lists are output into the final inverted file, with all document ids corrected. Lastly, when all terms have been processed, the document indexes are merged into the final document index, and the lexicon hash and lexid created.

Partitioned Reduce processing
Normally, the MapReduce indexer is used with a single reducer. However, if the partitioner is used, multiple reduces can run concurrently, building several final indices. In doing so, a large collection can be indexed into several output indices, which may be useful for distributed retrieval.

Since:

2.2

Author:

Richard McCreadie and Craig Macdonald

Nested Class Summary
- Nested classes/interfaces inherited from class org.terrier.structures.indexing.classical.BasicIndexer
  BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor

Field Summary

Fields
Modifier and Type	Field and Description
`protected org.apache.hadoop.mapred.Reporter`	`currentReporter`
`protected LinkedList<Integer>`	`flushList` List of how many documents are in each flush we have made
`protected int`	`flushNo` How many flushes have we made
`protected org.apache.hadoop.mapred.JobConf`	`jc` JobConf of the current running job
`protected org.apache.hadoop.mapred.Reporter`	`lastReporter`
`protected LexiconOutputStream<String>`	`lexstream` OutputStream for the Lexicon
`protected String[]`	`MapIndexPrefixes`
`protected String`	`mapTaskID` Current map number
`protected boolean`	`mutipleIndices`
`protected org.apache.hadoop.mapred.OutputCollector<SplitEmittedTerm,MapEmittedPostingList>`	`outputPostingListCollector` output collector for the current map indexing process
`protected int`	`reduceId`
`protected boolean`	`reduceStarted` records whether the reduce() has been called for the first time
`protected DataOutputStream`	`RunData` OutputStream for the the data on the runs (runNo, flushes etc)
`protected HadoopRunIteratorFactory`	`runIteratorF` runIterator factory being used to generate RunIterators
`protected int`	`splitnum` The split that these documents came form
`protected boolean`	`start`

Fields inherited from class org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer
basicInvertedIndexPostingIteratorClass, currentFile, currentId, docsPerCheck, fieldInvertedIndexPostingIteratorClass, fileNames, invertedIndexClass, invertedIndexInputStreamClass, maxDocsPerFlush, maxMemory, memoryAfterFlush, memoryCheck, merger, mp, numberOfDocsSinceCheck, numberOfDocsSinceFlush, numberOfDocuments, numberOfPointers, numberOfTokens, numberOfUniqueTerms, runtime

Fields inherited from class org.terrier.structures.indexing.classical.BasicIndexer
compressionDirectConfig, compressionInvertedConfig, numOfTokensInDocument, termFields, termsInDocument

Fields inherited from class org.terrier.structures.indexing.Indexer
BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocIndexEntry, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation

Constructor Summary

Constructors
Constructor and Description

Hadoop_BasicSinglePassIndexer()
Empty constructor

Constructors
Constructor and Description
`Hadoop_BasicSinglePassIndexer()` Empty constructor

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`close()` Called when the Map or Reduce task ends, to finish up the indexer.
`protected void`	`closeMap()` Finish up the map processing.
`protected void`	`closeReduce()` finishes the reduce step, by closing the lexicon and inverted file output, building the lexicon hash and index, and merging the document indices created by the map tasks.
`void`	`configure(org.apache.hadoop.mapred.JobConf _jc)` Configure this indexer.
`protected void`	`configureMap()`
`protected void`	`configureReduce()`
`protected MetaIndexBuilder`	`createMetaIndexBuilder()`
`protected RunsMerger`	`createtheRunMerger()` Creates the RunsMerger and the RunIteratorFactory
`static void`	`finish(String destinationIndexPath, int numberOfReduceTasks, HadoopPlugin.JobFactory jf)` finish
`protected void`	`forceFlush()`
`protected void`	`indexEmpty(Map<String,String> docProperties)` Write the empty document to the inverted index
`protected void`	`load_builder_boundary_documents()` Loads the builder boundary documents from the property `indexing.builder.boundary.docnos`, comma delimited.
`protected LinkedList<MapData>`	`loadRunData()`
`static void`	`main(String[] args)` main
`void`	`map(org.apache.hadoop.io.Text key, SplitAwareWrapper<Document> value, org.apache.hadoop.mapred.OutputCollector<SplitEmittedTerm,MapEmittedPostingList> _outputPostingListCollector, org.apache.hadoop.mapred.Reporter reporter)` Map processes a single document.
`protected void`	`mergeDocumentIndex(Index[] src, int numdocs)` Merges the simple document indexes made for each map, instead creating the final document index
`void`	`reduce(SplitEmittedTerm Term, Iterator<MapEmittedPostingList> postingIterator, org.apache.hadoop.mapred.OutputCollector<Object,Object> output, org.apache.hadoop.mapred.Reporter reporter)` Main reduce algorithm step.
`void`	`startReduce(LinkedList<MapData> mapData)` Merge the postings for the current term, converts the document ID's in the postings to be relative to one another using the run number, number of documents covered in each run, the flush number for that run and the number of documents flushed.

Methods inherited from class org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer
checkFlush, createDirectIndex, createFieldRunMerger, createInvertedIndex, createInvertedIndex, createMemoryPostings, createRunMerger, finishMemoryPosting, getFileNames, indexDocument, load_indexer_properties, performMultiWayMerge

Methods inherited from class org.terrier.structures.indexing.classical.BasicIndexer
createDocumentPostings, finishedInvertedIndexBuild, getEndOfPipeline

Methods inherited from class org.terrier.structures.indexing.Indexer
finishedDirectIndexBuild, index, init, load_field_ids, load_pipeline, merge, merge, mergeTwoIndices, parseInts, useFieldInformation

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - jc
```
protected org.apache.hadoop.mapred.JobConf jc
```
    JobConf of the current running job
  - splitnum
```
protected int splitnum
```
    The split that these documents came form
  - start
```
protected boolean start
```
  - outputPostingListCollector
```
protected org.apache.hadoop.mapred.OutputCollector<SplitEmittedTerm,MapEmittedPostingList> outputPostingListCollector
```
    output collector for the current map indexing process
  - mapTaskID
```
protected String mapTaskID
```
    Current map number
  - flushNo
```
protected int flushNo
```
    How many flushes have we made
  - RunData
```
protected DataOutputStream RunData
```
    OutputStream for the the data on the runs (runNo, flushes etc)
  - flushList
```
protected LinkedList<Integer> flushList
```
    List of how many documents are in each flush we have made
  - currentReporter
```
protected org.apache.hadoop.mapred.Reporter currentReporter
```
  - lexstream
```
protected LexiconOutputStream<String> lexstream
```
    OutputStream for the Lexicon
  - runIteratorF
```
protected HadoopRunIteratorFactory runIteratorF
```
    runIterator factory being used to generate RunIterators
  - reduceStarted
```
protected boolean reduceStarted
```
    records whether the reduce() has been called for the first time
  - mutipleIndices
```
protected boolean mutipleIndices
```
  - reduceId
```
protected int reduceId
```
  - MapIndexPrefixes
```
protected String[] MapIndexPrefixes
```
  - lastReporter
```
protected org.apache.hadoop.mapred.Reporter lastReporter
```
- Constructor Detail
  - Hadoop_BasicSinglePassIndexer
```
public Hadoop_BasicSinglePassIndexer()
```
    Empty constructor
- Method Detail
  - main
```
public static void main(String[] args)
                 throws Exception
```
    main
    
    Parameters:
    args -
    
    Throws:
    
    Exception
  - finish
```
public static void finish(String destinationIndexPath,
          int numberOfReduceTasks,
          HadoopPlugin.JobFactory jf)
                   throws Exception
```
    finish
    
    Parameters:
    destinationIndexPath -
    numberOfReduceTasks -
    jf -
    
    Throws:
    
    Exception
  - configure
```
public void configure(org.apache.hadoop.mapred.JobConf _jc)
```
    Configure this indexer. Firstly, loads ApplicationSetup appropriately. Actual configuration of indexer is then handled by configureMap() or configureReduce() depending on whether a Map or Reduce task is being configured.
    
    Specified by:
    
    configure in interface org.apache.hadoop.mapred.JobConfigurable
    
    Parameters:
    _jc - The configuration for the job
  - close
```
public void close()
           throws IOException
```
    Called when the Map or Reduce task ends, to finish up the indexer. Actual cleanup is handled by closeMap() or closeReduce() depending on whether this is a Map or Reduce task.
    
    Specified by:
    
    close in interface Closeable
    
    Specified by:
    
    close in interface AutoCloseable
    
    Throws:
    
    IOException
  - load_builder_boundary_documents
```
protected void load_builder_boundary_documents()
```
    Description copied from class: Indexer
    
    Loads the builder boundary documents from the property indexing.builder.boundary.docnos, comma delimited.
    
    Overrides:
    
    load_builder_boundary_documents in class Indexer
  - configureMap
```
protected void configureMap()
                     throws Exception
```
    Throws:
    
    Exception
  - createMetaIndexBuilder
```
protected MetaIndexBuilder createMetaIndexBuilder()
```
    Overrides:
    
    createMetaIndexBuilder in class Indexer
  - forceFlush
```
protected void forceFlush()
                   throws IOException
```
    Overrides:
    
    forceFlush in class BasicSinglePassIndexer
    
    Throws:
    
    IOException
  - map
```
public void map(org.apache.hadoop.io.Text key,
       SplitAwareWrapper<Document> value,
       org.apache.hadoop.mapred.OutputCollector<SplitEmittedTerm,MapEmittedPostingList> _outputPostingListCollector,
       org.apache.hadoop.mapred.Reporter reporter)
         throws IOException
```
    Map processes a single document. Stores the terms in the document along with the posting list until memory is full or all documents in this map have been processed then writes then to the output collector.
    
    Specified by:
    
    map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>,SplitEmittedTerm,MapEmittedPostingList>
    
    Parameters:
    key - - Wrapper for Document Number
    value - - Wrapper for Document Object
    _outputPostingListCollector - Collector for emitting terms and postings lists
    
    Throws:
    
    IOException
  - indexEmpty
```
protected void indexEmpty(Map<String,String> docProperties)
                   throws IOException
```
    Write the empty document to the inverted index
    
    Overrides:
    
    indexEmpty in class Indexer
    
    Throws:
    
    IOException
  - closeMap
```
protected void closeMap()
                 throws IOException
```
    Finish up the map processing. Forces a flush, then writes out the final run data
    
    Throws:
    
    IOException
  - configureReduce
```
protected void configureReduce()
                        throws Exception
```
    Throws:
    
    Exception
  - loadRunData
```
protected LinkedList<MapData> loadRunData()
                                   throws IOException
```
    Throws:
    
    IOException
  - startReduce
```
public void startReduce(LinkedList<MapData> mapData)
                 throws IOException
```
    Merge the postings for the current term, converts the document ID's in the postings to be relative to one another using the run number, number of documents covered in each run, the flush number for that run and the number of documents flushed.
    
    Parameters:
    mapData - - info about the runs(maps) and the flushes
    
    Throws:
    
    IOException
  - reduce
```
public void reduce(SplitEmittedTerm Term,
          Iterator<MapEmittedPostingList> postingIterator,
          org.apache.hadoop.mapred.OutputCollector<Object,Object> output,
          org.apache.hadoop.mapred.Reporter reporter)
            throws IOException
```
    Main reduce algorithm step. Called for every term in the merged index, together with accessors to the posting list information that has been written. This reduce has no output.
    
    Specified by:
    
    reduce in interface org.apache.hadoop.mapred.Reducer<SplitEmittedTerm,MapEmittedPostingList,Object,Object>
    
    Parameters:
    Term - indexing term which we are reducing the posting lists into
    postingIterator - Iterator over the temporary posting lists we have for this term
    output - Unused output collector
    reporter - Used to report progress
    
    Throws:
    
    IOException
  - mergeDocumentIndex
```
protected void mergeDocumentIndex(Index[] src,
                      int numdocs)
                           throws IOException
```
    Merges the simple document indexes made for each map, instead creating the final document index
    
    Throws:
    
    IOException
  - closeReduce
```
protected void closeReduce()
                    throws IOException
```
    finishes the reduce step, by closing the lexicon and inverted file output, building the lexicon hash and index, and merging the document indices created by the map tasks. The output index finalised
    
    Throws:
    
    IOException
  - createtheRunMerger
```
protected RunsMerger createtheRunMerger()
```
    Creates the RunsMerger and the RunIteratorFactory

Class Hadoop_BasicSinglePassIndexer

Map phase processing

Reduce phase processing

Partitioned Reduce processing

Nested Class Summary

Nested classes/interfaces inherited from class org.terrier.structures.indexing.classical.BasicIndexer

Field Summary

Fields inherited from class org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer

Fields inherited from class org.terrier.structures.indexing.classical.BasicIndexer

Fields inherited from class org.terrier.structures.indexing.Indexer

Constructor Summary

Method Summary

Methods inherited from class org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer

Methods inherited from class org.terrier.structures.indexing.classical.BasicIndexer

Methods inherited from class org.terrier.structures.indexing.Indexer

Methods inherited from class java.lang.Object

Field Detail

jc

splitnum

start

outputPostingListCollector

mapTaskID

flushNo

RunData

flushList

currentReporter

lexstream

runIteratorF

reduceStarted

mutipleIndices

reduceId

MapIndexPrefixes

lastReporter

Constructor Detail

Hadoop_BasicSinglePassIndexer

Method Detail

main

finish

configure

close

load_builder_boundary_documents

configureMap

createMetaIndexBuilder

forceFlush

map

indexEmpty

closeMap

configureReduce

loadRunData

startReduce

reduce

mergeDocumentIndex

closeReduce

createtheRunMerger