Hadoop_BasicSinglePassIndexer (Terrier 3.5 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.terrier.indexing.hadoop
Class Hadoop_BasicSinglePassIndexer

java.lang.Object
  org.terrier.indexing.Indexer
      org.terrier.indexing.BasicIndexer
          org.terrier.indexing.BasicSinglePassIndexer
              org.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer

All Implemented Interfaces:: java.io.Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>,SplitEmittedTerm,MapEmittedPostingList>, org.apache.hadoop.mapred.Reducer<SplitEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>

Direct Known Subclasses:: Hadoop_BlockSinglePassIndexer

public class Hadoop_BasicSinglePassIndexer
extends BasicSinglePassIndexer
implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>,SplitEmittedTerm,MapEmittedPostingList>, org.apache.hadoop.mapred.Reducer<SplitEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>
extends BasicSinglePassIndexer
implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>,SplitEmittedTerm,MapEmittedPostingList>, org.apache.hadoop.mapred.Reducer<SplitEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>

Single Pass MapReduce indexer.

Map phase processing

Indexes as a Map task, taking in a series of documents, emitting posting lists for terms as memory becomes exhausted. Two side-files are created for each map task: the first (run files) takes note of how many documents were indexed for each flush and for each map; the second contains the statistics for each document in a minature document index

Reduce phase processing

All posting lists for each term are read in, one term at a time. Using the run files, the posting lists are output into the final inverted file, with all document ids corrected. Lastly, when all terms have been processed, the document indexes are merged into the final document index, and the lexicon hash and lexid created.

Partitioned Reduce processing

Normally, the MapReduce indexer is used with a single reducer. However, if the partitioner is used, multiple reduces can run concurrently, building several final indices. In doing so, a large collection can be indexed into several output indices, which may be useful for distributed retrieval.

Since:: 2.2
Author:: Richard McCreadie and Craig Macdonald

Nested Class Summary

Nested classes/interfaces inherited from class org.terrier.indexing.BasicIndexer
`BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor`

Field Summary
`protected org.apache.hadoop.mapred.Reporter`	`currentReporter`
`protected java.util.LinkedList<java.lang.Integer>`	`flushList` List of how many documents are in each flush we have made
`protected int`	`flushNo` How many flushes have we made
`protected org.apache.hadoop.mapred.JobConf`	`jc` JobConf of the current running job
`protected org.apache.hadoop.mapred.Reporter`	`lastReporter`
`protected LexiconOutputStream<java.lang.String>`	`lexstream` OutputStream for the Lexicon
`protected java.lang.String[]`	`MapIndexPrefixes`
`protected java.lang.String`	`mapTaskID` Current map number
`protected boolean`	`mutipleIndices`
`protected org.apache.hadoop.mapred.OutputCollector<SplitEmittedTerm,MapEmittedPostingList>`	`outputPostingListCollector` output collector for the current map indexing process
`protected int`	`reduceId`
`protected boolean`	`reduceStarted` records whether the reduce() has been called for the first time
`protected java.io.DataOutputStream`	`RunData` OutputStream for the the data on the runs (runNo, flushes etc)
`protected HadoopRunIteratorFactory`	`runIteratorF` runIterator factory being used to generate RunIterators
`protected int`	`splitnum` The split that these documents came form
`protected boolean`	`start`

Fields inherited from class org.terrier.indexing.BasicSinglePassIndexer
`basicInvertedIndexPostingIteratorClass, currentFile, currentId, docsPerCheck, fieldInvertedIndexPostingIteratorClass, fileNames, invertedIndexClass, invertedIndexInputStreamClass, maxDocsPerFlush, maxMemory, memoryAfterFlush, memoryCheck, merger, mp, numberOfDocsSinceCheck, numberOfDocsSinceFlush, numberOfDocuments, numberOfPointers, numberOfTokens, numberOfUniqueTerms, runtime`

Fields inherited from class org.terrier.indexing.BasicIndexer
`numOfTokensInDocument, termFields, termsInDocument`

Fields inherited from class org.terrier.indexing.Indexer
`basicDirectIndexPostingIteratorClass, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocIndexEntry, fieldDirectIndexPostingIteratorClass, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation`

Constructor Summary
`Hadoop_BasicSinglePassIndexer()` Empty constructor.

Method Summary
`void`	`close()` Called when the Map or Reduce task ends, to finish up the indexer.
`protected void`	`closeMap()` Finish up the map processing.
`protected void`	`closeReduce()` finishes the reduce step, by closing the lexicon and inverted file output, building the lexicon hash and index, and merging the document indices created by the map tasks.
`void`	`configure(org.apache.hadoop.mapred.JobConf _jc)` Configure this indexer.
`protected void`	`configureMap()`
`protected void`	`configureReduce()`
`protected MetaIndexBuilder`	`createMetaIndexBuilder()`
`protected RunsMerger`	`createtheRunMerger()` Creates the RunsMerger and the RunIteratorFactory
`static void`	`finish(java.lang.String destinationIndexPath, int numberOfReduceTasks, HadoopPlugin.JobFactory jf)` finish
`protected void`	`forceFlush()`
`protected void`	`indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties)` Write the empty document to the inverted index
`protected void`	`load_builder_boundary_documents()` Loads the builder boundary documents from the property `indexing.builder.boundary.docnos`, comma delimited.
`protected java.util.LinkedList<MapData>`	`loadRunData()`
`static void`	`main(java.lang.String[] args)` main
`void`	`map(org.apache.hadoop.io.Text key, SplitAwareWrapper<Document> value, org.apache.hadoop.mapred.OutputCollector<SplitEmittedTerm,MapEmittedPostingList> _outputPostingListCollector, org.apache.hadoop.mapred.Reporter reporter)` Map processes a single document.
`protected void`	`mergeDocumentIndex(Index[] src)` Merges the simple document indexes made for each map, instead creating the final document index
`void`	`reduce(SplitEmittedTerm Term, java.util.Iterator<MapEmittedPostingList> postingIterator, org.apache.hadoop.mapred.OutputCollector<java.lang.Object,java.lang.Object> output, org.apache.hadoop.mapred.Reporter reporter)` Main reduce algorithm step.
`void`	`startReduce(java.util.LinkedList<MapData> mapData)` Merge the postings for the current term, converts the document ID's in the postings to be relative to one another using the run number, number of documents covered in each run, the flush number for that run and the number of documents flushed.

Methods inherited from class org.terrier.indexing.BasicSinglePassIndexer
`checkFlush, createDirectIndex, createFieldRunMerger, createInvertedIndex, createInvertedIndex, createMemoryPostings, createRunMerger, finishMemoryPosting, getFileNames, indexDocument, load_indexer_properties, performMultiWayMerge`

Methods inherited from class org.terrier.indexing.BasicIndexer
`createDocumentPostings, finishedInvertedIndexBuild, getEndOfPipeline`

Methods inherited from class org.terrier.indexing.Indexer
`finishedDirectIndexBuild, index, init, load_field_ids, load_pipeline, merge, merge, mergeTwoIndices, parseInts, useFieldInformation`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

jc

protected org.apache.hadoop.mapred.JobConf jc

JobConf of the current running job

splitnum

protected int splitnum

The split that these documents came form

start

protected boolean start

outputPostingListCollector

protected org.apache.hadoop.mapred.OutputCollector<SplitEmittedTerm,MapEmittedPostingList> outputPostingListCollector

output collector for the current map indexing process

mapTaskID

protected java.lang.String mapTaskID

Current map number

flushNo

protected int flushNo

How many flushes have we made

RunData

protected java.io.DataOutputStream RunData

OutputStream for the the data on the runs (runNo, flushes etc)

flushList

protected java.util.LinkedList<java.lang.Integer> flushList

List of how many documents are in each flush we have made

currentReporter

protected org.apache.hadoop.mapred.Reporter currentReporter

lexstream

protected LexiconOutputStream<java.lang.String> lexstream

OutputStream for the Lexicon

runIteratorF

protected HadoopRunIteratorFactory runIteratorF

runIterator factory being used to generate RunIterators

reduceStarted

protected boolean reduceStarted

records whether the reduce() has been called for the first time

mutipleIndices

protected boolean mutipleIndices

reduceId

protected int reduceId

MapIndexPrefixes

protected java.lang.String[] MapIndexPrefixes

lastReporter

protected org.apache.hadoop.mapred.Reporter lastReporter

Constructor Detail

Hadoop_BasicSinglePassIndexer

public Hadoop_BasicSinglePassIndexer()

Empty constructor.

Method Detail

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception

main

Parameters:: args -
Throws:: java.lang.Exception

finish

public static void finish(java.lang.String destinationIndexPath,
                          int numberOfReduceTasks,
                          HadoopPlugin.JobFactory jf)
                   throws java.lang.Exception

finish

Parameters:: destinationIndexPath -; numberOfReduceTasks -; jf -
Throws:: java.lang.Exception

configure

public void configure(org.apache.hadoop.mapred.JobConf _jc)

Configure this indexer. Firstly, loads ApplicationSetup appropriately. Actual configuration of indexer is then handled by configureMap() or configureReduce() depending on whether a Map or Reduce task is being configured.

Specified by:: configure in interface org.apache.hadoop.mapred.JobConfigurable

Parameters:: _jc - The configuration for the job

close

public void close()
           throws java.io.IOException

Called when the Map or Reduce task ends, to finish up the indexer. Actual cleanup is handled by closeMap() or closeReduce() depending on whether this is a Map or Reduce task.

Specified by:: close in interface java.io.Closeable

Throws:: java.io.IOException

load_builder_boundary_documents

protected void load_builder_boundary_documents()

Description copied from class: Indexer

Loads the builder boundary documents from the property indexing.builder.boundary.docnos, comma delimited.

Overrides:: load_builder_boundary_documents in class Indexer

configureMap

protected void configureMap()
                     throws java.lang.Exception

Throws:: java.lang.Exception

createMetaIndexBuilder

protected MetaIndexBuilder createMetaIndexBuilder()

Overrides:: createMetaIndexBuilder in class Indexer

forceFlush

protected void forceFlush()
                   throws java.io.IOException

Overrides:: forceFlush in class BasicSinglePassIndexer

Throws:: java.io.IOException

map

public void map(org.apache.hadoop.io.Text key,
                SplitAwareWrapper<Document> value,
                org.apache.hadoop.mapred.OutputCollector<SplitEmittedTerm,MapEmittedPostingList> _outputPostingListCollector,
                org.apache.hadoop.mapred.Reporter reporter)
         throws java.io.IOException

Map processes a single document. Stores the terms in the document along with the posting list until memory is full or all documents in this map have been processed then writes then to the output collector.

Specified by:: map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>,SplitEmittedTerm,MapEmittedPostingList>

Parameters:: key - - Wrapper for Document Number; value - - Wrapper for Document Object; _outputPostingListCollector - Collector for emitting terms and postings lists
Throws:: java.io.IOException

indexEmpty

protected void indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties)
                   throws java.io.IOException

Write the empty document to the inverted index

Overrides:: indexEmpty in class Indexer

Throws:: java.io.IOException

closeMap

protected void closeMap()
                 throws java.io.IOException

Finish up the map processing. Forces a flush, then writes out the final run data

Throws:: java.io.IOException

configureReduce

protected void configureReduce()
                        throws java.lang.Exception

Throws:: java.lang.Exception

loadRunData

protected java.util.LinkedList<MapData> loadRunData()
                                             throws java.io.IOException

Throws:: java.io.IOException

startReduce

public void startReduce(java.util.LinkedList<MapData> mapData)
                 throws java.io.IOException

Merge the postings for the current term, converts the document ID's in the postings to be relative to one another using the run number, number of documents covered in each run, the flush number for that run and the number of documents flushed.

Parameters:: mapData - - info about the runs(maps) and the flushes
Throws:: java.io.IOException

reduce

public void reduce(SplitEmittedTerm Term,
                   java.util.Iterator<MapEmittedPostingList> postingIterator,
                   org.apache.hadoop.mapred.OutputCollector<java.lang.Object,java.lang.Object> output,
                   org.apache.hadoop.mapred.Reporter reporter)
            throws java.io.IOException

Main reduce algorithm step. Called for every term in the merged index, together with accessors to the posting list information that has been written. This reduce has no output.

Specified by:: reduce in interface org.apache.hadoop.mapred.Reducer<SplitEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>

Parameters:: Term - indexing term which we are reducing the posting lists into; postingIterator - Iterator over the temporary posting lists we have for this term; output - Unused output collector; reporter - Used to report progress
Throws:: java.io.IOException

mergeDocumentIndex

protected void mergeDocumentIndex(Index[] src)
                           throws java.io.IOException

Merges the simple document indexes made for each map, instead creating the final document index

Throws:: java.io.IOException

closeReduce

protected void closeReduce()
                    throws java.io.IOException

finishes the reduce step, by closing the lexicon and inverted file output, building the lexicon hash and index, and merging the document indices created by the map tasks. The output index finalised

Throws:: java.io.IOException

createtheRunMerger

protected RunsMerger createtheRunMerger()

Creates the RunsMerger and the RunIteratorFactory

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.terrier.indexing.hadoop Class Hadoop_BasicSinglePassIndexer

Map phase processing

Reduce phase processing

Partitioned Reduce processing

jc

splitnum

start

outputPostingListCollector

mapTaskID

flushNo

RunData

flushList

currentReporter

lexstream

runIteratorF

reduceStarted

mutipleIndices

reduceId

MapIndexPrefixes

lastReporter

Hadoop_BasicSinglePassIndexer

main

finish

configure

close

load_builder_boundary_documents

configureMap

createMetaIndexBuilder

forceFlush

map

indexEmpty

closeMap

configureReduce

loadRunData

startReduce

reduce

mergeDocumentIndex

closeReduce

createtheRunMerger

org.terrier.indexing.hadoop
Class Hadoop_BasicSinglePassIndexer