org.terrier.indexing.hadoop
Class Hadoop_BlockSinglePassIndexer

java.lang.Object
  extended by org.terrier.indexing.Indexer
      extended by org.terrier.indexing.BasicIndexer
          extended by org.terrier.indexing.BasicSinglePassIndexer
              extended by org.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer
                  extended by org.terrier.indexing.hadoop.Hadoop_BlockSinglePassIndexer
All Implemented Interfaces:
java.io.Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>,SplitEmittedTerm,MapEmittedPostingList>, org.apache.hadoop.mapred.Reducer<SplitEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>

public class Hadoop_BlockSinglePassIndexer
extends Hadoop_BasicSinglePassIndexer

A MapReduce single-pass indexer that records term positions (blocks). All normal block properties are supported. For more information, see BlockIndexer.

Since:
2.2
Author:
Richard McCreadie, Craig Macdonald and Rodrygo Santos

Nested Class Summary
protected  class Hadoop_BlockSinglePassIndexer.BasicTermProcessor
          This class implements an end of a TermPipeline that adds the term to the DocumentTree.
protected  class Hadoop_BlockSinglePassIndexer.DelimFieldTermProcessor
          This class behaves in a similar fashion to FieldTermProcessor except that this one treats blocks bounded by delimiters instead of fixed-sized blocks.
protected  class Hadoop_BlockSinglePassIndexer.DelimTermProcessor
          This class behaves in a similar fashion to BasicTermProcessor except that this one treats blocks bounded by delimiters instead of fixed-sized blocks.
protected  class Hadoop_BlockSinglePassIndexer.FieldTermProcessor
          This class implements an end of a TermPipeline that adds the term to the DocumentTree.
 
Field Summary
protected  int BLOCK_SIZE
          The maximum number of terms allowed in a block
protected  int blockId
          The block number in the current document.
protected  int MAX_BLOCKS
          The maximum number allowed number of blocks in a document.
protected  int numOfTokensInBlock
          The number of tokens in the current block of the current document.
 
Fields inherited from class org.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer
currentReporter, flushList, flushNo, jc, lastReporter, lexstream, MapIndexPrefixes, mapTaskID, mutipleIndices, outputPostingListCollector, reduceId, reduceStarted, RunData, runIteratorF, splitnum, start
 
Fields inherited from class org.terrier.indexing.BasicSinglePassIndexer
basicInvertedIndexPostingIteratorClass, currentFile, currentId, docsPerCheck, fieldInvertedIndexPostingIteratorClass, fileNames, invertedIndexClass, invertedIndexInputStreamClass, maxDocsPerFlush, maxMemory, memoryAfterFlush, memoryCheck, merger, mp, numberOfDocsSinceCheck, numberOfDocsSinceFlush, numberOfDocuments, numberOfPointers, numberOfTokens, numberOfUniqueTerms, runtime
 
Fields inherited from class org.terrier.indexing.BasicIndexer
numOfTokensInDocument, termFields, termsInDocument
 
Fields inherited from class org.terrier.indexing.Indexer
basicDirectIndexPostingIteratorClass, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocIndexEntry, fieldDirectIndexPostingIteratorClass, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation
 
Constructor Summary
Hadoop_BlockSinglePassIndexer()
          Constructs an instance of this class, where the created data structures are stored in the given path.
 
Method Summary
protected  void createDocumentPostings()
          Hook method that creates the right type of DocumentTree class.
 void createMemoryPostings()
          Hook method that creates the right type of MemoryPostings class.
protected  RunsMerger createtheRunMerger()
          Creates the RunsMerger and the RunIteratorFactory
protected  TermPipeline getEndOfPipeline()
          Returns the object that is to be the end of the TermPipeline.
protected  void load_indexer_properties()
           
 
Methods inherited from class org.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer
close, closeMap, closeReduce, configure, configureMap, configureReduce, createMetaIndexBuilder, finish, forceFlush, indexEmpty, load_builder_boundary_documents, loadRunData, main, map, mergeDocumentIndex, reduce, startReduce
 
Methods inherited from class org.terrier.indexing.BasicSinglePassIndexer
checkFlush, createDirectIndex, createFieldRunMerger, createInvertedIndex, createInvertedIndex, createRunMerger, finishMemoryPosting, getFileNames, indexDocument, performMultiWayMerge
 
Methods inherited from class org.terrier.indexing.BasicIndexer
finishedInvertedIndexBuild
 
Methods inherited from class org.terrier.indexing.Indexer
finishedDirectIndexBuild, index, init, load_field_ids, load_pipeline, merge, merge, mergeTwoIndices, parseInts, useFieldInformation
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

numOfTokensInBlock

protected int numOfTokensInBlock
The number of tokens in the current block of the current document.


blockId

protected int blockId
The block number in the current document.


BLOCK_SIZE

protected int BLOCK_SIZE
The maximum number of terms allowed in a block


MAX_BLOCKS

protected int MAX_BLOCKS
The maximum number allowed number of blocks in a document. After this value, all the remaining terms are in the final block

Constructor Detail

Hadoop_BlockSinglePassIndexer

public Hadoop_BlockSinglePassIndexer()
Constructs an instance of this class, where the created data structures are stored in the given path.

Method Detail

getEndOfPipeline

protected TermPipeline getEndOfPipeline()
Returns the object that is to be the end of the TermPipeline. This method is used at construction time of the parent object.

Overrides:
getEndOfPipeline in class BasicIndexer
Returns:
TermPipeline the last component of the term pipeline.

createMemoryPostings

public void createMemoryPostings()
Hook method that creates the right type of MemoryPostings class.

Overrides:
createMemoryPostings in class BasicSinglePassIndexer

createDocumentPostings

protected void createDocumentPostings()
Description copied from class: BasicIndexer
Hook method that creates the right type of DocumentTree class.

Overrides:
createDocumentPostings in class BasicIndexer

createtheRunMerger

protected RunsMerger createtheRunMerger()
Description copied from class: Hadoop_BasicSinglePassIndexer
Creates the RunsMerger and the RunIteratorFactory

Overrides:
createtheRunMerger in class Hadoop_BasicSinglePassIndexer

load_indexer_properties

protected void load_indexer_properties()
Overrides:
load_indexer_properties in class BasicSinglePassIndexer


Terrier 3.5. Copyright © 2004-2011 University of Glasgow