org.terrier.indexing
Class BasicSinglePassIndexer

java.lang.Object
  extended by org.terrier.indexing.Indexer
      extended by org.terrier.indexing.BasicIndexer
          extended by org.terrier.indexing.BasicSinglePassIndexer
Direct Known Subclasses:
BlockSinglePassIndexer, ExtensibleSinglePassIndexer, Hadoop_BasicSinglePassIndexer

public class BasicSinglePassIndexer
extends BasicIndexer

This class indexes a document collection (skipping the direct file construction). It implements a single-pass algorithm, that operates in two phases:
First, it traverses the document collection, passes the terms through the TermPipeline and builds an in-memory representation of the posting lists. When it has exhausted the main memory, it flushes the sorted postings to disk, along with the lexicon (collectively known as a run, and continues traversing the collection.
The second phase, merges the sorted runs (with their partial lexicons) in disk to create the final inverted file. This class follows the template pattern, so the main bulk of the code is reused for block (and fields) indexing. There are a few hook methods, that chooses the right classes to instanciate, depending on the indexing options defined.

Memory tracking is a key concern in this class. Four properties are provided for checking the amount of memory consumed, how regularly to check the memory, and (optional) maximums on the amount of memory that can be used for the postings, or on the number of documents before a flush is comitted.

Properties:

Author:
Roi Blanco

Nested Class Summary
 
Nested classes/interfaces inherited from class org.terrier.indexing.BasicIndexer
BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor
 
Field Summary
protected  java.lang.String basicInvertedIndexPostingIteratorClass
           
protected  int currentFile
          Number of the current Run to be written in disk
protected  int currentId
          Current document Id
protected  int docsPerCheck
          Number of documents read per memory check
protected  java.lang.String fieldInvertedIndexPostingIteratorClass
           
protected  java.util.Queue<java.lang.String[]> fileNames
          Queue with the file names for the runs in disk
protected  java.lang.String invertedIndexClass
          what class should be used to read the generated inverted index?
protected  java.lang.String invertedIndexInputStreamClass
          what class should be used to read the inverted index as a stream?
protected  int maxDocsPerFlush
           
protected  long maxMemory
           
protected  long memoryAfterFlush
          Memory status after flush
protected  MemoryChecker memoryCheck
          Memory Checker - provides the method for checking to see if the system is running low on memory
protected  RunsMerger merger
          Structure for merging the run
protected  MemoryPostings mp
          Structure that keeps the posting lists in memory
protected  int numberOfDocsSinceCheck
          Number of documents read since the memory consumption was last checked
protected  int numberOfDocsSinceFlush
          Number of documents read since the memory runs were last flushed to disk
protected  int numberOfDocuments
          Number of documents indexed
protected  long numberOfPointers
          Number of pointers indexed
protected  long numberOfTokens
          Number of tokens indexed
protected  int numberOfUniqueTerms
          Number of unique terms indexed
protected static java.lang.Runtime runtime
          Runtime system JVM running this instance of Terrier
 
Fields inherited from class org.terrier.indexing.BasicIndexer
numOfTokensInDocument, termFields, termsInDocument
 
Fields inherited from class org.terrier.indexing.Indexer
basicDirectIndexPostingIteratorClass, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocIndexEntry, fieldDirectIndexPostingIteratorClass, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation
 
Constructor Summary
protected BasicSinglePassIndexer(long a, long b, long c)
          Protected do-nothing constructor for use by child classes
  BasicSinglePassIndexer(java.lang.String pathname, java.lang.String prefix)
          Constructs an instance of a BasicSinglePassIndexer, using the given path name for storing the data structures.
 
Method Summary
protected  void checkFlush()
          check to see if a flush is required, and perform if necessary
 void createDirectIndex(Collection[] collections)
          Creates the direct index, the document index and the lexicon.
protected  void createFieldRunMerger(java.lang.String[][] files)
          Hook method that creates a FieldRunMerger instance
 void createInvertedIndex()
          Creates the inverted index after having created the direct index, document index and lexicon.
 void createInvertedIndex(Collection[] collections)
          Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).
protected  void createMemoryPostings()
          Hook method that creates the right type of MemoryPostings class.
protected  void createRunMerger(java.lang.String[][] files)
          Hook method that creates a RunsMerger instance
protected  java.lang.String[] finishMemoryPosting()
          Adds the name of the current run + partial lexicon to be flushed in disk.
protected  void forceFlush()
           
protected  java.lang.String[][] getFileNames()
           
protected  void indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties, DocumentPostingList termsInDocument)
          This adds a document to the direct and document indexes, as well as it's terms to the lexicon.
protected  void load_indexer_properties()
           
 void performMultiWayMerge()
          Uses the merger class to perform a k multiway merge in a set of previously written runs.
 
Methods inherited from class org.terrier.indexing.BasicIndexer
createDocumentPostings, finishedInvertedIndexBuild, getEndOfPipeline
 
Methods inherited from class org.terrier.indexing.Indexer
createMetaIndexBuilder, finishedDirectIndexBuild, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, useFieldInformation
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

currentId

protected int currentId
Current document Id


maxMemory

protected long maxMemory

memoryCheck

protected MemoryChecker memoryCheck
Memory Checker - provides the method for checking to see if the system is running low on memory


docsPerCheck

protected int docsPerCheck
Number of documents read per memory check


maxDocsPerFlush

protected int maxDocsPerFlush

runtime

protected static final java.lang.Runtime runtime
Runtime system JVM running this instance of Terrier


numberOfDocsSinceCheck

protected int numberOfDocsSinceCheck
Number of documents read since the memory consumption was last checked


numberOfDocsSinceFlush

protected int numberOfDocsSinceFlush
Number of documents read since the memory runs were last flushed to disk


memoryAfterFlush

protected long memoryAfterFlush
Memory status after flush


fileNames

protected java.util.Queue<java.lang.String[]> fileNames
Queue with the file names for the runs in disk


currentFile

protected int currentFile
Number of the current Run to be written in disk


mp

protected MemoryPostings mp
Structure that keeps the posting lists in memory


merger

protected RunsMerger merger
Structure for merging the run


numberOfDocuments

protected int numberOfDocuments
Number of documents indexed


numberOfTokens

protected long numberOfTokens
Number of tokens indexed


numberOfUniqueTerms

protected int numberOfUniqueTerms
Number of unique terms indexed


numberOfPointers

protected long numberOfPointers
Number of pointers indexed


invertedIndexClass

protected java.lang.String invertedIndexClass
what class should be used to read the generated inverted index?


basicInvertedIndexPostingIteratorClass

protected java.lang.String basicInvertedIndexPostingIteratorClass

fieldInvertedIndexPostingIteratorClass

protected java.lang.String fieldInvertedIndexPostingIteratorClass

invertedIndexInputStreamClass

protected java.lang.String invertedIndexInputStreamClass
what class should be used to read the inverted index as a stream?

Constructor Detail

BasicSinglePassIndexer

public BasicSinglePassIndexer(java.lang.String pathname,
                              java.lang.String prefix)
Constructs an instance of a BasicSinglePassIndexer, using the given path name for storing the data structures.

Parameters:
pathname - String the path where the datastructures will be created. This is assumed to be absolute.
prefix - String the prefix of the index, usually "data".

BasicSinglePassIndexer

protected BasicSinglePassIndexer(long a,
                                 long b,
                                 long c)
Protected do-nothing constructor for use by child classes

Method Detail

createDirectIndex

public void createDirectIndex(Collection[] collections)
Description copied from class: BasicIndexer
Creates the direct index, the document index and the lexicon. Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).

Overrides:
createDirectIndex in class BasicIndexer
Parameters:
collections - Collection[] the collections to be indexed.

createInvertedIndex

public void createInvertedIndex()
Description copied from class: BasicIndexer
Creates the inverted index after having created the direct index, document index and lexicon.

Overrides:
createInvertedIndex in class BasicIndexer

createInvertedIndex

public void createInvertedIndex(Collection[] collections)
Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).

Parameters:
collections - Collection[] the collections to be indexed.

checkFlush

protected void checkFlush()
                   throws java.io.IOException
check to see if a flush is required, and perform if necessary

Throws:
java.io.IOException

forceFlush

protected void forceFlush()
                   throws java.io.IOException
Throws:
java.io.IOException

indexDocument

protected void indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties,
                             DocumentPostingList termsInDocument)
                      throws java.lang.Exception
This adds a document to the direct and document indexes, as well as it's terms to the lexicon. Handled internally by the methods indexFieldDocument and indexNoFieldDocument.. This implementation only places content in the runs in memory, which will eventually be flushed to disk.

Overrides:
indexDocument in class BasicIndexer
Parameters:
docProperties - Map properties of the document
termsInDocument - DocumentPostingList the terms in the document.
Throws:
java.lang.Exception

finishMemoryPosting

protected java.lang.String[] finishMemoryPosting()
Adds the name of the current run + partial lexicon to be flushed in disk.

Returns:
the two dimensional String[] array with the names of the run and partial lexicon to write.

performMultiWayMerge

public void performMultiWayMerge()
                          throws java.io.IOException
Uses the merger class to perform a k multiway merge in a set of previously written runs. The file names and the number of runs are given by the private queue

Throws:
java.io.IOException

getFileNames

protected java.lang.String[][] getFileNames()
Returns:
the String[][] structure with the name of the runs files and partial lexicons.

createFieldRunMerger

protected void createFieldRunMerger(java.lang.String[][] files)
                             throws java.lang.Exception
Hook method that creates a FieldRunMerger instance

Throws:
java.io.IOException - if an I/O error occurs.
java.lang.Exception

createRunMerger

protected void createRunMerger(java.lang.String[][] files)
                        throws java.lang.Exception
Hook method that creates a RunsMerger instance

Throws:
java.io.IOException - if an I/O error occurs.
java.lang.Exception

createMemoryPostings

protected void createMemoryPostings()
Hook method that creates the right type of MemoryPostings class.


load_indexer_properties

protected void load_indexer_properties()
Overrides:
load_indexer_properties in class Indexer


Terrier 3.5. Copyright © 2004-2011 University of Glasgow