org.terrier.indexing
Class BasicIndexer

java.lang.Object
  extended by org.terrier.indexing.Indexer
      extended by org.terrier.indexing.BasicIndexer
Direct Known Subclasses:
BasicSinglePassIndexer

public class BasicIndexer
extends Indexer

BasicIndexer is the default indexer for Terrier. It takes terms from each Document object provided by the collection, and adds terms to temporary Lexicons, and into the DirectFile. The documentIndex is updated to give the pointers into the Direct file. The temporary lexicons are then merged into the main lexicon. Inverted Index construction takes place as a second step.
Properties:

Author:
Craig Macdonald & Vassilis Plachouras
See Also:
Indexer, BlockIndexer

Nested Class Summary
protected  class BasicIndexer.BasicTermProcessor
          This class implements an end of a TermPipeline that adds the term to the DocumentTree.
protected  class BasicIndexer.FieldTermProcessor
          This class implements an end of a TermPipeline that adds the term to the DocumentTree.
 
Field Summary
protected  int numOfTokensInDocument
          The number of tokens found in the current document so far/
protected  java.util.Set<java.lang.String> termFields
          A private variable for storing the fields a term appears into.
protected  DocumentPostingList termsInDocument
          The structure that holds the terms found in a document.
 
Fields inherited from class org.terrier.indexing.Indexer
basicDirectIndexPostingIteratorClass, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocIndexEntry, fieldDirectIndexPostingIteratorClass, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation
 
Constructor Summary
protected BasicIndexer(long a, long b, long c)
          Protected do-nothing constructor for use by child classes.
  BasicIndexer(java.lang.String path, java.lang.String prefix)
          Constructs an instance of a BasicIndexer, using the given path name for storing the data structures.
 
Method Summary
 void createDirectIndex(Collection[] collections)
          Creates the direct index, the document index and the lexicon.
protected  void createDocumentPostings()
          Hook method that creates the right type of DocumentTree class.
 void createInvertedIndex()
          Creates the inverted index after having created the direct index, document index and lexicon.
protected  void finishedInvertedIndexBuild()
          Hook method, called when the inverted index is finished - ie the lexicon is finished
protected  TermPipeline getEndOfPipeline()
          Returns the end of the term pipeline, which corresponds to an instance of either BasicIndexer.BasicTermProcessor, or BasicIndexer.FieldTermProcessor, depending on whether field information is stored.
protected  void indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties, DocumentPostingList _termsInDocument)
          This adds a document to the direct and document indexes, as well as it's terms to the lexicon.
 
Methods inherited from class org.terrier.indexing.Indexer
createMetaIndexBuilder, finishedDirectIndexBuild, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_indexer_properties, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, useFieldInformation
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

termFields

protected java.util.Set<java.lang.String> termFields
A private variable for storing the fields a term appears into.


termsInDocument

protected DocumentPostingList termsInDocument
The structure that holds the terms found in a document.


numOfTokensInDocument

protected int numOfTokensInDocument
The number of tokens found in the current document so far/

Constructor Detail

BasicIndexer

protected BasicIndexer(long a,
                       long b,
                       long c)
Protected do-nothing constructor for use by child classes. Classes which use this method must call init()


BasicIndexer

public BasicIndexer(java.lang.String path,
                    java.lang.String prefix)
Constructs an instance of a BasicIndexer, using the given path name for storing the data structures.

Parameters:
path - String the path where the data structures will be created. This is assumed to be absolute.
prefix - String the filename component of the data structures
Method Detail

getEndOfPipeline

protected TermPipeline getEndOfPipeline()
Returns the end of the term pipeline, which corresponds to an instance of either BasicIndexer.BasicTermProcessor, or BasicIndexer.FieldTermProcessor, depending on whether field information is stored.

Specified by:
getEndOfPipeline in class Indexer
Returns:
TermPipeline the end of the term pipeline.

createDirectIndex

public void createDirectIndex(Collection[] collections)
Creates the direct index, the document index and the lexicon. Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).

Specified by:
createDirectIndex in class Indexer
Parameters:
collections - Collection[] the collections to be indexed.

indexDocument

protected void indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties,
                             DocumentPostingList _termsInDocument)
                      throws java.lang.Exception
This adds a document to the direct and document indexes, as well as it's terms to the lexicon. Handled internally by the methods indexFieldDocument and indexNoFieldDocument.

Parameters:
docProperties - Map properties of the document
_termsInDocument - DocumentPostingList the terms in the document.
Throws:
java.lang.Exception

createInvertedIndex

public void createInvertedIndex()
Creates the inverted index after having created the direct index, document index and lexicon.

Specified by:
createInvertedIndex in class Indexer

createDocumentPostings

protected void createDocumentPostings()
Hook method that creates the right type of DocumentTree class.


finishedInvertedIndexBuild

protected void finishedInvertedIndexBuild()
Hook method, called when the inverted index is finished - ie the lexicon is finished

Overrides:
finishedInvertedIndexBuild in class Indexer


Terrier 3.5. Copyright © 2004-2011 University of Glasgow