org.terrier.indexing
Class BlockIndexer

java.lang.Object
  extended by org.terrier.indexing.Indexer
      extended by org.terrier.indexing.BlockIndexer

public class BlockIndexer
extends Indexer

An indexer that saves block information for the indexed terms. Block information is usualy recorded in terms of relative term positions (position 1, positions 2, etc), however, since 2.2, Terrier supports the presence of "marker terms" during indexing which are used to increment the block counter. Properties:

Markered Blocks
Markers are terms (artificially inserted or otherwise into the term stream that are used to denote when the block counter should be incremented. This functionality is enabled using the block.delimiters.enabled property, while the terms are specified using a comma delimited fashion with the block.delimiters property. The following lists the properties:

Author:
Craig Macdonald, Vassilis Plachouras, Rodrygo Santos

Nested Class Summary
protected  class BlockIndexer.BasicTermProcessor
          This class implements an end of a TermPipeline that adds the term to the DocumentTree.
protected  class BlockIndexer.DelimFieldTermProcessor
          This class behaves in a similar fashion to FieldTermProcessor except that this one treats blocks bounded by delimiters instead of fixed-sized blocks.
protected  class BlockIndexer.DelimTermProcessor
          This class behaves in a similar fashion to BasicTermProcessor except that this one treats blocks bounded by delimiters instead of fixed-sized blocks.
protected  class BlockIndexer.FieldTermProcessor
          This class implements an end of a TermPipeline that adds the term to the DocumentTree.
 
Field Summary
protected  int BLOCK_SIZE
          The maximum number of terms allowed in a block.
protected  int blockId
          The block number of the current document.
protected  int MAX_BLOCKS
          The maximum number allowed number of blocks in a document.
protected  int numOfTokensInBlock
          The number of tokens in the current block of the current document.
protected  int numOfTokensInDocument
          The number of tokens in the current document so far.
protected  java.util.Set<java.lang.String> termFields
          The fields that are set for the current term.
protected  DocumentPostingList termsInDocument
          The list of terms in this document, and for each, the block occurrences.
 
Fields inherited from class org.terrier.indexing.Indexer
basicDirectIndexPostingIteratorClass, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocIndexEntry, fieldDirectIndexPostingIteratorClass, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation
 
Constructor Summary
BlockIndexer(java.lang.String pathname, java.lang.String prefix)
          Constructs an instance of this class, where the created data structures are stored in the given path, with the given prefix on the filenames.
 
Method Summary
 void createDirectIndex(Collection[] collections)
          For the given collection, it iterates through the documents and creates the direct index, document index and lexicon, using information about blocks and possibly fields.
protected  void createDocumentPostings()
           
 void createInvertedIndex()
          Creates the inverted index from the already created direct index, document index and lexicon.
protected  void finishedInvertedIndexBuild()
          Hook method, called when the inverted index is finished - ie the lexicon is finished
protected  TermPipeline getEndOfPipeline()
          Returns the object that is to be the end of the TermPipeline.
protected  void indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties, DocumentPostingList _termsInDocument)
          This adds a document to the direct and document indexes, as well as it's terms to the lexicon.
protected  void load_indexer_properties()
           
 
Methods inherited from class org.terrier.indexing.Indexer
createMetaIndexBuilder, finishedDirectIndexBuild, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, useFieldInformation
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

numOfTokensInDocument

protected int numOfTokensInDocument
The number of tokens in the current document so far.


numOfTokensInBlock

protected int numOfTokensInBlock
The number of tokens in the current block of the current document.


blockId

protected int blockId
The block number of the current document.


termFields

protected java.util.Set<java.lang.String> termFields
The fields that are set for the current term.


termsInDocument

protected DocumentPostingList termsInDocument
The list of terms in this document, and for each, the block occurrences.


BLOCK_SIZE

protected int BLOCK_SIZE
The maximum number of terms allowed in a block. See Property blocks.size


MAX_BLOCKS

protected int MAX_BLOCKS
The maximum number allowed number of blocks in a document. After this value, all the remaining terms are in the final block. See Property max.blocks.

Constructor Detail

BlockIndexer

public BlockIndexer(java.lang.String pathname,
                    java.lang.String prefix)
Constructs an instance of this class, where the created data structures are stored in the given path, with the given prefix on the filenames.

Parameters:
pathname - String the path in which the created data structures will be saved. This is assumed to be absolute.
prefix - String the prefix on the filenames of the created data structures, usually "data"
Method Detail

getEndOfPipeline

protected TermPipeline getEndOfPipeline()
Returns the object that is to be the end of the TermPipeline. This method is used at construction time of the parent object.

Specified by:
getEndOfPipeline in class Indexer
Returns:
TermPipeline the last component of the term pipeline.

createDirectIndex

public void createDirectIndex(Collection[] collections)
For the given collection, it iterates through the documents and creates the direct index, document index and lexicon, using information about blocks and possibly fields.

Specified by:
createDirectIndex in class Indexer
Parameters:
collections - Collection[] the collection to index.
See Also:
Indexer.createDirectIndex(org.terrier.indexing.Collection[])

indexDocument

protected void indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties,
                             DocumentPostingList _termsInDocument)
                      throws java.lang.Exception
This adds a document to the direct and document indexes, as well as it's terms to the lexicon. Handled internally by the methods indexFieldDocument and indexNoFieldDocument.

Parameters:
docProperties - Map properties of the document
_termsInDocument - DocumentPostingList the terms in the document.
Throws:
java.lang.Exception

createInvertedIndex

public void createInvertedIndex()
Creates the inverted index from the already created direct index, document index and lexicon. It saves block information and possibly field information as well.

Specified by:
createInvertedIndex in class Indexer
See Also:
Indexer.createInvertedIndex()

finishedInvertedIndexBuild

protected void finishedInvertedIndexBuild()
Hook method, called when the inverted index is finished - ie the lexicon is finished

Overrides:
finishedInvertedIndexBuild in class Indexer

createDocumentPostings

protected void createDocumentPostings()

load_indexer_properties

protected void load_indexer_properties()
Overrides:
load_indexer_properties in class Indexer


Terrier 3.5. Copyright © 2004-2011 University of Glasgow