org.terrier.structures.indexing
Class BlockInvertedIndexBuilder

java.lang.Object
  extended by org.terrier.structures.indexing.InvertedIndexBuilder
      extended by org.terrier.structures.indexing.BlockInvertedIndexBuilder

public class BlockInvertedIndexBuilder
extends InvertedIndexBuilder

Builds an inverted index saving term-block information. It optionally saves term-field information as well.

Algorithm:

  1. While there are terms left:
    1. Read M term ids from lexicon, in lexicographical order
    2. Read the occurrences of these M terms into memory from the direct file
    3. Write the occurrences of these M terms to the inverted file
  2. Rewrite the lexicon, removing block frequencies, and adding inverted file offsets
  3. Write the collection statistics

Lexicon term selection: There are two strategies of selecting the number of terms to read from the lexicon. The trade-off here is to read a small enough number of terms into memory such that the occurrences of all those terms from the direct file can fit in memory. On the other hand, the less terms that are read implies more iterations, which is I/O expensive, as the entire direct file has to be read for every iteration.
The two strategies are:

By default, the 2nd strategy is chosen, unless the invertedfile.processpointers has a zero value specified.

Properties:

Author:
Douglas Johnson & Vassilis Plachouras & Craig Macdonald

Nested Class Summary
 
Nested classes/interfaces inherited from class org.terrier.structures.indexing.InvertedIndexBuilder
InvertedIndexBuilder.IntLongTuple
 
Field Summary
protected  String finalLexiconClass
           
 
Fields inherited from class org.terrier.structures.indexing.InvertedIndexBuilder
fieldCount, file, index, lexiconOutputStream, logger, numberOfPointersPerIteration, processTerms, structureName, useFieldInformation
 
Constructor Summary
BlockInvertedIndexBuilder(Index index, String structureName)
          constructor
 
Method Summary
 void createInvertedIndex()
          This method creates the block html inverted index.
protected  gnu.trove.TIntArrayList[] createPointerForTerm(LexiconEntry le)
           
protected  void traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage)
          Traverses the direct fies recording all occurrences of terms noted in codesHashMap into tmpStorage.
protected  long writeInvertedFilePart(DataOutputStream dos, gnu.trove.TIntArrayList[][] tmpStorage, int processTerms)
          Writes the section of the inverted file
 
Methods inherited from class org.terrier.structures.indexing.InvertedIndexBuilder
close, displayMemoryUsage, getLexOutputStream, scanLexiconForPointers, scanLexiconForTerms
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

finalLexiconClass

protected String finalLexiconClass
Constructor Detail

BlockInvertedIndexBuilder

public BlockInvertedIndexBuilder(Index index,
                                 String structureName)
constructor

Parameters:
index -
structureName -
Method Detail

createInvertedIndex

public void createInvertedIndex()
This method creates the block html inverted index. The approach used is described briefly: for a group of M terms from the lexicon we build the inverted file and save it on disk. In this way, the number of times we need to read the direct file is related to the parameter M, and consequently to the size of the available memory.

Overrides:
createInvertedIndex in class InvertedIndexBuilder

createPointerForTerm

protected gnu.trove.TIntArrayList[] createPointerForTerm(LexiconEntry le)
Overrides:
createPointerForTerm in class InvertedIndexBuilder

traverseDirectFile

protected void traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap,
                                  gnu.trove.TIntArrayList[][] tmpStorage)
                           throws IOException
Traverses the direct fies recording all occurrences of terms noted in codesHashMap into tmpStorage.

Overrides:
traverseDirectFile in class InvertedIndexBuilder
Parameters:
codesHashMap - contains the term ids that are being processed in this iteration, as keys. Values are the corresponding index in tmpStorage that information about this terms should be placed.
tmpStorage - Records the occurrences information. First dimension is for each term, as of the index given by codesHashMap; Second dimension contains fieldCount+4 TIntArrayLists : (document id, term frequency, field0, ... fieldCount-1 , block frequencies, block ids).
Throws:
IOException - if there is a problem while traversing the direct index.

writeInvertedFilePart

protected long writeInvertedFilePart(DataOutputStream dos,
                                     gnu.trove.TIntArrayList[][] tmpStorage,
                                     int processTerms)
                              throws IOException
Writes the section of the inverted file

Overrides:
writeInvertedFilePart in class InvertedIndexBuilder
Parameters:
dos - a temporary data structure that contains the offsets in the inverted index for each term.
tmpStorage - Occurrences information, as described in traverseDirectFile(). This data is consumed by this method - once this method has been called, all the data in tmpStorage will be destroyed.
processTerms - The number of terms being processed in this iteration.
Returns:
the number of tokens processed in this iteration
Throws:
IOException


Terrier 3.6. Copyright © 2004-2011 University of Glasgow