org.terrier.structures.indexing
Class InvertedIndexBuilder

java.lang.Object
  extended by org.terrier.structures.indexing.InvertedIndexBuilder
Direct Known Subclasses:
BlockInvertedIndexBuilder

public class InvertedIndexBuilder
extends java.lang.Object

Builds an inverted index. It optionally saves term-field information as well.

Algorithm:

  1. While there are terms left:
    1. Read M term ids from lexicon, in lexicographical order
    2. Read the occurrences of these M terms into memory from the direct file
    3. Write the occurrences of these M terms to the inverted file
  2. Rewrite the lexicon, removing block frequencies, and adding inverted file offsets
  3. Write the collection statistics

Lexicon term selection: There are two strategies of selecting the number of terms to read from the lexicon. The trade-off here is to read a small enough number of terms into memory such that the occurrences of all those terms from the direct file can fit in memory. On the other hand, the less terms that are read implies more iterations, which is I/O expensive, as the entire direct file has to be read for every iteration.
The two strategies are:

By default, the 2nd strategy is chosen, unless the invertedfile.processpointers has a zero value specified.

Properties:

Author:
Craig Macdonald & Vassilis Plachouras

Nested Class Summary
protected static class InvertedIndexBuilder.IntLongTuple
           
 
Field Summary
protected  int fieldCount
           
protected  BitOut file
          The underlying bit file.
protected  Index index
           
protected  java.lang.Class<?> lexiconOutputStream
          class to be used as a lexiconoutpustream.
protected static org.apache.log4j.Logger logger
          The logger used
protected  long numberOfPointersPerIteration
          The number of pointers to be processed in an interation.
protected  int processTerms
          The number of terms for which the inverted file is built each time.
protected  java.lang.String structureName
           
protected  boolean useFieldInformation
          Indicates whether field information is used.
 
Constructor Summary
InvertedIndexBuilder(Index i, java.lang.String _structureName)
          contructor
 
Method Summary
 void close()
          Closes the underlying bit file.
 void createInvertedIndex()
          Creates the inverted index using the already created direct index, document index and lexicon.
protected  gnu.trove.TIntArrayList[] createPointerForTerm(LexiconEntry le)
           
static void displayMemoryUsage(java.lang.Runtime r)
          display memory usage
protected  LexiconOutputStream<java.lang.String> getLexOutputStream(java.lang.String _structureName)
          get LexiconOutputStream
protected  InvertedIndexBuilder.IntLongTuple scanLexiconForPointers(long PointersToProcess, java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lexiconStream, gnu.trove.TIntIntHashMap codesHashMap, java.util.ArrayList<gnu.trove.TIntArrayList[]> tmpStorageStorage)
          Iterates through the lexicon, until it has reached the given number of pointers
protected  InvertedIndexBuilder.IntLongTuple scanLexiconForTerms(int _processTerms, java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lexiconStream, gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage)
          Iterates through the lexicon, until it has reached the given number of terms
protected  void traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage)
          Traverses the direct index and creates the inverted index entries for the terms specified in the codesHashMap and tmpStorage.
protected  long writeInvertedFilePart(java.io.DataOutputStream dos, gnu.trove.TIntArrayList[][] tmpStorage, int _processTerms)
          Writes the section of the inverted file
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

lexiconOutputStream

protected java.lang.Class<?> lexiconOutputStream
class to be used as a lexiconoutpustream. set by this and child classes


logger

protected static final org.apache.log4j.Logger logger
The logger used


fieldCount

protected int fieldCount

useFieldInformation

protected boolean useFieldInformation
Indicates whether field information is used.


index

protected Index index

structureName

protected java.lang.String structureName

numberOfPointersPerIteration

protected long numberOfPointersPerIteration
The number of pointers to be processed in an interation. This directly corresponds to the property invertedfile.processpointers. If this property is set and > 0, then each iteration of the inverted index creation will be done to a set number of pointers, not a set number of terms, overriding invertedfile.processterms. Default is 20000000.


file

protected BitOut file
The underlying bit file.


processTerms

protected int processTerms
The number of terms for which the inverted file is built each time. The corresponding property is invertedfile.processterms and the default value is 75000. The higher the value, the greater the requirements for memory are, but the less time it takes to invert the direct file.

Constructor Detail

InvertedIndexBuilder

public InvertedIndexBuilder(Index i,
                            java.lang.String _structureName)
contructor

Parameters:
i -
_structureName -
Method Detail

close

public void close()
           throws java.io.IOException
Closes the underlying bit file.

Throws:
java.io.IOException

createInvertedIndex

public void createInvertedIndex()
Creates the inverted index using the already created direct index, document index and lexicon.


createPointerForTerm

protected gnu.trove.TIntArrayList[] createPointerForTerm(LexiconEntry le)

scanLexiconForPointers

protected InvertedIndexBuilder.IntLongTuple scanLexiconForPointers(long PointersToProcess,
                                                                   java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lexiconStream,
                                                                   gnu.trove.TIntIntHashMap codesHashMap,
                                                                   java.util.ArrayList<gnu.trove.TIntArrayList[]> tmpStorageStorage)
                                                            throws java.io.IOException
Iterates through the lexicon, until it has reached the given number of pointers

Parameters:
PointersToProcess - Number of pointers to stop reading the lexicon after
lexiconStream - the lexicon input stream to read
codesHashMap -
tmpStorageStorage -
Returns:
IntLongTuple number of terms, number of pointers
Throws:
java.io.IOException

scanLexiconForTerms

protected InvertedIndexBuilder.IntLongTuple scanLexiconForTerms(int _processTerms,
                                                                java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lexiconStream,
                                                                gnu.trove.TIntIntHashMap codesHashMap,
                                                                gnu.trove.TIntArrayList[][] tmpStorage)
                                                         throws java.io.IOException
Iterates through the lexicon, until it has reached the given number of terms

Parameters:
_processTerms - Number of terms to stop reading the lexicon after
lexiconStream - the lexicon input stream to read
codesHashMap - mapping of termids to which offset in the storage array for terms to be processed this iteration
tmpStorage - place to put postings for this iteration
Returns:
IntLongTuple number of terms, number of pointers
Throws:
java.io.IOException

traverseDirectFile

protected void traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap,
                                  gnu.trove.TIntArrayList[][] tmpStorage)
                           throws java.io.IOException
Traverses the direct index and creates the inverted index entries for the terms specified in the codesHashMap and tmpStorage.

Parameters:
tmpStorage - TIntArrayList[][] an array of the inverted index entries to store
codesHashMap - a mapping from the term identifiers to the index in the tmpStorage matrix.
Throws:
java.io.IOException - if there is a problem while traversing the direct index.

writeInvertedFilePart

protected long writeInvertedFilePart(java.io.DataOutputStream dos,
                                     gnu.trove.TIntArrayList[][] tmpStorage,
                                     int _processTerms)
                              throws java.io.IOException
Writes the section of the inverted file

Parameters:
dos - a temporary data structure that contains the offsets in the inverted index for each term.
tmpStorage - Occurrences information, as described in traverseDirectFile(). This data is consumed by this method - once this method has been called, all the data in tmpStorage will be destroyed.
_processTerms - The number of terms being processed in this iteration.
Returns:
the number of tokens processed in this iteration
Throws:
java.io.IOException

displayMemoryUsage

public static void displayMemoryUsage(java.lang.Runtime r)
display memory usage

Parameters:
r -

getLexOutputStream

protected LexiconOutputStream<java.lang.String> getLexOutputStream(java.lang.String _structureName)
                                                            throws java.io.IOException
get LexiconOutputStream

Parameters:
_structureName -
Returns:
LexiconOutputStream
Throws:
java.io.IOException


Terrier 3.5. Copyright © 2004-2011 University of Glasgow