Terrier IR Platform
2.2.1

uk.ac.gla.terrier.structures.indexing
Class InvertedIndexBuilder

java.lang.Object
  extended by uk.ac.gla.terrier.structures.indexing.InvertedIndexBuilder
Direct Known Subclasses:
BlockInvertedIndexBuilder, UTFInvertedIndexBuilder

public class InvertedIndexBuilder
extends java.lang.Object

Builds an inverted index. It optionally saves term-field information as well.

Algorithm:

  1. While there are terms left:
    1. Read M term ids from lexicon, in lexicographical order
    2. Read the occurrences of these M terms into memory from the direct file
    3. Write the occurrences of these M terms to the inverted file
  2. Rewrite the lexicon, removing block frequencies, and adding inverted file offsets
  3. Write the collection statistics

Lexicon term selection: There are two strategies of selecting the number of terms to read from the lexicon. The trade-off here is to read a small enough number of terms into memory such that the occurrences of all those terms from the direct file can fit in memory. On the other hand, the less terms that are read implies more iterations, which is I/O expensive, as the entire direct file has to be read for every iteration.
The two strategies are:

By default, the 2nd strategy is chosen, unless the invertedfile.processpointers has a zero value specified.

Properties:

Version:
$Revision: 1.41 $
Author:
Craig Macdonald & Vassilis Plachouras

Field Summary
 int numberOfDocuments
          The number of documents in the collection.
 long numberOfPointers
          The number of pointers in the inverted file.
 long numberOfTokens
          The number of tokens in the collection.
 int numberOfUniqueTerms
          The number of unique terms in the vocabulary.
 
Constructor Summary
InvertedIndexBuilder()
          Deprecated.  
InvertedIndexBuilder(Index i)
           
InvertedIndexBuilder(java.lang.String filename)
          Deprecated. Use this() or this(String, String)
InvertedIndexBuilder(java.lang.String Path, java.lang.String Prefix)
          Deprecated.  
 
Method Summary
 void close()
          Closes the underlying bit file.
 void createInvertedIndex()
          Creates the inverted index using the already created direct index, document index and lexicon.
static void displayMemoryUsage(java.lang.Runtime r)
           
 LexiconInputStream getLexInputStream(java.lang.String filename)
           
 LexiconOutputStream getLexOutputStream(java.lang.String filename)
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

numberOfUniqueTerms

public int numberOfUniqueTerms
The number of unique terms in the vocabulary.


numberOfDocuments

public int numberOfDocuments
The number of documents in the collection.


numberOfTokens

public long numberOfTokens
The number of tokens in the collection.


numberOfPointers

public long numberOfPointers
The number of pointers in the inverted file.

Constructor Detail

InvertedIndexBuilder

public InvertedIndexBuilder(java.lang.String Path,
                            java.lang.String Prefix)
Deprecated. 

Constructor of the class InvertedIndex.


InvertedIndexBuilder

public InvertedIndexBuilder(Index i)

InvertedIndexBuilder

public InvertedIndexBuilder()
Deprecated. 

A default constructor of the class InvertedIndex.


InvertedIndexBuilder

public InvertedIndexBuilder(java.lang.String filename)
Deprecated. Use this() or this(String, String)

Creates an instance of the InvertedIndex class using the given filename.

Parameters:
filename - The name of the inverted file
Method Detail

close

public void close()
           throws java.io.IOException
Closes the underlying bit file.

Throws:
java.io.IOException

createInvertedIndex

public void createInvertedIndex()
Creates the inverted index using the already created direct index, document index and lexicon.


displayMemoryUsage

public static void displayMemoryUsage(java.lang.Runtime r)

getLexInputStream

public LexiconInputStream getLexInputStream(java.lang.String filename)

getLexOutputStream

public LexiconOutputStream getLexOutputStream(java.lang.String filename)

Terrier IR Platform
2.2.1

Terrier Information Retrieval Platform 2.2.1. Copyright 2004-2008 University of Glasgow