InvertedIndexBuilder (Terrier Information Retrieval Platform 5.2 API)

java.lang.Object
- org.terrier.structures.indexing.classical.InvertedIndexBuilder

Direct Known Subclasses:

BlockInvertedIndexBuilder
```
public class InvertedIndexBuilder
extends Object
```
Builds an inverted index. It optionally saves term-field information as well.
Algorithm:
1. While there are terms left:
  1. Read M term ids from lexicon, in lexicographical order
  2. Read the occurrences of these M terms into memory from the direct file
  3. Write the occurrences of these M terms to the inverted file
2. Rewrite the lexicon, removing block frequencies, and adding inverted file offsets
3. Write the collection statistics
Lexicon term selection: There are two strategies of selecting the number of terms to read from the lexicon. The trade-off here is to read a small enough number of terms into memory such that the occurrences of all those terms from the direct file can fit in memory. On the other hand, the less terms that are read implies more iterations, which is I/O expensive, as the entire direct file has to be read for every iteration.
The three strategies are:
- Read until an assumed amount of memory is consumed. The amount to consume is upto 80% of the free memory at startup.
- Read a fixed number of terms on each iterations - this corresponds to the property invertedfile.processterms
- Read a fixed number of occurrences (pointers) on each iteration. The number of pointers can be determined using the sum of frequencies of each term from the lexicon. This corresponds to the property invertedfile.processpointers.
By default, the 2nd strategy is chosen, unless the invertedfile.processpointers has a zero value specified.
Properties:
- invertedfile.processterms- the number of terms to process in each iteration. Defaults to 75,000
- invertedfile.processpointers - the number of pointers to process in each iteration. Defaults to 20,000,000
Author:

Craig Macdonald & Vassilis Plachouras

Field Summary

Fields
Modifier and Type	Field and Description
`protected CompressionFactory.CompressionConfiguration`	`compressionConfig`
`protected static String`	`DEFAULT_LEX_SCANNER_PROP_VALUE`
`protected int`	`externalParalllism`
`protected int`	`fieldCount`
`protected AbstractPostingOutputStream`	`file` The underlying bit file.
`protected float`	`heapusage`
`protected IndexOnDisk`	`index`
`protected Class<?>`	`lexiconOutputStream` class to be used as a lexiconoutpustream.
`protected String`	`lexScanClassName`
`protected static org.slf4j.Logger`	`logger` The logger used
`protected long`	`numberOfPointersPerIteration` The number of pointers to be processed in an interation.
`protected int`	`processTerms` The number of terms for which the inverted file is built each time.
`protected String`	`structureName`
`protected static int`	`tintint_overhead`
`protected static float`	`tintlist_overhead`
`protected boolean`	`useFieldInformation` Indicates whether field information is used.

Constructor Summary

Constructors
Constructor and Description
`InvertedIndexBuilder(IndexOnDisk i, String _structureName, CompressionFactory.CompressionConfiguration compressionConfig)` contructor

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`close()` Closes the underlying bit file.
`void`	`createInvertedIndex()` Creates the inverted index using the already created direct index, document index and lexicon.
`protected gnu.trove.TIntArrayList[]`	`createPointerForTerm(LexiconEntry le)`
`static void`	`displayMemoryUsage(Runtime r)` display memory usage
`int`	`getExternalParalllism()`
`protected LexiconOutputStream<String>`	`getLexOutputStream(String _structureName)` get LexiconOutputStream
`protected org.terrier.structures.indexing.classical.InvertedIndexBuilder.LexiconScanner`	`getLexScanner(Iterator<Map.Entry<String,LexiconEntry>> lexStream, CollectionStatistics stats)`
`static void`	`main(String[] args)` utility method that allows creation of an inverted index from a direct index
`void`	`setExternalParalllism(int externalParalllism)`
`protected void`	`traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage)` Traverses the direct index and creates the inverted index entries for the terms specified in the codesHashMap and tmpStorage.
`protected long[]`	`writeInvertedFilePart(DataOutputStream dos, gnu.trove.TIntArrayList[][] tmpStorage, int _processTerms)` Writes the section of the inverted file

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - tintint_overhead
```
protected static final int tintint_overhead
```
    See Also:
    
    Constant Field Values
  - tintlist_overhead
```
protected static final float tintlist_overhead
```
    See Also:
    
    Constant Field Values
  - DEFAULT_LEX_SCANNER_PROP_VALUE
```
protected static final String DEFAULT_LEX_SCANNER_PROP_VALUE
```
    See Also:
    
    Constant Field Values
  - lexiconOutputStream
```
protected Class<?> lexiconOutputStream
```
    class to be used as a lexiconoutpustream. set by this and child classes
  - logger
```
protected static final org.slf4j.Logger logger
```
    The logger used
  - fieldCount
```
protected int fieldCount
```
  - useFieldInformation
```
protected boolean useFieldInformation
```
    Indicates whether field information is used.
  - index
```
protected IndexOnDisk index
```
  - structureName
```
protected String structureName
```
  - numberOfPointersPerIteration
```
protected long numberOfPointersPerIteration
```
    The number of pointers to be processed in an interation. This directly corresponds to the property invertedfile.processpointers. If this property is set and > 0, then each iteration of the inverted index creation will be done to a set number of pointers, not a set number of terms, overriding invertedfile.processterms. Default is 20000000.
  - heapusage
```
protected float heapusage
```
  - externalParalllism
```
protected int externalParalllism
```
  - file
```
protected AbstractPostingOutputStream file
```
    The underlying bit file.
  - compressionConfig
```
protected CompressionFactory.CompressionConfiguration compressionConfig
```
  - lexScanClassName
```
protected String lexScanClassName
```
  - processTerms
```
protected int processTerms
```
    The number of terms for which the inverted file is built each time. The corresponding property is invertedfile.processterms and the default value is 75000. The higher the value, the greater the requirements for memory are, but the less time it takes to invert the direct file.
- Constructor Detail
  - InvertedIndexBuilder
```
public InvertedIndexBuilder(IndexOnDisk i,
                            String _structureName,
                            CompressionFactory.CompressionConfiguration compressionConfig)
```
    contructor
    
    Parameters:
    
    i -
    
    _structureName -
- Method Detail
  - getExternalParalllism
```
public int getExternalParalllism()
```
  - setExternalParalllism
```
public void setExternalParalllism(int externalParalllism)
```
  - getLexScanner
```
protected org.terrier.structures.indexing.classical.InvertedIndexBuilder.LexiconScanner getLexScanner(Iterator<Map.Entry<String,LexiconEntry>> lexStream,
                                                                                                      CollectionStatistics stats)
                                                                                               throws Exception
```
    Throws:
    
    Exception
  - close
```
public void close()
           throws IOException
```
    Closes the underlying bit file.
    
    Throws:
    
    IOException
  - createInvertedIndex
```
public void createInvertedIndex()
```
    Creates the inverted index using the already created direct index, document index and lexicon.
  - createPointerForTerm
```
protected gnu.trove.TIntArrayList[] createPointerForTerm(LexiconEntry le)
```
  - traverseDirectFile
```
protected void traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap,
                                  gnu.trove.TIntArrayList[][] tmpStorage)
                           throws IOException
```
    Traverses the direct index and creates the inverted index entries for the terms specified in the codesHashMap and tmpStorage.
    
    Parameters:
    
    tmpStorage - TIntArrayList[][] an array of the inverted index entries to store
    
    codesHashMap - a mapping from the term identifiers to the index in the tmpStorage matrix.
    
    Throws:
    
    IOException - if there is a problem while traversing the direct index.
  - writeInvertedFilePart
```
protected long[] writeInvertedFilePart(DataOutputStream dos,
                                       gnu.trove.TIntArrayList[][] tmpStorage,
                                       int _processTerms)
                                throws IOException
```
    Writes the section of the inverted file
    
    Parameters:
    
    dos - a temporary data structure that contains the offsets in the inverted index for each term.
    
    tmpStorage - Occurrences information, as described in traverseDirectFile(). This data is consumed by this method - once this method has been called, all the data in tmpStorage will be destroyed.
    
    _processTerms - The number of terms being processed in this iteration.
    
    Returns:
    
    the number of tokens processed in this iteration and the number of temporary bytes of RAM required
    
    Throws:
    
    IOException
  - displayMemoryUsage
```
public static void displayMemoryUsage(Runtime r)
```
    display memory usage
    
    Parameters:
    
    r -
  - getLexOutputStream
```
protected LexiconOutputStream<String> getLexOutputStream(String _structureName)
                                                  throws IOException
```
    get LexiconOutputStream
    
    Parameters:
    
    _structureName -
    
    Throws:
    
    IOException
  - main
```
public static void main(String[] args)
                 throws Exception
```
    utility method that allows creation of an inverted index from a direct index
    
    Throws:
    
    Exception

Class InvertedIndexBuilder

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

tintint_overhead

tintlist_overhead

DEFAULT_LEX_SCANNER_PROP_VALUE

lexiconOutputStream

logger

fieldCount

useFieldInformation

index

structureName

numberOfPointersPerIteration

heapusage

externalParalllism

file

compressionConfig

lexScanClassName

processTerms

Constructor Detail

InvertedIndexBuilder

Method Detail

getExternalParalllism

setExternalParalllism

getLexScanner

close

createInvertedIndex

createPointerForTerm

traverseDirectFile

writeInvertedFilePart

displayMemoryUsage

getLexOutputStream

main