BlockInvertedIndexBuilder (Terrier Information Retrieval Platform 5.2 API)

java.lang.Object
- org.terrier.structures.indexing.classical.InvertedIndexBuilder
- - org.terrier.structures.indexing.classical.BlockInvertedIndexBuilder

```
public class BlockInvertedIndexBuilder
extends InvertedIndexBuilder
```
Builds an inverted index saving term-block information. It optionally saves term-field information as well.
Algorithm:
1. While there are terms left:
  1. Read M term ids from lexicon, in lexicographical order
  2. Read the occurrences of these M terms into memory from the direct file
  3. Write the occurrences of these M terms to the inverted file
2. Rewrite the lexicon, removing block frequencies, and adding inverted file offsets
3. Write the collection statistics
Lexicon term selection: There are two strategies of selecting the number of terms to read from the lexicon. The trade-off here is to read a small enough number of terms into memory such that the occurrences of all those terms from the direct file can fit in memory. On the other hand, the less terms that are read implies more iterations, which is I/O expensive, as the entire direct file has to be read for every iteration.
The two strategies are:
- Read a fixed number of terms on each iterations - this corresponds to the property invertedfile.processterms
- Read a fixed number of occurrences (pointers) on each iteration. The number of pointers can be determined using the sum of frequencies of each term from the lexicon. This corresponds to the property invertedfile.processpointers.
By default, the 2nd strategy is chosen, unless the invertedfile.processpointers has a zero value specified.
Properties:
- invertedfile.processterms - the number of terms to process in each iteration. Defaults to 25,000
- invertedfile.processpointers - the number of pointers to process in each iteration. Defaults to 2,000,000, which specifies that invertedfile.processterms should be read from the lexicon, regardless of the number of pointers.
Author:

Douglas Johnson & Vassilis Plachouras & Craig Macdonald

Field Summary
- Fields inherited from class org.terrier.structures.indexing.classical.InvertedIndexBuilder
  compressionConfig, DEFAULT_LEX_SCANNER_PROP_VALUE, externalParalllism, fieldCount, file, heapusage, index, lexiconOutputStream, lexScanClassName, logger, numberOfPointersPerIteration, processTerms, structureName, tintint_overhead, tintlist_overhead, useFieldInformation

Constructor Summary

Constructors
Constructor and Description
`BlockInvertedIndexBuilder(IndexOnDisk index, String structureName, CompressionFactory.CompressionConfiguration compressionConfig)` constructor

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`createInvertedIndex()` This method creates the block inverted index.
`protected gnu.trove.TIntArrayList[]`	`createPointerForTerm(LexiconEntry le)`
`protected org.terrier.structures.indexing.classical.InvertedIndexBuilder.LexiconScanner`	`getLexScanner(Iterator<Map.Entry<String,LexiconEntry>> lexStream, CollectionStatistics stats)`
`static void`	`main(String[] args)` Use this main method to recover the creation of an inverted index, should it fail
`protected void`	`traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage)` Traverses the direct files recording all occurrences of terms noted in codesHashMap into tmpStorage.
`protected long[]`	`writeInvertedFilePart(DataOutputStream dos, gnu.trove.TIntArrayList[][] tmpStorage, int _processTerms)` Writes the section of the inverted file

Methods inherited from class org.terrier.structures.indexing.classical.InvertedIndexBuilder
close, displayMemoryUsage, getExternalParalllism, getLexOutputStream, setExternalParalllism

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - BlockInvertedIndexBuilder
```
public BlockInvertedIndexBuilder(IndexOnDisk index,
                                 String structureName,
                                 CompressionFactory.CompressionConfiguration compressionConfig)
```
    constructor
    
    Parameters:
    
    index -
    
    structureName -
- Method Detail
  - getLexScanner
```
protected org.terrier.structures.indexing.classical.InvertedIndexBuilder.LexiconScanner getLexScanner(Iterator<Map.Entry<String,LexiconEntry>> lexStream,
                                                                                                      CollectionStatistics stats)
                                                                                               throws Exception
```
    Overrides:
    
    getLexScanner in class InvertedIndexBuilder
    
    Throws:
    
    Exception
  - createInvertedIndex
```
public void createInvertedIndex()
```
    This method creates the block inverted index. The approach used is described briefly: for a group of M terms from the lexicon we build the inverted file and save it on disk. In this way, the number of times we need to read the direct file is related to the parameter M, and consequently to the size of the available memory.
    
    Overrides:
    
    createInvertedIndex in class InvertedIndexBuilder
  - createPointerForTerm
```
protected gnu.trove.TIntArrayList[] createPointerForTerm(LexiconEntry le)
```
    Overrides:
    
    createPointerForTerm in class InvertedIndexBuilder
  - traverseDirectFile
```
protected void traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap,
                                  gnu.trove.TIntArrayList[][] tmpStorage)
                           throws IOException
```
    Traverses the direct files recording all occurrences of terms noted in codesHashMap into tmpStorage.
    
    Overrides:
    
    traverseDirectFile in class InvertedIndexBuilder
    
    Parameters:
    
    codesHashMap - contains the term ids that are being processed in this iteration, as keys. Values are the corresponding index in tmpStorage that information about this terms should be placed.
    
    tmpStorage - Records the occurrences information. First dimension is for each term, as of the index given by codesHashMap; Second dimension contains fieldCount+4 TIntArrayLists : (document id, term frequency, field0, ... fieldCount-1 , block frequencies, block ids).
    
    Throws:
    
    IOException - if there is a problem while traversing the direct index.
  - writeInvertedFilePart
```
protected long[] writeInvertedFilePart(DataOutputStream dos,
                                       gnu.trove.TIntArrayList[][] tmpStorage,
                                       int _processTerms)
                                throws IOException
```
    Writes the section of the inverted file
    
    Overrides:
    
    writeInvertedFilePart in class InvertedIndexBuilder
    
    Parameters:
    
    dos - a temporary data structure that contains the offsets in the inverted index for each term.
    
    tmpStorage - Occurrences information, as described in traverseDirectFile(). This data is consumed by this method - once this method has been called, all the data in tmpStorage will be destroyed.
    
    _processTerms - The number of terms being processed in this iteration.
    
    Returns:
    
    the number of tokens processed in this iteration and the number of bytes of temporary mem that were used
    
    Throws:
    
    IOException
  - main
```
public static void main(String[] args)
                 throws Exception
```
    Use this main method to recover the creation of an inverted index, should it fail
    
    Throws:
    
    Exception

Class BlockInvertedIndexBuilder

Field Summary

Fields inherited from class org.terrier.structures.indexing.classical.InvertedIndexBuilder

Constructor Summary

Method Summary

Methods inherited from class org.terrier.structures.indexing.classical.InvertedIndexBuilder

Methods inherited from class java.lang.Object

Constructor Detail

BlockInvertedIndexBuilder

Method Detail

getLexScanner

createInvertedIndex

createPointerForTerm

traverseDirectFile

writeInvertedFilePart

main