InvertedIndexBuilder (Terrier 3.5 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.terrier.structures.indexing
Class InvertedIndexBuilder

java.lang.Object
  org.terrier.structures.indexing.InvertedIndexBuilder

Direct Known Subclasses:: BlockInvertedIndexBuilder

public class InvertedIndexBuilder
extends java.lang.Object
extends java.lang.Object

Builds an inverted index. It optionally saves term-field information as well.

Algorithm:

While there are terms left:
1. Read M term ids from lexicon, in lexicographical order
2. Read the occurrences of these M terms into memory from the direct file
3. Write the occurrences of these M terms to the inverted file
Rewrite the lexicon, removing block frequencies, and adding inverted file offsets
Write the collection statistics

Lexicon term selection: There are two strategies of selecting the number of terms to read from the lexicon. The trade-off here is to read a small enough number of terms into memory such that the occurrences of all those terms from the direct file can fit in memory. On the other hand, the less terms that are read implies more iterations, which is I/O expensive, as the entire direct file has to be read for every iteration.
The two strategies are:

Read a fixed number of terms on each iterations - this corresponds to the property invertedfile.processterms
Read a fixed number of occurrences (pointers) on each iteration. The number of pointers can be determined using the sum of frequencies of each term from the lexicon. This corresponds to the property invertedfile.processpointers.

By default, the 2nd strategy is chosen, unless the invertedfile.processpointers has a zero value specified.

Properties:

invertedfile.processterms- the number of terms to process in each iteration. Defaults to 75,000
invertedfile.processpointers - the number of pointers to process in each iteration. Defaults to 20,000,000

Author:: Craig Macdonald & Vassilis Plachouras

Nested Class Summary
`protected static class`	`InvertedIndexBuilder.IntLongTuple`

Field Summary
`protected int`	`fieldCount`
`protected BitOut`	`file` The underlying bit file.
`protected Index`	`index`
`protected java.lang.Class<?>`	`lexiconOutputStream` class to be used as a lexiconoutpustream.
`protected static org.apache.log4j.Logger`	`logger` The logger used
`protected long`	`numberOfPointersPerIteration` The number of pointers to be processed in an interation.
`protected int`	`processTerms` The number of terms for which the inverted file is built each time.
`protected java.lang.String`	`structureName`
`protected boolean`	`useFieldInformation` Indicates whether field information is used.

Constructor Summary
`InvertedIndexBuilder(Index i, java.lang.String _structureName)` contructor

Method Summary
`void`	`close()` Closes the underlying bit file.
`void`	`createInvertedIndex()` Creates the inverted index using the already created direct index, document index and lexicon.
`protected gnu.trove.TIntArrayList[]`	`createPointerForTerm(LexiconEntry le)`
`static void`	`displayMemoryUsage(java.lang.Runtime r)` display memory usage
`protected LexiconOutputStream<java.lang.String>`	`getLexOutputStream(java.lang.String _structureName)` get LexiconOutputStream
`protected InvertedIndexBuilder.IntLongTuple`	`scanLexiconForPointers(long PointersToProcess, java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lexiconStream, gnu.trove.TIntIntHashMap codesHashMap, java.util.ArrayList<gnu.trove.TIntArrayList[]> tmpStorageStorage)` Iterates through the lexicon, until it has reached the given number of pointers
`protected InvertedIndexBuilder.IntLongTuple`	`scanLexiconForTerms(int _processTerms, java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lexiconStream, gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage)` Iterates through the lexicon, until it has reached the given number of terms
`protected void`	`traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage)` Traverses the direct index and creates the inverted index entries for the terms specified in the codesHashMap and tmpStorage.
`protected long`	`writeInvertedFilePart(java.io.DataOutputStream dos, gnu.trove.TIntArrayList[][] tmpStorage, int _processTerms)` Writes the section of the inverted file

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

lexiconOutputStream

protected java.lang.Class<?> lexiconOutputStream

class to be used as a lexiconoutpustream. set by this and child classes

logger

protected static final org.apache.log4j.Logger logger

The logger used

fieldCount

protected int fieldCount

useFieldInformation

protected boolean useFieldInformation

Indicates whether field information is used.

index

protected Index index

structureName

protected java.lang.String structureName

numberOfPointersPerIteration

protected long numberOfPointersPerIteration

The number of pointers to be processed in an interation. This directly corresponds to the property invertedfile.processpointers. If this property is set and > 0, then each iteration of the inverted index creation will be done to a set number of pointers, not a set number of terms, overriding invertedfile.processterms. Default is 20000000.

file

protected BitOut file

The underlying bit file.

processTerms

protected int processTerms

The number of terms for which the inverted file is built each time. The corresponding property is invertedfile.processterms and the default value is 75000. The higher the value, the greater the requirements for memory are, but the less time it takes to invert the direct file.

Constructor Detail