Class InvertedIndexBuilder
- java.lang.Object
-
- org.terrier.structures.indexing.classical.InvertedIndexBuilder
-
- Direct Known Subclasses:
BlockInvertedIndexBuilder
public class InvertedIndexBuilder extends java.lang.Object
Builds an inverted index. It optionally saves term-field information as well.Algorithm:
- While there are terms left:
- Read M term ids from lexicon, in lexicographical order
- Read the occurrences of these M terms into memory from the direct file
- Write the occurrences of these M terms to the inverted file
- Rewrite the lexicon, removing block frequencies, and adding inverted file offsets
- Write the collection statistics
Lexicon term selection: There are two strategies of selecting the number of terms to read from the lexicon. The trade-off here is to read a small enough number of terms into memory such that the occurrences of all those terms from the direct file can fit in memory. On the other hand, the less terms that are read implies more iterations, which is I/O expensive, as the entire direct file has to be read for every iteration.
The three strategies are:- Read until an assumed amount of memory is consumed. The amount to consume is upto 80% of the free memory at startup.
- Read a fixed number of terms on each iterations - this corresponds to the property invertedfile.processterms
- Read a fixed number of occurrences (pointers) on each iteration. The number of pointers can be determined using the sum of frequencies of each term from the lexicon. This corresponds to the property invertedfile.processpointers.
Properties:
- invertedfile.processterms- the number of terms to process in each iteration. Defaults to 75,000
- invertedfile.processpointers - the number of pointers to process in each iteration. Defaults to 20,000,000
- Author:
- Craig Macdonald & Vassilis Plachouras
-
-
Field Summary
Fields Modifier and Type Field Description protected CompressionFactory.CompressionConfiguration
compressionConfig
protected static java.lang.String
DEFAULT_LEX_SCANNER_PROP_VALUE
protected int
externalParalllism
protected int
fieldCount
protected AbstractPostingOutputStream
file
The underlying bit file.protected float
heapusage
protected IndexOnDisk
index
protected java.lang.Class<?>
lexiconOutputStream
class to be used as a lexiconoutpustream.protected java.lang.String
lexScanClassName
protected static org.slf4j.Logger
logger
The logger usedprotected static boolean
MEM_SIZE_TRACE
protected long
numberOfPointersPerIteration
The number of pointers to be processed in an interation.protected int
processTerms
The number of terms for which the inverted file is built each time.protected java.lang.String
structureName
protected static int
tintint_overhead
protected static float
tintlist_overhead
protected boolean
useFieldInformation
Indicates whether field information is used.
-
Constructor Summary
Constructors Constructor Description InvertedIndexBuilder(IndexOnDisk i, java.lang.String _structureName, CompressionFactory.CompressionConfiguration compressionConfig)
contructor
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
Closes the underlying bit file.void
createInvertedIndex()
Creates the inverted index using the already created direct index, document index and lexicon.protected gnu.trove.TIntArrayList[]
createPointerForTerm(LexiconEntry le)
static void
displayMemoryUsage(java.lang.Runtime r)
display memory usageint
getExternalParalllism()
protected LexiconOutputStream<java.lang.String>
getLexOutputStream(java.lang.String _structureName)
get LexiconOutputStreamprotected org.terrier.structures.indexing.classical.InvertedIndexBuilder.LexiconScanner
getLexScanner(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lexStream, CollectionStatistics stats)
static void
main(java.lang.String[] args)
utility method that allows creation of an inverted index from a direct indexvoid
setExternalParalllism(int externalParalllism)
protected void
traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage)
Traverses the direct index and creates the inverted index entries for the terms specified in the codesHashMap and tmpStorage.protected long[]
writeInvertedFilePart(java.io.DataOutputStream dos, gnu.trove.TIntArrayList[][] tmpStorage, int _processTerms)
Writes the section of the inverted file
-
-
-
Field Detail
-
MEM_SIZE_TRACE
protected static final boolean MEM_SIZE_TRACE
- See Also:
- Constant Field Values
-
tintint_overhead
protected static final int tintint_overhead
- See Also:
- Constant Field Values
-
tintlist_overhead
protected static final float tintlist_overhead
- See Also:
- Constant Field Values
-
DEFAULT_LEX_SCANNER_PROP_VALUE
protected static final java.lang.String DEFAULT_LEX_SCANNER_PROP_VALUE
- See Also:
- Constant Field Values
-
lexiconOutputStream
protected java.lang.Class<?> lexiconOutputStream
class to be used as a lexiconoutpustream. set by this and child classes
-
logger
protected static final org.slf4j.Logger logger
The logger used
-
fieldCount
protected int fieldCount
-
useFieldInformation
protected boolean useFieldInformation
Indicates whether field information is used.
-
index
protected IndexOnDisk index
-
structureName
protected java.lang.String structureName
-
numberOfPointersPerIteration
protected long numberOfPointersPerIteration
The number of pointers to be processed in an interation. This directly corresponds to the property invertedfile.processpointers. If this property is set and > 0, then each iteration of the inverted index creation will be done to a set number of pointers, not a set number of terms, overriding invertedfile.processterms. Default is 20000000.
-
heapusage
protected float heapusage
-
externalParalllism
protected int externalParalllism
-
file
protected AbstractPostingOutputStream file
The underlying bit file.
-
compressionConfig
protected CompressionFactory.CompressionConfiguration compressionConfig
-
lexScanClassName
protected java.lang.String lexScanClassName
-
processTerms
protected int processTerms
The number of terms for which the inverted file is built each time. The corresponding property is invertedfile.processterms and the default value is 75000. The higher the value, the greater the requirements for memory are, but the less time it takes to invert the direct file.
-
-
Constructor Detail
-
InvertedIndexBuilder
public InvertedIndexBuilder(IndexOnDisk i, java.lang.String _structureName, CompressionFactory.CompressionConfiguration compressionConfig)
contructor- Parameters:
i
-_structureName
-
-
-
Method Detail
-
getExternalParalllism
public int getExternalParalllism()
-
setExternalParalllism
public void setExternalParalllism(int externalParalllism)
-
getLexScanner
protected org.terrier.structures.indexing.classical.InvertedIndexBuilder.LexiconScanner getLexScanner(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lexStream, CollectionStatistics stats) throws java.lang.Exception
- Throws:
java.lang.Exception
-
close
public void close() throws java.io.IOException
Closes the underlying bit file.- Throws:
java.io.IOException
-
createInvertedIndex
public void createInvertedIndex()
Creates the inverted index using the already created direct index, document index and lexicon.
-
createPointerForTerm
protected gnu.trove.TIntArrayList[] createPointerForTerm(LexiconEntry le)
-
traverseDirectFile
protected void traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage) throws java.io.IOException
Traverses the direct index and creates the inverted index entries for the terms specified in the codesHashMap and tmpStorage.- Parameters:
tmpStorage
- TIntArrayList[][] an array of the inverted index entries to storecodesHashMap
- a mapping from the term identifiers to the index in the tmpStorage matrix.- Throws:
java.io.IOException
- if there is a problem while traversing the direct index.
-
writeInvertedFilePart
protected long[] writeInvertedFilePart(java.io.DataOutputStream dos, gnu.trove.TIntArrayList[][] tmpStorage, int _processTerms) throws java.io.IOException
Writes the section of the inverted file- Parameters:
dos
- a temporary data structure that contains the offsets in the inverted index for each term.tmpStorage
- Occurrences information, as described in traverseDirectFile(). This data is consumed by this method - once this method has been called, all the data in tmpStorage will be destroyed._processTerms
- The number of terms being processed in this iteration.- Returns:
- the number of tokens processed in this iteration and the number of temporary bytes of RAM required
- Throws:
java.io.IOException
-
displayMemoryUsage
public static void displayMemoryUsage(java.lang.Runtime r)
display memory usage- Parameters:
r
-
-
getLexOutputStream
protected LexiconOutputStream<java.lang.String> getLexOutputStream(java.lang.String _structureName) throws java.io.IOException
get LexiconOutputStream- Parameters:
_structureName
-- Throws:
java.io.IOException
-
main
public static void main(java.lang.String[] args) throws java.lang.Exception
utility method that allows creation of an inverted index from a direct index- Throws:
java.lang.Exception
-
-