public class BlockInvertedIndexBuilder extends InvertedIndexBuilder
Algorithm:
Lexicon term selection: There are two strategies of selecting the
number of terms to read from the lexicon. The trade-off here is to read a
small enough number of terms into memory such that the occurrences of all
those terms from the direct file can fit in memory. On the other hand, the
less terms that are read implies more iterations, which is I/O expensive, as
the entire direct file has to be read for every iteration.
The two strategies are:
Properties:
compressionConfig, DEFAULT_LEX_SCANNER_PROP_VALUE, externalParalllism, fieldCount, file, heapusage, index, lexiconOutputStream, lexScanClassName, logger, numberOfPointersPerIteration, processTerms, structureName, tintint_overhead, tintlist_overhead, useFieldInformation
Constructor and Description |
---|
BlockInvertedIndexBuilder(IndexOnDisk index,
String structureName,
CompressionFactory.CompressionConfiguration compressionConfig)
constructor
|
Modifier and Type | Method and Description |
---|---|
void |
createInvertedIndex()
This method creates the block inverted index.
|
protected gnu.trove.TIntArrayList[] |
createPointerForTerm(LexiconEntry le) |
protected org.terrier.structures.indexing.classical.InvertedIndexBuilder.LexiconScanner |
getLexScanner(Iterator<Map.Entry<String,LexiconEntry>> lexStream,
CollectionStatistics stats) |
static void |
main(String[] args)
Use this main method to recover the creation of an inverted index, should it fail
|
protected void |
traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap,
gnu.trove.TIntArrayList[][] tmpStorage)
Traverses the direct files recording all occurrences of terms noted in
codesHashMap into tmpStorage.
|
protected long[] |
writeInvertedFilePart(DataOutputStream dos,
gnu.trove.TIntArrayList[][] tmpStorage,
int _processTerms)
Writes the section of the inverted file
|
close, displayMemoryUsage, getExternalParalllism, getLexOutputStream, setExternalParalllism
public BlockInvertedIndexBuilder(IndexOnDisk index, String structureName, CompressionFactory.CompressionConfiguration compressionConfig)
index
- structureName
- protected org.terrier.structures.indexing.classical.InvertedIndexBuilder.LexiconScanner getLexScanner(Iterator<Map.Entry<String,LexiconEntry>> lexStream, CollectionStatistics stats) throws Exception
getLexScanner
in class InvertedIndexBuilder
Exception
public void createInvertedIndex()
createInvertedIndex
in class InvertedIndexBuilder
protected gnu.trove.TIntArrayList[] createPointerForTerm(LexiconEntry le)
createPointerForTerm
in class InvertedIndexBuilder
protected void traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage) throws IOException
traverseDirectFile
in class InvertedIndexBuilder
codesHashMap
- contains the term ids that are being processed in this
iteration, as keys. Values are the corresponding index in
tmpStorage that information about this terms should be placed.tmpStorage
- Records the occurrences information. First dimension is for
each term, as of the index given by codesHashMap; Second
dimension contains fieldCount+4 TIntArrayLists : (document id, term
frequency, field0, ... fieldCount-1 , block frequencies, block ids).IOException
- if there is a problem while traversing the direct index.protected long[] writeInvertedFilePart(DataOutputStream dos, gnu.trove.TIntArrayList[][] tmpStorage, int _processTerms) throws IOException
writeInvertedFilePart
in class InvertedIndexBuilder
dos
- a temporary data structure that contains the offsets in the
inverted index for each term.tmpStorage
- Occurrences information, as described in traverseDirectFile().
This data is consumed by this method - once this method has
been called, all the data in tmpStorage will be destroyed._processTerms
- The number of terms being processed in this iteration.IOException
Terrier Information Retrieval Platform 5.2. Copyright © 2004-2019, University of Glasgow