public class InvertedIndexBuilder extends Object
Algorithm:
Lexicon term selection:
There are two strategies of selecting the number of terms to read from the lexicon. The trade-off here
is to read a small enough number of terms into memory such that the occurrences of all those terms from
the direct file can fit in memory. On the other hand, the less terms that are read implies more iterations,
which is I/O expensive, as the entire direct file has to be read for every iteration.
The three strategies are:
Properties:
Modifier and Type | Field and Description |
---|---|
protected CompressionFactory.CompressionConfiguration |
compressionConfig |
protected static String |
DEFAULT_LEX_SCANNER_PROP_VALUE |
protected int |
externalParalllism |
protected int |
fieldCount |
protected AbstractPostingOutputStream |
file
The underlying bit file.
|
protected float |
heapusage |
protected IndexOnDisk |
index |
protected Class<?> |
lexiconOutputStream
class to be used as a lexiconoutpustream.
|
protected String |
lexScanClassName |
protected static org.slf4j.Logger |
logger
The logger used
|
protected long |
numberOfPointersPerIteration
The number of pointers to be processed in an interation.
|
protected int |
processTerms
The number of terms for which the inverted file
is built each time.
|
protected String |
structureName |
protected static int |
tintint_overhead |
protected static float |
tintlist_overhead |
protected boolean |
useFieldInformation
Indicates whether field information is used.
|
Constructor and Description |
---|
InvertedIndexBuilder(IndexOnDisk i,
String _structureName,
CompressionFactory.CompressionConfiguration compressionConfig)
contructor
|
Modifier and Type | Method and Description |
---|---|
void |
close()
Closes the underlying bit file.
|
void |
createInvertedIndex()
Creates the inverted index using the already created direct index,
document index and lexicon.
|
protected gnu.trove.TIntArrayList[] |
createPointerForTerm(LexiconEntry le) |
static void |
displayMemoryUsage(Runtime r)
display memory usage
|
int |
getExternalParalllism() |
protected LexiconOutputStream<String> |
getLexOutputStream(String _structureName)
get LexiconOutputStream
|
protected org.terrier.structures.indexing.classical.InvertedIndexBuilder.LexiconScanner |
getLexScanner(Iterator<Map.Entry<String,LexiconEntry>> lexStream,
CollectionStatistics stats) |
static void |
main(String[] args)
utility method that allows creation of an inverted index from a direct index
|
void |
setExternalParalllism(int externalParalllism) |
protected void |
traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap,
gnu.trove.TIntArrayList[][] tmpStorage)
Traverses the direct index and creates the inverted index entries
for the terms specified in the codesHashMap and tmpStorage.
|
protected long[] |
writeInvertedFilePart(DataOutputStream dos,
gnu.trove.TIntArrayList[][] tmpStorage,
int _processTerms)
Writes the section of the inverted file
|
protected static final int tintint_overhead
protected static final float tintlist_overhead
protected static final String DEFAULT_LEX_SCANNER_PROP_VALUE
protected Class<?> lexiconOutputStream
protected static final org.slf4j.Logger logger
protected int fieldCount
protected boolean useFieldInformation
protected IndexOnDisk index
protected String structureName
protected long numberOfPointersPerIteration
protected float heapusage
protected int externalParalllism
protected AbstractPostingOutputStream file
protected CompressionFactory.CompressionConfiguration compressionConfig
protected String lexScanClassName
protected int processTerms
public InvertedIndexBuilder(IndexOnDisk i, String _structureName, CompressionFactory.CompressionConfiguration compressionConfig)
i
- _structureName
- public int getExternalParalllism()
public void setExternalParalllism(int externalParalllism)
protected org.terrier.structures.indexing.classical.InvertedIndexBuilder.LexiconScanner getLexScanner(Iterator<Map.Entry<String,LexiconEntry>> lexStream, CollectionStatistics stats) throws Exception
Exception
public void close() throws IOException
IOException
public void createInvertedIndex()
protected gnu.trove.TIntArrayList[] createPointerForTerm(LexiconEntry le)
protected void traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage) throws IOException
tmpStorage
- TIntArrayList[][] an array of the inverted index entries to storecodesHashMap
- a mapping from the term identifiers to the index
in the tmpStorage matrix.IOException
- if there is a problem while traversing the direct index.protected long[] writeInvertedFilePart(DataOutputStream dos, gnu.trove.TIntArrayList[][] tmpStorage, int _processTerms) throws IOException
dos
- a temporary data structure that contains the offsets in the inverted
index for each term.tmpStorage
- Occurrences information, as described in traverseDirectFile().
This data is consumed by this method - once this method has been called, all
the data in tmpStorage will be destroyed._processTerms
- The number of terms being processed in this iteration.IOException
public static void displayMemoryUsage(Runtime r)
r
- protected LexiconOutputStream<String> getLexOutputStream(String _structureName) throws IOException
_structureName
- IOException
Terrier Information Retrieval Platform 5.2. Copyright © 2004-2019, University of Glasgow