public class InvertedIndexBuilder extends Object
Algorithm:
Lexicon term selection:
There are two strategies of selecting the number of terms to read from the lexicon. The trade-off here
is to read a small enough number of terms into memory such that the occurrences of all those terms from
the direct file can fit in memory. On the other hand, the less terms that are read implies more iterations,
which is I/O expensive, as the entire direct file has to be read for every iteration.
The two strategies are:
Properties:
Modifier and Type | Class and Description |
---|---|
protected static class |
InvertedIndexBuilder.IntLongTuple
A tuple containing a integer (termid) and a long pointer
|
Modifier and Type | Field and Description |
---|---|
protected CompressionFactory.CompressionConfiguration |
compressionConfig |
protected int |
fieldCount |
protected AbstractPostingOutputStream |
file
The underlying bit file.
|
protected IndexOnDisk |
index |
protected Class<?> |
lexiconOutputStream
class to be used as a lexiconoutpustream.
|
protected static org.slf4j.Logger |
logger
The logger used
|
protected long |
numberOfPointersPerIteration
The number of pointers to be processed in an interation.
|
protected int |
processTerms
The number of terms for which the inverted file
is built each time.
|
protected String |
structureName |
protected boolean |
useFieldInformation
Indicates whether field information is used.
|
Constructor and Description |
---|
InvertedIndexBuilder(IndexOnDisk i,
String _structureName,
CompressionFactory.CompressionConfiguration compressionConfig)
contructor
|
Modifier and Type | Method and Description |
---|---|
void |
close()
Closes the underlying bit file.
|
void |
createInvertedIndex()
Creates the inverted index using the already created direct index,
document index and lexicon.
|
protected gnu.trove.TIntArrayList[] |
createPointerForTerm(LexiconEntry le) |
static void |
displayMemoryUsage(Runtime r)
display memory usage
|
protected LexiconOutputStream<String> |
getLexOutputStream(String _structureName)
get LexiconOutputStream
|
protected InvertedIndexBuilder.IntLongTuple |
scanLexiconForPointers(long PointersToProcess,
Iterator<Map.Entry<String,LexiconEntry>> lexiconStream,
gnu.trove.TIntIntHashMap codesHashMap,
ArrayList<gnu.trove.TIntArrayList[]> tmpStorageStorage)
Iterates through the lexicon, until it has reached the given number of pointers
|
protected InvertedIndexBuilder.IntLongTuple |
scanLexiconForTerms(int _processTerms,
Iterator<Map.Entry<String,LexiconEntry>> lexiconStream,
gnu.trove.TIntIntHashMap codesHashMap,
gnu.trove.TIntArrayList[][] tmpStorage)
Iterates through the lexicon, until it has reached the given number of terms
|
protected void |
traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap,
gnu.trove.TIntArrayList[][] tmpStorage)
Traverses the direct index and creates the inverted index entries
for the terms specified in the codesHashMap and tmpStorage.
|
protected long |
writeInvertedFilePart(DataOutputStream dos,
gnu.trove.TIntArrayList[][] tmpStorage,
int _processTerms)
Writes the section of the inverted file
|
protected Class<?> lexiconOutputStream
protected static final org.slf4j.Logger logger
protected int fieldCount
protected boolean useFieldInformation
protected IndexOnDisk index
protected String structureName
protected long numberOfPointersPerIteration
protected AbstractPostingOutputStream file
protected CompressionFactory.CompressionConfiguration compressionConfig
protected int processTerms
public InvertedIndexBuilder(IndexOnDisk i, String _structureName, CompressionFactory.CompressionConfiguration compressionConfig)
i
- _structureName
- public void close() throws IOException
IOException
public void createInvertedIndex()
protected gnu.trove.TIntArrayList[] createPointerForTerm(LexiconEntry le)
protected InvertedIndexBuilder.IntLongTuple scanLexiconForPointers(long PointersToProcess, Iterator<Map.Entry<String,LexiconEntry>> lexiconStream, gnu.trove.TIntIntHashMap codesHashMap, ArrayList<gnu.trove.TIntArrayList[]> tmpStorageStorage) throws IOException
PointersToProcess
- Number of pointers to stop reading the lexicon afterlexiconStream
- the lexicon input stream to readcodesHashMap
- tmpStorageStorage
- IOException
protected InvertedIndexBuilder.IntLongTuple scanLexiconForTerms(int _processTerms, Iterator<Map.Entry<String,LexiconEntry>> lexiconStream, gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage) throws IOException
_processTerms
- Number of terms to stop reading the lexicon afterlexiconStream
- the lexicon input stream to readcodesHashMap
- mapping of termids to which offset in the storage array for terms to be processed this iterationtmpStorage
- place to put postings for this iterationIOException
protected void traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage) throws IOException
tmpStorage
- TIntArrayList[][] an array of the inverted index entries to storecodesHashMap
- a mapping from the term identifiers to the index
in the tmpStorage matrix.IOException
- if there is a problem while traversing the direct index.protected long writeInvertedFilePart(DataOutputStream dos, gnu.trove.TIntArrayList[][] tmpStorage, int _processTerms) throws IOException
dos
- a temporary data structure that contains the offsets in the inverted
index for each term.tmpStorage
- Occurrences information, as described in traverseDirectFile().
This data is consumed by this method - once this method has been called, all
the data in tmpStorage will be destroyed._processTerms
- The number of terms being processed in this iteration.IOException
public static void displayMemoryUsage(Runtime r)
r
- protected LexiconOutputStream<String> getLexOutputStream(String _structureName) throws IOException
_structureName
- IOException
Terrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow