|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.terrier.structures.indexing.InvertedIndexBuilder
public class InvertedIndexBuilder
Builds an inverted index. It optionally saves term-field information as well.
Algorithm:
Lexicon term selection:
There are two strategies of selecting the number of terms to read from the lexicon. The trade-off here
is to read a small enough number of terms into memory such that the occurrences of all those terms from
the direct file can fit in memory. On the other hand, the less terms that are read implies more iterations,
which is I/O expensive, as the entire direct file has to be read for every iteration.
The two strategies are:
Properties:
Nested Class Summary | |
---|---|
protected static class |
InvertedIndexBuilder.IntLongTuple
|
Field Summary | |
---|---|
protected int |
fieldCount
|
protected BitOut |
file
The underlying bit file. |
protected Index |
index
|
protected java.lang.Class<?> |
lexiconOutputStream
class to be used as a lexiconoutpustream. |
protected static org.apache.log4j.Logger |
logger
The logger used |
protected long |
numberOfPointersPerIteration
The number of pointers to be processed in an interation. |
protected int |
processTerms
The number of terms for which the inverted file is built each time. |
protected java.lang.String |
structureName
|
protected boolean |
useFieldInformation
Indicates whether field information is used. |
Constructor Summary | |
---|---|
InvertedIndexBuilder(Index i,
java.lang.String _structureName)
contructor |
Method Summary | |
---|---|
void |
close()
Closes the underlying bit file. |
void |
createInvertedIndex()
Creates the inverted index using the already created direct index, document index and lexicon. |
protected gnu.trove.TIntArrayList[] |
createPointerForTerm(LexiconEntry le)
|
static void |
displayMemoryUsage(java.lang.Runtime r)
display memory usage |
protected LexiconOutputStream<java.lang.String> |
getLexOutputStream(java.lang.String _structureName)
get LexiconOutputStream |
protected InvertedIndexBuilder.IntLongTuple |
scanLexiconForPointers(long PointersToProcess,
java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lexiconStream,
gnu.trove.TIntIntHashMap codesHashMap,
java.util.ArrayList<gnu.trove.TIntArrayList[]> tmpStorageStorage)
Iterates through the lexicon, until it has reached the given number of pointers |
protected InvertedIndexBuilder.IntLongTuple |
scanLexiconForTerms(int _processTerms,
java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lexiconStream,
gnu.trove.TIntIntHashMap codesHashMap,
gnu.trove.TIntArrayList[][] tmpStorage)
Iterates through the lexicon, until it has reached the given number of terms |
protected void |
traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap,
gnu.trove.TIntArrayList[][] tmpStorage)
Traverses the direct index and creates the inverted index entries for the terms specified in the codesHashMap and tmpStorage. |
protected long |
writeInvertedFilePart(java.io.DataOutputStream dos,
gnu.trove.TIntArrayList[][] tmpStorage,
int _processTerms)
Writes the section of the inverted file |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected java.lang.Class<?> lexiconOutputStream
protected static final org.apache.log4j.Logger logger
protected int fieldCount
protected boolean useFieldInformation
protected Index index
protected java.lang.String structureName
protected long numberOfPointersPerIteration
protected BitOut file
protected int processTerms
Constructor Detail |
---|
public InvertedIndexBuilder(Index i, java.lang.String _structureName)
i
- _structureName
- Method Detail |
---|
public void close() throws java.io.IOException
java.io.IOException
public void createInvertedIndex()
protected gnu.trove.TIntArrayList[] createPointerForTerm(LexiconEntry le)
protected InvertedIndexBuilder.IntLongTuple scanLexiconForPointers(long PointersToProcess, java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lexiconStream, gnu.trove.TIntIntHashMap codesHashMap, java.util.ArrayList<gnu.trove.TIntArrayList[]> tmpStorageStorage) throws java.io.IOException
PointersToProcess
- Number of pointers to stop reading the lexicon afterlexiconStream
- the lexicon input stream to readcodesHashMap
- tmpStorageStorage
-
java.io.IOException
protected InvertedIndexBuilder.IntLongTuple scanLexiconForTerms(int _processTerms, java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lexiconStream, gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage) throws java.io.IOException
_processTerms
- Number of terms to stop reading the lexicon afterlexiconStream
- the lexicon input stream to readcodesHashMap
- mapping of termids to which offset in the storage array for terms to be processed this iterationtmpStorage
- place to put postings for this iteration
java.io.IOException
protected void traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap, gnu.trove.TIntArrayList[][] tmpStorage) throws java.io.IOException
tmpStorage
- TIntArrayList[][] an array of the inverted index entries to storecodesHashMap
- a mapping from the term identifiers to the index
in the tmpStorage matrix.
java.io.IOException
- if there is a problem while traversing the direct index.protected long writeInvertedFilePart(java.io.DataOutputStream dos, gnu.trove.TIntArrayList[][] tmpStorage, int _processTerms) throws java.io.IOException
dos
- a temporary data structure that contains the offsets in the inverted
index for each term.tmpStorage
- Occurrences information, as described in traverseDirectFile().
This data is consumed by this method - once this method has been called, all
the data in tmpStorage will be destroyed._processTerms
- The number of terms being processed in this iteration.
java.io.IOException
public static void displayMemoryUsage(java.lang.Runtime r)
r
- protected LexiconOutputStream<java.lang.String> getLexOutputStream(java.lang.String _structureName) throws java.io.IOException
_structureName
-
java.io.IOException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |