org.terrier.structures.indexing
Class BlockInvertedIndexBuilder
java.lang.Object
org.terrier.structures.indexing.InvertedIndexBuilder
org.terrier.structures.indexing.BlockInvertedIndexBuilder
public class BlockInvertedIndexBuilder
- extends InvertedIndexBuilder
Builds an inverted index saving term-block information. It optionally saves
term-field information as well.
Algorithm:
- While there are terms left:
- Read M term ids from lexicon, in lexicographical order
- Read the occurrences of these M terms into memory from the direct file
- Write the occurrences of these M terms to the inverted file
- Rewrite the lexicon, removing block frequencies, and adding inverted
file offsets
- Write the collection statistics
Lexicon term selection: There are two strategies of selecting the
number of terms to read from the lexicon. The trade-off here is to read a
small enough number of terms into memory such that the occurrences of all
those terms from the direct file can fit in memory. On the other hand, the
less terms that are read implies more iterations, which is I/O expensive, as
the entire direct file has to be read for every iteration.
The two strategies are:
- Read a fixed number of terms on each iterations - this corresponds to
the property invertedfile.processterms
- Read a fixed number of occurrences (pointers) on each iteration. The
number of pointers can be determined using the sum of frequencies of each
term from the lexicon. This corresponds to the property
invertedfile.processpointers.
By default, the 2nd
strategy is chosen, unless the invertedfile.processpointers has a
zero value specified.
Properties:
- invertedfile.processterms - the number of terms to process in
each iteration. Defaults to 25,000
- invertedfile.processpointers - the number of pointers to
process in each iteration. Defaults to 2,000,000, which specifies that
invertedfile.processterms should be read from the lexicon, regardless of the
number of pointers.
- Author:
- Douglas Johnson & Vassilis Plachouras & Craig Macdonald
Method Summary |
void |
createInvertedIndex()
This method creates the block html inverted index. |
protected gnu.trove.TIntArrayList[] |
createPointerForTerm(LexiconEntry le)
|
protected void |
traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap,
gnu.trove.TIntArrayList[][] tmpStorage)
Traverses the direct fies recording all occurrences of terms noted in
codesHashMap into tmpStorage. |
protected long |
writeInvertedFilePart(java.io.DataOutputStream dos,
gnu.trove.TIntArrayList[][] tmpStorage,
int processTerms)
Writes the section of the inverted file |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
finalLexiconClass
protected java.lang.String finalLexiconClass
BlockInvertedIndexBuilder
public BlockInvertedIndexBuilder(Index index,
java.lang.String structureName)
- constructor
- Parameters:
index
- structureName
-
createInvertedIndex
public void createInvertedIndex()
- This method creates the block html inverted index. The approach used is
described briefly: for a group of M terms from the lexicon we build the
inverted file and save it on disk. In this way, the number of times we
need to read the direct file is related to the parameter M, and
consequently to the size of the available memory.
- Overrides:
createInvertedIndex
in class InvertedIndexBuilder
createPointerForTerm
protected gnu.trove.TIntArrayList[] createPointerForTerm(LexiconEntry le)
- Overrides:
createPointerForTerm
in class InvertedIndexBuilder
traverseDirectFile
protected void traverseDirectFile(gnu.trove.TIntIntHashMap codesHashMap,
gnu.trove.TIntArrayList[][] tmpStorage)
throws java.io.IOException
- Traverses the direct fies recording all occurrences of terms noted in
codesHashMap into tmpStorage.
- Overrides:
traverseDirectFile
in class InvertedIndexBuilder
- Parameters:
codesHashMap
- contains the term ids that are being processed in this
iteration, as keys. Values are the corresponding index in
tmpStorage that information about this terms should be placed.tmpStorage
- Records the occurrences information. First dimension is for
each term, as of the index given by codesHashMap; Second
dimension contains fieldCount+4 TIntArrayLists : (document id, term
frequency, field0, ... fieldCount-1 , block frequencies, block ids).
- Throws:
java.io.IOException
- if there is a problem while traversing the direct index.
writeInvertedFilePart
protected long writeInvertedFilePart(java.io.DataOutputStream dos,
gnu.trove.TIntArrayList[][] tmpStorage,
int processTerms)
throws java.io.IOException
- Writes the section of the inverted file
- Overrides:
writeInvertedFilePart
in class InvertedIndexBuilder
- Parameters:
dos
- a temporary data structure that contains the offsets in the
inverted index for each term.tmpStorage
- Occurrences information, as described in traverseDirectFile().
This data is consumed by this method - once this method has
been called, all the data in tmpStorage will be destroyed.processTerms
- The number of terms being processed in this iteration.
- Returns:
- the number of tokens processed in this iteration
- Throws:
java.io.IOException
Terrier 3.5. Copyright © 2004-2011 University of Glasgow