|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.terrier.indexing.Indexer org.terrier.indexing.BasicIndexer org.terrier.indexing.BasicSinglePassIndexer
public class BasicSinglePassIndexer
This class indexes a document collection (skipping the direct file construction). It implements a single-pass algorithm,
that operates in two phases:
First, it traverses the document collection, passes the terms through the TermPipeline and builds an in-memory
representation of the posting lists. When it has exhausted the main memory, it flushes the sorted postings to disk, along
with the lexicon (collectively known as a run, and continues traversing the collection.
The second phase, merges the sorted runs (with their partial lexicons) in disk to create the final inverted file.
This class follows the template pattern, so the main bulk of the code is reused for block (and fields) indexing. There are a few hook methods,
that chooses the right classes to instanciate, depending on the indexing options defined.
Memory tracking is a key concern in this class. Four properties are provided for checking the amount of memory consumed, how regularly to check the memory, and (optional) maximums on the amount of memory that can be used for the postings, or on the number of documents before a flush is comitted.
Properties:
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.terrier.indexing.BasicIndexer |
---|
BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor |
Field Summary | |
---|---|
protected java.lang.String |
basicInvertedIndexPostingIteratorClass
|
protected int |
currentFile
Number of the current Run to be written in disk |
protected int |
currentId
Current document Id |
protected int |
docsPerCheck
Number of documents read per memory check |
protected java.lang.String |
fieldInvertedIndexPostingIteratorClass
|
protected java.util.Queue<java.lang.String[]> |
fileNames
Queue with the file names for the runs in disk |
protected java.lang.String |
invertedIndexClass
what class should be used to read the generated inverted index? |
protected java.lang.String |
invertedIndexInputStreamClass
what class should be used to read the inverted index as a stream? |
protected int |
maxDocsPerFlush
|
protected long |
maxMemory
|
protected long |
memoryAfterFlush
Memory status after flush |
protected MemoryChecker |
memoryCheck
Memory Checker - provides the method for checking to see if the system is running low on memory |
protected RunsMerger |
merger
Structure for merging the run |
protected MemoryPostings |
mp
Structure that keeps the posting lists in memory |
protected int |
numberOfDocsSinceCheck
Number of documents read since the memory consumption was last checked |
protected int |
numberOfDocsSinceFlush
Number of documents read since the memory runs were last flushed to disk |
protected int |
numberOfDocuments
Number of documents indexed |
protected long |
numberOfPointers
Number of pointers indexed |
protected long |
numberOfTokens
Number of tokens indexed |
protected int |
numberOfUniqueTerms
Number of unique terms indexed |
protected static java.lang.Runtime |
runtime
Runtime system JVM running this instance of Terrier |
Fields inherited from class org.terrier.indexing.BasicIndexer |
---|
numOfTokensInDocument, termFields, termsInDocument |
Constructor Summary | |
---|---|
protected |
BasicSinglePassIndexer(long a,
long b,
long c)
Protected do-nothing constructor for use by child classes |
|
BasicSinglePassIndexer(java.lang.String pathname,
java.lang.String prefix)
Constructs an instance of a BasicSinglePassIndexer, using the given path name for storing the data structures. |
Method Summary | |
---|---|
protected void |
checkFlush()
check to see if a flush is required, and perform if necessary |
void |
createDirectIndex(Collection[] collections)
Creates the direct index, the document index and the lexicon. |
protected void |
createFieldRunMerger(java.lang.String[][] files)
Hook method that creates a FieldRunMerger instance |
void |
createInvertedIndex()
Creates the inverted index after having created the direct index, document index and lexicon. |
void |
createInvertedIndex(Collection[] collections)
Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase). |
protected void |
createMemoryPostings()
Hook method that creates the right type of MemoryPostings class. |
protected void |
createRunMerger(java.lang.String[][] files)
Hook method that creates a RunsMerger instance |
protected java.lang.String[] |
finishMemoryPosting()
Adds the name of the current run + partial lexicon to be flushed in disk. |
protected void |
forceFlush()
|
protected java.lang.String[][] |
getFileNames()
|
protected void |
indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties,
DocumentPostingList termsInDocument)
This adds a document to the direct and document indexes, as well as it's terms to the lexicon. |
protected void |
load_indexer_properties()
|
void |
performMultiWayMerge()
Uses the merger class to perform a k multiway merge in a set of previously written runs. |
Methods inherited from class org.terrier.indexing.BasicIndexer |
---|
createDocumentPostings, finishedInvertedIndexBuild, getEndOfPipeline |
Methods inherited from class org.terrier.indexing.Indexer |
---|
createMetaIndexBuilder, finishedDirectIndexBuild, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, useFieldInformation |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected int currentId
protected long maxMemory
protected MemoryChecker memoryCheck
protected int docsPerCheck
protected int maxDocsPerFlush
protected static final java.lang.Runtime runtime
protected int numberOfDocsSinceCheck
protected int numberOfDocsSinceFlush
protected long memoryAfterFlush
protected java.util.Queue<java.lang.String[]> fileNames
protected int currentFile
protected MemoryPostings mp
protected RunsMerger merger
protected int numberOfDocuments
protected long numberOfTokens
protected int numberOfUniqueTerms
protected long numberOfPointers
protected java.lang.String invertedIndexClass
protected java.lang.String basicInvertedIndexPostingIteratorClass
protected java.lang.String fieldInvertedIndexPostingIteratorClass
protected java.lang.String invertedIndexInputStreamClass
Constructor Detail |
---|
public BasicSinglePassIndexer(java.lang.String pathname, java.lang.String prefix)
pathname
- String the path where the datastructures will be created. This is assumed to be
absolute.prefix
- String the prefix of the index, usually "data".protected BasicSinglePassIndexer(long a, long b, long c)
Method Detail |
---|
public void createDirectIndex(Collection[] collections)
BasicIndexer
createDirectIndex
in class BasicIndexer
collections
- Collection[] the collections to be indexed.public void createInvertedIndex()
BasicIndexer
createInvertedIndex
in class BasicIndexer
public void createInvertedIndex(Collection[] collections)
collections
- Collection[] the collections to be indexed.protected void checkFlush() throws java.io.IOException
java.io.IOException
protected void forceFlush() throws java.io.IOException
java.io.IOException
protected void indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties, DocumentPostingList termsInDocument) throws java.lang.Exception
indexDocument
in class BasicIndexer
docProperties
- MaptermsInDocument
- DocumentPostingList the terms in the document.
java.lang.Exception
protected java.lang.String[] finishMemoryPosting()
public void performMultiWayMerge() throws java.io.IOException
java.io.IOException
protected java.lang.String[][] getFileNames()
protected void createFieldRunMerger(java.lang.String[][] files) throws java.lang.Exception
java.io.IOException
- if an I/O error occurs.
java.lang.Exception
protected void createRunMerger(java.lang.String[][] files) throws java.lang.Exception
java.io.IOException
- if an I/O error occurs.
java.lang.Exception
protected void createMemoryPostings()
protected void load_indexer_properties()
load_indexer_properties
in class Indexer
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |