Class BasicSinglePassIndexer
- java.lang.Object
-
- org.terrier.structures.indexing.Indexer
-
- org.terrier.structures.indexing.classical.BasicIndexer
-
- org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer
-
- Direct Known Subclasses:
BlockSinglePassIndexer
,ExtensibleSinglePassIndexer
,NoDuplicatesSinglePassIndexing
public class BasicSinglePassIndexer extends BasicIndexer
This class indexes a document collection (skipping the direct file construction). It implements a single-pass algorithm, that operates in two phases:
First, it traverses the document collection, passes the terms through the TermPipeline and builds an in-memory representation of the posting lists. When it has exhausted the main memory, it flushes the sorted postings to disk, along with the lexicon (collectively known as a run, and continues traversing the collection.
The second phase, merges the sorted runs (with their partial lexicons) in disk to create the final inverted file. This class follows the template pattern, so the main bulk of the code is reused for block (and fields) indexing. There are a few hook methods, that chooses the right classes to instantiate, depending on the indexing options defined.Memory tracking is a key concern in this class. Four properties are provided for checking the amount of memory consumed, how regularly to check the memory, and (optional) maximums on the amount of memory that can be used for the postings, or on the number of documents before a flush is comitted.
Properties:
- memory.reserved - amount of free memory threshold before a run is committed. Default is 50 000 000 (50MB) and 100 000 000 (100MB) for 32bit and 64bit JVMs respectively.
- memory.heap.usage - proportion of max heap allocated to JVM before a run is committed. Default 0.70.
- indexing.singlepass.max.postings.memory - maximum amount of memory that the postings can consume before a run is committed. Default is 0, which is no limit.
- indexing.singlepass.max.documents.flush - maximum number of documents before a run is committed. Default is 0, which is no limit.
- docs.check - interval of how many documents indexed should the amount of free memory be checked. Default is 20 - check memory consumption every 20 documents.
- Author:
- Roi Blanco
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.terrier.structures.indexing.classical.BasicIndexer
BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor
-
-
Field Summary
Fields Modifier and Type Field Description protected java.lang.String
basicInvertedIndexPostingIteratorClass
protected int
currentFile
Number of the current Run to be written in diskprotected int
currentId
Current document Idprotected int
docsPerCheck
Number of documents read per memory checkprotected java.lang.String
fieldInvertedIndexPostingIteratorClass
protected java.util.Queue<java.lang.String[]>
fileNames
Queue with the file names for the runs in diskprotected java.lang.String
invertedIndexClass
what class should be used to read the generated inverted index?protected java.lang.String
invertedIndexInputStreamClass
what class should be used to read the inverted index as a stream?protected int
maxDocsPerFlush
protected long
maxMemory
protected long
memoryAfterFlush
Memory status after flushprotected MemoryChecker
memoryCheck
Memory Checker - provides the method for checking to see if the system is running low on memoryprotected org.terrier.structures.indexing.singlepass.RunsMerger
merger
Structure for merging the runprotected org.terrier.structures.indexing.singlepass.MemoryPostings
mp
Structure that keeps the posting lists in memoryprotected int
numberOfDocsSinceCheck
Number of documents read since the memory consumption was last checkedprotected int
numberOfDocsSinceFlush
Number of documents read since the memory runs were last flushed to diskprotected int
numberOfDocuments
Number of documents indexedprotected long
numberOfPointers
Number of pointers indexedprotected long
numberOfTokens
Number of tokens indexedprotected int
numberOfUniqueTerms
Number of unique terms indexedprotected static java.lang.Runtime
runtime
Runtime system JVM running this instance of Terrier-
Fields inherited from class org.terrier.structures.indexing.classical.BasicIndexer
compressionDirectConfig, compressionInvertedConfig, numOfTokensInDocument, termCodes, termFields, termsInDocument
-
Fields inherited from class org.terrier.structures.indexing.Indexer
blocks, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocCount, emptyDocIndexEntry, externalParalllism, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation
-
-
Constructor Summary
Constructors Modifier Constructor Description protected
BasicSinglePassIndexer(long a, long b, long c)
Protected do-nothing constructor for use by child classesBasicSinglePassIndexer(java.lang.String pathname, java.lang.String prefix)
Constructs an instance of a BasicSinglePassIndexer, using the given path name for storing the data structures.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
checkFlush()
check to see if a flush is required, and perform if necessaryvoid
createDirectIndex(Collection[] collections)
Creates the direct index, the document index and the lexicon.protected void
createFieldRunMerger(java.lang.String[][] files)
Hook method that creates a FieldRunMerger instancevoid
createInvertedIndex()
Creates the inverted index after having created the direct index, document index and lexicon.void
createInvertedIndex(Collection[] collections)
Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).protected void
createMemoryPostings()
Hook method that creates the right type of MemoryPostings class.protected void
createRunMerger(java.lang.String[][] files)
Hook method that creates a RunsMerger instanceprotected java.lang.String[]
finishMemoryPosting()
Adds the name of the current run + partial lexicon to be flushed in disk.protected void
forceFlush()
protected java.lang.String[][]
getFileNames()
protected void
indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties, DocumentPostingList termsInDocument)
This adds a document to the direct and document indexes, as well as it's terms to the lexicon.protected void
load_indexer_properties()
void
performMultiWayMerge()
Uses the merger class to perform a k multiway merge in a set of previously written runs.-
Methods inherited from class org.terrier.structures.indexing.classical.BasicIndexer
createDocumentPostings, finishedInvertedIndexBuild, getEndOfPipeline
-
Methods inherited from class org.terrier.structures.indexing.Indexer
createMetaIndexBuilder, finishedDirectIndexBuild, getExternalParalllism, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, setExternalParalllism, useFieldInformation
-
-
-
-
Field Detail
-
currentId
protected int currentId
Current document Id
-
maxMemory
protected long maxMemory
-
memoryCheck
protected MemoryChecker memoryCheck
Memory Checker - provides the method for checking to see if the system is running low on memory
-
docsPerCheck
protected int docsPerCheck
Number of documents read per memory check
-
maxDocsPerFlush
protected int maxDocsPerFlush
-
runtime
protected static final java.lang.Runtime runtime
Runtime system JVM running this instance of Terrier
-
numberOfDocsSinceCheck
protected int numberOfDocsSinceCheck
Number of documents read since the memory consumption was last checked
-
numberOfDocsSinceFlush
protected int numberOfDocsSinceFlush
Number of documents read since the memory runs were last flushed to disk
-
memoryAfterFlush
protected long memoryAfterFlush
Memory status after flush
-
fileNames
protected java.util.Queue<java.lang.String[]> fileNames
Queue with the file names for the runs in disk
-
currentFile
protected int currentFile
Number of the current Run to be written in disk
-
mp
protected org.terrier.structures.indexing.singlepass.MemoryPostings mp
Structure that keeps the posting lists in memory
-
merger
protected org.terrier.structures.indexing.singlepass.RunsMerger merger
Structure for merging the run
-
numberOfDocuments
protected int numberOfDocuments
Number of documents indexed
-
numberOfTokens
protected long numberOfTokens
Number of tokens indexed
-
numberOfUniqueTerms
protected int numberOfUniqueTerms
Number of unique terms indexed
-
numberOfPointers
protected long numberOfPointers
Number of pointers indexed
-
invertedIndexClass
protected java.lang.String invertedIndexClass
what class should be used to read the generated inverted index?
-
basicInvertedIndexPostingIteratorClass
protected java.lang.String basicInvertedIndexPostingIteratorClass
-
fieldInvertedIndexPostingIteratorClass
protected java.lang.String fieldInvertedIndexPostingIteratorClass
-
invertedIndexInputStreamClass
protected java.lang.String invertedIndexInputStreamClass
what class should be used to read the inverted index as a stream?
-
-
Constructor Detail
-
BasicSinglePassIndexer
public BasicSinglePassIndexer(java.lang.String pathname, java.lang.String prefix)
Constructs an instance of a BasicSinglePassIndexer, using the given path name for storing the data structures.- Parameters:
pathname
- String the path where the datastructures will be created. This is assumed to be absolute.prefix
- String the prefix of the index, usually "data".
-
BasicSinglePassIndexer
protected BasicSinglePassIndexer(long a, long b, long c)
Protected do-nothing constructor for use by child classes
-
-
Method Detail
-
createDirectIndex
public void createDirectIndex(Collection[] collections)
Description copied from class:BasicIndexer
Creates the direct index, the document index and the lexicon. Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).- Overrides:
createDirectIndex
in classBasicIndexer
- Parameters:
collections
- Collection[] the collections to be indexed.
-
createInvertedIndex
public void createInvertedIndex()
Description copied from class:BasicIndexer
Creates the inverted index after having created the direct index, document index and lexicon.- Overrides:
createInvertedIndex
in classBasicIndexer
-
createInvertedIndex
public void createInvertedIndex(Collection[] collections)
Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).- Parameters:
collections
- Collection[] the collections to be indexed.
-
checkFlush
protected void checkFlush() throws java.io.IOException
check to see if a flush is required, and perform if necessary- Throws:
java.io.IOException
-
forceFlush
protected void forceFlush() throws java.io.IOException
- Throws:
java.io.IOException
-
indexDocument
protected void indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties, DocumentPostingList termsInDocument) throws java.lang.Exception
This adds a document to the direct and document indexes, as well as it's terms to the lexicon. Handled internally by the methods indexFieldDocument and indexNoFieldDocument.. This implementation only places content in the runs in memory, which will eventually be flushed to disk.- Overrides:
indexDocument
in classBasicIndexer
- Parameters:
docProperties
- Map<String,String> properties of the documenttermsInDocument
- DocumentPostingList the terms in the document.- Throws:
java.lang.Exception
-
finishMemoryPosting
protected java.lang.String[] finishMemoryPosting()
Adds the name of the current run + partial lexicon to be flushed in disk.- Returns:
- the two dimensional String[] array with the names of the run and partial lexicon to write.
-
performMultiWayMerge
public void performMultiWayMerge() throws java.io.IOException
Uses the merger class to perform a k multiway merge in a set of previously written runs. The file names and the number of runs are given by the private queue- Throws:
java.io.IOException
-
getFileNames
protected java.lang.String[][] getFileNames()
- Returns:
- the String[][] structure with the name of the runs files and partial lexicons.
-
createFieldRunMerger
protected void createFieldRunMerger(java.lang.String[][] files) throws java.lang.Exception
Hook method that creates a FieldRunMerger instance- Throws:
java.io.IOException
- if an I/O error occurs.java.lang.Exception
-
createRunMerger
protected void createRunMerger(java.lang.String[][] files) throws java.lang.Exception
Hook method that creates a RunsMerger instance- Throws:
java.io.IOException
- if an I/O error occurs.java.lang.Exception
-
createMemoryPostings
protected void createMemoryPostings()
Hook method that creates the right type of MemoryPostings class.
-
load_indexer_properties
protected void load_indexer_properties()
- Overrides:
load_indexer_properties
in classIndexer
-
-