public class BasicSinglePassIndexer extends BasicIndexer
Memory tracking is a key concern in this class. Four properties are provided for checking the amount of memory consumed, how regularly to check the memory, and (optional) maximums on the amount of memory that can be used for the postings, or on the number of documents before a flush is comitted.
Properties:
BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor
Modifier and Type | Field and Description |
---|---|
protected String |
basicInvertedIndexPostingIteratorClass |
protected int |
currentFile
Number of the current Run to be written in disk
|
protected int |
currentId
Current document Id
|
protected int |
docsPerCheck
Number of documents read per memory check
|
protected String |
fieldInvertedIndexPostingIteratorClass |
protected Queue<String[]> |
fileNames
Queue with the file names for the runs in disk
|
protected String |
invertedIndexClass
what class should be used to read the generated inverted index?
|
protected String |
invertedIndexInputStreamClass
what class should be used to read the inverted index as a stream?
|
protected int |
maxDocsPerFlush |
protected long |
maxMemory |
protected long |
memoryAfterFlush
Memory status after flush
|
protected MemoryChecker |
memoryCheck
Memory Checker - provides the method for checking to see if
the system is running low on memory
|
protected RunsMerger |
merger
Structure for merging the run
|
protected MemoryPostings |
mp
Structure that keeps the posting lists in memory
|
protected int |
numberOfDocsSinceCheck
Number of documents read since the memory consumption was last checked
|
protected int |
numberOfDocsSinceFlush
Number of documents read since the memory runs were last flushed to disk
|
protected int |
numberOfDocuments
Number of documents indexed
|
protected long |
numberOfPointers
Number of pointers indexed
|
protected long |
numberOfTokens
Number of tokens indexed
|
protected int |
numberOfUniqueTerms
Number of unique terms indexed
|
protected static Runtime |
runtime
Runtime system JVM running this instance of Terrier
|
compressionDirectConfig, compressionInvertedConfig, numOfTokensInDocument, termFields, termsInDocument
BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocIndexEntry, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation
Modifier | Constructor and Description |
---|---|
protected |
BasicSinglePassIndexer(long a,
long b,
long c)
Protected do-nothing constructor for use by child classes
|
|
BasicSinglePassIndexer(String pathname,
String prefix)
Constructs an instance of a BasicSinglePassIndexer, using the given path name
for storing the data structures.
|
Modifier and Type | Method and Description |
---|---|
protected void |
checkFlush()
check to see if a flush is required, and perform if necessary
|
void |
createDirectIndex(Collection[] collections)
Creates the direct index, the document index and the lexicon.
|
protected void |
createFieldRunMerger(String[][] files)
Hook method that creates a FieldRunMerger instance
|
void |
createInvertedIndex()
Creates the inverted index after having created the
direct index, document index and lexicon.
|
void |
createInvertedIndex(Collection[] collections)
Builds the inverted file and lexicon file for the given collections
Loops through each document in each of the collections,
extracting terms and pushing these through the Term Pipeline
(eg stemming, stopping, lowercase).
|
protected void |
createMemoryPostings()
Hook method that creates the right type of MemoryPostings class.
|
protected void |
createRunMerger(String[][] files)
Hook method that creates a RunsMerger instance
|
protected String[] |
finishMemoryPosting()
Adds the name of the current run + partial lexicon to be flushed in disk.
|
protected void |
forceFlush() |
protected String[][] |
getFileNames() |
protected void |
indexDocument(Map<String,String> docProperties,
DocumentPostingList termsInDocument)
This adds a document to the direct and document indexes, as well
as it's terms to the lexicon.
|
protected void |
load_indexer_properties() |
void |
performMultiWayMerge()
Uses the merger class to perform a k multiway merge
in a set of previously written runs.
|
createDocumentPostings, finishedInvertedIndexBuild, getEndOfPipeline
createMetaIndexBuilder, finishedDirectIndexBuild, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, useFieldInformation
protected int currentId
protected long maxMemory
protected MemoryChecker memoryCheck
protected int docsPerCheck
protected int maxDocsPerFlush
protected static final Runtime runtime
protected int numberOfDocsSinceCheck
protected int numberOfDocsSinceFlush
protected long memoryAfterFlush
protected int currentFile
protected MemoryPostings mp
protected RunsMerger merger
protected int numberOfDocuments
protected long numberOfTokens
protected int numberOfUniqueTerms
protected long numberOfPointers
protected String invertedIndexClass
protected String basicInvertedIndexPostingIteratorClass
protected String fieldInvertedIndexPostingIteratorClass
protected String invertedIndexInputStreamClass
public BasicSinglePassIndexer(String pathname, String prefix)
pathname
- String the path where the datastructures will be created. This is assumed to be
absolute.prefix
- String the prefix of the index, usually "data".protected BasicSinglePassIndexer(long a, long b, long c)
public void createDirectIndex(Collection[] collections)
BasicIndexer
createDirectIndex
in class BasicIndexer
collections
- Collection[] the collections to be indexed.public void createInvertedIndex()
BasicIndexer
createInvertedIndex
in class BasicIndexer
public void createInvertedIndex(Collection[] collections)
collections
- Collection[] the collections to be indexed.protected void checkFlush() throws IOException
IOException
protected void forceFlush() throws IOException
IOException
protected void indexDocument(Map<String,String> docProperties, DocumentPostingList termsInDocument) throws Exception
indexDocument
in class BasicIndexer
docProperties
- MaptermsInDocument
- DocumentPostingList the terms in the document.Exception
protected String[] finishMemoryPosting()
public void performMultiWayMerge() throws IOException
IOException
protected String[][] getFileNames()
protected void createFieldRunMerger(String[][] files) throws Exception
IOException
- if an I/O error occurs.Exception
protected void createRunMerger(String[][] files) throws Exception
IOException
- if an I/O error occurs.Exception
protected void createMemoryPostings()
protected void load_indexer_properties()
load_indexer_properties
in class Indexer
Terrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow