BasicSinglePassIndexer (Terrier Information Retrieval Platform 4.1 API)

java.lang.Object
- org.terrier.structures.indexing.Indexer
- - org.terrier.structures.indexing.classical.BasicIndexer
  - - org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer

Direct Known Subclasses:

BlockSinglePassIndexer, ExtensibleSinglePassIndexer, Hadoop_BasicSinglePassIndexer, NoDuplicatesSinglePassIndexing
```
public class BasicSinglePassIndexer
extends BasicIndexer
```
This class indexes a document collection (skipping the direct file construction). It implements a single-pass algorithm, that operates in two phases:
First, it traverses the document collection, passes the terms through the TermPipeline and builds an in-memory representation of the posting lists. When it has exhausted the main memory, it flushes the sorted postings to disk, along with the lexicon (collectively known as a run, and continues traversing the collection.
The second phase, merges the sorted runs (with their partial lexicons) in disk to create the final inverted file. This class follows the template pattern, so the main bulk of the code is reused for block (and fields) indexing. There are a few hook methods, that chooses the right classes to instantiate, depending on the indexing options defined.
Memory tracking is a key concern in this class. Four properties are provided for checking the amount of memory consumed, how regularly to check the memory, and (optional) maximums on the amount of memory that can be used for the postings, or on the number of documents before a flush is comitted.
Properties:
- memory.reserved - amount of free memory threshold before a run is committed. Default is 50 000 000 (50MB) and 100 000 000 (100MB) for 32bit and 64bit JVMs respectively.
- memory.heap.usage - proportion of max heap allocated to JVM before a run is committed. Default 0.70.
- indexing.singlepass.max.postings.memory - maximum amount of memory that the postings can consume before a run is committed. Default is 0, which is no limit.
- indexing.singlepass.max.documents.flush - maximum number of documents before a run is committed. Default is 0, which is no limit.
- docs.check - interval of how many documents indexed should the amount of free memory be checked. Default is 20 - check memory consumption every 20 documents.
Author:

Roi Blanco

Nested Class Summary
- Nested classes/interfaces inherited from class org.terrier.structures.indexing.classical.BasicIndexer
  BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor

Field Summary

Fields
Modifier and Type	Field and Description
`protected String`	`basicInvertedIndexPostingIteratorClass`
`protected int`	`currentFile` Number of the current Run to be written in disk
`protected int`	`currentId` Current document Id
`protected int`	`docsPerCheck` Number of documents read per memory check
`protected String`	`fieldInvertedIndexPostingIteratorClass`
`protected Queue<String[]>`	`fileNames` Queue with the file names for the runs in disk
`protected String`	`invertedIndexClass` what class should be used to read the generated inverted index?
`protected String`	`invertedIndexInputStreamClass` what class should be used to read the inverted index as a stream?
`protected int`	`maxDocsPerFlush`
`protected long`	`maxMemory`
`protected long`	`memoryAfterFlush` Memory status after flush
`protected MemoryChecker`	`memoryCheck` Memory Checker - provides the method for checking to see if the system is running low on memory
`protected RunsMerger`	`merger` Structure for merging the run
`protected MemoryPostings`	`mp` Structure that keeps the posting lists in memory
`protected int`	`numberOfDocsSinceCheck` Number of documents read since the memory consumption was last checked
`protected int`	`numberOfDocsSinceFlush` Number of documents read since the memory runs were last flushed to disk
`protected int`	`numberOfDocuments` Number of documents indexed
`protected long`	`numberOfPointers` Number of pointers indexed
`protected long`	`numberOfTokens` Number of tokens indexed
`protected int`	`numberOfUniqueTerms` Number of unique terms indexed
`protected static Runtime`	`runtime` Runtime system JVM running this instance of Terrier

Fields inherited from class org.terrier.structures.indexing.classical.BasicIndexer
compressionDirectConfig, compressionInvertedConfig, numOfTokensInDocument, termFields, termsInDocument

Fields inherited from class org.terrier.structures.indexing.Indexer
BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocIndexEntry, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation

Constructor Summary

Constructors
Modifier	Constructor and Description
`protected`	`BasicSinglePassIndexer(long a, long b, long c)` Protected do-nothing constructor for use by child classes
	`BasicSinglePassIndexer(String pathname, String prefix)` Constructs an instance of a BasicSinglePassIndexer, using the given path name for storing the data structures.

Method Summary

Methods
Modifier and Type	Method and Description
`protected void`	`checkFlush()` check to see if a flush is required, and perform if necessary
`void`	`createDirectIndex(Collection[] collections)` Creates the direct index, the document index and the lexicon.
`protected void`	`createFieldRunMerger(String[][] files)` Hook method that creates a FieldRunMerger instance
`void`	`createInvertedIndex()` Creates the inverted index after having created the direct index, document index and lexicon.
`void`	`createInvertedIndex(Collection[] collections)` Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).
`protected void`	`createMemoryPostings()` Hook method that creates the right type of MemoryPostings class.
`protected void`	`createRunMerger(String[][] files)` Hook method that creates a RunsMerger instance
`protected String[]`	`finishMemoryPosting()` Adds the name of the current run + partial lexicon to be flushed in disk.
`protected void`	`forceFlush()`
`protected String[][]`	`getFileNames()`
`protected void`	`indexDocument(Map<String,String> docProperties, DocumentPostingList termsInDocument)` This adds a document to the direct and document indexes, as well as it's terms to the lexicon.
`protected void`	`load_indexer_properties()`
`void`	`performMultiWayMerge()` Uses the merger class to perform a k multiway merge in a set of previously written runs.

Methods inherited from class org.terrier.structures.indexing.classical.BasicIndexer
createDocumentPostings, finishedInvertedIndexBuild, getEndOfPipeline

Methods inherited from class org.terrier.structures.indexing.Indexer
createMetaIndexBuilder, finishedDirectIndexBuild, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, useFieldInformation

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - currentId
```
protected int currentId
```
    Current document Id
  - maxMemory
```
protected long maxMemory
```
  - memoryCheck
```
protected MemoryChecker memoryCheck
```
    Memory Checker - provides the method for checking to see if the system is running low on memory
  - docsPerCheck
```
protected int docsPerCheck
```
    Number of documents read per memory check
  - maxDocsPerFlush
```
protected int maxDocsPerFlush
```
  - runtime
```
protected static final Runtime runtime
```
    Runtime system JVM running this instance of Terrier
  - numberOfDocsSinceCheck
```
protected int numberOfDocsSinceCheck
```
    Number of documents read since the memory consumption was last checked
  - numberOfDocsSinceFlush
```
protected int numberOfDocsSinceFlush
```
    Number of documents read since the memory runs were last flushed to disk
  - memoryAfterFlush
```
protected long memoryAfterFlush
```
    Memory status after flush
  - fileNames
```
protected Queue<String[]> fileNames
```
    Queue with the file names for the runs in disk
  - currentFile
```
protected int currentFile
```
    Number of the current Run to be written in disk
  - mp
```
protected MemoryPostings mp
```
    Structure that keeps the posting lists in memory
  - merger
```
protected RunsMerger merger
```
    Structure for merging the run
  - numberOfDocuments
```
protected int numberOfDocuments
```
    Number of documents indexed
  - numberOfTokens
```
protected long numberOfTokens
```
    Number of tokens indexed
  - numberOfUniqueTerms
```
protected int numberOfUniqueTerms
```
    Number of unique terms indexed
  - numberOfPointers
```
protected long numberOfPointers
```
    Number of pointers indexed
  - invertedIndexClass
```
protected String invertedIndexClass
```
    what class should be used to read the generated inverted index?
  - basicInvertedIndexPostingIteratorClass
```
protected String basicInvertedIndexPostingIteratorClass
```
  - fieldInvertedIndexPostingIteratorClass
```
protected String fieldInvertedIndexPostingIteratorClass
```
  - invertedIndexInputStreamClass
```
protected String invertedIndexInputStreamClass
```
    what class should be used to read the inverted index as a stream?
- Constructor Detail
  - BasicSinglePassIndexer
```
public BasicSinglePassIndexer(String pathname,
                      String prefix)
```
    Constructs an instance of a BasicSinglePassIndexer, using the given path name for storing the data structures.
    
    Parameters:
    pathname - String the path where the datastructures will be created. This is assumed to be absolute.
    prefix - String the prefix of the index, usually "data".
  - BasicSinglePassIndexer
```
protected BasicSinglePassIndexer(long a,
                      long b,
                      long c)
```
    Protected do-nothing constructor for use by child classes
- Method Detail
  - createDirectIndex
```
public void createDirectIndex(Collection[] collections)
```
    Description copied from class: BasicIndexer
    
    Creates the direct index, the document index and the lexicon. Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).
    
    Overrides:
    
    createDirectIndex in class BasicIndexer
    
    Parameters:
    collections - Collection[] the collections to be indexed.
  - createInvertedIndex
```
public void createInvertedIndex()
```
    Description copied from class: BasicIndexer
    
    Creates the inverted index after having created the direct index, document index and lexicon.
    
    Overrides:
    
    createInvertedIndex in class BasicIndexer
  - createInvertedIndex
```
public void createInvertedIndex(Collection[] collections)
```
    Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).
    
    Parameters:
    collections - Collection[] the collections to be indexed.
  - checkFlush
```
protected void checkFlush()
                   throws IOException
```
    check to see if a flush is required, and perform if necessary
    
    Throws:
    
    IOException
  - forceFlush
```
protected void forceFlush()
                   throws IOException
```
    Throws:
    
    IOException
  - indexDocument
```
protected void indexDocument(Map<String,String> docProperties,
                 DocumentPostingList termsInDocument)
                      throws Exception
```
    This adds a document to the direct and document indexes, as well as it's terms to the lexicon. Handled internally by the methods indexFieldDocument and indexNoFieldDocument.. This implementation only places content in the runs in memory, which will eventually be flushed to disk.
    
    Overrides:
    
    indexDocument in class BasicIndexer
    
    Parameters:
    docProperties - Map properties of the document
    termsInDocument - DocumentPostingList the terms in the document.
    
    Throws:
    
    Exception
  - finishMemoryPosting
```
protected String[] finishMemoryPosting()
```
    Adds the name of the current run + partial lexicon to be flushed in disk.
    
    Returns:
    the two dimensional String[] array with the names of the run and partial lexicon to write.
  - performMultiWayMerge
```
public void performMultiWayMerge()
                          throws IOException
```
    Uses the merger class to perform a k multiway merge in a set of previously written runs. The file names and the number of runs are given by the private queue
    
    Throws:
    
    IOException
  - getFileNames
```
protected String[][] getFileNames()
```
    Returns:
    the String[][] structure with the name of the runs files and partial lexicons.
  - createFieldRunMerger
```
protected void createFieldRunMerger(String[][] files)
                             throws Exception
```
    Hook method that creates a FieldRunMerger instance
    
    Throws:
    
    IOException - if an I/O error occurs.
    
    Exception
  - createRunMerger
```
protected void createRunMerger(String[][] files)
                        throws Exception
```
    Hook method that creates a RunsMerger instance
    
    Throws:
    
    IOException - if an I/O error occurs.
    
    Exception
  - createMemoryPostings
```
protected void createMemoryPostings()
```
    Hook method that creates the right type of MemoryPostings class.
  - load_indexer_properties
```
protected void load_indexer_properties()
```
    Overrides:
    
    load_indexer_properties in class Indexer

Class BasicSinglePassIndexer

Nested Class Summary

Nested classes/interfaces inherited from class org.terrier.structures.indexing.classical.BasicIndexer

Field Summary

Fields inherited from class org.terrier.structures.indexing.classical.BasicIndexer

Fields inherited from class org.terrier.structures.indexing.Indexer

Constructor Summary

Method Summary

Methods inherited from class org.terrier.structures.indexing.classical.BasicIndexer

Methods inherited from class org.terrier.structures.indexing.Indexer

Methods inherited from class java.lang.Object

Field Detail

currentId

maxMemory

memoryCheck

docsPerCheck

maxDocsPerFlush

runtime

numberOfDocsSinceCheck

numberOfDocsSinceFlush

memoryAfterFlush

fileNames

currentFile

mp

merger

numberOfDocuments

numberOfTokens

numberOfUniqueTerms

numberOfPointers

invertedIndexClass

basicInvertedIndexPostingIteratorClass

fieldInvertedIndexPostingIteratorClass

invertedIndexInputStreamClass

Constructor Detail

BasicSinglePassIndexer

BasicSinglePassIndexer

Method Detail

createDirectIndex

createInvertedIndex

createInvertedIndex

checkFlush

forceFlush

indexDocument

finishMemoryPosting

performMultiWayMerge

getFileNames

createFieldRunMerger

createRunMerger

createMemoryPostings

load_indexer_properties