BasicSinglePassIndexer (Terrier 3.5 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.terrier.indexing
Class BasicSinglePassIndexer

java.lang.Object
  org.terrier.indexing.Indexer
      org.terrier.indexing.BasicIndexer
          org.terrier.indexing.BasicSinglePassIndexer

Direct Known Subclasses:: BlockSinglePassIndexer, ExtensibleSinglePassIndexer, Hadoop_BasicSinglePassIndexer

public class BasicSinglePassIndexer
extends BasicIndexer
extends BasicIndexer

This class indexes a document collection (skipping the direct file construction). It implements a single-pass algorithm, that operates in two phases:
First, it traverses the document collection, passes the terms through the TermPipeline and builds an in-memory representation of the posting lists. When it has exhausted the main memory, it flushes the sorted postings to disk, along with the lexicon (collectively known as a run, and continues traversing the collection.
The second phase, merges the sorted runs (with their partial lexicons) in disk to create the final inverted file. This class follows the template pattern, so the main bulk of the code is reused for block (and fields) indexing. There are a few hook methods, that chooses the right classes to instanciate, depending on the indexing options defined.

Memory tracking is a key concern in this class. Four properties are provided for checking the amount of memory consumed, how regularly to check the memory, and (optional) maximums on the amount of memory that can be used for the postings, or on the number of documents before a flush is comitted.

Properties:

memory.reserved - amount of free memory threshold before a run is committed. Default is 50 000 000 (50MB) and 100 000 000 (100MB) for 32bit and 64bit JVMs respectively.
memory.heap.usage - proportion of max heap allocated to JVM before a run is committed. Default 0.70.
indexing.singlepass.max.postings.memory - maximum amount of memory that the postings can consume before a run is committed.
indexing.singlepass.max.documents.flush - maximum number of documents before a run is committed.
docs.check - interval of how many documents indexed should the amount of free memory be checked. Defaults to 20.

Author:: Roi Blanco

Nested Class Summary

Nested classes/interfaces inherited from class org.terrier.indexing.BasicIndexer
`BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor`

Field Summary
`protected java.lang.String`	`basicInvertedIndexPostingIteratorClass`
`protected int`	`currentFile` Number of the current Run to be written in disk
`protected int`	`currentId` Current document Id
`protected int`	`docsPerCheck` Number of documents read per memory check
`protected java.lang.String`	`fieldInvertedIndexPostingIteratorClass`
`protected java.util.Queue<java.lang.String[]>`	`fileNames` Queue with the file names for the runs in disk
`protected java.lang.String`	`invertedIndexClass` what class should be used to read the generated inverted index?
`protected java.lang.String`	`invertedIndexInputStreamClass` what class should be used to read the inverted index as a stream?
`protected int`	`maxDocsPerFlush`
`protected long`	`maxMemory`
`protected long`	`memoryAfterFlush` Memory status after flush
`protected MemoryChecker`	`memoryCheck` Memory Checker - provides the method for checking to see if the system is running low on memory
`protected RunsMerger`	`merger` Structure for merging the run
`protected MemoryPostings`	`mp` Structure that keeps the posting lists in memory
`protected int`	`numberOfDocsSinceCheck` Number of documents read since the memory consumption was last checked
`protected int`	`numberOfDocsSinceFlush` Number of documents read since the memory runs were last flushed to disk
`protected int`	`numberOfDocuments` Number of documents indexed
`protected long`	`numberOfPointers` Number of pointers indexed
`protected long`	`numberOfTokens` Number of tokens indexed
`protected int`	`numberOfUniqueTerms` Number of unique terms indexed
`protected static java.lang.Runtime`	`runtime` Runtime system JVM running this instance of Terrier

Fields inherited from class org.terrier.indexing.BasicIndexer
`numOfTokensInDocument, termFields, termsInDocument`

Fields inherited from class org.terrier.indexing.Indexer
`basicDirectIndexPostingIteratorClass, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocIndexEntry, fieldDirectIndexPostingIteratorClass, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation`

Constructor Summary
`protected`	`BasicSinglePassIndexer(long a, long b, long c)` Protected do-nothing constructor for use by child classes
	`BasicSinglePassIndexer(java.lang.String pathname, java.lang.String prefix)` Constructs an instance of a BasicSinglePassIndexer, using the given path name for storing the data structures.

Method Summary
`protected void`	`checkFlush()` check to see if a flush is required, and perform if necessary
`void`	`createDirectIndex(Collection[] collections)` Creates the direct index, the document index and the lexicon.
`protected void`	`createFieldRunMerger(java.lang.String[][] files)` Hook method that creates a FieldRunMerger instance
`void`	`createInvertedIndex()` Creates the inverted index after having created the direct index, document index and lexicon.
`void`	`createInvertedIndex(Collection[] collections)` Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).
`protected void`	`createMemoryPostings()` Hook method that creates the right type of MemoryPostings class.
`protected void`	`createRunMerger(java.lang.String[][] files)` Hook method that creates a RunsMerger instance
`protected java.lang.String[]`	`finishMemoryPosting()` Adds the name of the current run + partial lexicon to be flushed in disk.
`protected void`	`forceFlush()`
`protected java.lang.String[][]`	`getFileNames()`
`protected void`	`indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties, DocumentPostingList termsInDocument)` This adds a document to the direct and document indexes, as well as it's terms to the lexicon.
`protected void`	`load_indexer_properties()`
`void`	`performMultiWayMerge()` Uses the merger class to perform a k multiway merge in a set of previously written runs.

Methods inherited from class org.terrier.indexing.BasicIndexer
`createDocumentPostings, finishedInvertedIndexBuild, getEndOfPipeline`

Methods inherited from class org.terrier.indexing.Indexer
`createMetaIndexBuilder, finishedDirectIndexBuild, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, useFieldInformation`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

currentId

protected int currentId

Current document Id

maxMemory

protected long maxMemory

memoryCheck

protected MemoryChecker memoryCheck

Memory Checker - provides the method for checking to see if the system is running low on memory

docsPerCheck

protected int docsPerCheck

Number of documents read per memory check

maxDocsPerFlush

protected int maxDocsPerFlush

runtime

protected static final java.lang.Runtime runtime

Runtime system JVM running this instance of Terrier

numberOfDocsSinceCheck

protected int numberOfDocsSinceCheck

Number of documents read since the memory consumption was last checked

numberOfDocsSinceFlush

protected int numberOfDocsSinceFlush

Number of documents read since the memory runs were last flushed to disk

memoryAfterFlush

protected long memoryAfterFlush

Memory status after flush

fileNames

protected java.util.Queue<java.lang.String[]> fileNames

Queue with the file names for the runs in disk

currentFile

protected int currentFile

Number of the current Run to be written in disk

mp

protected MemoryPostings mp

Structure that keeps the posting lists in memory

merger

protected RunsMerger merger

Structure for merging the run

numberOfDocuments

protected int numberOfDocuments

Number of documents indexed

numberOfTokens

protected long numberOfTokens

Number of tokens indexed

numberOfUniqueTerms

protected int numberOfUniqueTerms

Number of unique terms indexed

numberOfPointers

protected long numberOfPointers

Number of pointers indexed

invertedIndexClass

protected java.lang.String invertedIndexClass

what class should be used to read the generated inverted index?

basicInvertedIndexPostingIteratorClass

protected java.lang.String basicInvertedIndexPostingIteratorClass

fieldInvertedIndexPostingIteratorClass

protected java.lang.String fieldInvertedIndexPostingIteratorClass

invertedIndexInputStreamClass

protected java.lang.String invertedIndexInputStreamClass

what class should be used to read the inverted index as a stream?

Constructor Detail

BasicSinglePassIndexer

public BasicSinglePassIndexer(java.lang.String pathname,
                              java.lang.String prefix)

Constructs an instance of a BasicSinglePassIndexer, using the given path name for storing the data structures.

Parameters:: pathname - String the path where the datastructures will be created. This is assumed to be absolute.; prefix - String the prefix of the index, usually "data".

BasicSinglePassIndexer

protected BasicSinglePassIndexer(long a,
                                 long b,
                                 long c)

Protected do-nothing constructor for use by child classes

Method Detail

createDirectIndex

public void createDirectIndex(Collection[] collections)

Description copied from class: BasicIndexer

Creates the direct index, the document index and the lexicon. Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).

Overrides:: createDirectIndex in class BasicIndexer

Parameters:: collections - Collection[] the collections to be indexed.

createInvertedIndex

public void createInvertedIndex()

Description copied from class: BasicIndexer

Creates the inverted index after having created the direct index, document index and lexicon.

Overrides:: createInvertedIndex in class BasicIndexer

createInvertedIndex

public void createInvertedIndex(Collection[] collections)

Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).

Parameters:: collections - Collection[] the collections to be indexed.

checkFlush

protected void checkFlush()
                   throws java.io.IOException

check to see if a flush is required, and perform if necessary

Throws:: java.io.IOException

forceFlush

protected void forceFlush()
                   throws java.io.IOException

Throws:: java.io.IOException

indexDocument

protected void indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties,
                             DocumentPostingList termsInDocument)
                      throws java.lang.Exception

This adds a document to the direct and document indexes, as well as it's terms to the lexicon. Handled internally by the methods indexFieldDocument and indexNoFieldDocument.. This implementation only places content in the runs in memory, which will eventually be flushed to disk.

Overrides:: indexDocument in class BasicIndexer

Parameters:: docProperties - Map properties of the document; termsInDocument - DocumentPostingList the terms in the document.
Throws:: java.lang.Exception

finishMemoryPosting

protected java.lang.String[] finishMemoryPosting()

Adds the name of the current run + partial lexicon to be flushed in disk.

Returns:: the two dimensional String[] array with the names of the run and partial lexicon to write.

performMultiWayMerge

public void performMultiWayMerge()
                          throws java.io.IOException

Uses the merger class to perform a k multiway merge in a set of previously written runs. The file names and the number of runs are given by the private queue

Throws:: java.io.IOException

getFileNames

protected java.lang.String[][] getFileNames()

Returns:: the String[][] structure with the name of the runs files and partial lexicons.

createFieldRunMerger

protected void createFieldRunMerger(java.lang.String[][] files)
                             throws java.lang.Exception

Hook method that creates a FieldRunMerger instance

Throws:: java.io.IOException - if an I/O error occurs.; java.lang.Exception

createRunMerger

protected void createRunMerger(java.lang.String[][] files)
                        throws java.lang.Exception

Hook method that creates a RunsMerger instance

Throws:: java.io.IOException - if an I/O error occurs.; java.lang.Exception

createMemoryPostings

protected void createMemoryPostings()

Hook method that creates the right type of MemoryPostings class.

load_indexer_properties

protected void load_indexer_properties()

Overrides:: load_indexer_properties in class Indexer

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.terrier.indexing Class BasicSinglePassIndexer

currentId

maxMemory

memoryCheck

docsPerCheck

maxDocsPerFlush

runtime

numberOfDocsSinceCheck

numberOfDocsSinceFlush

memoryAfterFlush

fileNames

currentFile

mp

merger

numberOfDocuments

numberOfTokens

numberOfUniqueTerms

numberOfPointers

invertedIndexClass

basicInvertedIndexPostingIteratorClass

fieldInvertedIndexPostingIteratorClass

invertedIndexInputStreamClass

BasicSinglePassIndexer

BasicSinglePassIndexer

createDirectIndex

createInvertedIndex

createInvertedIndex

checkFlush

forceFlush

indexDocument

finishMemoryPosting

performMultiWayMerge

getFileNames

createFieldRunMerger

createRunMerger

createMemoryPostings

load_indexer_properties

org.terrier.indexing
Class BasicSinglePassIndexer