ExtensibleSinglePassIndexer (Terrier 3.5 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.terrier.indexing
Class ExtensibleSinglePassIndexer

java.lang.Object
  org.terrier.indexing.Indexer
      org.terrier.indexing.BasicIndexer
          org.terrier.indexing.BasicSinglePassIndexer
              org.terrier.indexing.ExtensibleSinglePassIndexer

public abstract class ExtensibleSinglePassIndexer
extends BasicSinglePassIndexer
extends BasicSinglePassIndexer

Directly based on BasicSinglePassIndexer, with just a few modifications to enable some extra hooks.

Author:: Roi Blanco, Jonathon Hare

Nested Class Summary

Nested classes/interfaces inherited from class org.terrier.indexing.BasicIndexer
`BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor`

Field Summary
`protected SinglePassIndexerFlushDelegate`	`flushDelegate` Delegate for HadoopIndexerMapper to intercept flushes

Fields inherited from class org.terrier.indexing.BasicSinglePassIndexer
`basicInvertedIndexPostingIteratorClass, currentFile, currentId, docsPerCheck, fieldInvertedIndexPostingIteratorClass, fileNames, invertedIndexClass, invertedIndexInputStreamClass, maxDocsPerFlush, maxMemory, memoryAfterFlush, memoryCheck, merger, mp, numberOfDocsSinceCheck, numberOfDocsSinceFlush, numberOfDocuments, numberOfPointers, numberOfTokens, numberOfUniqueTerms, runtime`

Fields inherited from class org.terrier.indexing.BasicIndexer
`numOfTokensInDocument, termFields, termsInDocument`

Fields inherited from class org.terrier.indexing.Indexer
`basicDirectIndexPostingIteratorClass, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocIndexEntry, fieldDirectIndexPostingIteratorClass, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation`

Constructor Summary
`ExtensibleSinglePassIndexer(java.lang.String pathname, java.lang.String prefix)` Default constructor

Method Summary
`protected abstract void`	`createDocumentPostings()` Hook method that creates the right type of DocumentTree class.
`void`	`createInvertedIndex(Collection[] collections)` Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (e.g.
`protected abstract void`	`createMemoryPostings()` Hook method that creates the right type of MemoryPostings class.
`protected void`	`createRunMerger(java.lang.String[][] files)` Hook method that creates a RunsMerger instance
`protected void`	`forceFlush()` Force the indexer to flush everything and free memory.
`Index`	`getCurrentIndex()` Get the index currently being constructed by this indexer.
`protected abstract TermPipeline`	`getEndOfPipeline()` Returns the end of the term pipeline, which corresponds to an instance of either BasicIndexer.BasicTermProcessor, or BasicIndexer.FieldTermProcessor, depending on whether field information is stored.
`protected SinglePassIndexerFlushDelegate`	`getFlushDelegate()` Get the flushDelegate
`protected abstract java.lang.Class<? extends PostingInRun>`	`getPostingInRunClass()` Get the class for storing postings in runs.
`protected abstract void`	`preProcess(Document doc, java.lang.String term)` Perform an operation before the term pipeline is initiated.
`protected void`	`setFlushDelegate(SinglePassIndexerFlushDelegate _flushDelegate)` Set the flushDelegate

Methods inherited from class org.terrier.indexing.BasicSinglePassIndexer
`checkFlush, createDirectIndex, createFieldRunMerger, createInvertedIndex, finishMemoryPosting, getFileNames, indexDocument, load_indexer_properties, performMultiWayMerge`

Methods inherited from class org.terrier.indexing.BasicIndexer
`finishedInvertedIndexBuild`

Methods inherited from class org.terrier.indexing.Indexer
`createMetaIndexBuilder, finishedDirectIndexBuild, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, useFieldInformation`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

flushDelegate

protected SinglePassIndexerFlushDelegate flushDelegate

Delegate for HadoopIndexerMapper to intercept flushes

Constructor Detail

ExtensibleSinglePassIndexer

public ExtensibleSinglePassIndexer(java.lang.String pathname,
                                   java.lang.String prefix)

Default constructor

Parameters:: pathname - String the path where the datastructures will be created. This is assumed to be absolute.; prefix - String the prefix of the index, usually "data".

Method Detail

getEndOfPipeline

protected abstract TermPipeline getEndOfPipeline()

Returns the end of the term pipeline, which corresponds to an instance of either BasicIndexer.BasicTermProcessor, or BasicIndexer.FieldTermProcessor, depending on whether field information is stored.

Overrides:: getEndOfPipeline in class BasicIndexer

Returns:: TermPipeline the end of the term pipeline.

getPostingInRunClass

protected abstract java.lang.Class<? extends PostingInRun> getPostingInRunClass()

Get the class for storing postings in runs.

Returns:: PostingInRun Subclass of PostingInRun for this indexer

createRunMerger

protected void createRunMerger(java.lang.String[][] files)
                        throws java.lang.Exception

Hook method that creates a RunsMerger instance

Overrides:: createRunMerger in class BasicSinglePassIndexer

Throws:: java.io.IOException - if an I/O error occurs.; java.lang.Exception

createMemoryPostings

protected abstract void createMemoryPostings()

Hook method that creates the right type of MemoryPostings class.

Overrides:: createMemoryPostings in class BasicSinglePassIndexer

createDocumentPostings

protected abstract void createDocumentPostings()

Hook method that creates the right type of DocumentTree class.

Overrides:: createDocumentPostings in class BasicIndexer

createInvertedIndex

public void createInvertedIndex(Collection[] collections)

Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (e.g. stemming, stopping, lowercase, etc.). Only one thing is modified from BasicSinglePassIndexer - I've added a pre-processing operation before each term is passed to the pipeline

Overrides:: createInvertedIndex in class BasicSinglePassIndexer

Parameters:: collections - Collection[] the collections to be indexed.

preProcess

protected abstract void preProcess(Document doc,
                                   java.lang.String term)

Perform an operation before the term pipeline is initiated. This could for example extract data and store in a field that the pipeline could access

Parameters:: doc - Current document; term - Current term

getCurrentIndex

public Index getCurrentIndex()

Get the index currently being constructed by this indexer. This might be null if indexing hasn't commenced yet. It is useful for adding extra properties, etc to the index after indexing is finished.

Returns:: the current index

setFlushDelegate

protected void setFlushDelegate(SinglePassIndexerFlushDelegate _flushDelegate)

Set the flushDelegate

Parameters:: _flushDelegate -

getFlushDelegate

protected SinglePassIndexerFlushDelegate getFlushDelegate()

Get the flushDelegate

Returns:: the flushDelegate

forceFlush

protected void forceFlush()
                   throws java.io.IOException

Force the indexer to flush everything and free memory. Either calls the super method, or passes to a delegate if the flushDelegate is set.

Overrides:: forceFlush in class BasicSinglePassIndexer

Throws:: java.io.IOException
See Also:: BasicSinglePassIndexer.forceFlush()