org.terrier.indexing
Class ExtensibleSinglePassIndexer

java.lang.Object
  extended by org.terrier.indexing.Indexer
      extended by org.terrier.indexing.BasicIndexer
          extended by org.terrier.indexing.BasicSinglePassIndexer
              extended by org.terrier.indexing.ExtensibleSinglePassIndexer

public abstract class ExtensibleSinglePassIndexer
extends BasicSinglePassIndexer

Directly based on BasicSinglePassIndexer, with just a few modifications to enable some extra hooks.

Author:
Roi Blanco, Jonathon Hare

Nested Class Summary
 
Nested classes/interfaces inherited from class org.terrier.indexing.BasicIndexer
BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor
 
Field Summary
protected  SinglePassIndexerFlushDelegate flushDelegate
          Delegate for HadoopIndexerMapper to intercept flushes
 
Fields inherited from class org.terrier.indexing.BasicSinglePassIndexer
basicInvertedIndexPostingIteratorClass, currentFile, currentId, docsPerCheck, fieldInvertedIndexPostingIteratorClass, fileNames, invertedIndexClass, invertedIndexInputStreamClass, maxDocsPerFlush, maxMemory, memoryAfterFlush, memoryCheck, merger, mp, numberOfDocsSinceCheck, numberOfDocsSinceFlush, numberOfDocuments, numberOfPointers, numberOfTokens, numberOfUniqueTerms, runtime
 
Fields inherited from class org.terrier.indexing.BasicIndexer
numOfTokensInDocument, termFields, termsInDocument
 
Fields inherited from class org.terrier.indexing.Indexer
basicDirectIndexPostingIteratorClass, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocIndexEntry, fieldDirectIndexPostingIteratorClass, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation
 
Constructor Summary
ExtensibleSinglePassIndexer(java.lang.String pathname, java.lang.String prefix)
          Default constructor
 
Method Summary
protected abstract  void createDocumentPostings()
          Hook method that creates the right type of DocumentTree class.
 void createInvertedIndex(Collection[] collections)
          Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (e.g.
protected abstract  void createMemoryPostings()
          Hook method that creates the right type of MemoryPostings class.
protected  void createRunMerger(java.lang.String[][] files)
          Hook method that creates a RunsMerger instance
protected  void forceFlush()
          Force the indexer to flush everything and free memory.
 Index getCurrentIndex()
          Get the index currently being constructed by this indexer.
protected abstract  TermPipeline getEndOfPipeline()
          Returns the end of the term pipeline, which corresponds to an instance of either BasicIndexer.BasicTermProcessor, or BasicIndexer.FieldTermProcessor, depending on whether field information is stored.
protected  SinglePassIndexerFlushDelegate getFlushDelegate()
          Get the flushDelegate
protected abstract  java.lang.Class<? extends PostingInRun> getPostingInRunClass()
          Get the class for storing postings in runs.
protected abstract  void preProcess(Document doc, java.lang.String term)
          Perform an operation before the term pipeline is initiated.
protected  void setFlushDelegate(SinglePassIndexerFlushDelegate _flushDelegate)
          Set the flushDelegate
 
Methods inherited from class org.terrier.indexing.BasicSinglePassIndexer
checkFlush, createDirectIndex, createFieldRunMerger, createInvertedIndex, finishMemoryPosting, getFileNames, indexDocument, load_indexer_properties, performMultiWayMerge
 
Methods inherited from class org.terrier.indexing.BasicIndexer
finishedInvertedIndexBuild
 
Methods inherited from class org.terrier.indexing.Indexer
createMetaIndexBuilder, finishedDirectIndexBuild, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, useFieldInformation
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

flushDelegate

protected SinglePassIndexerFlushDelegate flushDelegate
Delegate for HadoopIndexerMapper to intercept flushes

Constructor Detail

ExtensibleSinglePassIndexer

public ExtensibleSinglePassIndexer(java.lang.String pathname,
                                   java.lang.String prefix)
Default constructor

Parameters:
pathname - String the path where the datastructures will be created. This is assumed to be absolute.
prefix - String the prefix of the index, usually "data".
Method Detail

getEndOfPipeline

protected abstract TermPipeline getEndOfPipeline()
Returns the end of the term pipeline, which corresponds to an instance of either BasicIndexer.BasicTermProcessor, or BasicIndexer.FieldTermProcessor, depending on whether field information is stored.

Overrides:
getEndOfPipeline in class BasicIndexer
Returns:
TermPipeline the end of the term pipeline.

getPostingInRunClass

protected abstract java.lang.Class<? extends PostingInRun> getPostingInRunClass()
Get the class for storing postings in runs.

Returns:
PostingInRun Subclass of PostingInRun for this indexer

createRunMerger

protected void createRunMerger(java.lang.String[][] files)
                        throws java.lang.Exception
Hook method that creates a RunsMerger instance

Overrides:
createRunMerger in class BasicSinglePassIndexer
Throws:
java.io.IOException - if an I/O error occurs.
java.lang.Exception

createMemoryPostings

protected abstract void createMemoryPostings()
Hook method that creates the right type of MemoryPostings class.

Overrides:
createMemoryPostings in class BasicSinglePassIndexer

createDocumentPostings

protected abstract void createDocumentPostings()
Hook method that creates the right type of DocumentTree class.

Overrides:
createDocumentPostings in class BasicIndexer

createInvertedIndex

public void createInvertedIndex(Collection[] collections)
Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (e.g. stemming, stopping, lowercase, etc.). Only one thing is modified from BasicSinglePassIndexer - I've added a pre-processing operation before each term is passed to the pipeline

Overrides:
createInvertedIndex in class BasicSinglePassIndexer
Parameters:
collections - Collection[] the collections to be indexed.

preProcess

protected abstract void preProcess(Document doc,
                                   java.lang.String term)
Perform an operation before the term pipeline is initiated. This could for example extract data and store in a field that the pipeline could access

Parameters:
doc - Current document
term - Current term

getCurrentIndex

public Index getCurrentIndex()
Get the index currently being constructed by this indexer. This might be null if indexing hasn't commenced yet. It is useful for adding extra properties, etc to the index after indexing is finished.

Returns:
the current index

setFlushDelegate

protected void setFlushDelegate(SinglePassIndexerFlushDelegate _flushDelegate)
Set the flushDelegate

Parameters:
_flushDelegate -

getFlushDelegate

protected SinglePassIndexerFlushDelegate getFlushDelegate()
Get the flushDelegate

Returns:
the flushDelegate

forceFlush

protected void forceFlush()
                   throws java.io.IOException
Force the indexer to flush everything and free memory. Either calls the super method, or passes to a delegate if the flushDelegate is set.

Overrides:
forceFlush in class BasicSinglePassIndexer
Throws:
java.io.IOException
See Also:
BasicSinglePassIndexer.forceFlush()


Terrier 3.5. Copyright © 2004-2011 University of Glasgow