Class ExtensibleSinglePassIndexer
- java.lang.Object
-
- org.terrier.structures.indexing.Indexer
-
- org.terrier.structures.indexing.classical.BasicIndexer
-
- org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer
-
- org.terrier.structures.indexing.singlepass.ExtensibleSinglePassIndexer
-
public abstract class ExtensibleSinglePassIndexer extends BasicSinglePassIndexer
Directly based on BasicSinglePassIndexer, with just a few modifications to enable some extra hooks.- Author:
- Roi Blanco, Jonathon Hare [jsh2{a.}ecs.soton.ac.uk]
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.terrier.structures.indexing.classical.BasicIndexer
BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor
-
-
Field Summary
Fields Modifier and Type Field Description protected SinglePassIndexerFlushDelegate
flushDelegate
Delegate for HadoopIndexerMapper to intercept flushes-
Fields inherited from class org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer
basicInvertedIndexPostingIteratorClass, currentFile, currentId, docsPerCheck, fieldInvertedIndexPostingIteratorClass, fileNames, invertedIndexClass, invertedIndexInputStreamClass, maxDocsPerFlush, maxMemory, memoryAfterFlush, memoryCheck, merger, mp, numberOfDocsSinceCheck, numberOfDocsSinceFlush, numberOfDocuments, numberOfPointers, numberOfTokens, numberOfUniqueTerms, runtime
-
Fields inherited from class org.terrier.structures.indexing.classical.BasicIndexer
compressionDirectConfig, compressionInvertedConfig, numOfTokensInDocument, termCodes, termFields, termsInDocument
-
Fields inherited from class org.terrier.structures.indexing.Indexer
blocks, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocCount, emptyDocIndexEntry, externalParalllism, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation
-
-
Constructor Summary
Constructors Constructor Description ExtensibleSinglePassIndexer(java.lang.String pathname, java.lang.String prefix)
Default constructor
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected abstract void
createDocumentPostings()
Hook method that creates the right type of DocumentTree class.void
createInvertedIndex(Collection[] collections)
Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (e.g.protected abstract void
createMemoryPostings()
Hook method that creates the right type of MemoryPostings class.protected void
createRunMerger(java.lang.String[][] files)
Hook method that creates a RunsMerger instanceprotected void
forceFlush()
Force the indexer to flush everything and free memory.Index
getCurrentIndex()
Get the index currently being constructed by this indexer.protected abstract TermPipeline
getEndOfPipeline()
Returns the end of the term pipeline, which corresponds to an instance of either BasicIndexer.BasicTermProcessor, or BasicIndexer.FieldTermProcessor, depending on whether field information is stored.protected SinglePassIndexerFlushDelegate
getFlushDelegate()
Get the flushDelegateprotected abstract java.lang.Class<? extends org.terrier.structures.indexing.singlepass.PostingInRun>
getPostingInRunClass()
Get the class for storing postings in runs.protected abstract void
preProcess(Document doc, java.lang.String term)
Perform an operation before the term pipeline is initiated.protected void
setFlushDelegate(SinglePassIndexerFlushDelegate _flushDelegate)
Set the flushDelegate-
Methods inherited from class org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer
checkFlush, createDirectIndex, createFieldRunMerger, createInvertedIndex, finishMemoryPosting, getFileNames, indexDocument, load_indexer_properties, performMultiWayMerge
-
Methods inherited from class org.terrier.structures.indexing.classical.BasicIndexer
finishedInvertedIndexBuild
-
Methods inherited from class org.terrier.structures.indexing.Indexer
createMetaIndexBuilder, finishedDirectIndexBuild, getExternalParalllism, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, setExternalParalllism, useFieldInformation
-
-
-
-
Field Detail
-
flushDelegate
protected SinglePassIndexerFlushDelegate flushDelegate
Delegate for HadoopIndexerMapper to intercept flushes
-
-
Constructor Detail
-
ExtensibleSinglePassIndexer
public ExtensibleSinglePassIndexer(java.lang.String pathname, java.lang.String prefix)
Default constructor- Parameters:
pathname
- String the path where the datastructures will be created. This is assumed to be absolute.prefix
- String the prefix of the index, usually "data".
-
-
Method Detail
-
getEndOfPipeline
protected abstract TermPipeline getEndOfPipeline()
Returns the end of the term pipeline, which corresponds to an instance of either BasicIndexer.BasicTermProcessor, or BasicIndexer.FieldTermProcessor, depending on whether field information is stored.- Overrides:
getEndOfPipeline
in classBasicIndexer
- Returns:
- TermPipeline the end of the term pipeline.
-
getPostingInRunClass
protected abstract java.lang.Class<? extends org.terrier.structures.indexing.singlepass.PostingInRun> getPostingInRunClass()
Get the class for storing postings in runs.- Returns:
- PostingInRun Subclass of PostingInRun for this indexer
-
createRunMerger
protected void createRunMerger(java.lang.String[][] files) throws java.lang.Exception
Hook method that creates a RunsMerger instance- Overrides:
createRunMerger
in classBasicSinglePassIndexer
- Throws:
java.io.IOException
- if an I/O error occurs.java.lang.Exception
-
createMemoryPostings
protected abstract void createMemoryPostings()
Hook method that creates the right type of MemoryPostings class.- Overrides:
createMemoryPostings
in classBasicSinglePassIndexer
-
createDocumentPostings
protected abstract void createDocumentPostings()
Hook method that creates the right type of DocumentTree class.- Overrides:
createDocumentPostings
in classBasicIndexer
-
createInvertedIndex
public void createInvertedIndex(Collection[] collections)
Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (e.g. stemming, stopping, lowercase, etc.). Only one thing is modified from BasicSinglePassIndexer - I've added a pre-processing operation before each term is passed to the pipeline- Overrides:
createInvertedIndex
in classBasicSinglePassIndexer
- Parameters:
collections
- Collection[] the collections to be indexed.
-
preProcess
protected abstract void preProcess(Document doc, java.lang.String term)
Perform an operation before the term pipeline is initiated. This could for example extract data and store in a field that the pipeline could access- Parameters:
doc
- Current documentterm
- Current term
-
getCurrentIndex
public Index getCurrentIndex()
Get the index currently being constructed by this indexer. This might be null if indexing hasn't commenced yet. It is useful for adding extra properties, etc to the index after indexing is finished.- Returns:
- the current index
-
setFlushDelegate
protected void setFlushDelegate(SinglePassIndexerFlushDelegate _flushDelegate)
Set the flushDelegate- Parameters:
_flushDelegate
-
-
getFlushDelegate
protected SinglePassIndexerFlushDelegate getFlushDelegate()
Get the flushDelegate- Returns:
- the flushDelegate
-
forceFlush
protected void forceFlush() throws java.io.IOException
Force the indexer to flush everything and free memory. Either calls the super method, or passes to a delegate if the flushDelegate is set.- Overrides:
forceFlush
in classBasicSinglePassIndexer
- Throws:
java.io.IOException
- See Also:
BasicSinglePassIndexer.forceFlush()
-
-