Class BasicIndexer
- java.lang.Object
-
- org.terrier.structures.indexing.Indexer
-
- org.terrier.structures.indexing.classical.BasicIndexer
-
- Direct Known Subclasses:
BasicSinglePassIndexer
public class BasicIndexer extends Indexer
BasicIndexer is the default indexer for Terrier. It takes terms from each Document object provided by the collection, and adds terms to temporary Lexicons, and into the DirectFile. The documentIndex is updated to give the pointers into the Direct file. The temporary lexicons are then merged into the main lexicon. Inverted Index construction takes place as a second step.
Properties:- indexing.max.encoded.documentindex.docs - how many docs before the DocumentIndexEncoded is dropped in favour of the DocumentIndex (on disk implementation).
- See Also: Properties in org.terrier.indexing.Indexer and org.terrier.indexing.BlockIndexer
- Author:
- Craig Macdonald & Vassilis Plachouras
- See Also:
Indexer
,BlockIndexer
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected class
BasicIndexer.BasicTermProcessor
This class implements an end of a TermPipeline that adds the term to the DocumentTree.protected class
BasicIndexer.FieldTermProcessor
This class implements an end of a TermPipeline that adds the term to the DocumentTree.
-
Field Summary
Fields Modifier and Type Field Description protected CompressionFactory.CompressionConfiguration
compressionDirectConfig
The compression configuration for the direct indexprotected CompressionFactory.CompressionConfiguration
compressionInvertedConfig
The compression configuration for the inverted indexprotected int
numOfTokensInDocument
The number of tokens found in the current document so far/protected TermCodes
termCodes
Mapping of terms 2 termidsprotected java.util.Set<java.lang.String>
termFields
A private variable for storing the fields a term appears into.protected DocumentPostingList
termsInDocument
The structure that holds the terms found in a document.-
Fields inherited from class org.terrier.structures.indexing.Indexer
blocks, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocCount, emptyDocIndexEntry, externalParalllism, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation
-
-
Constructor Summary
Constructors Modifier Constructor Description protected
BasicIndexer(long a, long b, long c)
Protected do-nothing constructor for use by child classes.BasicIndexer(java.lang.String path, java.lang.String prefix)
Constructs an instance of a BasicIndexer, using the given path name for storing the data structures.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
createDirectIndex(Collection[] collections)
Creates the direct index, the document index and the lexicon.protected void
createDocumentPostings()
Hook method that creates the right type of DocumentTree class.void
createInvertedIndex()
Creates the inverted index after having created the direct index, document index and lexicon.protected void
finishedInvertedIndexBuild()
Hook method, called when the inverted index is finished - ie the lexicon is finishedprotected TermPipeline
getEndOfPipeline()
Returns the end of the term pipeline, which corresponds to an instance of either BasicIndexer.BasicTermProcessor, or BasicIndexer.FieldTermProcessor, depending on whether field information is stored.protected void
indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties, DocumentPostingList _termsInDocument)
This adds a document to the direct and document indexes, as well as it's terms to the lexicon.-
Methods inherited from class org.terrier.structures.indexing.Indexer
createMetaIndexBuilder, finishedDirectIndexBuild, getExternalParalllism, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_indexer_properties, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, setExternalParalllism, useFieldInformation
-
-
-
-
Field Detail
-
termFields
protected java.util.Set<java.lang.String> termFields
A private variable for storing the fields a term appears into.
-
termsInDocument
protected DocumentPostingList termsInDocument
The structure that holds the terms found in a document.
-
termCodes
protected TermCodes termCodes
Mapping of terms 2 termids
-
numOfTokensInDocument
protected int numOfTokensInDocument
The number of tokens found in the current document so far/
-
compressionDirectConfig
protected CompressionFactory.CompressionConfiguration compressionDirectConfig
The compression configuration for the direct index
-
compressionInvertedConfig
protected CompressionFactory.CompressionConfiguration compressionInvertedConfig
The compression configuration for the inverted index
-
-
Constructor Detail
-
BasicIndexer
protected BasicIndexer(long a, long b, long c)
Protected do-nothing constructor for use by child classes. Classes which use this method must call init()
-
BasicIndexer
public BasicIndexer(java.lang.String path, java.lang.String prefix)
Constructs an instance of a BasicIndexer, using the given path name for storing the data structures.- Parameters:
path
- String the path where the data structures will be created. This is assumed to be absolute.prefix
- String the filename component of the data structures
-
-
Method Detail
-
getEndOfPipeline
protected TermPipeline getEndOfPipeline()
Returns the end of the term pipeline, which corresponds to an instance of either BasicIndexer.BasicTermProcessor, or BasicIndexer.FieldTermProcessor, depending on whether field information is stored.- Specified by:
getEndOfPipeline
in classIndexer
- Returns:
- TermPipeline the end of the term pipeline.
-
createDirectIndex
public void createDirectIndex(Collection[] collections)
Creates the direct index, the document index and the lexicon. Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).- Specified by:
createDirectIndex
in classIndexer
- Parameters:
collections
- Collection[] the collections to be indexed.
-
indexDocument
protected void indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties, DocumentPostingList _termsInDocument) throws java.lang.Exception
This adds a document to the direct and document indexes, as well as it's terms to the lexicon. Handled internally by the methods indexFieldDocument and indexNoFieldDocument.- Parameters:
docProperties
- Map<String,String> properties of the document_termsInDocument
- DocumentPostingList the terms in the document.- Throws:
java.lang.Exception
-
createInvertedIndex
public void createInvertedIndex()
Creates the inverted index after having created the direct index, document index and lexicon.- Specified by:
createInvertedIndex
in classIndexer
-
createDocumentPostings
protected void createDocumentPostings()
Hook method that creates the right type of DocumentTree class.
-
finishedInvertedIndexBuild
protected void finishedInvertedIndexBuild()
Hook method, called when the inverted index is finished - ie the lexicon is finished- Overrides:
finishedInvertedIndexBuild
in classIndexer
-
-