public class BlockIndexer extends Indexer
Markered Blocks
Markers are terms (artificially inserted or otherwise into the term stream that are used to denote when the block counter should
be incremented. This functionality is enabled using the block.delimiters.enabled property, while the terms are specified using a comma delimited fashion with the
block.delimiters property. The following lists the properties:
Modifier and Type | Class and Description |
---|---|
protected class |
BlockIndexer.BasicTermProcessor
This class implements an end of a TermPipeline that adds the
term to the DocumentTree.
|
protected class |
BlockIndexer.DelimFieldTermProcessor
This class behaves in a similar fashion to FieldTermProcessor except that
this one treats blocks bounded by delimiters instead of fixed-sized blocks.
|
protected class |
BlockIndexer.DelimTermProcessor
This class behaves in a similar fashion to BasicTermProcessor except that
this one treats blocks bounded by delimiters instead of fixed-sized blocks.
|
protected class |
BlockIndexer.FieldTermProcessor
This class implements an end of a TermPipeline that adds the
term to the DocumentTree.
|
Modifier and Type | Field and Description |
---|---|
protected int |
BLOCK_SIZE
The maximum number of terms allowed in a block.
|
protected int |
blockId
The block number of the current document.
|
protected int |
MAX_BLOCKS
The maximum number allowed number of blocks in a document.
|
protected int |
numOfTokensInBlock
The number of tokens in the current block of the current document.
|
protected int |
numOfTokensInDocument
The number of tokens in the current document so far.
|
protected Set<String> |
termFields
The fields that are set for the current term.
|
protected DocumentPostingList |
termsInDocument
The list of terms in this document, and for each, the block occurrences.
|
BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocIndexEntry, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation
Constructor and Description |
---|
BlockIndexer(String pathname,
String prefix)
Constructs an instance of this class, where the created data structures
are stored in the given path, with the given prefix on the filenames.
|
Modifier and Type | Method and Description |
---|---|
void |
createDirectIndex(Collection[] collections)
For the given collection, it iterates through the documents and
creates the direct index, document index and lexicon, using
information about blocks and possibly fields.
|
protected void |
createDocumentPostings() |
void |
createInvertedIndex()
Creates the inverted index from the already created direct index,
document index and lexicon.
|
protected void |
finishedInvertedIndexBuild()
Hook method, called when the inverted index is finished - ie the lexicon is finished
|
protected TermPipeline |
getEndOfPipeline()
Returns the object that is to be the end of the TermPipeline.
|
protected void |
indexDocument(Map<String,String> docProperties,
DocumentPostingList _termsInDocument)
This adds a document to the direct and document indexes, as well
as it's terms to the lexicon.
|
protected void |
load_indexer_properties() |
createMetaIndexBuilder, finishedDirectIndexBuild, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, useFieldInformation
protected int numOfTokensInDocument
protected int numOfTokensInBlock
protected int blockId
protected DocumentPostingList termsInDocument
protected int BLOCK_SIZE
protected int MAX_BLOCKS
public BlockIndexer(String pathname, String prefix)
pathname
- String the path in which the created data structures will be saved. This is assumed to be
absolute.prefix
- String the prefix on the filenames of the created data structures, usually "data"protected TermPipeline getEndOfPipeline()
getEndOfPipeline
in class Indexer
public void createDirectIndex(Collection[] collections)
createDirectIndex
in class Indexer
collections
- Collection[] the collection to index.Indexer.createDirectIndex(org.terrier.indexing.Collection[])
protected void indexDocument(Map<String,String> docProperties, DocumentPostingList _termsInDocument) throws Exception
docProperties
- Map_termsInDocument
- DocumentPostingList the terms in the document.Exception
public void createInvertedIndex()
createInvertedIndex
in class Indexer
Indexer.createInvertedIndex()
protected void finishedInvertedIndexBuild()
finishedInvertedIndexBuild
in class Indexer
protected void createDocumentPostings()
protected void load_indexer_properties()
load_indexer_properties
in class Indexer
Terrier 4.0. Copyright © 2004-2014 University of Glasgow