|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.terrier.indexing.Indexer org.terrier.indexing.BlockIndexer
public class BlockIndexer
An indexer that saves block information for the indexed terms. Block information is usualy recorded in terms of relative term positions (position 1, positions 2, etc), however, since 2.2, Terrier supports the presence of "marker terms" during indexing which are used to increment the block counter. Properties:
Markered Blocks
Markers are terms (artificially inserted or otherwise into the term stream that are used to denote when the block counter should
be incremented. This functionality is enabled using the block.delimiters.enabled property, while the terms are specified using a comma delimited fashion with the
block.delimiters property. The following lists the properties:
Nested Class Summary | |
---|---|
protected class |
BlockIndexer.BasicTermProcessor
This class implements an end of a TermPipeline that adds the term to the DocumentTree. |
protected class |
BlockIndexer.DelimFieldTermProcessor
This class behaves in a similar fashion to FieldTermProcessor except that this one treats blocks bounded by delimiters instead of fixed-sized blocks. |
protected class |
BlockIndexer.DelimTermProcessor
This class behaves in a similar fashion to BasicTermProcessor except that this one treats blocks bounded by delimiters instead of fixed-sized blocks. |
protected class |
BlockIndexer.FieldTermProcessor
This class implements an end of a TermPipeline that adds the term to the DocumentTree. |
Field Summary | |
---|---|
protected int |
BLOCK_SIZE
The maximum number of terms allowed in a block. |
protected int |
blockId
The block number of the current document. |
protected int |
MAX_BLOCKS
The maximum number allowed number of blocks in a document. |
protected int |
numOfTokensInBlock
The number of tokens in the current block of the current document. |
protected int |
numOfTokensInDocument
The number of tokens in the current document so far. |
protected java.util.Set<java.lang.String> |
termFields
The fields that are set for the current term. |
protected DocumentPostingList |
termsInDocument
The list of terms in this document, and for each, the block occurrences. |
Constructor Summary | |
---|---|
BlockIndexer(java.lang.String pathname,
java.lang.String prefix)
Constructs an instance of this class, where the created data structures are stored in the given path, with the given prefix on the filenames. |
Method Summary | |
---|---|
void |
createDirectIndex(Collection[] collections)
For the given collection, it iterates through the documents and creates the direct index, document index and lexicon, using information about blocks and possibly fields. |
protected void |
createDocumentPostings()
|
void |
createInvertedIndex()
Creates the inverted index from the already created direct index, document index and lexicon. |
protected void |
finishedInvertedIndexBuild()
Hook method, called when the inverted index is finished - ie the lexicon is finished |
protected TermPipeline |
getEndOfPipeline()
Returns the object that is to be the end of the TermPipeline. |
protected void |
indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties,
DocumentPostingList _termsInDocument)
This adds a document to the direct and document indexes, as well as it's terms to the lexicon. |
protected void |
load_indexer_properties()
|
Methods inherited from class org.terrier.indexing.Indexer |
---|
createMetaIndexBuilder, finishedDirectIndexBuild, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, useFieldInformation |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected int numOfTokensInDocument
protected int numOfTokensInBlock
protected int blockId
protected java.util.Set<java.lang.String> termFields
protected DocumentPostingList termsInDocument
protected int BLOCK_SIZE
protected int MAX_BLOCKS
Constructor Detail |
---|
public BlockIndexer(java.lang.String pathname, java.lang.String prefix)
pathname
- String the path in which the created data structures will be saved. This is assumed to be
absolute.prefix
- String the prefix on the filenames of the created data structures, usually "data"Method Detail |
---|
protected TermPipeline getEndOfPipeline()
getEndOfPipeline
in class Indexer
public void createDirectIndex(Collection[] collections)
createDirectIndex
in class Indexer
collections
- Collection[] the collection to index.Indexer.createDirectIndex(org.terrier.indexing.Collection[])
protected void indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties, DocumentPostingList _termsInDocument) throws java.lang.Exception
docProperties
- Map_termsInDocument
- DocumentPostingList the terms in the document.
java.lang.Exception
public void createInvertedIndex()
createInvertedIndex
in class Indexer
Indexer.createInvertedIndex()
protected void finishedInvertedIndexBuild()
finishedInvertedIndexBuild
in class Indexer
protected void createDocumentPostings()
protected void load_indexer_properties()
load_indexer_properties
in class Indexer
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |