Indexer (Terrier 3.5 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.terrier.indexing
Class Indexer

java.lang.Object
  org.terrier.indexing.Indexer

Direct Known Subclasses:: BasicIndexer, BlockIndexer

public abstract class Indexer
extends java.lang.Object
extends java.lang.Object

Properties:

termpipelines - the sequence of TermPipeline stages (e.g. Stopwords removal and PorterStemmer).
termpipelines.skip - a list of tokens which should not be skipped from the term pipeline. If not set or empty, then none will be skipped.
indexing.max.tokens - The maximum number of tokens the indexer will attempt to index in a document. If 0, then all tokens will be indexed (default).
ignore.empty.documents - Assign empty documents documnent Ids. Default true
indexing.max.docs.per.builder - Maximum number of documents in an index before a new index is created, and merged later.
indexing.builder.boundary.docnos - Docnos of documents that force the index being created to be completed, and a new index to be commenced. An alternative to indexing.max.docs.per.builder

Author:: Craig Macdonald

Field Summary
`protected java.lang.String`	`basicDirectIndexPostingIteratorClass`
`protected java.util.HashSet<java.lang.String>`	`BUILDER_BOUNDARY_DOCUMENTS` The DOCNO of documents to force builder boundaries
`protected Index`	`currentIndex` The index being worked on, denoted by path and prefix
`protected DirectInvertedOutputStream`	`directIndexBuilder` The builder that creates the direct index.
`protected DocumentIndexBuilder`	`docIndexBuilder` The builder that creates the document index.
`protected DocumentIndexEntry`	`emptyDocIndexEntry`
`protected java.lang.String`	`fieldDirectIndexPostingIteratorClass`
`protected gnu.trove.TObjectIntHashMap<java.lang.String>`	`fieldNames` mapping: field name -> field id, returns 0 for no mapping
`protected java.lang.String`	`fileNameNoExtension` The common prefix of the data structures filenames.
`protected boolean`	`IndexEmptyDocuments` Indicates whether an entry for empty documents is stored in the document index, or empty documents should be ignored.
`protected InvertedIndexBuilder`	`invertedIndexBuilder` The builder that creates the inverted index.
`protected LexiconBuilder`	`lexiconBuilder` The builder that creates the lexicon.
`protected static org.apache.log4j.Logger`	`logger` the logger for this class
`protected int`	`MAX_DOCS_PER_BUILDER` The number of documents indexed with a set of builders.
`protected int`	`MAX_TOKENS_IN_DOCUMENT` The maximum number of tokens in a document.
`protected MetaIndexBuilder`	`metaBuilder`
`protected int`	`numFields` the number of fields
`protected java.lang.String`	`path` The path in which the data structures are stored.
`protected TermPipeline`	`pipeline_first` The first component of the term pipeline.
`protected java.lang.String`	`prefix` The prefix of the data structures, ie the first part of the filename
`protected boolean`	`useFieldInformation` Indicates whether field information should be saved in the created data structures.

Constructor Summary
	`Indexer()` Creates an indexer at the location ApplicationSetup.TERRIER_INDEX_PATH and ApplicationSetup.TERRIER_INDEX_PREFIX
`protected`	`Indexer(long a, long b, long c)` Protected do-nothing constructor for use by child classes
	`Indexer(java.lang.String _path, java.lang.String _prefix)` Creates an instance of the class.

Method Summary
`abstract void`	`createDirectIndex(Collection[] collections)` An abstract method for creating the direct index, the document index and the lexicon for the given collections.
`abstract void`	`createInvertedIndex()` An abstract method for creating the inverted index, given that the the direct index, the document index and the lexicon have already been created.
`protected MetaIndexBuilder`	`createMetaIndexBuilder()`
`protected void`	`finishedDirectIndexBuild()` event method to be overridden by child classes
`protected void`	`finishedInvertedIndexBuild()` event method to be overridden by child classes
`protected abstract TermPipeline`	`getEndOfPipeline()` An abstract method that returns the last component of the term pipeline.
`void`	`index(Collection[] collections)` Creates the data structures for a set of collections.
`protected void`	`indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties)` Adds an entry to document index for empty document @param docid, only if IndexEmptyDocuments is set to true.
`protected void`	`init()` This method must be called by anything which directly extends Indexer.
`protected void`	`load_builder_boundary_documents()` Loads the builder boundary documents from the property `indexing.builder.boundary.docnos`, comma delimited.
`protected void`	`load_field_ids()` loads a mapping of field name -> field id
`protected void`	`load_indexer_properties()`
`protected void`	`load_pipeline()` Creates the term pipeline, as specified by the property `termpipelines` in the properties file.
`static void`	`main(java.lang.String[] args)` Utility method for merging indices
`static void`	`merge(java.lang.String mpath, java.lang.String mprefix, int lowest, int highest)` Merge a series of numbered indices in the same path/prefix area.
`static void`	`merge(java.lang.String mpath, java.lang.String mprefix, java.util.LinkedList<java.lang.String[]> llist, int counterMerged)` Merge a series of indices, in pair-wise fashion
`protected static void`	`mergeTwoIndices(java.lang.String[] index1, java.lang.String[] index2, java.lang.String[] outputIndex)` Merge two indices.
`protected static int[]`	`parseInts(java.lang.String[] in)`
`boolean`	`useFieldInformation()` Returns the is the index will record fields

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

logger

protected static final org.apache.log4j.Logger logger

the logger for this class

MAX_DOCS_PER_BUILDER

protected int MAX_DOCS_PER_BUILDER

The number of documents indexed with a set of builders. If a collection consists of more documents, then we need to create new builders and later merge the data structures. The corresponding property is indexing.max.docs.per.builder and the default value is 18000000 (18 million documents). If the property is set equal to zero, then there is no limit.

MAX_TOKENS_IN_DOCUMENT

protected int MAX_TOKENS_IN_DOCUMENT

The maximum number of tokens in a document. If it is set to zero, then there is no limit in the number of tokens indexed for a document. Set by property indexing.max.tokens.

BUILDER_BOUNDARY_DOCUMENTS

protected final java.util.HashSet<java.lang.String> BUILDER_BOUNDARY_DOCUMENTS

The DOCNO of documents to force builder boundaries

useFieldInformation

protected boolean useFieldInformation

Indicates whether field information should be saved in the created data structures.

pipeline_first

protected TermPipeline pipeline_first

The first component of the term pipeline.

IndexEmptyDocuments

protected boolean IndexEmptyDocuments

Indicates whether an entry for empty documents is stored in the document index, or empty documents should be ignored.

directIndexBuilder

protected DirectInvertedOutputStream directIndexBuilder

The builder that creates the direct index.

docIndexBuilder

protected DocumentIndexBuilder docIndexBuilder

The builder that creates the document index.

invertedIndexBuilder

protected InvertedIndexBuilder invertedIndexBuilder

The builder that creates the inverted index.

lexiconBuilder

protected LexiconBuilder lexiconBuilder

The builder that creates the lexicon.

metaBuilder

protected MetaIndexBuilder metaBuilder

fileNameNoExtension

protected java.lang.String fileNameNoExtension

The common prefix of the data structures filenames.

path

protected java.lang.String path

The path in which the data structures are stored.

prefix

protected java.lang.String prefix

The prefix of the data structures, ie the first part of the filename

currentIndex

protected Index currentIndex

The index being worked on, denoted by path and prefix

basicDirectIndexPostingIteratorClass

protected java.lang.String basicDirectIndexPostingIteratorClass

fieldDirectIndexPostingIteratorClass

protected java.lang.String fieldDirectIndexPostingIteratorClass

fieldNames

protected gnu.trove.TObjectIntHashMap<java.lang.String> fieldNames

mapping: field name -> field id, returns 0 for no mapping

numFields

protected int numFields

the number of fields

emptyDocIndexEntry

protected DocumentIndexEntry emptyDocIndexEntry

Constructor Detail

Indexer

public Indexer()

Creates an indexer at the location ApplicationSetup.TERRIER_INDEX_PATH and ApplicationSetup.TERRIER_INDEX_PREFIX

Indexer

public Indexer(java.lang.String _path,
               java.lang.String _prefix)

Creates an instance of the class. The generated data structures will be saved in the given path. The of the data is given by the prefix parameter.

Parameters:: _path - String the path where the generated data structures will be saved.; _prefix - String the filename that the data structures will have.

Indexer

protected Indexer(long a,
                  long b,
                  long c)

Protected do-nothing constructor for use by child classes

Method Detail

init

protected void init()

This method must be called by anything which directly extends Indexer. See: http://benpryor.com/blog/2008/01/02/dont-call-subclass-methods-from-a-superclass-constructor/

createDirectIndex

public abstract void createDirectIndex(Collection[] collections)

An abstract method for creating the direct index, the document index and the lexicon for the given collections.

Parameters:: collections - Collection[] An array of collections to index

createInvertedIndex

public abstract void createInvertedIndex()

An abstract method for creating the inverted index, given that the the direct index, the document index and the lexicon have already been created.

getEndOfPipeline

protected abstract TermPipeline getEndOfPipeline()

An abstract method that returns the last component of the term pipeline.

Returns:: TermPipeline the end of the term pipeline.

createMetaIndexBuilder

protected MetaIndexBuilder createMetaIndexBuilder()

parseInts

protected static final int[] parseInts(java.lang.String[] in)

load_indexer_properties

protected void load_indexer_properties()

load_field_ids

protected void load_field_ids()

loads a mapping of field name -> field id

load_pipeline

protected void load_pipeline()

Creates the term pipeline, as specified by the property termpipelines in the properties file. The default value of the property termpipelines is Stopwords,PorterStemmer. This means that we first remove stopwords and then apply Porter's stemming algorithm.

load_builder_boundary_documents

protected void load_builder_boundary_documents()

Loads the builder boundary documents from the property indexing.builder.boundary.docnos, comma delimited.

index

public void index(Collection[] collections)

Creates the data structures for a set of collections. It creates a set of data structures for every indexing.max.docs.per.builder, if the value of this property is greater than zero, and then it mertges the generated data structures.

Parameters:: collections - The document collection objects to index.

merge

public static void merge(java.lang.String mpath,
                         java.lang.String mprefix,
                         int lowest,
                         int highest)

Merge a series of numbered indices in the same path/prefix area. New merged index will be stored at mpath/mprefix_highest+1.

Parameters:: mpath - Path of all indices; mprefix - Common prefix of all indices; lowest - lowest subfix of prefix; highest - highest subfix of prefix

mergeTwoIndices

protected static void mergeTwoIndices(java.lang.String[] index1,
                                      java.lang.String[] index2,
                                      java.lang.String[] outputIndex)

Merge two indices.

Parameters:: index1 - Path/Prefix of source index 1; index2 - Path/Prefix of source index 2; outputIndex - Path/Prefix of destination index

merge

public static void merge(java.lang.String mpath,
                         java.lang.String mprefix,
                         java.util.LinkedList<java.lang.String[]> llist,
                         int counterMerged)

Merge a series of indices, in pair-wise fashion

Parameters:: mpath - Common path of all indices; mprefix - Prefix of target index; counterMerged - - number of indices to merge

finishedDirectIndexBuild

protected void finishedDirectIndexBuild()

event method to be overridden by child classes

finishedInvertedIndexBuild

protected void finishedInvertedIndexBuild()

event method to be overridden by child classes

useFieldInformation

public boolean useFieldInformation()

Returns the is the index will record fields

indexEmpty

protected void indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties)
                   throws java.io.IOException

Adds an entry to document index for empty document @param docid, only if IndexEmptyDocuments is set to true.

Throws:: java.io.IOException

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception

Utility method for merging indices

Throws:: java.lang.Exception

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.terrier.indexing Class Indexer

logger

MAX_DOCS_PER_BUILDER

MAX_TOKENS_IN_DOCUMENT

BUILDER_BOUNDARY_DOCUMENTS

useFieldInformation

pipeline_first

IndexEmptyDocuments

directIndexBuilder

docIndexBuilder

invertedIndexBuilder

lexiconBuilder

metaBuilder

fileNameNoExtension

path

prefix

currentIndex

basicDirectIndexPostingIteratorClass

fieldDirectIndexPostingIteratorClass

fieldNames

numFields

emptyDocIndexEntry

Indexer

Indexer

Indexer

init

createDirectIndex

createInvertedIndex

getEndOfPipeline

createMetaIndexBuilder

parseInts

load_indexer_properties

load_field_ids

load_pipeline

load_builder_boundary_documents

index

merge

mergeTwoIndices

merge

finishedDirectIndexBuild

finishedInvertedIndexBuild

useFieldInformation

indexEmpty

main

org.terrier.indexing
Class Indexer