org.terrier.indexing
Class Indexer

java.lang.Object
  extended by org.terrier.indexing.Indexer
Direct Known Subclasses:
BasicIndexer, BlockIndexer

public abstract class Indexer
extends java.lang.Object

Properties:

Author:
Craig Macdonald

Field Summary
protected  java.lang.String basicDirectIndexPostingIteratorClass
           
protected  java.util.HashSet<java.lang.String> BUILDER_BOUNDARY_DOCUMENTS
          The DOCNO of documents to force builder boundaries
protected  Index currentIndex
          The index being worked on, denoted by path and prefix
protected  DirectInvertedOutputStream directIndexBuilder
          The builder that creates the direct index.
protected  DocumentIndexBuilder docIndexBuilder
          The builder that creates the document index.
protected  DocumentIndexEntry emptyDocIndexEntry
           
protected  java.lang.String fieldDirectIndexPostingIteratorClass
           
protected  gnu.trove.TObjectIntHashMap<java.lang.String> fieldNames
          mapping: field name -> field id, returns 0 for no mapping
protected  java.lang.String fileNameNoExtension
          The common prefix of the data structures filenames.
protected  boolean IndexEmptyDocuments
          Indicates whether an entry for empty documents is stored in the document index, or empty documents should be ignored.
protected  InvertedIndexBuilder invertedIndexBuilder
          The builder that creates the inverted index.
protected  LexiconBuilder lexiconBuilder
          The builder that creates the lexicon.
protected static org.apache.log4j.Logger logger
          the logger for this class
protected  int MAX_DOCS_PER_BUILDER
          The number of documents indexed with a set of builders.
protected  int MAX_TOKENS_IN_DOCUMENT
          The maximum number of tokens in a document.
protected  MetaIndexBuilder metaBuilder
           
protected  int numFields
          the number of fields
protected  java.lang.String path
          The path in which the data structures are stored.
protected  TermPipeline pipeline_first
          The first component of the term pipeline.
protected  java.lang.String prefix
          The prefix of the data structures, ie the first part of the filename
protected  boolean useFieldInformation
          Indicates whether field information should be saved in the created data structures.
 
Constructor Summary
  Indexer()
          Creates an indexer at the location ApplicationSetup.TERRIER_INDEX_PATH and ApplicationSetup.TERRIER_INDEX_PREFIX
protected Indexer(long a, long b, long c)
          Protected do-nothing constructor for use by child classes
  Indexer(java.lang.String _path, java.lang.String _prefix)
          Creates an instance of the class.
 
Method Summary
abstract  void createDirectIndex(Collection[] collections)
          An abstract method for creating the direct index, the document index and the lexicon for the given collections.
abstract  void createInvertedIndex()
          An abstract method for creating the inverted index, given that the the direct index, the document index and the lexicon have already been created.
protected  MetaIndexBuilder createMetaIndexBuilder()
           
protected  void finishedDirectIndexBuild()
          event method to be overridden by child classes
protected  void finishedInvertedIndexBuild()
          event method to be overridden by child classes
protected abstract  TermPipeline getEndOfPipeline()
          An abstract method that returns the last component of the term pipeline.
 void index(Collection[] collections)
          Creates the data structures for a set of collections.
protected  void indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties)
          Adds an entry to document index for empty document @param docid, only if IndexEmptyDocuments is set to true.
protected  void init()
          This method must be called by anything which directly extends Indexer.
protected  void load_builder_boundary_documents()
          Loads the builder boundary documents from the property indexing.builder.boundary.docnos, comma delimited.
protected  void load_field_ids()
          loads a mapping of field name -> field id
protected  void load_indexer_properties()
           
protected  void load_pipeline()
          Creates the term pipeline, as specified by the property termpipelines in the properties file.
static void main(java.lang.String[] args)
          Utility method for merging indices
static void merge(java.lang.String mpath, java.lang.String mprefix, int lowest, int highest)
          Merge a series of numbered indices in the same path/prefix area.
static void merge(java.lang.String mpath, java.lang.String mprefix, java.util.LinkedList<java.lang.String[]> llist, int counterMerged)
          Merge a series of indices, in pair-wise fashion
protected static void mergeTwoIndices(java.lang.String[] index1, java.lang.String[] index2, java.lang.String[] outputIndex)
          Merge two indices.
protected static int[] parseInts(java.lang.String[] in)
           
 boolean useFieldInformation()
          Returns the is the index will record fields
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected static final org.apache.log4j.Logger logger
the logger for this class


MAX_DOCS_PER_BUILDER

protected int MAX_DOCS_PER_BUILDER
The number of documents indexed with a set of builders. If a collection consists of more documents, then we need to create new builders and later merge the data structures. The corresponding property is indexing.max.docs.per.builder and the default value is 18000000 (18 million documents). If the property is set equal to zero, then there is no limit.


MAX_TOKENS_IN_DOCUMENT

protected int MAX_TOKENS_IN_DOCUMENT
The maximum number of tokens in a document. If it is set to zero, then there is no limit in the number of tokens indexed for a document. Set by property indexing.max.tokens.


BUILDER_BOUNDARY_DOCUMENTS

protected final java.util.HashSet<java.lang.String> BUILDER_BOUNDARY_DOCUMENTS
The DOCNO of documents to force builder boundaries


useFieldInformation

protected boolean useFieldInformation
Indicates whether field information should be saved in the created data structures.


pipeline_first

protected TermPipeline pipeline_first
The first component of the term pipeline.


IndexEmptyDocuments

protected boolean IndexEmptyDocuments
Indicates whether an entry for empty documents is stored in the document index, or empty documents should be ignored.


directIndexBuilder

protected DirectInvertedOutputStream directIndexBuilder
The builder that creates the direct index.


docIndexBuilder

protected DocumentIndexBuilder docIndexBuilder
The builder that creates the document index.


invertedIndexBuilder

protected InvertedIndexBuilder invertedIndexBuilder
The builder that creates the inverted index.


lexiconBuilder

protected LexiconBuilder lexiconBuilder
The builder that creates the lexicon.


metaBuilder

protected MetaIndexBuilder metaBuilder

fileNameNoExtension

protected java.lang.String fileNameNoExtension
The common prefix of the data structures filenames.


path

protected java.lang.String path
The path in which the data structures are stored.


prefix

protected java.lang.String prefix
The prefix of the data structures, ie the first part of the filename


currentIndex

protected Index currentIndex
The index being worked on, denoted by path and prefix


basicDirectIndexPostingIteratorClass

protected java.lang.String basicDirectIndexPostingIteratorClass

fieldDirectIndexPostingIteratorClass

protected java.lang.String fieldDirectIndexPostingIteratorClass

fieldNames

protected gnu.trove.TObjectIntHashMap<java.lang.String> fieldNames
mapping: field name -> field id, returns 0 for no mapping


numFields

protected int numFields
the number of fields


emptyDocIndexEntry

protected DocumentIndexEntry emptyDocIndexEntry
Constructor Detail

Indexer

public Indexer()
Creates an indexer at the location ApplicationSetup.TERRIER_INDEX_PATH and ApplicationSetup.TERRIER_INDEX_PREFIX


Indexer

public Indexer(java.lang.String _path,
               java.lang.String _prefix)
Creates an instance of the class. The generated data structures will be saved in the given path. The of the data is given by the prefix parameter.

Parameters:
_path - String the path where the generated data structures will be saved.
_prefix - String the filename that the data structures will have.

Indexer

protected Indexer(long a,
                  long b,
                  long c)
Protected do-nothing constructor for use by child classes

Method Detail

init

protected void init()
This method must be called by anything which directly extends Indexer. See: http://benpryor.com/blog/2008/01/02/dont-call-subclass-methods-from-a-superclass-constructor/


createDirectIndex

public abstract void createDirectIndex(Collection[] collections)
An abstract method for creating the direct index, the document index and the lexicon for the given collections.

Parameters:
collections - Collection[] An array of collections to index

createInvertedIndex

public abstract void createInvertedIndex()
An abstract method for creating the inverted index, given that the the direct index, the document index and the lexicon have already been created.


getEndOfPipeline

protected abstract TermPipeline getEndOfPipeline()
An abstract method that returns the last component of the term pipeline.

Returns:
TermPipeline the end of the term pipeline.

createMetaIndexBuilder

protected MetaIndexBuilder createMetaIndexBuilder()

parseInts

protected static final int[] parseInts(java.lang.String[] in)

load_indexer_properties

protected void load_indexer_properties()

load_field_ids

protected void load_field_ids()
loads a mapping of field name -> field id


load_pipeline

protected void load_pipeline()
Creates the term pipeline, as specified by the property termpipelines in the properties file. The default value of the property termpipelines is Stopwords,PorterStemmer. This means that we first remove stopwords and then apply Porter's stemming algorithm.


load_builder_boundary_documents

protected void load_builder_boundary_documents()
Loads the builder boundary documents from the property indexing.builder.boundary.docnos, comma delimited.


index

public void index(Collection[] collections)
Creates the data structures for a set of collections. It creates a set of data structures for every indexing.max.docs.per.builder, if the value of this property is greater than zero, and then it mertges the generated data structures.

Parameters:
collections - The document collection objects to index.

merge

public static void merge(java.lang.String mpath,
                         java.lang.String mprefix,
                         int lowest,
                         int highest)
Merge a series of numbered indices in the same path/prefix area. New merged index will be stored at mpath/mprefix_highest+1.

Parameters:
mpath - Path of all indices
mprefix - Common prefix of all indices
lowest - lowest subfix of prefix
highest - highest subfix of prefix

mergeTwoIndices

protected static void mergeTwoIndices(java.lang.String[] index1,
                                      java.lang.String[] index2,
                                      java.lang.String[] outputIndex)
Merge two indices.

Parameters:
index1 - Path/Prefix of source index 1
index2 - Path/Prefix of source index 2
outputIndex - Path/Prefix of destination index

merge

public static void merge(java.lang.String mpath,
                         java.lang.String mprefix,
                         java.util.LinkedList<java.lang.String[]> llist,
                         int counterMerged)
Merge a series of indices, in pair-wise fashion

Parameters:
mpath - Common path of all indices
mprefix - Prefix of target index
counterMerged - - number of indices to merge

finishedDirectIndexBuild

protected void finishedDirectIndexBuild()
event method to be overridden by child classes


finishedInvertedIndexBuild

protected void finishedInvertedIndexBuild()
event method to be overridden by child classes


useFieldInformation

public boolean useFieldInformation()
Returns the is the index will record fields


indexEmpty

protected void indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties)
                   throws java.io.IOException
Adds an entry to document index for empty document @param docid, only if IndexEmptyDocuments is set to true.

Throws:
java.io.IOException

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Utility method for merging indices

Throws:
java.lang.Exception


Terrier 3.5. Copyright © 2004-2011 University of Glasgow