java.lang.Object
- org.terrier.structures.indexing.Indexer

Direct Known Subclasses:

BasicIndexer, BlockIndexer
```
public abstract class Indexer
extends java.lang.Object
```
Properties:
- termpipelines - the sequence of TermPipeline stages (e.g. Stopwords removal and PorterStemmer).
- termpipelines.skip - a list of tokens which should not be skipped from the term pipeline. If not set or empty, then none will be skipped.
- indexing.max.tokens - The maximum number of tokens the indexer will attempt to index in a document. If 0, then all tokens will be indexed (default).
- ignore.empty.documents - Assign empty documents with docids. Default true
- indexing.max.docs.per.builder - Maximum number of documents in an index before a new index is created, and merged later.
- indexing.builder.boundary.docnos - Docnos of documents that force the index being created to be completed, and a new index to be commenced. An alternative to indexing.max.docs.per.builder
- indexer.meta.forward.keys - comma delimited list of Document properties to index as document metadata in the MetaIndex. Defaults to "docno", which permits docid->docno lookups.. Examples are "docno,url" or "docno,url,content"
- indexer.meta.forward.keylens - comma delimited list of the length of the values to record in the MetaIndex. Defaults to 20.
- indexer.meta.reverse.keys - comma delimited list of Document properties to permit lookups for (i.e. docno->docid). Defaults to empty (none are enabled).
- indexer.meta.builder - name of the class to build the MetaIndex. Defaults to ZstdMetaIndexBuilder, which uses zstandard compression.
Author:

Craig Macdonald

Field Summary

Fields
Modifier and Type	Field	Description
`protected boolean`	`blocks`	is block indexing
`protected java.util.HashSet<java.lang.String>`	`BUILDER_BOUNDARY_DOCUMENTS`	The DOCNO of documents to force builder boundaries
`protected IndexOnDisk`	`currentIndex`	The index being worked on, denoted by path and prefix
`protected AbstractPostingOutputStream`	`directIndexBuilder`	The builder that creates the direct index.
`protected DocumentIndexBuilder`	`docIndexBuilder`	The builder that creates the document index.
`protected int`	`emptyDocCount`
`protected DocumentIndexEntry`	`emptyDocIndexEntry`
`protected int`	`externalParalllism`	how many instances are being used by the code calling this class in parallel
`protected gnu.trove.TObjectIntHashMap<java.lang.String>`	`fieldNames`	mapping: field name -> field id, returns 0 for no mapping
`protected java.lang.String`	`fileNameNoExtension`	The common prefix of the data structures filenames.
`protected boolean`	`IndexEmptyDocuments`	Indicates whether an entry for empty documents is stored in the document index, or empty documents should be ignored.
`protected InvertedIndexBuilder`	`invertedIndexBuilder`	The builder that creates the inverted index.
`protected LexiconBuilder`	`lexiconBuilder`	The builder that creates the lexicon.
`protected static org.slf4j.Logger`	`logger`	the logger for this class
`protected int`	`MAX_DOCS_PER_BUILDER`	The number of documents indexed with a set of builders.
`protected int`	`MAX_TOKENS_IN_DOCUMENT`	The maximum number of tokens in a document.
`protected MetaIndexBuilder`	`metaBuilder`
`protected int`	`numFields`	the number of fields
`protected java.lang.String`	`path`	The path in which the data structures are stored.
`protected TermPipeline`	`pipeline_first`	The first component of the term pipeline.
`protected java.lang.String`	`prefix`	The prefix of the data structures, ie the first part of the filename
`protected boolean`	`useFieldInformation`	Indicates whether field information should be saved in the created data structures.

Constructor Summary

Constructors
Modifier	Constructor	Description
	`Indexer()`	Creates an indexer at the location ApplicationSetup.TERRIER_INDEX_PATH and ApplicationSetup.TERRIER_INDEX_PREFIX
`protected`	`Indexer(long a, long b, long c)`	Protected do-nothing constructor for use by child classes
	`Indexer(java.lang.String _path, java.lang.String _prefix)`	Creates an instance of the class.

Method Summary

All Methods Static Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method	Description
`abstract void`	`createDirectIndex(Collection[] collections)`	An abstract method for creating the direct index, the document index and the lexicon for the given collections.
`abstract void`	`createInvertedIndex()`	An abstract method for creating the inverted index, given that the the direct index, the document index and the lexicon have already been created.
`protected MetaIndexBuilder`	`createMetaIndexBuilder()`
`protected void`	`finishedDirectIndexBuild()`	event method to be overridden by child classes
`protected void`	`finishedInvertedIndexBuild()`	event method to be overridden by child classes
`protected abstract TermPipeline`	`getEndOfPipeline()`	An abstract method that returns the last component of the term pipeline.
`int`	`getExternalParalllism()`	how many indexers are running in this and other threads?
`void`	`index(Collection[] collections)`	Creates the data structures for a set of collections.
`protected void`	`indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties)`	Adds an entry to document index for empty document @param docid, only if IndexEmptyDocuments is set to true.
`protected void`	`init()`	This method must be called by anything which directly extends Indexer.
`protected void`	`load_builder_boundary_documents()`	Loads the builder boundary documents from the property `indexing.builder.boundary.docnos`, comma delimited.
`protected void`	`load_field_ids()`	loads a mapping of field name -> field id
`protected void`	`load_indexer_properties()`
`protected void`	`load_pipeline()`	Creates the term pipeline, as specified by the property `termpipelines` in the properties file.
`static void`	`main(java.lang.String[] args)`	Utility method for merging indices
`static void`	`merge(java.lang.String mpath, java.lang.String mprefix, int lowest, int highest, boolean blocks)`	Merge a series of numbered indices in the same path/prefix area.
`static void`	`merge(java.lang.String mpath, java.lang.String mprefix, java.util.LinkedList<java.lang.String[]> llist, int counterMerged, boolean blocks)`	Merge a series of indices, in pair-wise fashion
`protected static void`	`mergeTwoIndices(java.lang.String[] index1, java.lang.String[] index2, java.lang.String[] outputIndex, boolean blocks)`	Merge two indices.
`protected static int[]`	`parseInts(java.lang.String[] in)`
`void`	`setExternalParalllism(int externalParalllism)`	set how many indexers are running in this and other threads?
`boolean`	`useFieldInformation()`	Returns the is the index will record fields

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - logger
```
protected static final org.slf4j.Logger logger
```
    the logger for this class
  - MAX_DOCS_PER_BUILDER
```
protected int MAX_DOCS_PER_BUILDER
```
    The number of documents indexed with a set of builders. If a collection consists of more documents, then we need to create new builders and later merge the data structures. The corresponding property is indexing.max.docs.per.builder and the default value is 18000000 (18 million documents). If the property is set equal to zero, then there is no limit.
  - MAX_TOKENS_IN_DOCUMENT
```
protected int MAX_TOKENS_IN_DOCUMENT
```
    The maximum number of tokens in a document. If it is set to zero, then there is no limit in the number of tokens indexed for a document. Set by property indexing.max.tokens.
  - BUILDER_BOUNDARY_DOCUMENTS
```
protected final java.util.HashSet<java.lang.String> BUILDER_BOUNDARY_DOCUMENTS
```
    The DOCNO of documents to force builder boundaries
  - useFieldInformation
```
protected boolean useFieldInformation
```
    Indicates whether field information should be saved in the created data structures.
  - pipeline_first
```
protected TermPipeline pipeline_first
```
    The first component of the term pipeline.
  - IndexEmptyDocuments
```
protected boolean IndexEmptyDocuments
```
    Indicates whether an entry for empty documents is stored in the document index, or empty documents should be ignored.
  - emptyDocCount
```
protected int emptyDocCount
```
  - directIndexBuilder
```
protected AbstractPostingOutputStream directIndexBuilder
```
    The builder that creates the direct index.
  - docIndexBuilder
```
protected DocumentIndexBuilder docIndexBuilder
```
    The builder that creates the document index.
  - invertedIndexBuilder
```
protected InvertedIndexBuilder invertedIndexBuilder
```
    The builder that creates the inverted index.
  - lexiconBuilder
```
protected LexiconBuilder lexiconBuilder
```
    The builder that creates the lexicon.
  - metaBuilder
```
protected MetaIndexBuilder metaBuilder
```
  - fileNameNoExtension
```
protected java.lang.String fileNameNoExtension
```
    The common prefix of the data structures filenames.
  - path
```
protected java.lang.String path
```
    The path in which the data structures are stored.
  - prefix
```
protected java.lang.String prefix
```
    The prefix of the data structures, ie the first part of the filename
  - currentIndex
```
protected IndexOnDisk currentIndex
```
    The index being worked on, denoted by path and prefix
  - fieldNames
```
protected gnu.trove.TObjectIntHashMap<java.lang.String> fieldNames
```
    mapping: field name -> field id, returns 0 for no mapping
  - numFields
```
protected int numFields
```
    the number of fields
  - blocks
```
protected boolean blocks
```
    is block indexing
  - externalParalllism
```
protected int externalParalllism
```
    how many instances are being used by the code calling this class in parallel
  - emptyDocIndexEntry
```
protected DocumentIndexEntry emptyDocIndexEntry
```
- Constructor Detail
  - Indexer
```
public Indexer()
```
    Creates an indexer at the location ApplicationSetup.TERRIER_INDEX_PATH and ApplicationSetup.TERRIER_INDEX_PREFIX
  - Indexer
```
public Indexer(java.lang.String _path,
               java.lang.String _prefix)
```
    Creates an instance of the class. The generated data structures will be saved in the given path. The of the data is given by the prefix parameter.
    
    Parameters:
    
    _path - String the path where the generated data structures will be saved.
    
    _prefix - String the filename that the data structures will have.
  - Indexer
```
protected Indexer(long a,
                  long b,
                  long c)
```
    Protected do-nothing constructor for use by child classes
- Method Detail
  - init
```
protected void init()
```
    This method must be called by anything which directly extends Indexer. See: http://benpryor.com/blog/2008/01/02/dont-call-subclass-methods-from-a-superclass-constructor/
  - createDirectIndex
```
public abstract void createDirectIndex(Collection[] collections)
```
    An abstract method for creating the direct index, the document index and the lexicon for the given collections.
    
    Parameters:
    
    collections - Collection[] An array of collections to index
  - createInvertedIndex
```
public abstract void createInvertedIndex()
```
    An abstract method for creating the inverted index, given that the the direct index, the document index and the lexicon have already been created.
  - getEndOfPipeline
```
protected abstract TermPipeline getEndOfPipeline()
```
    An abstract method that returns the last component of the term pipeline.
    
    Returns:
    
    TermPipeline the end of the term pipeline.
  - getExternalParalllism
```
public int getExternalParalllism()
```
    how many indexers are running in this and other threads?
  - setExternalParalllism
```
public void setExternalParalllism(int externalParalllism)
```
    set how many indexers are running in this and other threads?
  - createMetaIndexBuilder
```
protected MetaIndexBuilder createMetaIndexBuilder()
```
  - parseInts
```
protected static final int[] parseInts(java.lang.String[] in)
```
  - load_indexer_properties
```
protected void load_indexer_properties()
```
  - load_field_ids
```
protected void load_field_ids()
```
    loads a mapping of field name -> field id
  - load_pipeline
```
protected void load_pipeline()
```
    Creates the term pipeline, as specified by the property termpipelines in the properties file. The default value of the property termpipelines is Stopwords,PorterStemmer. This means that we first remove stopwords and then apply Porter's stemming algorithm.
  - load_builder_boundary_documents
```
protected void load_builder_boundary_documents()
```
    Loads the builder boundary documents from the property indexing.builder.boundary.docnos, comma delimited.
  - index
```
public void index(Collection[] collections)
```
    Creates the data structures for a set of collections. It creates a set of data structures for every indexing.max.docs.per.builder, if the value of this property is greater than zero, and then it mertges the generated data structures.
    
    Parameters:
    
    collections - The document collection objects to index.
  - merge
```
public static void merge(java.lang.String mpath,
                         java.lang.String mprefix,
                         int lowest,
                         int highest,
                         boolean blocks)
```
    Merge a series of numbered indices in the same path/prefix area. New merged index will be stored at mpath/mprefix_highest+1.
    
    Parameters:
    
    mpath - Path of all indices
    
    mprefix - Common prefix of all indices
    
    lowest - lowest subfix of prefix
    
    highest - highest subfix of prefix
  - mergeTwoIndices
```
protected static void mergeTwoIndices(java.lang.String[] index1,
                                      java.lang.String[] index2,
                                      java.lang.String[] outputIndex,
                                      boolean blocks)
```
    Merge two indices.
    
    Parameters:
    
    index1 - Path/Prefix of source index 1
    
    index2 - Path/Prefix of source index 2
    
    outputIndex - Path/Prefix of destination index
    
    blocks - TODO
  - merge
```
public static void merge(java.lang.String mpath,
                         java.lang.String mprefix,
                         java.util.LinkedList<java.lang.String[]> llist,
                         int counterMerged,
                         boolean blocks)
```
    Merge a series of indices, in pair-wise fashion
    
    Parameters:
    
    mpath - Common path of all indices
    
    mprefix - Prefix of target index
    
    counterMerged - - number of indices to merge
  - finishedDirectIndexBuild
```
protected void finishedDirectIndexBuild()
```
    event method to be overridden by child classes
  - finishedInvertedIndexBuild
```
protected void finishedInvertedIndexBuild()
```
    event method to be overridden by child classes
  - useFieldInformation
```
public boolean useFieldInformation()
```
    Returns the is the index will record fields
  - indexEmpty
```
protected void indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties)
                   throws java.io.IOException
```
    Adds an entry to document index for empty document @param docid, only if IndexEmptyDocuments is set to true.
    
    Throws:
    
    java.io.IOException
  - main
```
public static void main(java.lang.String[] args)
                 throws java.lang.Exception
```
    Utility method for merging indices
    
    Throws:
    
    java.lang.Exception

Class Indexer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

logger

MAX_DOCS_PER_BUILDER

MAX_TOKENS_IN_DOCUMENT

BUILDER_BOUNDARY_DOCUMENTS

useFieldInformation

pipeline_first

IndexEmptyDocuments

emptyDocCount

directIndexBuilder

docIndexBuilder

invertedIndexBuilder

lexiconBuilder

metaBuilder

fileNameNoExtension

path

prefix

currentIndex

fieldNames

numFields

blocks

externalParalllism

emptyDocIndexEntry

Constructor Detail

Indexer

Indexer

Indexer

Method Detail

init

createDirectIndex

createInvertedIndex

getEndOfPipeline

getExternalParalllism

setExternalParalllism

createMetaIndexBuilder

parseInts

load_indexer_properties

load_field_ids

load_pipeline

load_builder_boundary_documents

index

merge

mergeTwoIndices

merge

finishedDirectIndexBuild

finishedInvertedIndexBuild

useFieldInformation

indexEmpty

main