Package org.terrier.structures.indexing
Class Indexer
- java.lang.Object
-
- org.terrier.structures.indexing.Indexer
-
- Direct Known Subclasses:
BasicIndexer
,BlockIndexer
public abstract class Indexer extends java.lang.Object
Properties:- termpipelines - the sequence of
TermPipeline
stages (e.g.Stopwords
removal andPorterStemmer
). - termpipelines.skip - a list of tokens which should not be skipped from the term pipeline. If not set or empty, then none will be skipped.
- indexing.max.tokens - The maximum number of tokens the indexer will attempt to index in a document. If 0, then all tokens will be indexed (default).
- ignore.empty.documents - Assign empty documents with docids. Default true
- indexing.max.docs.per.builder - Maximum number of documents in an index before a new index is created, and merged later.
- indexing.builder.boundary.docnos - Docnos of documents that force the index being created to be completed, and a new index to be commenced. An alternative to indexing.max.docs.per.builder
- indexer.meta.forward.keys - comma delimited list of
Document
properties to index as document metadata in theMetaIndex
. Defaults to "docno", which permits docid->docno lookups.. Examples are "docno,url" or "docno,url,content" - indexer.meta.forward.keylens - comma delimited list of the length of the values to record in the
MetaIndex
. Defaults to 20. - indexer.meta.reverse.keys - comma delimited list of
Document
properties to permit lookups for (i.e. docno->docid). Defaults to empty (none are enabled). - indexer.meta.builder - name of the class to build the MetaIndex. Defaults to ZstdMetaIndexBuilder, which uses zstandard compression.
- Author:
- Craig Macdonald
-
-
Field Summary
Fields Modifier and Type Field Description protected boolean
blocks
is block indexingprotected java.util.HashSet<java.lang.String>
BUILDER_BOUNDARY_DOCUMENTS
The DOCNO of documents to force builder boundariesprotected IndexOnDisk
currentIndex
The index being worked on, denoted by path and prefixprotected AbstractPostingOutputStream
directIndexBuilder
The builder that creates the direct index.protected DocumentIndexBuilder
docIndexBuilder
The builder that creates the document index.protected int
emptyDocCount
protected DocumentIndexEntry
emptyDocIndexEntry
protected int
externalParalllism
how many instances are being used by the code calling this class in parallelprotected gnu.trove.TObjectIntHashMap<java.lang.String>
fieldNames
mapping: field name -> field id, returns 0 for no mappingprotected java.lang.String
fileNameNoExtension
The common prefix of the data structures filenames.protected boolean
IndexEmptyDocuments
Indicates whether an entry for empty documents is stored in the document index, or empty documents should be ignored.protected InvertedIndexBuilder
invertedIndexBuilder
The builder that creates the inverted index.protected LexiconBuilder
lexiconBuilder
The builder that creates the lexicon.protected static org.slf4j.Logger
logger
the logger for this classprotected int
MAX_DOCS_PER_BUILDER
The number of documents indexed with a set of builders.protected int
MAX_TOKENS_IN_DOCUMENT
The maximum number of tokens in a document.protected MetaIndexBuilder
metaBuilder
protected int
numFields
the number of fieldsprotected java.lang.String
path
The path in which the data structures are stored.protected TermPipeline
pipeline_first
The first component of the term pipeline.protected java.lang.String
prefix
The prefix of the data structures, ie the first part of the filenameprotected boolean
useFieldInformation
Indicates whether field information should be saved in the created data structures.
-
Constructor Summary
Constructors Modifier Constructor Description Indexer()
Creates an indexer at the location ApplicationSetup.TERRIER_INDEX_PATH and ApplicationSetup.TERRIER_INDEX_PREFIXprotected
Indexer(long a, long b, long c)
Protected do-nothing constructor for use by child classesIndexer(java.lang.String _path, java.lang.String _prefix)
Creates an instance of the class.
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description abstract void
createDirectIndex(Collection[] collections)
An abstract method for creating the direct index, the document index and the lexicon for the given collections.abstract void
createInvertedIndex()
An abstract method for creating the inverted index, given that the the direct index, the document index and the lexicon have already been created.protected MetaIndexBuilder
createMetaIndexBuilder()
protected void
finishedDirectIndexBuild()
event method to be overridden by child classesprotected void
finishedInvertedIndexBuild()
event method to be overridden by child classesprotected abstract TermPipeline
getEndOfPipeline()
An abstract method that returns the last component of the term pipeline.int
getExternalParalllism()
how many indexers are running in this and other threads?void
index(Collection[] collections)
Creates the data structures for a set of collections.protected void
indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties)
Adds an entry to document index for empty document @param docid, only if IndexEmptyDocuments is set to true.protected void
init()
This method must be called by anything which directly extends Indexer.protected void
load_builder_boundary_documents()
Loads the builder boundary documents from the property indexing.builder.boundary.docnos, comma delimited.protected void
load_field_ids()
loads a mapping of field name -> field idprotected void
load_indexer_properties()
protected void
load_pipeline()
Creates the term pipeline, as specified by the property termpipelines in the properties file.static void
main(java.lang.String[] args)
Utility method for merging indicesstatic void
merge(java.lang.String mpath, java.lang.String mprefix, int lowest, int highest, boolean blocks)
Merge a series of numbered indices in the same path/prefix area.static void
merge(java.lang.String mpath, java.lang.String mprefix, java.util.LinkedList<java.lang.String[]> llist, int counterMerged, boolean blocks)
Merge a series of indices, in pair-wise fashionprotected static void
mergeTwoIndices(java.lang.String[] index1, java.lang.String[] index2, java.lang.String[] outputIndex, boolean blocks)
Merge two indices.protected static int[]
parseInts(java.lang.String[] in)
void
setExternalParalllism(int externalParalllism)
set how many indexers are running in this and other threads?boolean
useFieldInformation()
Returns the is the index will record fields
-
-
-
Field Detail
-
logger
protected static final org.slf4j.Logger logger
the logger for this class
-
MAX_DOCS_PER_BUILDER
protected int MAX_DOCS_PER_BUILDER
The number of documents indexed with a set of builders. If a collection consists of more documents, then we need to create new builders and later merge the data structures. The corresponding property is indexing.max.docs.per.builder and the default value is 18000000 (18 million documents). If the property is set equal to zero, then there is no limit.
-
MAX_TOKENS_IN_DOCUMENT
protected int MAX_TOKENS_IN_DOCUMENT
The maximum number of tokens in a document. If it is set to zero, then there is no limit in the number of tokens indexed for a document. Set by property indexing.max.tokens.
-
BUILDER_BOUNDARY_DOCUMENTS
protected final java.util.HashSet<java.lang.String> BUILDER_BOUNDARY_DOCUMENTS
The DOCNO of documents to force builder boundaries
-
useFieldInformation
protected boolean useFieldInformation
Indicates whether field information should be saved in the created data structures.
-
pipeline_first
protected TermPipeline pipeline_first
The first component of the term pipeline.
-
IndexEmptyDocuments
protected boolean IndexEmptyDocuments
Indicates whether an entry for empty documents is stored in the document index, or empty documents should be ignored.
-
emptyDocCount
protected int emptyDocCount
-
directIndexBuilder
protected AbstractPostingOutputStream directIndexBuilder
The builder that creates the direct index.
-
docIndexBuilder
protected DocumentIndexBuilder docIndexBuilder
The builder that creates the document index.
-
invertedIndexBuilder
protected InvertedIndexBuilder invertedIndexBuilder
The builder that creates the inverted index.
-
lexiconBuilder
protected LexiconBuilder lexiconBuilder
The builder that creates the lexicon.
-
metaBuilder
protected MetaIndexBuilder metaBuilder
-
fileNameNoExtension
protected java.lang.String fileNameNoExtension
The common prefix of the data structures filenames.
-
path
protected java.lang.String path
The path in which the data structures are stored.
-
prefix
protected java.lang.String prefix
The prefix of the data structures, ie the first part of the filename
-
currentIndex
protected IndexOnDisk currentIndex
The index being worked on, denoted by path and prefix
-
fieldNames
protected gnu.trove.TObjectIntHashMap<java.lang.String> fieldNames
mapping: field name -> field id, returns 0 for no mapping
-
numFields
protected int numFields
the number of fields
-
blocks
protected boolean blocks
is block indexing
-
externalParalllism
protected int externalParalllism
how many instances are being used by the code calling this class in parallel
-
emptyDocIndexEntry
protected DocumentIndexEntry emptyDocIndexEntry
-
-
Constructor Detail
-
Indexer
public Indexer()
Creates an indexer at the location ApplicationSetup.TERRIER_INDEX_PATH and ApplicationSetup.TERRIER_INDEX_PREFIX
-
Indexer
public Indexer(java.lang.String _path, java.lang.String _prefix)
Creates an instance of the class. The generated data structures will be saved in the given path. The of the data is given by the prefix parameter.- Parameters:
_path
- String the path where the generated data structures will be saved._prefix
- String the filename that the data structures will have.
-
Indexer
protected Indexer(long a, long b, long c)
Protected do-nothing constructor for use by child classes
-
-
Method Detail
-
init
protected void init()
This method must be called by anything which directly extends Indexer. See: http://benpryor.com/blog/2008/01/02/dont-call-subclass-methods-from-a-superclass-constructor/
-
createDirectIndex
public abstract void createDirectIndex(Collection[] collections)
An abstract method for creating the direct index, the document index and the lexicon for the given collections.- Parameters:
collections
- Collection[] An array of collections to index
-
createInvertedIndex
public abstract void createInvertedIndex()
An abstract method for creating the inverted index, given that the the direct index, the document index and the lexicon have already been created.
-
getEndOfPipeline
protected abstract TermPipeline getEndOfPipeline()
An abstract method that returns the last component of the term pipeline.- Returns:
- TermPipeline the end of the term pipeline.
-
getExternalParalllism
public int getExternalParalllism()
how many indexers are running in this and other threads?
-
setExternalParalllism
public void setExternalParalllism(int externalParalllism)
set how many indexers are running in this and other threads?
-
createMetaIndexBuilder
protected MetaIndexBuilder createMetaIndexBuilder()
-
parseInts
protected static final int[] parseInts(java.lang.String[] in)
-
load_indexer_properties
protected void load_indexer_properties()
-
load_field_ids
protected void load_field_ids()
loads a mapping of field name -> field id
-
load_pipeline
protected void load_pipeline()
Creates the term pipeline, as specified by the property termpipelines in the properties file. The default value of the property termpipelines is Stopwords,PorterStemmer. This means that we first remove stopwords and then apply Porter's stemming algorithm.
-
load_builder_boundary_documents
protected void load_builder_boundary_documents()
Loads the builder boundary documents from the property indexing.builder.boundary.docnos, comma delimited.
-
index
public void index(Collection[] collections)
Creates the data structures for a set of collections. It creates a set of data structures for every indexing.max.docs.per.builder, if the value of this property is greater than zero, and then it mertges the generated data structures.- Parameters:
collections
- The document collection objects to index.
-
merge
public static void merge(java.lang.String mpath, java.lang.String mprefix, int lowest, int highest, boolean blocks)
Merge a series of numbered indices in the same path/prefix area. New merged index will be stored at mpath/mprefix_highest+1.- Parameters:
mpath
- Path of all indicesmprefix
- Common prefix of all indiceslowest
- lowest subfix of prefixhighest
- highest subfix of prefix
-
mergeTwoIndices
protected static void mergeTwoIndices(java.lang.String[] index1, java.lang.String[] index2, java.lang.String[] outputIndex, boolean blocks)
Merge two indices.- Parameters:
index1
- Path/Prefix of source index 1index2
- Path/Prefix of source index 2outputIndex
- Path/Prefix of destination indexblocks
- TODO
-
merge
public static void merge(java.lang.String mpath, java.lang.String mprefix, java.util.LinkedList<java.lang.String[]> llist, int counterMerged, boolean blocks)
Merge a series of indices, in pair-wise fashion- Parameters:
mpath
- Common path of all indicesmprefix
- Prefix of target indexcounterMerged
- - number of indices to merge
-
finishedDirectIndexBuild
protected void finishedDirectIndexBuild()
event method to be overridden by child classes
-
finishedInvertedIndexBuild
protected void finishedInvertedIndexBuild()
event method to be overridden by child classes
-
useFieldInformation
public boolean useFieldInformation()
Returns the is the index will record fields
-
indexEmpty
protected void indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties) throws java.io.IOException
Adds an entry to document index for empty document @param docid, only if IndexEmptyDocuments is set to true.- Throws:
java.io.IOException
-
main
public static void main(java.lang.String[] args) throws java.lang.Exception
Utility method for merging indices- Throws:
java.lang.Exception
-
-