If your data is in files (1 file per document, or in XML or TREC files), you should be able to index your data using one of the provided collection decoder, such as SimpleFileCollection or TRECCollection. Otherwise, in scenarios such as extracting documents to be indexed from a database, you will need to write your own Collection decoder. This is done by implementing the Collection interface, and setting this to be used with the
trec.collection.class property. MultiFileCollection is a useful base class for implementing readers for TREC-like corpora with multiple documents stored in each file. Due to its ability to fetch HTTP URLs, SimpleFileCollection can be used to download webpages also.
A Collection implementation returns a Document object for every document in the corpus. Simple textual contents can be handled by FileDocument, while HTML documents can be handled by TaggedDocument. Otherwise, if your documents are of a non-standard format, you'll need to implement your own Document. The purpose of a Document object is to parse a document's format (e.g. Microsoft Word, HTML), and extract the text that should be indexed – optionally, you can designate the fields that contain the text to be extracted and, if configured, the indexer will note the fields where each term occurs in a document.
The Document typically provides the extracted text as input to a tokeniser, which identifies multiple tokens, and return them as stream, in their order of occurrence. For languages where tokens are naturally delimited by whitespace characters, Terrier provides English and UTF tokenisers. If your particular corpus has more complicated tokenisation than just whitespace, you can implement the Tokeniser interface to suit your needs.
As discussed in Configuring Indexing, Terrier has different indexing implementations. In the following, we describe how the generic indexing infrastructure. Details on the implementation of the classical two-pass and single-pass indexing can be at (indexer details)[indexer_details.html]. Details are on the Hadoop MapReduce indexing are described elsewhere. In-memory and incremental indices are described under real-time indexing.
Each indexer creates several data structures, and creates a differing Index implementation (summarised in the table below):
direct (PostingIndex) : a compressed file, where we store the terms contained in each document. The direct index is used for automatic query expansion. Accessed using an IterablePosting. Optionally contains position and field information.
document (DocumentIndex) : a fixed-length entry file, where we store information about documents, such as the number of indexed tokens (document length), the identifier of a document, and the offset of its corresponding entry in the direct index. The direct index provides the Pointer necessary for accessing the direct index. Created by the DocumentIndexBuilder.
inverted (PostingIndex) : a compressed file, where we store the docids of the documents containing a given term. Accessed using an IterablePosting. Optionally contains position and field information.
meta (MetaIndex) : stores metadata about each document.
If a direct file is required after using an indexer that does not create one, the Inverted2DirectIndexBuilder can be used to create one.
Each indexer iterates through the documents of the collection, using a Tokeniser to identify terms to index. Each term found is sent through the TermPipeline. The TermPipeline transforms the terms, and can remove terms that should not be indexed. The TermPipeline chain in use is
termpipelines=Stopwords,PorterStemmer, which removes terms from the document using the Stopwords object, and then applies Porter’s Stemming algorithm for English to the terms (PorterStemmer). If you wanted to use a different stemmer, this is the point at which it should be implemented.
Once terms have been processed through the TermPipeline, they are aggregated by the DocumentPostingList. Each DocumentPostingList is then processed to update temporary data structures.
There are two variants of each indexer: one providing basic functionality (storing term frequencies only), and one additionally storing position information (i.e where each word occurs at in each document) within the direct and inverted index structures. This allows querying to use term positions information - for example phrasal search (
"") and proximity search (
""~10). For more details about the querying process, you may refer to querying with Terrier and the description of the query language.
Block indexing can be configured to consider bounded instead of fixed-size blocks. Basically, a list of pre-defined terms must be specified as special-purpose block delimiters. By using sentence boundaries as block delimiters, for instance, one can have blocks to represent sentences. BlockIndexer, BlockSinglePassIndexer, and Hadoop_BlockSinglePassIndexer all implement this feature.
Bounded block indexing can be used by configuring the following properties:
block.delimiters.enabled - Whether delimited blocks should be used instead of fixed-size blocks. Defaults to false.
block.delimiters - Comma-separated list of terms that cause the block counter to be incremented.
block.delimiters.index.terms - Whether delimiters should be themselves indexed as normal terms. Defaults to false.
block.delimiters.index.doclength - Whether indexed delimiters should contribute to document length statistics. Defaults to false; if set to true, this property only has effect if
block.delimiters.index.terms is enabled.
termpipeline.skip - Comma-separated list of tokens to be skipped by the configured term pipelines. In practice, this should be set to the value of block.delimiters in order to prevent the specified delimiters from being stemmed or removed during indexing.
Terrier uses highly compressed data structures as much as possible. In particular, the inverted and direct index structures are encoded using bit-level compression, namely Elias Gamma and Elias Unary encoding of integers (namely term ids, docids and frequencies). The underlying compression is provided by the org.terrier.compression.bit package. Since version 4.0, alternative integer-focussed compression schemes have been supported. These are applied by rewriting the inverted (or direct) index data structures with a new format. See the compression documentation for more information.