Extending Indexing in Terrier

If your data is in files (1 file per document, or in XML or TREC files), you should be able to index your data using one of the provided collection decoder, such as SimpleFileCollection or TRECCollection. Otherwise, in scenarios such as extracting documents to be indexed from a database, you will need to write your own Collection decoder. This is done by implementing the Collection interface, and setting this to be used with the trec.collection.class property. MultiFileCollection is a useful base class for implementing readers for TREC-like corpora with multiple documents stored in each file. Due to its ability to fetch HTTP URLs, SimpleFileCollection can be used to download webpages also.

A Collection implementation returns a Document object for every document in the corpus. Simple textual contents can be handled by FileDocument, while HTML documents can be handled by TaggedDocument. Otherwise, if your documents are of a non-standard format, you'll need to implement your own Document. The purpose of a Document object is to parse a document's format (e.g. Microsoft Word, HTML), and extract the text that should be indexed – optionally, you can designate the fields that contain the text to be extracted and, if configured, the indexer will note the fields where each term occurs in a document.

The Document typically provides the extracted text as input to a tokeniser, which identifies multiple tokens, and return them as stream, in their order of occurrence. For languages where tokens are naturally delimited by whitespace characters, Terrier provides English and UTF tokenisers. If your particular corpus has more complicated tokenisation than just whitespace, you can implement the Tokeniser interface to suit your needs.

Index Data Structures

As discussed in Configuring Indexing, Terrier has different indexing implementations. In the following, we describe how the generic indexing infrastructure. Details on the implementation of the classical two-pass and single-pass indexing can be at (indexer details)[indexer_details.html]. Details are on the Hadoop MapReduce indexing are described elsewhere. In-memory and incremental indices are described under real-time indexing.

Each indexer creates several data structures, and creates a differing Index implementation (summarised in the table below):

|Structure|Classical| Single-pass | MapReduce | Memory | |------------|---|---|---|---| |direct|✔|x|x|optional| |document|✔|✔|✔|✔| |lexicon|✔|✔|✔|✔| |inverted|✔|✔|✔|✔| |meta|✔|✔|✔|✔| |(Index Type)|IndexOnDisk|IndexOnDisk|IndexOnDisk|MemoryIndex|

Hint: If a direct file is required after using an indexer that does not create one, the Inverted2DirectIndexBuilder can be used to create one.

Each indexer iterates through the documents of the collection, using a Tokeniser to identify terms to index. Each term found is sent through the TermPipeline. The TermPipeline transforms the terms, and can remove terms that should not be indexed. The TermPipeline chain in use is termpipelines=Stopwords,PorterStemmer, which removes terms from the document using the Stopwords object, and then applies Porter’s Stemming algorithm for English to the terms (PorterStemmer). If you wanted to use a different stemmer, this is the point at which it should be implemented.

Once terms have been processed through the TermPipeline, they are aggregated by the DocumentPostingList. Each DocumentPostingList is then processed to update temporary data structures.

There are two variants of each indexer: one providing basic functionality (storing term frequencies only), and one additionally storing position information (i.e where each word occurs at in each document) within the direct and inverted index structures. This allows querying to use term positions information - for example phrasal search ("") and proximity search (""~10). For more details about the querying process, you may refer to querying with Terrier and the description of the query language.

Block Delimiter Terms

Block indexing can be configured to consider bounded instead of fixed-size blocks. Basically, a list of pre-defined terms must be specified as special-purpose block delimiters. By using sentence boundaries as block delimiters, for instance, one can have blocks to represent sentences. BlockIndexer, BlockSinglePassIndexer, and Hadoop_BlockSinglePassIndexer all implement this feature.

Bounded block indexing can be used by configuring the following properties:

Compression

Terrier uses highly compressed data structures as much as possible. In particular, the inverted and direct index structures are encoded using bit-level compression, namely Elias Gamma and Elias Unary encoding of integers (namely term ids, docids and frequencies). The underlying compression is provided by the org.terrier.compression.bit package. Since version 4.0, alternative integer-focussed compression schemes have been supported. These are applied by rewriting the inverted (or direct) index data structures with a new format. See the compression documentation for more information.

For document metadata, the default MetaIndex, namely CompressingMetaIndex uses Zip compression to minimise the number of bytes necessary for every document.


Webpage: http://terrier.org
Contact: School of Computing Science
Copyright (C) 2004-2018 University of Glasgow. All Rights Reserved.