[Previous: Developing with Terrier] [Contents] [Next: Extending Retrieval]

Extending Indexing in Terrier

Unless your data is in files (1 file per document), you will probably need to create your own collection decoder. This is done by implementing the Collection interface (uk.ac.gla.terrier.indexing.Collection), and writing your own indexing application. (See the classes uk.ac.gla.terrier.applications.TRECIndexing, or uk.ac.gla.terrier.applications.desktop.DesktopTerrier).

If your documents are of a non-standard format, then we would advise you create your own Document implementation as well. You'll need to implement the interface uk.ac.gla.terrier.indexing.Document. The purpose of a Document object is parse a Document, and identify terms, which should be returned in order of their occurrence. Optionally, you can designate the fields that the terms exist in, and if configured the indexer will note the fields that each term occur in within a document.

Classical two-pass indexing

Essentially, you can now use the BasicIndexer or the BlockIndexer to index your collection. The BlockIndexer provides the same functionality as BasicIndexer, but uses larger DirectIndex and InvertedIndex for storing the positions that each word occurs at in each document. This allows querying to use term positions information - for example Phrasal search ("") and proximity search (""~10). For more details about the querying process, you may refer to querying with Terrier and the description of the query language.

The indexer iterates through the documents of the collection and sends each term found through the TermPipeline. The TermPipeline transforms the terms, and can remove terms that should not be indexed. The TermPipeline chain in use is termpipelines=Stopwords,PorterStemmer, which removes terms from the document using the Stopwords object, and then applies Porter's Stemming algorithm for English to the terms (PorterStemmer). If you wanted to use a different stemmer, this is the point at which it should be implemented.

Once terms have been processed through the TermPipeline, they are aggregated by the DocumentPostingList and the LexiconMap, to create the following data structures:

DirectIndex : a compressed file, where we store the terms contained in each document. The direct index is used for automatic query expansion. Built by the DirectIndexBuilders
DocumentIndex : a fixed-length entry file, where we store information about documents, such as the number of indexed tokens (document length), the identifier of a document, and the offset of its corresponding entry in the direct index. Created by the DocumentIndexBuilder
Lexicon : a fixed-length entry file, where we store information about the vocabulary of the indexed collection. Built as a series of temporary Lexicons by the LexiconBuilders, which are then merged at the end of the DirectIndex build phase.

As the indexer iterates through the documents of the collection, it appends the direct and document indexes. For saving the vocabulary information, the indexer creates temporary lexicons for parts of the collection, which are merged once all the documents have been processed.

Once the direct index, the document index and the lexicon have been created, the inverted index is created, by the InvertedIndexBuilder, which inverts the direct index.

Single-pass indexing

Terrier 2.0 adds the single-pass indexing architecture. In this architecture, indexing is performed to build up in-memory posting lists (MemoryPostings containing Posting objects), which are written to disk as 'runs' by the RunWriter when most of the available memory is consumed.

Once the collection has been parsed, all runs are merged by the RunsMerger, which uses a SimplePostingInRun to represent each posting list when iterating through the contents of each run.

If a direct file is required, the Inverted2DirectIndexBuilder can be used to create one.

Changing Indexing

To replace the default indexing structures in Terrier with others is very easy, as the data.properties file contains information about which classes should be used to load the four main data structures of the Index: DocumentIndex, DirectIndex, Lexicon and InvertedIndex. To implement a replacement index data structure, it may sufficient to subclass a builder, and then subclass the appropriate Indexer class to ensure is used.

Adding other data structures to a Terrier index is also easy. The Index object contains methods such as addIndexStructure(String, String) which allow a class to be associated with a structure name (e.g. uk.ac.gla.terrier.structures.InvertedIndex is associated to the "inverted" structure. You can retrieve your structure by casting the result of getIndexStructure(String). For structures with more complicated constructors, other addIndexStructure methods are provided. Finally, your application can check that the desired structure exists using hasIndexStructure(String).

Terrier indices specify the random-access and in-order structure classes for each of the main structure types: direct, inverted, lexicon and document. When generating new data structures, it is good practice to provide in-order as well as random-access to your classes, should other developers wish to access these index structures at another indexing stage.

[Previous: Developing with Terrier] [Contents] [Next: Extending Retrieval]