Unless your data is in files (1 file per document, or in XML or TREC files), you will probably need to create your own collection decoder. Examples of such scenarios could be extracting documents to be indexed from a database. This is done by implementing the Collection interface, and setting this to be used with the trec.collection.class property.
A Collection implementation returns a Document object for every document in the corpus. Simple textual contents can be handled by FileDocument, while HTML documents can be handled by TaggedDocument. Otherwise, if your documents are of a non-standard format, you'll need to implement your own Document. The purpose of a Document object is to parse a Document, and extract the text that should be indexed -- optionally, you can designate the fields that contain the text to be extracted and, if configured, the indexer will note the fields where each term occurs in a document.
The Document typically provides the extracted text as input to a tokeniser, which identifies multiple tokens, and return them as stream, in their order of occurrence. For languages where tokens are naturally delimited by whitespace characters, Terrier provides English and UTF tokenisers. If your particular corpus has more complicated tokenisation than just whitespace, you can implement the Tokeniser interface to suit your needs.
There are two variants of two-pass indexing: the BlockIndexer provides the same functionality as BasicIndexer, but uses a larger DirectIndex and InvertedIndex for storing the positions that each word occurs at in each document. This allows querying to use term positions information - for example Phrasal search ("") and proximity search (""~10). For more details about the querying process, you may refer to querying with Terrier and the description of the query language.
The indexer iterates through the documents of the collection, using a Tokeniser to identify terms to index. Each term found is sent through the TermPipeline. The TermPipeline transforms the terms, and can remove terms that should not be indexed. The TermPipeline chain in use is termpipelines=Stopwords,PorterStemmer, which removes terms from the document using the Stopwords object, and then applies Porter's Stemming algorithm for English to the terms (PorterStemmer). If you wanted to use a different stemmer, this is the point at which it should be implemented.
Once terms have been processed through the TermPipeline, they are aggregated by the DocumentPostingList and the LexiconMap, to create the following data structures:
As the indexer iterates through the documents of the collection, it appends the direct and document indexes. For saving the vocabulary information, the indexer creates temporary lexicons for parts of the collection, which are merged once all the documents have been processed.
Once the direct index, the document index and the lexicon have been created, the inverted index is created, by the InvertedIndexBuilder, which inverts the direct index.
Terrier 2.0 adds the single-pass indexing architecture. In this architecture, indexing is performed to build up in-memory posting lists (MemoryPostings containing Posting objects), which are written to disk as 'runs' by the RunWriter when most of the available memory is consumed.
Once the collection has been parsed, all runs are merged by the RunsMerger, which uses a SimplePostingInRun to represent each posting list when iterating through the contents of each run.
If a direct file is required, the Inverted2DirectIndexBuilder can be used to create one.
Block indexing can be configured to consider bounded instead of fixed-size blocks. Basically, a list of pre-defined terms must be specified as special-purpose block delimiters. By using sentence boundaries as block delimiters, for instance, one can have blocks to represent sentences. BlockIndexer, BlockSinglePassIndexer, and Hadoop_BlockSinglePassIndexer all implement this feature.
Bounded block indexing can be used by configuring the following properties:
Terrier uses highly compressed data structures as much as possible. In particular, the inverted and direct index structures are encoded using bit-level compression, namely Elias Gamma and Elias Unary encoding of integers (namely term ids, docids and frequencies). The underlying compression is provided by the org.terrier.compression package. For document metadata, the default MetaIndex, namely CompressingMetaIndex uses Zip compression to minimise the number of bytes necessary for every document.
To replace the default indexing structures in Terrier with others is very easy, as the data.properties file contains information about which classes should be used to load the five main data structures of the Index: DocumentIndex, MetaIndex, DirectIndex, Lexicon and InvertedIndex. A more detailed summary of the standard index structures is given at Index Structures (Terrier wiki). To implement a replacement index data structure, it may sufficient to subclass a builder, and then subclass the appropriate Indexer class to ensure it is used.
Adding other data structures to a Terrier index is also straightforward. The abstract Index class defines methods such as addIndexStructure(String, String) which allow a class to be associated with a structure name (e.g. org.terrier.structures.InvertedIndex is associated to the "inverted" structure. You can retrieve your structure by casting the result of getIndexStructure(String). For structures with more complicated constructors, other addIndexStructure methods are provided. Finally, your application can check that the desired structure exists using hasIndexStructure(String).
Terrier indices specify the random-access and in-order structure classes for each of the main structure types: direct, inverted, lexicon and document. When generating new data structures, it is good practice to provide in-order (stream) classes as well as random-access classes to your data structures, should other developers wish to access these index structures at another indexing stage. For instance, for the inverted structure, Terrier provides the InvertedIndex and InvertedIndexInputStream classes.
Many of Terrier's index structures can be accessed from Hadoop MapReduce. In particular, we provide BitPostingIndexInputFormat which allows an inverted index or direct index to be split across many map tasks, with an orthogonal CompressingMetaIndexInputFormat for the MetaIndex.
[Previous: Developing with Terrier] [Contents] [Next: Extending Retrieval]Copyright © 2014 University of Glasgow | All Rights Reserved