Firstly, a Collection object extracts the raw content of each individual document (from a collection of documents) and hands it in to a Document object. The Document object then removes any unwanted content (e.g., from a particular document tag) and gives the resulting text to a Tokeniser object. Finally, the tokeniser object converts the text into a stream of tokens that represent the content of the document
.By default, Terrier uses TRECCollection, which parses corpora in TREC format. In particular, in TREC-formatted files, there are many documents delimited by <DOC></DOC> tags, as in the following example:
<DOC> <DOCNO> doc1 </DOCNO> Content of the document does here </DOC> <DOC> ...For corpora in other formats, you will need to change the Collection object being used, by setting the property trec.collection.class. Here are some options:
Except for the special-purpose collections (SimpleFileCollection, SimpleXMLCollection, and SimpleMedlineXMLCollection), all other Collection implementations allow for different Document implementations to be used, by specifying the trec.document.class property. By default, these collections use TaggedDocument. The available Document implementations are:
Finally, all Document implementations can specify their own Tokeniser implementation. By default, Terrier uses the EnglishTokeniser. When indexing non-English corpora, a different Tokeniser implementation can be specified by the tokeniser property.
For now, we'll stick to TRECCollection, which can be used for all TREC corporas from Disks 1&2 until Blogs08, including WT2G, .GOV, .GOV2, etc. TRECCollection can be further configured.
Terrier has the ability to record the frequency with which terms occur in various fields of documents. The required fields are specified by the FieldTags.process property. For example, to note when a term occurs in the TITLE or H1 HTML tags of a document, set FieldTags.process=TITLE,H1. FieldTags are case-insensitive. There is a special field called ELSE, which contains all terms not in any other specified field.
The indexer iterates through the documents of the collection and sends each term found through the TermPipeline. The TermPipeline transforms the terms, and can remove terms that should not be indexed. The TermPipeline chain in use is termpipelines=Stopwords,PorterStemmer, which removes terms from the document using the Stopwords object, and then applies Porter's Stemming algorithm for English to the terms (PorterStemmer). If you want to use a different stemmer, this is the point at which it should be called.
The term pipeline can also be configured at indexing time to skip various tokens. Set a comma-delimited list of tokens to skip in the property termpipelines.skip. The same property works at retrieval time also.
The indexers are more complicated. Each class can be configured by several properties. Many of these alter the memory usage of the classes.
For the BlockIndexer:
Once terms have been processed through the TermPipeline, they are aggregated by the DocumentPostingList and the LexiconMap. These have a few properties:
Document metadata is recorded in a MetaIndex structure. For instance, such metadata could include the DOCNO and URL of each document, which the system can use to represent the document during retrieval. The MetaIndex can be configured to take note of various document attributes during indexing. The available attributes depend on those provided by the Document implementation. MetaIndex can be configured using the following properties:
Note that for presenting results to a user, additional indexing configuration is required. See Web-based Terrier for more information.
Terrier supports three types of indexing: "classical two-pass", "single-pass" and "MapReduce". All three methods create an identical inverted index, that produces identical retrieval effectiveness. However, they differ on other characteristics, namely their support for query expansion, and the scalability and efficiency when indexing large corpora. The choice of indexing method is likely to be driven by your need for query expansion, and the scale of the data you are working with. In particular, only classical twio-pass indexing directly creates a direct index, which is used for query expansion. However, classical two-pass indexing doesn't scale to large corpora (maximum practical is about 25 million documents). Single pass indexing is faster, but doesn't create a direct index. MapReduce indexing can be used when you have very large data (e.g. 50+ million documents), and already have an existing Hadoop cluster. If you do create an index that doesn't have a direct index, you can create one later using
--inverted2directflag when calling trec_terrier.sh. This subsection describes the classical indexing implemented by BasicIndexer and BlockIndexer. For single-pass indexing, see the next subsection.
The LexiconMap is flushed to disk every bundle.size documents. If memory during indexing is a concern, then reduce this property to less than its default 2500. However, more temporary lexicons will be created. The rate at which the temporary lexicons are merged is controlled by the lexicon.builder.merge.lex.max property, though we have found 16 to be a good compromise.
Once all documents in the index have been created, the InvertedIndex is created by the InvertedIndexBuilder. As the entire DirectIndex cannot be inverted in memory, the InvertedIndexBuilder takes several iterations, selecting a few terms, scanning the direct index for them, and then writing out their postings to the inverted index. If it takes too many terms at once, Terrier can run out of memory. Reduce the property invertedfile.processpointers from its default 20,000,000 and rerun (default is only 2,000,000 for block indexing, which is more memory intensive). See the InvertedIndexBuilder for more information about the inversion and term selection strategies.
Single-pass indexing is implemented by the classes BasicSinglePassIndexer and BlockSinglePassIndexer. Essentially, instead of building a direct file from the collection, term posting lists are held in memory, and written to disk as 'run' when memory is exhausted. These are then merged to form the lexicon and the inverted file. Note that no direct index is created - indeed, the single-pass indexing is much faster than classical two-pass indexing when the direct index is not required. If the direct index is required, then this can be built from the inverted index using the Inverted2DirectIndexBuilder.
The single-pass indexer can be used by using the -i -j command line argument to TrecTerrier.
The majority of the properties configuring the single-pass indexer are related to memory consumption, and how it decides that memory has been exhausted. Firstly, the indexer will commit a run to disk when free memory falls below the threshold set by memory.reserved (50MB). To ensure that this doesn't happen too soon, 85% of the possible heap must be allocated (controlled by the property memory.heap.usage). This check occurs every 20 documents (docs.checks).
Single-pass indexing is significantly quicker than two-pass indexing. However, there are some configuration points to be aware of. In particular, it makes much use of the memory to reduce disk IO. For Java 6, we recommend adding the -XX:-UseGCOverheadLimit to the command line. Moreover, for very large indices, many files have to be opened during merging, possibly exhausting the maximum number of allowed open files. Refer to your operating system documentation to increase this limit.
Notably, single-pass indexing does not build a direct index. However, a direct index can be build later using the -id command line argument to TrecTerrier.
For large-scale collections, Terrier provides a MapReduce based indexing system. For more details, please see Hadoop MapReduce Indexing with Terrier.
Terrier also supports the real-time indexing of document collections using MemoryIndex and IncrementalIndex structures, allowing for new documents to be added to the index at later points in time. For more details, please see Real-time Index Structures.
By default, Terrier uses Elias-Gamma and Elias-Unary algorithms for ensuring a highly compressed direct and inverted indices, however starting with version 4.0 Terrier now has support for a variety of state-of-the-art compression schemes including PForDelta. For more information about configuring the compression used for indexing, see the documentation on compression.
A block is a unit of text in a document. When you index using blocks, you tell Terrier to save positional information with each term. Depending on how Terrier has been configured, a block can be of size 1 or larger. Size 1 means that the exact position of each term can be determined. For size > 1, the block id is incremented after every N terms. You can configure the size of a block using the property blocks.size.
You can enable block indexing by setting the property block.indexing to true in your terrier.properties file. This ensures that the Indexer used for indexing is the BlockIndexer, not the BasicIndexer (or BlockSinglePassIndexer instead of BasicSinglePassIndexer). When loading an index, Terrier will detect that the index has block information saved and use the appropriate classes for reading the index files.
You can use the positional information when doing retrieval. For instance, you can search for documents matching a phrase, e.g. "Terabyte retriever", or where the words occur near each other, e.g. "indexing blocks"~20.
When you enable the property block.indexing, the indexer used is the BlockIndexer, not the BasicIndexer (if you're using single-pass indexing, it is the BlockSinglePassIndexer, not the BasicSinglePassIndexer that is used). The created DirectIndex and InvertedIndex use a different format, which includes the blockids for each posting, and can be read by BlockDirectIndex and BlockInvertedIndex, respectively. During two-pass indexing, BlockLexicons are created to keep track of how many blocks are in use for a term. However, at the last stage of rewriting the lexicon at the end of inverted indexing, the BlockLexicon is rewritten as a normal Lexicon, as the block information can be guessed during retrieval.
[Previous: Configuring Terrier] [Contents] [Next: Configuring Retrieval]Copyright © 2014 University of Glasgow | All Rights Reserved