Terrier Components

On this page we will give an overview of Terrier's main components and their interaction.

Component Interaction

The graphic below gives an overview of the interaction of the main components involved in the indexing process.

A corpus will be represented in the form of a Collection object. Raw text data will be represented in the form of a Document object.
The indexer is responsible for managing the indexing process. It iterates through the documents of the collection and sends each found term through a TermPipeline component.
A TermPipeline can transform terms or removes terms that should not be indexed. An example for a TermPipeline chain is termpipelines=Stopwords,PorterStemmer, which removes terms from the document using the Stopwords object, and then applies Porter's Stemming algorithm for English to the terms (PorterStemmer).
Once terms have been processed through the TermPipeline, they are aggregated and the following data structures are created by their corresponding DocumentBuilders: DirectIndex, DocumentIndex, Lexicon, and InvertedIndex.
For single-pass indexing, the structures are written in a different order. Inverted file postings are built in memory, and committed to 'runs' when memory is exhausted. Once the collection had been indexed, all runs are merged to form the inverted index and the lexicon.

The graphic below gives an overview of the interaction of Terrier's components in the retrieval phase.

An application, such as for example the Desktop Terrier or TrecTerrier applications, issues a query to the Terrier framework.
In a first step the query will be parsed and an instantiation of a Query object will take place.
Afterwards the query will be handed to the Manager component. The manager firstly pre-processes the query, by applying it to the configured TermPipeline.
After the Pre-Processing the query will be handed to the Matching component. The Matching component is responsible for initialising the appropriate WeightingModel, TermScoreModifiers, and DocumentScoreModifiers. Once all these components have been instantiated the computation of document scores with respect to the query will take place.
Afterwards the Post Processing and PostFiltering takes place. In PostProcessing, the ResultSet can be altered in any way - for example, QueryExpansion expands the query, and then calls Matching again to generate an improved ranking of documents. PostFiltering is simpler, allowing documents to be either included or excluded - this is ideal for interactive applications where users want to restrict the domain of the documents being retrieved.
Finally the ResultSet will be returned to the application component.

Here we provide a listing and brief description of Terrier's components.

Name	Description
Collection	This component encapsulates the most fundamental concept to indexing with Terrier - a Collection i.e. a set of documents. See uk.ac.gla.terrier.indexing.Collection for more details.
Document	This component encapsulates the concept of a document. It is essentially an Iterator over terms in a document. See uk.ac.gla.terrier.indexing.Document for more details.
TermPipeline	Models the concept of a component in a pipeline of term processors. Classes that implement this interface could be stemming algorithms, stopwords removers, or acronym expansion just to mention few examples. See uk.ac.gla.terrier.terms.TermPipeline for more details.
Indexer	The component responsible for managing the indexing process. It instantiates TermPipelines and Builders. See uk.ac.gla.terrier.indexing.Indexer for more details.
Builders	Builders are responsible for writing an index to disk. See uk.ac.gla.terrier.structures.indexing package for more details.

Name	Description
BitFile	A highly compressed I/O layer using gamma and unary encodings. See uk.ac.gla.terrier.compression.BitFile for more details.
Direct Index	The direct index stores the identifiers of terms that appear in each document and the corresponding frequencies. It is used for automatic query expansion, but can also be used for user profiling activities. See uk.ac.gla.terrier.structures.DirectIndex for more details.
Document Index	The document index stores information about each document for example the document length and identifier, and a pointer to the corresponding entry in the direct index. See uk.ac.gla.terrier.structures.DocumentIndex for more details.
Inverted Index	The inverted index stores the posting lists, i.e. the identifiers of the documents and their corresponding term frequencies. Moreover it is capable of storing the position of terms within a document. See uk.ac.gla.terrier.structures.InvertedIndex for more details.
Lexicon	The lexicon stores the collection vocabulary and the corresponding document and term frequencies. See uk.ac.gla.terrier.structures.Lexicon for more details.

Name	Description
Manager	This component is responsible for handling/coordinating the main high-level operations of a query. These are: Pre Processing (Term Pipeline, Control finding, term aggregation) Matching Post-processing Post-filtering See uk.ac.gla.terrier.querying.Manager for more details.
Matching	The matching component is responsible for determining which documents match a specific query and for scoring documents with respect to a query. See uk.ac.gla.terrier.matching.Matching for more details.
Query	The matching component is responsible for determining which documents match a specific query and for scoring documents with respect to a query. See uk.ac.gla.terrier.querying.parser.Query for more details.
Weighting Model	The Weighting model represents the retrieval model that is used to weight the terms of a document. See uk.ac.gla.terrier.matching.models.WeightingModel for more details.
Document Score Modifiers	Responsible for query dependent modification document scores. See uk.ac.gla.terrier.matching.dsms package for more details.
Term Score Modifiers	Modifies the scores of the documents for a given set of pointers, or postings. See uk.ac.gla.terrier.matching.tsms package for more details.

Name	Description
Trec Terrier	An application that enables indexing and querying of TREC collections. See TrecTerrier for more details.
Desktop Terrier	An application that allows for indexing and retrieval of local user content. See uk.ac.gla.terrier.applications.desktop package for more details.