Terrier Features

Below, you can find a succinct list of features offered by Terrier.

Indexing support for common desktop file formats, and for commonly used TREC research collections (e.g. TREC CDs 1-5, WT2G, WT10G, GOV, GOV2, Blogs06, Blog08, ClueWeb09).
Many document weighting models, such as many parameter-free Divergence from Randomness weighting models, Okapi BM25 and language modelling.
Conventional query language supported, including phrases, and terms occurring in tags.
Handling full-text indexing of large-scale document collections, in a centralised architecture to at least 50 million documents, and using the Hadoop MapReduce distributed indexing scheme for even larger collections.
Modular and open indexing and querying APIs, to allow easy extension for your own applications and research.
Active Information Retrieval research fed into the Open Source platform.
Open Source (Mozilla Public Licence).
Written in cross-platform Java - works on Windows, Mac OS X, Linux and Unix.
Large user-base over 7 years of public release.

Out-of-the box indexing of tagged document collections, such as the TREC test collections.
Out-of-the box indexing for documents of various formats, such as HTML, PDF, or Microsoft Word, Excel and PowerPoint files.
Out-of-the box support for distributed indexing in a Hadoop MapReduce setting.
Indexing of field information, such as the frequency of a term in a TITLE or H1 HTML tag.
Indexing of position information on a word, or a block (e.g. a window of terms within a distance) level.
Support for various encodings of documents (UTF), to facilitate multi-lingual retrieval.
Support for changing the tokenisation being used.
Indexing support for query-biased summarisation.
Support for fetching files to index by HTTP, allowing intranets to be easily searched.
Highly compressed index disk data structures.
Highly compressed direct file for efficient query expansion.
Alternative faster single-pass and MapReduce based indexing.
Various stemming techniques supported, including the Snowball stemmer for European languages.

Provides desktop, command-line and Web based querying interfaces.
Provides standard querying facilities, as well as Query Expansion (pseudo-relevance feedback).
Can be applied in interactive applications, such as the included Desktop Search, or in a batch setting for research and experimentation.
Provides many standard document weighting models, including up to 126 Divergence From Randomness (DFR) document ranking models, and other models such as Okapi BM25, language modelling and TF-IDF. Two new 2nd generation DFR weighting model, JsKLs and XSqrA_M, are also included, which provide robust performance on a range of test collections without the need for any parameter tuning or training.
Advanced query language that supports synonyms, +/- operators, phrase and proximity search, and fields.
Provides a number of parameter-free DFR term weighting models for automatic query expansion, in addition to Rocchio's query expansion.
Flexible processing of terms through a pipeline of components, such as stopword removers and stemmers.

Handles all currently available TREC test collections - see TREC Experimentation Examples for examples and known settings.
Easily scriptable to evaluate many parameter settings, or many weighting models in batch form.
Built-in evaluation tools for use with TREC ad-hoc and known-item search retrieval results, to produce various Precision and Recall measures.