Terrier Features

Below, you can find a succinct list of features offered by Terrier.

Indexing support for common desktop file formats, and for commonly used TREC research collections (e.g. TREC CDs 1-5, WT2G, WT10G, GOV, GOV2, Blogs06).
Many document weighting models, such as many parameter-free Divergence from Randomness weighting models, Okapi BM25 and language modelling.
Conventional query language supported, including phrases, and terms occurring in tags.
Handling full-text indexing of large-scale document collections, in a centralised architecture to at least 25 million documents, and using the Hadoop Map Reduce distributed indexing scheme for even larger collections.
Modular and open indexing and querying APIs, to allow easy extension for your own applications and research.
Active Information Retrieval research fed into the Open Source platform.
Open Source (Mozilla Public Licence).
Written in cross-platform Java - works on Windows, Mac OS X, Linux and Unix.
Large user-base over 4 years of public release.

Out-of-the box indexing of tagged document collections, such as the TREC test collections.
Out-of-the box indexing for documents of various formats, such as HTML, PDF, or Microsoft Word, Excel and PowerPoint files.
Out-of-the box support for distributed indexing in a Hadoop Map Reduce setting.
Indexing of field information, such as TITLE, H1, HTML tags information
Indexing of position information on a word, or a block (e.g. a window of terms within a distance) level.
Support for various encodings of documents (UTF), to facilitate multi-lingual retrieval.
Support for fetching files to index by HTTP, allowing intranets to be easily searched.
Highly compressed index disk data structures.
Highly compressed direct file for efficient query expansion.
Alternative faster single-pass indexing.
Various stemming techniques supported, including the Snowball stemmer for European languages.

Provides standard querying facilities, as well as Query Expansion (pseudo-relevance feedback)
Can be applied in interactive applications, such as the included Desktop Search, or in a batch setting for research & experimentation.
Provides many standard document weighting models, including up to 126 Divergence From Randomness (DFR) document ranking models, and other models such as Okapi BM25, language modelling and TF-IDF. The new DFRee DFR weighting model is also included, which provides robust performance on a range of test collections without the need for any parameter tuning or training.
Advanced query language that supports Boolean operators, +/- operators, phrase and proximity search, and fields.
Provides a number of parameter-free DFR term weighting models for automatic query expansion, in addition to Rocchio's query expansion.
Flexible processing of terms through a pipeline of components, such as stop-words removers and stemmers.

Handles all currently available TREC test collections - see TREC Experimentation Examples for examples and known settings.
Easily scriptable to evaluate many parameter settings, or many weighting models in batch form.
In-built evaluation tools for use with TREC ad-hoc and known-item search retrieval results, to produce various Precision and Recall measures.