[
Contents] [
Next: What's new]
Terrier Features
Below, you can find a succinct list of features offered by Terrier.
General
- Indexing support for common desktop file formats, and for commonly used TREC research collections (e.g. TREC CDs 1-5, WT2G, WT10G, GOV, GOV2, Blogs06, Blog08, ClueWeb09).
- Many document weighting models, such as many parameter-free Divergence from Randomness weighting models, Okapi BM25 and language modelling.
- Conventional query language supported, including phrases, and terms occurring in tags.
- Handling full-text indexing of large-scale document collections, in a centralised architecture to at least 50 million documents,
and using the Hadoop MapReduce distributed indexing scheme for even larger collections.
- Modular and open indexing and querying APIs, to allow easy extension for your own applications and research.
- Active Information Retrieval research fed into the Open Source platform.
- Open Source (Mozilla Public Licence).
- Written in cross-platform Java - works on Windows, Mac OS X, Linux and Unix.
- Large user-base over 7 years of public release.
Indexing
- Out-of-the box indexing of tagged document collections, such as the TREC test collections.
- Out-of-the box indexing for documents
of various formats, such as HTML, PDF, or Microsoft Word,
Excel and PowerPoint files.
- Out-of-the box support for distributed indexing in a Hadoop MapReduce setting.
- Indexing of field information, such as the frequency of a term in a TITLE or H1 HTML tag.
- Indexing of position information on a word, or a block (e.g. a window of terms within a distance) level.
- Support for various encodings of documents (UTF), to facilitate multi-lingual retrieval.
- Support for changing the tokenisation being used.
- Indexing support for query-biased summarisation.
- Support for fetching files to index by HTTP, allowing intranets to be easily searched.
- Highly compressed index disk data structures.
- Highly compressed direct file for efficient query expansion.
- Alternative faster single-pass and MapReduce based indexing.
- Various stemming techniques supported, including the Snowball stemmer for European languages.
Retrieval
- Provides desktop, command-line and Web based querying interfaces.
- Provides standard querying facilities, as well as Query Expansion (pseudo-relevance feedback).
- Can be applied in interactive applications, such as the included Desktop Search, or in
a batch setting for research and experimentation.
- Provides many standard document weighting models, including up to 126 Divergence From Randomness (DFR) document ranking models, and other models such as Okapi BM25, language modelling and TF-IDF. Two new 2nd generation DFR weighting model, JsKLs and XSqrA_M, are also included, which provide robust performance on a range of test collections without the need for any parameter tuning or training.
- Advanced query language that supports synonyms, +/- operators, phrase and proximity search, and fields.
- Provides a number of parameter-free DFR term weighting models for automatic query expansion, in addition to Rocchio's query expansion.
- Flexible processing of terms through a pipeline of components, such as stopword removers and stemmers.
Experimentation
- Handles all currently available TREC test collections - see TREC Experimentation Examples for examples and known settings.
- Easily scriptable to evaluate many parameter settings, or many weighting models in batch form.
- Built-in evaluation tools for use with TREC ad-hoc and known-item search
retrieval results, to produce various Precision and Recall measures.
[
Contents] [
Next: What's new]