Developping Applications with Terrier

Terrier provides APIs for indexing, and querying your data.

Indexing

Unless your data is in files (1 file per document), you will probably need to create your own collection decoder. This is done by implementing the Collection interface (uk.ac.gla.terrier.indexing.Collection), and writing your own indexing application. (See the classes uk.ac.gla.terrier.applications.TRECIndexing, or uk.ac.gla.terrier.applications.desktop.DesktopTerrier).

If you documents are of a non-standard format, then we would advise you create your own Document implementation as well. You'll need to implement the interface uk.ac.gla.terrier.indexing.Document.

Essentially, you can now use the BasicIndexer or the BlockIndexer to index your collection. The BlockIndexer provides the same functionality as BasicIndexer, but uses larger DirectIndex and InvertedIndex for storing the positions that each word occurs at in each document. This allows querying to use term positions information - for example Phrasal search ("") and proximity search (""~10).

The indexer iterates through the documents of the collection and creates the following data structures:

Direct Index : a compressed file, where we store the terms contained in each document. The direct index is used for automatic query expansion.
Document Index : a fixed-length entry file, where we store information about documents, such as the number of indexed tokens (document length), the identifier of a document, and the offset of its corresponding entry in the direct index.
Lexicon : a fixed-length entry file, where we store information about the vocabulary of the indexed collection.

As the indexer iterates through the documents of the collection, it appends the direct and document indexes. For saving the vocabulary information, the indexer creates temporary lexicons for parts of the collection, which are merged once all the documents have been processed.

Once the direct index, the document index and the lexicon have been created, the inverted index is created, by inverting the direct index.

Querying your index

There are several stages to querying you index:

Parsing
PreProcessing
Matching
PostProcessing
PostFiltering

The querying API has been implemented to allow Terrier to be suited for more applications, including interactive applications. To this end, we have encapsulated every query in a SearchRequest object, which is passed through different stages of a query retrieval by the Manager:

Parsing

A query has to be parsed into a syntax tree - this allows Terrier to identify terms, phrases, requirements, fields, proximity requirements, weights etc from the grammar of the query entered. For this we use a parser generated by the Antlr parser generator. The parser uses two lexers, one for parsing most of the query, and another for parsing numbers (integer and floats).

PreProcessing

The Query tree is then traversed. This allows three operations: each term to be passed through the TermPipeline (stemming, stopping etc); controls to be identified and removed; terms to be aggregated for the Matching process.

Matching

The aggregated terms (known as MatchingTerms) are the query for the main retrieval (Matching) stage, where relevant documents are determined, and scores assigned using assigned weighting model. There are two additional (new) sub-stages at this time:

Term Score Modifiers - alter the scores given to a term in a given document - eg the term occurs in the desired field (eg TITLE, H1 etc)
Document Score Modifiers - alters the score of a given document - eg if all the terms occur in the document, but not in a phrase as desired.

Post Processing

Post Processing is for application specific code to alter the result set in an unspecified way. Terrier provides automatic Query Expansion where relevant terms from the top N documents are added to the query, and the matching stage rerun.

Post Filtering

Post Filtering is like Post Processing, but only one document of the result set may be operated on at any one time - this allows results to be filtered out (eg not in a specific DNS domain for Search Engine results).

Example Querying Code

Below, you can find a example sample of using the querying functionalities of Terrier.

        String query = "term1 term2";
        SearchRequest srq = queryingManager.newSearchRequest();
        try{
			TerrierLexer lexer = new TerrierLexer(new StringReader(query));
			TerrierFloatLexer flexer = new
				TerrierFloatLexer(lexer.getInputState());

			TokenStreamSelector selector = new TokenStreamSelector();
			selector.addInputStream(lexer, "main");
			selector.addInputStream(flexer, "numbers");
			selector.select("main");
			TerrierQueryParser parser = new TerrierQueryParser(selector);
			parser.setSelector(selector);
			srq.setQuery(parser.query());
        } catch (Exception e) {
             System.err.println("Failed to process Query ("+query+") : "+e,4);
             return;
        }
        srq.addMatchingModel("Matching", "PL2");
        queryingManager.runPreProcessing(srq);
        queryingManager.runMatching(srq);
        queryingManager.runPostProcessing(srq);
        queryingManager.runPostFilters(srq);
        ResultSet rs = srq.getResultSet();

Terrier Query Language

Terrier offers a flexible and powerful query language for searching with phrases, fields, or specifying that terms are required to appear in the retrieved documents. Some examples of queries are the following:

term1 term2	retrieves documents that contains 1 or more term1 and term2 (they need not contain both)
term1^2.3	the weight of term1 is boosted 2.3.
+term1 +term2	retrieves documents that contain both term1 and term2.
+term1 -term2	retrieves documents that contain term1 and do not contain term2.
"term1 term2"	retrieves documents where the terms term1 and term2 appear in a phrase.
"term1 term2"~n	retrieves documents where the terms term1 and term2 appear within a distance of n blocks. The order of the terms is not considered.

Combinations of the different constructs are possible as well. For example, the query term1 term2 -"term1 term2" would retrieve all the documents that contain at least one of the terms term1 and term2, but not the documents where the phrase "term1 term2" appears.

Common Applications Notes

Retrieving from small/focused collections

If indexing a very small number of documents for a given topic, then some words may occur in many or all of the documents. In this case, these terms have a low discrimation power, and therefore low idf. By default, these terms would be ignored during retrieval. However, if you want to override this default setting, you may want to turn off the property ignore.low.idf.terms

ignore.low.idf.terms=false

Mapping document identifiers to filenames

A core issue, is that you will need to be able to map docids to filenames. The class uk.ac.gla.terrier.indexing.SimpleFileCollection gives an example of how to do this, by saving a file with the documents' filenames. Alternatively, if you require to access the original contents of a file, then your collection implementation should implement the interface uk.ac.gla.terrier.indexing.DocumentExtractor.

Implementing a Web search engine

You will probably need to save a two-way mapping from URLs to document identifiers, to enable your search engine to determine the URLs from the document identifiers and vice versa. Note that the document identifier stored by the class uk.ac.gla.terrier.structures.DocumentIndex corresponds to a fixed length entry and it is unlikely to be useful for this task, because URLs are often of highly variable length.

Many search engines also provide titles and abstracts or summaries. You may wish to save titles and abstracts of web documents you have indexed in separate files, so you can retrieve them for each document identifier, or implement a query-biased summary for each document.

Compiling Terrier

The main Terrier distribution comes pre-compiled as Java, and can be run on any Java 1.4 or 1.5 JDK. You should have no need to compile Terrier unless:

You have altered the Terrier source code and wish to check or use your changes.
You have checked out a CVS copy of Terrier and wish to build a distribution.
You have checked out a CVS copy of Terrier and wish to run the test scripts.
You have downloaded a source-only distribution of Terrier.
You want to browse source code of the query parser related classes, which are automatically generated by compiling the grammar specifications with ANTLR.

Terrier is distributed with two scripts for compiling Terrier for Unix-like platforms:

bin/compile.sh : This builds the terrier-$VERSION.jar file and puts it in the lib/ folder. It will compile all files it finds in the src/ folder.
Makefile : This is a classic Makefile for building Terrier, and is more maintained than bin/compile.sh. It has many targets:
- clean - removes all build process files
- compile - builds Terrier query parser, Terrier jar file and new Javadoc. It only builds compiles and includes java files that are in the MANIFEST.txt
- doc javadoc - builds the Javadoc
- distribution - builds the currently selected target platform distribution file (eg terrier-1.0.0.tar.gz, terrier-1.0.0.zip)
- unix - builds terrier-1.0.0.tar.gz
- win - build terrier-1.0.0.zip, and runs all text files through unix2dos

NB:Currently we suggest that you use the Makefile instead of the script bin/compile.sh, and that you always execute make clean compile to compile Terrier. This ensures that the TerrierParser is always built correctly.

If you want to compile your application code that uses functionalities of Terrier, then it is preferable to make the compilation having the file lib/terrier-1.X.X.jar in your classpath, instead of the folder src/.

We currently have no provisions for compiling Terrier on Windows. An obvious way would be to port compile.sh to a file called compile.bat. We also mention in Future features our intention to include an ANT task.