Using Terrier for experiments with TREC collections

Terrier can be readily used for experimentation with test collections used in the Text REtrieval Conference. Below we describe how to index and how to perform retrieval from a TREC test collection. You may also refer to the example of indexing, retrieving and evaluating results for the TREC WT2G collection.

Indexing a TREC collection

After having installed Terrier, we proceed with indexing a document collection. In order to do so, we need to go through the following steps.

The first step consists of running the script bin/trec_setup.sh. This script takes one parameter, which corresponds to the directory under which the document collection to be indexed is stored. For example, if Terrier has been installed in /local/terrier, and the document collection to be indexed is stored under directory /local/collection, we should write:

bash-2.05b$ cd /local/terrier
bash-2.05b$ bin/trec_setup.sh /local/collection

The script trec_setup.sh creates the default configuration files, used for indexing a TREC collection. These files are stored in the directory etc and they are:

While running the script bin/trec_setup.sh, the contents of the automatically generated file collection.spec are displayed in order for the user to verify that the file contains only the files of the collection to index. Alternatively, it can be created manually in the following way. If the document collection files are the following:

/local/collection/d1/f1
/local/collection/d1/f2
/local/collection/d2/f1
/local/collection/d2/f2

then the file collection.spec can be created as follows:

bash-2.05b$ find /local/collection -name f? > etc/collection.spec

The name and the location of the created data structures is specified by the properties terrier.index.prefix and terrier.index.path. The default values of these properties are data and index, respectively. So, assuming that Terrier has been installed in the directory /local/terrier, the created inverted index will be /local/terrier/var/index/data.if. All other data structures will be created in the same directory and will have the same name with appropriate extensions.

The next step involves updating the configuration file terrier.properties, if necessary. Among the properties that can be configured, we can specify:

After updating the required files and setting the properties, we can proceed with indexing the collection, and creating the direct file, lexicon, document index, inverted file and collection statistics file:

bash-2.05b$ bin/trec_terrier.sh -i 

For more information about the available options of the script bin/trec_terrier.sh, you may obtain a help message by typing:

bash-2.05$ bin/trec_terrier.sh --help

Retrieving with Terrier from TREC collections

After the end of the indexing process, we can proceed with retrieving from the document collection. At this stage, the options for applying stemming or not, removing stopwords or not, and the maximum length of terms, should be exactly the same as the ones used for indexing the collection.

In the file etc/trec.topics.list, we need to specify which file contains the queries to process. Next, we need to specify which of the available weighting models we will use for assigning scores to the retrieved documents. We do this by specifying the name of the corresponding class in the file etc/trec.models. For example, if we are using the weighting scheme InL2, then the models file should contain:

uk.ac.gla.terrier.matching.models.InL2

A last step before processing the queries is to specify which tags from the topics to use. We can do that by setting the properties TrecQueryTags.process, which denote which tags to process, TrecQueryTags.idtag, which stands for the tag containing the query identifier, and TrecQueryTags.skip, which denote which query tags to ignore.

For example, suppose that the format of topics is the following:

<TOP>
<NUM>123<NUM>
<TITLE>title
<DESC>description
<NARR>narrative
</TOP>

If we want to skip the description and narrative (DESC and NARR tags respectively), and consequently use the title only, then we need to setup the properties as follows:

TrecQueryTags.process=TOP
TrecQueryTags.idtag=NUM
TrecQueryTags.skip=DESC,NARR

If alternatively, we want to skip the title, and consequently use the description and the narrative tags to create the query, then we need to setup the properties as follows:

TrecQueryTags.process=TOP
TrecQueryTags.idtag=NUM
TrecQueryTag.skip=TITLE

To process the queries, we can type the following:

bash-2.05b$ bin/trec_terrier.sh -r -c 1.0

where the option -r specifies that we want to perform retrieval, and the option -c 1.0 specifies the parameter value for the term frequency normalisation. If the option -c is not specified, then a default value 1.0 is used. This default value can be altered by setting the property term.freq.norm.parameter in the properties file.

Terrier also offers query expansion functionality. For a brief description of the query expansion module, you may view the query expansion section of the DFR Framework description. The term weighting model used for expanding the queries with the most informative terms of the top-ranked documents is specified in the file etc/qemodels. This file contains the class names of the term weighting models to be used for query expansion. The default content of the file is:

uk.ac.gla.terrier.matching.models.queryexpansion.Bo1  

In addition, there are two parameters that can be set for applying query expansion. The first one is the number of terms to expand a query with. It is specified by the property expansion.terms, the default value of which is 10. Moreover, the number of top-ranked documents from which these terms are extracted, is specified by the property expansion.documents, the default value of which is 3.

To retrieve from an indexed test collection, using query expansion, with the term frequency normalisation parameter equal to 1.0, we can type:

bash-2.05b$ bin/trec_terrier.sh -r -q -c 1.0 

The results are saved in the directory var/results in a file named as follows:

"weighting scheme" c "value of c"_counter.res

For example, if we have used the weighting scheme PL2 with c=1.28 and the counter was 2, then the filename of the results would be PL2c1.28_3.res.