Using Terrier for experiments with TREC collections

Terrier can be readily used for experimentation with test collections used in the Text REtrieval Conference. Below we describe how to index and how to perform retrieval from a TREC test collection. You may also refer to the example of indexing, retrieving and evaluating results for the TREC WT2G collection.

Indexing a TREC collection

After having installed Terrier, we proceed with indexing a document collection. In order to do so, we need to go through the following steps.

The first step consists of running the script bin/trec_setup.sh. This script takes one parameter, which corresponds to the directory under which the document collection to be indexed is stored. For example, if Terrier has been installed in /local/terrier, and the document collection to be indexed is stored under directory /local/collection, we should write:

bash-2.05b$ cd /local/terrier
bash-2.05b$ bin/trec_setup.sh /local/collection

The script trec_setup.sh creates the default configuration files, used for indexing a TREC collection. These files are stored in the directory etc and they are:

collection.spec : contains the list of files to index.
terrier.properties : contains the basic configuration of Terrier.
trec.topics.list : contains the full path to the topics file. The automatically generated file is empty and the user needs to update this file.
trec.models : contains the fully qualified class name of the weighting model used for matching.
qemodels : contains the fully qualified class name of the term weighting model used for automatic query expansion.

While running the script bin/trec_setup.sh, the contents of the automatically generated file collection.spec are displayed in order for the user to verify that the file contains only the files of the collection to index. Alternatively, it can be created manually in the following way. If the document collection files are the following:

/local/collection/d1/f1
/local/collection/d1/f2
/local/collection/d2/f1
/local/collection/d2/f2

then the file collection.spec can be created as follows:

bash-2.05b$ find /local/collection -name f? > etc/collection.spec

The name and the location of the created data structures is specified by the properties terrier.index.prefix and terrier.index.path. The default values of these properties are data and index, respectively. So, assuming that Terrier has been installed in the directory /local/terrier, the created inverted index will be /local/terrier/var/index/data.if. All other data structures will be created in the same directory and will have the same name with appropriate extensions.

The next step involves updating the configuration file terrier.properties, if necessary. Among the properties that can be configured, we can specify:

the components of the term pipeline (property termpipelines). The default value of this property is Stopwords,PorterStemmer. This means that during indexing and querying the tokens go through a pipeline where the first component removes stopwords and the second applies Porter's stemming algorithm. The names of the pipeline components correspond to either the fully qualified class name of the component. If the class name is not fully qualified, ie it does not contain the package names, then the default package uk.ac.gla.terrier.terms is prepended to the class name.
the path of a file that contains a list of stop-words (property stopwords.filename). The default value is stopword-list.txt and the file is assumed to be in the directory etc. You may specify both a full-path, or a relative path. If a relative path is given, then the name of the file is prepended with the full path to the directory share/.
the maximum length of terms to be indexed (property string.byte.length). The default value is 20. This property should be at least as many bytes as the length of the document identifier (i.e. the value of tag <DOCNO> for TREC collections).
the documents' tags. We can specify the delimiting tag of documents (property TrecDocTags.doctag), the document identifier tag (property TrecDocTags.idtag), the tags to process (property TrecDocTags.process) and the tags to skip (property TrecDocTags.skip). If we are within the scope of a tag to process, then the text is indexed, unless the tokenizer enters the scope of a tag to skip. By default, if the list of tags to skip is empty, then we process all tags. For example, suppose that a document has the following structure:
```
   <DOC>
   <DOCNO>abc</DOCNO>
   <DOCHDR>...</DOCHDR>
   ...
   </DOC>
```
If we want to process everything that is within tag <DOC> and also to specify that the document identifier is within <DOCNO>, then we can set the above mentioned properties as follows:
```
   TrecDocTags.doctag=DOC
   TrecDocTags.idtag=DOCNO
   TrecDocTags.process=
   TrecDocTags.skip=
```
If we don't want to index the contents of the tag <DOCHDR>, then we can write:
```
   TrecDocTags.doctag=DOC
   TrecDocTags.idtag=DOCNO
   TrecDocTags.skip=DOCHDR
```
For more details on other properties, refer to the classes uk.ac.gla.terrier.utility.TagSet, uk.ac.gla.terrier.utility.ApplicationSetup and the description of the properties you can modify through the configuration file of Terrier.
indexing and retrieving with position information. The following properties can be setup to use position information:
```
   block.size=1
   max.blocks=100000
   block.indexing=true
```
The property block.size specifies the size of each block. A value of 1 means that each term appears in a different block. The property max.blocks specifies the maximum number of blocks a document may contain. If there are more blocks in a document than the maximum number, then all the additional blocks are added to the last one. The property block.indexing enables the indexing with position information. The position information is used during querying for processing phrase or proximity queries. For more details about the query language, you may refer to its description in the guide for developping applications with Terrier).
whether we save information about terms that appear in a set of specified fields (or tags). If a collection of HTML documents is indexed, we can flag in the direct and inverted files the terms that appear in any of the specified HTML tags. The properties we need to setup are the following:
```
   field.modifiers=0.10d
   FieldTags.process=TITLE
```
The property html.modifiers specifies how much a document's score is increased when a query term appears in one of the specified HTML tags. The property HtmlTags.process is a comma separated list of the HTML tags to process. In the above example, if a query term appears in the <title> tag of a document, then the document's score will be increased by 10 percent.

After updating the required files and setting the properties, we can proceed with indexing the collection, and creating the direct file, lexicon, document index, inverted file and collection statistics file:

bash-2.05b$ bin/trec_terrier.sh -i

For more information about the available options of the script bin/trec_terrier.sh, you may obtain a help message by typing:

bash-2.05$ bin/trec_terrier.sh --help

Retrieving with Terrier from TREC collections

After the end of the indexing process, we can proceed with retrieving from the document collection. At this stage, the options for applying stemming or not, removing stopwords or not, and the maximum length of terms, should be exactly the same as the ones used for indexing the collection.

In the file etc/trec.topics.list, we need to specify which file contains the queries to process. Next, we need to specify which of the available weighting models we will use for assigning scores to the retrieved documents. We do this by specifying the name of the corresponding class in the file etc/trec.models. For example, if we are using the weighting scheme InL2, then the models file should contain:

uk.ac.gla.terrier.matching.models.InL2

A last step before processing the queries is to specify which tags from the topics to use. We can do that by setting the properties TrecQueryTags.process, which denote which tags to process, TrecQueryTags.idtag, which stands for the tag containing the query identifier, and TrecQueryTags.skip, which denote which query tags to ignore.

For example, suppose that the format of topics is the following:

<TOP>
<NUM>123<NUM>
<TITLE>title
<DESC>description
<NARR>narrative
</TOP>

If we want to skip the description and narrative (DESC and NARR tags respectively), and consequently use the title only, then we need to setup the properties as follows:

TrecQueryTags.process=TOP
TrecQueryTags.idtag=NUM
TrecQueryTags.skip=DESC,NARR

If alternatively, we want to skip the title, and consequently use the description and the narrative tags to create the query, then we need to setup the properties as follows:

TrecQueryTags.process=TOP
TrecQueryTags.idtag=NUM
TrecQueryTag.skip=TITLE

To process the queries, we can type the following:

bash-2.05b$ bin/trec_terrier.sh -r -c 1.0

where the option -r specifies that we want to perform retrieval, and the option -c 1.0 specifies the parameter value for the term frequency normalisation. If the option -c is not specified, then a default value 1.0 is used. This default value can be altered by setting the property term.freq.norm.parameter in the properties file.

Terrier also offers query expansion functionality. For a brief description of the query expansion module, you may view the query expansion section of the DFR Framework description. The term weighting model used for expanding the queries with the most informative terms of the top-ranked documents is specified in the file etc/qemodels. This file contains the class names of the term weighting models to be used for query expansion. The default content of the file is:

uk.ac.gla.terrier.matching.models.queryexpansion.Bo1

In addition, there are two parameters that can be set for applying query expansion. The first one is the number of terms to expand a query with. It is specified by the property expansion.terms, the default value of which is 10. Moreover, the number of top-ranked documents from which these terms are extracted, is specified by the property expansion.documents, the default value of which is 3.

To retrieve from an indexed test collection, using query expansion, with the term frequency normalisation parameter equal to 1.0, we can type:

bash-2.05b$ bin/trec_terrier.sh -r -q -c 1.0

The results are saved in the directory var/results in a file named as follows:

"weighting scheme" c "value of c"_counter.res

For example, if we have used the weighting scheme PL2 with c=1.28 and the counter was 2, then the filename of the results would be PL2c1.28_3.res.