[Previous: Configuring Indexing] [Contents] [Next: Terrier Query Language]

Configuring Retrieval in Terrier

Topics

After the end of the indexing process, we can proceed with retrieving from the document collection. At this stage, the options for applying stemming or not, removing stopwords or not, and the maximum length of terms, should be exactly the same as the ones used for indexing the collection.

In the file etc/trec.topics.list, we need to specify which file contains the queries to process. Alternatively, we can specify the topic file by setting property trec.topics to the name of the topic file.

A last step before processing the queries is to specify which tags from the topics to use. We can do that by setting the properties TrecQueryTags.process, which denotes which tags to process, TrecQueryTags.idtag, which stands for the tag containing the query identifier, and TrecQueryTags.skip, which denotes which query tags to ignore.

For example, suppose that the format of topics is the following:

<TOP>
<NUM>123<NUM>
<TITLE>title
<DESC>description
<NARR>narrative
</TOP>

If we want to skip the description and narrative (DESC and NARR tags respectively), and consequently use the title only, then we need to setup the properties as follows:

TrecQueryTags.doctag=TOP
TrecQueryTags.process=TOP,NUM,TITLE
TrecQueryTags.idtag=NUM
TrecQueryTags.skip=DESC,NARR

If alternatively, we want to use the title, description and the narrative tags to create the query, then we need to setup the properties as follows:

TrecQueryTags.doctag=TOP
TrecQueryTags.process=TOP,NUM,DESC,NARR,TITLE
TrecQueryTags.idtag=NUM
TrecQueryTags.skip=

The tags specified by TrecQueryTags are case-insensitive (note the difference from TrecDocTags). If you want them to be case-sensitive, then set TrecQueryTags.casesensitive=false.

If you have test topics written in a language other than English, then you will hopefully have indexed with the string.use_utf property set. In this case, for retrieval, Terrier will use a more forgiving tokeniser for parsing the topic files is string.use_utf remains set.

Weighting Models and Parameters

Next, we need to specify which of the available weighting models we will use for assigning scores to the retrieved documents. We do this by specifying the name of the corresponding class in the file etc/trec.models, or by setting property trec.model to the name of model used. For example, if we are using the weighting scheme InL2, then the models file should contain:

uk.ac.gla.terrier.matching.models.InL2

Terrier provides implementation of the following weighting models:

BB2: Bose-Einstein model for randomness, the ratio of two Bernoulli's processes for first normalisation, and Normalisation 2 for term frequency normalisation.
BM25: The BM25 probabilistic model.
DFR_BM25: The DFR version of BM25.
DLH: The DLH hyper-geometric DFR model.
DLH13: An improved version of DLH.
Hiemstra_LM: Hiemstra's language model.
IFB2: Inverse Term Frequency model for randomness, the ratio of two Bernoulli's processes for first normalisation, and Normalisation 2 for term frequency normalisation.
In_expB2: Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalisation, and Normalisation 2 for term frequency normalisation.
In_expC2: Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalisation, and Normalisation 2 for term frequency normalisation with natural logarithm.
InL2: Inverse document frequency model for randomness, Laplace succession for first normalisation, and Normalisation 2 for term frequency normalisation.
LemurTF_IDF: Lemur's version of the tf*idf weighting function.
PL2: Poisson estimation for randomness, Laplace succession for first normalisation, and Normalisation 2 for term frequency normalisation.
TF_IDF: The tf*idf weighting function, where tf is given by Robertson's tf and idf is given by the standard Sparck Jones' idf.
PonteCroft: Ponte & Croft's language model. This is the only model for which special index structures are needed. To use it, ensure that the special index structure is created by bin/trec_terrier.sh -i -l. The model was implemented as written in the paper. Possible improvements, such as using sums of logs to avoid product of small probabilities, are not considered in the implementation.
DFRWeightingModel: This class provides an alternative way of specifying the weighting model to be used. For usage, see Extending Retrieval.

To process the queries, we can type the following:

bash-2.05b$ bin/trec_terrier.sh -r -c 1.0

where the option -r specifies that we want to perform retrieval, and the option -c 1.0 specifies the parameter value for the term frequency normalisation.

If Ponte & and Croft's language model is used, we need to use option -l:

bash-2.05b$ bin/trec_terrier.sh -r -l

Query Expansion

Terrier also offers a query expansion functionality. For a brief description of the query expansion module, you may view the query expansion section of the DFR Framework description. The term weighting model used for expanding the queries with the most informative terms of the top-ranked documents is specified in the file etc/qemodels. This file contains the class names of the term weighting models to be used for query expansion. The default content of the file is:

uk.ac.gla.terrier.matching.models.queryexpansion.Bo1

In addition, there are two parameters that can be set for applying query expansion. The first one is the number of terms to expand a query with. It is specified by the property expansion.terms, the default value of which is 10. Moreover, the number of top-ranked documents from which these terms are extracted, is specified by the property expansion.documents, the default value of which is 3.

To retrieve from an indexed test collection, using query expansion, with the term frequency normalisation parameter equal to 1.0, we can type:

bash-2.05b$ bin/trec_terrier.sh -r -q -c 1.0

Other Configurables

The results are saved in the directory var/results in a file named as follows:

"weighting scheme" c "value of c"_counter.res

For example, if we have used the weighting scheme PL2 with c=1.28 and the counter was 2, then the filename of the results would be PL2c1.28_3.res.

For each query, Terrier returns a maximum number of 1000 documents by default. We can change the maximum number of returned documents per query by changing matching.retrieved_set_size. For example, if we want to retrieve 10000 document for each given query, we need to set matching.retrieved_set_size to 10000. In addition, we need to set the rank of the last returned document to 9999 in querying.default.controls.

Some of the weighting models, e.g. BM25, assume low document frequencies of query terms. For these models, it is worth ignoring query terms with high document frequency during retrieval by setting ignore.low.idf.terms to true.Moreover, it is better to set ignore.low.idf.terms to false for high precision search tasks such as named-page finding.

[Previous: Configuring Indexing] [Contents] [Next: Terrier Query Language]