Configuring Retrieval in Terrier
After the end of the indexing process, we can proceed with retrieving from the document collection. At this stage, the options for applying stemming or not, removing stopwords or not, and the maximum length of terms, should be exactly the same as the ones used for indexing the collection.
In the file etc/trec.topics.list, we need to specify which file contains the queries to process. Alternatively, we can specify the topic file by setting property trec.topics to the name of the topic file.
A last step before processing the queries is to specify which tags from the topics to use. We can do that by setting the properties TrecQueryTags.process, which denotes which tags to process, TrecQueryTags.idtag, which stands for the tag containing the query identifier, and TrecQueryTags.skip, which denotes which query tags to ignore.
For example, suppose that the format of topics is the following:
<TOP> <NUM>123<NUM> <TITLE>title <DESC>description <NARR>narrative </TOP>
If we want to skip the description and narrative (DESC and NARR tags respectively), and consequently use the title only, then we need to setup the properties as follows:
TrecQueryTags.doctag=TOP TrecQueryTags.process=TOP,NUM,TITLE TrecQueryTags.idtag=NUM TrecQueryTags.skip=DESC,NARR
If alternatively, we want to use the title, description and the narrative tags to create the query, then we need to setup the properties as follows:
TrecQueryTags.doctag=TOP TrecQueryTags.process=TOP,NUM,DESC,NARR,TITLE TrecQueryTags.idtag=NUM TrecQueryTags.skip=
The tags specified by TrecQueryTags are case-insensitive (note the difference from TrecDocTags). If you want them to be case-sensitive, then set TrecQueryTags.casesensitive=false.
If you have test topics written in a language other than English, then you will hopefully have indexed with the string.use_utf property set. In this case, for retrieval, Terrier will use a more forgiving tokeniser for parsing the topic files is string.use_utf remains set.
Next, we need to specify which of the available weighting models we will use for assigning scores to the retrieved documents. We do this by specifying the name of the corresponding class in the file etc/trec.models, or by setting property trec.model to the name of model used. For example, if we are using the weighting scheme InL2, then the models file should contain:
Terrier provides implementation of the following weighting models:
To process the queries, we can type the following:
bash-2.05b$ bin/trec_terrier.sh -r -c 1.0
where the option -r specifies that we want to perform retrieval, and the option -c 1.0 specifies the parameter value for the term frequency normalisation.
If Ponte & and Croft's language model is used, we need to use option -l:
bash-2.05b$ bin/trec_terrier.sh -r -l
Terrier also offers a query expansion functionality. For a brief description of the query expansion module, you may view the query expansion section of the DFR Framework description. The term weighting model used for expanding the queries with the most informative terms of the top-ranked documents is specified in the file etc/qemodels. This file contains the class names of the term weighting models to be used for query expansion. The default content of the file is:
In addition, there are two parameters that can be set for applying query expansion. The first one is the number of terms to expand a query with. It is specified by the property expansion.terms, the default value of which is 10. Moreover, the number of top-ranked documents from which these terms are extracted, is specified by the property expansion.documents, the default value of which is 3.
To retrieve from an indexed test collection, using query expansion, with the term frequency normalisation parameter equal to 1.0, we can type:
bash-2.05b$ bin/trec_terrier.sh -r -q -c 1.0
The results are saved in the directory var/results in a file named as follows:
"weighting scheme" c "value of c"_counter.res
For example, if we have used the weighting scheme PL2 with c=1.28 and the counter was 2, then the filename of the results would be PL2c1.28_3.res.
For each query, Terrier returns a maximum number of 1000 documents by default. We can change the maximum number of returned documents per query by changing matching.retrieved_set_size. For example, if we want to retrieve 10000 document for each given query, we need to set matching.retrieved_set_size to 10000. In addition, we need to set the rank of the last returned document to 9999 in querying.default.controls.
Some of the weighting models, e.g. BM25, assume low document frequencies of query terms. For these models, it is worth ignoring query terms with high document frequency during retrieval by setting ignore.low.idf.terms to true.Moreover, it is better to set ignore.low.idf.terms to false for high precision search tasks such as named-page finding.[Previous: Configuring Indexing] [Contents] [Next: Terrier Query Language]
Copyright © 2015 University of Glasgow | All Rights Reserved