Using Terrier for experiments with TREC collections |
Terrier can be readily used for experimentation with test collections used in the Text REtrieval Conference. Below we describe how to index and how to perform retrieval from a TREC test collection. You may also refer to the example of indexing, retrieving and evaluating results for the TREC WT2G collection.
After having installed Terrier, we proceed with indexing a document collection. In order to do so, we need to go through the following steps.
The first step consists of running the script bin/trec_setup.sh. This script takes one parameter, which corresponds to the directory under which the document collection to be indexed is stored. For example, if Terrier has been installed in /local/terrier, and the document collection to be indexed is stored under directory /local/collection, we should write:
bash-2.05b$ cd /local/terrier bash-2.05b$ bin/trec_setup.sh /local/collection
The script trec_setup.sh creates the default configuration files, used for indexing a TREC collection. These files are stored in the directory etc and they are:
While running the script bin/trec_setup.sh, the contents of the automatically generated file collection.spec are displayed in order for the user to verify that the file contains only the files of the collection to index. Alternatively, it can be created manually in the following way. If the document collection files are the following:
/local/collection/d1/f1 /local/collection/d1/f2 /local/collection/d2/f1 /local/collection/d2/f2
then the file collection.spec can be created as follows:
bash-2.05b$ find /local/collection -name f? > etc/collection.spec
The name and the location of the created data structures is specified by the properties terrier.index.prefix and terrier.index.path. The default values of these properties are data and index, respectively. So, assuming that Terrier has been installed in the directory /local/terrier, the created inverted index will be /local/terrier/var/index/data.if. All other data structures will be created in the same directory and will have the same name with appropriate extensions.
The next step involves updating the configuration file terrier.properties, if necessary. Among the properties that can be configured, we can specify:
<DOC> <DOCNO>abc</DOCNO> <DOCHDR>...</DOCHDR> ... </DOC>If we want to process everything that is within tag <DOC> and also to specify that the document identifier is within <DOCNO>, then we can set the above mentioned properties as follows:
TrecDocTags.doctag=DOC TrecDocTags.idtag=DOCNO TrecDocTags.process= TrecDocTags.skip=If we don't want to index the contents of the tag <DOCHDR>, then we can write:
TrecDocTags.doctag=DOC TrecDocTags.idtag=DOCNO TrecDocTags.skip=DOCHDRFor more details on other properties, refer to the classes uk.ac.gla.terrier.utility.TagSet, uk.ac.gla.terrier.utility.ApplicationSetup and the description of the properties you can modify through the configuration file of Terrier.
block.size=1 max.blocks=100000 block.indexing=trueThe property block.size specifies the size of each block. A value of 1 means that each term appears in a different block. The property max.blocks specifies the maximum number of blocks a document may contain. If there are more blocks in a document than the maximum number, then all the additional blocks are added to the last one. The property block.indexing enables the indexing with position information. The position information is used during querying for processing phrase or proximity queries. For more details about the query language, you may refer to its description in the guide for developping applications with Terrier).
field.modifiers=0.10d FieldTags.process=TITLEThe property html.modifiers specifies how much a document's score is increased when a query term appears in one of the specified HTML tags. The property HtmlTags.process is a comma separated list of the HTML tags to process. In the above example, if a query term appears in the <title> tag of a document, then the document's score will be increased by 10 percent.
After updating the required files and setting the properties, we can proceed with indexing the collection, and creating the direct file, lexicon, document index, inverted file and collection statistics file:
bash-2.05b$ bin/trec_terrier.sh -i
For more information about the available options of the script bin/trec_terrier.sh, you may obtain a help message by typing:
bash-2.05$ bin/trec_terrier.sh --help
After the end of the indexing process, we can proceed with retrieving from the document collection. At this stage, the options for applying stemming or not, removing stopwords or not, and the maximum length of terms, should be exactly the same as the ones used for indexing the collection.
In the file etc/trec.topics.list, we need to specify which file contains the queries to process. Next, we need to specify which of the available weighting models we will use for assigning scores to the retrieved documents. We do this by specifying the name of the corresponding class in the file etc/trec.models. For example, if we are using the weighting scheme InL2, then the models file should contain:
uk.ac.gla.terrier.matching.models.InL2
A last step before processing the queries is to specify which tags from the topics to use. We can do that by setting the properties TrecQueryTags.process, which denote which tags to process, TrecQueryTags.idtag, which stands for the tag containing the query identifier, and TrecQueryTags.skip, which denote which query tags to ignore.
For example, suppose that the format of topics is the following:
<TOP> <NUM>123<NUM> <TITLE>title <DESC>description <NARR>narrative </TOP>
If we want to skip the description and narrative (DESC and NARR tags respectively), and consequently use the title only, then we need to setup the properties as follows:
TrecQueryTags.process=TOP TrecQueryTags.idtag=NUM TrecQueryTags.skip=DESC,NARR
If alternatively, we want to skip the title, and consequently use the description and the narrative tags to create the query, then we need to setup the properties as follows:
TrecQueryTags.process=TOP TrecQueryTags.idtag=NUM TrecQueryTag.skip=TITLE
To process the queries, we can type the following:
bash-2.05b$ bin/trec_terrier.sh -r -c 1.0
where the option -r specifies that we want to perform retrieval, and the option -c 1.0 specifies the parameter value for the term frequency normalisation. If the option -c is not specified, then a default value 1.0 is used. This default value can be altered by setting the property term.freq.norm.parameter in the properties file.
Terrier also offers query expansion functionality. For a brief description of the query expansion module, you may view the query expansion section of the DFR Framework description. The term weighting model used for expanding the queries with the most informative terms of the top-ranked documents is specified in the file etc/qemodels. This file contains the class names of the term weighting models to be used for query expansion. The default content of the file is:
uk.ac.gla.terrier.matching.models.queryexpansion.Bo1
In addition, there are two parameters that can be set for applying query expansion. The first one is the number of terms to expand a query with. It is specified by the property expansion.terms, the default value of which is 10. Moreover, the number of top-ranked documents from which these terms are extracted, is specified by the property expansion.documents, the default value of which is 3.
To retrieve from an indexed test collection, using query expansion, with the term frequency normalisation parameter equal to 1.0, we can type:
bash-2.05b$ bin/trec_terrier.sh -r -q -c 1.0
The results are saved in the directory var/results in a file named as follows:
"weighting scheme" c "value of c"_counter.res
For example, if we have used the weighting scheme PL2 with c=1.28 and the counter was 2, then the filename of the results would be PL2c1.28_3.res.
Copyright © 2015 University of Glasgow | All Rights Reserved