If you are interested in using Terrier straightaway in order to index and retrieve from standard test collections, then you may follow the steps described below. We provide step-by-step instructions for the installation of Terrier on Linux and Windows operating systems and guide you through your first indexing and retrieval steps on a test collection.
Terrier’s single requirement consists of an installed Java JRE 1.8.0 or higher. You can download the JRE, or the JDK (if you want to develop with Terrier, or run the web-based interface), from the Java website.
Terrier version 4.2 can be downloaded from the following location:Terrier Home. The site offers pre-compiled releases of the newest and previous Unix and Windows versions of Terrier.
After having downloaded Terrier, copy the file to the directory where you want to install Terrier. Navigate to this directory and execute the following command to decompress the distribution:
tar -zxvf terrier-core-4.2-bin.tar.gz
This will result in the creation of a terrier directory in your current directory. Next we will have to make sure that you have the correct Java version available on the system. Type:
If the environment variable $JAVA_HOME is set, this command will output the path of your Java installation. (e.g. /usr/java/jre1.8.0). If this command shows that you have a correct Java version (1.8.0 or later) installed then your all done. If your system does not meet these requirements you can download a Java 1.8 from the JRE 1.8 download website and set the environment variable by including the following line either in your /etc/profile or ~/.bashrc files:
In order to be able to use Terrier you simply have to extract the contents of the downloaded Zip file into a directory of your choice. Terrier requires Java version 1.8 or higher. If your system does not meet this requirement you can download an appropriate version from the JRE download website. Finally, Terrier assumes that java.exe is on the path, so you should use the System applet in the control panel, to ensure that your Java\bin folder is in your PATH environment variable.
The following instructions are equally applicable to Windows, with the exception that the .bat scripts are used instead of .sh.
Terrier comes with three applications:
This allows you to easily index, retrieve, and evaluate results on TREC collections. In the next session, we provide you with a step-by-step tutorial of how to use this application.
This allows you to to do interactive retrieval. This is a quick way to test Terrier. Given that you have installed Terrier on Windows, you can start Interactive Terrier by executing the
interactive_terrier.bat file in Terrier’s
bin directory. On a Unix system or Mac, you can run interactive Terrier by executing the
interactive_terrier.sh file. You can configure the retrieval functionalities of Interactive Terrier using properties described in the InteractiveQuerying class.
A sample Desktop search application, available separately from Github.
This guide will provide step-by-step instructions for using Terrier to index a TREC collection. We assume that the operating system is Linux, and that the collection, along with the topics and the relevance assessments (qrels), is stored in the directory
In our example we are using a collection called VASWANI_NPL located at
share/vaswani_npl/. It follows a traditional TREC test collection, with a corpus file, topics, and relevance assessments (qrels), and using the same format.
$head share/vaswani_npl/corpus/doc-text.trec <DOC> <DOCNO>1</DOCNO> compact memories have flexible capacities a digital data storage system with capacity up to bits and random and or sequential access is described </DOC>
To setup for this corpus, run:
This will result in the creation of a
collection.spec file in the
etc directory. This file contains a list of the document files contained in the specified corpus directory.
If necessary, check/modify the
collection.spec file. This might be required if the collection directory contained files that you do not want to index (READMEs, etc).
Now we are ready to begin the indexing of the collection. This is achieved using the
trec_terrier.sh script, adding the
-i option, as follows:
$bin/trec_terrier.sh -i 16:00:03.028 [main] INFO o.terrier.indexing.CollectionFactory - Finished reading collection specification 16:00:03.046 [main] INFO o.t.i.MultiDocumentFileCollection - TRECCollection 0% processing share/vaswani_npl/corpus//doc-text.trec 16:00:03.116 [main] INFO o.t.structures.indexing.Indexer - creating the data structures data_1 16:00:04.885 [main] INFO o.t.structures.indexing.Indexer - Collection #0 took 1 seconds to index (11429 documents) 16:00:04.918 [main] INFO o.t.s.indexing.LexiconBuilder - 6 lexicons to merge 16:00:05.045 [main] INFO o.t.s.indexing.LexiconBuilder - Optimising structure lexicon 16:00:05.047 [main] INFO o.t.structures.FSOMapFileLexicon - Optimising lexicon with 7756 entries 16:00:05.761 [main] INFO o.t.structures.indexing.Indexer - Started building the inverted index... 16:00:05.761 [main] INFO o.t.structures.indexing.Indexer - Started building the inverted index... 16:00:05.766 [main] INFO o.t.s.i.c.InvertedIndexBuilder - Iteration 1 of 1 iterations 16:00:06.929 [main] INFO o.t.s.indexing.LexiconBuilder - Optimising structure lexicon 16:00:06.930 [main] INFO o.t.structures.FSOMapFileLexicon - Optimising lexicon with 7756 entries 16:00:06.954 [main] INFO o.t.structures.indexing.Indexer - Finished building the inverted index... 16:00:06.954 [main] INFO o.t.structures.indexing.Indexer - Time elapsed for inverted file: 1
With Terrier's default settings, the resulting index will be created in the
var/index folder within the Terrier installation folder.
Note: If you do not need the direct index structure for e.g. for query expansion, then you can use
bin/trec_terrier.sh -i -j for the faster single-pass indexing.
Once indexing completes, you can verify your index by obtaining its statistics, using the
--printstats option of Terrier.
$bin/trec_terrier.sh --printstats 16:21:45.086 [main] INFO org.terrier.applications.TrecTerrier - Collection statistics: 16:21:45.088 [main] INFO org.terrier.applications.TrecTerrier - number of indexed documents: 11429 16:21:45.088 [main] INFO org.terrier.applications.TrecTerrier - size of vocabulary: 7756 16:21:45.088 [main] INFO org.terrier.applications.TrecTerrier - number of tokens: 271581 16:21:45.089 [main] INFO org.terrier.applications.TrecTerrier - number of pointers: 224573
This displays the number of documents, number of tokens, number of terms, etc.
Firstly, lets see if we can get search results from our index. We can use the
bin/interactive_terrier.sh script to query the index for results.
$bin/interactive_terrier.sh 16:30:07.139 [main] INFO o.t.structures.CompressingMetaIndex - Structure meta reading lookup file into memory 16:30:07.146 [main] INFO o.t.structures.CompressingMetaIndex - Structure meta loading data file into memory 16:30:07.152 [main] INFO o.t.applications.InteractiveQuerying - time to intialise index : 0.086 Please enter your query: compressed 16:30:26.624 [main] INFO o.t.matching.PostingListManager - Query 1 with 1 terms has 1 posting lists Displaying 1-22 results 0 11196 6.965311483754079 1 6891 6.861351572397433 2 8706 6.6285666210018395 3 6812 6.419975936835514 4 3286 6.0561185692309065 5 4007 5.744292373685925 6 70 5.603313027948017 ... Please enter your query: exit
In responding to the query
compression, Terrier found document 11196 was deemed to be most relevant, scoring 6.96. 11196 was recorded from the DOCNO tag of the corresponding index.
Information retrieval has a history of evaluating search effectiveness automatically, using queries with associated relevance assessments. In order to perform retrieval using an existing index, follow the steps described below.
etc/terrier.propertiesfile, or specify each on the command line. In the following, we are going to use the command line to specify the appropriate properties. To perform retrieval and evaluate the results of a batch of queries, we need to know:
a. The location of the queries (also known as topic files) - specified using
b. The weighting model (e.g. TF_IDF) to use - specified using
trec.model - along with any parameter. The default is InL2.
c. The corresponding relevance assessments file (or qrels) for the topics - specified by
-roption tells Terrier to do a batch retrieval run, i.e. retrieving the documents estimated to be the most relevant for each query in the topics file. However, instead of having
trec.topicsproperty set in the
terrier.propertiesfile, we specify it on the command line (all other configurration remains using Terrier’s default settings):
$bin/trec_terrier.sh -r -Dtrec.topics=share/vaswani_npl/query-text.trec ... 16:14:43.440 [main] INFO o.t.matching.PostingListManager - Query 93 with 10 terms has 10 posting lists 16:14:43.444 [main] INFO o.t.a.batchquerying.TRECQuerying - Time to process query: 0.006 16:14:43.461 [main] INFO o.t.a.batchquerying.TRECQuerying - Settings of Terrier written to var/results/InL2c1_0.res.settings 16:14:43.461 [main] INFO o.t.a.batchquerying.TRECQuerying - Finished topics, executed 93 queries in 0.866 seconds, results written tovar/results/InL2c1_0.res Time elapsed: 0.987 seconds.
If all goes well this will result in a
.res file in the
var/results directory called
InL2c1_0.res. We call each
.res a run.
You can also configure more options on the command line, e.g.:
$bin/trec_terrier.sh -r -Dtrec.model=BM25 -c 0.4 -Dtrec.topics=share/vaswani_npl/query-text.trec
So what are these? The
-r parameter instructs Terrier to perform retrieval, while
-Dtrec.model=BM25 tells Terrier to use the BM25 weighting model.
-c tells Terrier the parameter for the weighting model. BM25 is a classical Okapi model firstly defined by Stephen Robertson, while InL2 is a Divergence From Randomness weighting model (to learn more, see the description of the DFR framework).
-eoption of trec_terrier:
$bin/trec_terrier.sh -e -Dtrec.qrels=share/vaswani_npl/qrels 16:27:28.527 [main] INFO o.t.evaluation.TrecEvalEvaluation - Evaluating result file: var/results/InL2c1.0_0.res Average Precision: 0.2948
Terrier will look at the
var/results directory, evaluate each .res file and save the output in a .eval file named the same as the corresponding .res file.
-qparameter in addition to
bin/trec_terrier.sh -r -q
See the guide for configuring retrieval for more information about QE. Note that your index must have a direct index structure to support QE, which is not built by default with single-pass indexing (see Configuring Indexing for more information). Afterwards we can run the evaluation again by using trec_terrier.sh with the
bin/trec_terrier.sh -e -Dtrec.qrels=share/vaswani_npl/qrels
The obtained MAP for the InL2 should be 0.2948.
The obtained MAP for BM25 should be 0.2992.
The obtained MAP for the run using InL2 with query expansion should be 0.3020.
You can interact with your index using a Web-based querying interface. Firstly, start the included HTTP server:
You can then enter queries and view results at http://localhost:8080 (If your running Terrier on another machine, replace localhost with the hostname of the remote machine). Terrier can provide more information in the search results -- for more information on configuring the Web interface, please see Using Web-based results.