[TR-558] Using Terrier 5.1 on CLEF Social Book Search data set Created: 22/May/19 Updated: 22/May/19
|Component/s:||.evaluation, .indexing, .querying|
|Affects Version/s:||5.0, 4.4, 5.1|
|Reporter:||Irfan Ullah||Assignee:||Craig Macdonald|
|Labels:||collection, evaluation, indexing, qrels|
|Attachments:||Sample XML file.xml SampleTREC file generated from multiple XML files using the python script.xml trec-conversion3.py|
Terrier IR platform has a built-in support for various document collections and data sets and the tutorials on how to use them. However, no such tutorial is available for working with CLEF Social Book Search data set (XML document corpus, qrels, topics sets) at: http://social-book-search.humanities.uva.nl/#/overview.
In order to use Terrier in batch IR and evaluation experiments, the collection needs to be converted to TREC format, for which I read this article: Indexing INEX SBS Corpus in Terrier, available at: https://lab.hypotheses.org/1129?unapproved=9747&moderation-hash=acc58886c603b45e424e66f737be9c50 and used its Python script to convert the SBS collection into TREC format, where all the XML files in a folder are now represented as one XML file, where each book is represented as:
<isbn>isbn of the book</isbn>
<text> all the text without xml tags from the corresponding XML document</text>
We can use TF-IDF, BM25, InL2 and other models, but:
What about the implementation of BM25F as all the fields are now replaced by a single <text> field? In my research implementation, I need to implement BM25F along with other available retrieval models in Terrier IR platform.
1. trec-coversion3.py to convert XML files into a single TREC file
2. Sample XML file.xml from the set of original XML files
3. Sample TREC file generated by the python script from the multiple XML files