[TR-558] Using Terrier 5.1 on CLEF Social Book Search data set Created: 22/May/19  Updated: 22/May/19

Status: Open
Project: Terrier Core
Component/s: .evaluation, .indexing, .querying
Affects Version/s: 5.0, 4.4, 5.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Irfan Ullah Assignee: Craig Macdonald
Resolution: Unresolved  
Labels: collection, evaluation, indexing, qrels

Attachments: XML File Sample XML file.xml     XML File SampleTREC file generated from multiple XML files using the python script.xml     File trec-conversion3.py    

 Description   
Terrier IR platform has a built-in support for various document collections and data sets and the tutorials on how to use them. However, no such tutorial is available for working with CLEF Social Book Search data set (XML document corpus, qrels, topics sets) at: http://social-book-search.humanities.uva.nl/#/overview.

In order to use Terrier in batch IR and evaluation experiments, the collection needs to be converted to TREC format, for which I read this article: Indexing INEX SBS Corpus in Terrier, available at: https://lab.hypotheses.org/1129?unapproved=9747&moderation-hash=acc58886c603b45e424e66f737be9c50 and used its Python script to convert the SBS collection into TREC format, where all the XML files in a folder are now represented as one XML file, where each book is represented as:
    <book>
    <isbn>isbn of the book</isbn>
    <text> all the text without xml tags from the corresponding XML document</text>
    </book>

We can use TF-IDF, BM25, InL2 and other models, but:

What about the implementation of BM25F as all the fields are now replaced by a single <text> field? In my research implementation, I need to implement BM25F along with other available retrieval models in Terrier IR platform.

Please help!

Files uploaded:
1. trec-coversion3.py to convert XML files into a single TREC file
2. Sample XML file.xml from the set of original XML files
3. Sample TREC file generated by the python script from the multiple XML files

Generated at Wed Feb 26 05:59:39 GMT 2020 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.