Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-558

Using Terrier 5.1 on CLEF Social Book Search data set

    Details

      Description

      Terrier IR platform has a built-in support for various document collections and data sets and the tutorials on how to use them. However, no such tutorial is available for working with CLEF Social Book Search data set (XML document corpus, qrels, topics sets) at: http://social-book-search.humanities.uva.nl/#/overview.

      In order to use Terrier in batch IR and evaluation experiments, the collection needs to be converted to TREC format, for which I read this article: Indexing INEX SBS Corpus in Terrier, available at: https://lab.hypotheses.org/1129?unapproved=9747&moderation-hash=acc58886c603b45e424e66f737be9c50 and used its Python script to convert the SBS collection into TREC format, where all the XML files in a folder are now represented as one XML file, where each book is represented as:
          <book>
          <isbn>isbn of the book</isbn>
          <text> all the text without xml tags from the corresponding XML document</text>
          </book>

      We can use TF-IDF, BM25, InL2 and other models, but:

      What about the implementation of BM25F as all the fields are now replaced by a single <text> field? In my research implementation, I need to implement BM25F along with other available retrieval models in Terrier IR platform.

      Please help!

      Files uploaded:
      1. trec-coversion3.py to convert XML files into a single TREC file
      2. Sample XML file.xml from the set of original XML files
      3. Sample TREC file generated by the python script from the multiple XML files

        Attachments

          Activity

          There are no comments yet on this issue.

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              Rocky Xanadul Irfan Ullah
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: