Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-183

Hiemstra_LM matching implementation seems wrong

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2
    • Component/s: .matching
    • Labels:
      None

      Description

      Hi,

      I ran some experiments that compared various ranking models across different test collections. I found that the effectiveness of the Hiemstra_LM implementation is bad in general, for instance for TREC-7 ad hoc I got a MAP of 0.1791 where BM25 achieves 0.2103.

      The documentation of the class Hiemstra_LM refers to the origin of the implementation. I checked the implementation and the source and could not wrap my head around, which of the proposed weighting schemes was implemented. For that reason I re-implemented formula "score2(d)" (see page 85 of D. Hiemstra's doctoral thesis). IMO, the crucial part is to incorporate the key frequency of the query terms: "Remember that the sum of i = 1 to n covers the query terms on each position i, which recomputes the weight of duplicate terms. In practice, this might of course be implemented by multiplying the weight of the term by the frequency of occurrence of the term in the query". It turned out that I could verify the empirical results of Hiemstra's thesis, i.e. Hiemstra_LM showing slightly better performance than BM25. I also ran some experiments using the TREC-8 and TREC2003.robust test collections as well as some other document collections from CLEF I had at hand and got similar results.

      MAP of the attached implementation of Hiemstra_LM (using stopping, Porter stemming and no PRF) vs. previous implementation of Hiemstra_LM.
      ----
      testset;Hiemstra_LM;Hiemstra_LM (patch)
      TREC-7.adhoc;0.1791;0.2137
      TREC-8.adhoc;0.2190;0.2539
      TREC2003.robust.new;0.3269;0.3573

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          Fixed for 4.2, at last

          Show
          craigm Craig Macdonald added a comment - Fixed for 4.2, at last
          Hide
          craigm Craig Macdonald added a comment -

          This bug was mentioned on Twitter (see https://twitter.com/tommy4st/status/773508806965858304?). Tagging for 4.2.

          Thanks to Thomas Wilhelm-Stein for the reminder!

          Show
          craigm Craig Macdonald added a comment - This bug was mentioned on Twitter (see https://twitter.com/tommy4st/status/773508806965858304? ). Tagging for 4.2. Thanks to Thomas Wilhelm-Stein for the reminder!

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              kuej Jens Kürsten
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: