Details
Description
Hi,
I ran some experiments that compared various ranking models across different test collections. I found that the effectiveness of the Hiemstra_LM implementation is bad in general, for instance for TREC-7 ad hoc I got a MAP of 0.1791 where BM25 achieves 0.2103.
The documentation of the class Hiemstra_LM refers to the origin of the implementation. I checked the implementation and the source and could not wrap my head around, which of the proposed weighting schemes was implemented. For that reason I re-implemented formula "score2(d)" (see page 85 of D. Hiemstra's doctoral thesis). IMO, the crucial part is to incorporate the key frequency of the query terms: "Remember that the sum of i = 1 to n covers the query terms on each position i, which recomputes the weight of duplicate terms. In practice, this might of course be implemented by multiplying the weight of the term by the frequency of occurrence of the term in the query". It turned out that I could verify the empirical results of Hiemstra's thesis, i.e. Hiemstra_LM showing slightly better performance than BM25. I also ran some experiments using theTREC-8 and TREC2003.robust test collections as well as some other document collections from CLEF I had at hand and got similar results.
MAP of the attached implementation of Hiemstra_LM (using stopping, Porter stemming and no PRF) vs. previous implementation of Hiemstra_LM.
----
testset;Hiemstra_LM;Hiemstra_LM (patch)
TREC-7.adhoc;0.1791;0.2137
TREC-8.adhoc;0.2190;0.2539
TREC2003.robust.new;0.3269;0.3573
I ran some experiments that compared various ranking models across different test collections. I found that the effectiveness of the Hiemstra_LM implementation is bad in general, for instance for TREC-7 ad hoc I got a MAP of 0.1791 where BM25 achieves 0.2103.
The documentation of the class Hiemstra_LM refers to the origin of the implementation. I checked the implementation and the source and could not wrap my head around, which of the proposed weighting schemes was implemented. For that reason I re-implemented formula "score2(d)" (see page 85 of D. Hiemstra's doctoral thesis). IMO, the crucial part is to incorporate the key frequency of the query terms: "Remember that the sum of i = 1 to n covers the query terms on each position i, which recomputes the weight of duplicate terms. In practice, this might of course be implemented by multiplying the weight of the term by the frequency of occurrence of the term in the query". It turned out that I could verify the empirical results of Hiemstra's thesis, i.e. Hiemstra_LM showing slightly better performance than BM25. I also ran some experiments using the
MAP of the attached implementation of Hiemstra_LM (using stopping, Porter stemming and no PRF) vs. previous implementation of Hiemstra_LM.
----
testset;Hiemstra_LM;Hiemstra_LM (patch)
TREC-7.adhoc;0.1791;0.2137
TREC2003.robust.new;0.3269;0.3573
This bug was mentioned on Twitter (see https://twitter.com/tommy4st/status/773508806965858304?). Tagging for 4.2.
Thanks to Thomas Wilhelm-Stein for the reminder!