Terrier Users :
Terrier Forum

**Re: Is terrier's BM25 algorithm exactly the same as the original one?**

**Re: Is terrier's BM25 algorithm exactly the same as the original one?**

**Re: Is terrier's BM25 algorithm exactly the same as the original one?**

**Re: Is terrier's BM25 algorithm exactly the same as the original one?**

**Re: Is terrier's BM25 algorithm exactly the same as the original one?**

General discussion about using/developing applications using Terrier

About BM25...

Posted by:
**
jeffye
** ()

Date: June 25, 2009 04:11AM

I figure Terrier's BM25 is not exactly the same as the original BM25 formula.

Because the parameter keyFrequency in snippet 1 you pass to class BM25 is not the term frequency in the query. Do you have a particular reason? Or could you tell where this revise comes form? In fact, this revise hurts the performance (I test on two Chinese corpus).

In addition, I am rather confused by your QueryExpansion framework, although it works well. Why do u add the weight of the expansion term to the original query weight(see snippet 2. in fact, the original query weight is the revised term frequency detailed above)? Could you the me the theoretical background of your QueryExpansion framework? please tell me the related citations.

code snippet 1: BM25.java

public final float score(

float tf,

float docLength,

float n_t,

float F_t,

float keyFrequency) {

code snippet 2: QueryExpansion.java

query.addTermPropertyWeight(expandedTerm.getTerm(), expandedTerm

.getWeight());

Edited 2 time(s). Last edit at 12/07/2012 11:02AM by craigm.

Because the parameter keyFrequency in snippet 1 you pass to class BM25 is not the term frequency in the query. Do you have a particular reason? Or could you tell where this revise comes form? In fact, this revise hurts the performance (I test on two Chinese corpus).

In addition, I am rather confused by your QueryExpansion framework, although it works well. Why do u add the weight of the expansion term to the original query weight(see snippet 2. in fact, the original query weight is the revised term frequency detailed above)? Could you the me the theoretical background of your QueryExpansion framework? please tell me the related citations.

code snippet 1: BM25.java

public final float score(

float tf,

float docLength,

float n_t,

float F_t,

float keyFrequency) {

code snippet 2: QueryExpansion.java

query.addTermPropertyWeight(expandedTerm.getTerm(), expandedTerm

.getWeight());

Edited 2 time(s). Last edit at 12/07/2012 11:02AM by craigm.

Posted by:
**
jeffye
** ()

Date: June 27, 2009 01:25AM

Anyone can help me?

Posted by:
**
Ben
** ()

Date: June 30, 2009 03:53PM

Hi Jeff

Terrier's implementation of BM25 is slightly different from the original one as you have mentioned in your post. The introduction of the normalised keyFrequency is due to the integration of the DFR query expansion framework that measures the relative importance of query terms by modifying query term frequencies. Description of the DFR QE framework can be found in:

G.~Amati. Probabilistic models for information retrieval based on divergence from randomness. PhD thesis, Department of Computing Science, University of Glasgow, 2003.

or the recent TREC papers of University of Glasgow, e.g.

B.~He, C.~Macdonald, I.~Ounis, J.~Peng, and R.L.T.~Santos. University of Glasgow at TREC 2008: Experiments in Blog, Enterprise, and Relevance Feedback Tracks with Terrier. In Proceedings of TREC 2008. Gaithersbug, MD.

Can you please indicate the MAP values obtained in your experiments? From our experience, the difference is usually minor.

You can also turn off the query term frequency normalisation by setting property querying.normalise.weights to false.

Thanks, ben

Terrier's implementation of BM25 is slightly different from the original one as you have mentioned in your post. The introduction of the normalised keyFrequency is due to the integration of the DFR query expansion framework that measures the relative importance of query terms by modifying query term frequencies. Description of the DFR QE framework can be found in:

G.~Amati. Probabilistic models for information retrieval based on divergence from randomness. PhD thesis, Department of Computing Science, University of Glasgow, 2003.

or the recent TREC papers of University of Glasgow, e.g.

B.~He, C.~Macdonald, I.~Ounis, J.~Peng, and R.L.T.~Santos. University of Glasgow at TREC 2008: Experiments in Blog, Enterprise, and Relevance Feedback Tracks with Terrier. In Proceedings of TREC 2008. Gaithersbug, MD.

Can you please indicate the MAP values obtained in your experiments? From our experience, the difference is usually minor.

You can also turn off the query term frequency normalisation by setting property querying.normalise.weights to false.

Thanks, ben

Posted by:
**
frousseau
** ()

Date: December 05, 2012 09:53AM

Hi all,

Just reopening the thread since there are still some incoherences in the BM25 code.

I am talking about org.terrier.matching.models.BM25.java in Terrier 3.5.

First of all, I am confused by the fact that there are always 2 methods "score" in each defined WeightingModel and they are not calling each other. The code is duplicated and that introduces some small differences/bugs.

As of now, we have:

public double score(double tf, double docLength) {

double K = k_1 * ((1 - b) + b * docLength / averageDocumentLength) + tf;

return (tf * (k_3 + 1d) * keyFrequency / ((k_3 + keyFrequency) * K))

* Idf.log((numberOfDocuments - documentFrequency + 0.5d) / (documentFrequency + 0.5d));

}

public double score(

double tf,

double docLength,

double n_t,

double F_t,

double keyFrequency) {

double K = k_1 * ((1 - b) + b * docLength / averageDocumentLength) + tf;

return Idf.log((numberOfDocuments - n_t + 0.5d) / (n_t+ 0.5d)) *

((k_1 + 1d) * tf / (K + tf)) *

((k_3+1)*keyFrequency/(k_3+keyFrequency));

}

Usually, when K is defined, we omit the "+ tf" in it. In the second method, tf is added twice because of that... I know this method is by default not called but the score is still wrong.

In the first method, the numerator of the tf-based component is tf instead of "(k_1 + 1) * tf". I know that because "k_1 + 1" is a constant, it is not going to affect the ranking since it's a relative one but still, if it was in the original BM25 formula... Moreover, the second method is actually using it.

My suggestion would be :

public double score(double tf, double docLength) {

return score(tf, docLength, documentFrequency, termFrequency, keyFrequency);

}

public double score(

double tf,

double docLength,

double n_t,

double F_t,

double keyFrequency) {

final double q = ((k_3 + 1d) * keyFrequency / (k_3 + keyFrequency));

final double idf = Idf.log((numberOfDocuments - n_t + 0.5d) / (n_t + 0.5d));

final double K = k_1 * ((1 - b) + b * docLength / averageDocumentLength);

return q * idf * (k_1 + 1d) * tf / (K + tf);

}

I would be happy to file a bug in JIRA and contribute with a patch if you think it's worth it.

Cheers,

François

Just reopening the thread since there are still some incoherences in the BM25 code.

I am talking about org.terrier.matching.models.BM25.java in Terrier 3.5.

First of all, I am confused by the fact that there are always 2 methods "score" in each defined WeightingModel and they are not calling each other. The code is duplicated and that introduces some small differences/bugs.

As of now, we have:

public double score(double tf, double docLength) {

double K = k_1 * ((1 - b) + b * docLength / averageDocumentLength) + tf;

return (tf * (k_3 + 1d) * keyFrequency / ((k_3 + keyFrequency) * K))

* Idf.log((numberOfDocuments - documentFrequency + 0.5d) / (documentFrequency + 0.5d));

}

public double score(

double tf,

double docLength,

double n_t,

double F_t,

double keyFrequency) {

double K = k_1 * ((1 - b) + b * docLength / averageDocumentLength) + tf;

return Idf.log((numberOfDocuments - n_t + 0.5d) / (n_t+ 0.5d)) *

((k_1 + 1d) * tf / (K + tf)) *

((k_3+1)*keyFrequency/(k_3+keyFrequency));

}

Usually, when K is defined, we omit the "+ tf" in it. In the second method, tf is added twice because of that... I know this method is by default not called but the score is still wrong.

In the first method, the numerator of the tf-based component is tf instead of "(k_1 + 1) * tf". I know that because "k_1 + 1" is a constant, it is not going to affect the ranking since it's a relative one but still, if it was in the original BM25 formula... Moreover, the second method is actually using it.

My suggestion would be :

public double score(double tf, double docLength) {

return score(tf, docLength, documentFrequency, termFrequency, keyFrequency);

}

public double score(

double tf,

double docLength,

double n_t,

double F_t,

double keyFrequency) {

final double q = ((k_3 + 1d) * keyFrequency / (k_3 + keyFrequency));

final double idf = Idf.log((numberOfDocuments - n_t + 0.5d) / (n_t + 0.5d));

final double K = k_1 * ((1 - b) + b * docLength / averageDocumentLength);

return q * idf * (k_1 + 1d) * tf / (K + tf);

}

I would be happy to file a bug in JIRA and contribute with a patch if you think it's worth it.

Cheers,

François

Posted by:
**
craigm
** ()

Date: December 06, 2012 03:03PM

The 5 param method came first historically. *Hence, issues in method will not impact on the normal effectiveness of Terrier since the 1.x series*

The 2 param method came later, as it makes things faster - i.e. with less parameters, there are less pushes onto the JVM stack etc.

The plan is for Terrier 4 to remove the second method. I need to get 3.6 released first though.

Can you create two issues: one for the correction to 5 param method (tagged for 3.6), and another to remove all 5 param methods (tagged for 4.0). In fact 5 param method should be deprecated for 3.6 too.

Many thanks in advance,

Craig

Edited 3 time(s). Last edit at 12/07/2012 11:22AM by craigm.

The 2 param method came later, as it makes things faster - i.e. with less parameters, there are less pushes onto the JVM stack etc.

The plan is for Terrier 4 to remove the second method. I need to get 3.6 released first though.

Can you create two issues: one for the correction to 5 param method (tagged for 3.6), and another to remove all 5 param methods (tagged for 4.0). In fact 5 param method should be deprecated for 3.6 too.

Many thanks in advance,

Craig

Edited 3 time(s). Last edit at 12/07/2012 11:22AM by craigm.

Posted by:
**
frousseau
** ()

Date: December 06, 2012 05:01PM

Ok.

I'll create one issue to fix BM25 (adding the "k_1 + 1" in the numerator in the 2-param method and removing the "+ tf" in the 5-param method).

I'll create one issue with a "complete" refactoring of the weighting models. In particular, remove the 5-param method but also add a library with methods such as "pivoted(double tf, double slope, double dl, double avdl)", "sublinear(double tf, double k)"... to avoid duplicated code in all the TF-based models, move the Idf.log method to this library and so on. I have already did that this afternoon but I was calling the 5-param method from the 2-param method so I need to change that.

Cheers,

François

I'll create one issue to fix BM25 (adding the "k_1 + 1" in the numerator in the 2-param method and removing the "+ tf" in the 5-param method).

I'll create one issue with a "complete" refactoring of the weighting models. In particular, remove the 5-param method but also add a library with methods such as "pivoted(double tf, double slope, double dl, double avdl)", "sublinear(double tf, double k)"... to avoid duplicated code in all the TF-based models, move the Idf.log method to this library and so on. I have already did that this afternoon but I was calling the 5-param method from the 2-param method so I need to change that.

Cheers,

François

Sorry, only registered users may post in this forum.