Terrier Users :
Terrier Forum

**Re: Implementation of query expansion -- limiting form of binomial (CS.java)**

**Re: Implementation of query expansion -- limiting form of binomial (CS.java)**

**Re: Implementation of query expansion -- limiting form of binomial (CS.java)**

**Re: Implementation of query expansion -- limiting form of binomial (CS.java)**

**Re: Implementation of query expansion -- limiting form of binomial (CS.java)**

**Re: Implementation of query expansion -- limiting form of binomial (CS.java)**

General discussion about using/developing applications using Terrier

Implementation of query expansion -- limiting form of binomial (CS.java)

Posted by:
**
swefire1
** ()

Date: October 24, 2012 08:36AM

I have noticed this before, and did a forum post about it. But now I am close to convinced that there are inconsistencies.

If you take a look at org.terrier.matching.models.queryexpansion.CS

And the implementation of the score:

public final double score(

double withinDocumentFrequency,

double termFrequency) {

if (withinDocumentFrequency / this.totalDocumentLength

< termFrequency / this.collectionLength)

return 0;

double f = withinDocumentFrequency / this.totalDocumentLength;

double p = termFrequency / this.collectionLength;

double D = f * Idf.log(f, p) + f * Idf.log(1 - f, 1 - p);

return this.totalDocumentLength * D

+0.5d * Idf.log(2* Math.PI

* this.totalDocumentLength

* (1d - withinDocumentFrequency / this.totalDocumentLength));

}

This expression can be stated like this:

Tot_f * D(p_r,p_c) + 1/2 * log2 (2*pi*Tot_f*(1-p_r))

and

D = p_r*log_2(p_r/p_c) + p_r*log2((1-p_r)/(1-p_c)

(something like information theoretic divergence)

where (borrowing some notation from [1]):

Tot_f - the total frequency of terms in relevant document (sample size)

p_r - the MLE estimation (without smoothing) for Prob(t | relevant)

p_c - "" for Prob(t | collection)

If we really are approximating -log2(binomial) using the divergence then why do we have this expression?

I will start of by using formula (2.15) [1]

B(Tot_f,k, p_c) ~ 2^(-Tot_f * D(p_r,p_c)) / (2*pi*k*(1-p_r))^0.5 , error: O(1/Tot_f)

D(p_r,p_c) = p_r*log_2(p_r/p_c) + (1-p_r)*log2((1-p_r)/(1/p_c)

(symmetric KL divergence for "true" prob p_r and underlying p_c)

p_r = k/Tot_f => k=p_r*Tot_F (expected number of tokens according to p_r)

If we then do log2 of formula (8.11) do we not get:

-log2(B(F_r, Tot_f, p_c)) ~ Tot_f*D(p_r,p_c) + 0.5*log2(2*pi*p_r*Tot_f*(1-p_r)) (*)

1. This is not the formula 8.12 in [1] (divergence uses unknown "(1-f)" and Tot_f instead of k)

2. nor is it the formula 2.16 (second term multiplied by Tot_f)

3. and finally not the computation done in CS.java.

If we use (*) to approximate -log_2(binom_pmf(Tot_f,k,p)) we get a good approximation.

Look at: [imgur.com] (p_c = 0.01)

The formula used in CS.java is also monotonically increasing with p_r, all else held fixed, so the ordering of new terms is the same. But it is not a good approximation of the binomial prob. mass function.

I am writing my master thesis and this is causing me a lot of trouble. I need to be able to show that I know what I am doing. But CS.java does not make any sense.

If someone could shed some light on how to derive the approximation in formula 2.15 that would also be nice.

======================================================================================

EDIT:

(*) corresponds to formula (8) in [2]

-------------------------------------------------------------------------------------

References:

[1] "Probability Models for Information Retrieval based on Divergence from Randomness"

Giambattista Amati

2003 PhD thesis Glasgow university

[2] "Probabilistic Models in Information Retrieval"

GIANNI AMATI

University of Glasgow, Fondazione Ugo Bordoni

and

CORNELIS JOOST VAN RIJSBERGEN

University of Glasgow

ACM Transactions on Information Systems, Vol. 20, No. 4, October 2002, Pages 357–389.

Edited 5 time(s). Last edit at 10/24/2012 09:20AM by swefire1.

If you take a look at org.terrier.matching.models.queryexpansion.CS

And the implementation of the score:

public final double score(

double withinDocumentFrequency,

double termFrequency) {

if (withinDocumentFrequency / this.totalDocumentLength

< termFrequency / this.collectionLength)

return 0;

double f = withinDocumentFrequency / this.totalDocumentLength;

double p = termFrequency / this.collectionLength;

double D = f * Idf.log(f, p) + f * Idf.log(1 - f, 1 - p);

return this.totalDocumentLength * D

+0.5d * Idf.log(2* Math.PI

* this.totalDocumentLength

* (1d - withinDocumentFrequency / this.totalDocumentLength));

}

This expression can be stated like this:

Tot_f * D(p_r,p_c) + 1/2 * log2 (2*pi*Tot_f*(1-p_r))

and

D = p_r*log_2(p_r/p_c) + p_r*log2((1-p_r)/(1-p_c)

(something like information theoretic divergence)

where (borrowing some notation from [1]):

Tot_f - the total frequency of terms in relevant document (sample size)

p_r - the MLE estimation (without smoothing) for Prob(t | relevant)

p_c - "" for Prob(t | collection)

If we really are approximating -log2(binomial) using the divergence then why do we have this expression?

I will start of by using formula (2.15) [1]

B(Tot_f,k, p_c) ~ 2^(-Tot_f * D(p_r,p_c)) / (2*pi*k*(1-p_r))^0.5 , error: O(1/Tot_f)

D(p_r,p_c) = p_r*log_2(p_r/p_c) + (1-p_r)*log2((1-p_r)/(1/p_c)

(symmetric KL divergence for "true" prob p_r and underlying p_c)

p_r = k/Tot_f => k=p_r*Tot_F (expected number of tokens according to p_r)

If we then do log2 of formula (8.11) do we not get:

-log2(B(F_r, Tot_f, p_c)) ~ Tot_f*D(p_r,p_c) + 0.5*log2(2*pi*p_r*Tot_f*(1-p_r)) (*)

1. This is not the formula 8.12 in [1] (divergence uses unknown "(1-f)" and Tot_f instead of k)

2. nor is it the formula 2.16 (second term multiplied by Tot_f)

3. and finally not the computation done in CS.java.

If we use (*) to approximate -log_2(binom_pmf(Tot_f,k,p)) we get a good approximation.

Look at: [imgur.com] (p_c = 0.01)

The formula used in CS.java is also monotonically increasing with p_r, all else held fixed, so the ordering of new terms is the same. But it is not a good approximation of the binomial prob. mass function.

I am writing my master thesis and this is causing me a lot of trouble. I need to be able to show that I know what I am doing. But CS.java does not make any sense.

If someone could shed some light on how to derive the approximation in formula 2.15 that would also be nice.

======================================================================================

EDIT:

(*) corresponds to formula (8) in [2]

-------------------------------------------------------------------------------------

References:

[1] "Probability Models for Information Retrieval based on Divergence from Randomness"

Giambattista Amati

2003 PhD thesis Glasgow university

[2] "Probabilistic Models in Information Retrieval"

GIANNI AMATI

University of Glasgow, Fondazione Ugo Bordoni

and

CORNELIS JOOST VAN RIJSBERGEN

University of Glasgow

ACM Transactions on Information Systems, Vol. 20, No. 4, October 2002, Pages 357–389.

Edited 5 time(s). Last edit at 10/24/2012 09:20AM by swefire1.

Posted by:
**
Gianni Amati
** ()

Date: October 30, 2012 10:17AM

Hi

As you know approximations in mathematics are of some order with respect a limit. Here the limit is the first order approximation with respect to $Tot_f \rightarrow \infty$, that is -log2(B(F_r, Tot_f, p_c))/Tot_f = D(p_r,p_c) + O(Tot_f). Either we use the entire formula (which is already an approximation of the -log of the binomial) or simply the divergence. Now Tot_f * D(p_r,p_c) does not depend on the term so that for weighting query-terms for additive query-document matching function is equivalent to D(p_r,p_c). We use simply this function in query expansion because the multiplicative factor Tot_f does not change the final ranking of the document.

Therefore D(p_r,p_c) is *not* a mathematical approximation of the binomial, whilst Tot_f* D(p_r,p_c) is, what we say is that Tot_f* D(p_r,p_c) and D(p_r,p_c) produce the same final document ranking as you can try.

Tot_f* D(p_r,p_c) and D(p_r,p_c) are equivalent when we use the approximation in query reformulation that is in query expansion.

When we use this approximation to assign score to terms in the document, that is for document weighting, sample size matters, and this is why we need normalizations for the binomial.

This is what can you see in Renyi's book Formula 4.5.21 . If $\left ( samplelength tf\right )$ is the binomial coefficient

\[\left ( samplelength tf\right )p^{tf}q^{samplelength-tf}= \frac{ 2^ {- samplelength*D(p_r,p_c)}}{\sqr{2*\pi *samplelength * p_r (1-p_r)}} \left(1 + O (\frac{1}{samplelength})\right)]

Formula 4.5.36 gives a second approximation which is Chi-square approximation which is of different order.

As you know approximations in mathematics are of some order with respect a limit. Here the limit is the first order approximation with respect to $Tot_f \rightarrow \infty$, that is -log2(B(F_r, Tot_f, p_c))/Tot_f = D(p_r,p_c) + O(Tot_f). Either we use the entire formula (which is already an approximation of the -log of the binomial) or simply the divergence. Now Tot_f * D(p_r,p_c) does not depend on the term so that for weighting query-terms for additive query-document matching function is equivalent to D(p_r,p_c). We use simply this function in query expansion because the multiplicative factor Tot_f does not change the final ranking of the document.

Therefore D(p_r,p_c) is *not* a mathematical approximation of the binomial, whilst Tot_f* D(p_r,p_c) is, what we say is that Tot_f* D(p_r,p_c) and D(p_r,p_c) produce the same final document ranking as you can try.

Tot_f* D(p_r,p_c) and D(p_r,p_c) are equivalent when we use the approximation in query reformulation that is in query expansion.

When we use this approximation to assign score to terms in the document, that is for document weighting, sample size matters, and this is why we need normalizations for the binomial.

This is what can you see in Renyi's book Formula 4.5.21 . If $\left ( samplelength tf\right )$ is the binomial coefficient

\[\left ( samplelength tf\right )p^{tf}q^{samplelength-tf}= \frac{ 2^ {- samplelength*D(p_r,p_c)}}{\sqr{2*\pi *samplelength * p_r (1-p_r)}} \left(1 + O (\frac{1}{samplelength})\right)]

Formula 4.5.36 gives a second approximation which is Chi-square approximation which is of different order.

Posted by:
**
Gianni Amati
** ()

Date: October 30, 2012 10:20AM

Errata corrige first pargraph

-log2(B(F_r, Tot_f, p_c))/Tot_f = D(p_r,p_c) + O(\frac {1}{Tot_f})

-log2(B(F_r, Tot_f, p_c))/Tot_f = D(p_r,p_c) + O(\frac {1}{Tot_f})

Posted by:
**
Gianni Amati
** ()

Date: October 30, 2012 10:37AM

As for the computation of -log Binomial in your picture you already use a different approximation so you cannot know which one is the best one!

Posted by:
**
Gianni Amati
** ()

Date: October 30, 2012 03:13PM

Hi

I checked the CS.java, and there is a bug

Indeed it should be

public final double score(

double withinDocumentFrequency,

double termFrequency,

double totalDocumentLength,

double collectionLength,

double averageDocumentLength) {

if (withinDocumentFrequency / totalDocumentLength

< termFrequency / collectionLength)

return 0;

double f = withinDocumentFrequency / totalDocumentLength;

double p = termFrequency / collectionLength;

double D = f * Idf.log(f, p) + f * Idf.log(1 - f, 1 - p);

return totalDocumentLength * D

+0.5d

* Idf.log(

2

* Math.PI

* withinDocumentFrequency

* (1d - withinDocumentFrequency / totalDocumentLength));

}

Thanks

Gianni

I checked the CS.java, and there is a bug

Indeed it should be

public final double score(

double withinDocumentFrequency,

double termFrequency,

double totalDocumentLength,

double collectionLength,

double averageDocumentLength) {

if (withinDocumentFrequency / totalDocumentLength

< termFrequency / collectionLength)

return 0;

double f = withinDocumentFrequency / totalDocumentLength;

double p = termFrequency / collectionLength;

double D = f * Idf.log(f, p) + f * Idf.log(1 - f, 1 - p);

return totalDocumentLength * D

+0.5d

* Idf.log(

2

* Math.PI

* withinDocumentFrequency

* (1d - withinDocumentFrequency / totalDocumentLength));

}

Thanks

Gianni

Posted by:
**
craigm
** ()

Date: October 30, 2012 03:44PM

Thanks Gianni. I'm tracking this issue on the tracker: [terrier.org]

Posted by:
**
swefire1
** ()

Date: November 02, 2012 06:08AM

Thanks for taking the time to answer my question!

Sorry, only registered users may post in this forum.