Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-328

Missing "tf" in the numerator of the BM25 weighting function

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 4.0
    • Fix Version/s: None
    • Component/s: .matching
    • Labels:
      None

      Description

      In the version 4.0, the 3-parameter score method of the BM25 weighting model is implemented as follows:

      double K = k_1 * ((1 - b) + b * docLength / averageDocumentLength) + tf;
      return (tf * (k_3 + 1d) * keyFrequency / ((k_3 + keyFrequency) * K))
            * WeightingModelLibrary.log((numberOfDocuments - documentFrequency + 0.5d) / (documentFrequency + 0.5d));

      Or, in simple math notation:

            tf (k3 + 1) * qf
      -------------------------------------------- * ---------------------------- * log2 ( (N - nt + 0.5) / (nt + 0.5) )
      tf + k1 * (1-b + b * docl/avel) k3 + qf


      The (k1 + 1) term is missing in the numerator of the leftmost division.

      This is not the case for the 5-parameter score method though which correctly implements the function.

        Attachments

          Issue Links

            Activity

            Hide
            dracca David Nicolas Racca added a comment -

            Again with better formatting:

            tf * (k3 + 1) * qf
            ------------------------------------------------ * log2 ( (N - nt + 0.5) / (nt + 0.5) )
            (tf + k1 * (1-b + b * docl/avel)) * (k3 + qf)

            Show
            dracca David Nicolas Racca added a comment - Again with better formatting: tf * (k3 + 1) * qf ------------------------------------------------ * log2 ( (N - nt + 0.5) / (nt + 0.5) ) (tf + k1 * (1-b + b * docl/avel)) * (k3 + qf)
            Hide
            dracca David Nicolas Racca added a comment -

            I put the wrong title for the issue, I am sorry about that. It should be "Missing (k1+1) term in ...". Can I rename it?

            Show
            dracca David Nicolas Racca added a comment - I put the wrong title for the issue, I am sorry about that. It should be "Missing (k1+1) term in ...". Can I rename it?
            Hide
            dracca David Nicolas Racca added a comment -

            In the 5-parameter method, there is also an extra "tf" in the denominator (it is included in the K variable already).

            Show
            dracca David Nicolas Racca added a comment - In the 5-parameter method, there is also an extra "tf" in the denominator (it is included in the K variable already).
            Hide
            craigm Craig Macdonald added a comment -

            This was fixed in TR-221 for v4.1 release. Thanks for the note though!

            Show
            craigm Craig Macdonald added a comment - This was fixed in TR-221 for v4.1 release. Thanks for the note though!

              People

              • Assignee:
                craigm Craig Macdonald
                Reporter:
                dracca David Nicolas Racca
              • Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: