Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-341

hyper-geometric models (DPH, DLH and DLH13) produces Not a Number (NaN)

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1
    • Component/s: .matching
    • Labels:
      None

      Description

      When tf equals docLength, relative frequency of 1 produces Not a Number (NaN) or Negative Infinity as scores in hyper-geometric models (DPH, DLH and DLH13).
      We should prevent this situation.

        Attachments

          Activity

          Hide
          iorixxx Ahmet Arslan added a comment -

          Here a patch, which simply returns 0.9999 when the situation occurs.

          /**
          	 * Computes relative term frequency.
          	 * When tf == docLength we return 0.99999 because relative frequency of 1 produces
          	 * Not a Number (NaN) or Negative Infinity as scores in hyper-geometric models (DPH, DLH and DLH13).
          	 *
          	 * @param tf        raw term frequency
          	 * @param docLength length of the document
          	 * @return relative term frequency
          	 */
          	protected double relativeFrequency(double tf, double docLength) {
          		assert tf <= docLength : "tf cannot be greater than docLength";
          		double f = tf < docLength ? tf / docLength : 0.99999;
          		assert f > 0 : "relative frequency must be greater than zero: " + f;
          		assert f < 1 : "relative frequency must be less than one: " + f;
          		return f;
          	}
          
          Show
          iorixxx Ahmet Arslan added a comment - Here a patch, which simply returns 0.9999 when the situation occurs. /** * Computes relative term frequency. * When tf == docLength we return 0.99999 because relative frequency of 1 produces * Not a Number (NaN) or Negative Infinity as scores in hyper-geometric models (DPH, DLH and DLH13). * * @param tf raw term frequency * @param docLength length of the document * @ return relative term frequency */ protected double relativeFrequency( double tf, double docLength) { assert tf <= docLength : "tf cannot be greater than docLength" ; double f = tf < docLength ? tf / docLength : 0.99999; assert f > 0 : "relative frequency must be greater than zero: " + f; assert f < 1 : "relative frequency must be less than one: " + f; return f; }
          Hide
          iorixxx Ahmet Arslan added a comment -

          Patch that ignores white space changes

          Show
          iorixxx Ahmet Arslan added a comment - Patch that ignores white space changes
          Hide
          craigm Craig Macdonald added a comment -

          Hi Ahmet,

          This matches an approach I have taken in the past, the use of a function is elegant. I will accept the patch, and it will be part of the next version of Terrier

          Craig

          Show
          craigm Craig Macdonald added a comment - Hi Ahmet, This matches an approach I have taken in the past, the use of a function is elegant. I will accept the patch, and it will be part of the next version of Terrier Craig
          Hide
          iorixxx Ahmet Arslan added a comment -

          Thanks Craig for the inclusion.

          Show
          iorixxx Ahmet Arslan added a comment - Thanks Craig for the inclusion.
          Hide
          craigm Craig Macdonald added a comment -

          Committed to git for v4.1 - thanks Ahmet!

          Show
          craigm Craig Macdonald added a comment - Committed to git for v4.1 - thanks Ahmet!

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              iorixxx Ahmet Arslan
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: