[TR-341] hyper-geometric models (DPH, DLH and DLH13) produces Not a Number (NaN) Created: 29/Jul/15  Updated: 06/Nov/15  Resolved: 06/Nov/15

Status: Resolved
Project: Terrier Core
Component/s: .matching
Affects Version/s: 4.0
Fix Version/s: 4.1

Type: Bug Priority: Major
Reporter: Ahmet Arslan Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: Text File TR-341.patch     Text File TR-341.patch    

 Description   
When tf equals docLength, relative frequency of 1 produces Not a Number (NaN) or Negative Infinity as scores in hyper-geometric models (DPH, DLH and DLH13).
We should prevent this situation.

 Comments   
Comment by Ahmet Arslan [ 29/Jul/15 ]

Here a patch, which simply returns 0.9999 when the situation occurs.

/**
	 * Computes relative term frequency.
	 * When tf == docLength we return 0.99999 because relative frequency of 1 produces
	 * Not a Number (NaN) or Negative Infinity as scores in hyper-geometric models (DPH, DLH and DLH13).
	 *
	 * @param tf        raw term frequency
	 * @param docLength length of the document
	 * @return relative term frequency
	 */
	protected double relativeFrequency(double tf, double docLength) {
		assert tf <= docLength : "tf cannot be greater than docLength";
		double f = tf < docLength ? tf / docLength : 0.99999;
		assert f > 0 : "relative frequency must be greater than zero: " + f;
		assert f < 1 : "relative frequency must be less than one: " + f;
		return f;
	}
Comment by Ahmet Arslan [ 29/Jul/15 ]

Patch that ignores white space changes

Comment by Craig Macdonald [ 29/Jul/15 ]

Hi Ahmet,

This matches an approach I have taken in the past, the use of a function is elegant. I will accept the patch, and it will be part of the next version of Terrier

Craig

Comment by Ahmet Arslan [ 31/Jul/15 ]

Thanks Craig for the inclusion.

Comment by Craig Macdonald [ 06/Nov/15 ]

Committed to git for v4.1 - thanks Ahmet!

Generated at Sat Dec 16 16:37:52 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.