[TR-230] proximity operator Created: 12/Jun/13  Updated: 01/Apr/14  Resolved: 01/Apr/14

Status: Resolved
Project: Terrier Core
Component/s: .matching
Affects Version/s: 3.5
Fix Version/s: 3.6

Type: Bug Priority: Major
Reporter: Matteo Catena Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: Java Source File PhraseScoreModifier.java     Java Source File ProximityIterablePosting.java    

proximity operator is not implemented. Attached a proposed implementation (TO BE TESTED)

Comment by Richard McCreadie [ 11/Mar/14 ]

Generated a test case for this issue (TestProximityIterablePosting) that this addition fails.

We need to define what block distance in proximity means.

Document: "Whenever you win a coin flip, put a luck counter on Chance Encounter"
Query: "coin flip luck"
Window: 3

The current implementation would return this document because window radius is calculated as Window*numQueryTerms (3*3=9)

My expectation is that window radius should be equal to Window?

Comment by Matteo Catena [ 11/Mar/14 ]

Probably, the implementation will be even simpler if you consider just 'window' instead of 'window * num_query_terms'.
But what if, for instance, window is less than num_query_term? Ex: "coin flip luck counter"~2, does it have to return no results? Or does it have to return all the documents s.t. the distance between consecutive query term is <= 2?

Comment by Richard McCreadie [ 14/Mar/14 ]

I think the best idea is to define proximity as follows: 'All query terms must be contained within a window of n terms'. In this case, if Window is less than num_query_terms then we have two options

1) window is set to num_query_terms
2) return nothing

Per-term radius proximity is different I think.

Comment by Matteo Catena [ 17/Mar/14 ]

Option 1 sounds better to me. If you want, I can re-implement the class. Can you attach your test case, please?

Comment by Craig Macdonald [ 17/Mar/14 ]

I recall that Richard and I agreed that the Distance class was appropriate to use for this class.

Comment by Richard McCreadie [ 19/Mar/14 ]

Committed patch and test case for interpretation 'All query terms must be contained within a window of n terms'. Window is set to num_query_terms if window is less than num_query_terms.

Commit 3754.


Comment by Richard McCreadie [ 19/Mar/14 ]

Related note:

The patch uses the local isInWindow method rather than Distance.noTimes method

Either implementation should be valid, but each will be faster in different use cases.
isInWindow will be faster for long documents when the query terms appear only rarely. (Complexity: |Q| . window . occurences(Q,d))
Distance.noTimes will be faster when the query terms appear often in a document. (Complexity: |Q| . documentLength-window)

Comment by Craig Macdonald [ 19/Mar/14 ]

Do you have numbers to prove this?
Also, does isInWindow() pass similar tests to Distance.noTimes
Could the method be moved to the Distance class, to keep everything in the same place?

Comment by Richard McCreadie [ 21/Mar/14 ]

Did a test of Distance.noTimes for the proximity, but it fails all of the tests. Looking at the Distance test case, I think that noTimes is looking for n-grams not term sets in windows.

If so, it is not applicable for this issue.

Comment by Richard McCreadie [ 01/Apr/14 ]

Current implementation passes all of the unit tests. Resolving this issue.

Query language documentation should be updated to describe what this functionality does and how it differs from proximity score modifiers.

Generated at Thu Feb 22 08:36:16 GMT 2018 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.