[TR-16] Extending query language and Matching to support synonyms Created: 16/Feb/09  Updated: 05/Apr/11  Resolved: 30/Mar/11

Status: Resolved
Project: Terrier Core
Component/s: .matching, .querying, .structures, tests
Affects Version/s: None
Fix Version/s: 3.5

Type: Improvement Priority: Major
Reporter: Iadh Ounis Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Issue Links:
Duplicate
is duplicated by TR-156 Deploy a DAAT matching strategy Resolved

 Description   
I think we should extend the query language capabilities of Terrier. A specific operator that should be added is the "synonym" operator, which allows the user to state that a given set of terms should be treated as synonyms. This is important for some languages other than English (e.g. Arabic where there might be several transliterated words for a given term). Supporting wildcards might also be useful for some NLP applications (e.g. identification of relations between entities) Another area of improvement is the proximity operator, which should allow the user to have several variant operators in relation to the order of the terms and the distance constraint.


 Comments   
Comment by Craig Macdonald [ 16/Feb/09 ]

Ok, let's use this issue to discuss all of the proposed operators, but implementations are likely to come in other separate issues. A discussion should encapsulate the proposed syntax of the operators, and the semantics they encapsulate.

Firstly, it's probably worth reiterating the existing query constructs. The lack of amiguety here is caused by the use of best match semantics in combination with constructs which suggest filtering of some form.

syntax semantics scoring terms
a retrieve documents containing a a
a b retrieve documents containing a and/or b a b
+a b c retrieve documents containing a and possibly containing b and/or c a b c
-a b c retrieve documents containing b and/or c, but no a b c
f1:a retrieve documents containing a in field f1 a
f1:a b retrieve documents containing a in field f1, and possibly containing b a b
-f1:a b retrieve documents containing b, but where a does not occur in field f1 b
"a b" c retrieve documents containg a and b as an adjacent phase, which may or may not contain c a b c
f1:"a b" c retrieve documents containg a and b as an adjacent phase within field f1, in a document which may or may not contain c a b c
"a b"~10 retrieve documents which contain a and b within 10 tokens of each other a b
c -"a b" retrieve documents which contain c, and which do not contain a or b as an adjacent phase c
c -(a b) retrieve documents which contain c, but do not contain a or b c
c -f1:(a b) retrieve documents which contain c, but which do not contain a or b in field f1 c

There is also the ^ (hat) operator for controlling the weights on an individual term.

Comment by Gianni Amati [ 17/Feb/09 ]

It would be necessary to perform also efficient boolean retrieval with nested boolean formulas, when one wants to activate it. I am also thinking to Google retrieval ( + terms) or to the legal track where queries are complex boolean queries.

Comment by Craig Macdonald [ 17/Feb/09 ]

Gianni, If we wish to do nested boolean expressions as queries, then surely we need to multiple matchings for every query, and then combine the results?

The present solution (which is mostly unary expressions) uses multiple TermScoreModifiers and DocumentScoreModifiers, often with second passes of the inverted file to ensure that the correct expressions are matched. For nested operators, that strategy would not be possible.

Comment by Craig Macdonald [ 18/Feb/11 ]

Tagging for 3.1.

Comment by Craig Macdonald [ 30/Mar/11 ]

Changed title to be explicitly for synonyms. Other issues such as wildcards should be in separate issues.

Comment by Craig Macdonald [ 30/Mar/11 ]

Mega patch committed. This adds the PostingListManager to handle the creation of IterablePostings (including for synoynm groups of terms), changes the Matching classes to use the PostingListManager for creating IterablePostings and scoring. ORIterablePosting and friends handle the combination of postings. Old Matching classes are retained as "FullNoPLM".

Generated at Mon Dec 11 03:59:21 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.