Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-16

Extending query language and Matching to support synonyms

    Details

      Description

      I think we should extend the query language capabilities of Terrier. A specific operator that should be added is the "synonym" operator, which allows the user to state that a given set of terms should be treated as synonyms. This is important for some languages other than English (e.g. Arabic where there might be several transliterated words for a given term). Supporting wildcards might also be useful for some NLP applications (e.g. identification of relations between entities) Another area of improvement is the proximity operator, which should allow the user to have several variant operators in relation to the order of the terms and the distance constraint.

        Attachments

          Issue Links

            Activity

            Hide
            craigm Craig Macdonald added a comment -

            Mega patch committed. This adds the PostingListManager to handle the creation of IterablePostings (including for synoynm groups of terms), changes the Matching classes to use the PostingListManager for creating IterablePostings and scoring. ORIterablePosting and friends handle the combination of postings. Old Matching classes are retained as "FullNoPLM".

            Show
            craigm Craig Macdonald added a comment - Mega patch committed. This adds the PostingListManager to handle the creation of IterablePostings (including for synoynm groups of terms), changes the Matching classes to use the PostingListManager for creating IterablePostings and scoring. ORIterablePosting and friends handle the combination of postings. Old Matching classes are retained as "FullNoPLM".
            Hide
            craigm Craig Macdonald added a comment -

            Changed title to be explicitly for synonyms. Other issues such as wildcards should be in separate issues.

            Show
            craigm Craig Macdonald added a comment - Changed title to be explicitly for synonyms. Other issues such as wildcards should be in separate issues.
            Hide
            craigm Craig Macdonald added a comment -

            Tagging for 3.1.

            Show
            craigm Craig Macdonald added a comment - Tagging for 3.1.
            Hide
            craigm Craig Macdonald added a comment -

            Gianni, If we wish to do nested boolean expressions as queries, then surely we need to multiple matchings for every query, and then combine the results?

            The present solution (which is mostly unary expressions) uses multiple TermScoreModifiers and DocumentScoreModifiers, often with second passes of the inverted file to ensure that the correct expressions are matched. For nested operators, that strategy would not be possible.

            Show
            craigm Craig Macdonald added a comment - Gianni, If we wish to do nested boolean expressions as queries, then surely we need to multiple matchings for every query, and then combine the results? The present solution (which is mostly unary expressions) uses multiple TermScoreModifiers and DocumentScoreModifiers, often with second passes of the inverted file to ensure that the correct expressions are matched. For nested operators, that strategy would not be possible.
            Hide
            gianni_amati Gianni Amati added a comment -

            It would be necessary to perform also efficient boolean retrieval with nested boolean formulas, when one wants to activate it. I am also thinking to Google retrieval ( + terms) or to the legal track where queries are complex boolean queries.

            Show
            gianni_amati Gianni Amati added a comment - It would be necessary to perform also efficient boolean retrieval with nested boolean formulas, when one wants to activate it. I am also thinking to Google retrieval ( + terms) or to the legal track where queries are complex boolean queries.
            Hide
            craigm Craig Macdonald added a comment -

            Ok, let's use this issue to discuss all of the proposed operators, but implementations are likely to come in other separate issues. A discussion should encapsulate the proposed syntax of the operators, and the semantics they encapsulate.

            Firstly, it's probably worth reiterating the existing query constructs. The lack of amiguety here is caused by the use of best match semantics in combination with constructs which suggest filtering of some form.

            syntax semantics scoring terms
            a retrieve documents containing a a
            a b retrieve documents containing a and/or b a b
            +a b c retrieve documents containing a and possibly containing b and/or c a b c
            -a b c retrieve documents containing b and/or c, but no a b c
            f1:a retrieve documents containing a in field f1 a
            f1:a b retrieve documents containing a in field f1, and possibly containing b a b
            -f1:a b retrieve documents containing b, but where a does not occur in field f1 b
            "a b" c retrieve documents containg a and b as an adjacent phase, which may or may not contain c a b c
            f1:"a b" c retrieve documents containg a and b as an adjacent phase within field f1, in a document which may or may not contain c a b c
            "a b"~10 retrieve documents which contain a and b within 10 tokens of each other a b
            c -"a b" retrieve documents which contain c, and which do not contain a or b as an adjacent phase c
            c -(a b) retrieve documents which contain c, but do not contain a or b c
            c -f1:(a b) retrieve documents which contain c, but which do not contain a or b in field f1 c

            There is also the ^ (hat) operator for controlling the weights on an individual term.

            Show
            craigm Craig Macdonald added a comment - Ok, let's use this issue to discuss all of the proposed operators, but implementations are likely to come in other separate issues. A discussion should encapsulate the proposed syntax of the operators, and the semantics they encapsulate. Firstly, it's probably worth reiterating the existing query constructs. The lack of amiguety here is caused by the use of best match semantics in combination with constructs which suggest filtering of some form. syntax semantics scoring terms a retrieve documents containing a a a b retrieve documents containing a and/or b a b +a b c retrieve documents containing a and possibly containing b and/or c a b c -a b c retrieve documents containing b and/or c, but no a b c f1:a retrieve documents containing a in field f1 a f1:a b retrieve documents containing a in field f1, and possibly containing b a b -f1:a b retrieve documents containing b, but where a does not occur in field f1 b "a b" c retrieve documents containg a and b as an adjacent phase, which may or may not contain c a b c f1:"a b" c retrieve documents containg a and b as an adjacent phase within field f1, in a document which may or may not contain c a b c "a b"~10 retrieve documents which contain a and b within 10 tokens of each other a b c -"a b" retrieve documents which contain c, and which do not contain a or b as an adjacent phase c c -(a b) retrieve documents which contain c, but do not contain a or b c c -f1:(a b) retrieve documents which contain c, but which do not contain a or b in field f1 c There is also the ^ (hat) operator for controlling the weights on an individual term.

              People

              • Assignee:
                craigm Craig Macdonald
                Reporter:
                ounis Iadh Ounis
              • Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: