Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-161

Use Tokenisers in query side tokenisation

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5
    • Component/s: .indexing, .querying
    • Labels:
      None

      Attachments

        Activity

        Hide
        rodrygo Rodrygo L. T. Santos added a comment -

        Issues are:

        • TRECQuery uses TRECFulTokenizer and TRECUTFFullTokenizer. Instead it should use TaggedDocument.
        • TRECQuery should (via TaggedDocument) make use of the new Tokeniser API
        • SingleQueryTerm should not assume anything about language
        • Query parser should not assume anything about language
        • SingleLineTRECQuery: current hands directly all content to the query parser. This can be handy.
        • Proposal: SingleLineTRECQueryTokenised (or option in SingleLineTRECQuery that does or does not do tokenisation)

        Responsibilites:

        • (Collection) TRECQuerySource - special case of TRECCollection for topics files (e.g. "Number" word, "Description:")
        • (Document) TaggedDocument - tokenises each query document - need special handling (e.g. "Number" word, "Description:", no closing tags)
        • SingleLineTerrierQuerySource - one query per line of Terrier queries
        • Makes special Document objects which returns an entire Terrier query for getNextTerm(); No tokenisation takes place.
        • SingleLineTRECQuerySource - one query per line of queries
        • Makes FileDocument which tokenises each line as per normal documen

        QuerySource == Collection
        Document - removes tags for one query, and passes to tokeniser
        Tokeniser - removes punctuation, obtains terms to send to QueryParser
        QueryParser - parses complex Terrier queries, but normally deals with simply building a parse tree of multiple terms

        Show
        rodrygo Rodrygo L. T. Santos added a comment - Issues are: TRECQuery uses TRECFulTokenizer and TRECUTFFullTokenizer. Instead it should use TaggedDocument. TRECQuery should (via TaggedDocument) make use of the new Tokeniser API SingleQueryTerm should not assume anything about language Query parser should not assume anything about language SingleLineTRECQuery: current hands directly all content to the query parser. This can be handy. Proposal: SingleLineTRECQueryTokenised (or option in SingleLineTRECQuery that does or does not do tokenisation) Responsibilites: (Collection) TRECQuerySource - special case of TRECCollection for topics files (e.g. "Number" word, "Description:") (Document) TaggedDocument - tokenises each query document - need special handling (e.g. "Number" word, "Description:", no closing tags) SingleLineTerrierQuerySource - one query per line of Terrier queries Makes special Document objects which returns an entire Terrier query for getNextTerm(); No tokenisation takes place. SingleLineTRECQuerySource - one query per line of queries Makes FileDocument which tokenises each line as per normal documen QuerySource == Collection Document - removes tags for one query, and passes to tokeniser Tokeniser - removes punctuation, obtains terms to send to QueryParser QueryParser - parses complex Terrier queries, but normally deals with simply building a parse tree of multiple terms
        Hide
        rodrygo Rodrygo L. T. Santos added a comment - - edited

        Here is a more concrete list of suggestions to create individual issues:

        • Have TRECQuery and TRECSingleLineQuery to adhere to the Collection interface
          • TRECCollection: replaces what TRECQuery currently does
          • SingleLineCollection: replaces what TRECSingleLineQuery currently does
        • Create special-purpose Document implementations
          • TRECQueryDocument: extends TaggedDocument to strip "Number:", "Narrative:", etc.
          • FileDocument should be the default document for SingleLineCollection
        • QueryParser should build a query tree out of a Document
          • How should Documents/Tokenisers handle query constructs?
        Show
        rodrygo Rodrygo L. T. Santos added a comment - - edited Here is a more concrete list of suggestions to create individual issues: Have TRECQuery and TRECSingleLineQuery to adhere to the Collection interface TRECCollection: replaces what TRECQuery currently does SingleLineCollection: replaces what TRECSingleLineQuery currently does Create special-purpose Document implementations TRECQueryDocument: extends TaggedDocument to strip "Number:", "Narrative:", etc. FileDocument should be the default document for SingleLineCollection QueryParser should build a query tree out of a Document How should Documents/Tokenisers handle query constructs?
        Hide
        craigm Craig Macdonald added a comment -

        How should Documents handle query constructs?

        It never can. Or we have a special Document implementation that passes the query verbatim to the query parser, without passing via the tokeniser.

        Have TRECQuery and TRECSingleLineQuery to adhere to the Collection interface

        How do you consider having the property setup for querying. At present, we have TrecQueryTags (retrieval usage), which are separate from TrecDocTags (indexing usage). Its a necessity that we have separate properties, so that one property file does for both indexing and retrieval. Perhaps in the interim, TRECQuery could remain as wrappers around Collection/Document?

        Show
        craigm Craig Macdonald added a comment - How should Documents handle query constructs? It never can. Or we have a special Document implementation that passes the query verbatim to the query parser, without passing via the tokeniser. Have TRECQuery and TRECSingleLineQuery to adhere to the Collection interface How do you consider having the property setup for querying. At present, we have TrecQueryTags (retrieval usage), which are separate from TrecDocTags (indexing usage). Its a necessity that we have separate properties, so that one property file does for both indexing and retrieval. Perhaps in the interim, TRECQuery could remain as wrappers around Collection/Document?
        Hide
        rodrygo Rodrygo L. T. Santos added a comment -

        Committed to trunk.

        • TRECQuery now has TRECFullTokenizer to use a Tokeniser internally.
        • SingleLineTRECQuery uses a Tokeniser directly; to prevent it from tokenising advanced query constructs, it uses IdentityTokeniser by default, which doesn't tokenise anything.
        Show
        rodrygo Rodrygo L. T. Santos added a comment - Committed to trunk. TRECQuery now has TRECFullTokenizer to use a Tokeniser internally. SingleLineTRECQuery uses a Tokeniser directly; to prevent it from tokenising advanced query constructs, it uses IdentityTokeniser by default, which doesn't tokenise anything.
        Hide
        rodrygo Rodrygo L. T. Santos added a comment -

        As a consequence of resolving this issue, TRECFullUTFTokenizer has been deprecated. TRECFullTokenizer should be used instead, with trec.encoding set to "utf8".

        Show
        rodrygo Rodrygo L. T. Santos added a comment - As a consequence of resolving this issue, TRECFullUTFTokenizer has been deprecated. TRECFullTokenizer should be used instead, with trec.encoding set to "utf8".

          People

          • Assignee:
            craigm Craig Macdonald
            Reporter:
            craigm Craig Macdonald
          • Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: