[TR-161] Use Tokenisers in query side tokenisation Created: 30/Mar/11  Updated: 09/Jun/11  Resolved: 09/Apr/11

Status: Resolved
Project: Terrier Core
Component/s: .indexing, .querying
Affects Version/s: None
Fix Version/s: 3.5

Type: Improvement Priority: Major
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Issue Links:
Duplicate

 Comments   
Comment by Rodrygo L. T. Santos [ 01/Apr/11 ]

Issues are:

  • TRECQuery uses TRECFulTokenizer and TRECUTFFullTokenizer. Instead it should use TaggedDocument.
  • TRECQuery should (via TaggedDocument) make use of the new Tokeniser API
  • SingleQueryTerm should not assume anything about language
  • Query parser should not assume anything about language
  • SingleLineTRECQuery: current hands directly all content to the query parser. This can be handy.
  • Proposal: SingleLineTRECQueryTokenised (or option in SingleLineTRECQuery that does or does not do tokenisation)

Responsibilites:

  • (Collection) TRECQuerySource - special case of TRECCollection for topics files (e.g. "Number" word, "Description:")
  • (Document) TaggedDocument - tokenises each query document - need special handling (e.g. "Number" word, "Description:", no closing tags)
  • SingleLineTerrierQuerySource - one query per line of Terrier queries
  • Makes special Document objects which returns an entire Terrier query for getNextTerm(); No tokenisation takes place.
  • SingleLineTRECQuerySource - one query per line of queries
  • Makes FileDocument which tokenises each line as per normal documen

QuerySource == Collection
Document - removes tags for one query, and passes to tokeniser
Tokeniser - removes punctuation, obtains terms to send to QueryParser
QueryParser - parses complex Terrier queries, but normally deals with simply building a parse tree of multiple terms

Comment by Rodrygo L. T. Santos [ 04/Apr/11 ]

Here is a more concrete list of suggestions to create individual issues:

  • Have TRECQuery and TRECSingleLineQuery to adhere to the Collection interface
    • TRECCollection: replaces what TRECQuery currently does
    • SingleLineCollection: replaces what TRECSingleLineQuery currently does
  • Create special-purpose Document implementations
    • TRECQueryDocument: extends TaggedDocument to strip "Number:", "Narrative:", etc.
    • FileDocument should be the default document for SingleLineCollection
  • QueryParser should build a query tree out of a Document
    • How should Documents/Tokenisers handle query constructs?
Comment by Craig Macdonald [ 04/Apr/11 ]

How should Documents handle query constructs?

It never can. Or we have a special Document implementation that passes the query verbatim to the query parser, without passing via the tokeniser.

Have TRECQuery and TRECSingleLineQuery to adhere to the Collection interface

How do you consider having the property setup for querying. At present, we have TrecQueryTags (retrieval usage), which are separate from TrecDocTags (indexing usage). Its a necessity that we have separate properties, so that one property file does for both indexing and retrieval. Perhaps in the interim, TRECQuery could remain as wrappers around Collection/Document?

Comment by Rodrygo L. T. Santos [ 09/Apr/11 ]

Committed to trunk.

  • TRECQuery now has TRECFullTokenizer to use a Tokeniser internally.
  • SingleLineTRECQuery uses a Tokeniser directly; to prevent it from tokenising advanced query constructs, it uses IdentityTokeniser by default, which doesn't tokenise anything.
Comment by Rodrygo L. T. Santos [ 09/Apr/11 ]

As a consequence of resolving this issue, TRECFullUTFTokenizer has been deprecated. TRECFullTokenizer should be used instead, with trec.encoding set to "utf8".

Generated at Mon Dec 11 13:32:18 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.