Generated at Sat Sep 22 14:23:30 BST 2018 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.
- TRECQuery uses TRECFulTokenizer and TRECUTFFullTokenizer. Instead it should use TaggedDocument.
- TRECQuery should (via TaggedDocument) make use of the new Tokeniser API
- SingleQueryTerm should not assume anything about language
- Query parser should not assume anything about language
- SingleLineTRECQuery: current hands directly all content to the query parser. This can be handy.
- Proposal: SingleLineTRECQueryTokenised (or option in SingleLineTRECQuery that does or does not do tokenisation)
- (Collection) TRECQuerySource - special case of TRECCollection for topics files (e.g. "Number" word, "Description:")
- (Document) TaggedDocument - tokenises each query document - need special handling (e.g. "Number" word, "Description:", no closing tags)
- SingleLineTerrierQuerySource - one query per line of Terrier queries
- Makes special Document objects which returns an entire Terrier query for getNextTerm(); No tokenisation takes place.
- SingleLineTRECQuerySource - one query per line of queries
- Makes FileDocument which tokenises each line as per normal documen
QuerySource == Collection
Document - removes tags for one query, and passes to tokeniser
Tokeniser - removes punctuation, obtains terms to send to QueryParser
QueryParser - parses complex Terrier queries, but normally deals with simply building a parse tree of multiple terms
Here is a more concrete list of suggestions to create individual issues:
- Have TRECQuery and TRECSingleLineQuery to adhere to the Collection interface
- TRECCollection: replaces what TRECQuery currently does
- SingleLineCollection: replaces what TRECSingleLineQuery currently does
- Create special-purpose Document implementations
- TRECQueryDocument: extends TaggedDocument to strip "Number:", "Narrative:", etc.
- FileDocument should be the default document for SingleLineCollection
- QueryParser should build a query tree out of a Document
- How should Documents/Tokenisers handle query constructs?
How should Documents handle query constructs?
It never can. Or we have a special Document implementation that passes the query verbatim to the query parser, without passing via the tokeniser.
Have TRECQuery and TRECSingleLineQuery to adhere to the Collection interface
How do you consider having the property setup for querying. At present, we have TrecQueryTags (retrieval usage), which are separate from TrecDocTags (indexing usage). Its a necessity that we have separate properties, so that one property file does for both indexing and retrieval. Perhaps in the interim, TRECQuery could remain as wrappers around Collection/Document?
Committed to trunk.
- TRECQuery now has TRECFullTokenizer to use a Tokeniser internally.
- SingleLineTRECQuery uses a Tokeniser directly; to prevent it from tokenising advanced query constructs, it uses IdentityTokeniser by default, which doesn't tokenise anything.
As a consequence of resolving this issue, TRECFullUTFTokenizer has been deprecated. TRECFullTokenizer should be used instead, with trec.encoding set to "utf8".