[TR-189] TRECFullTokenizer may discard DOCNO tag, causing terrier to crash Created: 19/Jan/12  Updated: 26/Jul/12  Resolved: 26/Jul/12

Status: Resolved
Project: Terrier Core
Component/s: None
Affects Version/s: 3.5
Fix Version/s: 3.6

Type: Bug Priority: Minor
Reporter: Steven Assignee: Craig Macdonald
Resolution: Duplicate  
Labels: None

Attachments: File TRECFullTokenizer.diff    
Issue Links:
is duplicated by TR-185 TRECQuery should not tokenise the top... Resolved

The class org.terrier.indexing.TRECFullTokenizer parses tag using the same tokenizer used for documents.
This has the effect of discarding numerical values in tags if they have more than 5 digits or 4 consecutive digits that are all the same.

The main problem is that this also applies to the DOCNO tag when parsing topic files, thus crashing on query number 1111.

The following patch adds a check that avoids tokenization of the tag content when the considering the DOCNO tag.

Comment by Steven [ 19/Jan/12 ]

This is a possible duplicate of: http://terrier.org/issues/browse/TR-185

Comment by Craig Macdonald [ 20/Jan/12 ]

Thanks Steven. Do you have a trivial example document which doesn't work?

Comment by Craig Macdonald [ 26/Jul/12 ]

Dup of TR-185

Comment by Craig Macdonald [ 26/Jul/12 ]

Resolved in other issue.

Generated at Wed May 22 18:03:24 BST 2019 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.