Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-189

TRECFullTokenizer may discard DOCNO tag, causing terrier to crash

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6
    • Component/s: None
    • Labels:
      None

      Description

      The class org.terrier.indexing.TRECFullTokenizer parses tag using the same tokenizer used for documents.
      This has the effect of discarding numerical values in tags if they have more than 5 digits or 4 consecutive digits that are all the same.

      The main problem is that this also applies to the DOCNO tag when parsing topic files, thus crashing on query number 1111.

      The following patch adds a check that avoids tokenization of the tag content when the considering the DOCNO tag.


        Attachments

          Issue Links

            Activity

            steven Steven created issue -
            craigm Craig Macdonald made changes -
            Field Original Value New Value
            Link This issue is duplicated by TR-185 [ TR-185 ]
            craigm Craig Macdonald made changes -
            Status Open [ 1 ] Resolved [ 5 ]
            Fix Version/s 3.6 [ 10060 ]
            Resolution Duplicate [ 3 ]

              People

              • Assignee:
                craigm Craig Macdonald
                Reporter:
                steven Steven
              • Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: