Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-189

TRECFullTokenizer may discard DOCNO tag, causing terrier to crash

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6
    • Component/s: None
    • Labels:
      None

      Description

      The class org.terrier.indexing.TRECFullTokenizer parses tag using the same tokenizer used for documents.
      This has the effect of discarding numerical values in tags if they have more than 5 digits or 4 consecutive digits that are all the same.

      The main problem is that this also applies to the DOCNO tag when parsing topic files, thus crashing on query number 1111.

      The following patch adds a check that avoids tokenization of the tag content when the considering the DOCNO tag.


        Attachments

          Issue Links

            Activity

            Hide
            steven Steven added a comment -

            This is a possible duplicate of: http://terrier.org/issues/browse/TR-185

            Show
            steven Steven added a comment - This is a possible duplicate of: http://terrier.org/issues/browse/TR-185
            Hide
            craigm Craig Macdonald added a comment -

            Thanks Steven. Do you have a trivial example document which doesn't work?

            Show
            craigm Craig Macdonald added a comment - Thanks Steven. Do you have a trivial example document which doesn't work?
            Hide
            craigm Craig Macdonald added a comment -

            Dup of TR-185

            Show
            craigm Craig Macdonald added a comment - Dup of TR-185
            Hide
            craigm Craig Macdonald added a comment -

            Resolved in other issue.

            Show
            craigm Craig Macdonald added a comment - Resolved in other issue.

              People

              • Assignee:
                craigm Craig Macdonald
                Reporter:
                steven Steven
              • Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: