Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-194

TRECCollection docnos should be trimmed of whitespace

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6
    • Component/s: .indexing
    • Labels:
      None

      Description

      The CompressingMetaIndex stores items as aspected.

      Problem: If you try to get an item (CompressingMetaIndex.getItem(String Key, int docid)) from the index, before it will be returned, the "trim()" method is called. That is a problem in case that the item contained leading/trailing spaces.

        Attachments

          Activity

          Hide
          bpiwowar Benjamin Piwowarski added a comment -

          @Craig

          Not sure what you meant; this is my code

          final String docno = index.getMetaIndex().getItem("docno", 0);
          System.err.format("docno[0] = [%s]%n", docno);
          System.err.format("docid(%s) = %d%n", docno, index.getMetaIndex().getDocument("docno", docno));

          The output is

          docno[0] = [AP880212-0001]
          docid(AP880212-0001) = -1

          As a quick fix, in TRECCollection.java, line 394 I replaced
          ThisDocID = DocumentIDContents.toString();
          by

          ThisDocID = DocumentIDContents.toString().trim();

          Show
          bpiwowar Benjamin Piwowarski added a comment - @Craig Not sure what you meant; this is my code final String docno = index.getMetaIndex().getItem("docno", 0); System.err.format("docno [0] = [%s] %n", docno); System.err.format("docid(%s) = %d%n", docno, index.getMetaIndex().getDocument("docno", docno)); The output is docno [0] = [AP880212-0001] docid(AP880212-0001) = -1 As a quick fix, in TRECCollection.java, line 394 I replaced ThisDocID = DocumentIDContents.toString(); by ThisDocID = DocumentIDContents.toString().trim();
          Hide
          craigm Craig Macdonald added a comment -

          @Benjamin
          yes, that looks like a (separate?) bug. Thanks for that.

          Show
          craigm Craig Macdonald added a comment - @Benjamin yes, that looks like a (separate?) bug. Thanks for that.
          Hide
          craigm Craig Macdonald added a comment -

          Hi folks.

          I made the decision that metadata is OK to trim(), but that docnos from TRECCollection should be trimmed() by default, as per Benjamin's suggestion. Updating issue title to reflect refocus. I have updated TestTRECCollection to check the docno.

          Show
          craigm Craig Macdonald added a comment - Hi folks. I made the decision that metadata is OK to trim(), but that docnos from TRECCollection should be trimmed() by default, as per Benjamin's suggestion. Updating issue title to reflect refocus. I have updated TestTRECCollection to check the docno.
          Hide
          craigm Craig Macdonald added a comment -

          Thanks to Benjamin and Andreas. Andreas what is the institution I should credit beside your name?

          Committed r3661. This will be fixed in v3.6

          Show
          craigm Craig Macdonald added a comment - Thanks to Benjamin and Andreas. Andreas what is the institution I should credit beside your name? Committed r3661. This will be fixed in v3.6
          Hide
          tangopium Andreas Eiselt added a comment -

          Hi Craig, and thank you as well for the fix! If you want to credit me, please mention Yahoo! Research Labs, Santiago

          Show
          tangopium Andreas Eiselt added a comment - Hi Craig, and thank you as well for the fix! If you want to credit me, please mention Yahoo! Research Labs, Santiago

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              tangopium Andreas Eiselt
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: