Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-194

TRECCollection docnos should be trimmed of whitespace

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6
    • Component/s: .indexing
    • Labels:
      None

      Description

      The CompressingMetaIndex stores items as aspected.

      Problem: If you try to get an item (CompressingMetaIndex.getItem(String Key, int docid)) from the index, before it will be returned, the "trim()" method is called. That is a problem in case that the item contained leading/trailing spaces.

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          Is the whitespace a key aspect of your use case? Most representation for documents, e.g. URL, Title, DOCNO, abstract are whitespace insensitive.

          Fixing this is non-trivial and would require an index format change.

          Show
          craigm Craig Macdonald added a comment - Is the whitespace a key aspect of your use case? Most representation for documents, e.g. URL, Title, DOCNO, abstract are whitespace insensitive. Fixing this is non-trivial and would require an index format change.
          Hide
          tangopium Andreas Eiselt added a comment -

          It's a case that may occur, but not a key aspect. I noticed that it's not easy to fix, but I think you should think about because at the moment, this leads to an inconsistancy. If I get an item from the MetaIndex and than query the MetaIndex for all entries that share this item, I will not get any element as result.

          Show
          tangopium Andreas Eiselt added a comment - It's a case that may occur, but not a key aspect. I noticed that it's not easy to fix, but I think you should think about because at the moment, this leads to an inconsistancy. If I get an item from the MetaIndex and than query the MetaIndex for all entries that share this item, I will not get any element as result.
          Hide
          bpiwowar Benjamin Piwowarski added a comment - - edited

          Is metadata really whitespace insensitive?

          I had troubles retrieving a document ID given its docno (TREC AP8889 collection), until I noticed that the key was
          e.g. " AP880212-0003 " and not "AP880212-0003".

          So, if the metadata is really whitespace insensitive, it should be accessible whether or not it contains leading/trailing whitespaces.

          Is this a bug or is it possible to say which metadata information should be trimmed?

          Show
          bpiwowar Benjamin Piwowarski added a comment - - edited Is metadata really whitespace insensitive? I had troubles retrieving a document ID given its docno (TREC AP8889 collection), until I noticed that the key was e.g. " AP880212-0003 " and not "AP880212-0003". So, if the metadata is really whitespace insensitive, it should be accessible whether or not it contains leading/trailing whitespaces. Is this a bug or is it possible to say which metadata information should be trimmed?
          Hide
          craigm Craig Macdonald added a comment -

          @Benjamin, I think you were using meta.getDocument(String key, String value) method instead?

          Show
          craigm Craig Macdonald added a comment - @Benjamin, I think you were using meta.getDocument(String key, String value) method instead?
          Hide
          bpiwowar Benjamin Piwowarski added a comment -

          @Craig

          Not sure what you meant; this is my code

          final String docno = index.getMetaIndex().getItem("docno", 0);
          System.err.format("docno[0] = [%s]%n", docno);
          System.err.format("docid(%s) = %d%n", docno, index.getMetaIndex().getDocument("docno", docno));

          The output is

          docno[0] = [AP880212-0001]
          docid(AP880212-0001) = -1

          As a quick fix, in TRECCollection.java, line 394 I replaced
          ThisDocID = DocumentIDContents.toString();
          by

          ThisDocID = DocumentIDContents.toString().trim();

          Show
          bpiwowar Benjamin Piwowarski added a comment - @Craig Not sure what you meant; this is my code final String docno = index.getMetaIndex().getItem("docno", 0); System.err.format("docno [0] = [%s] %n", docno); System.err.format("docid(%s) = %d%n", docno, index.getMetaIndex().getDocument("docno", docno)); The output is docno [0] = [AP880212-0001] docid(AP880212-0001) = -1 As a quick fix, in TRECCollection.java, line 394 I replaced ThisDocID = DocumentIDContents.toString(); by ThisDocID = DocumentIDContents.toString().trim();
          Hide
          craigm Craig Macdonald added a comment -

          @Benjamin
          yes, that looks like a (separate?) bug. Thanks for that.

          Show
          craigm Craig Macdonald added a comment - @Benjamin yes, that looks like a (separate?) bug. Thanks for that.
          Hide
          craigm Craig Macdonald added a comment -

          Hi folks.

          I made the decision that metadata is OK to trim(), but that docnos from TRECCollection should be trimmed() by default, as per Benjamin's suggestion. Updating issue title to reflect refocus. I have updated TestTRECCollection to check the docno.

          Show
          craigm Craig Macdonald added a comment - Hi folks. I made the decision that metadata is OK to trim(), but that docnos from TRECCollection should be trimmed() by default, as per Benjamin's suggestion. Updating issue title to reflect refocus. I have updated TestTRECCollection to check the docno.
          Hide
          craigm Craig Macdonald added a comment -

          Thanks to Benjamin and Andreas. Andreas what is the institution I should credit beside your name?

          Committed r3661. This will be fixed in v3.6

          Show
          craigm Craig Macdonald added a comment - Thanks to Benjamin and Andreas. Andreas what is the institution I should credit beside your name? Committed r3661. This will be fixed in v3.6
          Hide
          tangopium Andreas Eiselt added a comment -

          Hi Craig, and thank you as well for the fix! If you want to credit me, please mention Yahoo! Research Labs, Santiago

          Show
          tangopium Andreas Eiselt added a comment - Hi Craig, and thank you as well for the fix! If you want to credit me, please mention Yahoo! Research Labs, Santiago

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              tangopium Andreas Eiselt
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: