[TR-194] TRECCollection docnos should be trimmed of whitespace Created: 19/Mar/12  Updated: 27/Jul/12  Resolved: 27/Jul/12

Status: Resolved
Project: Terrier Core
Component/s: .indexing
Affects Version/s: 3.5
Fix Version/s: 3.6

Type: Bug Priority: Trivial
Reporter: Andreas Eiselt Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

The CompressingMetaIndex stores items as aspected.

Problem: If you try to get an item (CompressingMetaIndex.getItem(String Key, int docid)) from the index, before it will be returned, the "trim()" method is called. That is a problem in case that the item contained leading/trailing spaces.

Comment by Craig Macdonald [ 19/Mar/12 ]

Is the whitespace a key aspect of your use case? Most representation for documents, e.g. URL, Title, DOCNO, abstract are whitespace insensitive.

Fixing this is non-trivial and would require an index format change.

Comment by Andreas Eiselt [ 19/Mar/12 ]

It's a case that may occur, but not a key aspect. I noticed that it's not easy to fix, but I think you should think about because at the moment, this leads to an inconsistancy. If I get an item from the MetaIndex and than query the MetaIndex for all entries that share this item, I will not get any element as result.

Comment by Benjamin Piwowarski [ 21/May/12 ]

Is metadata really whitespace insensitive?

I had troubles retrieving a document ID given its docno (TREC AP8889 collection), until I noticed that the key was
e.g. " AP880212-0003 " and not "AP880212-0003".

So, if the metadata is really whitespace insensitive, it should be accessible whether or not it contains leading/trailing whitespaces.

Is this a bug or is it possible to say which metadata information should be trimmed?

Comment by Craig Macdonald [ 21/May/12 ]

@Benjamin, I think you were using meta.getDocument(String key, String value) method instead?

Comment by Benjamin Piwowarski [ 21/May/12 ]


Not sure what you meant; this is my code

final String docno = index.getMetaIndex().getItem("docno", 0);
System.err.format("docno[0] = [%s]%n", docno);
System.err.format("docid(%s) = %d%n", docno, index.getMetaIndex().getDocument("docno", docno));

The output is

docno[0] = [AP880212-0001]
docid(AP880212-0001) = -1

As a quick fix, in TRECCollection.java, line 394 I replaced
ThisDocID = DocumentIDContents.toString();

ThisDocID = DocumentIDContents.toString().trim();

Comment by Craig Macdonald [ 21/May/12 ]

yes, that looks like a (separate?) bug. Thanks for that.

Comment by Craig Macdonald [ 27/Jul/12 ]

Hi folks.

I made the decision that metadata is OK to trim(), but that docnos from TRECCollection should be trimmed() by default, as per Benjamin's suggestion. Updating issue title to reflect refocus. I have updated TestTRECCollection to check the docno.

Comment by Craig Macdonald [ 27/Jul/12 ]

Thanks to Benjamin and Andreas. Andreas what is the institution I should credit beside your name?

Committed r3661. This will be fixed in v3.6

Comment by Andreas Eiselt [ 27/Jul/12 ]

Hi Craig, and thank you as well for the fix! If you want to credit me, please mention Yahoo! Research Labs, Santiago

Generated at Mon Dec 11 13:32:16 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.