[TR-200] Non unique keys in reverse index Created: 31/May/12  Updated: 27/Jul/12  Resolved: 27/Jul/12

Status: Resolved
Project: Terrier Core
Component/s: None
Affects Version/s: 3.5
Fix Version/s: 3.6

Type: Bug Priority: Minor
Reporter: Benjamin Piwowarski Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

When trying to index ClueWeb (B), and to have a reverse index for the URLs, (at least) one URL is a duplicate:

ERROR - Could not finish MetaIndexBuilder:
java.io.IOException: Key http://en.wikipedia.org/wiki/^Lestm???r_Vycp???lek is not unique: 11802014,11691925
        at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.mergeTwo(FSOrderedMapFile.java:1016)
        at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.close(FSOrderedMapFile.java:969)
        at org.terrier.structures.indexing.CompressingMetaIndexBuilder.close(CompressingMetaIndexBuilder.java:299)
        at org.terrier.indexing.BasicIndexer.createDirectIndex(BasicIndexer.java:340)
        at org.terrier.indexing.Indexer.index(Indexer.java:346)
        at org.terrier.applications.TRECIndexing.index(TRECIndexing.java:122)
        at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:388)
        at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:564)
        at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:235)

Not sure how to handle this, but it might be good not to handle this error, e.g. by keeping the first entry only; a better alternative would be to handle the multiple key case by keeping all the entries.

I am not sure why this happens in the ClueWeb collection - I will look at this more closely.

Another thing: is it possible to run only the construction of the inverted meta index ?

Comment by Craig Macdonald [ 27/Jul/12 ]

For 3.6, I have added the property metaindex.compressed.reverse.allow.duplicates. Set to true for duplicates to be ignored.

The inverting can be done by scripting CompressingMetaIndexBuilder. There is even a MapReduce job that can be called.

Comment by Craig Macdonald [ 27/Jul/12 ]

Thanks for the feedback Benjamin!

Generated at Thu Mar 22 21:36:19 GMT 2018 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.