Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-200

Non unique keys in reverse index

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6
    • Component/s: None
    • Labels:
      None

      Description

      When trying to index ClueWeb (B), and to have a reverse index for the URLs, (at least) one URL is a duplicate:

      ERROR - Could not finish MetaIndexBuilder:
      java.io.IOException: Key http://en.wikipedia.org/wiki/^Lestm???r_Vycp???lek is not unique: 11802014,11691925
              at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.mergeTwo(FSOrderedMapFile.java:1016)
              at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.close(FSOrderedMapFile.java:969)
              at org.terrier.structures.indexing.CompressingMetaIndexBuilder.close(CompressingMetaIndexBuilder.java:299)
              at org.terrier.indexing.BasicIndexer.createDirectIndex(BasicIndexer.java:340)
              at org.terrier.indexing.Indexer.index(Indexer.java:346)
              at org.terrier.applications.TRECIndexing.index(TRECIndexing.java:122)
              at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:388)
              at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:564)
              at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:235)
      ...

      Not sure how to handle this, but it might be good not to handle this error, e.g. by keeping the first entry only; a better alternative would be to handle the multiple key case by keeping all the entries.

      I am not sure why this happens in the ClueWeb collection - I will look at this more closely.

      Another thing: is it possible to run only the construction of the inverted meta index ?

        Attachments

          Activity

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              bpiwowar Benjamin Piwowarski
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: