Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-122

Two pass indexing results in incorrect inverted index

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Duplicate
    • Affects Version/s: 3.0
    • Fix Version/s: None
    • Component/s: .indexing, .structures
    • Labels:
      None

      Description

      When using two pass indexing, the resulting inverted index contains wrong entries. Single pass indexing is not affected.

      The error can be reproduced using this example:

      = doc0.txt =
      cats dogs horses

      = doc1.txt =
      chicken cats chicken chicken

      = Program =
      List<String> files = new ArrayList<String>();
      files.add( "doc0.txt" );
      files.add( "doc1.txt" );

      /* two pass */
      Collection col = new SimpleFileCollection( files, false );
      Collection[] collections = new Collection[] { col };

      Indexer indexer = new BasicIndexer( ApplicationSetup.TERRIER_INDEX_PATH, "test_filecollection" );
      indexer.createDirectIndex( collections );
      indexer.createInvertedIndex();

      Index index = Index.createIndex( "index", "test_filecollection" );
      Lexicon<String> lexicon = index.getLexicon();
      InvertedIndex invertedIndex = index.getInvertedIndex();

      LexiconEntry chickenEntry = lexicon.getLexiconEntry( "chicken" );
      int[][] docs = invertedIndex.getDocuments( chickenEntry );

      System.out.println( "docs[0]: " + Arrays.toString( docs[0] ) );
      System.out.println( "docs[1]: " + Arrays.toString( docs[1] ) );
      System.out.println( "docno of docs[0][0]: " + index.getMetaIndex().getItem( "docno", docs[0][0] ) );

      /* single pass */
      col = new SimpleFileCollection( files, false );
      collections = new Collection[] { col };

      BasicSinglePassIndexer singlePassIndexer = new BasicSinglePassIndexer(
      ApplicationSetup.TERRIER_INDEX_PATH, "test_filecollection_singlepass" );
      singlePassIndexer.createInvertedIndex( collections );

      index = Index.createIndex( "index", "test_filecollection_singlepass" );
      lexicon = index.getLexicon();
      invertedIndex = index.getInvertedIndex();

      chickenEntry = lexicon.getLexiconEntry( "chicken" );
      docs = invertedIndex.getDocuments( chickenEntry );

      System.out.println( "docs[0]: " + Arrays.toString( docs[0] ) );
      System.out.println( "docs[1]: " + Arrays.toString( docs[1] ) );
      System.out.println( "docno of docs[0][0]: " + index.getMetaIndex().getItem( "docno", docs[0][0] ) );

      = Output =
      docs[0]: [0]
      docs[1]: [1]
      docno of docs[0][0]: 1

      docs[0]: [1]
      docs[1]: [3]
      docno of docs[0][0]: 2

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          Thanks for the report. I will investigate shortly.

          Show
          craigm Craig Macdonald added a comment - Thanks for the report. I will investigate shortly.
          Hide
          craigm Craig Macdonald added a comment -

          Sorry, I can't reproduce this. I tried both with trunk, and a virgin copy of Terrier 3.0. Can you try also with a virgin Terrier 3.0?

          INFO - NEXT: doc0.txt
          INFO - NEXT: doc1.txt
          INFO - Collection #0 took 0 seconds to index (2 documents)
          INFO - Key docno values are sorted in meta index, consider binary searching zdata file
          INFO - 1 lexicons to merge
          INFO - Optimising structure lexicon
          Optimsing lexicon with 4 entries
          INFO - Started building the inverted index...
          INFO - Started building the inverted index...
          INFO - Iteration 1 of 1 iterations
          INFO - Optimising structure lexicon
          Optimsing lexicon with 4 entries
          INFO - Finished building the inverted index...
          INFO - Time elapsed for inverted file: 0
          INFO - Structure meta reading lookup file into memory
          INFO - Structure meta reading reverse map for key docno directly from disk
          INFO - Structure meta loading data file into memory
          docs[0]: [1]
          docs[1]: [3]
          docno of docs[0][0]: 2
          INFO - Creating IF (no direct file)..
          INFO - NEXT: doc0.txt
          INFO - NEXT: doc1.txt
          INFO - Collection #0 took 0 seconds to build the runs for 2 documents
          
          INFO - Key docno values are sorted in meta index, consider binary searching zdata file
          INFO - Merging 1 runs...
          INFO - Collection #0 took 0 seconds to merge
           
          INFO - Collection #0 total time 0
          INFO - Optimising structure lexicon
          Optimsing lexicon with 4 entries
          All ids for structure lexicon are aligned, skipping .fsomapid file
          INFO - Structure meta reading lookup file into memory
          INFO - Structure meta reading reverse map for key docno directly from disk
          INFO - Structure meta loading data file into memory
          docs[0]: [1]
          docs[1]: [3]
          docno of docs[0][0]: 2
          
          Show
          craigm Craig Macdonald added a comment - Sorry, I can't reproduce this. I tried both with trunk, and a virgin copy of Terrier 3.0. Can you try also with a virgin Terrier 3.0? INFO - NEXT: doc0.txt INFO - NEXT: doc1.txt INFO - Collection #0 took 0 seconds to index (2 documents) INFO - Key docno values are sorted in meta index, consider binary searching zdata file INFO - 1 lexicons to merge INFO - Optimising structure lexicon Optimsing lexicon with 4 entries INFO - Started building the inverted index... INFO - Started building the inverted index... INFO - Iteration 1 of 1 iterations INFO - Optimising structure lexicon Optimsing lexicon with 4 entries INFO - Finished building the inverted index... INFO - Time elapsed for inverted file: 0 INFO - Structure meta reading lookup file into memory INFO - Structure meta reading reverse map for key docno directly from disk INFO - Structure meta loading data file into memory docs[0]: [1] docs[1]: [3] docno of docs[0][0]: 2 INFO - Creating IF (no direct file).. INFO - NEXT: doc0.txt INFO - NEXT: doc1.txt INFO - Collection #0 took 0 seconds to build the runs for 2 documents INFO - Key docno values are sorted in meta index, consider binary searching zdata file INFO - Merging 1 runs... INFO - Collection #0 took 0 seconds to merge INFO - Collection #0 total time 0 INFO - Optimising structure lexicon Optimsing lexicon with 4 entries All ids for structure lexicon are aligned, skipping .fsomapid file INFO - Structure meta reading lookup file into memory INFO - Structure meta reading reverse map for key docno directly from disk INFO - Structure meta loading data file into memory docs[0]: [1] docs[1]: [3] docno of docs[0][0]: 2
          Hide
          philipps Philipp Sorg added a comment -

          I tried again using a virgin copy of Terrier 3.0 and also ran the test on a Linux server.

          On the server (Debian, x64) the results are correct. However on my desktop (Windows 7, x64) the error still remains. Seems to be a platform specific problem.

          Show
          philipps Philipp Sorg added a comment - I tried again using a virgin copy of Terrier 3.0 and also ran the test on a Linux server. On the server (Debian, x64) the results are correct. However on my desktop (Windows 7, x64) the error still remains. Seems to be a platform specific problem.
          Hide
          craigm Craig Macdonald added a comment -

          Ah, now I understand. See TR-116 for a file not being closed issue. If this turns out to be the problem, then I'll close this issue as a duplicate.

          Show
          craigm Craig Macdonald added a comment - Ah, now I understand. See TR-116 for a file not being closed issue. If this turns out to be the problem, then I'll close this issue as a duplicate.
          Hide
          philipps Philipp Sorg added a comment -

          The patch for TR166 fixes the problem, this bug is a duplicate.

          Show
          philipps Philipp Sorg added a comment - The patch for TR166 fixes the problem, this bug is a duplicate.
          Hide
          craigm Craig Macdonald added a comment -

          Duplicate of TR-116. Thanks for raising the issue Philipp.

          Show
          craigm Craig Macdonald added a comment - Duplicate of TR-116 . Thanks for raising the issue Philipp.

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              philipps Philipp Sorg
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: