Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-474

getNumberOfTokens of the class UpdatingCollectionStatistics gives back the number of pointers instead of the number of tokens.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 4.2
    • Fix Version/s: None
    • Component/s: .structures
    • Labels:
      None

      Description

      There is a bug in the method getNumberOfTokens of the class UpdatingCollectionStatistics within the Index class. This method returns the value associated with the property "num.Pointers" instead of the one related to "num.Tokens". This originates a problem when IndexOnDisk objects are merged to create a new index: its Properties object indicates a number of tokens smaller than the right one, hence the average document length of the collection documents turns out to be wrong.
      I upload the patched Java class.

        Attachments

        1. Approach_1.java
          2 kB
        2. Approach_2.java
          2 kB
        3. Approach_3.java
          2 kB
        4. Index.java
          15 kB

          Issue Links

            Activity

            Hide
            richardm Richard McCreadie added a comment -

            Good spot on that bug. I can confirm that Index.class is reading the wrong property. Index.class line 90.

            return Long.parseLong(properties.getProperty("num.Pointers", "0"));

            should be

            return Long.parseLong(properties.getProperty("num.Tokens", "0"));

            Show
            richardm Richard McCreadie added a comment - Good spot on that bug. I can confirm that Index.class is reading the wrong property. Index.class line 90. return Long.parseLong(properties.getProperty("num.Pointers", "0")); should be return Long.parseLong(properties.getProperty("num.Tokens", "0"));
            Hide
            craigm Craig Macdonald added a comment -

            Thanks both. Andrea, could you tell us how you found the bug (which particular indexing mechanism, as Terrier has a few!). This will enable us to generate a verifiable unit test case.

            Show
            craigm Craig Macdonald added a comment - Thanks both. Andrea, could you tell us how you found the bug (which particular indexing mechanism, as Terrier has a few!). This will enable us to generate a verifiable unit test case.
            Hide
            Andrea Andrea Langeli added a comment -

            Yes, of course. I have created a collection of 4 PDF documents, each one composed of an only statement. I have indexed the collection using three different approaches and finally I have applied the same query to each Index object I have gotten previously. The query consisted of four words: the first one occurred one time in the first document, the second word occurred one time in the third document. The indexing strategies I have used were:
            1)Classical two-pass indexing.
            2)Realtime-indexing with incremental.flushdocs value equal to 1 and no merging policy.
            3)Realtime-indexing with incremental.flushdocs value equal to 1 and incremental.merge value equal to single.

            In all three cases I have adopted the tf-idf weighting model. The first two approaches gave back the same results (the same ranking list and the same values), instead the third one returned slightly lower scores than the first two cases. Gianmaria Silvello and I have thought that this mismatch was due to the normalization: we have inspected the code, searching for a solution.

            The bug is in the method getNumberOfTokens of the inner class UpdatingCollectionStatistics. I report below the steps of the indexing process before the occurrence of the error.

            (IncrementaIndex) indexDocument(Document doc) – (IncrementalIndex) flush() – (IncrementalIndex) flush ((Runnable) mergePolicy).run() – (IncrementalMergeSingle) run merger.mergeStructures() – (StructureMerger) mergeStructures createLexidFile() – (StructureMerger) createLexidFile LexiconBuilder.optimise(destIndex, “lexicon”) – (LexiconBuilder) optimise counter.close() – (BasicLexiconCollectionStatisticsCounter) close.

            The method close of the static class BasicLexiconCollectionStatisticsCounter modifies the Properties object associated with the destination IndexOnDisk object setting “num.Documents”, “num.Tokens”, “num.Terms” to the corresponding values of the BasicLexiconCollectionStatisticsCounter variables. This modifies indirectly the values of the variables numberOfDocuments, numberOfTokens, numberOfUniqueTerms of the UpdatingCollectionStatistics object related to the IndexOnDisk object. These values are used for evaluate the score assigned to each collection document during the retrieval process.
            This error does not happen in the first two approaches because the loadStatistics method is executed instead of loadUpdatingStatistics one.

            I am Andrea Langeli, a computer engineering student of the University of Padua. I’m working on Terrier for my master’s thesis. My supervisor is Gianmaria Silvello.
            Approach_1.java Approach_2.java Approach_3.java

            Show
            Andrea Andrea Langeli added a comment - Yes, of course. I have created a collection of 4 PDF documents, each one composed of an only statement. I have indexed the collection using three different approaches and finally I have applied the same query to each Index object I have gotten previously. The query consisted of four words: the first one occurred one time in the first document, the second word occurred one time in the third document. The indexing strategies I have used were: 1)Classical two-pass indexing. 2)Realtime-indexing with incremental.flushdocs value equal to 1 and no merging policy. 3)Realtime-indexing with incremental.flushdocs value equal to 1 and incremental.merge value equal to single. In all three cases I have adopted the tf-idf weighting model. The first two approaches gave back the same results (the same ranking list and the same values), instead the third one returned slightly lower scores than the first two cases. Gianmaria Silvello and I have thought that this mismatch was due to the normalization: we have inspected the code, searching for a solution. The bug is in the method getNumberOfTokens of the inner class UpdatingCollectionStatistics. I report below the steps of the indexing process before the occurrence of the error. (IncrementaIndex) indexDocument(Document doc) – (IncrementalIndex) flush() – (IncrementalIndex) flush ((Runnable) mergePolicy).run() – (IncrementalMergeSingle) run merger.mergeStructures() – (StructureMerger) mergeStructures createLexidFile() – (StructureMerger) createLexidFile LexiconBuilder.optimise(destIndex, “lexicon”) – (LexiconBuilder) optimise counter.close() – (BasicLexiconCollectionStatisticsCounter) close. The method close of the static class BasicLexiconCollectionStatisticsCounter modifies the Properties object associated with the destination IndexOnDisk object setting “num.Documents”, “num.Tokens”, “num.Terms” to the corresponding values of the BasicLexiconCollectionStatisticsCounter variables. This modifies indirectly the values of the variables numberOfDocuments, numberOfTokens, numberOfUniqueTerms of the UpdatingCollectionStatistics object related to the IndexOnDisk object. These values are used for evaluate the score assigned to each collection document during the retrieval process. This error does not happen in the first two approaches because the loadStatistics method is executed instead of loadUpdatingStatistics one. I am Andrea Langeli, a computer engineering student of the University of Padua. I’m working on Terrier for my master’s thesis. My supervisor is Gianmaria Silvello. Approach_1.java Approach_2.java Approach_3.java
            Hide
            craigm Craig Macdonald added a comment -

            This was a dup. Already fixed in TR-444. Thanks for the report.

            Craig

            Show
            craigm Craig Macdonald added a comment - This was a dup. Already fixed in TR-444 . Thanks for the report. Craig

              People

              • Assignee:
                craigm Craig Macdonald
                Reporter:
                Andrea Andrea Langeli
              • Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: