Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-279

Termids should be assigned by decreasing frequency for highest direct file compression [single pass indexers]

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6
    • Component/s: None
    • Labels:
      None

      Description

      One of the reasons that the direct files are so large for our modern corpora is that the termids are poorly assigned. In particular, for classical indexing, they are assigned by order of observation. This is fine, as more frequent terms are more likely to be met earlier in the corpus. On the other hand, for single-pass inverted indices (which can then be re-inverted), the termids are assigned increasing lexigraphically. This results in inferior compression for the direct file.

      In this issue, we reassign termids before the inverted2direct processes take place.

        Attachments

          Activity

          craigm Craig Macdonald created issue -
          craigm Craig Macdonald made changes -
          Field Original Value New Value
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          richardm Richard McCreadie made changes -
          Project TREC [ 10010 ] Terrier Core [ 10000 ]
          Key TREC-352 TR-279
          Workflow jira [ 10752 ] Terrier Open Source [ 10797 ]
          Component/s Core [ 10020 ]
          Fix Version/s 3.6 [ 10060 ]
          Fix Version/s 3.6 [ 10061 ]

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: