Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-279

Termids should be assigned by decreasing frequency for highest direct file compression [single pass indexers]

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6
    • Component/s: None
    • Labels:
      None

      Description

      One of the reasons that the direct files are so large for our modern corpora is that the termids are poorly assigned. In particular, for classical indexing, they are assigned by order of observation. This is fine, as more frequent terms are more likely to be met earlier in the corpus. On the other hand, for single-pass inverted indices (which can then be re-inverted), the termids are assigned increasing lexigraphically. This results in inferior compression for the direct file.

      In this issue, we reassign termids before the inverted2direct processes take place.

        Attachments

          Activity

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: