[TR-279] Termids should be assigned by decreasing frequency for highest direct file compression [single pass indexers] Created: 31/Jan/14  Updated: 04/Apr/14  Resolved: 31/Jan/14

Status: Resolved
Project: Terrier Core
Component/s: None
Affects Version/s: None
Fix Version/s: 3.6

Type: Improvement Priority: Major
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

One of the reasons that the direct files are so large for our modern corpora is that the termids are poorly assigned. In particular, for classical indexing, they are assigned by order of observation. This is fine, as more frequent terms are more likely to be met earlier in the corpus. On the other hand, for single-pass inverted indices (which can then be re-inverted), the termids are assigned increasing lexigraphically. This results in inferior compression for the direct file.

In this issue, we reassign termids before the inverted2direct processes take place.

Comment by Craig Macdonald [ 31/Jan/14 ]

Committed, r3724

Generated at Wed Aug 05 14:41:02 BST 2020 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.