[TR-279] Termids should be assigned by decreasing frequency for highest direct file compression [single pass indexers] Created: 31/Jan/14  Updated: 04/Apr/14  Resolved: 31/Jan/14

Status: Resolved
Project: Terrier Core
Component/s: None
Affects Version/s: None
Fix Version/s: 3.6

Type: Improvement Priority: Major
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None


 Description   
One of the reasons that the direct files are so large for our modern corpora is that the termids are poorly assigned. In particular, for classical indexing, they are assigned by order of observation. This is fine, as more frequent terms are more likely to be met earlier in the corpus. On the other hand, for single-pass inverted indices (which can then be re-inverted), the termids are assigned increasing lexigraphically. This results in inferior compression for the direct file.

In this issue, we reassign termids before the inverted2direct processes take place.


 Comments   
Comment by Craig Macdonald [ 31/Jan/14 ]

Committed, r3724

Generated at Mon Dec 18 09:01:16 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.