[TR-14] Refactor Lexicons: LexiconEntry should be inter-changable Created: 12/Feb/09  Updated: 13/Mar/09  Resolved: 13/Mar/09

Status: Resolved
Project: Terrier Core
Component/s: .structures
Affects Version/s: 2.2.1
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: File TR-14.v2.patch     File TR-14.v3.svn.patch     File TR14-v1.patch    
Issue Links:
blocks TR-17 DOCNOs must be in lexicographical ord... Resolved
blocks TR-13 Allow fields to contain count informa... Resolved

The current Lexicon implementations suffer from several disadvantages:
 * To store more information in the lexicon, the Lexicon class has to be sub-classed
 * LexiconInputStream and LexiconOutputStreams don't make it easy for more information to be added to the Lexicon
 * Deprecated methods, e.g. getTF() etc should be removed

This issue is to track changes to the Lexicon so that the Lexicon code can be reused without extensive sub-classing.

Comment by Gianni Amati [ 18/Feb/09 ]

The issue here is if we want to store all or most of the information in a unique file or not. Lexicon contains information about how to find information about the elements of our algebra/probability space. For example exact matching or k-grams may require a dedicate structure and a dedicate lexicon.

So, suppose that all related issues about unary lexicons have been solved, so that we have all desiderata all fields, field counter, intelligent merger etc. In principle we have implicitly all information about the collection where a token is and under the scope of what label/tag occurrence. Now, we want to store more information. What is this new information?

Comment by Craig Macdonald [ 18/Feb/09 ]

My idea for this issue is that the Lexicon will have configurable key types and value types. For unigram, the key type will always be String. The value type will always be a subclass of LexiconEntry. LexiconEntry will implement two interfaces - TermStatistics (possibly extended by FieldStatistics), and BitIndexPointer, which represents the byte/bit offset and the number of pointers for that term (same as Nt usually).

The 'name' of the structure in the index will always be 'lexicon' for unigram. However, it can be changed in the case that the lexicon is being used for another purpose. E.g. i would suggest '2lexicon' for 2-gram lexicon.

Stay tuned, more details soon.

Comment by Craig Macdonald [ 26/Feb/09 ]

First version of this (substantial) patch. This patch makes substantial changes to both the code and the index format. After applying this patch, Terrier indices will no longer be forwards or backwards compatible. You will have to reindex to use existing collections.

  • Lexicon is refactored into a lightish wrapper around a MapFile. A MapFile provides the previous search functionality of Lexicon, but using abstract key and value types. We will use MapFile for reverse metadata lookups in due course (e.g. docno or URL -> docid)
  • All Lexicon access now goes via a LexiconEntry.
  • Structures now have names associated to them, and these names are reflected in their constructors, and in the corresponding file names - eg. the inverted index filename is now data.inverted.bf
  • Upgrading code is unlikely to work and will be removed as part of this issue. Terrier will not be able to use indices from previous versions. (Lemur does this often, this is the first time that Terrier has done so).
  • Pointers into inverted index are now instances of BitFilePosition (usually implemented by LexiconEntry), which consists of a long and a byte (start offset in bytes and bits), as well as an int (number of records). Default inverted index BitFile is BitFileBuffered, as BitFile loads the entire compressed posting list into memory, and hence requires the end-offset as well. Offsets stored in Lexicon entries are now starting offsets, not ending offsets.
  • As of v1, this patch removes the UTF/non-UTF differentation within the Lexicon hierarchy, simplifying various class structure.

Please note that more index format changes are planned, as part of issues TR-12 & TR-17

Comment by Craig Macdonald [ 03/Mar/09 ]

Updated patch (v2), addressing offline comments from Ben:

  • Fixed lexicon merging for classical indexing, including removing temporary lexicon folders
  • Fixes to allow QueryExpansion class to work (vis-a-vis MatchingQueryTerms being given TermStatistics at Matching time, and MapFile implementing OrderedMap)
  • add() and subtract() methods of LexiconEntry are now specified in the TermStatistics interface.

I want to question whether TermStatistics is the correct name for the interface about the statistics of a term in the whole collection? If the Lexicon is used to store information about bi-grams, then TermStatistics may be slightly misleading, However, I cant think of a better name.

Experiments have shown this improved version to be slightly lower at classical indexing of the WT2G collection: 1795 vs 1740 seconds (direct 1418 vs 1378, inversion 322 vs 287).

Comment by Craig Macdonald [ 12/Mar/09 ]

Final version of this patch, taking into account Ben's comments:

  • TermStatistics is renamed EntryStatistics
  • MapFile is now FSOrderedMapFile (FS standard for FixedSize)
  • MapFileLexicon is now FSOMapFileLexicon
  • Some javadoc is improved.
Comment by Ben He [ 13/Mar/09 ]

+1 It looks good to me and tests ok.

Comment by Craig Macdonald [ 13/Mar/09 ]

Patch committed! Thanks for your detailed feedback and testing on this patch Ben.

Generated at Thu Oct 01 08:19:10 BST 2020 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.