Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-14

Refactor Lexicons: LexiconEntry should be inter-changable

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.1
    • Fix Version/s: None
    • Component/s: .structures
    • Labels:
      None

      Description

      The current Lexicon implementations suffer from several disadvantages:
       * To store more information in the lexicon, the Lexicon class has to be sub-classed
       * LexiconInputStream and LexiconOutputStreams don't make it easy for more information to be added to the Lexicon
       * Deprecated methods, e.g. getTF() etc should be removed

      This issue is to track changes to the Lexicon so that the Lexicon code can be reused without extensive sub-classing.

        Attachments

        1. TR14-v1.patch
          268 kB
        2. TR-14.v3.svn.patch
          467 kB
        3. TR-14.v2.patch
          507 kB

          Issue Links

            Activity

            Hide
            gianni_amati Gianni Amati added a comment -

            The issue here is if we want to store all or most of the information in a unique file or not. Lexicon contains information about how to find information about the elements of our algebra/probability space. For example exact matching or k-grams may require a dedicate structure and a dedicate lexicon.

            So, suppose that all related issues about unary lexicons have been solved, so that we have all desiderata all fields, field counter, intelligent merger etc. In principle we have implicitly all information about the collection where a token is and under the scope of what label/tag occurrence. Now, we want to store more information. What is this new information?

            Show
            gianni_amati Gianni Amati added a comment - The issue here is if we want to store all or most of the information in a unique file or not. Lexicon contains information about how to find information about the elements of our algebra/probability space. For example exact matching or k-grams may require a dedicate structure and a dedicate lexicon. So, suppose that all related issues about unary lexicons have been solved, so that we have all desiderata all fields, field counter, intelligent merger etc. In principle we have implicitly all information about the collection where a token is and under the scope of what label/tag occurrence. Now, we want to store more information. What is this new information?
            Hide
            craigm Craig Macdonald added a comment -

            My idea for this issue is that the Lexicon will have configurable key types and value types. For unigram, the key type will always be String. The value type will always be a subclass of LexiconEntry. LexiconEntry will implement two interfaces - TermStatistics (possibly extended by FieldStatistics), and BitIndexPointer, which represents the byte/bit offset and the number of pointers for that term (same as Nt usually).

            The 'name' of the structure in the index will always be 'lexicon' for unigram. However, it can be changed in the case that the lexicon is being used for another purpose. E.g. i would suggest '2lexicon' for 2-gram lexicon.

            Stay tuned, more details soon.

            Show
            craigm Craig Macdonald added a comment - My idea for this issue is that the Lexicon will have configurable key types and value types. For unigram, the key type will always be String. The value type will always be a subclass of LexiconEntry. LexiconEntry will implement two interfaces - TermStatistics (possibly extended by FieldStatistics), and BitIndexPointer, which represents the byte/bit offset and the number of pointers for that term (same as Nt usually). The 'name' of the structure in the index will always be 'lexicon' for unigram. However, it can be changed in the case that the lexicon is being used for another purpose. E.g. i would suggest '2lexicon' for 2-gram lexicon. Stay tuned, more details soon.
            Hide
            craigm Craig Macdonald added a comment -

            First version of this (substantial) patch. This patch makes substantial changes to both the code and the index format. After applying this patch, Terrier indices will no longer be forwards or backwards compatible. You will have to reindex to use existing collections.

            • Lexicon is refactored into a lightish wrapper around a MapFile. A MapFile provides the previous search functionality of Lexicon, but using abstract key and value types. We will use MapFile for reverse metadata lookups in due course (e.g. docno or URL -> docid)
            • All Lexicon access now goes via a LexiconEntry.
            • Structures now have names associated to them, and these names are reflected in their constructors, and in the corresponding file names - eg. the inverted index filename is now data.inverted.bf
            • Upgrading code is unlikely to work and will be removed as part of this issue. Terrier will not be able to use indices from previous versions. (Lemur does this often, this is the first time that Terrier has done so).
            • Pointers into inverted index are now instances of BitFilePosition (usually implemented by LexiconEntry), which consists of a long and a byte (start offset in bytes and bits), as well as an int (number of records). Default inverted index BitFile is BitFileBuffered, as BitFile loads the entire compressed posting list into memory, and hence requires the end-offset as well. Offsets stored in Lexicon entries are now starting offsets, not ending offsets.
            • As of v1, this patch removes the UTF/non-UTF differentation within the Lexicon hierarchy, simplifying various class structure.

            Please note that more index format changes are planned, as part of issues TR-12 & TR-17

            Show
            craigm Craig Macdonald added a comment - First version of this (substantial) patch. This patch makes substantial changes to both the code and the index format. After applying this patch, Terrier indices will no longer be forwards or backwards compatible. You will have to reindex to use existing collections. Lexicon is refactored into a lightish wrapper around a MapFile. A MapFile provides the previous search functionality of Lexicon, but using abstract key and value types. We will use MapFile for reverse metadata lookups in due course (e.g. docno or URL -> docid) All Lexicon access now goes via a LexiconEntry. Structures now have names associated to them, and these names are reflected in their constructors, and in the corresponding file names - eg. the inverted index filename is now data.inverted.bf Upgrading code is unlikely to work and will be removed as part of this issue. Terrier will not be able to use indices from previous versions. (Lemur does this often, this is the first time that Terrier has done so). Pointers into inverted index are now instances of BitFilePosition (usually implemented by LexiconEntry), which consists of a long and a byte (start offset in bytes and bits), as well as an int (number of records). Default inverted index BitFile is BitFileBuffered, as BitFile loads the entire compressed posting list into memory, and hence requires the end-offset as well. Offsets stored in Lexicon entries are now starting offsets, not ending offsets. As of v1, this patch removes the UTF/non-UTF differentation within the Lexicon hierarchy, simplifying various class structure. Please note that more index format changes are planned, as part of issues TR-12 & TR-17
            Hide
            craigm Craig Macdonald added a comment -

            Updated patch (v2), addressing offline comments from Ben:

            • Fixed lexicon merging for classical indexing, including removing temporary lexicon folders
            • Fixes to allow QueryExpansion class to work (vis-a-vis MatchingQueryTerms being given TermStatistics at Matching time, and MapFile implementing OrderedMap)
            • add() and subtract() methods of LexiconEntry are now specified in the TermStatistics interface.

            I want to question whether TermStatistics is the correct name for the interface about the statistics of a term in the whole collection? If the Lexicon is used to store information about bi-grams, then TermStatistics may be slightly misleading, However, I cant think of a better name.

            Experiments have shown this improved version to be slightly lower at classical indexing of the WT2G collection: 1795 vs 1740 seconds (direct 1418 vs 1378, inversion 322 vs 287).

            Show
            craigm Craig Macdonald added a comment - Updated patch (v2), addressing offline comments from Ben: Fixed lexicon merging for classical indexing, including removing temporary lexicon folders Fixes to allow QueryExpansion class to work (vis-a-vis MatchingQueryTerms being given TermStatistics at Matching time, and MapFile implementing OrderedMap) add() and subtract() methods of LexiconEntry are now specified in the TermStatistics interface. I want to question whether TermStatistics is the correct name for the interface about the statistics of a term in the whole collection? If the Lexicon is used to store information about bi-grams, then TermStatistics may be slightly misleading, However, I cant think of a better name. Experiments have shown this improved version to be slightly lower at classical indexing of the WT2G collection: 1795 vs 1740 seconds (direct 1418 vs 1378, inversion 322 vs 287).
            Hide
            craigm Craig Macdonald added a comment -

            Final version of this patch, taking into account Ben's comments:

            • TermStatistics is renamed EntryStatistics
            • MapFile is now FSOrderedMapFile (FS standard for FixedSize)
            • MapFileLexicon is now FSOMapFileLexicon
            • Some javadoc is improved.
            Show
            craigm Craig Macdonald added a comment - Final version of this patch, taking into account Ben's comments: TermStatistics is renamed EntryStatistics MapFile is now FSOrderedMapFile (FS standard for FixedSize) MapFileLexicon is now FSOMapFileLexicon Some javadoc is improved.
            Hide
            ben Ben He added a comment -

            +1 It looks good to me and tests ok.

            Show
            ben Ben He added a comment - +1 It looks good to me and tests ok.
            Hide
            craigm Craig Macdonald added a comment -

            Patch committed! Thanks for your detailed feedback and testing on this patch Ben.

            Show
            craigm Craig Macdonald added a comment - Patch committed! Thanks for your detailed feedback and testing on this patch Ben.

              People

              • Assignee:
                craigm Craig Macdonald
                Reporter:
                craigm Craig Macdonald
              • Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: