Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-311

New integer compression techniques for the direct and inverted index structures

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.6
    • Fix Version/s: 4.0
    • Component/s: None
    • Labels:
      None

      Description

      Attached, the files for enabling modern integer compression techniques for the inverted index in Terrier.

      Files:
      matteo_compression.jar: the code
      matteo_compression_test.jar: unit testing
      JavaFastPFOR_Terrier.jar: MODIFIED JavaFastPFOR library, contains also Kamikaze. Add this to the build path.

      Required modifications to the rest of the code:

      1) In org.terrier.compression, make BitInBase public
      2) In (tes) org.terrier.tests.ShakespeareEndToEndTest, use always PostingIndex and PostingIndexInputStream instead of InvertedIndex and InvertedIndexInputStream
      3) Replace PostingTestUtils with the attached file (it contains some extra methods)

      The main entry point for this library may be the InvertedIndexRecompresser utility, which recompress a classical inverted index file using modern integer techinques specified via a configuration file. Read the javadoc documentation to learn about the usage.

        Attachments

          Issue Links

            Activity

            Hide
            craigm Craig Macdonald added a comment -

            Ok. I made the proposed changes, which reduced the number of classes significantly. This means that we can target an index compressed by FOR using the following terrier.properties:

            compression.inverted.integer.ids.codec=LemireFORVBCodec
            compression.inverted.integer.tfs.codec=LemireFORVBCodec
            compression.inverted.integer.fields.codec=LemireFORVBCodec
            compression.inverted.integer.blocks.codec=LemireFORVBCodec
            indexing.compression.configuration=IntegerCodecCompressionConfiguration
            compression.integer.chunk.size=1024
            

            Looking at these, we need a bit of uniformity in the property names, but all else looks OK.

            Show
            craigm Craig Macdonald added a comment - Ok. I made the proposed changes, which reduced the number of classes significantly. This means that we can target an index compressed by FOR using the following terrier.properties: compression.inverted.integer.ids.codec=LemireFORVBCodec compression.inverted.integer.tfs.codec=LemireFORVBCodec compression.inverted.integer.fields.codec=LemireFORVBCodec compression.inverted.integer.blocks.codec=LemireFORVBCodec indexing.compression.configuration=IntegerCodecCompressionConfiguration compression.integer.chunk.size=1024 Looking at these, we need a bit of uniformity in the property names, but all else looks OK.
            Hide
            craigm Craig Macdonald added a comment -

            BitInBase committed in r3792

            Show
            craigm Craig Macdonald added a comment - BitInBase committed in r3792
            Hide
            craigm Craig Macdonald added a comment -

            Revised title.

            Show
            craigm Craig Macdonald added a comment - Revised title.
            Hide
            craigm Craig Macdonald added a comment -

            At last, committed r3839. TREC-387 should be fixed, and a top-level documentation file is required. Thanks for your hard efforts Matteo!

            Show
            craigm Craig Macdonald added a comment - At last, committed r3839. TREC-387 should be fixed, and a top-level documentation file is required. Thanks for your hard efforts Matteo!
            Hide
            catena.matteo Matteo Catena added a comment -

            Well done, guys!
            Classes and configurations changed a bit, so I don't know if I can be useful. But please let me know if you need any help with the documentation.

            Show
            catena.matteo Matteo Catena added a comment - Well done, guys! Classes and configurations changed a bit, so I don't know if I can be useful. But please let me know if you need any help with the documentation.

              People

              • Assignee:
                craigm Craig Macdonald
                Reporter:
                catena.matteo Matteo Catena
              • Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: