Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-320

Blocks for Integer compression fails for large documents (blocks.max)

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1
    • Component/s: .structures
    • Labels:
      None

      Description

      From Ben He, UCAS

       I am having problem with the PFD compression which you may want to have a look at.

      The collection I was using is WT10G. According to documentation at http://terrier.org/docs/v4.0/compression.html , I added the following lines in the .properties file:

      indexing.direct.compression.configuration=org.terrier.structures.integer.IntegerCodecCompressionConfiguration

      index.direct.compression.integer.chunk.size=1024
      index.direct.compression.integer.fields.codec=LemireNewPFDVBCodec
      index.direct.compression.integer.blocks.codec=LemireNewPFDVBCodec
      index.direct.compression.integer.ids.codec=LemireNewPFDVBCodec
      index.direct.compression.integer.tfs.codec=LemireNewPFDVBCodec
      indexing.inverted.compression.configuration=org.terrier.structures.integer.IntegerCodecCompressionConfiguration

      index.inverted.compression.integer.chunk.size=1024
      index.inverted.compression.integer.ids.codec=LemireNewPFDVBCodec
      index.inverted.compression.integer.tfs.codec=LemireNewPFDVBCodec
      index.inverted.compression.integer.fields.codec=LemireNewPFDVBCodec
      index.inverted.compression.integer.blocks.codec=LemireNewPFDVBCodec

      I set the package prefix of the compression configuration class to "org.terrier.structures.integer" because "org.terrier.structures.indexing" as in the documentation throws a
      ClassNotFound exception.

      The two-pass indexing worked fine for non-block indexing, but fails with block indexing enabled while traversing the inverted file. It ended with an ArrayIndexOutOfBoundsException.

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          Megapatch for TR-320 now applied to Terrier 4.1., d056d8fc. Save bfs for all positional indices. New issue created to track alternative property.

          Show
          craigm Craig Macdonald added a comment - Megapatch for TR-320 now applied to Terrier 4.1., d056d8fc. Save bfs for all positional indices. New issue created to track alternative property.
          Hide
          craigm Craig Macdonald added a comment -

          Copied from emails.

          Matteo,

          With some help from Graham, I looked in detail into the maxBlocks problem. The root of the problem is the semantics of blocks.max: its a limit on the number of blocks in a document, and NOT in a posting.

          Consider the following document:
          tokens: a a b b
          pos: 0 1 2 3

          The corresponding postings are:

          term a: (tf=2, pos=[0,1])
          term b: (tf=2, pos=[2,3])

          Now consider blocks.max=2, i.e. 2 is the highest block number permitted in a document. The postings become:

          term a: (tf=2, pos=[0,1])
          term b: (tf=2, pos=[2])

          So the number of positions recorded within a given posting vary not just with the value of blocks.max, but also with the number of positions that occur below vs above the threshold.

          This leads me to believe that we need to record the number of positions occurring below the blocks.max setting, and we therefore cant eliminate the bfs compression for max.blocks > 0. Let me know if you agree.

          Or we alternatively introduce a new configuration options, with the new semantics: max blocks per posting.

          A challenging bug this one.

          Craig

          From Matteo:

          It was one year ago so i'm not 100% sure, but looking at the patch it should save the bfs (block frequencies/lenghts) before the block arrays.

          The layout for a chunk of 1024 would be like
          id_1 id_2...id_1024; tf_1 tf_2 ... tf_1024; bf_1 bf_2 ... bf_1024; (pos_1_1 pos_1_2 ...) (pos_2_1, pos_2_2 ...) ...
          (p_1_1 p_1_2 ...) has length bf_1
          (p_2_1 p_2_2 ...) has length bf_2
          and so on

          Show
          craigm Craig Macdonald added a comment - Copied from emails. Matteo, With some help from Graham, I looked in detail into the maxBlocks problem. The root of the problem is the semantics of blocks.max: its a limit on the number of blocks in a document, and NOT in a posting. Consider the following document: tokens: a a b b pos: 0 1 2 3 The corresponding postings are: term a: (tf=2, pos= [0,1] ) term b: (tf=2, pos= [2,3] ) Now consider blocks.max=2, i.e. 2 is the highest block number permitted in a document. The postings become: term a: (tf=2, pos= [0,1] ) term b: (tf=2, pos= [2] ) So the number of positions recorded within a given posting vary not just with the value of blocks.max, but also with the number of positions that occur below vs above the threshold. This leads me to believe that we need to record the number of positions occurring below the blocks.max setting, and we therefore cant eliminate the bfs compression for max.blocks > 0. Let me know if you agree. Or we alternatively introduce a new configuration options, with the new semantics: max blocks per posting. A challenging bug this one. Craig From Matteo: It was one year ago so i'm not 100% sure, but looking at the patch it should save the bfs (block frequencies/lenghts) before the block arrays. The layout for a chunk of 1024 would be like id_1 id_2...id_1024; tf_1 tf_2 ... tf_1024; bf_1 bf_2 ... bf_1024; (pos_1_1 pos_1_2 ...) (pos_2_1, pos_2_2 ...) ... (p_1_1 p_1_2 ...) has length bf_1 (p_2_1 p_2_2 ...) has length bf_2 and so on
          Hide
          craigm Craig Macdonald added a comment -

          Matteo,

          I've been looking at this patch. I was thinking about the other route (which you thought was problematic):

          • Record max.blocks in the index (or assume (default) value from terrier.properties if not known)
          • Use Math.min(max.blocks,tfs[i]) when decoding for number of blocks.

          Do you have stashed what you tried?

          Craig

          Show
          craigm Craig Macdonald added a comment - Matteo, I've been looking at this patch. I was thinking about the other route (which you thought was problematic): Record max.blocks in the index (or assume (default) value from terrier.properties if not known) Use Math.min(max.blocks,tfs [i] ) when decoding for number of blocks. Do you have stashed what you tried? Craig
          Hide
          catena.matteo Matteo Catena added a comment -

          Using TF when blocks.size=1 seemed promising (it'd save space and decompression operations). Anyway, it fails when blocks.max is too low. Workarounds which use blocks.max or document lengths are possible, but do these guarantee to work with existing indices? (I implemented it but experienced problem with an existing gov2 index).
          Instead, the proposed patch always writes the blocks' lengths in the inverted index.

          Show
          catena.matteo Matteo Catena added a comment - Using TF when blocks.size=1 seemed promising (it'd save space and decompression operations). Anyway, it fails when blocks.max is too low. Workarounds which use blocks.max or document lengths are possible, but do these guarantee to work with existing indices? (I implemented it but experienced problem with an existing gov2 index). Instead, the proposed patch always writes the blocks' lengths in the inverted index.
          Hide
          craigm Craig Macdonald added a comment -
          Show
          craigm Craig Macdonald added a comment - I just never passed the information into the (IntegerCodec)CompressionConfiguration. The relevant method is the constructor at http://terrier.org/docs/current/javadoc/org/terrier/structures/indexing/CompressionFactory.CompressionConfiguration.html#CompressionFactory.CompressionConfiguration(java.lang.String,%20java.lang.String[],%20boolean ) Craig
          Hide
          catena.matteo Matteo Catena added a comment - - edited

          Is it possible to write max.blocks to the data.properties file? To have this information in the properties will avoid to ask the document index for the document lenghts.
          Also, the new iterableposting impls. can manage no blocks (blocks.size=0), positions (blocks.size=1) and blocks (blocks.size>1). However, it seems to me that IntegerCodecCompressionConfiguration always force blocks.size=0 or 1. The case where we have blocks with size > 1 is unreachable. Should I remove it? (code will be slightly cleaner, but we have to warn users that block.size>1 is not allowed with pluggable compression)

          Show
          catena.matteo Matteo Catena added a comment - - edited Is it possible to write max.blocks to the data.properties file? To have this information in the properties will avoid to ask the document index for the document lenghts. Also, the new iterableposting impls. can manage no blocks (blocks.size=0), positions (blocks.size=1) and blocks (blocks.size>1). However, it seems to me that IntegerCodecCompressionConfiguration always force blocks.size=0 or 1. The case where we have blocks with size > 1 is unreachable. Should I remove it? (code will be slightly cleaner, but we have to warn users that block.size>1 is not allowed with pluggable compression)
          Hide
          craigm Craig Macdonald added a comment -

          This is an interaction between the blocks.max setting (default 100000) and the compression setup.
          Setting blocks.max to 1000000 resolves the problem for WT10G.

          Show
          craigm Craig Macdonald added a comment - This is an interaction between the blocks.max setting (default 100000) and the compression setup. Setting blocks.max to 1000000 resolves the problem for WT10G.
          Hide
          craigm Craig Macdonald added a comment -

          VIntCodec: Failed with IOException
          LemireSimple16Codec: OK
          LemireFORVBCodec: Failed with IllegalArgumentException
          LemireNewPFDVBCodec: Failed with ArrayIndexOutOfBoundsException (as in
          a previous email)
          LemireOptPFDVBCodec: Failed with ArrayIndexOutOfBoundsException
          LemireFastPFORVBCodec: Failed with ArrayIndexOutOfBoundsException
          KamikazePForDeltaVBCodec: Failed with ArrayIndexOutOfBoundsException

          Show
          craigm Craig Macdonald added a comment - VIntCodec: Failed with IOException LemireSimple16Codec: OK LemireFORVBCodec: Failed with IllegalArgumentException LemireNewPFDVBCodec: Failed with ArrayIndexOutOfBoundsException (as in a previous email) LemireOptPFDVBCodec: Failed with ArrayIndexOutOfBoundsException LemireFastPFORVBCodec: Failed with ArrayIndexOutOfBoundsException KamikazePForDeltaVBCodec: Failed with ArrayIndexOutOfBoundsException

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: