[TR-320] Blocks for Integer compression fails for large documents (blocks.max) Created: 19/Sep/14  Updated: 27/Nov/15  Resolved: 27/Nov/15

Status: Resolved
Project: Terrier Core
Component/s: .structures
Affects Version/s: 4.0
Fix Version/s: 4.1

Type: Bug Priority: Blocker
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: File structures.integer.patch     File structures.integer.postings.patch    
Issue Links:
Related

 Description   
From Ben He, UCAS

 I am having problem with the PFD compression which you may want to have a look at.

The collection I was using is WT10G. According to documentation at http://terrier.org/docs/v4.0/compression.html , I added the following lines in the .properties file:

indexing.direct.compression.configuration=org.terrier.structures.integer.IntegerCodecCompressionConfiguration

index.direct.compression.integer.chunk.size=1024
index.direct.compression.integer.fields.codec=LemireNewPFDVBCodec
index.direct.compression.integer.blocks.codec=LemireNewPFDVBCodec
index.direct.compression.integer.ids.codec=LemireNewPFDVBCodec
index.direct.compression.integer.tfs.codec=LemireNewPFDVBCodec
indexing.inverted.compression.configuration=org.terrier.structures.integer.IntegerCodecCompressionConfiguration

index.inverted.compression.integer.chunk.size=1024
index.inverted.compression.integer.ids.codec=LemireNewPFDVBCodec
index.inverted.compression.integer.tfs.codec=LemireNewPFDVBCodec
index.inverted.compression.integer.fields.codec=LemireNewPFDVBCodec
index.inverted.compression.integer.blocks.codec=LemireNewPFDVBCodec

I set the package prefix of the compression configuration class to "org.terrier.structures.integer" because "org.terrier.structures.indexing" as in the documentation throws a
ClassNotFound exception.

The two-pass indexing worked fine for non-block indexing, but fails with block indexing enabled while traversing the inverted file. It ended with an ArrayIndexOutOfBoundsException.

 Comments   
Comment by Craig Macdonald [ 19/Sep/14 ]

VIntCodec: Failed with IOException
LemireSimple16Codec: OK
LemireFORVBCodec: Failed with IllegalArgumentException
LemireNewPFDVBCodec: Failed with ArrayIndexOutOfBoundsException (as in
a previous email)
LemireOptPFDVBCodec: Failed with ArrayIndexOutOfBoundsException
LemireFastPFORVBCodec: Failed with ArrayIndexOutOfBoundsException
KamikazePForDeltaVBCodec: Failed with ArrayIndexOutOfBoundsException

Comment by Craig Macdonald [ 19/Sep/14 ]

This is an interaction between the blocks.max setting (default 100000) and the compression setup.
Setting blocks.max to 1000000 resolves the problem for WT10G.

Comment by Matteo Catena [ 22/Sep/14 ]

Is it possible to write max.blocks to the data.properties file? To have this information in the properties will avoid to ask the document index for the document lenghts.
Also, the new iterableposting impls. can manage no blocks (blocks.size=0), positions (blocks.size=1) and blocks (blocks.size>1). However, it seems to me that IntegerCodecCompressionConfiguration always force blocks.size=0 or 1. The case where we have blocks with size > 1 is unreachable. Should I remove it? (code will be slightly cleaner, but we have to warn users that block.size>1 is not allowed with pluggable compression)

Comment by Craig Macdonald [ 22/Sep/14 ]

I just never passed the information into the (IntegerCodec)CompressionConfiguration.

The relevant method is the constructor at

http://terrier.org/docs/current/javadoc/org/terrier/structures/indexing/CompressionFactory.CompressionConfiguration.html#CompressionFactory.CompressionConfiguration(java.lang.String,%20java.lang.String[],%20boolean)

Craig

Comment by Matteo Catena [ 24/Sep/14 ]

Using TF when blocks.size=1 seemed promising (it'd save space and decompression operations). Anyway, it fails when blocks.max is too low. Workarounds which use blocks.max or document lengths are possible, but do these guarantee to work with existing indices? (I implemented it but experienced problem with an existing gov2 index).
Instead, the proposed patch always writes the blocks' lengths in the inverted index.

Comment by Craig Macdonald [ 10/Nov/15 ]

Matteo,

I've been looking at this patch. I was thinking about the other route (which you thought was problematic):

  • Record max.blocks in the index (or assume (default) value from terrier.properties if not known)
  • Use Math.min(max.blocks,tfs[i]) when decoding for number of blocks.

Do you have stashed what you tried?

Craig

Comment by Craig Macdonald [ 27/Nov/15 ]

Copied from emails.

Matteo,

With some help from Graham, I looked in detail into the maxBlocks problem. The root of the problem is the semantics of blocks.max: its a limit on the number of blocks in a document, and NOT in a posting.

Consider the following document:
tokens: a a b b
pos: 0 1 2 3

The corresponding postings are:

term a: (tf=2, pos=[0,1])
term b: (tf=2, pos=[2,3])

Now consider blocks.max=2, i.e. 2 is the highest block number permitted in a document. The postings become:

term a: (tf=2, pos=[0,1])
term b: (tf=2, pos=[2])

So the number of positions recorded within a given posting vary not just with the value of blocks.max, but also with the number of positions that occur below vs above the threshold.

This leads me to believe that we need to record the number of positions occurring below the blocks.max setting, and we therefore cant eliminate the bfs compression for max.blocks > 0. Let me know if you agree.

Or we alternatively introduce a new configuration options, with the new semantics: max blocks per posting.

A challenging bug this one.

Craig

From Matteo:

It was one year ago so i'm not 100% sure, but looking at the patch it should save the bfs (block frequencies/lenghts) before the block arrays.

The layout for a chunk of 1024 would be like
id_1 id_2...id_1024; tf_1 tf_2 ... tf_1024; bf_1 bf_2 ... bf_1024; (pos_1_1 pos_1_2 ...) (pos_2_1, pos_2_2 ...) ...
(p_1_1 p_1_2 ...) has length bf_1
(p_2_1 p_2_2 ...) has length bf_2
and so on

Comment by Craig Macdonald [ 27/Nov/15 ]

Megapatch for TR-320 now applied to Terrier 4.1., d056d8fc. Save bfs for all positional indices. New issue created to track alternative property.

Generated at Mon Dec 11 09:28:05 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.