[TR-311] New integer compression techniques for the direct and inverted index structures Created: 23/Jun/13  Updated: 16/Jun/14  Resolved: 21/May/14

Status: Resolved
Project: Terrier Core
Component/s: None
Affects Version/s: 3.6
Fix Version/s: 4.0

Type: New Feature Priority: Major
Reporter: Matteo Catena Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: Text File IntegerCodecCompressionConfiguration.java     Java Archive File JavaFastPFOR_Terrier.jar     File matteo_compression.jar     Java Archive File matteo_compression.jar     Java Archive File matteo_compression_test.jar     Java Source File PostingTestUtils.java    
Issue Links:
Block
is blocked by TR-303 Make compression pluggable/selectable... Resolved
Related
is related to TR-303 Make compression pluggable/selectable... Resolved
is related to TR-290 Integer compression properties should... Resolved

 Description   
Attached, the files for enabling modern integer compression techniques for the inverted index in Terrier.

Files:
matteo_compression.jar: the code
matteo_compression_test.jar: unit testing
JavaFastPFOR_Terrier.jar: MODIFIED JavaFastPFOR library, contains also Kamikaze. Add this to the build path.

Required modifications to the rest of the code:

1) In org.terrier.compression, make BitInBase public
2) In (tes) org.terrier.tests.ShakespeareEndToEndTest, use always PostingIndex and PostingIndexInputStream instead of InvertedIndex and InvertedIndexInputStream
3) Replace PostingTestUtils with the attached file (it contains some extra methods)

The main entry point for this library may be the InvertedIndexRecompresser utility, which recompress a classical inverted index file using modern integer techinques specified via a configuration file. Read the javadoc documentation to learn about the usage.

 Comments   
Comment by Craig Macdonald [ 25/Apr/14 ]

Matteo,

This is the corresponding proposed CompressionConfiguration class for your work on compression. I haven't finished the bit about the Properties object that needs to be fed to the compression codec, I think you 'll need to provide me with more details about this bit.

Thanks

Craig

Comment by Matteo Catena [ 28/Apr/14 ]

From IntegerCodingPostingIndex.java:
/**
This implementation of

{@link PostingIndex}

provides access to an index of

{@link IntegerCodingIterablePosting}

s.
This expects in the data.properties file some information, such as

index.structureName.fields.count=the number of fields in the postings, if any
index.structureName.blocks=0 (no blocks) or 1 (positions) or >1 (blocks of any size)
index.structureName.compression.chunk-size=the maximum number of posting in a chunk
index.structureName.compression.ids.factory-class=the

{@link IntegerCodecFactory} implementation to use to get an {@link IntegerCodec} for docIds
index.structureName.compression.tfs.factory-class=the {@link IntegerCodecFactory}

implementation to use to get an

{@link IntegerCodec} for tfs
index.structureName.compression.fields.factory-class=the {@link IntegerCodecFactory} implementation to use to get an {@link IntegerCodec}

for fields (optional)
index.structureName.compression.blocks.factory-class=the

{@link IntegerCodecFactory}

implementation to use to get an

{@link IntegerCodec}

for blocks (optional)
Note: depending on the factory-class property, other properties may be needed.
//

VIntCodecFactory, GammaCodecFactory, etc. should be fine with just that. JavaFastPFOR codecs requires some more properties. The most general configuration is as in LemireCodecFactory.java:
/**
This factory builds IntegerCodecs which wrap IntegerCODEC from the JavaFastPFOR package.
This works reading properties from the data.properties file.
In particular:
prefix+.integercodec-class (the choosen codec);
prefix+integercodec.primary-class (the primary codec, IF integercodec-class is a Composition);
prefix+integercodec.secondary-class (the fallback codec, IF integercodec-class is a Composition);
The available integercodec-class are
me.lemire.integercompression.Composition
me.lemire.integercompression.Simple16 (doesn't need to specify anything primary and secondary)
me.lemire.integercompression.VariableByte (doesn't need to specify anything primary and secondary)

The available integercodec.primary-class are
me.lemire.integercompression.Kamikaze
me.lemire.integercompression.NewPFD
me.lemire.integercompression.OptPFD
me.lemire.integercompression.FastPFOR
me.lemire.integercompression.BinaryPacking

The available integercodec.secondary-class are
me.lemire.integercompression.Simple16
me.lemire.integercompression.VariableByte

For a simpler configuration, look at

{@link SimpleLemireCodecFactoryS16}

or

{@link SimpleLemireCodecFactoryVB}

//

Comment by Craig Macdonald [ 29/Apr/14 ]

So it looks to me that the Index (or its properties object) must be passed to the CompressionConfiguration.

Comment by Matteo Catena [ 30/Apr/14 ]

> So it looks to me that the Index (or its properties object) must be passed to the CompressionConfiguration.
I think the advantage in using a property object was to not have to write methods with dozens of parameters. Also, number of parameters are variable depending on the chosen codec.
There shouldn't be much differences moving the conf. strings from the index prop.obj. to the ApplicationSetup prop.obj.
On the other hand, the chosen compression is a characteristic of an index. When you experiment with with different index you'll have different data.properties but probably just one terrier.properties.

> On IntegerCodecCompressionConfiguration:
@Override
public Class<? extends IterablePosting> getPostingIteratorClass() {
return blocks
? fieldCount > 0
? BlockFieldIntegerCodingIterablePosting.class
: BlockIntegerCodingIterablePosting.class
: fieldCount > 0
? FieldIntegerCodingIterablePosting.class
: BasicIntegerCodingIterablePosting.class;
}

I'm fine with that, but mind I haven't written Block/Field/BlockField/BasicIntegerCodingIterablePosting. There is just IntegerCodingIterablePosting class doing all the things.
(Maybe we can save some if-then-elses splitting it).

Comment by Craig Macdonald [ 30/Apr/14 ]

On the latter point, I did the splits already.

Comment by Craig Macdonald [ 30/Apr/14 ]

On the earlier point, let me clarify. There are two configuration issues:

  • telling the ConfigurationConfig object how to configure indexing: user, typically via ApplicationSetup, and should save the configuration in index properties (c.f data.properties), c.f. next point.
  • telling the IntegerCodingPostingIndex what classes to load for decompression, which should be done via the index properties (c.f data.properties).

I'm referring to the fact that ConfigurationConfig should probably obtain the index object to store the compression configuration.

Comment by Matteo Catena [ 30/Apr/14 ]

Yes, I think index obj is needed to store compression configuration on data.properties.
I'd say that this should read from ApplicationSetup while writing an index (+ accessing index properties to save the configuration); read from data.properties while reading and index.

Comment by Matteo Catena [ 30/Apr/14 ]

On a different point:
We are using JavaFastPFOR_Terrier.jar, which is a modified version of JavaFastPFOR.
I wonder if we should use instead the official version of github.

The modifications were:
– Kamikaze's PFORDelta is part of the package (maybe we should use kamikaze official lib as well).
– Original behaviour:
Codecs compress an array of integers and store the #byte used for compression, array length in the same output.
Modified behaviour:
Codecs don't store the #byte used original array length in their output. This would be redundant, since Terrier knows a posting list length and the chunks size.
Also, my IntegerCodec has to store the #bytes used in the posting list, BEFORE the codec output (this information is used also for skipping).
While this permitted to save bytes avoiding redundant information, I wonder if we should use the official package for the sake of "standardization"

Also, JavaFastPFOR received a lot of updates since my visit in Glasgow, and other codecs have been introduced.
If signatures are unchanged, we get these other codecs "for free".

Moreover, I recently spotted some other interesting codec implementations:
https://github.com/stuhood/gvi (group varint implementation. group varint is what google said to use in DeanWSDM'09)
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.5.0/org/apache/lucene/util/packed/EliasFanoDocIdSet.java
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.5.0/org/apache/lucene/util/packed/EliasFanoEncoder.java
These classes implement VignaWSDM'13 quasi-succint-index codec (but honestly, I'm not sure how to use and integrate this).

Comment by Craig Macdonald [ 02/May/14 ]

Ok, attaching matteo_compression.jar, which is the revised version of the Integer compression package.

I have now got this working with my branch Terrier. I am still keen to integrate this in time for Terrier 4.0.

Matteo, I have the following notes/questions:

  • give a look at the TestIntCompressionConfiguration and IntegerCodecCompressionConfiguration to see how things have evolved.
    To instantiate a compressed index, the user would only have to state the following in the terrier.properties file:
    indexing.direct.compression.configuration=org.terrier.matteo.indexing.IntegerCodecCompressionConfiguration
    indexing.inverted.compression.configuration=org.terrier.matteo.indexing.IntegerCodecCompressionConfiguration
    compression.integer.ids.codec.factory=LemireFastPFORVBFactory
    #alternatively:
    #compression.integer.ids.codec.factory=LemireCodecFactory(Composition,FastPFOR,VariableByte)
    compression.integer.tfs.codec.factory=VIntCodecFactory
    
  • I intend to change the global properties to encapsulate the index data structure, so that compression can be configured separately between inverted and direct structures.
  • I now wonder if we need IntegerCodecFactory classes at all, given that we dont inspect the data.properties file anymore. In essence, we could just have subclasses of IntegerCodec instead. However, I can keep the factories if you are still convinced they might be useful in the future, complete with their getCodec(Properties p) and writeProperties() methods.
  • I think InMemoryIterablePosting and IntegerCodingIterablePosting are now unused and will be removed.
  • Which one is FOR (c.f. the recommendation of our paper)
  • Your provided jar files didnt included Kamikaze, but I did find a version in your home directory. I note that it is 3.0.3 is mentioned at http://data.linkedin.com/opensource/kamikaze/releases in having problems. Can we upgrade to 3.0.7? I guess you don't have any existing indices to be concerned with the minor incompatibilities noted on that website.
Comment by Matteo Catena [ 05/May/14 ]

> give a look at the TestIntCompressionConfiguration and IntegerCodecCompressionConfiguration to see how things have evolved.
> To instantiate a compressed index, the user would only have to state the following in the terrier.properties file: [...]

It seems good to me.

> I intend to change the global properties to encapsulate the index data structure, so that compression can be configured separately between inverted and direct structures.

I'm not sure I got the part about global properties, but sure different compressions for inverted and direct indices sounds good.

> I now wonder if we need IntegerCodecFactory classes at all, given that we dont inspect the data.properties file anymore. In essence, we could just have subclasses of IntegerCodec instead. However, I can keep the factories if you are still convinced they might be useful in the future, complete with their getCodec(Properties p) and writeProperties() methods.

I'm perfectly ok in just having subclasses of IntegerCodec. As long as new codecs can be integrated easily.

> Which one is FOR (c.f. the recommendation of our paper)

It is called BinaryPacking in JavaFastPFOR

> Your provided jar files didnt included Kamikaze, but I did find a version in your home directory. I note that it is 3.0.3 is mentioned at http://data.linkedin.com/opensource/kamikaze/releases in having problems. Can we upgrade to 3.0.7? I guess you don't have any existing indices to be concerned with the minor incompatibilities noted on that website.

I think that issue is related to Lucene indices. I guess we can update to 3.0.7 w/o problems

Comment by Craig Macdonald [ 05/May/14 ]

Ok. I made the proposed changes, which reduced the number of classes significantly. This means that we can target an index compressed by FOR using the following terrier.properties:

compression.inverted.integer.ids.codec=LemireFORVBCodec
compression.inverted.integer.tfs.codec=LemireFORVBCodec
compression.inverted.integer.fields.codec=LemireFORVBCodec
compression.inverted.integer.blocks.codec=LemireFORVBCodec
indexing.compression.configuration=IntegerCodecCompressionConfiguration
compression.integer.chunk.size=1024

Looking at these, we need a bit of uniformity in the property names, but all else looks OK.

Comment by Craig Macdonald [ 06/May/14 ]

BitInBase committed in r3792

Comment by Craig Macdonald [ 21/May/14 ]

Revised title.

Comment by Craig Macdonald [ 21/May/14 ]

At last, committed r3839. TREC-387 should be fixed, and a top-level documentation file is required. Thanks for your hard efforts Matteo!

Comment by Matteo Catena [ 23/May/14 ]

Well done, guys!
Classes and configurations changed a bit, so I don't know if I can be useful. But please let me know if you need any help with the documentation.

Generated at Fri Aug 23 12:26:04 BST 2019 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.