Name | Description | Codec Class name (in org.terrier.compression.integer.codec) |
VInt | Hadoop's Variable byte [1] implementation | VIntCodec |
Simple16 | JavaFastPFOR's Simple16 [2,3] implementation | LemireSimple16Codec |
Frame-of-Reference (FOR) | JavaFastPFOR's Frame-of-Reference [4] implementation | LemireFORVBCodec |
NewPFD | JavaFastPFOR's NewPFD [5] implementation | LemireNewPFDVBCodec |
OptPFD | JavaFastPFOR's OptPFD [5] implementation | LemireOptPFDVBCodec |
FastPFOR | JavaFastPFOR's FastPFOR [6] implementation - NB: A larger chunk-size is recommended for this codec. | LemireFastPFORVBCodec |
PForDelta | Linkedin's Kamikaze PForDelta [3,5] | KamikazePForDeltaVBCodec |
index.direct.compression.integer.chunk.size=1024 indexing.direct.compression.configuration=org.terrier.structures.indexing.IntegerCodecCompressionConfiguration index.direct.compression.integer.chunk.size=1024 index.direct.compression.integer.ids.codec=LemireNewPFDVBCodec index.direct.compression.integer.tfs.codec=LemireNewPFDVBCodec indexing.inverted.compression.configuration=org.terrier.structures.indexing.IntegerCodecCompressionConfiguration index.inverted.compression.integer.chunk.size=1024 index.inverted.compression.integer.ids.codec=LemireNewPFDVBCodec index.inverted.compression.integer.tfs.codec=LemireNewPFDVBCodec index.inverted.compression.integer.fields.codec=LemireNewPFDVBCodec index.inverted.compression.integer.blocks.codec=LemireNewPFDVBCodecYou can also plug into Terrier a new compression schema by implementing your own CompressionConfiguration. If IntegerCodec meets your requirements, you can implement it, and directly use IntegerCodecCompressionConfiguration. List of properties for indexing:
Name | Description | Values |
indexing.inverted.compression.configuration indexing.direct.compression.configuration | The class that defines the compression configuration to be used on the inverted (direct) index at indexing time. Only classical indexing supports pluggable compression. | org.terrier.structures.indexing.CompressionFactory$BitCompressionConfiguration (default); org.terrier.structures.indexing.IntegerCodecCompressionConfiguration |
index.inverted.compression.integer.chunk.size index.direct.compression.integer.chunk.size | Number of postings to be compressed at a time (used only w/ IntegerCodecCompressionConfiguration) | integer (default: 1024) |
index.inverted.compression.integer.ids.codec index.direct.compression.integer.ids.codec | The codec to be used to compress document identifiers in the inverted index (used only w/ IntegerCodecCompressionConfiguration). For the direct index, the codec to be used for the term identifiers. | See codecs table |
index.inverted.compression.integer.tfs.codec index.direct.compression.integer.tfs.codec |
The codec to be used to compress term frequencies in the inverted (direct) index (used only w/ IntegerCodecCompressionConfiguration) | " |
index.inverted.compression.integer.fields.codec index.direct.compression.integer.fields.codec | The codec to be used to compress field frequencies in the inverted (direct) index (used only w/ IntegerCodecCompressionConfiguration, optional) | " |
index.inverted.compression.integer.blocks.codec index.direct.compression.integer.blocks.codec | The codec to be used to compress term positions in the inverted (direct) index (used only w/ IntegerCodecCompressionConfiguration, optional) | " |
indexing.tmp-inverted.compression.configuration=org.terrier.structures.indexing.IntegerCodecCompressionConfiguration index.tmp-inverted.compression.integer.chunk.size=1024 index.tmp-inverted.compression.integer.ids.codec=LemireOptPFDVBCodec index.tmp-inverted.compression.integer.tfs.codec=LemireOptPFDVBCodec index.tmp-inverted.compression.integer.fields.codec=LemireOptPFDVBCodec index.tmp-inverted.compression.integer.blocks.codec=LemireOptPFDVBCodecPlease notice that InvertedIndexRecompresser overwrites the original inverted index with the re-compressed one. Be sure to have one backup copy of the inverted index before using InvertedIndexRecompresser. Different codecs have different effects on index size and query response time. When storage space is a concern, it is suggested to use Terrier's default compression configuration (Simple16 and OptPFD are options too). Instead, when the inverted index can fit in main memory, the best practices derived in [7] recommend to use the FOR codec to reduce the query response time, as follows:
index.direct.compression.integer.chunk.size=1024 indexing.direct.compression.configuration=org.terrier.structures.indexing.IntegerCodecCompressionConfiguration compression.direct.integer.ids.codec=LemireFORVBCodec compression.direct.integer.tfs.codec=LemireFORVBCodec indexing.inverted.compression.configuration=org.terrier.structures.indexing.IntegerCodecCompressionConfiguration compression.inverted.integer.ids.codec=LemireFORVBCodec compression.inverted.integer.tfs.codec=LemireFORVBCodec compression.inverted.integer.fields.codec=LemireFORVBCodec compression.inverted.integer.blocks.codec=LemireFORVBCodec
Catena, M., Macdonald, C., Ounis, I.: On Inverted Index Compression for Search Engine Efficiency. In: Proceedings of ECIR 2014. [PDF]
Copyright © 2014 University of Glasgow | All Rights Reserved