[TR-199] Block compression support Created: 29/May/12  Updated: 05/Mar/14  Resolved: 05/Mar/14

Status: Resolved
Project: Terrier Core
Component/s: None
Affects Version/s: 3.5
Fix Version/s: 3.6

Type: Improvement Priority: Minor
Reporter: Benjamin Piwowarski Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: Text File 0001-Block-compressed-WARC-support.patch     File sample.warc.bgz    

 Description   
Block compressed gzip files can be used to save space on disk but are not supported by Terrier.

Included is a patch [on 3.5] (using some classes from samtools, MIT license) that enables block compressed gzip support in WARC018Collection.
http://samtools.sourceforge.net/

Block-gzip is automatically enabled when the file extension is "bgz"

Benjamin

 Comments   
Comment by Craig Macdonald [ 29/May/12 ]

Thanks Benjamin. From a quick scan of the patch, I would probably integrate this in the utility.Files class, where .gz .bz2 are handled etc.

Will also see how the license issue can be addressed.

Cheers

Craig

Comment by Craig Macdonald [ 29/May/12 ]

Tagging for 3.6. Should be able to get this in. Benjamin, can you provide a sample .bgz file for testing purposes?

Comment by Benjamin Piwowarski [ 30/May/12 ]

A sample file for block gzip (compressed with samtools razip utility)

Comment by Benjamin Piwowarski [ 05/Jun/12 ]

By the way, when using hadoop the constructor of WARC018Collection takes directly an InputStream, so I added a configuration property to in order to

WARC018Collection.java
    public WARC018Collection(InputStream input)
    {
        boolean isBlockCompressed = "block".equals(ApplicationSetup.getProperty("warc018collection.compression","none"));
        is = isBlockCompressed ? new BlockCompressedInputStream(input) : input;
        loadDocumentClass();
    }
Comment by Richard McCreadie [ 05/Mar/14 ]

Fixed in commit 3744.

Added sam-1.108.jar to lib
Files class modified to support BGZ and bgz extension files - uses BlockCompressed<Input/Output>Stream from the above jar. Note that BlockCompressedOutputStream requires a special case constructor (outputstream, file), where file is null, rather than (outputstream) like normal stream classes.
Test case added to TestFiles: testReadBGZ()
BGZ compressed file helloworld.txt.bgz added to share/tests/files/

Generated at Sat Dec 16 20:32:35 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.