Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6
    • Component/s: None
    • Labels:
      None

      Description

      Block compressed gzip files can be used to save space on disk but are not supported by Terrier.

      Included is a patch [on 3.5] (using some classes from samtools, MIT license) that enables block compressed gzip support in WARC018Collection.
      http://samtools.sourceforge.net/

      Block-gzip is automatically enabled when the file extension is "bgz"

      Benjamin

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          Thanks Benjamin. From a quick scan of the patch, I would probably integrate this in the utility.Files class, where .gz .bz2 are handled etc.

          Will also see how the license issue can be addressed.

          Cheers

          Craig

          Show
          craigm Craig Macdonald added a comment - Thanks Benjamin. From a quick scan of the patch, I would probably integrate this in the utility.Files class, where .gz .bz2 are handled etc. Will also see how the license issue can be addressed. Cheers Craig
          Hide
          craigm Craig Macdonald added a comment -

          Tagging for 3.6. Should be able to get this in. Benjamin, can you provide a sample .bgz file for testing purposes?

          Show
          craigm Craig Macdonald added a comment - Tagging for 3.6. Should be able to get this in. Benjamin, can you provide a sample .bgz file for testing purposes?
          Hide
          bpiwowar Benjamin Piwowarski added a comment -

          A sample file for block gzip (compressed with samtools razip utility)

          Show
          bpiwowar Benjamin Piwowarski added a comment - A sample file for block gzip (compressed with samtools razip utility)
          Hide
          bpiwowar Benjamin Piwowarski added a comment - - edited

          By the way, when using hadoop the constructor of WARC018Collection takes directly an InputStream, so I added a configuration property to in order to

          WARC018Collection.java
              public WARC018Collection(InputStream input)
              {
                  boolean isBlockCompressed = "block".equals(ApplicationSetup.getProperty("warc018collection.compression","none"));
                  is = isBlockCompressed ? new BlockCompressedInputStream(input) : input;
                  loadDocumentClass();
              }
          
          Show
          bpiwowar Benjamin Piwowarski added a comment - - edited By the way, when using hadoop the constructor of WARC018Collection takes directly an InputStream, so I added a configuration property to in order to WARC018Collection.java public WARC018Collection(InputStream input) { boolean isBlockCompressed = "block" .equals(ApplicationSetup.getProperty( "warc018collection.compression" , "none" )); is = isBlockCompressed ? new BlockCompressedInputStream(input) : input; loadDocumentClass(); }
          Hide
          richardm Richard McCreadie added a comment -

          Fixed in commit 3744.

          Added sam-1.108.jar to lib
          Files class modified to support BGZ and bgz extension files - uses BlockCompressed<Input/Output>Stream from the above jar. Note that BlockCompressedOutputStream requires a special case constructor (outputstream, file), where file is null, rather than (outputstream) like normal stream classes.
          Test case added to TestFiles: testReadBGZ()
          BGZ compressed file helloworld.txt.bgz added to share/tests/files/

          Show
          richardm Richard McCreadie added a comment - Fixed in commit 3744. Added sam-1.108.jar to lib Files class modified to support BGZ and bgz extension files - uses BlockCompressed<Input/Output>Stream from the above jar. Note that BlockCompressedOutputStream requires a special case constructor (outputstream, file), where file is null, rather than (outputstream) like normal stream classes. Test case added to TestFiles: testReadBGZ() BGZ compressed file helloworld.txt.bgz added to share/tests/files/

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              bpiwowar Benjamin Piwowarski
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: