[TR-40] Enable Hadoop-mode Map Output Compression Created: 27/Mar/09  Updated: 05/Mar/10  Resolved: 09/Sep/09

Status: Resolved
Project: Terrier Core
Component/s: .indexing
Affects Version/s: 3.0
Fix Version/s: 3.0

Type: Improvement Priority: Minor
Reporter: Richard McCreadie Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None


 Description   
Hadoop supports the compression of map outputs. Some examination has found that the sequence files of map output that Hadoop moves to the reducer can be halfed in size for Terrier map reduce indexing by applying gzip. This suggests that using Haoop map output compression may be beneficial. See http://hadoop.apache.org/core/docs/r0.18.3/mapred_tutorial.html#Data+Compression for more details.

In this issue I will report space and efficiency changes in applying various compression changes.

 Comments   
Comment by Richard McCreadie [ 12/May/09 ]

The Patch to add Map Compression using GZip.
Argument is -c on the command line.

This also improves how arguments are processed from the command line and adds a basic help command (displayed by placing the String help (not case sensitive) any where in the command line).

Comment by Craig Macdonald [ 12/May/09 ]

If experimentation shows that map output compression is beneficial to efficiency, then I would be inclined to leave it on all the time, rather than adding a command-line option or a Terrier property.

Comment by Richard McCreadie [ 26/May/09 ]

Bug found in patch ; conf.setMapOutputCompressorClass(GzipCodec.class); causes a null pointer exception during map output, even if compression mode is not selected.

Comment by Richard McCreadie [ 26/May/09 ]

I have no idea what is causing this, as it worked in a previous version. It may be an issue with the new Hadoop.

Comment by Craig Macdonald [ 27/May/09 ]

Can you paste a stack trace?

Comment by Craig Macdonald [ 12/Aug/09 ]

I'd really like to have this turned on by default. Can you provide a working version of this patch?

Comment by Craig Macdonald [ 08/Sep/09 ]

Issue is that for some reason, we cannot use "local" job tracker and have compression working. I have enabled it, but with this special case.

Comment by Craig Macdonald [ 09/Sep/09 ]

I committed this.

Generated at Mon Dec 11 03:56:24 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.