[TR-83] Hadoop indexing: splits are uneven Created: 09/Dec/09  Updated: 05/Mar/10  Resolved: 09/Dec/09

Status: Resolved
Project: Terrier Core
Component/s: .structures
Affects Version/s: None
Fix Version/s: 3.0

Type: Bug Priority: Major
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None


 Description   
For 256 map tasks, and a corpus of 1492 files.

Split size = 5.8 files each => All but the last split get 5 files each, and the last gets 212 files.

 Comments   
Comment by Craig Macdonald [ 09/Dec/09 ]

Resolved, in conjunction with Richard.

Comment by Iadh Ounis [ 09/Dec/09 ]

... and the problem was .....

Just curious (perhaps, I'm trying to find any excuse to stop reading)

Comment by Craig Macdonald [ 09/Dec/09 ]

Good point.

We were taking the floor of the division, and adding any leftover files to the last split. For large numbers of files, this can become very uneven.

The solution is to take the ceiling of the same division. The downside is that you may end up with slightly less splits than requested.

Generated at Thu Dec 14 02:36:51 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.