[TR-83] Hadoop indexing: splits are uneven Created: 09/Dec/09  Updated: 05/Mar/10  Resolved: 09/Dec/09

Status: Resolved
Project: Terrier Core
Component/s: .structures
Affects Version/s: None
Fix Version/s: 3.0

Type: Bug Priority: Major
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

For 256 map tasks, and a corpus of 1492 files.

Split size = 5.8 files each => All but the last split get 5 files each, and the last gets 212 files.

Comment by Craig Macdonald [ 09/Dec/09 ]

Resolved, in conjunction with Richard.

Comment by Iadh Ounis [ 09/Dec/09 ]

... and the problem was .....

Just curious (perhaps, I'm trying to find any excuse to stop reading)

Comment by Craig Macdonald [ 09/Dec/09 ]

Good point.

We were taking the floor of the division, and adding any leftover files to the last split. For large numbers of files, this can become very uneven.

The solution is to take the ceiling of the same division. The downside is that you may end up with slightly less splits than requested.

Generated at Fri Jun 18 19:48:35 BST 2021 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.