[TR-39] DocID's are not assigned correctly during the reduce step of MapReduce Indexing. Created: 06/May/09  Updated: 05/Mar/10  Resolved: 23/Jun/09

Status: Resolved
Project: Terrier Core
Component/s: .indexing, .structures
Affects Version/s: 3.0
Fix Version/s: 3.0

Type: Bug Priority: Minor
Reporter: Richard McCreadie Assignee: Richard McCreadie
Resolution: Fixed  
Labels: None

Attachments: Text File patch.txt     File TR-34.branch2x.v4.patch     File TREC-34.v3.patch     File TREC-34.v4.patch    

 Description   
There is a bad assumption that splits will be allocated to maps in order, i.e that split 1 will be given to Map1 etc.

Fix: Sort the MapData files by start DocID and when assigning these to reduces use this ordering.

 Comments   
Comment by Richard McCreadie [ 06/May/09 ]

Created PositionAwareSplit extends InputSplit
Created PositionAwareDocument extends Document
CollectionRecordReader etc now use these rather than InputSplit and Document
Maps now store the splitnum to the MapDataFile
MapData is now comparable on splitnum
Reducers sort MapData
HadoopRunsMerger now checks against the splitnum rather than MapNo.

Comment by Richard McCreadie [ 08/May/09 ]

Swapped class files for patch

Comment by Craig Macdonald [ 08/May/09 ]

Hi Richard,

This patch looks OK.However, you need to make some consideration for the sorting of DocumentIndexes etc.

I would suggest that closeReduce() stage 3 needs to sort the indices in some way.

Perhaps you wish to write the splt number into the currentIndex object in closeMap(), and then use that to sort the indices array in closeReduce()?

Ta

C

Comment by Richard McCreadie [ 12/May/09 ]

Added sorting on document Index - sorts MapTaskIDs during the load of run files to be inorder of splits using a new IDComparator class

Comment by Richard McCreadie [ 12/May/09 ]

typo fix

Comment by Richard McCreadie [ 12/May/09 ]

Another fix, PositionAwareSplit is now generic on type <T extends InputSplit>
T's class name is written with the file and then used to call the generic constructor for T so that it can do the rest of the reading.

Comment by Craig Macdonald [ 12/May/09 ]

Hi Richard,

Couple of picky things:
1. IDComparator

  • Its fairly unusual that a comparator object has state. Can you avoid this?
  • Can you add Javadoc so we understand how this class sorts things

2. PositionAwareSplit:

  • I'm ok with using the class name to determine the split type
  • However, I dont see why it has to be generic on the split type, as getSplit() returns InputSplit
  • Can you throw a terrier.utility.WrappedIOException if an except occurs loading the object.
  • Use this to load the class without a cast:
    Class<? extends InputSplit> c = Class.forName(className, false, this.getClass().getClassLoader()).asSubclass(InputSplit.class);
    split = c.getConstructor(new Class[0]).newInstance(new Object[0]);
    
  • Why the getConstructor() call? Just use Class.newInstance() instead, as that uses the default constructor.

We need to have some kind of test case to ensure that this works as expected.

Comment by Richard McCreadie [ 02/Jun/09 ]

Bug found in run matching to run data files during the reduce step.

Comment by Craig Macdonald [ 22/Jun/09 ]

Improved patch. Appears to fix docids properly. Prevents double flushes, and keeps all indexer variables at correct values.

Comment by Craig Macdonald [ 22/Jun/09 ]

Richard,

I'm wondering if this patch could be back-ported to Terrier 2, as block indexing is taking an age. (Dont do anything yet)

C

Comment by Craig Macdonald [ 23/Jun/09 ]

Last revision of patch. This patch for trunk only.

Comment by Craig Macdonald [ 23/Jun/09 ]

Patch committed for trunk.

Comment by Craig Macdonald [ 24/Jun/09 ]

patch for 2.x SVN branch (rev 2573). All tests pass.

Generated at Fri Dec 15 19:39:10 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.