Terrier Core

DocID's are not assigned correctly during the reduce step of MapReduce Indexing.

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Minor Minor
  • Resolution: Fixed
  • Affects Version/s: 3.0
  • Fix Version/s: 3.0
  • Component/s: .indexing, .structures
  • Description:
    There is a bad assumption that splits will be allocated to maps in order, i.e that split 1 will be given to Map1 etc.

    Fix: Sort the MapData files by start DocID and when assigning these to reduces use this ordering.
  1. patch.txt
    (31 kB)
    Richard McCreadie
    18/Jun/09 2:53 PM
  2. TR-34.branch2x.v4.patch
    (31 kB)
    Craig Macdonald
    24/Jun/09 6:21 PM
  3. TREC-34.v3.patch
    (27 kB)
    Craig Macdonald
    22/Jun/09 7:12 PM
  4. TREC-34.v4.patch
    (34 kB)
    Craig Macdonald
    23/Jun/09 2:33 PM

Activity

Hide
Richard McCreadie added a comment - 06/May/09 1:34 PM

Created PositionAwareSplit extends InputSplit
Created PositionAwareDocument extends Document
CollectionRecordReader etc now use these rather than InputSplit and Document
Maps now store the splitnum to the MapDataFile
MapData is now comparable on splitnum
Reducers sort MapData
HadoopRunsMerger now checks against the splitnum rather than MapNo.

Show
Richard McCreadie added a comment - 06/May/09 1:34 PM Created PositionAwareSplit extends InputSplit Created PositionAwareDocument extends Document CollectionRecordReader etc now use these rather than InputSplit and Document Maps now store the splitnum to the MapDataFile MapData is now comparable on splitnum Reducers sort MapData HadoopRunsMerger now checks against the splitnum rather than MapNo.
Hide
Richard McCreadie added a comment - 08/May/09 1:06 PM

Swapped class files for patch

Show
Richard McCreadie added a comment - 08/May/09 1:06 PM Swapped class files for patch
Hide
Craig Macdonald added a comment - 08/May/09 1:17 PM

Hi Richard,

This patch looks OK.However, you need to make some consideration for the sorting of DocumentIndexes etc.

I would suggest that closeReduce() stage 3 needs to sort the indices in some way.

Perhaps you wish to write the splt number into the currentIndex object in closeMap(), and then use that to sort the indices array in closeReduce()?

Ta

C

Show
Craig Macdonald added a comment - 08/May/09 1:17 PM Hi Richard, This patch looks OK.However, you need to make some consideration for the sorting of DocumentIndexes etc. I would suggest that closeReduce() stage 3 needs to sort the indices in some way. Perhaps you wish to write the splt number into the currentIndex object in closeMap(), and then use that to sort the indices array in closeReduce()? Ta C
Hide
Richard McCreadie added a comment - 12/May/09 2:42 PM

Added sorting on document Index - sorts MapTaskIDs during the load of run files to be inorder of splits using a new IDComparator class

Show
Richard McCreadie added a comment - 12/May/09 2:42 PM Added sorting on document Index - sorts MapTaskIDs during the load of run files to be inorder of splits using a new IDComparator class
Hide
Richard McCreadie added a comment - 12/May/09 5:54 PM

typo fix

Show
Richard McCreadie added a comment - 12/May/09 5:54 PM typo fix
Hide
Richard McCreadie added a comment - 12/May/09 7:38 PM

Another fix, PositionAwareSplit is now generic on type <T extends InputSplit>
T's class name is written with the file and then used to call the generic constructor for T so that it can do the rest of the reading.

Show
Richard McCreadie added a comment - 12/May/09 7:38 PM Another fix, PositionAwareSplit is now generic on type <T extends InputSplit> T's class name is written with the file and then used to call the generic constructor for T so that it can do the rest of the reading.
Hide
Craig Macdonald added a comment - 12/May/09 8:28 PM

Hi Richard,

Couple of picky things:
1. IDComparator

  • Its fairly unusual that a comparator object has state. Can you avoid this?
  • Can you add Javadoc so we understand how this class sorts things

2. PositionAwareSplit:

  • I'm ok with using the class name to determine the split type
  • However, I dont see why it has to be generic on the split type, as getSplit() returns InputSplit
  • Can you throw a terrier.utility.WrappedIOException if an except occurs loading the object.
  • Use this to load the class without a cast:
    Class<? extends InputSplit> c = Class.forName(className, false, this.getClass().getClassLoader()).asSubclass(InputSplit.class);
    split = c.getConstructor(new Class[0]).newInstance(new Object[0]);
  • Why the getConstructor() call? Just use Class.newInstance() instead, as that uses the default constructor.

We need to have some kind of test case to ensure that this works as expected.

Show
Craig Macdonald added a comment - 12/May/09 8:28 PM Hi Richard, Couple of picky things: 1. IDComparator
  • Its fairly unusual that a comparator object has state. Can you avoid this?
  • Can you add Javadoc so we understand how this class sorts things
2. PositionAwareSplit:
  • I'm ok with using the class name to determine the split type
  • However, I dont see why it has to be generic on the split type, as getSplit() returns InputSplit
  • Can you throw a terrier.utility.WrappedIOException if an except occurs loading the object.
  • Use this to load the class without a cast:
    Class<? extends InputSplit> c = Class.forName(className, false, this.getClass().getClassLoader()).asSubclass(InputSplit.class);
    split = c.getConstructor(new Class[0]).newInstance(new Object[0]);
  • Why the getConstructor() call? Just use Class.newInstance() instead, as that uses the default constructor.
We need to have some kind of test case to ensure that this works as expected.
Hide
Richard McCreadie added a comment - 02/Jun/09 11:57 AM

Bug found in run matching to run data files during the reduce step.

Show
Richard McCreadie added a comment - 02/Jun/09 11:57 AM Bug found in run matching to run data files during the reduce step.
Hide
Craig Macdonald added a comment - 22/Jun/09 7:12 PM

Improved patch. Appears to fix docids properly. Prevents double flushes, and keeps all indexer variables at correct values.

Show
Craig Macdonald added a comment - 22/Jun/09 7:12 PM Improved patch. Appears to fix docids properly. Prevents double flushes, and keeps all indexer variables at correct values.
Hide
Craig Macdonald added a comment - 22/Jun/09 7:17 PM

Richard,

I'm wondering if this patch could be back-ported to Terrier 2, as block indexing is taking an age. (Dont do anything yet)

C

Show
Craig Macdonald added a comment - 22/Jun/09 7:17 PM Richard, I'm wondering if this patch could be back-ported to Terrier 2, as block indexing is taking an age. (Dont do anything yet) C
Hide
Craig Macdonald added a comment - 23/Jun/09 2:33 PM

Last revision of patch. This patch for trunk only.

Show
Craig Macdonald added a comment - 23/Jun/09 2:33 PM Last revision of patch. This patch for trunk only.
Hide
Craig Macdonald added a comment - 23/Jun/09 2:34 PM

Patch committed for trunk.

Show
Craig Macdonald added a comment - 23/Jun/09 2:34 PM Patch committed for trunk.
Hide
Craig Macdonald added a comment - 24/Jun/09 6:21 PM

patch for 2.x SVN branch (rev 2573). All tests pass.

Show
Craig Macdonald added a comment - 24/Jun/09 6:21 PM patch for 2.x SVN branch (rev 2573). All tests pass.

People

Dates

  • Created:
    06/May/09 11:56 AM
    Updated:
    05/Mar/10 4:47 PM
    Resolved:
    23/Jun/09 2:34 PM