Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-39

DocID's are not assigned correctly during the reduce step of MapReduce Indexing.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 3.0
    • Component/s: .indexing, .structures
    • Labels:
      None

      Description

      There is a bad assumption that splits will be allocated to maps in order, i.e that split 1 will be given to Map1 etc.

      Fix: Sort the MapData files by start DocID and when assigning these to reduces use this ordering.

        Attachments

        1. patch.txt
          31 kB
        2. TR-34.branch2x.v4.patch
          31 kB
        3. TREC-34.v3.patch
          27 kB
        4. TREC-34.v4.patch
          34 kB

          Activity

          Hide
          richardm Richard McCreadie added a comment -

          Created PositionAwareSplit extends InputSplit
          Created PositionAwareDocument extends Document
          CollectionRecordReader etc now use these rather than InputSplit and Document
          Maps now store the splitnum to the MapDataFile
          MapData is now comparable on splitnum
          Reducers sort MapData
          HadoopRunsMerger now checks against the splitnum rather than MapNo.

          Show
          richardm Richard McCreadie added a comment - Created PositionAwareSplit extends InputSplit Created PositionAwareDocument extends Document CollectionRecordReader etc now use these rather than InputSplit and Document Maps now store the splitnum to the MapDataFile MapData is now comparable on splitnum Reducers sort MapData HadoopRunsMerger now checks against the splitnum rather than MapNo.
          Hide
          richardm Richard McCreadie added a comment -

          Swapped class files for patch

          Show
          richardm Richard McCreadie added a comment - Swapped class files for patch
          Hide
          craigm Craig Macdonald added a comment -

          Hi Richard,

          This patch looks OK.However, you need to make some consideration for the sorting of DocumentIndexes etc.

          I would suggest that closeReduce() stage 3 needs to sort the indices in some way.

          Perhaps you wish to write the splt number into the currentIndex object in closeMap(), and then use that to sort the indices array in closeReduce()?

          Ta

          C

          Show
          craigm Craig Macdonald added a comment - Hi Richard, This patch looks OK.However, you need to make some consideration for the sorting of DocumentIndexes etc. I would suggest that closeReduce() stage 3 needs to sort the indices in some way. Perhaps you wish to write the splt number into the currentIndex object in closeMap(), and then use that to sort the indices array in closeReduce()? Ta C
          Hide
          richardm Richard McCreadie added a comment -

          Added sorting on document Index - sorts MapTaskIDs during the load of run files to be inorder of splits using a new IDComparator class

          Show
          richardm Richard McCreadie added a comment - Added sorting on document Index - sorts MapTaskIDs during the load of run files to be inorder of splits using a new IDComparator class
          Hide
          richardm Richard McCreadie added a comment -

          typo fix

          Show
          richardm Richard McCreadie added a comment - typo fix
          Hide
          richardm Richard McCreadie added a comment -

          Another fix, PositionAwareSplit is now generic on type <T extends InputSplit>
          T's class name is written with the file and then used to call the generic constructor for T so that it can do the rest of the reading.

          Show
          richardm Richard McCreadie added a comment - Another fix, PositionAwareSplit is now generic on type <T extends InputSplit> T's class name is written with the file and then used to call the generic constructor for T so that it can do the rest of the reading.
          Hide
          craigm Craig Macdonald added a comment -

          Hi Richard,

          Couple of picky things:
          1. IDComparator

          • Its fairly unusual that a comparator object has state. Can you avoid this?
          • Can you add Javadoc so we understand how this class sorts things

          2. PositionAwareSplit:

          • I'm ok with using the class name to determine the split type
          • However, I dont see why it has to be generic on the split type, as getSplit() returns InputSplit
          • Can you throw a terrier.utility.WrappedIOException if an except occurs loading the object.
          • Use this to load the class without a cast:
            Class<? extends InputSplit> c = Class.forName(className, false, this.getClass().getClassLoader()).asSubclass(InputSplit.class);
            split = c.getConstructor(new Class[0]).newInstance(new Object[0]);
            
          • Why the getConstructor() call? Just use Class.newInstance() instead, as that uses the default constructor.

          We need to have some kind of test case to ensure that this works as expected.

          Show
          craigm Craig Macdonald added a comment - Hi Richard, Couple of picky things: 1. IDComparator Its fairly unusual that a comparator object has state. Can you avoid this? Can you add Javadoc so we understand how this class sorts things 2. PositionAwareSplit: I'm ok with using the class name to determine the split type However, I dont see why it has to be generic on the split type, as getSplit() returns InputSplit Can you throw a terrier.utility.WrappedIOException if an except occurs loading the object. Use this to load the class without a cast: Class <? extends InputSplit> c = Class .forName(className, false , this .getClass().getClassLoader()).asSubclass(InputSplit.class); split = c.getConstructor( new Class [0]).newInstance( new Object [0]); Why the getConstructor() call? Just use Class.newInstance() instead, as that uses the default constructor. We need to have some kind of test case to ensure that this works as expected.
          Hide
          richardm Richard McCreadie added a comment -

          Bug found in run matching to run data files during the reduce step.

          Show
          richardm Richard McCreadie added a comment - Bug found in run matching to run data files during the reduce step.
          Hide
          craigm Craig Macdonald added a comment -

          Improved patch. Appears to fix docids properly. Prevents double flushes, and keeps all indexer variables at correct values.

          Show
          craigm Craig Macdonald added a comment - Improved patch. Appears to fix docids properly. Prevents double flushes, and keeps all indexer variables at correct values.
          Hide
          craigm Craig Macdonald added a comment -

          Richard,

          I'm wondering if this patch could be back-ported to Terrier 2, as block indexing is taking an age. (Dont do anything yet)

          C

          Show
          craigm Craig Macdonald added a comment - Richard, I'm wondering if this patch could be back-ported to Terrier 2, as block indexing is taking an age. (Dont do anything yet) C
          Hide
          craigm Craig Macdonald added a comment -

          Last revision of patch. This patch for trunk only.

          Show
          craigm Craig Macdonald added a comment - Last revision of patch. This patch for trunk only.
          Hide
          craigm Craig Macdonald added a comment -

          Patch committed for trunk.

          Show
          craigm Craig Macdonald added a comment - Patch committed for trunk.
          Hide
          craigm Craig Macdonald added a comment -

          patch for 2.x SVN branch (rev 2573). All tests pass.

          Show
          craigm Craig Macdonald added a comment - patch for 2.x SVN branch (rev 2573). All tests pass.

            People

            • Assignee:
              richardm Richard McCreadie
              Reporter:
              richardm Richard McCreadie
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: