Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-39

DocID's are not assigned correctly during the reduce step of MapReduce Indexing.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 3.0
    • Component/s: .indexing, .structures
    • Labels:
      None

      Description

      There is a bad assumption that splits will be allocated to maps in order, i.e that split 1 will be given to Map1 etc.

      Fix: Sort the MapData files by start DocID and when assigning these to reduces use this ordering.

        Attachments

        1. patch.txt
          31 kB
        2. TR-34.branch2x.v4.patch
          31 kB
        3. TREC-34.v3.patch
          27 kB
        4. TREC-34.v4.patch
          34 kB

          Activity

          richardm Richard McCreadie created issue -
          richardm Richard McCreadie made changes -
          Field Original Value New Value
          Attachment BasicSinglePassIndexer.java [ 10101 ]
          Attachment CollectionRecordReader.java [ 10102 ]
          Attachment FileCollectionRecordReader.java [ 10103 ]
          richardm Richard McCreadie made changes -
          Attachment HadoopRunsMerger.java [ 10104 ]
          Attachment MapData.java [ 10105 ]
          Attachment MultiFileCollectionInputFormat.java [ 10106 ]
          richardm Richard McCreadie made changes -
          Attachment PositionAwareDocument.java [ 10107 ]
          Attachment PositionAwareSplit.java [ 10108 ]
          Hide
          richardm Richard McCreadie added a comment -

          Created PositionAwareSplit extends InputSplit
          Created PositionAwareDocument extends Document
          CollectionRecordReader etc now use these rather than InputSplit and Document
          Maps now store the splitnum to the MapDataFile
          MapData is now comparable on splitnum
          Reducers sort MapData
          HadoopRunsMerger now checks against the splitnum rather than MapNo.

          Show
          richardm Richard McCreadie added a comment - Created PositionAwareSplit extends InputSplit Created PositionAwareDocument extends Document CollectionRecordReader etc now use these rather than InputSplit and Document Maps now store the splitnum to the MapDataFile MapData is now comparable on splitnum Reducers sort MapData HadoopRunsMerger now checks against the splitnum rather than MapNo.
          richardm Richard McCreadie made changes -
          Attachment BasicSinglePassIndexer.java [ 10101 ]
          richardm Richard McCreadie made changes -
          Attachment CollectionRecordReader.java [ 10102 ]
          richardm Richard McCreadie made changes -
          Attachment FileCollectionRecordReader.java [ 10103 ]
          richardm Richard McCreadie made changes -
          Attachment HadoopRunsMerger.java [ 10104 ]
          richardm Richard McCreadie made changes -
          Attachment MapData.java [ 10105 ]
          richardm Richard McCreadie made changes -
          Attachment MultiFileCollectionInputFormat.java [ 10106 ]
          richardm Richard McCreadie made changes -
          Attachment PositionAwareDocument.java [ 10107 ]
          richardm Richard McCreadie made changes -
          Attachment PositionAwareSplit.java [ 10108 ]
          Hide
          richardm Richard McCreadie added a comment -

          Swapped class files for patch

          Show
          richardm Richard McCreadie added a comment - Swapped class files for patch
          richardm Richard McCreadie made changes -
          Attachment patch.txt [ 10114 ]
          Hide
          craigm Craig Macdonald added a comment -

          Hi Richard,

          This patch looks OK.However, you need to make some consideration for the sorting of DocumentIndexes etc.

          I would suggest that closeReduce() stage 3 needs to sort the indices in some way.

          Perhaps you wish to write the splt number into the currentIndex object in closeMap(), and then use that to sort the indices array in closeReduce()?

          Ta

          C

          Show
          craigm Craig Macdonald added a comment - Hi Richard, This patch looks OK.However, you need to make some consideration for the sorting of DocumentIndexes etc. I would suggest that closeReduce() stage 3 needs to sort the indices in some way. Perhaps you wish to write the splt number into the currentIndex object in closeMap(), and then use that to sort the indices array in closeReduce()? Ta C
          Hide
          richardm Richard McCreadie added a comment -

          Added sorting on document Index - sorts MapTaskIDs during the load of run files to be inorder of splits using a new IDComparator class

          Show
          richardm Richard McCreadie added a comment - Added sorting on document Index - sorts MapTaskIDs during the load of run files to be inorder of splits using a new IDComparator class
          richardm Richard McCreadie made changes -
          Attachment patchv2.txt [ 10115 ]
          richardm Richard McCreadie made changes -
          Priority Critical [ 2 ] Minor [ 4 ]
          richardm Richard McCreadie made changes -
          Attachment patch.txt [ 10114 ]
          richardm Richard McCreadie made changes -
          Attachment patchv2.txt [ 10115 ]
          richardm Richard McCreadie made changes -
          Status Open [ 1 ] Patch Available [ 10000 ]
          richardm Richard McCreadie made changes -
          Status Patch Available [ 10000 ] Open [ 1 ]
          Hide
          richardm Richard McCreadie added a comment -

          typo fix

          Show
          richardm Richard McCreadie added a comment - typo fix
          richardm Richard McCreadie made changes -
          Attachment patchv2.txt [ 10117 ]
          richardm Richard McCreadie made changes -
          Attachment patchv2.txt [ 10117 ]
          Hide
          richardm Richard McCreadie added a comment -

          Another fix, PositionAwareSplit is now generic on type <T extends InputSplit>
          T's class name is written with the file and then used to call the generic constructor for T so that it can do the rest of the reading.

          Show
          richardm Richard McCreadie added a comment - Another fix, PositionAwareSplit is now generic on type <T extends InputSplit> T's class name is written with the file and then used to call the generic constructor for T so that it can do the rest of the reading.
          richardm Richard McCreadie made changes -
          Attachment patchv3.txt [ 10118 ]
          Hide
          craigm Craig Macdonald added a comment -

          Hi Richard,

          Couple of picky things:
          1. IDComparator

          • Its fairly unusual that a comparator object has state. Can you avoid this?
          • Can you add Javadoc so we understand how this class sorts things

          2. PositionAwareSplit:

          • I'm ok with using the class name to determine the split type
          • However, I dont see why it has to be generic on the split type, as getSplit() returns InputSplit
          • Can you throw a terrier.utility.WrappedIOException if an except occurs loading the object.
          • Use this to load the class without a cast:
            Class<? extends InputSplit> c = Class.forName(className, false, this.getClass().getClassLoader()).asSubclass(InputSplit.class);
            split = c.getConstructor(new Class[0]).newInstance(new Object[0]);
            
          • Why the getConstructor() call? Just use Class.newInstance() instead, as that uses the default constructor.

          We need to have some kind of test case to ensure that this works as expected.

          Show
          craigm Craig Macdonald added a comment - Hi Richard, Couple of picky things: 1. IDComparator Its fairly unusual that a comparator object has state. Can you avoid this? Can you add Javadoc so we understand how this class sorts things 2. PositionAwareSplit: I'm ok with using the class name to determine the split type However, I dont see why it has to be generic on the split type, as getSplit() returns InputSplit Can you throw a terrier.utility.WrappedIOException if an except occurs loading the object. Use this to load the class without a cast: Class <? extends InputSplit> c = Class .forName(className, false , this .getClass().getClassLoader()).asSubclass(InputSplit.class); split = c.getConstructor( new Class [0]).newInstance( new Object [0]); Why the getConstructor() call? Just use Class.newInstance() instead, as that uses the default constructor. We need to have some kind of test case to ensure that this works as expected.
          craigm Craig Macdonald made changes -
          Project Terrier Core [ 10000 ] TREC [ 10010 ]
          Key TR-29 TREC-34
          Workflow Terrier Open Source [ 10100 ] jira [ 10103 ]
          Affects Version/s 2.2.1 [ 10010 ]
          Component/s .indexing [ 10002 ]
          craigm Craig Macdonald made changes -
          Component/s Core [ 10020 ]
          Hide
          richardm Richard McCreadie added a comment -

          Bug found in run matching to run data files during the reduce step.

          Show
          richardm Richard McCreadie added a comment - Bug found in run matching to run data files during the reduce step.
          richardm Richard McCreadie made changes -
          Attachment MergerPatch.txt [ 10125 ]
          richardm Richard McCreadie made changes -
          Attachment MergerPatch.txt [ 10125 ]
          richardm Richard McCreadie made changes -
          Attachment MergerPatch.txt [ 10126 ]
          richardm Richard McCreadie made changes -
          Attachment MergerPatch.txt [ 10126 ]
          richardm Richard McCreadie made changes -
          Attachment patchv3.txt [ 10118 ]
          richardm Richard McCreadie made changes -
          Attachment patch.txt [ 10128 ]
          Hide
          craigm Craig Macdonald added a comment -

          Improved patch. Appears to fix docids properly. Prevents double flushes, and keeps all indexer variables at correct values.

          Show
          craigm Craig Macdonald added a comment - Improved patch. Appears to fix docids properly. Prevents double flushes, and keeps all indexer variables at correct values.
          craigm Craig Macdonald made changes -
          Attachment TREC-34.v3.patch [ 10129 ]
          Hide
          craigm Craig Macdonald added a comment -

          Richard,

          I'm wondering if this patch could be back-ported to Terrier 2, as block indexing is taking an age. (Dont do anything yet)

          C

          Show
          craigm Craig Macdonald added a comment - Richard, I'm wondering if this patch could be back-ported to Terrier 2, as block indexing is taking an age. (Dont do anything yet) C
          Hide
          craigm Craig Macdonald added a comment -

          Last revision of patch. This patch for trunk only.

          Show
          craigm Craig Macdonald added a comment - Last revision of patch. This patch for trunk only.
          craigm Craig Macdonald made changes -
          Attachment TREC-34.v4.patch [ 10130 ]
          Hide
          craigm Craig Macdonald added a comment -

          Patch committed for trunk.

          Show
          craigm Craig Macdonald added a comment - Patch committed for trunk.
          craigm Craig Macdonald made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          craigm Craig Macdonald added a comment -

          patch for 2.x SVN branch (rev 2573). All tests pass.

          Show
          craigm Craig Macdonald added a comment - patch for 2.x SVN branch (rev 2573). All tests pass.
          craigm Craig Macdonald made changes -
          Attachment TR-34.branch2x.v4.patch [ 10131 ]
          craigm Craig Macdonald made changes -
          Affects Version/s 3.0 [ 10020 ]
          Fix Version/s 3.0 [ 10020 ]
          craigm Craig Macdonald made changes -
          Project TREC [ 10010 ] Terrier Core [ 10000 ]
          Key TREC-34 TR-39
          Workflow jira [ 10103 ] Terrier Open Source [ 10297 ]
          Affects Version/s 3.0 [ 10030 ]
          Affects Version/s 3.0 [ 10020 ]
          Component/s .indexing [ 10002 ]
          Component/s .structures [ 10007 ]
          Component/s Core [ 10020 ]
          Fix Version/s 3.0 [ 10030 ]
          Fix Version/s 3.0 [ 10020 ]

            People

            • Assignee:
              richardm Richard McCreadie
              Reporter:
              richardm Richard McCreadie
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: