Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-50

in MR indexing, corpus order is not retained.

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 3.0
    • Component/s: .indexing, .structures
    • Labels:
      None

      Description

      When indexing using MR, it still seems that corpus ordering is not retained. I.e. Shard 0, document 0 is not the first document in the corpus.

      This has been a pain for anchor text indexing.

        Attachments

        1. TREC-53.v1.patch
          13 kB
        2. TR-53.v3.patch
          53 kB
        3. patch.txt
          34 kB

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          Can we get rid of MapEmitted term then if we use SplitEmittedTerm. Perhaps SplitEmittedTerm isnt the best name? Not sure.

          Thanks for working on this. I will also give a test.

          Show
          craigm Craig Macdonald added a comment - Can we get rid of MapEmitted term then if we use SplitEmittedTerm. Perhaps SplitEmittedTerm isnt the best name? Not sure. Thanks for working on this. I will also give a test.
          Hide
          richardm Richard McCreadie added a comment -

          Yes, MapEmittedTerm is no longer used, similarly MapEmittedTermByMapPartitioner and MapEmittedTermBySplitPartitioner are also unused.

          Show
          richardm Richard McCreadie added a comment - Yes, MapEmittedTerm is no longer used, similarly MapEmittedTermByMapPartitioner and MapEmittedTermBySplitPartitioner are also unused.
          Hide
          craigm Craig Macdonald added a comment -

          Minor update:

          • Use Hadoop writable methods for SplitEmittedTerm (e.g. VInt, as split id and flushes will mostly be small)
          • Added test cases for SplitEmittedTerm, particularly with various comparator classes, and the partitioner
          • Removed unused class.

          Testing for ClueWeb09 showed successful indexing and correct retrieval performance with the correct ordering of docids

          Please give a look to see how I designed the test class.

          Show
          craigm Craig Macdonald added a comment - Minor update: Use Hadoop writable methods for SplitEmittedTerm (e.g. VInt, as split id and flushes will mostly be small) Added test cases for SplitEmittedTerm, particularly with various comparator classes, and the partitioner Removed unused class. Testing for ClueWeb09 showed successful indexing and correct retrieval performance with the correct ordering of docids Please give a look to see how I designed the test class.
          Hide
          richardm Richard McCreadie added a comment -

          +1 Test Case (that must have taken some time to write!)

          As a note, we need to put in the copyright blurb since this is part of the core

          Show
          richardm Richard McCreadie added a comment - +1 Test Case (that must have taken some time to write!) As a note, we need to put in the copyright blurb since this is part of the core
          Hide
          craigm Craig Macdonald added a comment -

          +1 also.

          Committed to trunk, rev 2746. Thanks Richard!

          Show
          craigm Craig Macdonald added a comment - +1 also. Committed to trunk, rev 2746. Thanks Richard!

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: