Terrier Core

in MR indexing, corpus order is not retained.

Details

  • Type: Improvement Improvement
  • Status: Resolved Resolved
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: 3.0
  • Fix Version/s: 3.0
  • Component/s: .indexing, .structures
  • Description:
    When indexing using MR, it still seems that corpus ordering is not retained. I.e. Shard 0, document 0 is not the first document in the corpus.

    This has been a pain for anchor text indexing.
  1. patch.txt
    (34 kB)
    Richard McCreadie
    25/Sep/09 4:31 PM
  2. TR-53.v3.patch
    (53 kB)
    Craig Macdonald
    28/Sep/09 11:16 AM
  3. TREC-53.v1.patch
    (13 kB)
    Craig Macdonald
    25/Aug/09 12:08 AM

Activity

Hide
Craig Macdonald added a comment - 24/Aug/09 11:42 PM

The issue here is that while we sort runs within a single reducer by split number, partitions are calculated using map tasks, not by splits. Instead, we should use split information to decide which map output goes to which reduce task.

Show
Craig Macdonald added a comment - 24/Aug/09 11:42 PM The issue here is that while we sort runs within a single reducer by split number, partitions are calculated using map tasks, not by splits. Instead, we should use split information to decide which map output goes to which reduce task.
Hide
Craig Macdonald added a comment - 25/Aug/09 12:08 AM

initial version of a patch for this issue. Requires testing.

Show
Craig Macdonald added a comment - 25/Aug/09 12:08 AM initial version of a patch for this issue. Requires testing.
Hide
Richard McCreadie added a comment - 25/Sep/09 4:31 PM

new improved patch

tested both docids and meta for ordering on a 7 file subset using 7 maps and 3 reducers, looks ok

Show
Richard McCreadie added a comment - 25/Sep/09 4:31 PM new improved patch tested both docids and meta for ordering on a 7 file subset using 7 maps and 3 reducers, looks ok
Hide
Richard McCreadie added a comment - 25/Sep/09 6:31 PM

tested using wt2g, indexing took 333seconds
retrieval was tested using In_expB2c10.99, Average Precision: 0.3140 - identical to expected value from test index

However, there seem to be a few missing terms
terms=1002586 (MR) vs 1002691 (classical)
this might just be an encoding issue though

Show
Richard McCreadie added a comment - 25/Sep/09 6:31 PM tested using wt2g, indexing took 333seconds retrieval was tested using In_expB2c10.99, Average Precision: 0.3140 - identical to expected value from test index However, there seem to be a few missing terms terms=1002586 (MR) vs 1002691 (classical) this might just be an encoding issue though
Hide
Craig Macdonald added a comment - 25/Sep/09 10:22 PM

Can we get rid of MapEmitted term then if we use SplitEmittedTerm. Perhaps SplitEmittedTerm isnt the best name? Not sure.

Thanks for working on this. I will also give a test.

Show
Craig Macdonald added a comment - 25/Sep/09 10:22 PM Can we get rid of MapEmitted term then if we use SplitEmittedTerm. Perhaps SplitEmittedTerm isnt the best name? Not sure. Thanks for working on this. I will also give a test.
Hide
Richard McCreadie added a comment - 28/Sep/09 10:42 AM

Yes, MapEmittedTerm is no longer used, similarly MapEmittedTermByMapPartitioner and MapEmittedTermBySplitPartitioner are also unused.

Show
Richard McCreadie added a comment - 28/Sep/09 10:42 AM Yes, MapEmittedTerm is no longer used, similarly MapEmittedTermByMapPartitioner and MapEmittedTermBySplitPartitioner are also unused.
Hide
Craig Macdonald added a comment - 28/Sep/09 11:16 AM

Minor update:

  • Use Hadoop writable methods for SplitEmittedTerm (e.g. VInt, as split id and flushes will mostly be small)
  • Added test cases for SplitEmittedTerm, particularly with various comparator classes, and the partitioner
  • Removed unused class.

Testing for ClueWeb09 showed successful indexing and correct retrieval performance with the correct ordering of docids

Please give a look to see how I designed the test class.

Show
Craig Macdonald added a comment - 28/Sep/09 11:16 AM Minor update:
  • Use Hadoop writable methods for SplitEmittedTerm (e.g. VInt, as split id and flushes will mostly be small)
  • Added test cases for SplitEmittedTerm, particularly with various comparator classes, and the partitioner
  • Removed unused class.
Testing for ClueWeb09 showed successful indexing and correct retrieval performance with the correct ordering of docids Please give a look to see how I designed the test class.
Hide
Richard McCreadie added a comment - 28/Sep/09 2:42 PM

+1 Test Case (that must have taken some time to write!)

As a note, we need to put in the copyright blurb since this is part of the core

Show
Richard McCreadie added a comment - 28/Sep/09 2:42 PM +1 Test Case (that must have taken some time to write!) As a note, we need to put in the copyright blurb since this is part of the core
Hide
Craig Macdonald added a comment - 28/Sep/09 2:59 PM

+1 also.

Committed to trunk, rev 2746. Thanks Richard!

Show
Craig Macdonald added a comment - 28/Sep/09 2:59 PM +1 also. Committed to trunk, rev 2746. Thanks Richard!

People

Dates

  • Created:
    19/Aug/09 3:57 PM
    Updated:
    05/Mar/10 4:59 PM
    Resolved:
    28/Sep/09 2:59 PM