[TR-50] in MR indexing, corpus order is not retained. Created: 19/Aug/09  Updated: 05/Mar/10  Resolved: 28/Sep/09

Status: Resolved
Project: Terrier Core
Component/s: .indexing, .structures
Affects Version/s: 3.0
Fix Version/s: 3.0

Type: Improvement Priority: Major
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: Text File patch.txt     File TR-53.v3.patch     File TREC-53.v1.patch    

When indexing using MR, it still seems that corpus ordering is not retained. I.e. Shard 0, document 0 is not the first document in the corpus.

This has been a pain for anchor text indexing.

Comment by Craig Macdonald [ 24/Aug/09 ]

The issue here is that while we sort runs within a single reducer by split number, partitions are calculated using map tasks, not by splits. Instead, we should use split information to decide which map output goes to which reduce task.

Comment by Craig Macdonald [ 25/Aug/09 ]

initial version of a patch for this issue. Requires testing.

Comment by Richard McCreadie [ 25/Sep/09 ]

new improved patch

tested both docids and meta for ordering on a 7 file subset using 7 maps and 3 reducers, looks ok

Comment by Richard McCreadie [ 25/Sep/09 ]

tested using wt2g, indexing took 333seconds
retrieval was tested using In_expB2c10.99, Average Precision: 0.3140 - identical to expected value from test index

However, there seem to be a few missing terms
terms=1002586 (MR) vs 1002691 (classical)
this might just be an encoding issue though

Comment by Craig Macdonald [ 25/Sep/09 ]

Can we get rid of MapEmitted term then if we use SplitEmittedTerm. Perhaps SplitEmittedTerm isnt the best name? Not sure.

Thanks for working on this. I will also give a test.

Comment by Richard McCreadie [ 28/Sep/09 ]

Yes, MapEmittedTerm is no longer used, similarly MapEmittedTermByMapPartitioner and MapEmittedTermBySplitPartitioner are also unused.

Comment by Craig Macdonald [ 28/Sep/09 ]

Minor update:

  • Use Hadoop writable methods for SplitEmittedTerm (e.g. VInt, as split id and flushes will mostly be small)
  • Added test cases for SplitEmittedTerm, particularly with various comparator classes, and the partitioner
  • Removed unused class.

Testing for ClueWeb09 showed successful indexing and correct retrieval performance with the correct ordering of docids

Please give a look to see how I designed the test class.

Comment by Richard McCreadie [ 28/Sep/09 ]

+1 Test Case (that must have taken some time to write!)

As a note, we need to put in the copyright blurb since this is part of the core

Comment by Craig Macdonald [ 28/Sep/09 ]

+1 also.

Committed to trunk, rev 2746. Thanks Richard!

Generated at Sun Jul 05 11:58:07 BST 2020 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.