[TR-46] Multiple reducing ends up with a document index and a metaindex for ALL shards Created: 11/Aug/09  Updated: 30/May/19  Resolved: 19/Aug/09

Status: Resolved
Project: Terrier Core
Component/s: .structures
Affects Version/s: 3.0
Fix Version/s: 3.0

Type: Bug Priority: Blocker
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: File TREC-45.v1.patch    
Issue Links:

Comment by Craig Macdonald [ 11/Aug/09 ]

This issue is even more complicated. The reducer uses the side-effect files for two purposes:

  • To determine what document index and metaindex structures need to be merged for its final index
  • To determine what the docid offsets should be in inverted index.

This means that all the docids in the shards are global, not local to the inverted index being created by that shard.

For instance, no docid in the second shard index will be less than the number of documents in the first shard index.

Comment by Craig Macdonald [ 11/Aug/09 ]

The NWayMergers need to account for the inverted index docid problem.

Comment by Craig Macdonald [ 11/Aug/09 ]

I have two classes in SVN that try to fix this problem for existing indices:

  • FixBadReducerIndex copies the index into a new index, fixing the docids in the inverted file, the collection statistics, and selecting only the appropriate parts of the document index and metaindex along the way.
  • FixDocumentIndexBadReducer just calculates the correct collection statistics.
Comment by Craig Macdonald [ 12/Aug/09 ]

Initial version of a patch for the multi reducer problem.

Comment by Craig Macdonald [ 12/Aug/09 ]

Richard and I checked this, and it does make sense. We're going to try this with for Blogs08 with blocks, as a single reducer doesnt have enough disk space to do this corpus.

Comment by Craig Macdonald [ 19/Aug/09 ]

Fixed version committed to SVN trunk.

Generated at Sun Aug 09 10:35:35 BST 2020 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.