[TR-289] Docid alignment is broken for MapReduce indexing when map tasks are repeated Created: 26/May/14  Updated: 16/Jun/14  Resolved: 12/Jun/14

Status: Resolved
Project: Terrier Core
Component/s: None
Affects Version/s: 3.6
Fix Version/s: 4.0

Type: Bug Priority: Critical
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: Text File PrintIndexTerm.java     Text File TREC-388.v3.6.patch     Text File TREC-388.v4.patch    

 Description   
Recent indices created by MapReduce have docid alignment problems:

Posting id 80368334 is too big for term 0 term0 Nt=34566754 TF=587985939 @{0 0 0} TFf=5215159,4448387,578322393
Posting id 79259555 is too big for term d term21430718 Nt=22088033 TF=65422749 @{3 0 0} TFf=679062,312433,64431254

This is /thought/ to be unrelated to compression changes.

 Comments   
Comment by Craig Macdonald [ 26/May/14 ]

This program can be used to test an index, used as follows:

 bin/anyclass.sh  -Dterrier.index.path=/path/to/index -Dterrier.index.prefix=data  org.terrier.applications.PrintIndexTerm
Comment by Richard McCreadie [ 28/May/14 ]

I don't think that I am going to make much progress on this without the indexing MR logs. Can you attach them?

Comment by Craig Macdonald [ 12/Jun/14 ]

This issue was found to be related to per-flush compression: the MapData contains information for per-flush compression, but this information may change if a map task is repeated.

Comment by Craig Macdonald [ 12/Jun/14 ]

Patch which disables per-flush compression in favour of per-map compression. This version for 3.6-ish.

Comment by Craig Macdonald [ 12/Jun/14 ]

TR v4 patch.

Comment by Craig Macdonald [ 12/Jun/14 ]

Committed r3954.

Generated at Sun Dec 17 00:18:44 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.