[TR-289] Docid alignment is broken for MapReduce indexing when map tasks are repeated Created: 26/May/14 Updated: 16/Jun/14 Resolved: 12/Jun/14 |
|
Status: | Resolved |
Project: | Terrier Core |
Component/s: | None |
Affects Version/s: | 3.6 |
Fix Version/s: | 4.0 |
Type: | Bug | Priority: | Critical |
Reporter: | Craig Macdonald | Assignee: | Craig Macdonald |
Resolution: | Fixed | ||
Labels: | None |
Attachments: |
![]() ![]() ![]() |
Description |
Recent indices created by MapReduce have docid alignment problems: Posting id 80368334 is too big for term 0 term0 Nt=34566754 TF=587985939 @{0 0 0} TFf=5215159,4448387,578322393 Posting id 79259555 is too big for term d term21430718 Nt=22088033 TF=65422749 @{3 0 0} TFf=679062,312433,64431254 This is /thought/ to be unrelated to compression changes. |
Comments |
Comment by Craig Macdonald [ 26/May/14 ] |
This program can be used to test an index, used as follows: bin/anyclass.sh -Dterrier.index.path=/path/to/index -Dterrier.index.prefix=data org.terrier.applications.PrintIndexTerm |
Comment by Richard McCreadie [ 28/May/14 ] |
I don't think that I am going to make much progress on this without the indexing MR logs. Can you attach them? |
Comment by Craig Macdonald [ 12/Jun/14 ] |
This issue was found to be related to per-flush compression: the MapData contains information for per-flush compression, but this information may change if a map task is repeated. |
Comment by Craig Macdonald [ 12/Jun/14 ] |
Patch which disables per-flush compression in favour of per-map compression. This version for 3.6-ish. |
Comment by Craig Macdonald [ 12/Jun/14 ] |
TR v4 patch. |
Comment by Craig Macdonald [ 12/Jun/14 ] |
Committed r3954. |