[TR-3] Partitioned Mode fails unexpectedly due to missing run status files Created: 21/Jan/09  Updated: 29/Jan/09  Resolved: 28/Jan/09

Status: Closed
Project: Terrier Core
Component/s: .indexing
Affects Version/s: 2.2
Fix Version/s: 2.2.1

Type: Bug Priority: Major
Reporter: Richard McCreadie Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: File partitionModePatch.patch     File partitionModePatch.v2.patch    

 Description   
Partitioning Mode likely does not work as it loses necessary run status files.

Possible Cause:
attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/IP:54476 remote=/IP:50010]
attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_5895289464510919755_580904
attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010
attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/130.209.249.49:54483 remote=/130.209.249.49:50010]
attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_1308099895179166256_580904
attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010
attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink IP:50010
attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_4491420624309706092_580904
attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010
attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink IP:50010
attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_-827893635345658476_580906
attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010
attempt_200901201748_0001_r_000000_0: WARN - Error running child
attempt_200901201748_0001_r_000000_0: java.io.IOException: Could not load index from (hdfs://master:9000/user/richardm/mapred-12-08_E1_3,task_200901201748_0001_m_000000) because Index not found: hdfs://master:9000/user/richardm/mapred-12-08_E1_3/task_200901201748_0001_m_000000.properties and hdfs://trmaster:9000/user/richardm/mapred-12-08_E1_3/task_200901201748_0001_m_000000.log both not found.
attempt_200901201748_0001_r_000000_0: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.closeReduce(Hadoop_BasicSinglePassIndexer.java:529)
attempt_200901201748_0001_r_000000_0: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.close(Hadoop_BasicSinglePassIndexer.java:160)
attempt_200901201748_0001_r_000000_0: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:324)
attempt_200901201748_0001_r_000000_0: at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

Effect:
attempt_200901201748_0001_r_000000_2: java.io.IOException: No run status files found in hdfs://master:9000/user/richardm/mapred-12-08_E1_3
attempt_200901201748_0001_r_000000_2: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.loadRunData(Hadoop_BasicSinglePassIndexer.java:393)
attempt_200901201748_0001_r_000000_2: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.reduce(Hadoop_BasicSinglePassIndexer.java:452)
attempt_200901201748_0001_r_000000_2: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.reduce(Hadoop_BasicSinglePassIndexer.java:97)
attempt_200901201748_0001_r_000000_2: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
attempt_200901201748_0001_r_000000_2: at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

 Comments   
Comment by Craig Macdonald [ 22/Jan/09 ]

This looks like a datanode timeout rather than a Terrier problem?

Comment by Richard McCreadie [ 22/Jan/09 ]

The first error was a standard DFS busy timeout.

The actual error was caused by one reduce deleting all the files the other reducers needed to run.

We may just want to hold off deleting those files until the whole job has finished.

Comment by Craig Macdonald [ 22/Jan/09 ]

In Hadoop 0.19, there is an OutputCommitter API that would let us cleanup after the job completes.

The attached patch fixes the problem for Hadoop 0.18, by deleting all files starting with a taskid matching the jobid when the job ends.

Comment by Craig Macdonald [ 28/Jan/09 ]

Tested patch

Comment by Craig Macdonald [ 28/Jan/09 ]

Committed.

Generated at Mon Sep 28 13:07:06 BST 2020 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.