Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-3

Partitioned Mode fails unexpectedly due to missing run status files

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2
    • Fix Version/s: 2.2.1
    • Component/s: .indexing
    • Labels:
      None

      Description

      Partitioning Mode likely does not work as it loses necessary run status files.

      Possible Cause:
      attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/IP:54476 remote=/IP:50010]
      attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_5895289464510919755_580904
      attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010
      attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/130.209.249.49:54483 remote=/130.209.249.49:50010]
      attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_1308099895179166256_580904
      attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010
      attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink IP:50010
      attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_4491420624309706092_580904
      attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010
      attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink IP:50010
      attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_-827893635345658476_580906
      attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010
      attempt_200901201748_0001_r_000000_0: WARN - Error running child
      attempt_200901201748_0001_r_000000_0: java.io.IOException: Could not load index from (hdfs://master:9000/user/richardm/mapred-12-08_E1_3,task_200901201748_0001_m_000000) because Index not found: hdfs://master:9000/user/richardm/mapred-12-08_E1_3/task_200901201748_0001_m_000000.properties and hdfs://trmaster:9000/user/richardm/mapred-12-08_E1_3/task_200901201748_0001_m_000000.log both not found.
      attempt_200901201748_0001_r_000000_0: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.closeReduce(Hadoop_BasicSinglePassIndexer.java:529)
      attempt_200901201748_0001_r_000000_0: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.close(Hadoop_BasicSinglePassIndexer.java:160)
      attempt_200901201748_0001_r_000000_0: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:324)
      attempt_200901201748_0001_r_000000_0: at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

      Effect:
      attempt_200901201748_0001_r_000000_2: java.io.IOException: No run status files found in hdfs://master:9000/user/richardm/mapred-12-08_E1_3
      attempt_200901201748_0001_r_000000_2: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.loadRunData(Hadoop_BasicSinglePassIndexer.java:393)
      attempt_200901201748_0001_r_000000_2: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.reduce(Hadoop_BasicSinglePassIndexer.java:452)
      attempt_200901201748_0001_r_000000_2: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.reduce(Hadoop_BasicSinglePassIndexer.java:97)
      attempt_200901201748_0001_r_000000_2: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
      attempt_200901201748_0001_r_000000_2: at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          This looks like a datanode timeout rather than a Terrier problem?

          Show
          craigm Craig Macdonald added a comment - This looks like a datanode timeout rather than a Terrier problem?
          Hide
          richardm Richard McCreadie added a comment -

          The first error was a standard DFS busy timeout.

          The actual error was caused by one reduce deleting all the files the other reducers needed to run.

          We may just want to hold off deleting those files until the whole job has finished.

          Show
          richardm Richard McCreadie added a comment - The first error was a standard DFS busy timeout. The actual error was caused by one reduce deleting all the files the other reducers needed to run. We may just want to hold off deleting those files until the whole job has finished.
          Hide
          craigm Craig Macdonald added a comment -

          In Hadoop 0.19, there is an OutputCommitter API that would let us cleanup after the job completes.

          The attached patch fixes the problem for Hadoop 0.18, by deleting all files starting with a taskid matching the jobid when the job ends.

          Show
          craigm Craig Macdonald added a comment - In Hadoop 0.19, there is an OutputCommitter API that would let us cleanup after the job completes. The attached patch fixes the problem for Hadoop 0.18, by deleting all files starting with a taskid matching the jobid when the job ends.
          Hide
          craigm Craig Macdonald added a comment -

          Tested patch

          Show
          craigm Craig Macdonald added a comment - Tested patch
          Hide
          craigm Craig Macdonald added a comment -

          Committed.

          Show
          craigm Craig Macdonald added a comment - Committed.

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              richardm Richard McCreadie
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: