Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-316

Upgrade Hadoop MapReduce indexer

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: .utility
    • Labels:
      None

      Attachments

        Issue Links

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          Hi,

          I think there are two sources of confusion here:

          • firstly, TR-316.v1.patch includes a copy of an earlier attempt, named all.diff, to use MRv2 API. This inner patch includes HadoopIndexing2.java etc. It should be ignored.
          • I think the latter problem is that your Hadoop cluster isn't properly configured. I think what we did was
            CLASSPATH=`hadoop classpath` bin/trec_terrier.sh -i -H
            

          Hope this helps.

          Craig

          Show
          craigm Craig Macdonald added a comment - Hi, I think there are two sources of confusion here: firstly, TR-316 .v1.patch includes a copy of an earlier attempt, named all.diff, to use MRv2 API. This inner patch includes HadoopIndexing2.java etc. It should be ignored. I think the latter problem is that your Hadoop cluster isn't properly configured. I think what we did was CLASSPATH=`hadoop classpath` bin/trec_terrier.sh -i -H Hope this helps. Craig
          Hide
          craigm Craig Macdonald added a comment -

          PS Suggestions to improve the documentation are welcome

          Craig

          Show
          craigm Craig Macdonald added a comment - PS Suggestions to improve the documentation are welcome Craig
          Hide
          Linares Yiyi added a comment - - edited

          Thanks Craig, I managed to run the indexer however I am now getting the Following FileNotFound Exception. Trying to bring some help on this I copy below the exception stack trace and full dump of hadoop service log. Any clues/patches/suggestions?
          Thanks again

          16/03/09 20:49:18 INFO mapreduce.Job: Task Id : attempt_1456928367853_0126_r_000019_0, Status : FAILED
          Error: java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/user/linareszaila/indices/_temporary/1/_temporary/attempt_1456928367853_0126_r_000013_0
          at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1132)
          at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1124)
          at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
          at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1124)
          at org.apache.hadoop.fs.FileSystem.resolvePath(FileSystem.java:750)
          at org.apache.hadoop.hdfs.DistributedFileSystem$16.<init>(DistributedFileSystem.java:779)
          at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:770)
          at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1664)
          at org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:1757)
          at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:1734)
          at org.terrier.structures.indexing.singlepass.hadoop.Hadoop_BasicSinglePassIndexer.loadRunData(Hadoop_BasicSinglePassIndexer.java:534)
          at org.terrier.structures.indexing.singlepass.hadoop.Hadoop_BasicSinglePassIndexer.reduce(Hadoop_BasicSinglePassIndexer.java:623)
          at org.terrier.structures.indexing.singlepass.hadoop.Hadoop_BasicSinglePassIndexer.reduce(Hadoop_BasicSinglePassIndexer.java:104)
          at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
          at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
          at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

          Dump of the hadoop service log:

          2016-03-09 20:49:34,820 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for application appattempt_1456928367853_0126_000002
          2016-03-09 20:49:35,045 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Executing with tokens:
          2016-03-09 20:49:35,045 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: YARN_AM_RM_TOKEN, Service: , Ident: (org.apache.hadoop.yarn.security.AMRMTokenIdentifier@7c83dc97)
          2016-03-09 20:49:35,526 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
          2016-03-09 20:49:35,652 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Attempt num: 2 is last retry: true because a commit was started.
          2016-03-09 20:49:35,654 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobEventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$NoopEventHandler
          2016-03-09 20:49:35,661 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.jobhistory.EventType for class org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler
          2016-03-09 20:49:35,663 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.rm.ContainerAllocator$EventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter
          2016-03-09 20:49:35,721 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://nameservice1:8020]
          2016-03-09 20:49:35,757 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://nameservice1:8020]
          2016-03-09 20:49:35,777 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://nameservice1:8020]
          2016-03-09 20:49:35,796 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Emitting job history data to the timeline server is not enabled
          2016-03-09 20:49:35,799 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Will not try to recover. recoveryEnabled: true recoverySupportedByCommitter: false numReduceTasks: 26 shuffleKeyValidForRecovery: true ApplicationAttemptID: 2
          2016-03-09 20:49:35,824 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://nameservice1:8020]
          2016-03-09 20:49:35,828 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at hdfs://nameservice1:8020/user/linareszaila/.staging/job_1456928367853_0126/job_1456928367853_0126_1.jhist
          2016-03-09 20:49:36,231 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobFinishEvent$Type for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler
          2016-03-09 20:49:36,257 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
          2016-03-09 20:49:36,317 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
          2016-03-09 20:49:36,317 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MRAppMaster metrics system started
          2016-03-09 20:49:36,336 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: nodeBlacklistingEnabled:true
          2016-03-09 20:49:36,336 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: maxTaskFailuresPerNode is 3
          2016-03-09 20:49:36,336 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: blacklistDisablePercent is 33
          2016-03-09 20:49:36,434 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: maxContainerCapability: <memory:10240, vCores:24>
          2016-03-09 20:49:36,434 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: queue: default
          2016-03-09 20:49:36,460 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://nameservice1:8020]
          2016-03-09 20:49:36,465 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryCopyService: History file is at hdfs://nameservice1:8020/user/linareszaila/.staging/job_1456928367853_0126/job_1456928367853_0126_1.jhist
          2016-03-09 20:49:36,501 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Event Writer setup for JobId: job_1456928367853_0126, File: hdfs://nameservice1:8020/user/linareszaila/.staging/job_1456928367853_0126/job_1456928367853_0126_2.jhist
          2016-03-09 20:49:36,668 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:linareszaila (auth:SIMPLE) cause:java.io.IOException: Was asked to shut down.
          2016-03-09 20:49:36,668 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
          java.io.IOException: Was asked to shut down.
          at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1502)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
          at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1496)
          at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1429)
          2016-03-09 20:49:36,671 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting with status 1

          Show
          Linares Yiyi added a comment - - edited Thanks Craig, I managed to run the indexer however I am now getting the Following FileNotFound Exception. Trying to bring some help on this I copy below the exception stack trace and full dump of hadoop service log. Any clues/patches/suggestions? Thanks again 16/03/09 20:49:18 INFO mapreduce.Job: Task Id : attempt_1456928367853_0126_r_000019_0, Status : FAILED Error: java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/user/linareszaila/indices/_temporary/1/_temporary/attempt_1456928367853_0126_r_000013_0 at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1132) at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1124) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1124) at org.apache.hadoop.fs.FileSystem.resolvePath(FileSystem.java:750) at org.apache.hadoop.hdfs.DistributedFileSystem$16.<init>(DistributedFileSystem.java:779) at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:770) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1664) at org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:1757) at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:1734) at org.terrier.structures.indexing.singlepass.hadoop.Hadoop_BasicSinglePassIndexer.loadRunData(Hadoop_BasicSinglePassIndexer.java:534) at org.terrier.structures.indexing.singlepass.hadoop.Hadoop_BasicSinglePassIndexer.reduce(Hadoop_BasicSinglePassIndexer.java:623) at org.terrier.structures.indexing.singlepass.hadoop.Hadoop_BasicSinglePassIndexer.reduce(Hadoop_BasicSinglePassIndexer.java:104) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Dump of the hadoop service log: 2016-03-09 20:49:34,820 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for application appattempt_1456928367853_0126_000002 2016-03-09 20:49:35,045 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Executing with tokens: 2016-03-09 20:49:35,045 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: YARN_AM_RM_TOKEN, Service: , Ident: (org.apache.hadoop.yarn.security.AMRMTokenIdentifier@7c83dc97) 2016-03-09 20:49:35,526 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2016-03-09 20:49:35,652 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Attempt num: 2 is last retry: true because a commit was started. 2016-03-09 20:49:35,654 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobEventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$NoopEventHandler 2016-03-09 20:49:35,661 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.jobhistory.EventType for class org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler 2016-03-09 20:49:35,663 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.rm.ContainerAllocator$EventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter 2016-03-09 20:49:35,721 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://nameservice1:8020] 2016-03-09 20:49:35,757 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://nameservice1:8020] 2016-03-09 20:49:35,777 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://nameservice1:8020] 2016-03-09 20:49:35,796 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Emitting job history data to the timeline server is not enabled 2016-03-09 20:49:35,799 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Will not try to recover. recoveryEnabled: true recoverySupportedByCommitter: false numReduceTasks: 26 shuffleKeyValidForRecovery: true ApplicationAttemptID: 2 2016-03-09 20:49:35,824 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://nameservice1:8020] 2016-03-09 20:49:35,828 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at hdfs://nameservice1:8020/user/linareszaila/.staging/job_1456928367853_0126/job_1456928367853_0126_1.jhist 2016-03-09 20:49:36,231 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobFinishEvent$Type for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler 2016-03-09 20:49:36,257 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2016-03-09 20:49:36,317 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2016-03-09 20:49:36,317 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MRAppMaster metrics system started 2016-03-09 20:49:36,336 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: nodeBlacklistingEnabled:true 2016-03-09 20:49:36,336 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: maxTaskFailuresPerNode is 3 2016-03-09 20:49:36,336 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: blacklistDisablePercent is 33 2016-03-09 20:49:36,434 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: maxContainerCapability: <memory:10240, vCores:24> 2016-03-09 20:49:36,434 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: queue: default 2016-03-09 20:49:36,460 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://nameservice1:8020] 2016-03-09 20:49:36,465 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryCopyService: History file is at hdfs://nameservice1:8020/user/linareszaila/.staging/job_1456928367853_0126/job_1456928367853_0126_1.jhist 2016-03-09 20:49:36,501 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Event Writer setup for JobId: job_1456928367853_0126, File: hdfs://nameservice1:8020/user/linareszaila/.staging/job_1456928367853_0126/job_1456928367853_0126_2.jhist 2016-03-09 20:49:36,668 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:linareszaila (auth:SIMPLE) cause:java.io.IOException: Was asked to shut down. 2016-03-09 20:49:36,668 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster java.io.IOException: Was asked to shut down. at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1502) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1496) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1429) 2016-03-09 20:49:36,671 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting with status 1
          Hide
          craigm Craig Macdonald added a comment -

          and hence why this issue has not been closed yet...

          The problem is predicting the filename of the map side-effect files during the reduce stage, which changed between somewhere along the lines for Hadoop. It'll be a few days before I can get back to looking at this.

          Is MapReduce important for your use-case, or can you use the classical indexer for now?

          Show
          craigm Craig Macdonald added a comment - and hence why this issue has not been closed yet... The problem is predicting the filename of the map side-effect files during the reduce stage, which changed between somewhere along the lines for Hadoop. It'll be a few days before I can get back to looking at this. Is MapReduce important for your use-case, or can you use the classical indexer for now?
          Hide
          Linares Yiyi added a comment -

          Hi Craig,

          Yes, I have seen that it is an open issue, I hope that the solution is coming soon.

          I cannot use the classical indexer because we want to index a large dataset from the Common Crawl Corpus, then the distributed index is really useful.

          Thank you very much for the help. I will keep you up to date if I make any progress

          Best,
          Yiyi

          Show
          Linares Yiyi added a comment - Hi Craig, Yes, I have seen that it is an open issue, I hope that the solution is coming soon. I cannot use the classical indexer because we want to index a large dataset from the Common Crawl Corpus, then the distributed index is really useful. Thank you very much for the help. I will keep you up to date if I make any progress Best, Yiyi

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: