[TR-115] Upgrade Hadoop support for 0.20 Created: 15/Apr/10  Updated: 06/Jun/11  Resolved: 01/Apr/11

Status: Resolved
Project: Terrier Core
Component/s: .structures
Affects Version/s: 3.0
Fix Version/s: 3.5

Type: Improvement Priority: Major
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: File 20.1-patch-v2.gz     File 20.1-patch.gz     File 20.1-TR-115-v3.patch     File 20.1-TR-115-v4.patch     File hadoop-0.20.1+169.68-core.jar     Text File HadoopPlugin.java    
Issue Links:
Related
is related to TR-104 Move to Java6 Resolved

 Description   
Hadoop 0.18 is quite old now. We should aim to upgrade to 0.20 for the next release. Unfortunately, this isn't as simple as upgrading the jar file. Moreover, there is the choice of MapReduce APIs to consider.

 Comments   
Comment by Craig Macdonald [ 15/Apr/10 ]

This patch has been contributed by Ian Soboroff (NIST), and is for Hadoop 0.20.1. The particular Hadoop core jar in use is attached to the issue.

Comment by Craig Macdonald [ 15/Apr/10 ]

New version of the same patch, keeping only the changes to the source tree.

Comment by Craig Macdonald [ 15/Apr/10 ]

Hadoop 0.20 depends on Java 6. Discuss.

Comment by Craig Macdonald [ 22/Apr/10 ]

Some more debugging on this patch. Found the issue that the JobTracker of 0.20 dislikes "" as a location for a split.

Comment by Ian Soboroff [ 23/Apr/10 ]

Note to build this patch you also need to replace lib/hadoop18.2-joined.jar with a hadoop-0.20-core.jar. I am using the one from Cloudera CDH2.

Comment by Craig Macdonald [ 23/Apr/10 ]

hadoop18.2-joined.jar is a merge of many Hadoop-related jar files. I copied/symlinked in many of the other jar files from $HADOOP_HOME and $HADOOP_HOME/lib/

Comment by Ian Soboroff [ 23/Apr/10 ]

This patch gives the following NPE:

$ bin/trec_terrier.sh -i -H
Setting TERRIER_HOME to /home/soboroff/terrier-3.0
10/04/23 10:12:39 WARN io.HadoopPlugin: Exception occurred while creating JobFactory
java.lang.NullPointerException
at org.terrier.utility.io.HadoopPlugin.getJobFactory(HadoopPlugin.java:284)
at org.terrier.utility.io.HadoopPlugin.getJobFactory(HadoopPlugin.java:274)
at org.terrier.applications.HadoopIndexing.main(HadoopIndexing.java:121)
at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:373)
at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:573)
at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:237)
java.lang.Exception: Could not get JobFactory from HadoopPlugin
java.lang.Exception: Could not get JobFactory from HadoopPlugin
at org.terrier.applications.HadoopIndexing.main(HadoopIndexing.java:123)
at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:373)
at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:573)
at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:237)

getJobFactory earlier calls getGlobalConfiguration, but I'm not sure that setGlobalConfiguration has ever been called. The only caller appears to be in utility.io.HadoopUtility, but application.HadoopIndexing goes straight to HadoopPlugin. Not sure I've understood the whole code flow.

Comment by Craig Macdonald [ 23/Apr/10 ]

Forgot to put this file (HadoopPlugin) in last patch. Updated patch to follow.

Comment by Craig Macdonald [ 23/Apr/10 ]

v4 patch. This includes the missing changes to HadoopPlugin.

Comment by Ian Soboroff [ 23/Apr/10 ]

The job now runs, but tasks die with failed spills:

java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:822)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
at org.terrier.structures.indexing.singlepass.hadoop.HadoopRunWriter.writeTerm(HadoopRunWriter.java:84)
at org.terrier.structures.indexing.singlepass.MemoryPostings.writeToWriter(MemoryPostings.java:151)
at org.terrier.structures.indexing.singlepass.MemoryPostings.finish(MemoryPostings.java:112)
at org.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.forceFlush(Hadoop_BasicSinglePassIndexer.java:308)
at org.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.closeMap(Hadoop_BasicSinglePassIndexer.java:419)
at org.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.close(Hadoop_BasicSinglePassIndexer.java:236)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.mapred.IFile$Writer.(IFile.java:102)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1198)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:648)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1135)

Comment by Craig Macdonald [ 23/Apr/10 ]

I saw that yesterday. Do your map tasks have a warning about being unable to find native compression libraries?

I presumed this was a configuration error with my Hadoop. Perhaps its a problem with CDH2 in general?

Workaround - disable Map output compression : see src/core/org/terrier/applications/HadoopIndexing.java line 174

Comment by Ian Soboroff [ 23/Apr/10 ]

Disabling map output compression fixes it. I have the native libraries installed and bin/hadoop finds them, but perhaps the Terrier script doesn't? Should I set JAVA_LIBRARY_LIBS in terrier_env.sh?

Comment by Craig Macdonald [ 23/Apr/10 ]

I'm not convinced it's a terrier issue. I ran the hadoop grep program for a gzipped file and saw the same warning message.

Hadoop should set library path when the map task child is forked. In this case it doesnt appear to be doing this.

Comment by Ian Soboroff [ 23/Apr/10 ]

Sigh. 2.5 hours to index the wikipedia portion, btw.

INFO - map 100% reduce 100%
INFO - Job complete: job_201004231118_0004
INFO - Counters: 23
INFO - Job Counters
INFO - Launched reduce tasks=26
INFO - Rack-local map tasks=20
INFO - Launched map tasks=49
INFO - Data-local map tasks=29
INFO - FileSystemCounters
INFO - FILE_BYTES_READ=15790406810
INFO - HDFS_BYTES_READ=50375670689
INFO - FILE_BYTES_WRITTEN=23558026906
INFO - HDFS_BYTES_WRITTEN=4826037969
INFO - org.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer$Counters
INFO - INDEXED_POINTERS=2656803981
INFO - INDEXED_TOKENS=5896971230
INFO - INDEXED_DOCUMENTS=5957529
INFO - INDEXER_FLUSHES=116
INFO - Map-Reduce Framework
INFO - Reduce input groups=6200909
INFO - Combine output records=0
INFO - Map input records=5957529
INFO - Reduce shuffle bytes=7534296876
INFO - Reduce output records=0
INFO - Spilled Records=213486017
INFO - Map output bytes=7646203740
INFO - Map input bytes=-41692599706
INFO - Combine input records=0
INFO - Map output records=70620646
INFO - Reduce input records=70620646
WARN - No reduce 0 output : no output index [/home/soboroff/terrier-3.0/var/index,data-0]
WARN - No reduce 1 output : no output index [/home/soboroff/terrier-3.0/var/index,data-1]
WARN - No reduce 2 output : no output index [/home/soboroff/terrier-3.0/var/index,data-2]
WARN - No reduce 3 output : no output index [/home/soboroff/terrier-3.0/var/index,data-3]
WARN - No reduce 4 output : no output index [/home/soboroff/terrier-3.0/var/index,data-4]
WARN - No reduce 5 output : no output index [/home/soboroff/terrier-3.0/var/index,data-5]
WARN - No reduce 6 output : no output index [/home/soboroff/terrier-3.0/var/index,data-6]
WARN - No reduce 7 output : no output index [/home/soboroff/terrier-3.0/var/index,data-7]
WARN - No reduce 8 output : no output index [/home/soboroff/terrier-3.0/var/index,data-8]
WARN - No reduce 9 output : no output index [/home/soboroff/terrier-3.0/var/index,data-9]
WARN - No reduce 10 output : no output index [/home/soboroff/terrier-3.0/var/index,data-10]
WARN - No reduce 11 output : no output index [/home/soboroff/terrier-3.0/var/index,data-11]
WARN - No reduce 12 output : no output index [/home/soboroff/terrier-3.0/var/index,data-12]
WARN - No reduce 13 output : no output index [/home/soboroff/terrier-3.0/var/index,data-13]
WARN - No reduce 14 output : no output index [/home/soboroff/terrier-3.0/var/index,data-14]
WARN - No reduce 15 output : no output index [/home/soboroff/terrier-3.0/var/index,data-15]
WARN - No reduce 16 output : no output index [/home/soboroff/terrier-3.0/var/index,data-16]
WARN - No reduce 17 output : no output index [/home/soboroff/terrier-3.0/var/index,data-17]
WARN - No reduce 18 output : no output index [/home/soboroff/terrier-3.0/var/index,data-18]
WARN - No reduce 19 output : no output index [/home/soboroff/terrier-3.0/var/index,data-19]
WARN - No reduce 20 output : no output index [/home/soboroff/terrier-3.0/var/index,data-20]
WARN - No reduce 21 output : no output index [/home/soboroff/terrier-3.0/var/index,data-21]
WARN - No reduce 22 output : no output index [/home/soboroff/terrier-3.0/var/index,data-22]
WARN - No reduce 23 output : no output index [/home/soboroff/terrier-3.0/var/index,data-23]
WARN - No reduce 24 output : no output index [/home/soboroff/terrier-3.0/var/index,data-24]
WARN - No reduce 25 output : no output index [/home/soboroff/terrier-3.0/var/index,data-25]
java.lang.NullPointerException
java.lang.NullPointerException
at org.terrier.applications.HadoopIndexing.mergeLexiconInvertedFiles(HadoopIndexing.java:276)
at org.terrier.applications.HadoopIndexing.main(HadoopIndexing.java:231)
at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:373)
at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:573)
at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:237)

Comment by Ian Soboroff [ 23/Apr/10 ]

Ah hah, found where stuff went: into that path in HDFS, and apparently the reducer is looking in the normal filesystem. How can I restart the merge phase of the process?

Comment by Craig Macdonald [ 26/Apr/10 ]

You can you reindex, after setting the destination index path using an hdfs protocol:

terrier.index.path=hdfs://node1:9000/path/to/index

Cheers,

Craig

Comment by Craig Macdonald [ 02/Jul/10 ]

Current v4 patch has the following problem:

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.mapred.JobID.compareTo(Lorg/apache/hadoop/mapred/ID;)I
        at org.terrier.applications.HadoopIndexing.deleteTaskFiles(HadoopIndexing.java:369)
        at org.terrier.applications.HadoopIndexing.main(HadoopIndexing.java:227)
        at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:373)
        at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:573)
        at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:237)
Comment by Craig Macdonald [ 18/Feb/11 ]

Tagging for 3.1.

Comment by Craig Macdonald [ 01/Apr/11 ]

Current trunk operating nicely on 0.20

Comment by Marco Didonna [ 05/Apr/11 ]

Where can I get "current trunk" ?

Comment by Craig Macdonald [ 05/Apr/11 ]

This month!

Comment by Marco Didonna [ 06/Jun/11 ]

ehm...it is taking a little longer

Generated at Fri Dec 15 23:26:49 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.