[TR-115] Upgrade Hadoop support for 0.20 Created: 15/Apr/10 Updated: 06/Jun/11 Resolved: 01/Apr/11 |
|
Status: | Resolved |
Project: | Terrier Core |
Component/s: | .structures |
Affects Version/s: | 3.0 |
Fix Version/s: | 3.5 |
Type: | Improvement | Priority: | Major |
Reporter: | Craig Macdonald | Assignee: | Craig Macdonald |
Resolution: | Fixed | ||
Labels: | None |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() |
||||||||
Issue Links: |
|
Description |
Hadoop 0.18 is quite old now. We should aim to upgrade to 0.20 for the next release. Unfortunately, this isn't as simple as upgrading the jar file. Moreover, there is the choice of MapReduce APIs to consider.
|
Comments |
Comment by Craig Macdonald [ 15/Apr/10 ] |
This patch has been contributed by Ian Soboroff (NIST), and is for Hadoop 0.20.1. The particular Hadoop core jar in use is attached to the issue. |
Comment by Craig Macdonald [ 15/Apr/10 ] |
New version of the same patch, keeping only the changes to the source tree. |
Comment by Craig Macdonald [ 15/Apr/10 ] |
Hadoop 0.20 depends on Java 6. Discuss. |
Comment by Craig Macdonald [ 22/Apr/10 ] |
Some more debugging on this patch. Found the issue that the JobTracker of 0.20 dislikes "" as a location for a split. |
Comment by Ian Soboroff [ 23/Apr/10 ] |
Note to build this patch you also need to replace lib/hadoop18.2-joined.jar with a hadoop-0.20-core.jar. I am using the one from Cloudera CDH2. |
Comment by Craig Macdonald [ 23/Apr/10 ] |
hadoop18.2-joined.jar is a merge of many Hadoop-related jar files. I copied/symlinked in many of the other jar files from $HADOOP_HOME and $HADOOP_HOME/lib/ |
Comment by Ian Soboroff [ 23/Apr/10 ] |
This patch gives the following NPE: $ bin/trec_terrier.sh -i -H getJobFactory earlier calls getGlobalConfiguration, but I'm not sure that setGlobalConfiguration has ever been called. The only caller appears to be in utility.io.HadoopUtility, but application.HadoopIndexing goes straight to HadoopPlugin. Not sure I've understood the whole code flow. |
Comment by Craig Macdonald [ 23/Apr/10 ] |
Forgot to put this file (HadoopPlugin) in last patch. Updated patch to follow. |
Comment by Craig Macdonald [ 23/Apr/10 ] |
v4 patch. This includes the missing changes to HadoopPlugin. |
Comment by Ian Soboroff [ 23/Apr/10 ] |
The job now runs, but tasks die with failed spills: java.io.IOException: Spill failed |
Comment by Craig Macdonald [ 23/Apr/10 ] |
I saw that yesterday. Do your map tasks have a warning about being unable to find native compression libraries? I presumed this was a configuration error with my Hadoop. Perhaps its a problem with CDH2 in general? Workaround - disable Map output compression : see src/core/org/terrier/applications/HadoopIndexing.java line 174 |
Comment by Ian Soboroff [ 23/Apr/10 ] |
Disabling map output compression fixes it. I have the native libraries installed and bin/hadoop finds them, but perhaps the Terrier script doesn't? Should I set JAVA_LIBRARY_LIBS in terrier_env.sh? |
Comment by Craig Macdonald [ 23/Apr/10 ] |
I'm not convinced it's a terrier issue. I ran the hadoop grep program for a gzipped file and saw the same warning message. Hadoop should set library path when the map task child is forked. In this case it doesnt appear to be doing this. |
Comment by Ian Soboroff [ 23/Apr/10 ] |
Sigh. 2.5 hours to index the wikipedia portion, btw. INFO - map 100% reduce 100% |
Comment by Ian Soboroff [ 23/Apr/10 ] |
Ah hah, found where stuff went: into that path in HDFS, and apparently the reducer is looking in the normal filesystem. How can I restart the merge phase of the process? |
Comment by Craig Macdonald [ 26/Apr/10 ] |
You can you reindex, after setting the destination index path using an hdfs protocol: terrier.index.path=hdfs://node1:9000/path/to/index Cheers, Craig |
Comment by Craig Macdonald [ 02/Jul/10 ] |
Current v4 patch has the following problem: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.mapred.JobID.compareTo(Lorg/apache/hadoop/mapred/ID;)I at org.terrier.applications.HadoopIndexing.deleteTaskFiles(HadoopIndexing.java:369) at org.terrier.applications.HadoopIndexing.main(HadoopIndexing.java:227) at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:373) at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:573) at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:237) |
Comment by Craig Macdonald [ 18/Feb/11 ] |
Tagging for 3.1. |
Comment by Craig Macdonald [ 01/Apr/11 ] |
Current trunk operating nicely on 0.20 |
Comment by Marco Didonna [ 05/Apr/11 ] |
Where can I get "current trunk" ? |
Comment by Craig Macdonald [ 05/Apr/11 ] |
This month! |
Comment by Marco Didonna [ 06/Jun/11 ] |
ehm...it is taking a little longer |