[TR-181] Hadoop indexing should not copy hadoop libraries to a job classpath Created: 12/Oct/11  Updated: 05/Mar/14  Resolved: 05/Mar/14

Status: Resolved
Project: Terrier Core
Component/s: .indexing
Affects Version/s: 3.5
Fix Version/s: 3.6

Type: Improvement Priority: Minor
Reporter: Marco Didonna Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: File anyclass.sh     Java Source File HadoopUtility.java    
Issue Links:
Related
is related to TR-205 hadoop jar folder in distribution sho... Resolved

 Description   
Dear Terrier Team,
I've noticed that every time I run an indexing job using Terrier (-H option on the command line) HadoopUtility copies the entire classpath to the job's classpath in order to make libraries available to all the nodes of the cluster. Some of them (the one included in hadoop0.20 directory) are already present on each node since they are part of any hadoop installation.
I thus modified anyclass.sh and HadoopUtility in order to upload only the libraries which are necessary to the indexing job: that is all libraries in the lib folder except those in the lib/hadoop0.20 subfolder. The jars present in the latter folder will still be included in the classpath. I made up a new property terrier.hadoopLibDir which contains all the unnecessary libraries that will not be uploaded to an hadoop cluster.

Just my 2 cents :)

 Comments   
Comment by Craig Macdonald [ 12/Oct/11 ]

Thanks Marco, good catch!

Comment by Craig Macdonald [ 27/Jul/12 ]

Marco,

Do you have a usecase where its not the jar files in lib/hadoop0.20 that you are using?

Cheers,

Craig

Comment by Craig Macdonald [ 27/Jul/12 ]

This and TR-205 should be resolved concurrently.

Comment by Craig Macdonald [ 27/Jul/12 ]

Tagging for 3.6

Comment by Richard McCreadie [ 05/Mar/14 ]

Committed fix for this issue. Using the new lib folder structure /lib/hadoop/ as per issue TR-205 to find hadoop jar files as below, rather than alter the start scripts.

List<String> hadoopJarList = new ArrayList<String>();

// find all hadoop jar files. We use the structure of the lib folder to determine these
String separator = ApplicationSetup.FILE_SEPARATOR;
for (String candidateHadoopJar : jarList) {
if (candidateHadoopJar.contains("lib"separator"hadoop"+separator))

{ //System.err.println("Removing "+candidateHadoopJar+" from classpath"); hadoopJarList.add(candidateHadoopJar); }

}

jarList.removeAll(hadoopJarList);

Generated at Sat Dec 16 03:22:39 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.