Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-181

Hadoop indexing should not copy hadoop libraries to a job classpath

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6
    • Component/s: .indexing
    • Labels:
      None

      Description

      Dear Terrier Team,
      I've noticed that every time I run an indexing job using Terrier (-H option on the command line) HadoopUtility copies the entire classpath to the job's classpath in order to make libraries available to all the nodes of the cluster. Some of them (the one included in hadoop0.20 directory) are already present on each node since they are part of any hadoop installation.
      I thus modified anyclass.sh and HadoopUtility in order to upload only the libraries which are necessary to the indexing job: that is all libraries in the lib folder except those in the lib/hadoop0.20 subfolder. The jars present in the latter folder will still be included in the classpath. I made up a new property terrier.hadoopLibDir which contains all the unnecessary libraries that will not be uploaded to an hadoop cluster.

      Just my 2 cents :)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                craigm Craig Macdonald
                Reporter:
                noiano Marco Didonna
              • Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: