Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-136

Hadoop indexing misbehaves when terrier.index.prefix is not "data"

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 3.5
    • Component/s: .indexing
    • Labels:
      None

      Description

      Hadoop MR indexing in Terrier misbehaves slightly when the index prefix is not "data". In particular, indexing completes normally, using the default prefix of "data", however MetaIndex reversal fails. As the priority says, trivial.

        Attachments

          Activity

          craigm Craig Macdonald created issue -
          Hide
          craigm Craig Macdonald added a comment -

          Patch:

          Index: src/core/org/terrier/applications/HadoopIndexing.java
          ===================================================================
          --- src/core/org/terrier/applications/HadoopIndexing.java	(revision 3010)
          +++ src/core/org/terrier/applications/HadoopIndexing.java	(working copy)
          @@ -166,6 +166,7 @@
           			conf.setReducerClass(Hadoop_BasicSinglePassIndexer.class);
           		}
           		FileOutputFormat.setOutputPath(conf, new Path(ApplicationSetup.TERRIER_INDEX_PATH));
          +		conf.set("indexing.hadoop.prefix", ApplicationSetup.TERRIER_INDEX_PREFIX);
           		conf.setMapOutputKeyClass(SplitEmittedTerm.class);
           		conf.setMapOutputValueClass(MapEmittedPostingList.class);
           		conf.setBoolean("indexing.hadoop.multiple.indices", docPartitioned);
          Index: src/core/org/terrier/indexing/hadoop/Hadoop_BasicSinglePassIndexer.java
          ===================================================================
          --- src/core/org/terrier/indexing/hadoop/Hadoop_BasicSinglePassIndexer.java	(revision 2991)
          +++ src/core/org/terrier/indexing/hadoop/Hadoop_BasicSinglePassIndexer.java	(working copy)
          @@ -114,7 +114,7 @@
           	
           	public static void main(String[] args) throws Exception
               {
          -        if (args.length > 0 && args[0].equals("--finish"))
          +        if (args.length == 2 && args[0].equals("--finish"))
                   {
                       final JobFactory jf = HadoopPlugin.getJobFactory("HOD-TerrierIndexing");
                       if (jf == null)
          @@ -157,7 +157,7 @@
           					@Override
           					public void run() {
           						try{
          -							Index index = Index.createIndex(destinationIndexPath, "data-"+id);
          +							Index index = Index.createIndex(destinationIndexPath, ApplicationSetup.TERRIER_INDEX_PREFIX+"-"+id);
           							CompressingMetaIndexBuilder.reverseAsMapReduceJob(index, "meta", reverseMetaKeys, jf);
           							index.close();
           						} catch (Exception e) {
          @@ -460,17 +460,18 @@
           		start = true;
           		//load in the current index
           		final Path indexDestination = FileOutputFormat.getWorkOutputPath(jc);
          +		final String indexDestinationPrefix = jc.get("indexing.hadoop.prefix", "data");
           		reduceId = TaskAttemptID.forName(jc.get("mapred.task.id")).getTaskID().getId();
           		path = indexDestination.toString();
           		mutipleIndices = jc.getBoolean("indexing.hadoop.multiple.indices", true);
           		if (jc.getNumReduceTasks() > 1)
           		{
          -			//gets the reduce number and suffices this to data
          -			prefix = "data-"+reduceId;
          +			//gets the reduce number and suffices this to the index prefix
          +			prefix = indexDestinationPrefix + "-"+reduceId;
           		}
           		else
           		{
          -			prefix = "data";
          +			prefix = indexDestinationPrefix;
           		}
           		
           		currentIndex = Index.createNewIndex(path, prefix);
          @@ -671,9 +672,10 @@
           		currentIndex.setIndexProperty("num.Terms",""+ lexstream.getNumberOfTermsWritten() );
           		currentIndex.setIndexProperty("num.Tokens",""+lexstream.getNumberOfTokensWritten() );
           		currentIndex.setIndexProperty("num.Pointers",""+lexstream.getNumberOfPointersWritten() );
          -		this.finishedInvertedIndexBuild();
           		if (FieldScore.FIELDS_COUNT > 0)
           			currentIndex.addIndexStructure("lexicon-valuefactory", FieldLexiconEntry.Factory.class.getName(), "java.lang.String", "${index.inverted.fields.count}");
          +		this.finishedInvertedIndexBuild();
          +			
           		
           		//the document indices are only merged if we are creating multiple indices
           		//OR if this is the first reducer for a job creating a single index
          
          
          Show
          craigm Craig Macdonald added a comment - Patch: Index: src/core/org/terrier/applications/HadoopIndexing.java =================================================================== --- src/core/org/terrier/applications/HadoopIndexing.java (revision 3010) +++ src/core/org/terrier/applications/HadoopIndexing.java (working copy) @@ -166,6 +166,7 @@ conf.setReducerClass(Hadoop_BasicSinglePassIndexer.class); } FileOutputFormat.setOutputPath(conf, new Path(ApplicationSetup.TERRIER_INDEX_PATH)); + conf.set( "indexing.hadoop.prefix" , ApplicationSetup.TERRIER_INDEX_PREFIX); conf.setMapOutputKeyClass(SplitEmittedTerm.class); conf.setMapOutputValueClass(MapEmittedPostingList.class); conf.setBoolean( "indexing.hadoop.multiple.indices" , docPartitioned); Index: src/core/org/terrier/indexing/hadoop/Hadoop_BasicSinglePassIndexer.java =================================================================== --- src/core/org/terrier/indexing/hadoop/Hadoop_BasicSinglePassIndexer.java (revision 2991) +++ src/core/org/terrier/indexing/hadoop/Hadoop_BasicSinglePassIndexer.java (working copy) @@ -114,7 +114,7 @@ public static void main( String [] args) throws Exception { - if (args.length > 0 && args[0].equals( "--finish" )) + if (args.length == 2 && args[0].equals( "--finish" )) { final JobFactory jf = HadoopPlugin.getJobFactory( "HOD-TerrierIndexing" ); if (jf == null ) @@ -157,7 +157,7 @@ @Override public void run() { try { - Index index = Index.createIndex(destinationIndexPath, "data-" +id); + Index index = Index.createIndex(destinationIndexPath, ApplicationSetup.TERRIER_INDEX_PREFIX+ "-" +id); CompressingMetaIndexBuilder.reverseAsMapReduceJob(index, "meta" , reverseMetaKeys, jf); index.close(); } catch (Exception e) { @@ -460,17 +460,18 @@ start = true ; //load in the current index final Path indexDestination = FileOutputFormat.getWorkOutputPath(jc); + final String indexDestinationPrefix = jc.get( "indexing.hadoop.prefix" , "data" ); reduceId = TaskAttemptID.forName(jc.get( "mapred.task.id" )).getTaskID().getId(); path = indexDestination.toString(); mutipleIndices = jc.getBoolean( "indexing.hadoop.multiple.indices" , true ); if (jc.getNumReduceTasks() > 1) { - //gets the reduce number and suffices this to data - prefix = "data-" +reduceId; + //gets the reduce number and suffices this to the index prefix + prefix = indexDestinationPrefix + "-" +reduceId; } else { - prefix = "data" ; + prefix = indexDestinationPrefix; } currentIndex = Index.createNewIndex(path, prefix); @@ -671,9 +672,10 @@ currentIndex.setIndexProperty( "num.Terms" ,""+ lexstream.getNumberOfTermsWritten() ); currentIndex.setIndexProperty( "num.Tokens" ,""+lexstream.getNumberOfTokensWritten() ); currentIndex.setIndexProperty( "num.Pointers" ,""+lexstream.getNumberOfPointersWritten() ); - this .finishedInvertedIndexBuild(); if (FieldScore.FIELDS_COUNT > 0) currentIndex.addIndexStructure( "lexicon-valuefactory" , FieldLexiconEntry.Factory.class.getName(), "java.lang. String " , "${index.inverted.fields.count}" ); + this .finishedInvertedIndexBuild(); + //the document indices are only merged if we are creating multiple indices //OR if this is the first reducer for a job creating a single index
          craigm Craig Macdonald made changes -
          Field Original Value New Value
          Assignee Craig Macdonald [ craigm ] Richard McCreadie [ richardm ]
          Hide
          craigm Craig Macdonald added a comment -

          Tagging for 3.1

          Show
          craigm Craig Macdonald added a comment - Tagging for 3.1
          craigm Craig Macdonald made changes -
          Fix Version/s 3.1 [ 10021 ]
          Hide
          craigm Craig Macdonald added a comment -

          Richard tested this manually. No test case.

          Show
          craigm Craig Macdonald added a comment - Richard tested this manually. No test case.
          craigm Craig Macdonald made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          craigm Craig Macdonald made changes -
          Project TREC [ 10010 ] Terrier Core [ 10000 ]
          Key TREC-177 TR-136
          Workflow jira [ 10424 ] Terrier Open Source [ 10529 ]
          Affects Version/s 3.0 [ 10030 ]
          Affects Version/s 3.0 [ 10020 ]
          Component/s .indexing [ 10002 ]
          Component/s Core [ 10020 ]
          Fix Version/s 3.1 [ 10040 ]
          Fix Version/s 3.1 [ 10021 ]

            People

            • Assignee:
              richardm Richard McCreadie
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: