org.terrier.structures.indexing.singlepass.hadoop
Class MultiFileCollectionInputFormat

java.lang.Object
  extended by org.apache.hadoop.mapred.FileInputFormat<K,V>
      extended by org.apache.hadoop.mapred.MultiFileInputFormat<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
          extended by org.terrier.structures.indexing.singlepass.hadoop.MultiFileCollectionInputFormat
All Implemented Interfaces:
org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>

public class MultiFileCollectionInputFormat
extends org.apache.hadoop.mapred.MultiFileInputFormat<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>

Input Format Class for Hadoop Indexing. Splits the input collection into sets of files where each Map task gets about the same number of files. Files are assumed to be un-splittable and are not split. Splits are of adjacent files - i.e. split 0 always has the first file, and the last split always has the last file. Any given split will have adjacent files.

Since:
2.2
Author:
Richard McCreadie and Craig Macdonald

Field Summary
protected static org.apache.log4j.Logger logger
          logger for this class
 
Fields inherited from class org.apache.hadoop.mapred.FileInputFormat
LOG
 
Constructor Summary
MultiFileCollectionInputFormat()
           
 
Method Summary
 org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>> getRecordReader(org.apache.hadoop.mapred.InputSplit genericSplit, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter)
           
 org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits)
           
 
Methods inherited from class org.apache.hadoop.mapred.FileInputFormat
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getInputPathFilter, getInputPaths, getSplitHosts, isSplitable, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMinSplitSize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected static final org.apache.log4j.Logger logger
logger for this class

Constructor Detail

MultiFileCollectionInputFormat

public MultiFileCollectionInputFormat()
Method Detail

getRecordReader

public org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>> getRecordReader(org.apache.hadoop.mapred.InputSplit genericSplit,
                                                                                                                    org.apache.hadoop.mapred.JobConf job,
                                                                                                                    org.apache.hadoop.mapred.Reporter reporter)
                                                                                                             throws java.io.IOException
Specified by:
getRecordReader in interface org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
Specified by:
getRecordReader in class org.apache.hadoop.mapred.MultiFileInputFormat<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
Throws:
java.io.IOException

getSplits

public org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job,
                                                       int numSplits)
                                                throws java.io.IOException
Specified by:
getSplits in interface org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
Overrides:
getSplits in class org.apache.hadoop.mapred.MultiFileInputFormat<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
Throws:
java.io.IOException


Terrier 3.5. Copyright © 2004-2011 University of Glasgow