org.terrier.structures.indexing.singlepass.hadoop
Class FileCollectionRecordReader

java.lang.Object
  extended by org.terrier.structures.indexing.singlepass.hadoop.CollectionRecordReader<PositionAwareSplit<org.apache.hadoop.mapred.lib.CombineFileSplit>>
      extended by org.terrier.structures.indexing.singlepass.hadoop.FileCollectionRecordReader
All Implemented Interfaces:
org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>

public class FileCollectionRecordReader
extends CollectionRecordReader<PositionAwareSplit<org.apache.hadoop.mapred.lib.CombineFileSplit>>
implements org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>

Record Reader for Hadoop Indexing. Reads documents from a file, when one document is empty the next is loaded. Acts like a wrapper around the Terrier Collection Class.

Since:
2.2
Author:
Richard McCreadie

Field Summary
protected  org.apache.hadoop.io.compress.CompressionCodecFactory compressionCodecs
          factory for accessing compressed files
protected  CountingInputStream inputStream
          the current input stream accessing the underlying (uncompressed) file, used for counting progress.
protected  long length
          length of the file
protected static org.apache.log4j.Logger logger
          The logger used
protected  long start
          where we started in this file
 
Fields inherited from class org.terrier.structures.indexing.singlepass.hadoop.CollectionRecordReader
collectionIndex, config, currentDocument, documentCollection, split
 
Constructor Summary
FileCollectionRecordReader(org.apache.hadoop.mapred.JobConf jobConf, PositionAwareSplit<org.apache.hadoop.mapred.lib.CombineFileSplit> split)
          Constructor
 
Method Summary
 long getPos()
          Gives the input in the raw, uncompressed stream.
 float getProgress()
          Returns the progress of the reading
protected  Collection openCollectionSplit(int index)
          Opens a collection on the next file.
 
Methods inherited from class org.terrier.structures.indexing.singlepass.hadoop.CollectionRecordReader
close, closeCollectionSplit, createKey, createValue, next
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.mapred.RecordReader
close, createKey, createValue, next
 

Field Detail

logger

protected static final org.apache.log4j.Logger logger
The logger used


inputStream

protected CountingInputStream inputStream
the current input stream accessing the underlying (uncompressed) file, used for counting progress.


start

protected long start
where we started in this file


length

protected long length
length of the file


compressionCodecs

protected org.apache.hadoop.io.compress.CompressionCodecFactory compressionCodecs
factory for accessing compressed files

Constructor Detail

FileCollectionRecordReader

public FileCollectionRecordReader(org.apache.hadoop.mapred.JobConf jobConf,
                                  PositionAwareSplit<org.apache.hadoop.mapred.lib.CombineFileSplit> split)
                           throws IOException
Constructor

Parameters:
jobConf - - Configuration
split - - Input Split (multiple Files)
Throws:
IOException
Method Detail

getPos

public long getPos()
            throws IOException
Gives the input in the raw, uncompressed stream.

Specified by:
getPos in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
Specified by:
getPos in class CollectionRecordReader<PositionAwareSplit<org.apache.hadoop.mapred.lib.CombineFileSplit>>
Throws:
IOException

getProgress

public float getProgress()
                  throws IOException
Returns the progress of the reading

Specified by:
getProgress in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
Specified by:
getProgress in class CollectionRecordReader<PositionAwareSplit<org.apache.hadoop.mapred.lib.CombineFileSplit>>
Throws:
IOException

openCollectionSplit

protected Collection openCollectionSplit(int index)
                                  throws IOException
Opens a collection on the next file.

Specified by:
openCollectionSplit in class CollectionRecordReader<PositionAwareSplit<org.apache.hadoop.mapred.lib.CombineFileSplit>>
Throws:
IOException


Terrier 3.6. Copyright © 2004-2011 University of Glasgow