org.terrier.structures.indexing.singlepass.hadoop
Class CollectionRecordReader<SPLITTYPE extends PositionAwareSplit<?>>

java.lang.Object
  extended by org.terrier.structures.indexing.singlepass.hadoop.CollectionRecordReader<SPLITTYPE>
Type Parameters:
SPLITTYPE - The subclass of InputSplit that this class should work with
All Implemented Interfaces:
org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
Direct Known Subclasses:
FileCollectionRecordReader

public abstract class CollectionRecordReader<SPLITTYPE extends PositionAwareSplit<?>>
extends java.lang.Object
implements org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>

An abstract RecordReader class which provides methods to read a collection within the Hadoop framework. Note that the collection will be split based on a predetermined InputSplit type which must contain positional information, i.e. which split it is in the list of all splits.

Author:
Craig Madonald and Richard McCreadie

Field Summary
protected  int collectionIndex
          number of collections obtained thus far by this record reader
protected  org.apache.hadoop.conf.Configuration config
          the configuration of this job
protected  int currentDocument
          the number of documents extacted thus far
protected  Collection documentCollection
          document collection currently being iterated through.
protected  SPLITTYPE split
          the files in this split
 
Constructor Summary
CollectionRecordReader(org.apache.hadoop.mapred.JobConf _jobConf, SPLITTYPE _split)
          constructor
 
Method Summary
 void close()
          Closes the document collection if it exists
protected  void closeCollectionSplit()
          closes the current collection
 org.apache.hadoop.io.Text createKey()
          Create a new Key, each key is a Document Number
 SplitAwareWrapper<Document> createValue()
          Create a new Text value, each value is a document
abstract  long getPos()
          Returns the number of bits the recordreader has accessed, thereby giving the position in the input data.
abstract  float getProgress()
          Returns the progress of the reading
 boolean next(org.apache.hadoop.io.Text DocID, SplitAwareWrapper<Document> document)
          Moves to the next Document in the Collections accessing this InputSplit if one exists, setting DocID to the property "DOCID" and Document to the text within the document.
protected abstract  Collection openCollectionSplit(int index)
          open a collection for the index'th parth of the current split
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

documentCollection

protected Collection documentCollection
document collection currently being iterated through. starts as null


split

protected SPLITTYPE extends PositionAwareSplit<?> split
the files in this split


config

protected org.apache.hadoop.conf.Configuration config
the configuration of this job


currentDocument

protected int currentDocument
the number of documents extacted thus far


collectionIndex

protected int collectionIndex
number of collections obtained thus far by this record reader

Constructor Detail

CollectionRecordReader

public CollectionRecordReader(org.apache.hadoop.mapred.JobConf _jobConf,
                              SPLITTYPE _split)
                       throws java.io.IOException
constructor

Parameters:
_jobConf -
_split -
Throws:
java.io.IOException
Method Detail

close

public void close()
           throws java.io.IOException
Closes the document collection if it exists

Specified by:
close in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
Throws:
java.io.IOException

createKey

public org.apache.hadoop.io.Text createKey()
Create a new Key, each key is a Document Number

Specified by:
createKey in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>

createValue

public SplitAwareWrapper<Document> createValue()
Create a new Text value, each value is a document

Specified by:
createValue in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>

getPos

public abstract long getPos()
                     throws java.io.IOException
Returns the number of bits the recordreader has accessed, thereby giving the position in the input data.

Specified by:
getPos in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
Throws:
java.io.IOException

getProgress

public abstract float getProgress()
                           throws java.io.IOException
Returns the progress of the reading

Specified by:
getProgress in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
Throws:
java.io.IOException

next

public boolean next(org.apache.hadoop.io.Text DocID,
                    SplitAwareWrapper<Document> document)
             throws java.io.IOException
Moves to the next Document in the Collections accessing this InputSplit if one exists, setting DocID to the property "DOCID" and Document to the text within the document. Returns true iff there was indeed a document to read, and hence this document is now in the DocID and document arguments.

Specified by:
next in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
Throws:
java.io.IOException

openCollectionSplit

protected abstract Collection openCollectionSplit(int index)
                                           throws java.io.IOException
open a collection for the index'th parth of the current split

Throws:
java.io.IOException

closeCollectionSplit

protected void closeCollectionSplit()
                             throws java.io.IOException
closes the current collection

Throws:
java.io.IOException


Terrier 3.5. Copyright © 2004-2011 University of Glasgow