CollectionRecordReader (Terrier 4.0 API)

java.lang.Object
- org.terrier.structures.indexing.singlepass.hadoop.CollectionRecordReader<SPLITTYPE>

Type Parameters:
SPLITTYPE - The subclass of InputSplit that this class should work with

All Implemented Interfaces:

org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>

Direct Known Subclasses:

FileCollectionRecordReader
```
public abstract class CollectionRecordReader<SPLITTYPE extends PositionAwareSplit<?>>
extends Object
implements org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
```
An abstract RecordReader class which provides methods to read a collection within the Hadoop framework. Note that the collection will be split based on a predetermined InputSplit type which must contain positional information, i.e. which split it is in the list of all splits.

Author:

Craig Madonald and Richard McCreadie

Field Summary

Fields
Modifier and Type	Field and Description
`protected int`	`collectionIndex` number of collections obtained thus far by this record reader
`protected org.apache.hadoop.conf.Configuration`	`config` the configuration of this job
`protected int`	`currentDocument` the number of documents extacted thus far
`protected Collection`	`documentCollection` document collection currently being iterated through.
`protected SPLITTYPE`	`split` the files in this split

Constructor Summary

Constructors
Constructor and Description

CollectionRecordReader(org.apache.hadoop.mapred.JobConf _jobConf, SPLITTYPE _split)
constructor

Constructors
Constructor and Description
`CollectionRecordReader(org.apache.hadoop.mapred.JobConf _jobConf, SPLITTYPE _split)` constructor

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`close()` Closes the document collection if it exists
`protected void`	`closeCollectionSplit()` closes the current collection
`org.apache.hadoop.io.Text`	`createKey()` Create a new Key, each key is a Document Number
`SplitAwareWrapper<Document>`	`createValue()` Create a new Text value, each value is a document
`abstract long`	`getPos()` Returns the number of bits the recordreader has accessed, thereby giving the position in the input data.
`abstract float`	`getProgress()` Returns the progress of the reading
`boolean`	`next(org.apache.hadoop.io.Text DocID, SplitAwareWrapper<Document> document)` Moves to the next Document in the Collections accessing this InputSplit if one exists, setting DocID to the property "DOCID" and Document to the text within the document.
`protected abstract Collection`	`openCollectionSplit(int index)` open a collection for the index'th parth of the current split

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - documentCollection
```
protected Collection documentCollection
```
    document collection currently being iterated through. starts as null
  - split
```
protected SPLITTYPE extends PositionAwareSplit<?> split
```
    the files in this split
  - config
```
protected org.apache.hadoop.conf.Configuration config
```
    the configuration of this job
  - currentDocument
```
protected int currentDocument
```
    the number of documents extacted thus far
  - collectionIndex
```
protected int collectionIndex
```
    number of collections obtained thus far by this record reader
- Constructor Detail
  - CollectionRecordReader
```
public CollectionRecordReader(org.apache.hadoop.mapred.JobConf _jobConf,
                      SPLITTYPE _split)
                       throws IOException
```
    constructor
    
    Parameters:
    _jobConf -
    _split -
    
    Throws:
    
    IOException
- Method Detail
  - close
```
public void close()
           throws IOException
```
    Closes the document collection if it exists
    
    Specified by:
    
    close in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
    
    Throws:
    
    IOException
  - createKey
```
public org.apache.hadoop.io.Text createKey()
```
    Create a new Key, each key is a Document Number
    
    Specified by:
    
    createKey in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
  - createValue
```
public SplitAwareWrapper<Document> createValue()
```
    Create a new Text value, each value is a document
    
    Specified by:
    
    createValue in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
  - getPos
```
public abstract long getPos()
                     throws IOException
```
    Returns the number of bits the recordreader has accessed, thereby giving the position in the input data.
    
    Specified by:
    
    getPos in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
    
    Throws:
    
    IOException
  - getProgress
```
public abstract float getProgress()
                           throws IOException
```
    Returns the progress of the reading
    
    Specified by:
    
    getProgress in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
    
    Throws:
    
    IOException
  - next
```
public boolean next(org.apache.hadoop.io.Text DocID,
           SplitAwareWrapper<Document> document)
             throws IOException
```
    Moves to the next Document in the Collections accessing this InputSplit if one exists, setting DocID to the property "DOCID" and Document to the text within the document. Returns true iff there was indeed a document to read, and hence this document is now in the DocID and document arguments.
    
    Specified by:
    
    next in interface org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
    
    Throws:
    
    IOException
  - openCollectionSplit
```
protected abstract Collection openCollectionSplit(int index)
                                           throws IOException
```
    open a collection for the index'th parth of the current split
    
    Throws:
    
    IOException
  - closeCollectionSplit
```
protected void closeCollectionSplit()
                             throws IOException
```
    closes the current collection
    
    Throws:
    
    IOException

Class CollectionRecordReader<SPLITTYPE extends PositionAwareSplit<?>>

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

documentCollection

split

config

currentDocument

collectionIndex

Constructor Detail

CollectionRecordReader

Method Detail

close

createKey

createValue

getPos

getProgress

next

openCollectionSplit

closeCollectionSplit