CollectionRecordReader (Terrier 3.5 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.terrier.structures.indexing.singlepass.hadoop
Class CollectionRecordReader<SPLITTYPE extends PositionAwareSplit<?>>

java.lang.Object
  org.terrier.structures.indexing.singlepass.hadoop.CollectionRecordReader<SPLITTYPE>


Type Parameters:: SPLITTYPE - The subclass of InputSplit that this class should work with

All Implemented Interfaces:: org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>

Direct Known Subclasses:: FileCollectionRecordReader

public abstract class CollectionRecordReader<SPLITTYPE extends PositionAwareSplit<?>>
extends java.lang.Object
implements org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>
extends java.lang.Object
implements org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>>

An abstract RecordReader class which provides methods to read a collection within the Hadoop framework. Note that the collection will be split based on a predetermined InputSplit type which must contain positional information, i.e. which split it is in the list of all splits.

Author:: Craig Madonald and Richard McCreadie

Field Summary
`protected int`	`collectionIndex` number of collections obtained thus far by this record reader
`protected org.apache.hadoop.conf.Configuration`	`config` the configuration of this job
`protected int`	`currentDocument` the number of documents extacted thus far
`protected Collection`	`documentCollection` document collection currently being iterated through.
`protected SPLITTYPE`	`split` the files in this split

Constructor Summary
`CollectionRecordReader(org.apache.hadoop.mapred.JobConf _jobConf, SPLITTYPE _split)` constructor

Method Summary
`void`	`close()` Closes the document collection if it exists
`protected void`	`closeCollectionSplit()` closes the current collection
`org.apache.hadoop.io.Text`	`createKey()` Create a new Key, each key is a Document Number
`SplitAwareWrapper<Document>`	`createValue()` Create a new Text value, each value is a document
`abstract long`	`getPos()` Returns the number of bits the recordreader has accessed, thereby giving the position in the input data.
`abstract float`	`getProgress()` Returns the progress of the reading
`boolean`	`next(org.apache.hadoop.io.Text DocID, SplitAwareWrapper<Document> document)` Moves to the next Document in the Collections accessing this InputSplit if one exists, setting DocID to the property "DOCID" and Document to the text within the document.
`protected abstract Collection`	`openCollectionSplit(int index)` open a collection for the index'th parth of the current split

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

documentCollection

protected Collection documentCollection

document collection currently being iterated through. starts as null

split

protected SPLITTYPE extends PositionAwareSplit<?> split

the files in this split

config

protected org.apache.hadoop.conf.Configuration config

the configuration of this job

currentDocument

protected int currentDocument

the number of documents extacted thus far

collectionIndex

protected int collectionIndex

number of collections obtained thus far by this record reader

Constructor Detail