org.terrier.indexing
Class PDFDocument

java.lang.Object
  extended by org.terrier.indexing.FileDocument
      extended by org.terrier.indexing.PDFDocument
All Implemented Interfaces:
Document

public class PDFDocument
extends FileDocument

Implements a Document object for reading PDF documents, using Apache PDFBox.

Author:
Craig Macdonald

Nested Class Summary
 
Nested classes/interfaces inherited from class org.terrier.indexing.FileDocument
FileDocument.ReaderWrapper
 
Field Summary
protected static org.apache.log4j.Logger logger
           
 
Fields inherited from class org.terrier.indexing.FileDocument
abstractlength, abstractname, abstractwritten, br, EOD, filename, fileProperties, tokenStream
 
Constructor Summary
PDFDocument(InputStream docStream, Map<String,String> docProperties, Tokeniser tok)
          Constructs a new PDFDocument
PDFDocument(Reader docReader, Map<String,String> docProperties, Tokeniser tok)
          Constructs a new PDFDocument
PDFDocument(String filename, InputStream docStream, Tokeniser tokeniser)
          Constructs a new PDFDocument, which will convert the docStream which represents the file to a Document object from which an Indexer can retrieve a stream of terms.
PDFDocument(String filename, Reader docReader, Tokeniser tok)
          Constructs a new PDFDocument
 
Method Summary
protected  Reader getReader(InputStream is)
          Returns the reader of text, which is suitable for parsing terms out of, and which is created by converting the file represented by parameter docStream.
 
Methods inherited from class org.terrier.indexing.FileDocument
endOfDocument, getAllProperties, getFields, getNextTerm, getProperty, getReader, makeFilenameProperties, setProperty
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected static final org.apache.log4j.Logger logger
Constructor Detail

PDFDocument

public PDFDocument(String filename,
                   InputStream docStream,
                   Tokeniser tokeniser)
Constructs a new PDFDocument, which will convert the docStream which represents the file to a Document object from which an Indexer can retrieve a stream of terms.

Parameters:
docStream - InputStream the input stream that represents the the document's file.

PDFDocument

public PDFDocument(InputStream docStream,
                   Map<String,String> docProperties,
                   Tokeniser tok)
Constructs a new PDFDocument

Parameters:
docStream -
docProperties -
tok -

PDFDocument

public PDFDocument(Reader docReader,
                   Map<String,String> docProperties,
                   Tokeniser tok)
Constructs a new PDFDocument

Parameters:
docReader -
docProperties -
tok -

PDFDocument

public PDFDocument(String filename,
                   Reader docReader,
                   Tokeniser tok)
Constructs a new PDFDocument

Parameters:
filename -
docReader -
tok -
Method Detail

getReader

protected Reader getReader(InputStream is)
Returns the reader of text, which is suitable for parsing terms out of, and which is created by converting the file represented by parameter docStream. This method involves running the stream through the PDFParser etc provided in the org.pdfbox library. On error, it returns null, and sets EOD to true, so no terms can be read from this document.

Overrides:
getReader in class FileDocument
Parameters:
is - the input stream that represents the document's file.
Returns:
Reader a reader that is fed to an indexer.


Terrier 3.6. Copyright © 2004-2011 University of Glasgow