java.lang.Object
- org.terrier.indexing.FileDocument
- - org.terrier.indexing.PDFDocument

All Implemented Interfaces:

Document
```
public class PDFDocument
extends FileDocument
```
Implements a Document object for reading PDF documents, using Apache PDFBox.

Author:

Craig Macdonald

Nested Class Summary
- Nested classes/interfaces inherited from class org.terrier.indexing.FileDocument
  FileDocument.ReaderWrapper

Field Summary

Fields
Modifier and Type Field Description

protected static org.slf4j.Logger logger
- Fields inherited from class org.terrier.indexing.FileDocument
  abstractlength, abstractname, abstractwritten, br, EOD, filename, fileProperties, tokenStream

Constructor Summary

Constructors
Constructor	Description
`PDFDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)`	Constructs a new PDFDocument
`PDFDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)`	Constructs a new PDFDocument
`PDFDocument(java.lang.String filename, java.io.InputStream docStream, Tokeniser tokeniser)`	Constructs a new PDFDocument, which will convert the docStream which represents the file to a Document object from which an Indexer can retrieve a stream of terms.
`PDFDocument(java.lang.String filename, java.io.Reader docReader, Tokeniser tok)`	Constructs a new PDFDocument

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`protected java.io.Reader`	`getReader(java.io.InputStream is)`	Returns the reader of text, which is suitable for parsing terms out of, and which is created by converting the file represented by parameter docStream.

Methods inherited from class org.terrier.indexing.FileDocument
endOfDocument, getAllProperties, getFields, getNextTerm, getProperty, getReader, makeFilenameProperties, setProperty

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - logger
```
protected static final org.slf4j.Logger logger
```
- Constructor Detail
  - PDFDocument
```
public PDFDocument(java.lang.String filename,
                   java.io.InputStream docStream,
                   Tokeniser tokeniser)
```
    Constructs a new PDFDocument, which will convert the docStream which represents the file to a Document object from which an Indexer can retrieve a stream of terms.
    
    Parameters:
    
    docStream - InputStream the input stream that represents the the document's file.
  - PDFDocument
```
public PDFDocument(java.io.InputStream docStream,
                   java.util.Map<java.lang.String,java.lang.String> docProperties,
                   Tokeniser tok)
```
    Constructs a new PDFDocument
    
    Parameters:
    
    docStream -
    
    docProperties -
    
    tok -
  - PDFDocument
```
public PDFDocument(java.io.Reader docReader,
                   java.util.Map<java.lang.String,java.lang.String> docProperties,
                   Tokeniser tok)
```
    Constructs a new PDFDocument
    
    Parameters:
    
    docReader -
    
    docProperties -
    
    tok -
  - PDFDocument
```
public PDFDocument(java.lang.String filename,
                   java.io.Reader docReader,
                   Tokeniser tok)
```
    Constructs a new PDFDocument
    
    Parameters:
    
    filename -
    
    docReader -
    
    tok -
- Method Detail
  - getReader
```
protected java.io.Reader getReader(java.io.InputStream is)
```
    Returns the reader of text, which is suitable for parsing terms out of, and which is created by converting the file represented by parameter docStream. This method involves running the stream through the PDFParser etc provided in the org.pdfbox library. On error, it returns null, and sets EOD to true, so no terms can be read from this document.
    
    Overrides:
    
    getReader in class FileDocument
    
    Parameters:
    
    is - the input stream that represents the document's file.
    
    Returns:
    
    Reader a reader that is fed to an indexer.

Class PDFDocument

Nested Class Summary

Nested classes/interfaces inherited from class org.terrier.indexing.FileDocument

Field Summary

Fields inherited from class org.terrier.indexing.FileDocument

Constructor Summary

Method Summary

Methods inherited from class org.terrier.indexing.FileDocument

Methods inherited from class java.lang.Object

Field Detail

logger

Constructor Detail

PDFDocument

PDFDocument

PDFDocument

PDFDocument

Method Detail

getReader