Package org.terrier.indexing
Class PDFDocument
- java.lang.Object
-
- org.terrier.indexing.FileDocument
-
- org.terrier.indexing.PDFDocument
-
- All Implemented Interfaces:
Document
public class PDFDocument extends FileDocument
Implements a Document object for reading PDF documents, using Apache PDFBox.- Author:
- Craig Macdonald
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.terrier.indexing.FileDocument
FileDocument.ReaderWrapper
-
-
Field Summary
Fields Modifier and Type Field Description protected static org.slf4j.Logger
logger
-
Fields inherited from class org.terrier.indexing.FileDocument
abstractlength, abstractname, abstractwritten, br, EOD, filename, fileProperties, tokenStream
-
-
Constructor Summary
Constructors Constructor Description PDFDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
Constructs a new PDFDocumentPDFDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
Constructs a new PDFDocumentPDFDocument(java.lang.String filename, java.io.InputStream docStream, Tokeniser tokeniser)
Constructs a new PDFDocument, which will convert the docStream which represents the file to a Document object from which an Indexer can retrieve a stream of terms.PDFDocument(java.lang.String filename, java.io.Reader docReader, Tokeniser tok)
Constructs a new PDFDocument
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected java.io.Reader
getReader(java.io.InputStream is)
Returns the reader of text, which is suitable for parsing terms out of, and which is created by converting the file represented by parameter docStream.-
Methods inherited from class org.terrier.indexing.FileDocument
endOfDocument, getAllProperties, getFields, getNextTerm, getProperty, getReader, makeFilenameProperties, setProperty
-
-
-
-
Constructor Detail
-
PDFDocument
public PDFDocument(java.lang.String filename, java.io.InputStream docStream, Tokeniser tokeniser)
Constructs a new PDFDocument, which will convert the docStream which represents the file to a Document object from which an Indexer can retrieve a stream of terms.- Parameters:
docStream
- InputStream the input stream that represents the the document's file.
-
PDFDocument
public PDFDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
Constructs a new PDFDocument- Parameters:
docStream
-docProperties
-tok
-
-
PDFDocument
public PDFDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
Constructs a new PDFDocument- Parameters:
docReader
-docProperties
-tok
-
-
PDFDocument
public PDFDocument(java.lang.String filename, java.io.Reader docReader, Tokeniser tok)
Constructs a new PDFDocument- Parameters:
filename
-docReader
-tok
-
-
-
Method Detail
-
getReader
protected java.io.Reader getReader(java.io.InputStream is)
Returns the reader of text, which is suitable for parsing terms out of, and which is created by converting the file represented by parameter docStream. This method involves running the stream through the PDFParser etc provided in the org.pdfbox library. On error, it returns null, and sets EOD to true, so no terms can be read from this document.- Overrides:
getReader
in classFileDocument
- Parameters:
is
- the input stream that represents the document's file.- Returns:
- Reader a reader that is fed to an indexer.
-
-