org.terrier.indexing
Class PDFDocument
java.lang.Object
org.terrier.indexing.FileDocument
org.terrier.indexing.PDFDocument
- All Implemented Interfaces:
- Document
public class PDFDocument
- extends FileDocument
Implements a Document object for reading PDF documents. This object uses the
PDFBox.org library, so you'll need
to ensure that PDFBox-0.6.7a.jar or greater is in your classpath when
compiling or using this document. For using this class, you will also
need the library log4j.
- Author:
- Craig Macdonald
Field Summary |
protected static org.apache.log4j.Logger |
logger
|
Constructor Summary |
PDFDocument(java.io.InputStream docStream,
java.util.Map<java.lang.String,java.lang.String> docProperties,
Tokeniser tok)
Constructs a new PDFDocument |
PDFDocument(java.io.Reader docReader,
java.util.Map<java.lang.String,java.lang.String> docProperties,
Tokeniser tok)
Constructs a new PDFDocument |
PDFDocument(java.lang.String filename,
java.io.InputStream docStream,
Tokeniser tokeniser)
Constructs a new PDFDocument, which will convert the docStream
which represents the file to a Document object from which an Indexer
can retrieve a stream of terms. |
PDFDocument(java.lang.String filename,
java.io.Reader docReader,
Tokeniser tok)
Constructs a new PDFDocument |
Method Summary |
protected java.io.Reader |
getReader(java.io.InputStream docStream)
Returns the reader of text, which is suitable for parsing terms out of,
and which is created by converting the file represented by
parameter docStream. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
logger
protected static final org.apache.log4j.Logger logger
PDFDocument
public PDFDocument(java.lang.String filename,
java.io.InputStream docStream,
Tokeniser tokeniser)
- Constructs a new PDFDocument, which will convert the docStream
which represents the file to a Document object from which an Indexer
can retrieve a stream of terms.
- Parameters:
docStream
- InputStream the input stream that represents the
the document's file.
PDFDocument
public PDFDocument(java.io.InputStream docStream,
java.util.Map<java.lang.String,java.lang.String> docProperties,
Tokeniser tok)
- Constructs a new PDFDocument
- Parameters:
docStream
- docProperties
- tok
-
PDFDocument
public PDFDocument(java.io.Reader docReader,
java.util.Map<java.lang.String,java.lang.String> docProperties,
Tokeniser tok)
- Constructs a new PDFDocument
- Parameters:
docReader
- docProperties
- tok
-
PDFDocument
public PDFDocument(java.lang.String filename,
java.io.Reader docReader,
Tokeniser tok)
- Constructs a new PDFDocument
- Parameters:
filename
- docReader
- tok
-
getReader
protected java.io.Reader getReader(java.io.InputStream docStream)
- Returns the reader of text, which is suitable for parsing terms out of,
and which is created by converting the file represented by
parameter docStream. This method involves running the stream
through the PDFParser etc provided in the org.pdfbox library.
On error, it returns null, and sets EOD to true, so no terms
can be read from this document.
- Overrides:
getReader
in class FileDocument
- Parameters:
docStream
- the input stream that represents the document's file.
- Returns:
- Reader a reader that is fed to an indexer.
Terrier 3.5. Copyright © 2004-2011 University of Glasgow