Package org.terrier.indexing
Class POIDocument
- java.lang.Object
-
- org.terrier.indexing.FileDocument
-
- org.terrier.indexing.POIDocument
-
- All Implemented Interfaces:
Document
- Direct Known Subclasses:
MSExcelDocument
,MSPowerPointDocument
,MSWordDocument
public class POIDocument extends FileDocument
Represents Microsoft Office documents, which are parsed by the Apache POI library- Since:
- 3.6
- Author:
- Craig Macdonald
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.terrier.indexing.FileDocument
FileDocument.ReaderWrapper
-
-
Field Summary
-
Fields inherited from class org.terrier.indexing.FileDocument
abstractlength, abstractname, abstractwritten, br, EOD, filename, fileProperties, logger, tokenStream
-
-
Constructor Summary
Constructors Constructor Description POIDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
Constructs a new MSWordDocument object for the file represented by docStream.POIDocument(java.lang.String filename, java.io.InputStream docStream, Tokeniser tokeniser)
Constructs a new MSWordDocument object for the file represented by docStream.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected org.apache.poi.POITextExtractor
getExtractor(java.lang.String filename, java.io.InputStream docStream)
protected java.io.Reader
getReader(java.io.InputStream docStream)
Converts the docStream InputStream parameter into a Reader which contains plain text, and from which terms can be obtained.-
Methods inherited from class org.terrier.indexing.FileDocument
endOfDocument, getAllProperties, getFields, getNextTerm, getProperty, getReader, makeFilenameProperties, setProperty
-
-
-
-
Constructor Detail
-
POIDocument
public POIDocument(java.lang.String filename, java.io.InputStream docStream, Tokeniser tokeniser)
Constructs a new MSWordDocument object for the file represented by docStream.
-
POIDocument
public POIDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
Constructs a new MSWordDocument object for the file represented by docStream.- Parameters:
docStream
-docProperties
-tok
-
-
-
Method Detail
-
getExtractor
protected org.apache.poi.POITextExtractor getExtractor(java.lang.String filename, java.io.InputStream docStream) throws java.io.IOException
- Throws:
java.io.IOException
-
getReader
protected java.io.Reader getReader(java.io.InputStream docStream)
Converts the docStream InputStream parameter into a Reader which contains plain text, and from which terms can be obtained. On failure, returns null and sets EOD to true, so no terms can be read from this object.- Overrides:
getReader
in classFileDocument
- Parameters:
docStream
- an input stream that we want to access as a buffered reader.- Returns:
- the buffered reader that encapsulates the given input stream.
-
-