org.terrier.indexing
Class MSWordDocument
java.lang.Object
org.terrier.indexing.FileDocument
org.terrier.indexing.MSWordDocument
- All Implemented Interfaces:
- Document
public class MSWordDocument
- extends FileDocument
This class is used for indexing MS Word document files (ie files ending .doc).
It does this by using the textmining.org
MSWord conversion library (tm-extractors), which in turn uses the Jakarta-POI
libraries. So to compile or use this object, you'll need to ensure poi-?.?.?-final-*.jar
and tm-extractors.jar are part of you classpath.
- Author:
- Craig Macdonald
Field Summary |
protected static org.apache.log4j.Logger |
logger
|
Constructor Summary |
MSWordDocument(java.io.InputStream docStream,
java.util.Map<java.lang.String,java.lang.String> docProperties,
Tokeniser tok)
Constructs a new MSWordDocument object for the file represented by
docStream. |
MSWordDocument(java.io.Reader docReader,
java.util.Map<java.lang.String,java.lang.String> docProperties,
Tokeniser tok)
Constructs a new MSWordDocument object for the file represented by
docReader. |
MSWordDocument(java.lang.String filename,
java.io.InputStream docStream,
Tokeniser tokeniser)
Constructs a new MSWordDocument object for the file represented by
docStream. |
MSWordDocument(java.lang.String filename,
java.io.Reader docReader,
Tokeniser tok)
Constructs a new MSWordDocument object for the file |
Method Summary |
protected java.io.Reader |
getReader(java.io.InputStream docStream)
Converts the docStream InputStream parameter into a Reader which contains
plain text, and from which terms can be obtained. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
logger
protected static final org.apache.log4j.Logger logger
MSWordDocument
public MSWordDocument(java.lang.String filename,
java.io.InputStream docStream,
Tokeniser tokeniser)
- Constructs a new MSWordDocument object for the file represented by
docStream.
MSWordDocument
public MSWordDocument(java.io.InputStream docStream,
java.util.Map<java.lang.String,java.lang.String> docProperties,
Tokeniser tok)
- Constructs a new MSWordDocument object for the file represented by
docStream.
- Parameters:
docStream
- docProperties
- tok
-
MSWordDocument
public MSWordDocument(java.io.Reader docReader,
java.util.Map<java.lang.String,java.lang.String> docProperties,
Tokeniser tok)
- Constructs a new MSWordDocument object for the file represented by
docReader.
- Parameters:
docReader
- docProperties
- tok
-
MSWordDocument
public MSWordDocument(java.lang.String filename,
java.io.Reader docReader,
Tokeniser tok)
- Constructs a new MSWordDocument object for the file
- Parameters:
filename
- docReader
- tok
-
getReader
protected java.io.Reader getReader(java.io.InputStream docStream)
- Converts the docStream InputStream parameter into a Reader which contains
plain text, and from which terms can be obtained.
On failure, returns null and sets EOD to true, so no terms can be read from
this object.
- Overrides:
getReader
in class FileDocument
- Parameters:
docStream
- an input stream that we want to
access as a buffered reader.
- Returns:
- the buffered reader that encapsulates the
given input stream.
Terrier 3.5. Copyright © 2004-2011 University of Glasgow