org.terrier.indexing
Class MSWordDocument
java.lang.Object
   org.terrier.indexing.FileDocument
org.terrier.indexing.FileDocument
       org.terrier.indexing.MSWordDocument
org.terrier.indexing.MSWordDocument
- All Implemented Interfaces: 
- Document
- public class MSWordDocument 
- extends FileDocument
This class is used for indexing MS Word document files (ie files ending .doc).
        It does this by using the textmining.org
  MSWord conversion library (tm-extractors), which in turn uses the Jakarta-POI
  libraries. So to compile or use this object, you'll need to ensure poi-?.?.?-final-*.jar
  and tm-extractors.jar are part of you classpath.
- Author:
- Craig Macdonald 
 
 
| Field Summary | 
| protected static org.apache.log4j.Logger | logger
 | 
 
 
| Constructor Summary | 
| MSWordDocument(java.io.InputStream docStream,
               java.util.Map<java.lang.String,java.lang.String> docProperties,
               Tokeniser tok)Constructs a new MSWordDocument object for the file represented by
        docStream.
 | 
| MSWordDocument(java.io.Reader docReader,
               java.util.Map<java.lang.String,java.lang.String> docProperties,
               Tokeniser tok)Constructs a new MSWordDocument object for the file represented by
        docReader.
 | 
| MSWordDocument(java.lang.String filename,
               java.io.InputStream docStream,
               Tokeniser tokeniser)Constructs a new MSWordDocument object for the file represented by
        docStream.
 | 
| MSWordDocument(java.lang.String filename,
               java.io.Reader docReader,
               Tokeniser tok)Constructs a new MSWordDocument object for the file
 | 
 
| Method Summary | 
| protected  java.io.Reader | getReader(java.io.InputStream docStream)Converts the docStream InputStream parameter into a Reader which contains
  plain text, and from which terms can be obtained.
 | 
 
 
| Methods inherited from class java.lang.Object | 
| clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait | 
 
logger
protected static final org.apache.log4j.Logger logger
MSWordDocument
public MSWordDocument(java.lang.String filename,
                      java.io.InputStream docStream,
                      Tokeniser tokeniser)
- Constructs a new MSWordDocument object for the file represented by
        docStream.
 
MSWordDocument
public MSWordDocument(java.io.InputStream docStream,
                      java.util.Map<java.lang.String,java.lang.String> docProperties,
                      Tokeniser tok)
- Constructs a new MSWordDocument object for the file represented by
        docStream.
 
- Parameters:
- docStream-
- docProperties-
- tok-
 
MSWordDocument
public MSWordDocument(java.io.Reader docReader,
                      java.util.Map<java.lang.String,java.lang.String> docProperties,
                      Tokeniser tok)
- Constructs a new MSWordDocument object for the file represented by
        docReader.
 
- Parameters:
- docReader-
- docProperties-
- tok-
 
MSWordDocument
public MSWordDocument(java.lang.String filename,
                      java.io.Reader docReader,
                      Tokeniser tok)
- Constructs a new MSWordDocument object for the file
 
- Parameters:
- filename-
- docReader-
- tok-
 
getReader
protected java.io.Reader getReader(java.io.InputStream docStream)
- Converts the docStream InputStream parameter into a Reader which contains
  plain text, and from which terms can be obtained. 
  On failure, returns null and sets EOD to true, so no terms can be read from
  this object.
 
- 
- Overrides:
- getReaderin class- FileDocument
 
- 
- Parameters:
- docStream- an input stream that we want to 
        access as a buffered reader.
- Returns:
- the buffered reader that encapsulates the 
         given input stream.
 
Terrier 3.5. Copyright © 2004-2011 University of Glasgow