org.terrier.indexing
Class MSWordDocument

java.lang.Object
  extended by org.terrier.indexing.FileDocument
      extended by org.terrier.indexing.MSWordDocument
All Implemented Interfaces:
Document

public class MSWordDocument
extends FileDocument

This class is used for indexing MS Word document files (ie files ending .doc). It does this by using the textmining.org MSWord conversion library (tm-extractors), which in turn uses the Jakarta-POI libraries. So to compile or use this object, you'll need to ensure poi-?.?.?-final-*.jar and tm-extractors.jar are part of you classpath.

Author:
Craig Macdonald

Nested Class Summary
 
Nested classes/interfaces inherited from class org.terrier.indexing.FileDocument
FileDocument.ReaderWrapper
 
Field Summary
protected static org.apache.log4j.Logger logger
           
 
Fields inherited from class org.terrier.indexing.FileDocument
abstractlength, abstractname, abstractwritten, br, counter, EOD, filename, fileProperties, tokenStream
 
Constructor Summary
MSWordDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
          Constructs a new MSWordDocument object for the file represented by docStream.
MSWordDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
          Constructs a new MSWordDocument object for the file represented by docReader.
MSWordDocument(java.lang.String filename, java.io.InputStream docStream, Tokeniser tokeniser)
          Constructs a new MSWordDocument object for the file represented by docStream.
MSWordDocument(java.lang.String filename, java.io.Reader docReader, Tokeniser tok)
          Constructs a new MSWordDocument object for the file
 
Method Summary
protected  java.io.Reader getReader(java.io.InputStream docStream)
          Converts the docStream InputStream parameter into a Reader which contains plain text, and from which terms can be obtained.
 
Methods inherited from class org.terrier.indexing.FileDocument
endOfDocument, getAllProperties, getFields, getNextTerm, getProperty, getReader, makeFilenameProperties, setProperty
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected static final org.apache.log4j.Logger logger
Constructor Detail

MSWordDocument

public MSWordDocument(java.lang.String filename,
                      java.io.InputStream docStream,
                      Tokeniser tokeniser)
Constructs a new MSWordDocument object for the file represented by docStream.


MSWordDocument

public MSWordDocument(java.io.InputStream docStream,
                      java.util.Map<java.lang.String,java.lang.String> docProperties,
                      Tokeniser tok)
Constructs a new MSWordDocument object for the file represented by docStream.

Parameters:
docStream -
docProperties -
tok -

MSWordDocument

public MSWordDocument(java.io.Reader docReader,
                      java.util.Map<java.lang.String,java.lang.String> docProperties,
                      Tokeniser tok)
Constructs a new MSWordDocument object for the file represented by docReader.

Parameters:
docReader -
docProperties -
tok -

MSWordDocument

public MSWordDocument(java.lang.String filename,
                      java.io.Reader docReader,
                      Tokeniser tok)
Constructs a new MSWordDocument object for the file

Parameters:
filename -
docReader -
tok -
Method Detail

getReader

protected java.io.Reader getReader(java.io.InputStream docStream)
Converts the docStream InputStream parameter into a Reader which contains plain text, and from which terms can be obtained. On failure, returns null and sets EOD to true, so no terms can be read from this object.

Overrides:
getReader in class FileDocument
Parameters:
docStream - an input stream that we want to access as a buffered reader.
Returns:
the buffered reader that encapsulates the given input stream.


Terrier 3.5. Copyright © 2004-2011 University of Glasgow