org.terrier.indexing
Class MSExcelDocument

java.lang.Object
  extended by org.terrier.indexing.FileDocument
      extended by org.terrier.indexing.MSExcelDocument
All Implemented Interfaces:
Document

public class MSExcelDocument
extends FileDocument

Implements a Document object for a Microsoft Excel spreadsheet. Uses HSSF and POIFS subparts of the Jakarta-POI project. This means that to use or compile this module, you must have the poi-?.?.?-final-*.jar in your classpath.

A bug in the current stable POI library seems to mean that large Excel files cannot be parsed - see the MAXFILESIZE field to control the maximum file size that this class will attempt to read.

Author:
Craig Macdonald

Nested Class Summary
 
Nested classes/interfaces inherited from class org.terrier.indexing.FileDocument
FileDocument.ReaderWrapper
 
Field Summary
protected static org.apache.log4j.Logger logger
           
protected static long MAXFILESIZE
          Maximum file size that this class will attempt to open.
protected static int MEGABYTE
          Size of 1MB in bytes
 
Fields inherited from class org.terrier.indexing.FileDocument
abstractlength, abstractname, abstractwritten, br, counter, EOD, filename, fileProperties, tokenStream
 
Constructor Summary
MSExcelDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
          Construct a new MSExcelDocument Document object
MSExcelDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
          Construct a new MSExcelDocument Document object
MSExcelDocument(java.lang.String filename, java.io.InputStream docStream, Tokeniser tokeniser)
          Construct a new MSExcelDocument Document object
MSExcelDocument(java.lang.String filename, java.io.Reader docReader, Tokeniser tok)
          Construct a new MSExcelDocument Document object
 
Method Summary
protected  java.io.Reader getReader(java.io.InputStream docStream)
          Get the reader appropriate for this InputStream.
 
Methods inherited from class org.terrier.indexing.FileDocument
endOfDocument, getAllProperties, getFields, getNextTerm, getProperty, getReader, makeFilenameProperties, setProperty
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected static final org.apache.log4j.Logger logger

MEGABYTE

protected static final int MEGABYTE
Size of 1MB in bytes

See Also:
Constant Field Values

MAXFILESIZE

protected static final long MAXFILESIZE
Maximum file size that this class will attempt to open. Set to 0 to ignore. Set by propery indexing.excel.maxfilesize.mb, default 0.5

Constructor Detail

MSExcelDocument

public MSExcelDocument(java.lang.String filename,
                       java.io.InputStream docStream,
                       Tokeniser tokeniser)
Construct a new MSExcelDocument Document object

Parameters:
filename - the file that is opened for this
docStream - the actual stream of the open file

MSExcelDocument

public MSExcelDocument(java.io.InputStream docStream,
                       java.util.Map<java.lang.String,java.lang.String> docProperties,
                       Tokeniser tok)
Construct a new MSExcelDocument Document object

Parameters:
docStream -
docProperties -
tok -

MSExcelDocument

public MSExcelDocument(java.io.Reader docReader,
                       java.util.Map<java.lang.String,java.lang.String> docProperties,
                       Tokeniser tok)
Construct a new MSExcelDocument Document object

Parameters:
docReader -
docProperties -
tok -

MSExcelDocument

public MSExcelDocument(java.lang.String filename,
                       java.io.Reader docReader,
                       Tokeniser tok)
Construct a new MSExcelDocument Document object

Parameters:
filename -
docReader -
tok -
Method Detail

getReader

protected java.io.Reader getReader(java.io.InputStream docStream)
Get the reader appropriate for this InputStream. This involves converting the Excel document to a stream of words. On failure returns null and sets EOD to true, so no terms can be read from the object. Uses the property indexing.excel.maxfilesize.mb to determine if the file is too big to open

Overrides:
getReader in class FileDocument
Parameters:
docStream -
Returns:
the buffered reader that encapsulates the given input stream.


Terrier 3.5. Copyright © 2004-2011 University of Glasgow