org.terrier.indexing
Class FileDocument

java.lang.Object
  extended by org.terrier.indexing.FileDocument
All Implemented Interfaces:
Document
Direct Known Subclasses:
MSExcelDocument, MSPowerpointDocument, MSWordDocument, PDFDocument

public class FileDocument
extends java.lang.Object
implements Document

Models a document which corresponds to one file. The first FileDocument.abstract.length characters can be saved as an abstract.

Author:
Craig Macdonald, Vassilis Plachouras, Richard McCreadie, Rodrygo Santos

Nested Class Summary
 class FileDocument.ReaderWrapper
          A wrapper around the token stream used to lift the terms from the stream for storage in the abstract
 
Field Summary
protected  int abstractlength
          The maximum length of each named abstract (comma separated list)
protected  java.lang.String abstractname
          The names of the abstracts to be saved (comma separated list)
protected  int abstractwritten
          The number of characters currently written
protected  java.io.Reader br
          The input reader.
 long counter
          The number of bytes read from the input.
protected  boolean EOD
          End of Document.
protected  java.lang.String filename
          The name of the file represented by this document.
protected  java.util.Map<java.lang.String,java.lang.String> fileProperties
           
protected static org.apache.log4j.Logger logger
           
protected  TokenStream tokenStream
           
 
Constructor Summary
protected FileDocument()
           
  FileDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
          Constructs an instance of the FileDocument from the given input stream.
  FileDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
          create a document for a file
  FileDocument(java.lang.String _filename, java.io.InputStream docStream, Tokeniser tok)
          create a document for a file
  FileDocument(java.lang.String _filename, java.io.Reader docReader, Tokeniser tok)
          create a document for a file
 
Method Summary
 boolean endOfDocument()
          Indicates whether the end of a document has been reached.
 java.util.Map<java.lang.String,java.lang.String> getAllProperties()
          Returns the underlying map of all the properties defined by this Document.
 java.util.Set<java.lang.String> getFields()
          Returns null because there is no support for fields with file documents.
 java.lang.String getNextTerm()
          Gets the next term from the Document
 java.lang.String getProperty(java.lang.String name)
          Get a document property
 java.io.Reader getReader()
          Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.
protected  java.io.Reader getReader(java.io.InputStream docStream)
          Returns a buffered reader that encapsulates the given input stream.
protected static java.util.Map<java.lang.String,java.lang.String> makeFilenameProperties(java.lang.String filename)
           
 void setProperty(java.lang.String name, java.lang.String value)
          Set a document property
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected static final org.apache.log4j.Logger logger

br

protected java.io.Reader br
The input reader.


EOD

protected boolean EOD
End of Document. Set by the last couple of lines in getNextTerm()


counter

public long counter
The number of bytes read from the input.


fileProperties

protected java.util.Map<java.lang.String,java.lang.String> fileProperties

filename

protected java.lang.String filename
The name of the file represented by this document.


tokenStream

protected TokenStream tokenStream

abstractname

protected final java.lang.String abstractname
The names of the abstracts to be saved (comma separated list)


abstractlength

protected final int abstractlength
The maximum length of each named abstract (comma separated list)


abstractwritten

protected int abstractwritten
The number of characters currently written

Constructor Detail

FileDocument

protected FileDocument()

FileDocument

public FileDocument(java.lang.String _filename,
                    java.io.Reader docReader,
                    Tokeniser tok)
create a document for a file

Parameters:
_filename -
docReader -
tok -

FileDocument

public FileDocument(java.lang.String _filename,
                    java.io.InputStream docStream,
                    Tokeniser tok)
create a document for a file

Parameters:
_filename -
docStream -
tok -

FileDocument

public FileDocument(java.io.Reader docReader,
                    java.util.Map<java.lang.String,java.lang.String> docProperties,
                    Tokeniser tok)
create a document for a file

Parameters:
docReader -
docProperties -
tok -

FileDocument

public FileDocument(java.io.InputStream docStream,
                    java.util.Map<java.lang.String,java.lang.String> docProperties,
                    Tokeniser tok)
Constructs an instance of the FileDocument from the given input stream.

Parameters:
docStream - the input stream that reads the file.
Method Detail

makeFilenameProperties

protected static java.util.Map<java.lang.String,java.lang.String> makeFilenameProperties(java.lang.String filename)

getReader

public java.io.Reader getReader()
Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.

Specified by:
getReader in interface Document

getReader

protected java.io.Reader getReader(java.io.InputStream docStream)
Returns a buffered reader that encapsulates the given input stream.

Parameters:
docStream - an input stream that we want to access as a buffered reader.
Returns:
the buffered reader that encapsulates the given input stream.

getNextTerm

public java.lang.String getNextTerm()
Gets the next term from the Document

Specified by:
getNextTerm in interface Document
Returns:
String the next term of the document. Null returns should be ignored.

getFields

public java.util.Set<java.lang.String> getFields()
Returns null because there is no support for fields with file documents.

Specified by:
getFields in interface Document
Returns:
null.

endOfDocument

public boolean endOfDocument()
Indicates whether the end of a document has been reached.

Specified by:
endOfDocument in interface Document
Returns:
boolean true if the end of a document has been reached, otherwise, it returns false.

getProperty

public java.lang.String getProperty(java.lang.String name)
Get a document property

Specified by:
getProperty in interface Document
Parameters:
name - Name of the property. It is suggested, but not required that this name should not be case insensitive.

setProperty

public void setProperty(java.lang.String name,
                        java.lang.String value)
Set a document property


getAllProperties

public java.util.Map<java.lang.String,java.lang.String> getAllProperties()
Returns the underlying map of all the properties defined by this Document.

Specified by:
getAllProperties in interface Document


Terrier 3.5. Copyright © 2004-2011 University of Glasgow