org.terrier.indexing
Interface Document

All Known Implementing Classes:
FileDocument, HTMLDocument, MSExcelDocument, MSPowerpointDocument, MSWordDocument, PDFDocument, TaggedDocument, TRECDocument

public interface Document

This interface encapsulates the concept of a document during indexing. Implementors of this interface as responsible for parsing and tokenising a document (eg parse the HTML tags, output the text terms found).

Author:
Craig Macdonald, Vassilis Plachouras

Method Summary
 boolean endOfDocument()
          Returns true when the end of the document has been reached, and there are no other terms to be retrieved from it.
 java.util.Map<java.lang.String,java.lang.String> getAllProperties()
          Returns the underlying map of all the properties defined by this Document.
 java.util.Set<java.lang.String> getFields()
          Returns a list of the fields the current term appears in.
 java.lang.String getNextTerm()
          Gets the next term of the document.
 java.lang.String getProperty(java.lang.String name)
          Allows access to a named property of the Document.
 java.io.Reader getReader()
          Returns a Reader object so client code can tokenise the document or deal with the document itself.
 

Method Detail

getNextTerm

java.lang.String getNextTerm()
Gets the next term of the document. NB:Null string returned from getNextTerm() should be ignored. They do not signify the lack of any more terms. endOfDocument() should be used to check that.

Returns:
String the next term of the document. Null returns should be ignored.

getFields

java.util.Set<java.lang.String> getFields()
Returns a list of the fields the current term appears in.

Returns:
HashSet a set of the terms that the current term appears in.

endOfDocument

boolean endOfDocument()
Returns true when the end of the document has been reached, and there are no other terms to be retrieved from it.

Returns:
boolean true if there are no more terms in the document, otherwise it returns false.

getReader

java.io.Reader getReader()
Returns a Reader object so client code can tokenise the document or deal with the document itself. Examples might be extracting URLs, language detection.


getProperty

java.lang.String getProperty(java.lang.String name)
Allows access to a named property of the Document. Examples might be URL, filename etc.

Parameters:
name - Name of the property. It is suggested, but not required that this name should not be case insensitive.
Since:
1.1.0

getAllProperties

java.util.Map<java.lang.String,java.lang.String> getAllProperties()
Returns the underlying map of all the properties defined by this Document.

Since:
1.1.0


Terrier 3.5. Copyright © 2004-2011 University of Glasgow