Package org.terrier.indexing
Interface Document
-
- All Known Implementing Classes:
FileDocument
,FlatJSONDocument
,MSExcelDocument
,MSPowerPointDocument
,MSWordDocument
,PDFDocument
,POIDocument
,TaggedDocument
,TwitterJSONDocument
public interface Document
This interface encapsulates the concept of a document during indexing. Implementors of this interface as responsible for parsing and tokenising a document (eg parse the HTML tags, output the text terms found).- Author:
- Craig Macdonald, Vassilis Plachouras
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description boolean
endOfDocument()
Returns true when the end of the document has been reached, and there are no other terms to be retrieved from it.java.util.Map<java.lang.String,java.lang.String>
getAllProperties()
Returns the underlying map of all the properties defined by this Document.java.util.Set<java.lang.String>
getFields()
Returns a list of the fields the current term appears in.java.lang.String
getNextTerm()
Gets the next term of the document.java.lang.String
getProperty(java.lang.String name)
Allows access to a named property of the Document.java.io.Reader
getReader()
Returns a Reader object so client code can tokenise the document or deal with the document itself.
-
-
-
Method Detail
-
getNextTerm
java.lang.String getNextTerm()
Gets the next term of the document. NB:Null string returned from getNextTerm() should be ignored. They do not signify the lack of any more terms. endOfDocument() should be used to check that.- Returns:
- String the next term of the document. Null returns should be ignored.
-
getFields
java.util.Set<java.lang.String> getFields()
Returns a list of the fields the current term appears in.- Returns:
- HashSet a set of the terms that the current term appears in.
-
endOfDocument
boolean endOfDocument()
Returns true when the end of the document has been reached, and there are no other terms to be retrieved from it.- Returns:
- boolean true if there are no more terms in the document, otherwise it returns false.
-
getReader
java.io.Reader getReader()
Returns a Reader object so client code can tokenise the document or deal with the document itself. Examples might be extracting URLs, language detection.
-
getProperty
java.lang.String getProperty(java.lang.String name)
Allows access to a named property of the Document. Examples might be URL, filename etc.- Parameters:
name
- Name of the property. It is suggested, but not required that this name should not be case insensitive.- Since:
- 1.1.0
-
getAllProperties
java.util.Map<java.lang.String,java.lang.String> getAllProperties()
Returns the underlying map of all the properties defined by this Document.- Since:
- 1.1.0
-
-