Package org.terrier.indexing
Class FileDocument
- java.lang.Object
-
- org.terrier.indexing.FileDocument
-
- All Implemented Interfaces:
Document
- Direct Known Subclasses:
PDFDocument
,POIDocument
public class FileDocument extends java.lang.Object implements Document
Models a document which corresponds to one file. The first FileDocument.abstract.length characters can be saved as an abstract.- Author:
- Craig Macdonald, Vassilis Plachouras, Richard McCreadie, Rodrygo Santos
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description class
FileDocument.ReaderWrapper
A wrapper around the token stream used to lift the terms from the stream for storage in the abstract
-
Field Summary
Fields Modifier and Type Field Description protected int
abstractlength
The maximum length of each named abstract (comma separated list)protected java.lang.String
abstractname
The names of the abstracts to be saved (comma separated list)protected int
abstractwritten
The number of characters currently writtenprotected java.io.Reader
br
The input reader.protected boolean
EOD
End of Document.protected java.lang.String
filename
The name of the file represented by this document.protected java.util.Map<java.lang.String,java.lang.String>
fileProperties
The number of bytes read from the input.protected static org.slf4j.Logger
logger
protected TokenStream
tokenStream
-
Constructor Summary
Constructors Modifier Constructor Description protected
FileDocument()
FileDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
Constructs an instance of the FileDocument from the given input stream.FileDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
create a document for a fileFileDocument(java.lang.String _filename, java.io.InputStream docStream, Tokeniser tok)
create a document for a fileFileDocument(java.lang.String _filename, java.io.Reader docReader, Tokeniser tok)
create a document for a file
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
endOfDocument()
Indicates whether the end of a document has been reached.java.util.Map<java.lang.String,java.lang.String>
getAllProperties()
Returns the underlying map of all the properties defined by this Document.java.util.Set<java.lang.String>
getFields()
Returns null because there is no support for fields with file documents.java.lang.String
getNextTerm()
Gets the next term from the Documentjava.lang.String
getProperty(java.lang.String name)
Get a document propertyjava.io.Reader
getReader()
Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.protected java.io.Reader
getReader(java.io.InputStream docStream)
Returns a buffered reader that encapsulates the given input stream.protected static java.util.Map<java.lang.String,java.lang.String>
makeFilenameProperties(java.lang.String filename)
void
setProperty(java.lang.String name, java.lang.String value)
Set a document property
-
-
-
Field Detail
-
logger
protected static final org.slf4j.Logger logger
-
br
protected java.io.Reader br
The input reader.
-
EOD
protected boolean EOD
End of Document. Set by the last couple of lines in getNextTerm()
-
fileProperties
protected java.util.Map<java.lang.String,java.lang.String> fileProperties
The number of bytes read from the input.
-
filename
protected java.lang.String filename
The name of the file represented by this document.
-
tokenStream
protected TokenStream tokenStream
-
abstractname
protected final java.lang.String abstractname
The names of the abstracts to be saved (comma separated list)
-
abstractlength
protected final int abstractlength
The maximum length of each named abstract (comma separated list)
-
abstractwritten
protected int abstractwritten
The number of characters currently written
-
-
Constructor Detail
-
FileDocument
protected FileDocument()
-
FileDocument
public FileDocument(java.lang.String _filename, java.io.Reader docReader, Tokeniser tok)
create a document for a file- Parameters:
_filename
-docReader
-tok
-
-
FileDocument
public FileDocument(java.lang.String _filename, java.io.InputStream docStream, Tokeniser tok)
create a document for a file- Parameters:
_filename
-docStream
-tok
-
-
FileDocument
public FileDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
create a document for a file- Parameters:
docReader
-docProperties
-tok
-
-
FileDocument
public FileDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)
Constructs an instance of the FileDocument from the given input stream.- Parameters:
docStream
- the input stream that reads the file.
-
-
Method Detail
-
makeFilenameProperties
protected static java.util.Map<java.lang.String,java.lang.String> makeFilenameProperties(java.lang.String filename)
-
getReader
public java.io.Reader getReader()
Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.
-
getReader
protected java.io.Reader getReader(java.io.InputStream docStream)
Returns a buffered reader that encapsulates the given input stream.- Parameters:
docStream
- an input stream that we want to access as a buffered reader.- Returns:
- the buffered reader that encapsulates the given input stream.
-
getNextTerm
public java.lang.String getNextTerm()
Gets the next term from the Document- Specified by:
getNextTerm
in interfaceDocument
- Returns:
- String the next term of the document. Null returns should be ignored.
-
getFields
public java.util.Set<java.lang.String> getFields()
Returns null because there is no support for fields with file documents.
-
endOfDocument
public boolean endOfDocument()
Indicates whether the end of a document has been reached.- Specified by:
endOfDocument
in interfaceDocument
- Returns:
- boolean true if the end of a document has been reached, otherwise, it returns false.
-
getProperty
public java.lang.String getProperty(java.lang.String name)
Get a document property- Specified by:
getProperty
in interfaceDocument
- Parameters:
name
- Name of the property. It is suggested, but not required that this name should not be case insensitive.
-
setProperty
public void setProperty(java.lang.String name, java.lang.String value)
Set a document property
-
getAllProperties
public java.util.Map<java.lang.String,java.lang.String> getAllProperties()
Returns the underlying map of all the properties defined by this Document.- Specified by:
getAllProperties
in interfaceDocument
-
-