Terrier IR Platform
1.1.1

uk.ac.gla.terrier.indexing
Class TRECDocument

java.lang.Object
  extended by uk.ac.gla.terrier.indexing.TRECDocument
All Implemented Interfaces:
Document

public class TRECDocument
extends java.lang.Object
implements Document

Models a document in a TREC collection. This class uses the integer property string.byte.length, which corresponds to the maximum length in characters of a term and defaults to 20, and the boolean property lowercase, which specifies whether characters are transformed to lowercase. The default value of lowercase is true.

Version:
$Revision: 1.29 $
Author:
Craig Macdonald & Vassilis Plachouras

Constructor Summary
TRECDocument(java.io.Reader docReader, java.util.Map docProperties)
          Constructs an instance of the class from the given reader object.
TRECDocument(java.io.Reader docReader, java.util.Map docProperties, TagSet _tags, TagSet _exact, TagSet _fields)
          Constructs an instance of the class from the given reader object.
 
Method Summary
static void dumpDocument(Document d)
          Dumps a document to stdout
 boolean endOfDocument()
          Indicates whether the tokenizer has reached the end of the current document.
static Document generateDocumentFromFile(java.lang.String filename)
          instantiates a TREC document from a file
 java.util.Map<java.lang.String,java.lang.String> getAllProperties()
          Returns the underlying map of all the properties defined by this Document.
 java.util.Set<java.lang.String> getFields()
          Returns the fields in which the current term appears in.
 java.lang.String getNextTerm()
          Returns the next term from a document.
 java.lang.String getProperty(java.lang.String name)
          Allows access to a named property of the Document.
 java.io.Reader getReader()
          Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.
static void main(java.lang.String[] args)
          Static method which dumps a document to System.out
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TRECDocument

public TRECDocument(java.io.Reader docReader,
                    java.util.Map docProperties)
Constructs an instance of the class from the given reader object.

Parameters:
docReader - Reader the stream from the collection that ends at the end of the current document.

TRECDocument

public TRECDocument(java.io.Reader docReader,
                    java.util.Map docProperties,
                    TagSet _tags,
                    TagSet _exact,
                    TagSet _fields)
Constructs an instance of the class from the given reader object. The tags to process, the exact tags and the field tags are passed as parameters in the constructor.

Parameters:
docReader - Reader the stream from the collection that ends at the end of the current document.
_tags - TagSet the tags of the document to process or ignore.
_exact - TagSet the tags of the document to process exactly.
_fields - TagSet the tags of the documents to be processed as fields.
Method Detail

getReader

public java.io.Reader getReader()
Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.

Specified by:
getReader in interface Document

getNextTerm

public java.lang.String getNextTerm()
Returns the next term from a document.

Specified by:
getNextTerm in interface Document
Returns:
String the next term of the document, or null if the term was discarded during tokenising.

getFields

public java.util.Set<java.lang.String> getFields()
Returns the fields in which the current term appears in.

Specified by:
getFields in interface Document
Returns:
HashSet a hashset containing the fields that the current term appears in.

endOfDocument

public boolean endOfDocument()
Indicates whether the tokenizer has reached the end of the current document.

Specified by:
endOfDocument in interface Document
Returns:
boolean true if the end of the current document has been reached, otherwise returns false.

getProperty

public java.lang.String getProperty(java.lang.String name)
Allows access to a named property of the Document. Examples might be URL, filename etc.

Specified by:
getProperty in interface Document
Parameters:
name - Name of the property. It is suggested, but not required that this name should not be case insensitive.
Since:
1.1.0

getAllProperties

public java.util.Map<java.lang.String,java.lang.String> getAllProperties()
Returns the underlying map of all the properties defined by this Document.

Specified by:
getAllProperties in interface Document
Since:
1.1.0

main

public static void main(java.lang.String[] args)
Static method which dumps a document to System.out

Parameters:
args - A filename to parse

generateDocumentFromFile

public static Document generateDocumentFromFile(java.lang.String filename)
instantiates a TREC document from a file


dumpDocument

public static void dumpDocument(Document d)
Dumps a document to stdout

Parameters:
d - a Document object

Terrier IR Platform
1.1.1

Terrier Information Retrieval Platform 1.1.1. Copyright 2004-2007 University of Glasgow