org.terrier.indexing
Class TaggedDocument

java.lang.Object
  extended by org.terrier.indexing.TaggedDocument
All Implemented Interfaces:
Document
Direct Known Subclasses:
HTMLDocument, TRECDocument

public class TaggedDocument
extends java.lang.Object
implements Document

Models a tagged document (e.g., an HTML or TREC document). In particular, getNextTerm() returns the next token in the current chunk of text, according to the specified tokeniser. This class replaces HTMLDocument and TRECDocument. This class uses the following properties:

Since:
3.5
Author:
Craig Macdonald, Vassilis Plachouras, Richard McCreadie, Rodrygo Santos

Field Summary
protected  TagSet _exact
          The tags to process exactly.
protected  TagSet _fields
          The tags to consider as fields.
protected  TagSet _tags
          The tags to process or skip.
protected  int abstractCount
          number of abstract types
protected  int[] abstractlengths
          The maximum length of each named abstract (comma separated list)
protected  java.lang.String[] abstractnames
          The names of the abstracts to be saved (comma separated list)
protected  java.lang.StringBuilder[] abstracts
          builders for each abstract
protected  java.lang.String[] abstracttags
          The fields that the named abstracts come from (comma separated list)
protected  boolean abstractTagsCaseSensitive
           
protected  java.io.Reader br
          The input reader.
protected  long counter
          The number of bytes read from the input.
protected  TokenStream currentTokenStream
           
protected  int elseAbstractSpecialTag
          else field index
protected  boolean EOD
          End of Document.
protected  boolean error
          Indicates whether an error has occurred.
protected  java.util.Set<java.lang.String> htmlStk
          The hash set where the tags, considered as fields, are inserted.
protected  boolean inHtmlTagToProcess
          Specifies whether the tokeniser is in a field tag to process.
protected  boolean inTagToProcess
          Indicates whether we are in a tag to process.
protected  boolean inTagToSkip
          Indicates whether we are in a tag to skip.
protected  int lastChar
          Saves the last read character between consecutive calls of getNextTerm().
protected static org.apache.log4j.Logger logger
           
protected static boolean lowercase
          Change to lowercase?
protected static int maxNumOfDigitsPerTerm
          The maximum number of digits that are allowed in valid terms.
protected static int maxNumOfSameConseqLettersPerTerm
          The maximum number of consecutive same letters or digits that are allowed in valid terms.
protected  java.util.Map<java.lang.String,java.lang.String> properties
           
protected  java.util.Stack<java.lang.String> stk
          The stack where the tags are pushed and popped accordingly.
protected  java.lang.String[] stringArray
          A temporary String array
protected  java.lang.StringBuilder sw
           
protected  java.lang.StringBuilder tagNameSB
           
protected  Tokeniser tokeniser
           
protected static int tokenMaximumLength
          The maximum length of a token in the check method.
 
Constructor Summary
TaggedDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser _tokeniser)
          Constructs an instance of the class from the given input stream.
TaggedDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser _tokeniser)
          Constructs an instance of the class from the given reader object.
 
Method Summary
static java.lang.String check(java.lang.String s)
          Checks whether a term is shorter than the maximum allowed length, and whether a term does not have many numerical digits or many consecutive same digits or letters.
static void dumpDocument(Document d)
          Dumps a document to stdout
 boolean endOfDocument()
          Indicates whether the tokenizer has reached the end of the current document.
static Document generateDocumentFromFile(java.lang.String filename)
          instantiates a TREC document from a file
 java.util.Map<java.lang.String,java.lang.String> getAllProperties()
          Returns the underlying map of all the properties defined by this Document.
 java.util.Set<java.lang.String> getFields()
          Returns the fields in which the current term appears in.
 java.lang.String getNextTerm()
          Returns the next token from the current chunk of text, extracted from the document into a TokenStream.
 java.lang.String getProperty(java.lang.String name)
          Allows access to a named property of the Document.
 java.io.Reader getReader()
          Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.
static void main(java.lang.String[] args)
          Static method which dumps a document to System.out
protected  void processEndOfDocument()
           
protected  void processEndOfTag(java.lang.String tag)
          The encountered tag, which must be a final tag is matched with the tag on the stack.
protected  void saveToAbstract(java.lang.String text, java.lang.String tag)
          This method takes the text parsed from a tag and then saves it to the abstract(s).
 void setProperty(java.lang.String name, java.lang.String value)
          Allows a named property to be added to the Document.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected static final org.apache.log4j.Logger logger

tokenMaximumLength

protected static final int tokenMaximumLength
The maximum length of a token in the check method.


lowercase

protected static final boolean lowercase
Change to lowercase?


stringArray

protected final java.lang.String[] stringArray
A temporary String array


br

protected java.io.Reader br
The input reader.


EOD

protected boolean EOD
End of Document. Set by the last couple of lines in getNextTerm()


counter

protected long counter
The number of bytes read from the input.


lastChar

protected int lastChar
Saves the last read character between consecutive calls of getNextTerm().


error

protected boolean error
Indicates whether an error has occurred.


_tags

protected TagSet _tags
The tags to process or skip.


_exact

protected TagSet _exact
The tags to process exactly. For these tags, the check() method is not applied.


_fields

protected TagSet _fields
The tags to consider as fields.


stk

protected java.util.Stack<java.lang.String> stk
The stack where the tags are pushed and popped accordingly.


inTagToProcess

protected boolean inTagToProcess
Indicates whether we are in a tag to process.


inTagToSkip

protected boolean inTagToSkip
Indicates whether we are in a tag to skip.


htmlStk

protected java.util.Set<java.lang.String> htmlStk
The hash set where the tags, considered as fields, are inserted.


inHtmlTagToProcess

protected boolean inHtmlTagToProcess
Specifies whether the tokeniser is in a field tag to process.


properties

protected java.util.Map<java.lang.String,java.lang.String> properties

tokeniser

protected Tokeniser tokeniser

currentTokenStream

protected TokenStream currentTokenStream

abstractnames

protected final java.lang.String[] abstractnames
The names of the abstracts to be saved (comma separated list)


abstracttags

protected final java.lang.String[] abstracttags
The fields that the named abstracts come from (comma separated list)


abstractlengths

protected final int[] abstractlengths
The maximum length of each named abstract (comma separated list)


abstractTagsCaseSensitive

protected final boolean abstractTagsCaseSensitive

abstractCount

protected final int abstractCount
number of abstract types


abstracts

protected final java.lang.StringBuilder[] abstracts
builders for each abstract


elseAbstractSpecialTag

protected int elseAbstractSpecialTag
else field index


sw

protected final java.lang.StringBuilder sw

tagNameSB

protected final java.lang.StringBuilder tagNameSB

maxNumOfDigitsPerTerm

protected static final int maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms.

See Also:
Constant Field Values

maxNumOfSameConseqLettersPerTerm

protected static final int maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are allowed in valid terms.

See Also:
Constant Field Values
Constructor Detail

TaggedDocument

public TaggedDocument(java.io.InputStream docStream,
                      java.util.Map<java.lang.String,java.lang.String> docProperties,
                      Tokeniser _tokeniser)
Constructs an instance of the class from the given input stream.

Parameters:
docStream -
docProperties -
_tokeniser -

TaggedDocument

public TaggedDocument(java.io.Reader docReader,
                      java.util.Map<java.lang.String,java.lang.String> docProperties,
                      Tokeniser _tokeniser)
Constructs an instance of the class from the given reader object.

Parameters:
docReader - Reader the stream from the collection that ends at the end of the current document.
Method Detail

getReader

public java.io.Reader getReader()
Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.

Specified by:
getReader in interface Document

getNextTerm

public java.lang.String getNextTerm()
Returns the next token from the current chunk of text, extracted from the document into a TokenStream.

Specified by:
getNextTerm in interface Document
Returns:
String the next token of the document, or null if the token was discarded during tokenisation.

processEndOfDocument

protected void processEndOfDocument()

saveToAbstract

protected void saveToAbstract(java.lang.String text,
                              java.lang.String tag)
This method takes the text parsed from a tag and then saves it to the abstract(s). This method contains the logic to decide whether indeed the text or some subset of it should be saved. The default behaviour checks each abstract named in TaggedDocument.absracts, if for an abstract we are in the correct field (specified in TaggedDocument.abstracts.tags) and then it saves up to maximum character length specified in TaggedDocument.abstracts.lengths. The 'ELSE' abstract tag is a special case that will be filled with any tag that is not added to an existing abstract. TaggedDocument should be sub-classed and this method overwritten if you want to save abstracts in a different manner, e.g. saving the first paragraph.

Parameters:
text - - the text to be saved
tag - - the tag that this text came from

getFields

public java.util.Set<java.lang.String> getFields()
Returns the fields in which the current term appears in.

Specified by:
getFields in interface Document
Returns:
HashSet a hashset containing the fields that the current term appears in.

endOfDocument

public boolean endOfDocument()
Indicates whether the tokenizer has reached the end of the current document.

Specified by:
endOfDocument in interface Document
Returns:
boolean true if the end of the current document has been reached, otherwise returns false.

processEndOfTag

protected void processEndOfTag(java.lang.String tag)
The encountered tag, which must be a final tag is matched with the tag on the stack. If they are not the same, then the consistency is restored by popping the tags in the stack, the observed tag included. If the stack becomes empty after that, then the end of document EOD is set to true.

Parameters:
tag - The closing tag to be tested against the content of the stack.

check

public static java.lang.String check(java.lang.String s)
Checks whether a term is shorter than the maximum allowed length, and whether a term does not have many numerical digits or many consecutive same digits or letters.

Parameters:
s - String the term to check if it is valid.
Returns:
String the term if it is valid, otherwise it returns null.

getProperty

public java.lang.String getProperty(java.lang.String name)
Allows access to a named property of the Document. Examples might be URL, filename etc.

Specified by:
getProperty in interface Document
Parameters:
name - Name of the property. It is suggested, but not required that this name should not be case insensitive.
Since:
1.1.0

setProperty

public void setProperty(java.lang.String name,
                        java.lang.String value)
Allows a named property to be added to the Document. Examples might be URL, filename etc.

Parameters:
name - Name of the property. It is suggested, but not required that this name should not be case insensitive.
value - The value of the property
Since:
1.1.0

getAllProperties

public java.util.Map<java.lang.String,java.lang.String> getAllProperties()
Returns the underlying map of all the properties defined by this Document.

Specified by:
getAllProperties in interface Document
Since:
1.1.0

main

public static void main(java.lang.String[] args)
Static method which dumps a document to System.out

Parameters:
args - A filename to parse

generateDocumentFromFile

public static Document generateDocumentFromFile(java.lang.String filename)
instantiates a TREC document from a file


dumpDocument

public static void dumpDocument(Document d)
Dumps a document to stdout

Parameters:
d - a Document object


Terrier 3.5. Copyright © 2004-2011 University of Glasgow