Package org.terrier.indexing
Class TaggedDocument
- java.lang.Object
-
- org.terrier.indexing.TaggedDocument
-
- All Implemented Interfaces:
Document
public class TaggedDocument extends java.lang.Object implements Document
Models a tagged document (e.g., an HTML or TREC document). In particular,getNextTerm()
returns the next token in the current chunk of text, according to the specified tokeniser. This class uses the following properties:- tokeniser, the tokeniser class to be used (defaults to EnglishTokeniser);
- max.term.length, the maximum length in characters of a term (defaults to 20);
- lowercase, whether characters are transformed to lowercase (defaults to true).
- TaggedDocument.abstracts - names of the abstracts to be saved for query-biased summarisation. Defaults to empty. Example: TaggedDocument.abstracts=title,abstract
- TaggedDocument.abstracts.tags - names of tags to save text from for the purposes of query-biased summarisation. Example: TaggedDocument.abstracts=title,body. ELSE is special tag name, which means anything not consumed by other tags.
- TaggedDocument.abstracts.lengths - max lengths of the asbtracts. Defaults to empty. Example: TaggedDocument.abstracts.lengths=100,2048
- TaggedDocument.abstracts.tags.casesensitive - should the names of tags be case-sensitive? Defaults to false.
- Since:
- 3.5
- Author:
- Craig Macdonald, Vassilis Plachouras, Richard McCreadie, Rodrygo Santos
-
-
Field Summary
Fields Modifier and Type Field Description protected TagSet
_exact
The tags to process exactly.protected TagSet
_fields
The tags to consider as fields.protected TagSet
_tags
The tags to process or skip.protected int
abstractCount
number of abstract typesprotected int[]
abstractlengths
The maximum length of each named abstract (comma separated list)protected gnu.trove.TObjectIntHashMap<java.lang.String>
abstractName2Index
A mapping for quick lookup of abstract tag namesprotected java.lang.String[]
abstractnames
The names of the abstracts to be saved (comma separated list)protected java.lang.StringBuilder[]
abstracts
builders for each abstractprotected java.lang.String[]
abstracttags
The fields that the named abstracts come from (comma separated list)protected boolean
abstractTagsCaseSensitive
protected java.io.Reader
br
The input reader.protected boolean
considerAbstracts
Flag to check that determines whether to short-cut the abstract generation methodprotected long
counter
The number of bytes read from the input.protected TokenStream
currentTokenStream
protected int
elseAbstractSpecialTag
else field indexprotected boolean
EOD
End of Document.protected boolean
error
Indicates whether an error has occurred.protected java.util.Set<java.lang.String>
htmlStk
The hash set where the tags, considered as fields, are inserted.protected boolean
inHtmlTagToProcess
Specifies whether the tokeniser is in a field tag to process.protected boolean
inTagToProcess
Indicates whether we are in a tag to process.protected boolean
inTagToSkip
Indicates whether we are in a tag to skip.protected int
lastChar
Saves the last read character between consecutive calls of getNextTerm().protected static org.slf4j.Logger
logger
protected static boolean
lowercase
Change to lowercase?protected static int
maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms.protected static int
maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are allowed in valid terms.protected java.util.Map<java.lang.String,java.lang.String>
properties
protected java.util.Stack<java.lang.String>
stk
The stack where the tags are pushed and popped accordingly.protected java.lang.String[]
stringArray
A temporary String arrayprotected java.lang.StringBuilder
sw
protected java.lang.StringBuilder
tagNameSB
protected Tokeniser
tokeniser
protected static int
tokenMaximumLength
The maximum length of a token in the check method.
-
Constructor Summary
Constructors Constructor Description TaggedDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser _tokeniser)
Constructs an instance of the class from the given input stream.TaggedDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser _tokeniser, java.lang.String doctags, java.lang.String exactdoctags, java.lang.String fieldtags)
Constructs an instance of the class from the given input stream.TaggedDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser _tokeniser)
Constructs an instance of the class from the given reader object.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static java.lang.String
check(java.lang.String s)
Checks whether a term is shorter than the maximum allowed length, and whether a term does not have many numerical digits or many consecutive same digits or letters.static void
dumpDocument(Document d)
Dumps a document to stdoutboolean
endOfDocument()
Indicates whether the tokenizer has reached the end of the current document.static Document
generateDocumentFromFile(java.lang.String filename)
instantiates a TREC document from a filejava.util.Map<java.lang.String,java.lang.String>
getAllProperties()
Returns the underlying map of all the properties defined by this Document.java.util.Set<java.lang.String>
getFields()
Returns the fields in which the current term appears in.java.lang.String
getNextTerm()
Returns the next token from the current chunk of text, extracted from the document into a TokenStream.java.lang.String
getProperty(java.lang.String name)
Allows access to a named property of the Document.java.io.Reader
getReader()
Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.static void
main(java.lang.String[] args)
Static method which dumps a document to System.outprotected void
processEndOfDocument()
protected void
processEndOfTag(java.lang.String tag)
The encountered tag, which must be a final tag is matched with the tag on the stack.protected void
saveToAbstract(java.lang.String text, java.lang.String tag)
This method takes the text parsed from a tag and then saves it to the abstract(s).void
setProperty(java.lang.String name, java.lang.String value)
Allows a named property to be added to the Document.
-
-
-
Field Detail
-
logger
protected static final org.slf4j.Logger logger
-
tokenMaximumLength
protected static final int tokenMaximumLength
The maximum length of a token in the check method.
-
lowercase
protected static final boolean lowercase
Change to lowercase?
-
stringArray
protected final java.lang.String[] stringArray
A temporary String array
-
br
protected java.io.Reader br
The input reader.
-
EOD
protected boolean EOD
End of Document. Set by the last couple of lines in getNextTerm()
-
counter
protected long counter
The number of bytes read from the input.
-
lastChar
protected int lastChar
Saves the last read character between consecutive calls of getNextTerm().
-
error
protected boolean error
Indicates whether an error has occurred.
-
_tags
protected TagSet _tags
The tags to process or skip.
-
_exact
protected TagSet _exact
The tags to process exactly. For these tags, the check() method is not applied.
-
_fields
protected TagSet _fields
The tags to consider as fields.
-
stk
protected java.util.Stack<java.lang.String> stk
The stack where the tags are pushed and popped accordingly.
-
inTagToProcess
protected boolean inTagToProcess
Indicates whether we are in a tag to process.
-
inTagToSkip
protected boolean inTagToSkip
Indicates whether we are in a tag to skip.
-
htmlStk
protected java.util.Set<java.lang.String> htmlStk
The hash set where the tags, considered as fields, are inserted.
-
inHtmlTagToProcess
protected boolean inHtmlTagToProcess
Specifies whether the tokeniser is in a field tag to process.
-
properties
protected java.util.Map<java.lang.String,java.lang.String> properties
-
tokeniser
protected Tokeniser tokeniser
-
currentTokenStream
protected TokenStream currentTokenStream
-
abstractnames
protected final java.lang.String[] abstractnames
The names of the abstracts to be saved (comma separated list)
-
abstracttags
protected final java.lang.String[] abstracttags
The fields that the named abstracts come from (comma separated list)
-
abstractlengths
protected final int[] abstractlengths
The maximum length of each named abstract (comma separated list)
-
abstractTagsCaseSensitive
protected final boolean abstractTagsCaseSensitive
-
abstractCount
protected final int abstractCount
number of abstract types
-
abstracts
protected final java.lang.StringBuilder[] abstracts
builders for each abstract
-
abstractName2Index
protected final gnu.trove.TObjectIntHashMap<java.lang.String> abstractName2Index
A mapping for quick lookup of abstract tag names
-
considerAbstracts
protected final boolean considerAbstracts
Flag to check that determines whether to short-cut the abstract generation method
-
elseAbstractSpecialTag
protected int elseAbstractSpecialTag
else field index
-
sw
protected final java.lang.StringBuilder sw
-
tagNameSB
protected final java.lang.StringBuilder tagNameSB
-
maxNumOfDigitsPerTerm
protected static final int maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms.- See Also:
- Constant Field Values
-
maxNumOfSameConseqLettersPerTerm
protected static final int maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are allowed in valid terms.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
TaggedDocument
public TaggedDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser _tokeniser)
Constructs an instance of the class from the given input stream.- Parameters:
docStream
-docProperties
-_tokeniser
-
-
TaggedDocument
public TaggedDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser _tokeniser, java.lang.String doctags, java.lang.String exactdoctags, java.lang.String fieldtags)
Constructs an instance of the class from the given input stream.- Parameters:
docStream
-docProperties
-_tokeniser
-doctags
-exactdoctags
-fieldtags
-
-
TaggedDocument
public TaggedDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser _tokeniser)
Constructs an instance of the class from the given reader object.- Parameters:
docReader
- Reader the stream from the collection that ends at the end of the current document.
-
-
Method Detail
-
getReader
public java.io.Reader getReader()
Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.
-
getNextTerm
public java.lang.String getNextTerm()
Returns the next token from the current chunk of text, extracted from the document into a TokenStream.- Specified by:
getNextTerm
in interfaceDocument
- Returns:
- String the next token of the document, or null if the token was discarded during tokenisation.
-
processEndOfDocument
protected void processEndOfDocument()
-
saveToAbstract
protected void saveToAbstract(java.lang.String text, java.lang.String tag)
This method takes the text parsed from a tag and then saves it to the abstract(s). This method contains the logic to decide whether indeed the text or some subset of it should be saved. The default behaviour checks each abstract named in TaggedDocument.absracts, if for an abstract we are in the correct field (specified in TaggedDocument.abstracts.tags) and then it saves up to maximum character length specified in TaggedDocument.abstracts.lengths. The 'ELSE' abstract tag is a special case that will be filled with any tag that is not added to an existing abstract. TaggedDocument should be sub-classed and this method overwritten if you want to save abstracts in a different manner, e.g. saving the first paragraph.- Parameters:
text
- - the text to be savedtag
- - the tag that this text came from
-
getFields
public java.util.Set<java.lang.String> getFields()
Returns the fields in which the current term appears in.
-
endOfDocument
public boolean endOfDocument()
Indicates whether the tokenizer has reached the end of the current document.- Specified by:
endOfDocument
in interfaceDocument
- Returns:
- boolean true if the end of the current document has been reached, otherwise returns false.
-
processEndOfTag
protected void processEndOfTag(java.lang.String tag)
The encountered tag, which must be a final tag is matched with the tag on the stack. If they are not the same, then the consistency is restored by popping the tags in the stack, the observed tag included. If the stack becomes empty after that, then the end of document EOD is set to true.- Parameters:
tag
- The closing tag to be tested against the content of the stack.
-
check
public static java.lang.String check(java.lang.String s)
Checks whether a term is shorter than the maximum allowed length, and whether a term does not have many numerical digits or many consecutive same digits or letters.- Parameters:
s
- String the term to check if it is valid.- Returns:
- String the term if it is valid, otherwise it returns null.
-
getProperty
public java.lang.String getProperty(java.lang.String name)
Allows access to a named property of the Document. Examples might be URL, filename etc.- Specified by:
getProperty
in interfaceDocument
- Parameters:
name
- Name of the property. It is suggested, but not required that this name should not be case insensitive.- Since:
- 1.1.0
-
setProperty
public void setProperty(java.lang.String name, java.lang.String value)
Allows a named property to be added to the Document. Examples might be URL, filename etc.- Parameters:
name
- Name of the property. It is suggested, but not required that this name should not be case insensitive.value
- The value of the property- Since:
- 1.1.0
-
getAllProperties
public java.util.Map<java.lang.String,java.lang.String> getAllProperties()
Returns the underlying map of all the properties defined by this Document.- Specified by:
getAllProperties
in interfaceDocument
- Since:
- 1.1.0
-
main
public static void main(java.lang.String[] args)
Static method which dumps a document to System.out- Parameters:
args
- A filename to parse
-
generateDocumentFromFile
public static Document generateDocumentFromFile(java.lang.String filename)
instantiates a TREC document from a file
-
dumpDocument
public static void dumpDocument(Document d)
Dumps a document to stdout- Parameters:
d
- a Document object
-
-