|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.terrier.indexing.TaggedDocument
public class TaggedDocument
Models a tagged document (e.g., an HTML or TREC document). In particular,
getNextTerm()
returns the next token in the current
chunk of text, according to the specified tokeniser. This class replaces
HTMLDocument
and TRECDocument
.
This class uses the following properties:
Field Summary | |
---|---|
protected TagSet |
_exact
The tags to process exactly. |
protected TagSet |
_fields
The tags to consider as fields. |
protected TagSet |
_tags
The tags to process or skip. |
protected int |
abstractCount
number of abstract types |
protected int[] |
abstractlengths
The maximum length of each named abstract (comma separated list) |
protected java.lang.String[] |
abstractnames
The names of the abstracts to be saved (comma separated list) |
protected java.lang.StringBuilder[] |
abstracts
builders for each abstract |
protected java.lang.String[] |
abstracttags
The fields that the named abstracts come from (comma separated list) |
protected boolean |
abstractTagsCaseSensitive
|
protected java.io.Reader |
br
The input reader. |
protected long |
counter
The number of bytes read from the input. |
protected TokenStream |
currentTokenStream
|
protected int |
elseAbstractSpecialTag
else field index |
protected boolean |
EOD
End of Document. |
protected boolean |
error
Indicates whether an error has occurred. |
protected java.util.Set<java.lang.String> |
htmlStk
The hash set where the tags, considered as fields, are inserted. |
protected boolean |
inHtmlTagToProcess
Specifies whether the tokeniser is in a field tag to process. |
protected boolean |
inTagToProcess
Indicates whether we are in a tag to process. |
protected boolean |
inTagToSkip
Indicates whether we are in a tag to skip. |
protected int |
lastChar
Saves the last read character between consecutive calls of getNextTerm(). |
protected static org.apache.log4j.Logger |
logger
|
protected static boolean |
lowercase
Change to lowercase? |
protected static int |
maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms. |
protected static int |
maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are allowed in valid terms. |
protected java.util.Map<java.lang.String,java.lang.String> |
properties
|
protected java.util.Stack<java.lang.String> |
stk
The stack where the tags are pushed and popped accordingly. |
protected java.lang.String[] |
stringArray
A temporary String array |
protected java.lang.StringBuilder |
sw
|
protected java.lang.StringBuilder |
tagNameSB
|
protected Tokeniser |
tokeniser
|
protected static int |
tokenMaximumLength
The maximum length of a token in the check method. |
Constructor Summary | |
---|---|
TaggedDocument(java.io.InputStream docStream,
java.util.Map<java.lang.String,java.lang.String> docProperties,
Tokeniser _tokeniser)
Constructs an instance of the class from the given input stream. |
|
TaggedDocument(java.io.Reader docReader,
java.util.Map<java.lang.String,java.lang.String> docProperties,
Tokeniser _tokeniser)
Constructs an instance of the class from the given reader object. |
Method Summary | |
---|---|
static java.lang.String |
check(java.lang.String s)
Checks whether a term is shorter than the maximum allowed length, and whether a term does not have many numerical digits or many consecutive same digits or letters. |
static void |
dumpDocument(Document d)
Dumps a document to stdout |
boolean |
endOfDocument()
Indicates whether the tokenizer has reached the end of the current document. |
static Document |
generateDocumentFromFile(java.lang.String filename)
instantiates a TREC document from a file |
java.util.Map<java.lang.String,java.lang.String> |
getAllProperties()
Returns the underlying map of all the properties defined by this Document. |
java.util.Set<java.lang.String> |
getFields()
Returns the fields in which the current term appears in. |
java.lang.String |
getNextTerm()
Returns the next token from the current chunk of text, extracted from the document into a TokenStream. |
java.lang.String |
getProperty(java.lang.String name)
Allows access to a named property of the Document. |
java.io.Reader |
getReader()
Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes. |
static void |
main(java.lang.String[] args)
Static method which dumps a document to System.out |
protected void |
processEndOfDocument()
|
protected void |
processEndOfTag(java.lang.String tag)
The encountered tag, which must be a final tag is matched with the tag on the stack. |
protected void |
saveToAbstract(java.lang.String text,
java.lang.String tag)
This method takes the text parsed from a tag and then saves it to the abstract(s). |
void |
setProperty(java.lang.String name,
java.lang.String value)
Allows a named property to be added to the Document. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final org.apache.log4j.Logger logger
protected static final int tokenMaximumLength
protected static final boolean lowercase
protected final java.lang.String[] stringArray
protected java.io.Reader br
protected boolean EOD
protected long counter
protected int lastChar
protected boolean error
protected TagSet _tags
protected TagSet _exact
protected TagSet _fields
protected java.util.Stack<java.lang.String> stk
protected boolean inTagToProcess
protected boolean inTagToSkip
protected java.util.Set<java.lang.String> htmlStk
protected boolean inHtmlTagToProcess
protected java.util.Map<java.lang.String,java.lang.String> properties
protected Tokeniser tokeniser
protected TokenStream currentTokenStream
protected final java.lang.String[] abstractnames
protected final java.lang.String[] abstracttags
protected final int[] abstractlengths
protected final boolean abstractTagsCaseSensitive
protected final int abstractCount
protected final java.lang.StringBuilder[] abstracts
protected int elseAbstractSpecialTag
protected final java.lang.StringBuilder sw
protected final java.lang.StringBuilder tagNameSB
protected static final int maxNumOfDigitsPerTerm
protected static final int maxNumOfSameConseqLettersPerTerm
Constructor Detail |
---|
public TaggedDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser _tokeniser)
docStream
- docProperties
- _tokeniser
- public TaggedDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser _tokeniser)
docReader
- Reader the stream from the collection that ends at the
end of the current document.Method Detail |
---|
public java.io.Reader getReader()
getReader
in interface Document
public java.lang.String getNextTerm()
getNextTerm
in interface Document
protected void processEndOfDocument()
protected void saveToAbstract(java.lang.String text, java.lang.String tag)
text
- - the text to be savedtag
- - the tag that this text came frompublic java.util.Set<java.lang.String> getFields()
getFields
in interface Document
public boolean endOfDocument()
endOfDocument
in interface Document
protected void processEndOfTag(java.lang.String tag)
tag
- The closing tag to be tested against the content of the stack.public static java.lang.String check(java.lang.String s)
s
- String the term to check if it is valid.
public java.lang.String getProperty(java.lang.String name)
getProperty
in interface Document
name
- Name of the property. It is suggested, but not required that this name
should not be case insensitive.public void setProperty(java.lang.String name, java.lang.String value)
name
- Name of the property. It is suggested, but not required that this name
should not be case insensitive.value
- The value of the propertypublic java.util.Map<java.lang.String,java.lang.String> getAllProperties()
getAllProperties
in interface Document
public static void main(java.lang.String[] args)
args
- A filename to parsepublic static Document generateDocumentFromFile(java.lang.String filename)
public static void dumpDocument(Document d)
d
- a Document object
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |