public class TaggedDocument extends Object implements Document
getNextTerm()
returns the next token in the current
chunk of text, according to the specified tokeniser.
This class uses the following properties:
Modifier and Type | Field and Description |
---|---|
protected TagSet |
_exact
The tags to process exactly.
|
protected TagSet |
_fields
The tags to consider as fields.
|
protected TagSet |
_tags
The tags to process or skip.
|
protected int |
abstractCount
number of abstract types
|
protected int[] |
abstractlengths
The maximum length of each named abstract (comma separated list)
|
protected gnu.trove.TObjectIntHashMap<String> |
abstractName2Index
A mapping for quick lookup of abstract tag names
|
protected String[] |
abstractnames
The names of the abstracts to be saved (comma separated list)
|
protected StringBuilder[] |
abstracts
builders for each abstract
|
protected String[] |
abstracttags
The fields that the named abstracts come from (comma separated list)
|
protected boolean |
abstractTagsCaseSensitive |
protected Reader |
br
The input reader.
|
protected boolean |
considerAbstracts
Flag to check that determines whether to short-cut the abstract generation method
|
protected long |
counter
The number of bytes read from the input.
|
protected TokenStream |
currentTokenStream |
protected int |
elseAbstractSpecialTag
else field index
|
protected boolean |
EOD
End of Document.
|
protected boolean |
error
Indicates whether an error has occurred.
|
protected Set<String> |
htmlStk
The hash set where the tags, considered as fields, are inserted.
|
protected boolean |
inHtmlTagToProcess
Specifies whether the tokeniser is in a field tag to process.
|
protected boolean |
inTagToProcess
Indicates whether we are in a tag to process.
|
protected boolean |
inTagToSkip
Indicates whether we are in a tag to skip.
|
protected int |
lastChar
Saves the last read character between consecutive calls of getNextTerm().
|
protected static org.slf4j.Logger |
logger |
protected static boolean |
lowercase
Change to lowercase?
|
protected static int |
maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms.
|
protected static int |
maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are
allowed in valid terms.
|
protected Map<String,String> |
properties |
protected Stack<String> |
stk
The stack where the tags are pushed and popped accordingly.
|
protected String[] |
stringArray
A temporary String array
|
protected StringBuilder |
sw |
protected StringBuilder |
tagNameSB |
protected Tokeniser |
tokeniser |
protected static int |
tokenMaximumLength
The maximum length of a token in the check method.
|
Constructor and Description |
---|
TaggedDocument(InputStream docStream,
Map<String,String> docProperties,
Tokeniser _tokeniser)
Constructs an instance of the class from the given input stream.
|
TaggedDocument(InputStream docStream,
Map<String,String> docProperties,
Tokeniser _tokeniser,
String doctags,
String exactdoctags,
String fieldtags)
Constructs an instance of the class from the given input stream.
|
TaggedDocument(Reader docReader,
Map<String,String> docProperties,
Tokeniser _tokeniser)
Constructs an instance of the class from the given reader object.
|
Modifier and Type | Method and Description |
---|---|
static String |
check(String s)
Checks whether a term is shorter than the maximum allowed length,
and whether a term does not have many numerical digits or many
consecutive same digits or letters.
|
static void |
dumpDocument(Document d)
Dumps a document to stdout
|
boolean |
endOfDocument()
Indicates whether the tokenizer has reached the end of the
current document.
|
static Document |
generateDocumentFromFile(String filename)
instantiates a TREC document from a file
|
Map<String,String> |
getAllProperties()
Returns the underlying map of all the properties defined by this Document.
|
Set<String> |
getFields()
Returns the fields in which the current term appears in.
|
String |
getNextTerm()
Returns the next token from the current chunk of text, extracted from the
document into a TokenStream.
|
String |
getProperty(String name)
Allows access to a named property of the Document.
|
Reader |
getReader()
Returns the underlying buffered reader, so that client code can tokenise the
document itself, and deal with it how it likes.
|
static void |
main(String[] args)
Static method which dumps a document to System.out
|
protected void |
processEndOfDocument() |
protected void |
processEndOfTag(String tag)
The encountered tag, which must be a final tag is matched with the tag on
the stack.
|
protected void |
saveToAbstract(String text,
String tag)
This method takes the text parsed from a tag and then saves it to the
abstract(s).
|
void |
setProperty(String name,
String value)
Allows a named property to be added to the Document.
|
protected static final org.slf4j.Logger logger
protected static final int tokenMaximumLength
protected static final boolean lowercase
protected final String[] stringArray
protected Reader br
protected boolean EOD
protected long counter
protected int lastChar
protected boolean error
protected TagSet _tags
protected TagSet _exact
protected TagSet _fields
protected boolean inTagToProcess
protected boolean inTagToSkip
protected Set<String> htmlStk
protected boolean inHtmlTagToProcess
protected Tokeniser tokeniser
protected TokenStream currentTokenStream
protected final String[] abstractnames
protected final String[] abstracttags
protected final int[] abstractlengths
protected final boolean abstractTagsCaseSensitive
protected final int abstractCount
protected final StringBuilder[] abstracts
protected final gnu.trove.TObjectIntHashMap<String> abstractName2Index
protected final boolean considerAbstracts
protected int elseAbstractSpecialTag
protected final StringBuilder sw
protected final StringBuilder tagNameSB
protected static final int maxNumOfDigitsPerTerm
protected static final int maxNumOfSameConseqLettersPerTerm
public TaggedDocument(InputStream docStream, Map<String,String> docProperties, Tokeniser _tokeniser)
docStream
- docProperties
- _tokeniser
- public TaggedDocument(InputStream docStream, Map<String,String> docProperties, Tokeniser _tokeniser, String doctags, String exactdoctags, String fieldtags)
docStream
- docProperties
- _tokeniser
- doctags
- exactdoctags
- fieldtags
- public Reader getReader()
public String getNextTerm()
getNextTerm
in interface Document
protected void processEndOfDocument()
protected void saveToAbstract(String text, String tag)
text
- - the text to be savedtag
- - the tag that this text came frompublic boolean endOfDocument()
endOfDocument
in interface Document
protected void processEndOfTag(String tag)
tag
- The closing tag to be tested against the content of the stack.public static String check(String s)
s
- String the term to check if it is valid.public String getProperty(String name)
getProperty
in interface Document
name
- Name of the property. It is suggested, but not required that this name
should not be case insensitive.public void setProperty(String name, String value)
name
- Name of the property. It is suggested, but not required that this name
should not be case insensitive.value
- The value of the propertypublic Map<String,String> getAllProperties()
getAllProperties
in interface Document
public static void main(String[] args)
args
- A filename to parsepublic static Document generateDocumentFromFile(String filename)
public static void dumpDocument(Document d)
d
- a Document objectTerrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow