public class TaggedDocument extends Object implements Document
getNextTerm() returns the next token in the current
 chunk of text, according to the specified tokeniser. 
 
 This class uses the following properties:
 | Modifier and Type | Field and Description | 
|---|---|
| protected TagSet | _exactThe tags to process exactly. | 
| protected TagSet | _fieldsThe tags to consider as fields. | 
| protected TagSet | _tagsThe tags to process or skip. | 
| protected int | abstractCountnumber of abstract types | 
| protected int[] | abstractlengthsThe maximum length of each named abstract (comma separated list) | 
| protected gnu.trove.TObjectIntHashMap<String> | abstractName2IndexA mapping for quick lookup of abstract tag names | 
| protected String[] | abstractnamesThe names of the abstracts to be saved (comma separated list) | 
| protected StringBuilder[] | abstractsbuilders for each abstract | 
| protected String[] | abstracttagsThe fields that the named abstracts come from (comma separated list) | 
| protected boolean | abstractTagsCaseSensitive | 
| protected Reader | brThe input reader. | 
| protected boolean | considerAbstractsFlag to check that determines whether to short-cut the abstract generation method | 
| protected long | counterThe number of bytes read from the input. | 
| protected TokenStream | currentTokenStream | 
| protected int | elseAbstractSpecialTagelse field index | 
| protected boolean | EODEnd of Document. | 
| protected boolean | errorIndicates whether an error has occurred. | 
| protected Set<String> | htmlStkThe hash set where the tags, considered as fields, are inserted. | 
| protected boolean | inHtmlTagToProcessSpecifies whether the tokeniser is in a field tag to process. | 
| protected boolean | inTagToProcessIndicates whether we are in a tag to process. | 
| protected boolean | inTagToSkipIndicates whether we are in a tag to skip. | 
| protected int | lastCharSaves the last read character between consecutive calls of getNextTerm(). | 
| protected static org.slf4j.Logger | logger | 
| protected static boolean | lowercaseChange to lowercase? | 
| protected static int | maxNumOfDigitsPerTermThe maximum number of digits that are allowed in valid terms. | 
| protected static int | maxNumOfSameConseqLettersPerTermThe maximum number of consecutive same letters or digits that are 
 allowed in valid terms. | 
| protected Map<String,String> | properties | 
| protected Stack<String> | stkThe stack where the tags are pushed and popped accordingly. | 
| protected String[] | stringArrayA temporary String array | 
| protected StringBuilder | sw | 
| protected StringBuilder | tagNameSB | 
| protected Tokeniser | tokeniser | 
| protected static int | tokenMaximumLengthThe maximum length of a token in the check method. | 
| Constructor and Description | 
|---|
| TaggedDocument(InputStream docStream,
              Map<String,String> docProperties,
              Tokeniser _tokeniser)Constructs an instance of the class from the given input stream. | 
| TaggedDocument(InputStream docStream,
              Map<String,String> docProperties,
              Tokeniser _tokeniser,
              String doctags,
              String exactdoctags,
              String fieldtags)Constructs an instance of the class from the given input stream. | 
| TaggedDocument(Reader docReader,
              Map<String,String> docProperties,
              Tokeniser _tokeniser)Constructs an instance of the class from the given reader object. | 
| Modifier and Type | Method and Description | 
|---|---|
| static String | check(String s)Checks whether a term is shorter than the maximum allowed length,
 and whether a term does not have many numerical digits or many 
 consecutive same digits or letters. | 
| static void | dumpDocument(Document d)Dumps a document to stdout | 
| boolean | endOfDocument()Indicates whether the tokenizer has reached the end of the 
 current document. | 
| static Document | generateDocumentFromFile(String filename)instantiates a TREC document from a file | 
| Map<String,String> | getAllProperties()Returns the underlying map of all the properties defined by this Document. | 
| Set<String> | getFields()Returns the fields in which the current term appears in. | 
| String | getNextTerm()Returns the next token from the current chunk of text, extracted from the
 document into a TokenStream. | 
| String | getProperty(String name)Allows access to a named property of the Document. | 
| Reader | getReader()Returns the underlying buffered reader, so that client code can tokenise the
 document itself, and deal with it how it likes. | 
| static void | main(String[] args)Static method which dumps a document to System.out | 
| protected void | processEndOfDocument() | 
| protected void | processEndOfTag(String tag)The encountered tag, which must be a final tag is matched with the tag on
 the stack. | 
| protected void | saveToAbstract(String text,
              String tag)This method takes the text parsed from a tag and then saves it to the
 abstract(s). | 
| void | setProperty(String name,
           String value)Allows a named property to be added to the Document. | 
protected static final org.slf4j.Logger logger
protected static final int tokenMaximumLength
protected static final boolean lowercase
protected final String[] stringArray
protected Reader br
protected boolean EOD
protected long counter
protected int lastChar
protected boolean error
protected TagSet _tags
protected TagSet _exact
protected TagSet _fields
protected boolean inTagToProcess
protected boolean inTagToSkip
protected Set<String> htmlStk
protected boolean inHtmlTagToProcess
protected Tokeniser tokeniser
protected TokenStream currentTokenStream
protected final String[] abstractnames
protected final String[] abstracttags
protected final int[] abstractlengths
protected final boolean abstractTagsCaseSensitive
protected final int abstractCount
protected final StringBuilder[] abstracts
protected final gnu.trove.TObjectIntHashMap<String> abstractName2Index
protected final boolean considerAbstracts
protected int elseAbstractSpecialTag
protected final StringBuilder sw
protected final StringBuilder tagNameSB
protected static final int maxNumOfDigitsPerTerm
protected static final int maxNumOfSameConseqLettersPerTerm
public TaggedDocument(InputStream docStream, Map<String,String> docProperties, Tokeniser _tokeniser)
docStream - docProperties - _tokeniser - public TaggedDocument(InputStream docStream, Map<String,String> docProperties, Tokeniser _tokeniser, String doctags, String exactdoctags, String fieldtags)
docStream - docProperties - _tokeniser - doctags - exactdoctags - fieldtags - public Reader getReader()
public String getNextTerm()
getNextTerm in interface Documentprotected void processEndOfDocument()
protected void saveToAbstract(String text, String tag)
text - - the text to be savedtag - - the tag that this text came frompublic boolean endOfDocument()
endOfDocument in interface Documentprotected void processEndOfTag(String tag)
tag - The closing tag to be tested against the content of the stack.public static String check(String s)
s - String the term to check if it is valid.public String getProperty(String name)
getProperty in interface Documentname - Name of the property. It is suggested, but not required that this name
 should not be case insensitive.public void setProperty(String name, String value)
name - Name of the property. It is suggested, but not required that this name
 should not be case insensitive.value - The value of the propertypublic Map<String,String> getAllProperties()
getAllProperties in interface Documentpublic static void main(String[] args)
args - A filename to parsepublic static Document generateDocumentFromFile(String filename)
public static void dumpDocument(Document d)
d - a Document objectTerrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow