public class TRECFullTokenizer extends Object implements Tokenizer
NB: This class only accepts A-Z a-z and 0-9 as valid character for query terms. If this restriction is too tight, please use TRECFullUTFTokenizer instead.
TagSet| Modifier and Type | Field and Description | 
|---|---|
| BufferedReader | brThe input reader. | 
| long | counterThe number of bytes read from the input. | 
| boolean | EODThe end of document. | 
| boolean | EOFThe end of file from the buffered reader. | 
| boolean | errorA flag which is set when errors are encountered. | 
| protected TagSet | exactTagSetThe set of exact tags. | 
| protected boolean | ignoreMissingClosingTagsAn option to ignore missing closing tags. | 
| boolean | inDocnoTagIs in docno tag? | 
| boolean | inTagToProcessIs in tag to process? | 
| boolean | inTagToSkipIs in tag to skip? | 
| static int | lastCharlast character read | 
| protected static org.slf4j.Logger | logger | 
| protected static boolean | lowercaseTransform to lowercase or not?. | 
| int | number_of_termsA counter for the number of terms. | 
| protected static Stack<String> | stkThe stack where the tags are pushed and popped accordingly. | 
| protected StringBuilder | sw | 
| protected StringBuilder | tagNameSB | 
| protected TagSet | tagSetThe tag set to use. | 
| protected static int | tokenMaximumLengthThe maximum length of a token in the check method. | 
| Constructor and Description | 
|---|
| TRECFullTokenizer()TConstructs an instance of the TRECFullTokenizer. | 
| TRECFullTokenizer(BufferedReader _br)Constructs an instance of the TRECFullTokenizer, 
 given the buffered reader. | 
| TRECFullTokenizer(TagSet _tagSet,
                 TagSet _exactSet)Constructs an instance of the TRECFullTokenizer with 
 non-default tags. | 
| TRECFullTokenizer(TagSet _ts,
                 TagSet _exactSet,
                 BufferedReader _br)Constructs an instance of the TRECFullTokenizer with 
 non-default tags and a given buffered reader. | 
| Modifier and Type | Method and Description | 
|---|---|
| protected String | check(String s)A restricted check function for discarding uncommon, or 'strange' terms. | 
| void | close()Closes the buffered reader associated with the tokenizer. | 
| void | closeBufferedReader()Closes the buffered reader associated with the tokenizer. | 
| String | currentTag()Returns the name of the tag the tokenizer is currently in. | 
| long | getByteOffset()Returns the number of bytes read from the current file. | 
| boolean | inDocnoTag()Indicates whether the tokenizer is in the special document number tag. | 
| boolean | inTagToProcess()Returns true if the given tag is to be processed. | 
| boolean | inTagToSkip()Returns true if the given tag is to be skipped. | 
| boolean | isEndOfDocument()Returns true if the end of document is encountered. | 
| boolean | isEndOfFile()Returns true if the end of file is encountered. | 
| void | nextDocument()Proceed to the next document. | 
| String | nextToken()Returns the next token from the current chunk of text, extracted from the
 document into a TokenStream. | 
| protected void | processEndOfTag(String tag)The encounterd tag, which must be a final tag is matched with the tag on
 the stack. | 
| void | setIgnoreMissingClosingTags(boolean toIgnore)Sets the value of the ignoreMissingClosingTags. | 
| void | setInput(BufferedReader _br)Sets the input of the tokenizer. | 
protected static final org.slf4j.Logger logger
protected boolean ignoreMissingClosingTags
public static int lastChar
public int number_of_terms
public boolean EOF
public boolean EOD
public boolean error
public BufferedReader br
public long counter
protected TagSet tagSet
protected TagSet exactTagSet
protected static final int tokenMaximumLength
protected static final boolean lowercase
public boolean inTagToProcess
public boolean inTagToSkip
public boolean inDocnoTag
protected final StringBuilder sw
protected final StringBuilder tagNameSB
public TRECFullTokenizer()
public TRECFullTokenizer(BufferedReader _br)
_br - java.io.BufferedReader the input stream to tokenizepublic TRECFullTokenizer(TagSet _tagSet, TagSet _exactSet)
_tagSet - TagSet the document tags to process._exactSet - TagSet the document tags to process exactly, without
        applying strict checks.public TRECFullTokenizer(TagSet _ts, TagSet _exactSet, BufferedReader _br)
_ts - TagSet the document tags to process._exactSet - TagSet the document tags to process exactly, without
        applying strict checks._br - java.io.BufferedReader the input to tokenize.protected String check(String s)
s - The term to check.public void close()
public void closeBufferedReader()
public String currentTag()
currentTag in interface Tokenizerpublic boolean inDocnoTag()
inDocnoTag in interface Tokenizerpublic boolean inTagToProcess()
inTagToProcess in interface Tokenizerpublic boolean inTagToSkip()
inTagToSkip in interface Tokenizerpublic boolean isEndOfDocument()
isEndOfDocument in interface Tokenizerpublic boolean isEndOfFile()
isEndOfFile in interface Tokenizerpublic void nextDocument()
nextDocument in interface Tokenizerpublic String nextToken()
protected void processEndOfTag(String tag)
tag - The closing tag to be tested against the content of the stack.public void setIgnoreMissingClosingTags(boolean toIgnore)
toIgnore - boolean to ignore or not the missing closing tagspublic long getByteOffset()
getByteOffset in interface Tokenizerpublic void setInput(BufferedReader _br)
Terrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow