public class TRECFullTokenizer extends Object implements Tokenizer
NB: This class only accepts A-Z a-z and 0-9 as valid character for query terms. If this restriction is too tight, please use TRECFullUTFTokenizer instead.
TagSet
Modifier and Type | Field and Description |
---|---|
BufferedReader |
br
The input reader.
|
long |
counter
The number of bytes read from the input.
|
boolean |
EOD
The end of document.
|
boolean |
EOF
The end of file from the buffered reader.
|
boolean |
error
A flag which is set when errors are encountered.
|
protected TagSet |
exactTagSet
The set of exact tags.
|
protected boolean |
ignoreMissingClosingTags
An option to ignore missing closing tags.
|
boolean |
inDocnoTag
Is in docno tag?
|
boolean |
inTagToProcess
Is in tag to process?
|
boolean |
inTagToSkip
Is in tag to skip?
|
static int |
lastChar
last character read
|
protected static org.apache.log4j.Logger |
logger |
protected static boolean |
lowercase
Transform to lowercase or not?.
|
int |
number_of_terms
A counter for the number of terms.
|
protected static Stack<String> |
stk
The stack where the tags are pushed and popped accordingly.
|
protected StringBuilder |
sw |
protected StringBuilder |
tagNameSB |
protected TagSet |
tagSet
The tag set to use.
|
protected static int |
tokenMaximumLength
The maximum length of a token in the check method.
|
Constructor and Description |
---|
TRECFullTokenizer()
TConstructs an instance of the TRECFullTokenizer.
|
TRECFullTokenizer(BufferedReader _br)
Constructs an instance of the TRECFullTokenizer,
given the buffered reader.
|
TRECFullTokenizer(TagSet _tagSet,
TagSet _exactSet)
Constructs an instance of the TRECFullTokenizer with
non-default tags.
|
TRECFullTokenizer(TagSet _ts,
TagSet _exactSet,
BufferedReader _br)
Constructs an instance of the TRECFullTokenizer with
non-default tags and a given buffered reader.
|
Modifier and Type | Method and Description |
---|---|
protected String |
check(String s)
A restricted check function for discarding uncommon, or 'strange' terms.
|
void |
close()
Closes the buffered reader associated with the tokenizer.
|
void |
closeBufferedReader()
Closes the buffered reader associated with the tokenizer.
|
String |
currentTag()
Returns the name of the tag the tokenizer is currently in.
|
long |
getByteOffset()
Returns the number of bytes read from the current file.
|
boolean |
inDocnoTag()
Indicates whether the tokenizer is in the special document number tag.
|
boolean |
inTagToProcess()
Returns true if the given tag is to be processed.
|
boolean |
inTagToSkip()
Returns true if the given tag is to be skipped.
|
boolean |
isEndOfDocument()
Returns true if the end of document is encountered.
|
boolean |
isEndOfFile()
Returns true if the end of file is encountered.
|
void |
nextDocument()
Proceed to the next document.
|
String |
nextToken()
Returns the next token from the current chunk of text, extracted from the
document into a TokenStream.
|
protected void |
processEndOfTag(String tag)
The encounterd tag, which must be a final tag is matched with the tag on
the stack.
|
void |
setIgnoreMissingClosingTags(boolean toIgnore)
Sets the value of the ignoreMissingClosingTags.
|
void |
setInput(BufferedReader _br)
Sets the input of the tokenizer.
|
protected static final org.apache.log4j.Logger logger
protected boolean ignoreMissingClosingTags
public static int lastChar
public int number_of_terms
public boolean EOF
public boolean EOD
public boolean error
public BufferedReader br
public long counter
protected TagSet tagSet
protected TagSet exactTagSet
protected static final int tokenMaximumLength
protected static final boolean lowercase
public boolean inTagToProcess
public boolean inTagToSkip
public boolean inDocnoTag
protected final StringBuilder sw
protected final StringBuilder tagNameSB
public TRECFullTokenizer()
public TRECFullTokenizer(BufferedReader _br)
_br
- java.io.BufferedReader the input stream to tokenizepublic TRECFullTokenizer(TagSet _tagSet, TagSet _exactSet)
_tagSet
- TagSet the document tags to process._exactSet
- TagSet the document tags to process exactly, without
applying strict checks.public TRECFullTokenizer(TagSet _ts, TagSet _exactSet, BufferedReader _br)
_ts
- TagSet the document tags to process._exactSet
- TagSet the document tags to process exactly, without
applying strict checks._br
- java.io.BufferedReader the input to tokenize.protected String check(String s)
s
- The term to check.public void close()
public void closeBufferedReader()
public String currentTag()
currentTag
in interface Tokenizer
public boolean inDocnoTag()
inDocnoTag
in interface Tokenizer
public boolean inTagToProcess()
inTagToProcess
in interface Tokenizer
public boolean inTagToSkip()
inTagToSkip
in interface Tokenizer
public boolean isEndOfDocument()
isEndOfDocument
in interface Tokenizer
public boolean isEndOfFile()
isEndOfFile
in interface Tokenizer
public void nextDocument()
nextDocument
in interface Tokenizer
public String nextToken()
protected void processEndOfTag(String tag)
tag
- The closing tag to be tested against the content of the stack.public void setIgnoreMissingClosingTags(boolean toIgnore)
toIgnore
- boolean to ignore or not the missing closing tagspublic long getByteOffset()
getByteOffset
in interface Tokenizer
public void setInput(BufferedReader _br)
Terrier 4.0. Copyright © 2004-2014 University of Glasgow