|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.terrier.indexing.TRECFullTokenizer
public class TRECFullTokenizer
This class is the tokenizer used for indexing TREC topic files. It can be used for tokenizing other topic file formats, provided that the tags to skip and to process are specified accordingly.
NB: This class only accepts A-Z a-z and 0-9 as valid character for query terms. If this restriction is too tight, please use TRECFullUTFTokenizer instead.
TagSet
Field Summary | |
---|---|
java.io.BufferedReader |
br
The input reader. |
long |
counter
The number of bytes read from the input. |
boolean |
EOD
The end of document. |
boolean |
EOF
The end of file from the buffered reader. |
boolean |
error
A flag which is set when errors are encountered. |
protected TagSet |
exactTagSet
The set of exact tags. |
protected boolean |
ignoreMissingClosingTags
An option to ignore missing closing tags. |
boolean |
inDocnoTag
Is in docno tag? |
boolean |
inTagToProcess
Is in tag to process? |
boolean |
inTagToSkip
Is in tag to skip? |
static int |
lastChar
last character read |
protected static org.apache.log4j.Logger |
logger
|
protected static boolean |
lowercase
Transform to lowercase or not?. |
int |
number_of_terms
A counter for the number of terms. |
protected static java.util.Stack<java.lang.String> |
stk
The stack where the tags are pushed and popped accordingly. |
protected java.lang.StringBuilder |
sw
|
protected java.lang.StringBuilder |
tagNameSB
|
protected TagSet |
tagSet
The tag set to use. |
protected static int |
tokenMaximumLength
The maximum length of a token in the check method. |
Constructor Summary | |
---|---|
TRECFullTokenizer()
TConstructs an instance of the TRECFullTokenizer. |
|
TRECFullTokenizer(java.io.BufferedReader _br)
Constructs an instance of the TRECFullTokenizer, given the buffered reader. |
|
TRECFullTokenizer(TagSet _tagSet,
TagSet _exactSet)
Constructs an instance of the TRECFullTokenizer with non-default tags. |
|
TRECFullTokenizer(TagSet _ts,
TagSet _exactSet,
java.io.BufferedReader _br)
Constructs an instance of the TRECFullTokenizer with non-default tags and a given buffered reader. |
Method Summary | |
---|---|
protected java.lang.String |
check(java.lang.String s)
A restricted check function for discarding uncommon, or 'strange' terms. |
void |
close()
Closes the buffered reader associated with the tokenizer. |
void |
closeBufferedReader()
Closes the buffered reader associated with the tokenizer. |
java.lang.String |
currentTag()
Returns the name of the tag the tokenizer is currently in. |
long |
getByteOffset()
Returns the number of bytes read from the current file. |
boolean |
inDocnoTag()
Indicates whether the tokenizer is in the special document number tag. |
boolean |
inTagToProcess()
Returns true if the given tag is to be processed. |
boolean |
inTagToSkip()
Returns true if the given tag is to be skipped. |
boolean |
isEndOfDocument()
Returns true if the end of document is encountered. |
boolean |
isEndOfFile()
Returns true if the end of file is encountered. |
void |
nextDocument()
Proceed to the next document. |
java.lang.String |
nextToken()
Returns the next token from the current chunk of text, extracted from the document into a TokenStream. |
protected void |
processEndOfTag(java.lang.String tag)
The encounterd tag, which must be a final tag is matched with the tag on the stack. |
void |
setIgnoreMissingClosingTags(boolean toIgnore)
Sets the value of the ignoreMissingClosingTags. |
void |
setInput(java.io.BufferedReader _br)
Sets the input of the tokenizer. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final org.apache.log4j.Logger logger
protected boolean ignoreMissingClosingTags
public static int lastChar
public int number_of_terms
public boolean EOF
public boolean EOD
public boolean error
public java.io.BufferedReader br
public long counter
protected static java.util.Stack<java.lang.String> stk
protected TagSet tagSet
protected TagSet exactTagSet
protected static final int tokenMaximumLength
protected static final boolean lowercase
public boolean inTagToProcess
public boolean inTagToSkip
public boolean inDocnoTag
protected final java.lang.StringBuilder sw
protected final java.lang.StringBuilder tagNameSB
Constructor Detail |
---|
public TRECFullTokenizer()
public TRECFullTokenizer(java.io.BufferedReader _br)
_br
- java.io.BufferedReader the input stream to tokenizepublic TRECFullTokenizer(TagSet _tagSet, TagSet _exactSet)
_tagSet
- TagSet the document tags to process._exactSet
- TagSet the document tags to process exactly, without
applying strict checks.public TRECFullTokenizer(TagSet _ts, TagSet _exactSet, java.io.BufferedReader _br)
_ts
- TagSet the document tags to process._exactSet
- TagSet the document tags to process exactly, without
applying strict checks._br
- java.io.BufferedReader the input to tokenize.Method Detail |
---|
protected java.lang.String check(java.lang.String s)
s
- The term to check.
public void close()
public void closeBufferedReader()
public java.lang.String currentTag()
currentTag
in interface Tokenizer
public boolean inDocnoTag()
inDocnoTag
in interface Tokenizer
public boolean inTagToProcess()
inTagToProcess
in interface Tokenizer
public boolean inTagToSkip()
inTagToSkip
in interface Tokenizer
public boolean isEndOfDocument()
isEndOfDocument
in interface Tokenizer
public boolean isEndOfFile()
isEndOfFile
in interface Tokenizer
public void nextDocument()
nextDocument
in interface Tokenizer
public java.lang.String nextToken()
nextToken
in interface Tokenizer
protected void processEndOfTag(java.lang.String tag)
tag
- The closing tag to be tested against the content of the stack.public void setIgnoreMissingClosingTags(boolean toIgnore)
toIgnore
- boolean to ignore or not the missing closing tagspublic long getByteOffset()
getByteOffset
in interface Tokenizer
public void setInput(java.io.BufferedReader _br)
setInput
in interface Tokenizer
_br
- BufferedReader the input stream
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |