Package org.terrier.indexing
Class TRECFullTokenizer
- java.lang.Object
-
- org.terrier.indexing.TRECFullTokenizer
-
- All Implemented Interfaces:
Tokenizer
public class TRECFullTokenizer extends java.lang.Object implements Tokenizer
This class is the tokenizer used for indexing TREC topic files. It can be used for tokenizing other topic file formats, provided that the tags to skip and to process are specified accordingly.NB: This class only accepts A-Z a-z and 0-9 as valid character for query terms. If this restriction is too tight, please use TRECFullUTFTokenizer instead.
- Author:
- Gianni Amati, Vassilis Plachouras
- See Also:
TagSet
-
-
Field Summary
Fields Modifier and Type Field Description java.io.BufferedReader
br
The input reader.long
counter
The number of bytes read from the input.boolean
EOD
The end of document.boolean
EOF
The end of file from the buffered reader.boolean
error
A flag which is set when errors are encountered.protected TagSet
exactTagSet
The set of exact tags.protected boolean
ignoreMissingClosingTags
An option to ignore missing closing tags.boolean
inDocnoTag
Is in docno tag?boolean
inTagToProcess
Is in tag to process?boolean
inTagToSkip
Is in tag to skip?static int
lastChar
last character readprotected static org.slf4j.Logger
logger
protected static boolean
lowercase
Transform to lowercase or not?.int
number_of_terms
A counter for the number of terms.protected static java.util.Stack<java.lang.String>
stk
The stack where the tags are pushed and popped accordingly.protected java.lang.StringBuilder
sw
protected java.lang.StringBuilder
tagNameSB
protected TagSet
tagSet
The tag set to use.protected static int
tokenMaximumLength
The maximum length of a token in the check method.
-
Constructor Summary
Constructors Constructor Description TRECFullTokenizer()
TConstructs an instance of the TRECFullTokenizer.TRECFullTokenizer(java.io.BufferedReader _br)
Constructs an instance of the TRECFullTokenizer, given the buffered reader.TRECFullTokenizer(TagSet _tagSet, TagSet _exactSet)
Constructs an instance of the TRECFullTokenizer with non-default tags.TRECFullTokenizer(TagSet _ts, TagSet _exactSet, java.io.BufferedReader _br)
Constructs an instance of the TRECFullTokenizer with non-default tags and a given buffered reader.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected java.lang.String
check(java.lang.String s)
A restricted check function for discarding uncommon, or 'strange' terms.void
close()
Closes the buffered reader associated with the tokenizer.void
closeBufferedReader()
Closes the buffered reader associated with the tokenizer.java.lang.String
currentTag()
Returns the name of the tag the tokenizer is currently in.long
getByteOffset()
Returns the number of bytes read from the current file.boolean
inDocnoTag()
Indicates whether the tokenizer is in the special document number tag.boolean
inTagToProcess()
Returns true if the given tag is to be processed.boolean
inTagToSkip()
Returns true if the given tag is to be skipped.boolean
isEndOfDocument()
Returns true if the end of document is encountered.boolean
isEndOfFile()
Returns true if the end of file is encountered.void
nextDocument()
Proceed to the next document.java.lang.String
nextToken()
Returns the next token from the current chunk of text, extracted from the document into a TokenStream.protected void
processEndOfTag(java.lang.String tag)
The encounterd tag, which must be a final tag is matched with the tag on the stack.void
setIgnoreMissingClosingTags(boolean toIgnore)
Sets the value of the ignoreMissingClosingTags.void
setInput(java.io.BufferedReader _br)
Sets the input of the tokenizer.
-
-
-
Field Detail
-
logger
protected static final org.slf4j.Logger logger
-
ignoreMissingClosingTags
protected boolean ignoreMissingClosingTags
An option to ignore missing closing tags. Used for the query files.
-
lastChar
public static int lastChar
last character read
-
number_of_terms
public int number_of_terms
A counter for the number of terms.
-
EOF
public boolean EOF
The end of file from the buffered reader.
-
EOD
public boolean EOD
The end of document.
-
error
public boolean error
A flag which is set when errors are encountered.
-
br
public java.io.BufferedReader br
The input reader.
-
counter
public long counter
The number of bytes read from the input.
-
stk
protected static java.util.Stack<java.lang.String> stk
The stack where the tags are pushed and popped accordingly.
-
tagSet
protected TagSet tagSet
The tag set to use.
-
exactTagSet
protected TagSet exactTagSet
The set of exact tags.
-
tokenMaximumLength
protected static final int tokenMaximumLength
The maximum length of a token in the check method.
-
lowercase
protected static final boolean lowercase
Transform to lowercase or not?.
-
inTagToProcess
public boolean inTagToProcess
Is in tag to process?
-
inTagToSkip
public boolean inTagToSkip
Is in tag to skip?
-
inDocnoTag
public boolean inDocnoTag
Is in docno tag?
-
sw
protected final java.lang.StringBuilder sw
-
tagNameSB
protected final java.lang.StringBuilder tagNameSB
-
-
Constructor Detail
-
TRECFullTokenizer
public TRECFullTokenizer()
TConstructs an instance of the TRECFullTokenizer. The used tags are TagSet.TREC_DOC_TAGS and TagSet.TREC_EXACT_DOC_TAGS
-
TRECFullTokenizer
public TRECFullTokenizer(java.io.BufferedReader _br)
Constructs an instance of the TRECFullTokenizer, given the buffered reader. The used tags are TagSet.TREC_DOC_TAGS and TagSet.TREC_EXACT_DOC_TAGS- Parameters:
_br
- java.io.BufferedReader the input stream to tokenize
-
TRECFullTokenizer
public TRECFullTokenizer(TagSet _tagSet, TagSet _exactSet)
Constructs an instance of the TRECFullTokenizer with non-default tags.- Parameters:
_tagSet
- TagSet the document tags to process._exactSet
- TagSet the document tags to process exactly, without applying strict checks.
-
TRECFullTokenizer
public TRECFullTokenizer(TagSet _ts, TagSet _exactSet, java.io.BufferedReader _br)
Constructs an instance of the TRECFullTokenizer with non-default tags and a given buffered reader.- Parameters:
_ts
- TagSet the document tags to process._exactSet
- TagSet the document tags to process exactly, without applying strict checks._br
- java.io.BufferedReader the input to tokenize.
-
-
Method Detail
-
check
protected java.lang.String check(java.lang.String s)
A restricted check function for discarding uncommon, or 'strange' terms.- Parameters:
s
- The term to check.- Returns:
- the term if it passed the check, otherwise null.
-
close
public void close()
Closes the buffered reader associated with the tokenizer.
-
closeBufferedReader
public void closeBufferedReader()
Closes the buffered reader associated with the tokenizer.
-
currentTag
public java.lang.String currentTag()
Returns the name of the tag the tokenizer is currently in.- Specified by:
currentTag
in interfaceTokenizer
- Returns:
- the name of the tag the tokenizer is currently in
-
inDocnoTag
public boolean inDocnoTag()
Indicates whether the tokenizer is in the special document number tag.- Specified by:
inDocnoTag
in interfaceTokenizer
- Returns:
- true if the tokenizer is in the document number tag.
-
inTagToProcess
public boolean inTagToProcess()
Returns true if the given tag is to be processed.- Specified by:
inTagToProcess
in interfaceTokenizer
- Returns:
- true if the tag is to be processed, otherwise false.
-
inTagToSkip
public boolean inTagToSkip()
Returns true if the given tag is to be skipped.- Specified by:
inTagToSkip
in interfaceTokenizer
- Returns:
- true if the tag is to be skipped, otherwise false.
-
isEndOfDocument
public boolean isEndOfDocument()
Returns true if the end of document is encountered.- Specified by:
isEndOfDocument
in interfaceTokenizer
- Returns:
- true if the end of document is encountered.
-
isEndOfFile
public boolean isEndOfFile()
Returns true if the end of file is encountered.- Specified by:
isEndOfFile
in interfaceTokenizer
- Returns:
- true if the end of file is encountered.
-
nextDocument
public void nextDocument()
Proceed to the next document.- Specified by:
nextDocument
in interfaceTokenizer
-
nextToken
public java.lang.String nextToken()
Returns the next token from the current chunk of text, extracted from the document into a TokenStream.
-
processEndOfTag
protected void processEndOfTag(java.lang.String tag)
The encounterd tag, which must be a final tag is matched with the tag on the stack. If they are not the same, then the consistency is restored by popping the tags in the stack, the observed tag included. If the stack becomes empty after that, then the end of document EOD is set to true.- Parameters:
tag
- The closing tag to be tested against the content of the stack.
-
setIgnoreMissingClosingTags
public void setIgnoreMissingClosingTags(boolean toIgnore)
Sets the value of the ignoreMissingClosingTags.- Parameters:
toIgnore
- boolean to ignore or not the missing closing tags
-
getByteOffset
public long getByteOffset()
Returns the number of bytes read from the current file.- Specified by:
getByteOffset
in interfaceTokenizer
- Returns:
- long the byte offset
-
-