|
Terrier IR Platform 2.2.1 |
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object uk.ac.gla.terrier.indexing.TRECFullTokenizer
public class TRECFullTokenizer
This class is the tokenizer used for indexing TREC topic files. It can be used for tokenizing other topic file formats, provided that the tags to skip and to process are specified accordingly.
NB: This class only accepts A-Z a-z and 0-9 as valid character for query terms. If this restriction is too tight, please use TRECFullUTFTokenizer instead.
TagSet
Field Summary | |
---|---|
java.io.BufferedReader |
br
The input reader. |
long |
counter
The number of bytes read from the input. |
boolean |
EOD
The end of document. |
boolean |
EOF
The end of file from the buffered reader. |
boolean |
error
A flag which is set when errors are encountered. |
boolean |
inDocnoTag
Is in docno tag? |
boolean |
inTagToProcess
Is in tag to process? |
boolean |
inTagToSkip
Is in tag to skip? |
static int |
lastChar
last character read |
int |
number_of_terms
A counter for the number of terms. |
Constructor Summary | |
---|---|
TRECFullTokenizer()
TConstructs an instance of the TRECFullTokenizer. |
|
TRECFullTokenizer(java.io.BufferedReader br)
Constructs an instance of the TRECFullTokenizer, given the buffered reader. |
|
TRECFullTokenizer(TagSet _tagSet,
TagSet _exactSet)
Constructs an instance of the TRECFullTokenizer with non-default tags. |
|
TRECFullTokenizer(TagSet _ts,
TagSet _exactSet,
java.io.BufferedReader br)
Constructs an instance of the TRECFullTokenizer with non-default tags and a given buffered reader. |
Method Summary | |
---|---|
void |
close()
Closes the buffered reader associated with the tokenizer. |
void |
closeBufferedReader()
Closes the buffered reader associated with the tokenizer. |
java.lang.String |
currentTag()
Returns the name of the tag the tokenizer is currently in. |
long |
getByteOffset()
Returns the number of bytes read from the current file. |
boolean |
inDocnoTag()
Indicates whether the tokenizer is in the special document number tag. |
boolean |
inTagToProcess()
Returns true if the given tag is to be processed. |
boolean |
inTagToSkip()
Returns true if the given tag is to be skipped. |
boolean |
isEndOfDocument()
Returns true if the end of document is encountered. |
boolean |
isEndOfFile()
Returns true if the end of file is encountered. |
void |
nextDocument()
Proceed to the next document. |
java.lang.String |
nextToken()
nextTermWithNumbers gives the first next string which is not a tag. |
void |
setIgnoreMissingClosingTags(boolean toIgnore)
Sets the value of the ignoreMissingClosingTags. |
void |
setInput(java.io.BufferedReader _br)
Sets the input of the tokenizer. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static int lastChar
public int number_of_terms
public boolean EOF
public boolean EOD
public boolean error
public java.io.BufferedReader br
public long counter
public boolean inTagToProcess
public boolean inTagToSkip
public boolean inDocnoTag
Constructor Detail |
---|
public TRECFullTokenizer()
public TRECFullTokenizer(java.io.BufferedReader br)
br
- java.io.BufferedReader the input stream to tokenizepublic TRECFullTokenizer(TagSet _tagSet, TagSet _exactSet)
_tagSet
- TagSet the document tags to process._exactSet
- TagSet the document tags to process exactly, without
applying strict checks.public TRECFullTokenizer(TagSet _ts, TagSet _exactSet, java.io.BufferedReader br)
_ts
- TagSet the document tags to process._exactSet
- TagSet the document tags to process exactly, without
applying strict checks.br
- java.io.BufferedReader the input to tokenize.Method Detail |
---|
public void close()
public void closeBufferedReader()
public java.lang.String currentTag()
currentTag
in interface Tokenizer
public boolean inDocnoTag()
inDocnoTag
in interface Tokenizer
public boolean inTagToProcess()
inTagToProcess
in interface Tokenizer
public boolean inTagToSkip()
inTagToSkip
in interface Tokenizer
public boolean isEndOfDocument()
isEndOfDocument
in interface Tokenizer
public boolean isEndOfFile()
isEndOfFile
in interface Tokenizer
public void nextDocument()
nextDocument
in interface Tokenizer
public java.lang.String nextToken()
nextToken
in interface Tokenizer
public void setIgnoreMissingClosingTags(boolean toIgnore)
toIgnore
- boolean to ignore or not the missing closing tagspublic long getByteOffset()
getByteOffset
in interface Tokenizer
public void setInput(java.io.BufferedReader _br)
setInput
in interface Tokenizer
_br
- BufferedReader the input stream
|
Terrier IR Platform 2.2.1 |
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |