Package org.terrier.indexing
Interface Tokenizer
-
- All Known Implementing Classes:
TRECFullTokenizer
public interface Tokenizer
The specification of the interface implemented by tokeniser classes.- Author:
- Gianni Amati, Vassilis Plachouras
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description java.lang.String
currentTag()
Returns the identifier of the tag the tokenizer is into.long
getByteOffset()
Returns the byte offset in the current indexed file.boolean
inDocnoTag()
Indicates whether we are in a special document number tag.boolean
inTagToProcess()
Indicates whether we are in a tag to process.boolean
inTagToSkip()
Indicates whether we are in a tag to skipboolean
isEndOfDocument()
Returns true if the end of document is encountered.boolean
isEndOfFile()
Returns true if the end of file is encountered.void
nextDocument()
Proceed to process the next document.java.lang.String
nextToken()
Returns the next token from the input stream used.void
setInput(java.io.BufferedReader input)
Sets the input of the tokenizer
-
-
-
Method Detail
-
currentTag
java.lang.String currentTag()
Returns the identifier of the tag the tokenizer is into.- Returns:
- the name of the tag the tokenizer is processing
-
nextToken
java.lang.String nextToken()
Returns the next token from the input stream used.- Returns:
- the next token, or null if the end of file is encountered.
-
inDocnoTag
boolean inDocnoTag()
Indicates whether we are in a special document number tag.- Returns:
- true if the tokenizer is in a document number tag.
-
inTagToProcess
boolean inTagToProcess()
Indicates whether we are in a tag to process.- Returns:
- true if we are in a tag to process.
-
inTagToSkip
boolean inTagToSkip()
Indicates whether we are in a tag to skip- Returns:
- true if we are in a tag to skip
-
isEndOfDocument
boolean isEndOfDocument()
Returns true if the end of document is encountered.- Returns:
- true if the end of document is encountered.
-
isEndOfFile
boolean isEndOfFile()
Returns true if the end of file is encountered.- Returns:
- true if the end of document is encountered.
-
nextDocument
void nextDocument()
Proceed to process the next document.
-
getByteOffset
long getByteOffset()
Returns the byte offset in the current indexed file.
-
setInput
void setInput(java.io.BufferedReader input)
Sets the input of the tokenizer- Parameters:
input
- BufferedReader the input stream to tokenize
-
-