org.terrier.indexing
Interface Tokenizer

All Known Implementing Classes:
TRECFullTokenizer, TRECFullUTFTokenizer

public interface Tokenizer

The specification of the interface implemented by tokeniser classes.

Author:
Gianni Amati, Vassilis Plachouras

Method Summary
 java.lang.String currentTag()
          Returns the identifier of the tag the tokenizer is into.
 long getByteOffset()
          Returns the byte offset in the current indexed file.
 boolean inDocnoTag()
          Indicates whether we are in a special document number tag.
 boolean inTagToProcess()
          Indicates whether we are in a tag to process.
 boolean inTagToSkip()
          Indicates whether we are in a tag to skip
 boolean isEndOfDocument()
          Returns true if the end of document is encountered.
 boolean isEndOfFile()
          Returns true if the end of file is encountered.
 void nextDocument()
          Proceed to process the next document.
 java.lang.String nextToken()
          Returns the next token from the input stream used.
 void setInput(java.io.BufferedReader input)
          Sets the input of the tokenizer
 

Method Detail

currentTag

java.lang.String currentTag()
Returns the identifier of the tag the tokenizer is into.

Returns:
the name of the tag the tokenizer is processing

nextToken

java.lang.String nextToken()
Returns the next token from the input stream used.

Returns:
the next token, or null if the end of file is encountered.

inDocnoTag

boolean inDocnoTag()
Indicates whether we are in a special document number tag.

Returns:
true if the tokenizer is in a document number tag.

inTagToProcess

boolean inTagToProcess()
Indicates whether we are in a tag to process.

Returns:
true if we are in a tag to process.

inTagToSkip

boolean inTagToSkip()
Indicates whether we are in a tag to skip

Returns:
true if we are in a tag to skip

isEndOfDocument

boolean isEndOfDocument()
Returns true if the end of document is encountered.

Returns:
true if the end of document is encountered.

isEndOfFile

boolean isEndOfFile()
Returns true if the end of file is encountered.

Returns:
true if the end of document is encountered.

nextDocument

void nextDocument()
Proceed to process the next document.


getByteOffset

long getByteOffset()
Returns the byte offset in the current indexed file.


setInput

void setInput(java.io.BufferedReader input)
Sets the input of the tokenizer

Parameters:
input - BufferedReader the input stream to tokenize


Terrier 3.5. Copyright © 2004-2011 University of Glasgow