Tokenizer (Terrier 3.5 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.terrier.indexing
Interface Tokenizer

All Known Implementing Classes:: TRECFullTokenizer, TRECFullUTFTokenizer

public interface Tokenizer

The specification of the interface implemented by tokeniser classes.

Author:: Gianni Amati, Vassilis Plachouras

Method Summary
`java.lang.String`	`currentTag()` Returns the identifier of the tag the tokenizer is into.
`long`	`getByteOffset()` Returns the byte offset in the current indexed file.
`boolean`	`inDocnoTag()` Indicates whether we are in a special document number tag.
`boolean`	`inTagToProcess()` Indicates whether we are in a tag to process.
`boolean`	`inTagToSkip()` Indicates whether we are in a tag to skip
`boolean`	`isEndOfDocument()` Returns true if the end of document is encountered.
`boolean`	`isEndOfFile()` Returns true if the end of file is encountered.
`void`	`nextDocument()` Proceed to process the next document.
`java.lang.String`	`nextToken()` Returns the next token from the input stream used.
`void`	`setInput(java.io.BufferedReader input)` Sets the input of the tokenizer

Method Detail

currentTag

java.lang.String currentTag()

Returns the identifier of the tag the tokenizer is into.

Returns:: the name of the tag the tokenizer is processing

nextToken

java.lang.String nextToken()

Returns the next token from the input stream used.

Returns:: the next token, or null if the end of file is encountered.

inDocnoTag

boolean inDocnoTag()

Indicates whether we are in a special document number tag.

Returns:: true if the tokenizer is in a document number tag.

inTagToProcess

boolean inTagToProcess()

Indicates whether we are in a tag to process.

Returns:: true if we are in a tag to process.

inTagToSkip

boolean inTagToSkip()

Indicates whether we are in a tag to skip

Returns:: true if we are in a tag to skip

isEndOfDocument

boolean isEndOfDocument()

Returns true if the end of document is encountered.

Returns:: true if the end of document is encountered.

isEndOfFile

boolean isEndOfFile()

Returns true if the end of file is encountered.

Returns:: true if the end of document is encountered.

nextDocument

void nextDocument()

Proceed to process the next document.

getByteOffset

long getByteOffset()

Returns the byte offset in the current indexed file.

setInput

void setInput(java.io.BufferedReader input)

Sets the input of the tokenizer

Parameters:: input - BufferedReader the input stream to tokenize