public abstract class Tokeniser extends Object
Document
implementations.
Available tokenisers There are two default tokenisers shipped with Terrier, namely EnglishTokeniser (default, only accepts A-Z, a-z and 0-9 as valid characters. Everything else causes a token boundary), and UTFTokeniser. The tokeniser used by default can be specified using the tokeniser property.
Properties:
Example:
Tokeniser tokeniser = Tokeniser.getTokeniser(); TokenStream toks = tokeniser.tokenise(new StringReader("This is a block of text.")); while(toks.hasNext()) { System.out.println(toks.next()); }
TokenStream
,
EnglishTokeniser
,
UTFTokeniser
Modifier and Type | Field and Description |
---|---|
static TokenStream |
EMPTY_STREAM
empty stream
|
Constructor and Description |
---|
Tokeniser() |
Modifier and Type | Method and Description |
---|---|
static Tokeniser |
getTokeniser()
Instantiates Tokeniser class named in the tokeniser property.
|
String[] |
getTokens(Reader reader)
Utility method which returns all of the tokens for a given
stream.
|
abstract TokenStream |
tokenise(Reader reader)
Tokenises the text obtained from the specified reader.
|
public static final TokenStream EMPTY_STREAM
public static Tokeniser getTokeniser()
public abstract TokenStream tokenise(Reader reader)
reader
- Stream of text to be tokenisedpublic String[] getTokens(Reader reader) throws IOException
reader
- Stream of text to be tokenisedIOException
Terrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow