public abstract class Tokeniser extends Object
Document
implementations.
Available tokenisers There are two default tokenisers shipped with Terrier, namely EnglishTokeniser (default, only accepts A-Z, a-z and 0-9 as valid characters. Everything else causes a token boundary), and UTFTokeniser. The tokeniser used by default can be specified using the tokeniser property.
Properties:
Example:
Tokeniser tokeniser = Tokeniser.getTokeniser();
TokenStream toks = tokeniser.tokenise(new StringReader("This is a block of text."));
while(toks.hasNext())
{
System.out.println(toks.next());
}
TokenStream,
EnglishTokeniser,
UTFTokeniser| Modifier and Type | Field and Description |
|---|---|
static TokenStream |
EMPTY_STREAM
empty stream
|
| Constructor and Description |
|---|
Tokeniser() |
| Modifier and Type | Method and Description |
|---|---|
static Tokeniser |
getTokeniser()
Instantiates Tokeniser class named in the tokeniser property.
|
String[] |
getTokens(Reader reader)
Utility method which returns all of the tokens for a given
stream.
|
abstract TokenStream |
tokenise(Reader reader)
Tokenises the text obtained from the specified reader.
|
public static final TokenStream EMPTY_STREAM
public static Tokeniser getTokeniser()
public abstract TokenStream tokenise(Reader reader)
reader - Stream of text to be tokenisedpublic String[] getTokens(Reader reader) throws IOException
reader - Stream of text to be tokenisedIOExceptionTerrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow