See: Description
Class | Description |
---|---|
EnglishTokeniser |
Tokenises text obtained from a text stream assuming English language.
|
IdentityTokeniser |
A Tokeniser implementation that returns the input as is.
|
Tokeniser |
A tokeniser class is responsible for tokenising a block of text.
|
TokenStream |
Represents a stream of tokens found by a tokeniser.
|
UTFTokeniser |
Tokenises text obtained from a text stream.
|
UTFTwitterTokeniser |
A tokeniser designed for use on tweets.
|
Provides classes related to the tokenisation of documents. Tokenisers are responsible for breaking chunks of text into words to be indexed. Different tokenisers may be used for different languages. In particular, two tokenisers are provided by Terrier:
Example Code
//get the default tokeniser, as set by property tokeniser Tokeniser tokeniser = Tokeniser.getTokeniser(); String sentence = "This is a sentence."; TokenStream toks = tokeniser.tokenise(new StringReader(sentence)); while(toks.hasNext()) { String token = toks.next(); }
Terrier Information Retrieval Platform 5.1. Copyright © 2004-2019, University of Glasgow