Package org.terrier.indexing.tokenisation

Provides classes related to the tokenisation of documents.

See: Description

Package org.terrier.indexing.tokenisation Description

Provides classes related to the tokenisation of documents. Tokenisers are responsible for breaking chunks of text into words to be indexed. Different tokenisers may be used for different languages. In particular, two tokenisers are provided by Terrier:

In addition, both default Tokenisers apply rules such as:

Example Code

//get the default tokeniser, as set by property tokeniser
Tokeniser tokeniser = Tokeniser.getTokeniser();
String sentence = "This is a sentence.";
TokenStream toks = tokeniser.tokenise(new StringReader(sentence));
while(toks.hasNext())
{
  String token = toks.next();
}

Terrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow