org.terrier.indexing.tokenisation
Class Tokeniser

java.lang.Object
  extended by org.terrier.indexing.tokenisation.Tokeniser
Direct Known Subclasses:
EnglishTokeniser, IdentityTokeniser, UTFTokeniser

public abstract class Tokeniser
extends java.lang.Object

A tokeniser class is responsible for tokenising a block of text. It is expected that no markup is present in this text. Input is usually a Reader, while output is in the form of a TokenStream. Tokenisers are typically used by Document implementations.

Available tokenisers There are two default tokenisers shipped with Terrier, namely EnglishTokeniser (default, only accepts A-Z, a-z and 0-9 as valid characters. Everything else causes a token boundary), and UTFTokeniser. The tokeniser used by default can be specified using the tokeniser property.

Properties:

Example:

 Tokeniser tokeniser = Tokeniser.getTokeniser();
 TokenStream toks = tokeniser.tokenise(new StringReader("This is a block of text."));
 while(toks.hasNext())
 {
   System.out.println(toks.next());
 }
 

Since:
3.5
Author:
Craig Macdonald & Rodrygo Santos
See Also:
TokenStream, EnglishTokeniser, UTFTokeniser

Field Summary
static TokenStream EMPTY_STREAM
          empty stream
 
Constructor Summary
Tokeniser()
           
 
Method Summary
static Tokeniser getTokeniser()
          Instantiates Tokeniser class named in the tokeniser property.
 java.lang.String[] getTokens(java.io.Reader reader)
          Utility method which returns all of the tokens for a given stream.
abstract  TokenStream tokenise(java.io.Reader reader)
          Tokenises the text obtained from the specified reader.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

EMPTY_STREAM

public static final TokenStream EMPTY_STREAM
empty stream

Constructor Detail

Tokeniser

public Tokeniser()
Method Detail

getTokeniser

public static Tokeniser getTokeniser()
Instantiates Tokeniser class named in the tokeniser property.

Returns:
Named tokeniser class from tokeniser property.

tokenise

public abstract TokenStream tokenise(java.io.Reader reader)
Tokenises the text obtained from the specified reader.

Parameters:
reader - Stream of text to be tokenised
Returns:
a TokenStream of the tokens found in the text.

getTokens

public java.lang.String[] getTokens(java.io.Reader reader)
                             throws java.io.IOException
Utility method which returns all of the tokens for a given stream.

Parameters:
reader - Stream of text to be tokenised
Returns:
All of the tokens found in the stream of text.
Throws:
java.io.IOException


Terrier 3.5. Copyright © 2004-2011 University of Glasgow