Class Tokeniser
- java.lang.Object
-
- org.terrier.indexing.tokenisation.Tokeniser
-
- All Implemented Interfaces:
java.io.Serializable
- Direct Known Subclasses:
EnglishTokeniser
,IdentityTokeniser
,UTFTokeniser
,UTFTwitterTokeniser
public abstract class Tokeniser extends java.lang.Object implements java.io.Serializable
A tokeniser class is responsible for tokenising a block of text. It is expected that no markup is present in this text. Input is usually a Reader, while output is in the form of a TokenStream. Tokenisers are typically used byDocument
implementations.Available tokenisers There are two default tokenisers shipped with Terrier, namely EnglishTokeniser (default, only accepts A-Z, a-z and 0-9 as valid characters. Everything else causes a token boundary), and UTFTokeniser. The tokeniser used by default can be specified using the tokeniser property.
Properties:
- tokeniser - name of the tokeniser class to use.
Example:
Tokeniser tokeniser = Tokeniser.getTokeniser(); TokenStream toks = tokeniser.tokenise(new StringReader("This is a block of text.")); while(toks.hasNext()) { System.out.println(toks.next()); }
- Since:
- 3.5
- Author:
- Craig Macdonald & Rodrygo Santos
- See Also:
TokenStream
,EnglishTokeniser
,UTFTokeniser
, Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description static TokenStream
EMPTY_STREAM
empty stream
-
Constructor Summary
Constructors Constructor Description Tokeniser()
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description static Tokeniser
getTokeniser()
Instantiates Tokeniser class named in the tokeniser property.java.lang.String[]
getTokens(java.io.Reader reader)
Utility method which returns all of the tokens for a given stream.java.lang.String[]
getTokens(java.lang.String s)
Utility method which returns all of the tokens in a String.abstract TokenStream
tokenise(java.io.Reader reader)
Tokenises the text obtained from the specified reader.
-
-
-
Field Detail
-
EMPTY_STREAM
public static final TokenStream EMPTY_STREAM
empty stream
-
-
Method Detail
-
getTokeniser
public static Tokeniser getTokeniser()
Instantiates Tokeniser class named in the tokeniser property.- Returns:
- Named tokeniser class from tokeniser property.
-
tokenise
public abstract TokenStream tokenise(java.io.Reader reader)
Tokenises the text obtained from the specified reader.- Parameters:
reader
- Stream of text to be tokenised- Returns:
- a TokenStream of the tokens found in the text.
-
getTokens
public java.lang.String[] getTokens(java.io.Reader reader) throws java.io.IOException
Utility method which returns all of the tokens for a given stream.- Parameters:
reader
- Stream of text to be tokenised- Returns:
- All of the tokens found in the stream of text.
- Throws:
java.io.IOException
-
getTokens
public java.lang.String[] getTokens(java.lang.String s)
Utility method which returns all of the tokens in a String.- Parameters:
s
- String of text to be tokenised- Returns:
- All of the tokens found in the stream of text.
-
-