Tokeniser (Terrier 4.0 API)

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- org.terrier.indexing.tokenisation.Tokeniser

Direct Known Subclasses:

EnglishTokeniser, IdentityTokeniser, UTFTokeniser, UTFTwitterTokeniser
```
public abstract class Tokeniser
extends Object
```
A tokeniser class is responsible for tokenising a block of text. It is expected that no markup is present in this text. Input is usually a Reader, while output is in the form of a TokenStream. Tokenisers are typically used by Document implementations.
Available tokenisers There are two default tokenisers shipped with Terrier, namely EnglishTokeniser (default, only accepts A-Z, a-z and 0-9 as valid characters. Everything else causes a token boundary), and UTFTokeniser. The tokeniser used by default can be specified using the tokeniser property.
Properties:
- tokeniser - name of the tokeniser class to use.
Example:
```
 Tokeniser tokeniser = Tokeniser.getTokeniser();
 TokenStream toks = tokeniser.tokenise(new StringReader("This is a block of text."));
 while(toks.hasNext())
 {
   System.out.println(toks.next());
 }
 
```
Since:

3.5

Author:

Craig Macdonald & Rodrygo Santos

See Also:
TokenStream, EnglishTokeniser, UTFTokeniser

Field Summary

Fields
Modifier and Type Field and Description

static TokenStream EMPTY_STREAM
empty stream

Constructor Summary

Constructors
Constructor and Description

Tokeniser()

Method Summary

Methods
Modifier and Type	Method and Description
`static Tokeniser`	`getTokeniser()` Instantiates Tokeniser class named in the `tokeniser` property.
`String[]`	`getTokens(Reader reader)` Utility method which returns all of the tokens for a given stream.
`abstract TokenStream`	`tokenise(Reader reader)` Tokenises the text obtained from the specified reader.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - EMPTY_STREAM
```
public static final TokenStream EMPTY_STREAM
```
    empty stream
- Constructor Detail
  - Tokeniser
```
public Tokeniser()
```
- Method Detail
  - getTokeniser
```
public static Tokeniser getTokeniser()
```
    Instantiates Tokeniser class named in the tokeniser property.
    
    Returns:
    Named tokeniser class from tokeniser property.
  - tokenise
```
public abstract TokenStream tokenise(Reader reader)
```
    Tokenises the text obtained from the specified reader.
    
    Parameters:
    reader - Stream of text to be tokenised
    
    Returns:
    a TokenStream of the tokens found in the text.
  - getTokens
```
public String[] getTokens(Reader reader)
                   throws IOException
```
    Utility method which returns all of the tokens for a given stream.
    
    Parameters:
    reader - Stream of text to be tokenised
    
    Returns:
    All of the tokens found in the stream of text.
    
    Throws:
    
    IOException

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

Terrier 4.0. Copyright © 2004-2014 University of Glasgow