UTFTokeniser (Terrier 4.0 API)

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- org.terrier.indexing.tokenisation.Tokeniser
- - org.terrier.indexing.tokenisation.UTFTokeniser

```
public class UTFTokeniser
extends Tokeniser
```
Tokenises text obtained from a text stream. In contrast to EnglishTokeniser, a more liberal tokenisation is performed. In particular, an acceptable character for any token must match one of three rules:
1. Character.isLetterOrDigit() returns true
2. Character.getType() returns Character.NON_SPACING_MARK
3. Character.getType() returns Character.COMBINING_SPACING_MARK
All other characters cause a new token.
Furthermore, there is an additional checking of terms, to reduce index noise, as follows:
1. Any term which is longer than max.term.length (usually 20 characters) is discarded.
2. Any term which has more than 4 digits is discarded.
3. Any term which has more than 3 consecutive identical characters are discarded.
Properties:
- lowercase - should all terms be lowercased or not?
- max.term.length - maximum acceptable term length, default is 20.
Author:

Gianni Amati, Ben He, Vassilis Plachouras, Craig Macdonald

See Also:
EnglishTokeniser, Character

Field Summary

Fields
Modifier and Type	Field and Description
`protected static boolean`	`DROP_LONG_TOKENS` Whether tokens longer than MAX_TERM_LENGTH should be dropped.
`protected static int`	`maxNumOfDigitsPerTerm` The maximum number of digits that are allowed in valid terms.
`protected static int`	`maxNumOfSameConseqLettersPerTerm` The maximum number of consecutive same letters or digits that are allowed in valid terms.

Fields inherited from class org.terrier.indexing.tokenisation.Tokeniser
EMPTY_STREAM

Constructor Summary

Constructors
Constructor and Description

UTFTokeniser()

Method Summary

Methods
Modifier and Type Method and Description

TokenStream tokenise(Reader reader)
Tokenises the text obtained from the specified reader.
- Methods inherited from class org.terrier.indexing.tokenisation.Tokeniser
  getTokeniser, getTokens
- Methods inherited from class java.lang.Object
  clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - maxNumOfDigitsPerTerm
```
protected static final int maxNumOfDigitsPerTerm
```
    The maximum number of digits that are allowed in valid terms.
    
    See Also:
    Constant Field Values
  - maxNumOfSameConseqLettersPerTerm
```
protected static final int maxNumOfSameConseqLettersPerTerm
```
    The maximum number of consecutive same letters or digits that are allowed in valid terms.
    
    See Also:
    Constant Field Values
  - DROP_LONG_TOKENS
```
protected static final boolean DROP_LONG_TOKENS
```
    Whether tokens longer than MAX_TERM_LENGTH should be dropped.
    
    See Also:
    Constant Field Values
- Constructor Detail
  - UTFTokeniser
```
public UTFTokeniser()
```
- Method Detail
  - tokenise
```
public TokenStream tokenise(Reader reader)
```
    Description copied from class: Tokeniser
    
    Tokenises the text obtained from the specified reader.
    
    Specified by:
    
    tokenise in class Tokeniser
    
    Parameters:
    reader - Stream of text to be tokenised
    
    Returns:
    a TokenStream of the tokens found in the text.

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

Terrier 4.0. Copyright © 2004-2014 University of Glasgow