public class UTFTokeniser extends Tokeniser
EnglishTokeniser
,
a more liberal tokenisation is performed. In particular,
an acceptable character for any token must match one of three
rules:
Furthermore, there is an additional checking of terms, to reduce index noise, as follows:
EnglishTokeniser
,
Character
Modifier and Type | Field and Description |
---|---|
protected static boolean |
DROP_LONG_TOKENS
Whether tokens longer than MAX_TERM_LENGTH should be dropped.
|
protected static int |
maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms.
|
protected static int |
maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are
allowed in valid terms.
|
EMPTY_STREAM
Constructor and Description |
---|
UTFTokeniser() |
Modifier and Type | Method and Description |
---|---|
TokenStream |
tokenise(Reader reader)
Tokenises the text obtained from the specified reader.
|
getTokeniser, getTokens
protected static final int maxNumOfDigitsPerTerm
protected static final int maxNumOfSameConseqLettersPerTerm
protected static final boolean DROP_LONG_TOKENS
public TokenStream tokenise(Reader reader)
Tokeniser
Terrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow