public class EnglishTokeniser extends Tokeniser
Furthermore, there is an additional checking of terms, to reduce index noise, as follows:
| Modifier and Type | Field and Description |
|---|---|
protected static boolean |
DROP_LONG_TOKENS
Whether tokens longer than MAX_TERM_LENGTH should be dropped.
|
protected static int |
maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms.
|
protected static int |
maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are
allowed in valid terms.
|
EMPTY_STREAM| Constructor and Description |
|---|
EnglishTokeniser() |
| Modifier and Type | Method and Description |
|---|---|
TokenStream |
tokenise(Reader reader)
Tokenises the text obtained from the specified reader.
|
getTokeniser, getTokens, getTokensprotected static final int maxNumOfDigitsPerTerm
protected static final int maxNumOfSameConseqLettersPerTerm
protected static final boolean DROP_LONG_TOKENS
public TokenStream tokenise(Reader reader)
TokeniserTerrier Information Retrieval Platform 5.1. Copyright © 2004-2019, University of Glasgow