public class EnglishTokeniser extends Tokeniser
Furthermore, there is an additional checking of terms, to reduce index noise, as follows:
| Modifier and Type | Field and Description | 
|---|---|
| protected static boolean | DROP_LONG_TOKENSWhether tokens longer than MAX_TERM_LENGTH should be dropped. | 
| protected static int | maxNumOfDigitsPerTermThe maximum number of digits that are allowed in valid terms. | 
| protected static int | maxNumOfSameConseqLettersPerTermThe maximum number of consecutive same letters or digits that are
 allowed in valid terms. | 
EMPTY_STREAM| Constructor and Description | 
|---|
| EnglishTokeniser() | 
| Modifier and Type | Method and Description | 
|---|---|
| TokenStream | tokenise(Reader reader)Tokenises the text obtained from the specified reader. | 
getTokeniser, getTokensprotected static final int maxNumOfDigitsPerTerm
protected static final int maxNumOfSameConseqLettersPerTerm
protected static final boolean DROP_LONG_TOKENS
public TokenStream tokenise(Reader reader)
TokeniserTerrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow