public class EnglishTokeniser extends Tokeniser
Furthermore, there is an additional checking of terms, to reduce index noise, as follows:
Modifier and Type | Field and Description |
---|---|
protected static boolean |
DROP_LONG_TOKENS
Whether tokens longer than MAX_TERM_LENGTH should be dropped.
|
protected static int |
maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms.
|
protected static int |
maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are
allowed in valid terms.
|
EMPTY_STREAM
Constructor and Description |
---|
EnglishTokeniser() |
Modifier and Type | Method and Description |
---|---|
TokenStream |
tokenise(Reader reader)
Tokenises the text obtained from the specified reader.
|
getTokeniser, getTokens
protected static final int maxNumOfDigitsPerTerm
protected static final int maxNumOfSameConseqLettersPerTerm
protected static final boolean DROP_LONG_TOKENS
public TokenStream tokenise(Reader reader)
Tokeniser
Terrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow