Class EnglishTokeniser
- java.lang.Object
-
- org.terrier.indexing.tokenisation.Tokeniser
-
- org.terrier.indexing.tokenisation.EnglishTokeniser
-
- All Implemented Interfaces:
java.io.Serializable
public class EnglishTokeniser extends Tokeniser
Tokenises text obtained from a text stream assuming English language. Acceptable characters are A-Z a-z and 0-9. All other characters cause a new token.Furthermore, there is an additional checking of terms, to reduce index noise, as follows:
- Any term which is longer than max.term.length (usually 20 characters) is discarded.
- Any term which has more than 4 digits is discarded.
- Any term which has more than 3 consecutive identical characters are discarded.
- lowercase - should all terms be lowercased or not?
- max.term.length - maximum acceptable term length, default is 20.
- Author:
- Gianni Amati, Ben He, Vassilis Plachouras, Craig Macdonald
- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected static boolean
DROP_LONG_TOKENS
Whether tokens longer than MAX_TERM_LENGTH should be dropped.protected static int
maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms.protected static int
maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are allowed in valid terms.-
Fields inherited from class org.terrier.indexing.tokenisation.Tokeniser
EMPTY_STREAM
-
-
Constructor Summary
Constructors Constructor Description EnglishTokeniser()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description TokenStream
tokenise(java.io.Reader reader)
Tokenises the text obtained from the specified reader.-
Methods inherited from class org.terrier.indexing.tokenisation.Tokeniser
getTokeniser, getTokens, getTokens
-
-
-
-
Field Detail
-
maxNumOfDigitsPerTerm
protected static final int maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms.- See Also:
- Constant Field Values
-
maxNumOfSameConseqLettersPerTerm
protected static final int maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are allowed in valid terms.- See Also:
- Constant Field Values
-
DROP_LONG_TOKENS
protected static final boolean DROP_LONG_TOKENS
Whether tokens longer than MAX_TERM_LENGTH should be dropped.- See Also:
- Constant Field Values
-
-
Method Detail
-
tokenise
public TokenStream tokenise(java.io.Reader reader)
Description copied from class:Tokeniser
Tokenises the text obtained from the specified reader.
-
-