org.terrier.indexing.tokenisation
Class EnglishTokeniser
java.lang.Object
org.terrier.indexing.tokenisation.Tokeniser
org.terrier.indexing.tokenisation.EnglishTokeniser
public class EnglishTokeniser
- extends Tokeniser
Tokenises text obtained from a text stream assuming English language.
Acceptable characters are A-Z a-z and 0-9. All other
characters cause a new token.
Furthermore, there is an additional checking of terms, to reduce
index noise, as follows:
- Any term which is longer than max.term.length (usually
20 characters) is discarded.
- Any term which has more than 4 digits is discarded.
- Any term which has more than 3 consecutive identical
characters are discarded.
Properties:
- lowercase - should all terms be lowercased or not?
- max.term.length - maximum acceptable term length, default is 20.
- Author:
- Gianni Amati, Ben He, Vassilis Plachouras, Craig Macdonald
Field Summary |
protected static boolean |
DROP_LONG_TOKENS
Whether tokens longer than MAX_TERM_LENGTH should be dropped. |
protected static int |
maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms. |
protected static int |
maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are
allowed in valid terms. |
Method Summary |
TokenStream |
tokenise(java.io.Reader reader)
Tokenises the text obtained from the specified reader. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
maxNumOfDigitsPerTerm
protected static final int maxNumOfDigitsPerTerm
- The maximum number of digits that are allowed in valid terms.
- See Also:
- Constant Field Values
maxNumOfSameConseqLettersPerTerm
protected static final int maxNumOfSameConseqLettersPerTerm
- The maximum number of consecutive same letters or digits that are
allowed in valid terms.
- See Also:
- Constant Field Values
DROP_LONG_TOKENS
protected static final boolean DROP_LONG_TOKENS
- Whether tokens longer than MAX_TERM_LENGTH should be dropped.
- See Also:
- Constant Field Values
EnglishTokeniser
public EnglishTokeniser()
tokenise
public TokenStream tokenise(java.io.Reader reader)
- Description copied from class:
Tokeniser
- Tokenises the text obtained from the specified reader.
- Specified by:
tokenise
in class Tokeniser
- Parameters:
reader
- Stream of text to be tokenised
- Returns:
- a TokenStream of the tokens found in the text.
Terrier 3.5. Copyright © 2004-2011 University of Glasgow