org.terrier.indexing.tokenisation
Class UTFTokeniser
java.lang.Object
org.terrier.indexing.tokenisation.Tokeniser
org.terrier.indexing.tokenisation.UTFTokeniser
public class UTFTokeniser
- extends Tokeniser
Tokenises text obtained from a text stream. In contrast to
EnglishTokeniser
,
a more liberal tokenisation is performed. In particular,
an acceptable character for any token must match one of three
rules:
- Character.isLetterOrDigit() returns true
- Character.getType() returns Character.NON_SPACING_MARK
- Character.getType() returns Character.COMBINING_SPACING_MARK
All other characters cause a new token.
Furthermore, there is an additional checking of terms, to reduce
index noise, as follows:
- Any term which is longer than max.term.length (usually
20 characters) is discarded.
- Any term which has more than 4 digits is discarded.
- Any term which has more than 3 consecutive identical
characters are discarded.
Properties:
- lowercase - should all terms be lowercased or not?
- max.term.length - maximum acceptable term length, default is 20.
- Author:
- Gianni Amati, Ben He, Vassilis Plachouras, Craig Macdonald
- See Also:
EnglishTokeniser
,
Character
Field Summary |
protected static boolean |
DROP_LONG_TOKENS
Whether tokens longer than MAX_TERM_LENGTH should be dropped. |
protected static int |
maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms. |
protected static int |
maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are
allowed in valid terms. |
Method Summary |
TokenStream |
tokenise(java.io.Reader reader)
Tokenises the text obtained from the specified reader. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
maxNumOfDigitsPerTerm
protected static final int maxNumOfDigitsPerTerm
- The maximum number of digits that are allowed in valid terms.
- See Also:
- Constant Field Values
maxNumOfSameConseqLettersPerTerm
protected static final int maxNumOfSameConseqLettersPerTerm
- The maximum number of consecutive same letters or digits that are
allowed in valid terms.
- See Also:
- Constant Field Values
DROP_LONG_TOKENS
protected static final boolean DROP_LONG_TOKENS
- Whether tokens longer than MAX_TERM_LENGTH should be dropped.
- See Also:
- Constant Field Values
UTFTokeniser
public UTFTokeniser()
tokenise
public TokenStream tokenise(java.io.Reader reader)
- Description copied from class:
Tokeniser
- Tokenises the text obtained from the specified reader.
- Specified by:
tokenise
in class Tokeniser
- Parameters:
reader
- Stream of text to be tokenised
- Returns:
- a TokenStream of the tokens found in the text.
Terrier 3.5. Copyright © 2004-2011 University of Glasgow