Class UTFTwitterTokeniser
- java.lang.Object
-
- org.terrier.indexing.tokenisation.Tokeniser
-
- org.terrier.indexing.tokenisation.UTFTwitterTokeniser
-
- All Implemented Interfaces:
java.io.Serializable
public class UTFTwitterTokeniser extends Tokeniser
A tokeniser designed for use on tweets. It maintains UTF-8 encoding and keeps mentions- Since:
- 4.0
- Author:
- Richard McCreadie
- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected static boolean
DROP_LONG_TOKENS
Whether tokens longer than MAX_TERM_LENGTH should be dropped.protected static int
maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms.protected static int
maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are allowed in valid terms.-
Fields inherited from class org.terrier.indexing.tokenisation.Tokeniser
EMPTY_STREAM
-
-
Constructor Summary
Constructors Constructor Description UTFTwitterTokeniser()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description TokenStream
tokenise(java.io.Reader reader)
Tokenises the text obtained from the specified reader.-
Methods inherited from class org.terrier.indexing.tokenisation.Tokeniser
getTokeniser, getTokens, getTokens
-
-
-
-
Field Detail
-
maxNumOfDigitsPerTerm
protected static final int maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms.- See Also:
- Constant Field Values
-
maxNumOfSameConseqLettersPerTerm
protected static final int maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are allowed in valid terms.- See Also:
- Constant Field Values
-
DROP_LONG_TOKENS
protected static final boolean DROP_LONG_TOKENS
Whether tokens longer than MAX_TERM_LENGTH should be dropped.- See Also:
- Constant Field Values
-
-
Method Detail
-
tokenise
public TokenStream tokenise(java.io.Reader reader)
Description copied from class:Tokeniser
Tokenises the text obtained from the specified reader.
-
-