org.terrier.indexing.tokenisation
Class UTFTokeniser

java.lang.Object
  extended by org.terrier.indexing.tokenisation.Tokeniser
      extended by org.terrier.indexing.tokenisation.UTFTokeniser

public class UTFTokeniser
extends Tokeniser

Tokenises text obtained from a text stream. In contrast to EnglishTokeniser, a more liberal tokenisation is performed. In particular, an acceptable character for any token must match one of three rules:

  1. Character.isLetterOrDigit() returns true
  2. Character.getType() returns Character.NON_SPACING_MARK
  3. Character.getType() returns Character.COMBINING_SPACING_MARK
All other characters cause a new token.

Furthermore, there is an additional checking of terms, to reduce index noise, as follows:

  1. Any term which is longer than max.term.length (usually 20 characters) is discarded.
  2. Any term which has more than 4 digits is discarded.
  3. Any term which has more than 3 consecutive identical characters are discarded.
Properties:

Author:
Gianni Amati, Ben He, Vassilis Plachouras, Craig Macdonald
See Also:
EnglishTokeniser, Character

Field Summary
protected static boolean DROP_LONG_TOKENS
          Whether tokens longer than MAX_TERM_LENGTH should be dropped.
protected static int maxNumOfDigitsPerTerm
          The maximum number of digits that are allowed in valid terms.
protected static int maxNumOfSameConseqLettersPerTerm
          The maximum number of consecutive same letters or digits that are allowed in valid terms.
 
Fields inherited from class org.terrier.indexing.tokenisation.Tokeniser
EMPTY_STREAM
 
Constructor Summary
UTFTokeniser()
           
 
Method Summary
 TokenStream tokenise(java.io.Reader reader)
          Tokenises the text obtained from the specified reader.
 
Methods inherited from class org.terrier.indexing.tokenisation.Tokeniser
getTokeniser, getTokens
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

maxNumOfDigitsPerTerm

protected static final int maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms.

See Also:
Constant Field Values

maxNumOfSameConseqLettersPerTerm

protected static final int maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are allowed in valid terms.

See Also:
Constant Field Values

DROP_LONG_TOKENS

protected static final boolean DROP_LONG_TOKENS
Whether tokens longer than MAX_TERM_LENGTH should be dropped.

See Also:
Constant Field Values
Constructor Detail

UTFTokeniser

public UTFTokeniser()
Method Detail

tokenise

public TokenStream tokenise(java.io.Reader reader)
Description copied from class: Tokeniser
Tokenises the text obtained from the specified reader.

Specified by:
tokenise in class Tokeniser
Parameters:
reader - Stream of text to be tokenised
Returns:
a TokenStream of the tokens found in the text.


Terrier 3.5. Copyright © 2004-2011 University of Glasgow