Class Tokeniser

  • All Implemented Interfaces:
    java.io.Serializable
    Direct Known Subclasses:
    EnglishTokeniser, IdentityTokeniser, UTFTokeniser, UTFTwitterTokeniser

    public abstract class Tokeniser
    extends java.lang.Object
    implements java.io.Serializable
    A tokeniser class is responsible for tokenising a block of text. It is expected that no markup is present in this text. Input is usually a Reader, while output is in the form of a TokenStream. Tokenisers are typically used by Document implementations.

    Available tokenisers There are two default tokenisers shipped with Terrier, namely EnglishTokeniser (default, only accepts A-Z, a-z and 0-9 as valid characters. Everything else causes a token boundary), and UTFTokeniser. The tokeniser used by default can be specified using the tokeniser property.

    Properties:

    • tokeniser - name of the tokeniser class to use.

    Example:

     Tokeniser tokeniser = Tokeniser.getTokeniser();
     TokenStream toks = tokeniser.tokenise(new StringReader("This is a block of text."));
     while(toks.hasNext())
     {
       System.out.println(toks.next());
     }
     
    Since:
    3.5
    Author:
    Craig Macdonald & Rodrygo Santos
    See Also:
    TokenStream, EnglishTokeniser, UTFTokeniser, Serialized Form
    • Constructor Summary

      Constructors 
      Constructor Description
      Tokeniser()  
    • Method Summary

      All Methods Static Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      static Tokeniser getTokeniser()
      Instantiates Tokeniser class named in the tokeniser property.
      java.lang.String[] getTokens​(java.io.Reader reader)
      Utility method which returns all of the tokens for a given stream.
      java.lang.String[] getTokens​(java.lang.String s)
      Utility method which returns all of the tokens in a String.
      abstract TokenStream tokenise​(java.io.Reader reader)
      Tokenises the text obtained from the specified reader.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • EMPTY_STREAM

        public static final TokenStream EMPTY_STREAM
        empty stream
    • Constructor Detail

      • Tokeniser

        public Tokeniser()
    • Method Detail

      • getTokeniser

        public static Tokeniser getTokeniser()
        Instantiates Tokeniser class named in the tokeniser property.
        Returns:
        Named tokeniser class from tokeniser property.
      • tokenise

        public abstract TokenStream tokenise​(java.io.Reader reader)
        Tokenises the text obtained from the specified reader.
        Parameters:
        reader - Stream of text to be tokenised
        Returns:
        a TokenStream of the tokens found in the text.
      • getTokens

        public java.lang.String[] getTokens​(java.io.Reader reader)
                                     throws java.io.IOException
        Utility method which returns all of the tokens for a given stream.
        Parameters:
        reader - Stream of text to be tokenised
        Returns:
        All of the tokens found in the stream of text.
        Throws:
        java.io.IOException
      • getTokens

        public java.lang.String[] getTokens​(java.lang.String s)
        Utility method which returns all of the tokens in a String.
        Parameters:
        s - String of text to be tokenised
        Returns:
        All of the tokens found in the stream of text.