|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.terrier.indexing.tokenisation.Tokeniser
public abstract class Tokeniser
A tokeniser class is responsible for tokenising a block of text.
It is expected that no markup is present in this text. Input
is usually a Reader, while output is in the form of a TokenStream.
Tokenisers are typically used by Document
implementations.
Available tokenisers There are two default tokenisers shipped with Terrier, namely EnglishTokeniser (default, only accepts A-Z, a-z and 0-9 as valid characters. Everything else causes a token boundary), and UTFTokeniser. The tokeniser used by default can be specified using the tokeniser property.
Properties:
Example:
Tokeniser tokeniser = Tokeniser.getTokeniser(); TokenStream toks = tokeniser.tokenise(new StringReader("This is a block of text.")); while(toks.hasNext()) { System.out.println(toks.next()); }
TokenStream
,
EnglishTokeniser
,
UTFTokeniser
Field Summary | |
---|---|
static TokenStream |
EMPTY_STREAM
empty stream |
Constructor Summary | |
---|---|
Tokeniser()
|
Method Summary | |
---|---|
static Tokeniser |
getTokeniser()
Instantiates Tokeniser class named in the tokeniser property. |
java.lang.String[] |
getTokens(java.io.Reader reader)
Utility method which returns all of the tokens for a given stream. |
abstract TokenStream |
tokenise(java.io.Reader reader)
Tokenises the text obtained from the specified reader. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final TokenStream EMPTY_STREAM
Constructor Detail |
---|
public Tokeniser()
Method Detail |
---|
public static Tokeniser getTokeniser()
public abstract TokenStream tokenise(java.io.Reader reader)
reader
- Stream of text to be tokenised
public java.lang.String[] getTokens(java.io.Reader reader) throws java.io.IOException
reader
- Stream of text to be tokenised
java.io.IOException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |