|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.terrier.indexing.tokenisation.Tokeniser
public abstract class Tokeniser
A tokeniser class is responsible for tokenising a block of text.
It is expected that no markup is present in this text. Input
is usually a Reader, while output is in the form of a TokenStream.
Tokenisers are typically used by Document
implementations.
Available tokenisers There are two default tokenisers shipped with Terrier, namely EnglishTokeniser (default, only accepts A-Z, a-z and 0-9 as valid characters. Everything else causes a token boundary), and UTFTokeniser. The tokeniser used by default can be specified using the tokeniser property.
Properties:
Example:
Tokeniser tokeniser = Tokeniser.getTokeniser();
TokenStream toks = tokeniser.tokenise(new StringReader("This is a block of text."));
while(toks.hasNext())
{
System.out.println(toks.next());
}
TokenStream,
EnglishTokeniser,
UTFTokeniser| Field Summary | |
|---|---|
static TokenStream |
EMPTY_STREAM
empty stream |
| Constructor Summary | |
|---|---|
Tokeniser()
|
|
| Method Summary | |
|---|---|
static Tokeniser |
getTokeniser()
Instantiates Tokeniser class named in the tokeniser property. |
java.lang.String[] |
getTokens(java.io.Reader reader)
Utility method which returns all of the tokens for a given stream. |
abstract TokenStream |
tokenise(java.io.Reader reader)
Tokenises the text obtained from the specified reader. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final TokenStream EMPTY_STREAM
| Constructor Detail |
|---|
public Tokeniser()
| Method Detail |
|---|
public static Tokeniser getTokeniser()
public abstract TokenStream tokenise(java.io.Reader reader)
reader - Stream of text to be tokenised
public java.lang.String[] getTokens(java.io.Reader reader)
throws java.io.IOException
reader - Stream of text to be tokenised
java.io.IOException
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||