[Previous: Extending Retrieval] [Contents] [Next: DFR Description]

Non English language support in Terrier

Indexing

Terrier internally represents all terms as UTF. All provided Document classes use Tokeniser classes to tokenise text into terms during indexing. Likewise, during retrieval, TRECQuery uses the same tokeniser to parse queries (note that, different from TRECQuery, SingleLineTRECQuery does not perform any tokenisation by default). To change the tokeniser being used for indexing and retrieval, set the tokeniser property to the name of the tokeniser you wish to use (NB: British English spelling). When indexing a batch collection using TRECCollection, the Document implementations should be informed of the expected character set using the property trec.encoding.

Tokenisers

Tokenisers are designed to identify the terms from a stream of text. It is expected that no markup will be present in the text passed to the tokenisers (for indexing, the removal of markup is handled by Document implementations - e.g. HTML tags are parsed by TaggedDocument). The choice of tokeniser to use depends on the language being dealt with. Terrier 3.5 ships with three different tokenisers for use when indexing text or parsing queries.

EnglishTokeniser - assumes that all valid characters in terms are A-Z, a-z and 0-9. Obviously this assumption is incorrect when indexing documents in languages other than English.
UTFTokeniser - uses Java's Character class to determine what valid characters in indexing terms are. In particular, a term can only contain characters matching one of Character.isLetterOrDigit(), Character.getType() returns Character.NON_SPACING_MARK or Character.getType() returns Character.COMBINING_SPACING_MARK.
IdentityTokeniser - a simple tokeniser that returns the input text as is, and is used internally by SingleLineTRECQuery.

Stemmers

Terrier includes all stemmers from the Snowball stemmer project, namely:

Batch Retrieval

When experimenting with topics in files other than English, use the same tokeniser setting used during indexing. Moreover, you should also use the property trec.encoding to ensure that the correct encoding is used when reading the topic files.

[Previous: Extending Retrieval] [Contents] [Next: DFR Description]