[Previous: Pluggable Compression] [Contents] [Next: DFR Description]

Non English language support in Terrier

Indexing

Terrier internally represents all terms as UTF. All provided Document classes use Tokeniser classes to tokenise text into terms during indexing. Likewise, during retrieval, TRECQuery uses the same tokeniser to parse queries (note that, different from TRECQuery, SingleLineTRECQuery does not perform any tokenisation by default). To change the tokeniser being used for indexing and retrieval, set the tokeniser property to the name of the tokeniser you wish to use (NB: British English spelling). When indexing a batch collection using TRECCollection, the Document implementations should be informed of the expected character set using the property trec.encoding.

File Encodings

While Terrier uses UTF internally to represent terms, the Collection and Document classes need to ensure that they are correctly opening files using the correct character encodings. For instance, while valid XML files will specify the encoding at the top of the file, a corpus of Hungarian in TREC format may be encoded in ISO 8859-16 or UTF-8. You should specify the encoding that TRECCollection should use to open the files, using the trec.encoding property. Note that Terrier will default to the Java default encoding if trec.encoding is not set. In a Unix-like operating system, Java's choice of default encoding may be influenced by the LANG environment variable - e.g. LANG=en_US.UTF-8 will cause Java to default to opening files using UTF-8 encoding, while en_US will use ISO-8859-1.

Tokenisers

Tokenisers are designed to identify the terms from a stream of text. It is expected that no markup will be present in the text passed to the tokenisers (for indexing, the removal of markup is handled by Document implementations - e.g. HTML tags are parsed by TaggedDocument). The choice of tokeniser to use depends on the language being dealt with. Terrier 3.5 ships with three different tokenisers for use when indexing text or parsing queries. The choice of tokeniser is specified by the tokeniser property, e.g. tokeniser=EnglishTokeniser.

EnglishTokeniser - assumes that all valid characters in terms are A-Z, a-z and 0-9. Obviously this assumption is incorrect when indexing documents in languages other than English.
UTFTokeniser - uses Java's Character class to determine what valid characters in indexing terms are. In particular, a term can only contain characters matching one of Character.isLetterOrDigit(), Character.getType() returns Character.NON_SPACING_MARK or Character.getType() returns Character.COMBINING_SPACING_MARK.
IdentityTokeniser - a simple tokeniser that returns the input text as is, and is used internally by SingleLineTRECQuery.

Stemmers

Terrier includes all stemmers from the Snowball stemmer project, namely:

Batch Retrieval

When experimenting with topics in files other than English, use the same tokeniser setting used during indexing. Moreover, you should also use the property trec.encoding to ensure that the correct encoding is used when reading the topic files.

[Previous: Pluggable Compression] [Contents] [Next: DFR Description]