[Previous: Extending Retrieval] [Contents] [Next: DFR Description]

Non English language support in Terrier

Index format

When indexing documents in languages other than English, you should use the UTF index format. By default, Terrier assumes that indexed documents only contain terms without accents. Setting the property string.use_utf to true will use the UTFLexicon which overcomes this issue, by storing all terms in UTF8.

Collection & Document support

TRECCollection assumes that all valid characters in terms are A-Z, a-z and 0-9. Obviously this assumption is incorrect when indexing documents in languages other than English. For this reason, you should use a Collection object which supports other languages. In most cases, this should be TRECUTFCollection. Specify by setting the property trec.collection.class=TRECUTFCollection. (TRECUTFCollection uses Character.isLetterOrDigit() to determine term boundaries).

Note that the FileDocument, HTMLDocument etc classes used by the Desktop Terrier do not yet support other languages.

Stemmers

Starting with Terrier 1.1.1, we have included all stemmers from the Snowball. Currently, this means that the following stemmers can be applied from Terrier:

Batch Retrieval

When experimenting with topics in files other than English, Terrier will use a suitable topic file tokeniser if the property string.use_utf is set to true.

[Previous: Extending Retrieval] [Contents] [Next: DFR Description]