Non English language support in Terrier |
When indexing documents in languages other than English, you should use the UTF index format. By default, Terrier assumes that indexed documents only contain terms without accents. Setting the property string.use_utf to true will use the UTFLexicon which overcomes this issue, by storing all terms in UTF8.
TRECCollection assumes that all valid characters in terms are A-Z, a-z and 0-9. Obviously this assumption is incorrect when indexing documents in languages other than English. For this reason, you should use a Collection object which supports other languages. In most cases, this should be TRECUTFCollection. Specify by setting the property trec.collection.class=TRECUTFCollection. (TRECUTFCollection uses Character.isLetterOrDigit() to determine term boundaries).
Note that the FileDocument, HTMLDocument etc classes used by the Desktop Terrier do not yet support other languages.
Starting with Terrier 1.1.1, we have included all stemmers from the Snowball. Currently, this means that the following stemmers can be applied from Terrier:
When experimenting with topics in files other than English, Terrier will use a suitable topic file tokeniser if the property string.use_utf is set to true.
[Previous: Extending Retrieval] [Contents] [Next: DFR Description]Copyright © 2015 University of Glasgow | All Rights Reserved