Terrier internally represents all terms as UTF. All provided Document classes use Tokeniser classes to tokenise text into terms during indexing. Likewise, during retrieval, TRECQuery uses the same tokeniser to parse queries (note that, different from TRECQuery, SingleLineTRECQuery does not perform any tokenisation by default). To change the tokeniser being used for indexing and retrieval, set the
tokeniser property to the name of the tokeniser you wish to use (NB: British English spelling). The default Tokeniser is EnglishTokeniser.
While Terrier uses UTF internally to represent terms, the Collection and Document classes need to ensure that they are correctly opening files using the correct character encodings. For instance, while valid XML files will specify the encoding at the top of the file, a corpus of Hungarian in TREC format may be encoded in ISO 8859-16 or UTF-8. You should specify the encoding that TRECCollection should use to open the files, using the
trec.encoding property. Note that Terrier will default to the Java default encoding if
trec.encoding is not set. In a Unix-like operating system, Java's choice of default encoding may be influenced by the LANG environment variable - e.g. LANG=en_US.UTF-8 will cause Java to default to opening files using UTF-8 encoding, while en_US will use ISO-8859-1.
Tokenisers are designed to identify the terms from a stream of text. It is expected that no markup will be present in the text passed to the tokenisers (for indexing, the removal of markup is handled by Document implementations - e.g. HTML tags are parsed by TaggedDocument). The choice of tokeniser to use depends on the language being dealt with. Terrier ships with three different tokenisers for use when indexing text or parsing queries. The choice of tokeniser is specified by the tokeniser property, e.g.
EnglishTokeniser - assumes that all valid characters in terms are A-Z, a-z and 0-9. Obviously this assumption is incorrect when indexing documents in languages other than English.
UTFTokeniser - uses Java’s Character class to determine what valid characters in indexing terms are. In particular, a term can only contain characters matching one of Character.isLetterOrDigit(), Character.getType() returns Character.NON_SPACING_MARK or Character.getType() returns Character.COMBINING_SPACING_MARK.
Terrier includes all stemmers from the Snowball stemmer project, namely:
When experimenting with topics in files other than English, use the same
tokeniser setting used during indexing. Moreover, you should also use the property
trec.encoding to ensure that the correct encoding is used when reading the topic files.