I experienced two problems using trec_terrier.sh with query files in the TREC format.
The first one occurred with queries containing the character '<'. When meeting this character, Terrier just considers all the remaining of the file as a single query (the current being parsed). I assume that the parser is somewhat mixing it up with a tag. At some point, it causes the doc tag not to be removed from the stack. The other tags are parsed and pushed/popped correctly but the stack is just never empty so it considers that the current query is still going on. I did not investigate further, just removed the '<' from my queries to fix it.
The second one occurs when having too many queries. Turns out the content of the query id tags (NUM or whatever) are parsed by the EnglishTokenizer, which eliminates too long numbers or tokens with repeated numbers. So if the query number is above 4 digits, or contains 4 repeated numbers (e.g. 1111, 2222...), the tokenizer returns null. It causes a null pointer exception at line 165 of TRECQuery.java (variable docno is null). I fixed it by increasing these limits in EnglishTokenizer, but this is not really satisfying to have to modify the parser just for the query ids. Is there another simple way to fix that?
The indexing seems to work fine though, the two problems seem to occur only when parsing queries.
The last problem made me wonder: is there a simple way (i.e. without writing a new tokenizer) to specify which characters are separators and/or eliminate specific words (numbers, dates...)? And where should I look to remove the hapax?
I use Terrier 3.5.
Edited 1 time(s). Last edit at 09/03/2011 01:12AM by ptirilly.