Terrier Users :  Terrier Forum terrier.org
General discussion about using/developing applications using Terrier 
Two problems with TREC format queries and a question
Posted by: ptirilly ()
Date: September 03, 2011 01:05AM

Hi,

I experienced two problems using trec_terrier.sh with query files in the TREC format.

The first one occurred with queries containing the character '<'. When meeting this character, Terrier just considers all the remaining of the file as a single query (the current being parsed). I assume that the parser is somewhat mixing it up with a tag. At some point, it causes the doc tag not to be removed from the stack. The other tags are parsed and pushed/popped correctly but the stack is just never empty so it considers that the current query is still going on. I did not investigate further, just removed the '<' from my queries to fix it.

The second one occurs when having too many queries. Turns out the content of the query id tags (NUM or whatever) are parsed by the EnglishTokenizer, which eliminates too long numbers or tokens with repeated numbers. So if the query number is above 4 digits, or contains 4 repeated numbers (e.g. 1111, 2222...), the tokenizer returns null. It causes a null pointer exception at line 165 of TRECQuery.java (variable docno is null). I fixed it by increasing these limits in EnglishTokenizer, but this is not really satisfying to have to modify the parser just for the query ids. Is there another simple way to fix that?

The indexing seems to work fine though, the two problems seem to occur only when parsing queries.

The last problem made me wonder: is there a simple way (i.e. without writing a new tokenizer) to specify which characters are separators and/or eliminate specific words (numbers, dates...)? And where should I look to remove the hapax?

I use Terrier 3.5.

Regards,
Pierre



Edited 1 time(s). Last edit at 09/03/2011 01:12AM by ptirilly.

Options: ReplyQuote
Re: Two problems with TREC format queries and a question
Posted by: craigm ()
Date: December 05, 2011 04:44PM

Hi ptirilly,

Thanks. This looks like a bug in TRECQuery or TRECFullTokenizer
We'll track progress on [terrier.org]

Craig

Options: ReplyQuote
Re: Two problems with TREC format queries and a question
Posted by: craigm ()
Date: July 26, 2012 11:46AM

Hi ptirilly,

Thanks for the report. I have attached a patch to the issue tracker for the second issue. For me, the first issue is a bad topics file.

Craig

Options: ReplyQuote


Sorry, only registered users may post in this forum.
This forum powered by Phorum.