Terrier Users :  Terrier Forum terrier.org
General discussion about using/developing applications using Terrier 
Non-English Retrieval
Posted by: akshatj ()
Date: March 13, 2018 11:08AM

I wanted to perform retrieval using French corpora that I have in a TREC format.I used the following settings in the terrier.properties file(I am using version 4.2) :

trec.collection.class=TRECCollection
string.use_utf=true
trec.encoding=utf-8
tokeniser =UTFTokeniser

However on indexing, it gives me this warning :
16:21:33.637 [main] INFO o.terrier.indexing.CollectionFactory - Finished reading collection specification
16:21:33.641 [main] INFO o.t.i.MultiDocumentFileCollection - TRECCollection 0% processing share/French/corpus.xml.tag
16:21:34.000 [main] INFO o.t.structures.indexing.Indexer - creating the data structures data_1
16:21:34.000 [main] INFO o.t.structures.indexing.Indexer - BlockIndexer creating direct index
16:22:42.589 [main] WARN o.t.structures.indexing.Indexer - Adding empty document ATS.940101.0021
16:22:42.590 [main] WARN o.t.structures.indexing.Indexer - Adding empty document ATS.940101.0023
16:22:42.593 [main] WARN o.t.structures.indexing.Indexer - Adding empty document ATS.940101.0030
16:22:42.594 [main] WARN o.t.structures.indexing.Indexer - Adding empty document ATS.940101.0035


I dont understand why is it adding empty documents. What could be the fix ?
Thank you in advance!
akshatj

Options: ReplyQuote
Re: Non-English Retrieval
Posted by: craigm ()
Date: March 14, 2018 09:00AM

Hi akshatj,

Have you given a look at the documents, to see if they really are empty? Its not unheard of.

Also note the space in the property below:

tokeniser =UTFTokeniser

HTH

Craig

Options: ReplyQuote
Re: Non-English Retrieval
Posted by: akshatj ()
Date: March 20, 2018 12:10PM

Hi Craig
The files are not empty. My TREC file, which is to be indexed is of the form :

<DOC>
<DOCNO>LEMONDE94-000131-19940103</DOCNO>
<TEXT>
-French text-
</TEXT>
</DOC>
<DOC>
<DOCNO>LEMONDE94-000131-19940104</DOCNO>
<TEXT>
-French text-
</TEXT>
</DOC>
.. similarly multiple such "docs" follow.Now, when I simply copy paste a few(say 10) of the intial "docs" into a seperate file, indexing works spot on, but when the entire(about 400MB ) file is to be indexed, this is shown

16:21:33.637 [main] INFO o.terrier.indexing.CollectionFactory - Finished reading collection specification
16:21:33.641 [main] INFO o.t.i.MultiDocumentFileCollection - TRECCollection 0% processing share/French/corpus.xml.tag
16:21:34.000 [main] INFO o.t.structures.indexing.Indexer - creating the data structures data_1

Then, the system just stays like this for 1-2 mins and then all "empty doc" warning begin to flow. I am out of ideas to try now

Thanks
Akshat



Edited 1 time(s). Last edit at 03/20/2018 02:45PM by akshatj.

Options: ReplyQuote
Re: Non-English Retrieval
Posted by: craigm ()
Date: March 21, 2018 11:12AM

Have you actually checked the first document it is suggesting is empty?

Options: ReplyQuote


Sorry, only registered users may post in this forum.
This forum powered by Phorum.