Terrier Users :  Terrier Forum terrier.org
General discussion about using/developing applications using Terrier 
Indexing Hindi Corpus
Posted by: akshatj ()
Date: March 31, 2018 10:26PM

I am trying to index a corpus in hindi. One of the files of the corpus is:

[drive.google.com]

However, while indexing this document I got this warning message:

"WARN o.t.structures.indexing.Indexer - Adding empty document range04_save044_d00001_f01693"

Below is my properties file;

#default controls for query expansion
querying.postprocesses.order=QueryExpansion
querying.postprocesses.controls=qe:QueryExpansion
#default controls for the web-based interface. SimpleDecorate
#is the simplest metadata decorator. For more control, see Decorate.
querying.postfilters.order=SimpleDecorate,SiteFilter,Scope
querying.postfilters.controls=decorate : SimpleDecorate,site : SiteFilter , scope : Scope

#default and allowed controls
querying.default.controls=
querying.allowed.controls=scope,qe,qemodel,start,end,site,scope

#trec.collection.class=TRECUTFCollection
#trec.encoding=utf-8
#trec.output.format.length=3000
#document tags specification
#for processing the contents of
#the documents, ignoring DOCHDR
TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.skip=DOCHDR
#set to true if the tags can be of various case
TrecDocTags.casesensitive=false

terrier.index.path = /home/siddhant/Documents/IR/terrier-core-4.2/var/Ass2/index
trec.results=/home/siddhant/Documents/IR/terrier-core-4.2/var/Ass2/results
indexer.meta.forward.keylens=100

#Alternatively, topics may be in a single file
#trec.topics.parser=SingleLineTRECQuery
#the first token on each line is the query id
#SingleLineTRECQuery.queryid.exists=true
#should periods be removed from the query stream (there break the query parser)
#SingleLineTRECQuery.periods.allowed=false

block.indexing=true
trec.output.format.length=50000
#query tags specification
TrecQueryTags.doctag=top
TrecQueryTags.idtag=num
TrecQueryTags.process=top,num,title
TrecQueryTags.skip=desc,narr

trec.topics=stemmed_topics.txt
trec.qrels=hindi-qrels.txt
trec.model=BM25

#stop-words file
stopwords.filename=hindi-stopwords.txt

#the processing stages a term goes through
termpipelines=


Even if I include
"trec.collection.class=TRECUTFCollection"
in my properties file I get an error:
"ERROR o.terrier.indexing.CollectionFactory - ERROR: First Collection class named org.terrier.indexing.TRECUTFCollection - cannot be instantiated
java.lang.reflect.InvocationTargetException: null"

Thanks,



Edited 3 time(s). Last edit at 04/01/2018 07:57AM by akshatj.

Options: ReplyQuote
Re: Indexing Hindi Corpus
Posted by: craigm ()
Date: April 05, 2018 11:00AM

Hi,

I guess you are following some instructions for an older version of Terrier. TRECUTFCollection was removed some time ago.

Use:
trec.collection.class=TRECCollection
tokeniser=UTFTokeniser

See [terrier.org] for more information.

Craig

Options: ReplyQuote
Re: Indexing Hindi Corpus
Posted by: akshatj ()
Date: April 12, 2018 05:52PM

Thanks, it worked

Options: ReplyQuote
Re: Indexing Hindi Corpus
Posted by: riya77 ()
Date: April 22, 2018 01:27PM

Hi akshat ,

Can you please help me out ! I am performing the hindi corpus indexing as well.
I have issues in xml indexing.
Thanks

Options: ReplyQuote


Sorry, only registered users may post in this forum.
This forum powered by Phorum.