Terrier Users :  Terrier Forum terrier.org
General discussion about using/developing applications using Terrier 
TERRIER 4.2 SIMPLE XML ISSUES
Posted by: riya77 ()
Date: April 22, 2018 01:23PM

So , I was trying to index xml files , but I was getting an error of "constructor not found" . After some searching I found soln on JIRA platform , where there was maven and bug fix solution for this.
I performed command :

mvn -DskipTests package
and build the copy of simpleXMLcollection java file.
But this is what I get while indexing.Why is this happening even when there are documents in the folder.

System : Win 10
terrier v : 4.2
dataset : Fire Hindi corpus 2008

Please help . Thanks in advance.


17:49:24.253 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01023.xml
17:49:24.306 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01024.xml
17:49:24.385 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01033.xml
17:49:24.409 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01044.xml
17:49:24.444 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01054.xml
17:49:24.488 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01061.xml
17:49:24.527 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01065.xml\range00_save001_d00001_f01013.xml
17:49:24.253 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01023.xml
17:49:24.306 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01024.xml
17:49:24.385 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01033.xml
17:49:24.409 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01044.xml
17:49:24.444 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01054.xml
17:49:24.488 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01061.xml
17:49:24.527 [main] INFO o.t.indexing.SimpleXMLCollection - Found 0 documents in C:\Users\user\Desktop\terrier-core-4.2\fhin2008\hi.docs.2008\hi.doc.2008\jagran\jagran\range00_save001_d00001_f01065.xml



Edited 1 time(s). Last edit at 04/22/2018 01:25PM by riya77.

Options: ReplyQuote
Re: TERRIER 4.2 SIMPLE XML ISSUES
Posted by: craigm ()
Date: April 23, 2018 11:31AM

Hi,

You need to specify some properties for SimpleXMLCollection to correctly identify documents within each XML file. See [terrier.org]

Can you show an example of your documents?

Craig

Options: ReplyQuote
Re: TERRIER 4.2 SIMPLE XML ISSUES
Posted by: riya77 ()
Date: April 25, 2018 09:16AM

Hi craig ,

This is the example of my doc file.


<Doc>
<DocNo>range00_save001_d00000_f00018</DocNo>
<Text>
????? ????? ????????? ????????????? ??? ????????? ???????? ??????????? ???? ???????? ????? ????? ??? ?????? ???? ????? 7272 ??? ?? ?? ?????????? ??????? ?????? 2005 ??????? ?????? ?????? ??????? ???-???? ??????? ??? ????? ??? ??????? ???????? ??????? ??????? ???? ??? ???? ??????? ???????? ??????? ????? ???????? ????? ??? ???? ????? ??????? ????????? ??????? ???? ??????? ??????? ?????? ??????? ????? ??????? ?????? ???? ?????? 24.1 5.7 ??????? 30.5 14.3 ???????? 28.6 20.3 ??????? 26.6 14.4 '  ??????? ---> ??????
?????? ???? ???????? ????? ?? ?????????? ?? ?? ???? ???? ?? ???????? ??????? ?? ????? ????????? ???? ?? ??, ????? ?????? ???? ?????? ?? ?????? ???? ?????? ?? ??????? ?? ?? ?????? ?????? ?????
   ????? ??????????? ?????? ???? ?? ...... "
???? ?????? ???? ???? ?? ????????? ????
" [Dec 31 01:06]
?? ???? ?? ??? ??????
????? ?? ???? ?????? ??? ??????????????: ??????? [Dec 31 00:58] ?????? ?? ?????? ?? ??? ????: ????????? [Dec 31 00:58] ????? ?? ????????? ??????? ?? ????? ???? [Dec 31 00:58] 10,???? ??? ??? ?????? ??????? ???: ?????? [Dec 31 00:58] ???????-???????? ?? ??? ????? ?????? [Dec 31 01:04] 25 ??? ?? ???? 25 ????? : ????????? [Dec 31 01:04] ?????? ???? ?? ??????? ?????? ?? ????? ???? [Dec 31 01:03] ??? ????? ???? ??????, ??????? ???? ??? ?? 5057272 ???? ?? ??? ???? ??
?????????
???? ?? ?? ????? ?? ???? ?? ???? ???? ??? ???? [Dec 31 01:05] ???? ???? ?? ??? ???? ??? ?????????: ?????? [Dec 31 01:02] ?? ?? ??? ?? ?????? 10 ????? ?? ???? [Dec 31 00:52] ??????????? ????????: ???? ????? ?? ???? [Dec 31 00:52]
?????????????
????? ?? ????????? ?? 20 ?????????????? ?? ??? [Dec 31 01:21] ??????? ?? ?????????? ?? ??? ??? ?? ?????? ???? [Dec 31 00:53] ??? ?? ???????? ???? ??? 598 ??????? ??? [Dec 31 00:53] ???????? ????? ?? ???? ??????????? ?? ????? ?????: ?????? [Dec 31 00:53]
???
???? ????? ?? ????????? ?? ??? ????? ?????? [Dec 30 23:21] ?? ???? ?? ???? ???? ‘??? ??????’ [Dec 30 23:21] ???????? ???????? ?? ??? ?? ????? [Dec 30 23:20] ???? ?????? ??? ????? ???? ??: ????? [Dec 30 23:20]
?????????
????? ?? 5600 ??? ????? ?? ???????? ???? ???? [Dec 30 23:34] ???????? ?????? ????? ?????? 144.5 ??? ???? [Dec 30 23:34] ???????????? ?? ?? ????? 4.62 ????????? ????? [Dec 30 23:34] ?????? ????? ?? ????? ????? ?? ???????????? ?? ????? ??? [Dec 30 23:32]
???? ?????
????? ?????? ?? ??? ?? ???? ????? ????? [Dec 31 00:14] ???????? ?? ???????? ???????? ????? [Dec 30 22:41] ????????? ??? ???? ???????, ?????? ?? ????????? ?? ?? [Dec 30 22:40] ?????? ?? ?? ???????? ?????? ???? ???? [Dec 30 22:11] ?? ???? ?? ???? ???? ????? ????-???: ????? [Dec 30 21:08] ?????? ?? ????? ????????: ??? ?????????? [Dec 30 20:56] ??????? ?????? ?????? ? ???: ??????? [Dec 30 20:56] ???? ?? ?????????? ?? ????????? ?? ?????? ???? [Dec 30 20:42] ????? ? ???????? ?? ????? ?????? ???????? [Dec 30 19:27] ?????? ?? ???????? ?? ???????? ????: ???? [Dec 30 17:44] 84 ?? ???????? ?? ??? ???? ?? ?????: ???? [Dec 30 17:22] ??????? ???????? ?? ??? 2005 ?? ??????? [Dec 30 17:09] ?????????? ?? ???? ?? ???? ?? ????????? ????? [Dec 30 16:32] ???? ?? ??? ?? ???????? ????? ?? ???? [Dec 30 16:09] ??? ?? ??? ???? ?????? ?? ??? ?? ????? [Dec 30 15:16] ?? ?????? ?? ??? ????? ????????? ?????? [Dec 30 15:02] ????? ?? ?????? ???? ?? ??? ???: ????? ???? [Dec 30 14:29] ???? ?????? ??????? ?? ??? ???? ????? [Dec 30 13:37] ???? ??? ??????? ????? ??????? ??????? [Dec 30 12:40] ??? ????? ?? ?? ??????? ? ????????? ?? ??????? [Dec 30 12:07] ???-????????? ?? ????? ?? ????? ??? ??????? [Dec 30 11:03] ???? ??? ??????? ?????? ?? ??? ??? [Dec 30 10:44] ????? ??? ?? ???????, ????? ?????? ???? [Dec 30 10:39]
'
????????? ??? ???? ???????, ?????? ?? ????? ?? ???????? ??
 ??? ?? ???? 25 ????? : ?????????
???? ?????? ???? ???? ?? ????????? ????
?????? ?? ?????? ?? ??? ????: ?????? ?????????
?????? ????????? ??????? ???? ?? ??????? ?????? ?? ????? ????
??? ???? ?? ?????????? ?? ????????? ?? ?????? ????
??? ??? ?????: ??? ?????????? ?? ?????? ?? ????? ?? ???????? ?????
??????: ???????-???????? ?? ??? ????? ??????
,???? ??? ??? ?????? ??????? ???: ?????? ???? ????
???? ?? ??? ?? ???????? ????? ?? ????
?? ?????? ?? ??? ????? ?? ??? ????? ????????? ??????
???-????????? ?? ??? ???????? ?? ?????? ??? ?????: ????????
????? ?? 5600 ??? ????? ?? ???????? ???? ????
??????? ?? ???????? ?? ??? ?????? ?? ????? ??? ??????: ????
???? ????? ?????
' ???? ???? ??? ?????? ??????? ?? ??????? ?? ??????? ???? ??? ??? ?????? ??? ???? ??? ????
</Text>
</Doc>



This is hindi corpus file.

Properties I used :

trec.collection.class=TRECCollection
tokeniser=UTFTokeniser
trec.encoding=utf-8
#query tags specification
TrecQueryTags.doctag=TOP
TrecQueryTags.idtag=NUM
TrecQueryTags.process=TOP,NUM,TITLE
TrecQueryTags.skip=DESC,NARR


trec.topics.parser=SingleLineTRECQuery
SingleLineTRECQuery.tokenise=true
#stop-words file
stopwords.filename=hindistop.txt

#the processing stages a term goes through
termpipelines=Stopwords,PorterStemmer
indexer.meta.forward.keylens=97



Edited 1 time(s). Last edit at 04/25/2018 09:17AM by riya77.

Options: ReplyQuote
Re: TERRIER 4.2 SIMPLE XML ISSUES
Posted by: craigm ()
Date: April 25, 2018 05:26PM

So I dont see any of the following properties, as mentioned in my link above:

xml.blacklist.docids - docnos of documents that will not be indexed.
xml.doctag - tag that marks a document.
xml.idtag - tag that contains the docno. Attribute are specified as "element.attribute".
xml.terms - list of tags whose children contain terms that should be indexed.

Options: ReplyQuote


Sorry, only registered users may post in this forum.
This forum powered by Phorum.