[TR-559] SimpleXMLCollection is unable to help in indexing XML documents Created: 25/May/19  Updated: 18/Jun/19

Status: Open
Project: Terrier Core
Component/s: .indexing, .matching, .querying
Affects Version/s: 5.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Irfan Ullah Assignee: Craig Macdonald
Resolution: Unresolved  
Labels: collection, indexing, querying,

Attachments: XML File 037329400X.xml     XML File 037541200X.xml     Java Source File IndexingXmlDocs.java     PNG File Screenshot.png     File terrier.properties     File terrier.properties    

 Description   
Hi
While indexing XML files, I used the uploaded code (IndexingXmlDocs.java). Where I have tested both SimpleFileCollection and SimpleXMLCollection. If I use SimpleFileCollection, Terrier indexes the document but during searching, the getOccurences method returns only 1 even if the term appears multiple times in an XML document (Although, it works fine with text files).
If I use SimpleXMLCollection, then the index files are generated but they contain no data.
My question is:
What can be changed in the attached IndexingXmlDocs.java file so that I correctly indexes the XML files?

Please help!

 Comments   
Comment by Craig Macdonald [ 28/May/19 ]

SimpleXMLCollection is able to help in indexing XML documents. What do your XML documents look like?

You need to configure SimpleXMLCollection properly:
What tag marks the start/end of a document?
What tag defines the unique identifier of each document?
What tag(s) contain text?

See the javadoc of SimpleXMLCollection for which properties it uses.

Craig

Comment by Irfan Ullah [ 28/May/19 ]

Thanks Sir
An XML file is attached here...037329400X.xml for your review.
Thanks

Comment by Craig Macdonald [ 28/May/19 ]

So following my feedback, can you propose a configuration?

Comment by Craig Macdonald [ 28/May/19 ]

I would suggest using the command line until you are familiar with Terrier. I still do most of my experiments from the command line.

Craig

Comment by Irfan Ullah [ 18/Jun/19 ]

Respected Sir
I have attached the terrier.properties file and a 037541200X.xml to be indexed. Please check them, where I am making the mistakes, as I get the error message given in the .
The indexing and searching work fine if I use this terrier.properties. i.e., indexing using TRECCollection.

Please help!

Generated at Sat Aug 08 01:27:34 BST 2020 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.