Terrier Users :  Terrier Forum terrier.org
General discussion about using/developing applications using Terrier 
XML and Index has no terms
Posted by: lisi ()
Date: September 15, 2010 12:08AM

Hi there!

I am having difficulties finding a solution to my problem. I hope, someone knows how to deal with this!

The problem: I want to index an XML file with the SimpleXMLCollection interface. It works with a very simple xml file, but not with the one I am interested to index. My terrier.properties file looks like this:

terrier.home=/some/useful/path
terrier.index.path=/another/useful/path
trec.collection.class=SimpleXMLCollection
xml.doctag=patent-document
xml.idtag=patent-document.ucid
xml.terms=claims

At first I had errors because of some empty documents, so I added "ignore.empty.documents=true". Now the exceptions are gone, but I still get "Found 2 documents in xml_file_i_want_to_index.xml" which is weird because there is only one patent-document tag (starting+closing) in it. I searched in the code and pinned the problem down to the SimpleXMLCollection.findDocumentElement method. There, in the for-loop that iterates over the childnodes, I have two childnodes that are patent-document: the first one seems to be the document type (because the NodeType number is 10) and the second one seems to be an element node (NodeType = 1). I can suppress the finding of the document type but this does not help my overall problem - the three last lines from indexing are always:

WARN - No temporary lexicons to merge, skipping
INFO - Started building the inverted index...
ERROR - Index has no terms. Inverted index creation aborted.

So nothing gets indexed. And I do not know why. Any ideas?

Regards,
Elisabeth


Btw: in line 427 of SimpleXMLCollection (which is in the above mentioned findDocumentElement method) the variable n gets checked if it is null.

if(n == null)
continue;

Shouldn't it be c that gets checked? Because if n was null the method would return false in line 415 (but it can be that I missed something)

Options: ReplyQuote
Re: XML and Index has no terms
Posted by: craigm ()
Date: September 16, 2010 07:13PM

Hi lisi,

It could be the case. Do you have a very simple XML file that I can test with, and your configuration?

Thanks

Craig

Options: ReplyQuote
Re: XML and Index has no terms
Posted by: lisi ()
Date: September 17, 2010 08:40PM

Hi!

Thank you for your quick reply! Because I do not know how much of the xml file I am allowed to published, I divided the problem into two files, which helped me understand some errors.

First, I have a simple xml file in the style I want to index:

********************************
<?xml version="1.0" encoding="UTF-8"?><patent-document ucid="2"><abstract lang="EN" load-source="us" status="new">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec tempus felis vitae urna porttitor sit amet pulvinar mi posuere. Suspendisse eu lorem urna, quis tempor dolor. Pellentesque vel sapien nec justo suscipit feugiat. Nullam posuere odio at arcu tempus tristique tempor orci tempus. Aliquam elit quam, rhoncus nec semper eget, posuere mollis ipsum. Fusce ac urna quis ipsum cursus dignissim. Nunc odio libero, consectetur ac ornare nec, laoreet a arcu. Duis quis nisl non ipsum fermentum consectetur. Nulla aliquam ultrices turpis nec elementum. Nam iaculis nunc nisi, vel aliquet neque. Ut a nulla vel ante congue consectetur vitae sed elit. Nulla eget mi lacus. Suspendisse potenti. Quisque et tortor sed mi congue porta. Nam ultricies nulla sed turpis rutrum rutrum. Suspendisse potenti. In dictum enim porttitor elit pharetra a euismod sapien pretium. Integer id ultrices nisi. Vestibulum aliquet rhoncus pellentesque. Morbi ultrices bibendum pellentesque.</abstract></patent-document>
********************************

with the properties file:
********************************
#directory names
terrier.home=<insertpath>
terrier.index.path=<insertpath>

trec.collection.class=SimpleXMLCollection

#what tag defines the document
xml.doctag=patent-document
#what tag defines the document number
xml.idtag=patent-document.ucid
#what tags hold text to be indexed
xml.terms=abstract

ignore.empty.documents=true
********************************

Here the problem finding 2 documents is gone. Nevertheless a nullpointerexception occurs, but interestingly the index gets build (although if I start the interactive_terrier it just crashes with this index).

Second file (with a dtd):

********************************
<?xml version="1.0" encoding="utf-8"?><!DOCTYPE document SYSTEM "path_to/mydtd.dtd">
<document id="3">
<text>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec tempus felis vitae urna porttitor sit amet pulvinar mi posuere. Suspendisse eu lorem urna, quis tempor dolor. Pellentesque vel sapien nec justo suscipit feugiat. Nullam posuere odio at arcu tempus tristique tempor orci tempus. Aliquam elit quam, rhoncus nec semper eget, posuere mollis ipsum. Fusce ac urna quis ipsum cursus dignissim. Nunc odio libero, consectetur ac ornare nec, laoreet a arcu. Duis quis nisl non ipsum fermentum consectetur. Nulla aliquam ultrices turpis nec elementum. Nam iaculis nunc nisi, vel aliquet neque. Ut a nulla vel ante congue consectetur vitae sed elit. Nulla eget mi lacus. Suspendisse potenti. Quisque et tortor sed mi congue porta. Nam ultricies nulla sed turpis rutrum rutrum. Suspendisse potenti. In dictum enim porttitor elit pharetra a euismod sapien pretium. Integer id ultrices nisi. Vestibulum aliquet rhoncus pellentesque. Morbi ultrices bibendum pellentesque. </text>
</document>
********************************

with the mydtd.dtd:
********************************
<!ELEMENT document ( text ) >
<!ATTLIST document id NMTOKEN #REQUIRED >

<!ELEMENT text ( #PCDATA ) >
********************************


and properties file:

********************************
#directory names
terrier.home=<inserpath>
terrier.index.path=<insertpath>

trec.collection.class=SimpleXMLCollection

#what tag defines the document
xml.doctag=document
#what tag defines the document number
xml.idtag=document.id
#what tags hold text to be indexed
xml.terms=text

ignore.empty.documents=true
********************************

This one finds 2 documents as expected: one because of the dtd, the other one because of the tag. Again, a nullpointerexception.

Thank you for the hint of using very simple XML files (really don't know why I did not think of it first) - I found out, why the index was not created. I wanted to index all subtags of the xml.terms tag (e.g. <a><b>textToIndex</b></a> you have to use xml.terms=b and NOT xml.term=a).

The question arises: Can I index subitems in another way? And why do I get nullpointerexceptions? I think I am doing something terrible wrong...

Options: ReplyQuote
Re: XML and Index has no terms
Posted by: lisi ()
Date: September 17, 2010 08:48PM

Ok, in the first example the NullPointerException disappears if I set the id in an extra tag:

<patent-document>
<id>2</id>
...

Maybe I did not quite understand on [terrier.org] "xml.idtag - tag that contains the docno. Attribute are specified as "element.attribute"." I thought this should be patent-document.id ??

Options: ReplyQuote
Re: XML and Index has no terms
Posted by: craigm ()
Date: September 18, 2010 04:40PM

Hi lisi,

I think the problem is with Terrier 3.0. I tried your example documents on the current SVN head of terrier and it works fine. This suggests that your problem may be resolved by applying the patch that I attached to the following issue:
[terrier.org]
and recompiling Terrier.

Would you be OK to try that and let me know if that resolved the problem?

Craig

Options: ReplyQuote
Re: XML and Index has no terms
Posted by: lisi ()
Date: September 20, 2010 03:21PM

Hi!

Thanks for the patch! With it the first problem is gone, but as soon as I have a dtd in my xml file, I get a NullPointerException and that's it. I tried it with the second example I posted and this is the error:

A problem occurred: java.lang.NullPointerException
java.lang.NullPointerException
at org.terrier.indexing.SimpleXMLCollection$XMLDocument.doRecursive(SimpleXMLCollection.java:103)
at org.terrier.indexing.SimpleXMLCollection$XMLDocument.<init>(SimpleXMLCollection.java:88)
at org.terrier.indexing.SimpleXMLCollection.findDocumentElement(SimpleXMLCollection.java:454)
at org.terrier.indexing.SimpleXMLCollection.findDocumentElement(SimpleXMLCollection.java:464)
at org.terrier.indexing.SimpleXMLCollection.openNextFile(SimpleXMLCollection.java:528)
at org.terrier.indexing.SimpleXMLCollection.nextDocument(SimpleXMLCollection.java:433)
at org.terrier.indexing.BasicIndexer.createDirectIndex(BasicIndexer.java:219)
at org.terrier.indexing.Indexer.index(Indexer.java:344)
at org.terrier.applications.TRECIndexing.index(TRECIndexing.java:123)
at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:390)
at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:573)
at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:237)

So at NamedNodeMap attributes = p.getAttributes() the Node does not contain any attributes.

If it worked with your svn-copy this issue maybe depends on some other file(s) than SimpleXMLCollection?

What is interesting is, that if I leave out the part "<!DOCTYPE document SYSTEM "collection/mydtd.dtd">" it works fine. I also tried to suppress the reading of the dtd according to this example: [forums.sun.com] but nevertheless the error occurs (maybe I did not insert this codesnippet at the right position...)

Options: ReplyQuote
Re: XML and Index has no terms
Posted by: craigm ()
Date: September 20, 2010 06:41PM

Ah. I have no idea about DTDs. Do you need to have the DTDs?

Craig

Options: ReplyQuote
Re: XML and Index has no terms
Posted by: craigm ()
Date: September 20, 2010 07:20PM

I added a simple test case where the DTD exists.

<?xml version="1.0"?>
<!DOCTYPE document SYSTEM "/path/to/dtd.dtd">
<doc>test</doc>

where DTD contained:

<!ELEMENT doc ( text ) >

and everything worked as expected.

I also did a more complex DTD-based example and it worked OK as well.

Pass. If pointing to DTDs dont work for you then remove the references to the DTDs?

Craig

Options: ReplyQuote
Re: XML and Index has no terms
Posted by: lisi ()
Date: September 20, 2010 08:45PM

Hi!

I need to have dtds because I cannot change the source files. Therefore I tried to remove the references to them according to the example I posted: [forums.sun.com]

With this code the resulting .xml files do not contain the dtd tag anymore, although the error occurs. Only if I manually remove the tag before processing the collection it works fine. I will check on your test case. It could be that the error only occurs if the dtd is related to the xml.doctag (tag that defines the document), so in your example it would be

<?xml version="1.0"?>
<!DOCTYPE doc SYSTEM "/path/to/dtd.dtd">
<doc>test</doc>

But I will check on that tomorrow and will report on my findings! Thanks for your test cases, it helps me to minimize errors which only occur because of my mistakes.

Options: ReplyQuote
Re: XML and Index has no terms
Posted by: lisi ()
Date: September 27, 2010 02:29PM

Sorry for taking so long to do the tests on my configuration...

My findings with your simple test from your last posting:
With my setup (= Terrier 3.0 + the patch you provided) it works, but if I change the line

<!ELEMENT doc ( text ) >
to
<!ELEMENT document ( text ) >

and change the

<doc>test</doc>
accordingly to
<document>test<document>,

I get the error:

ERROR - Index has no terms. Inverted index creation aborted.

So this seems to support my idea of errors when using the same string in !DOCTYPE as in the tag that defines the document.

Because I did not find a good solution for removing the references to the dtd I will decided to try my luck with changing the xml I have into TREC-style documents.
Nevertheless thank you with your support until now smiling smiley

Options: ReplyQuote


Sorry, only registered users may post in this forum.
This forum powered by Phorum.