Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-568

IllegalArgumentException while indexing XML files with SimpleXMLCollection

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.2, 5.0, 4.4
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Hi guys. I'm trying to index some XML files extracted from Pubmed. The error I'm getting is the following:

      18:14:01.080 [main] INFO o.terrier.indexing.CollectionFactory - Finished reading collection specification
      18:14:01.137 [main] INFO o.t.structures.indexing.Indexer - creating the data structures data_1
      18:14:01.144 [main] INFO o.t.s.indexing.LexiconBuilder - LexiconBuilder active - flushing every 100000 documents, or when memory threshold hit
      A problem occurred: java.lang.IllegalArgumentException: protocol = https host = null
      java.lang.IllegalArgumentException: protocol = https host = null
              at sun.net.spi.DefaultProxySelector.select(DefaultProxySelector.java:177)
              at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1150)
              at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
              at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
              at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1564)
              at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
              at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:263)
              at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
              at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
              at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown Source)
              at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown Source)
              at org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown Source)
              at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
              at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
              at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
              at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
              at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
              at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
              at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
              at org.terrier.indexing.SimpleXMLCollection.openNextFile(SimpleXMLCollection.java:606)
              at org.terrier.indexing.SimpleXMLCollection.nextDocument(SimpleXMLCollection.java:518)
              at org.terrier.structures.indexing.classical.BasicIndexer.createDirectIndex(BasicIndexer.java:243)
              at org.terrier.structures.indexing.Indexer.index(Indexer.java:366)
              at org.terrier.applications.TRECIndexing.index(TRECIndexing.java:155)
              at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:410)
              at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:606)
              at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:230)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:498)
              at org.terrier.applications.AnyclassLauncher.main(AnyclassLauncher.java:45)
       
      I've tried to use different versions of Terrier and all failed. I'm attaching the properties files, as well as an example file of the collection.

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          I think the problem is your DTD in line 2.

          "https: // dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd"

          Can you remove the DTD statement from your example file and try again. We already try to disable DTD validation, but it seems in this case not to have worked (see https://github.com/terrier-org/terrier-core/blob/5.x/modules/core/src/main/java/org/terrier/indexing/SimpleXMLCollection.java#L415))

          This link https://rubenlaguna.com/post/2009-10-25-disable-dom-dtd-validation/ suggests another method to disable this "feature".

          Craig

          Show
          craigm Craig Macdonald added a comment - I think the problem is your DTD in line 2. "https: // dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd" Can you remove the DTD statement from your example file and try again. We already try to disable DTD validation, but it seems in this case not to have worked (see https://github.com/terrier-org/terrier-core/blob/5.x/modules/core/src/main/java/org/terrier/indexing/SimpleXMLCollection.java#L415 )) This link https://rubenlaguna.com/post/2009-10-25-disable-dom-dtd-validation/ suggests another method to disable this "feature". Craig
          Hide
          Chaosphere Teofan Clipa added a comment -

          Removing the line

          <!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https: // dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">

          from the XML file seems to solve the problem, the file is then indexed without errors.

          Show
          Chaosphere Teofan Clipa added a comment - Removing the line <!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https: // dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd"> from the XML file seems to solve the problem, the file is then indexed without errors.
          Hide
          craigm Craig Macdonald added a comment -

          Yes, this was my suspicion. Are you OK to do that for all of your files?

          Craig

          Show
          craigm Craig Macdonald added a comment - Yes, this was my suspicion. Are you OK to do that for all of your files? Craig
          Hide
          Chaosphere Teofan Clipa added a comment -

          Yes, I'll write a script that removes the line for every file in the collection, shouldn't take long.

          Thank you for your help.

          Show
          Chaosphere Teofan Clipa added a comment - Yes, I'll write a script that removes the line for every file in the collection, shouldn't take long. Thank you for your help.
          Hide
          Chaosphere Teofan Clipa added a comment -

          Hi. I have a question regarding my collection and since the forum seems dead I post it here.

          In my collection, I have to use the PMID as ID for a document. The problem is that some documents have something like this in the corpus:

          <CommentsCorrectionsList>
          <CommentsCorrections RefType="CommentIn">
          <RefSource>Eur Spine J. 2006 Jan;15(1):8-15</RefSource>
          <PMID Version="1">16411129</PMID>
          </CommentsCorrections>
          </CommentsCorrectionsList>

          with the terrier.properties attached in the OP terrier seems to index the document with PMID=16411129 multiple times, so I have duplicate documents. Is there a way to specify in the terrier properties which tag should be used as ID, specifying maybe parent-children tag? The documents start with:

          <PubmedArticle>
          <MedlineCitation Owner="NLM" Status="MEDLINE">
          <PMID Version="1">7072537</PMID>

          is there any way to specify that for ID I want only the PMID that is directly the child of PubmedArticle and MedlineCitation?

          If this is not where to post the issue let me now.

          Thanks.

          Show
          Chaosphere Teofan Clipa added a comment - Hi. I have a question regarding my collection and since the forum seems dead I post it here. In my collection, I have to use the PMID as ID for a document. The problem is that some documents have something like this in the corpus: <CommentsCorrectionsList> <CommentsCorrections RefType="CommentIn"> <RefSource>Eur Spine J. 2006 Jan;15(1):8-15</RefSource> <PMID Version="1">16411129</PMID> </CommentsCorrections> </CommentsCorrectionsList> with the terrier.properties attached in the OP terrier seems to index the document with PMID=16411129 multiple times, so I have duplicate documents. Is there a way to specify in the terrier properties which tag should be used as ID, specifying maybe parent-children tag? The documents start with: <PubmedArticle> <MedlineCitation Owner="NLM" Status="MEDLINE"> <PMID Version="1">7072537</PMID> is there any way to specify that for ID I want only the PMID that is directly the child of PubmedArticle and MedlineCitation ? If this is not where to post the issue let me now. Thanks.

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              Chaosphere Teofan Clipa
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: