Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-559

SimpleXMLCollection is unable to help in indexing XML documents

    Details

      Description

      Hi
      While indexing XML files, I used the uploaded code (IndexingXmlDocs.java). Where I have tested both SimpleFileCollection and SimpleXMLCollection. If I use SimpleFileCollection, Terrier indexes the document but during searching, the getOccurences method returns only 1 even if the term appears multiple times in an XML document (Although, it works fine with text files).
      If I use SimpleXMLCollection, then the index files are generated but they contain no data.
      My question is:
      What can be changed in the attached IndexingXmlDocs.java file so that I correctly indexes the XML files?

      Please help!

        Attachments

        1. 037329400X.xml
          23 kB
        2. 037541200X.xml
          181 kB
        3. IndexingXmlDocs.java
          1 kB
        4. Screenshot.png
          Screenshot.png
          132 kB
        5. terrier.properties
          2 kB
        6. terrier.properties
          2 kB

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          SimpleXMLCollection is able to help in indexing XML documents. What do your XML documents look like?

          You need to configure SimpleXMLCollection properly:
          What tag marks the start/end of a document?
          What tag defines the unique identifier of each document?
          What tag(s) contain text?

          See the javadoc of SimpleXMLCollection for which properties it uses.

          Craig

          Show
          craigm Craig Macdonald added a comment - SimpleXMLCollection is able to help in indexing XML documents. What do your XML documents look like? You need to configure SimpleXMLCollection properly: What tag marks the start/end of a document? What tag defines the unique identifier of each document? What tag(s) contain text? See the javadoc of SimpleXMLCollection for which properties it uses. Craig
          Hide
          Rocky Xanadul Irfan Ullah added a comment -

          Thanks Sir
          An XML file is attached here...037329400X.xml for your review.
          Thanks

          Show
          Rocky Xanadul Irfan Ullah added a comment - Thanks Sir An XML file is attached here... 037329400X.xml for your review. Thanks
          Hide
          craigm Craig Macdonald added a comment -

          So following my feedback, can you propose a configuration?

          Show
          craigm Craig Macdonald added a comment - So following my feedback, can you propose a configuration?
          Hide
          craigm Craig Macdonald added a comment -

          I would suggest using the command line until you are familiar with Terrier. I still do most of my experiments from the command line.

          Craig

          Show
          craigm Craig Macdonald added a comment - I would suggest using the command line until you are familiar with Terrier. I still do most of my experiments from the command line. Craig
          Hide
          Rocky Xanadul Irfan Ullah added a comment -

          Respected Sir
          I have attached the terrier.properties file and a 037541200X.xml to be indexed. Please check them, where I am making the mistakes, as I get the error message given in the .
          The indexing and searching work fine if I use this terrier.properties. i.e., indexing using TRECCollection.

          Please help!

          Show
          Rocky Xanadul Irfan Ullah added a comment - Respected Sir I have attached the terrier.properties file and a 037541200X.xml to be indexed. Please check them, where I am making the mistakes, as I get the error message given in the . The indexing and searching work fine if I use this terrier.properties . i.e., indexing using TRECCollection. Please help!

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              Rocky Xanadul Irfan Ullah
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: