Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-559

SimpleXMLCollection is unable to help in indexing XML documents

    Details

      Description

      Hi
      While indexing XML files, I used the uploaded code (IndexingXmlDocs.java). Where I have tested both SimpleFileCollection and SimpleXMLCollection. If I use SimpleFileCollection, Terrier indexes the document but during searching, the getOccurences method returns only 1 even if the term appears multiple times in an XML document (Although, it works fine with text files).
      If I use SimpleXMLCollection, then the index files are generated but they contain no data.
      My question is:
      What can be changed in the attached IndexingXmlDocs.java file so that I correctly indexes the XML files?

      Please help!

        Attachments

        1. 037329400X.xml
          23 kB
        2. 037541200X.xml
          181 kB
        3. IndexingXmlDocs.java
          1 kB
        4. Screenshot.png
          Screenshot.png
          132 kB
        5. terrier.properties
          2 kB
        6. terrier.properties
          2 kB

          Activity

          Rocky Xanadul Irfan Ullah created issue -
          Hide
          craigm Craig Macdonald added a comment -

          SimpleXMLCollection is able to help in indexing XML documents. What do your XML documents look like?

          You need to configure SimpleXMLCollection properly:
          What tag marks the start/end of a document?
          What tag defines the unique identifier of each document?
          What tag(s) contain text?

          See the javadoc of SimpleXMLCollection for which properties it uses.

          Craig

          Show
          craigm Craig Macdonald added a comment - SimpleXMLCollection is able to help in indexing XML documents. What do your XML documents look like? You need to configure SimpleXMLCollection properly: What tag marks the start/end of a document? What tag defines the unique identifier of each document? What tag(s) contain text? See the javadoc of SimpleXMLCollection for which properties it uses. Craig
          Rocky Xanadul Irfan Ullah made changes -
          Field Original Value New Value
          Attachment 037329400X.xml [ 10711 ]
          Hide
          Rocky Xanadul Irfan Ullah added a comment -

          Thanks Sir
          An XML file is attached here...037329400X.xml for your review.
          Thanks

          Show
          Rocky Xanadul Irfan Ullah added a comment - Thanks Sir An XML file is attached here... 037329400X.xml for your review. Thanks
          Hide
          craigm Craig Macdonald added a comment -

          So following my feedback, can you propose a configuration?

          Show
          craigm Craig Macdonald added a comment - So following my feedback, can you propose a configuration?
          Hide
          craigm Craig Macdonald added a comment -

          I would suggest using the command line until you are familiar with Terrier. I still do most of my experiments from the command line.

          Craig

          Show
          craigm Craig Macdonald added a comment - I would suggest using the command line until you are familiar with Terrier. I still do most of my experiments from the command line. Craig
          Rocky Xanadul Irfan Ullah made changes -
          Comment [ Respected Sir

          I am beginner with doing retrieval experiments with Terrier. I first integrated Terrier with Eclipse project using maven by following the tutorials. That worked for me while searching simple text files. Now I am confused regarding batch retrieval experiments with Terrier using the Social Book Search collection, from which I uploaded a sample file, as I don't know whether I should use the binary version where command line is used in performing batch retrieval and evaluation or go on the same line of using Eclipse for the purpose.

          I read your discussions with other users on the Forum (unfortunately, I am unable to login there even though after a successful registration), in which you mentioned "*scripting is essential for batch retrieval*".
          *Kindly, guide me whether I use the binary version from the command line and set the properties in the etc folder accordingly or do the needful using the eclipse project*.

          Please help. ]
          Rocky Xanadul Irfan Ullah made changes -
          Comment [ Thank you very much sir.... ]
          Rocky Xanadul Irfan Ullah made changes -
          Attachment terrier.properties [ 10712 ]
          Rocky Xanadul Irfan Ullah made changes -
          Attachment 037541200X.xml [ 10713 ]
          Rocky Xanadul Irfan Ullah made changes -
          Attachment Screenshot.png [ 10714 ]
          Rocky Xanadul Irfan Ullah made changes -
          Attachment terrier.properties [ 10715 ]
          Hide
          Rocky Xanadul Irfan Ullah added a comment -

          Respected Sir
          I have attached the terrier.properties file and a 037541200X.xml to be indexed. Please check them, where I am making the mistakes, as I get the error message given in the .
          The indexing and searching work fine if I use this terrier.properties. i.e., indexing using TRECCollection.

          Please help!

          Show
          Rocky Xanadul Irfan Ullah added a comment - Respected Sir I have attached the terrier.properties file and a 037541200X.xml to be indexed. Please check them, where I am making the mistakes, as I get the error message given in the . The indexing and searching work fine if I use this terrier.properties . i.e., indexing using TRECCollection. Please help!

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              Rocky Xanadul Irfan Ullah
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: