Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-124

When processing docid tag in MEDLINE format XML file, xml context path is needed for consideration

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 3.5
    • Component/s: .indexing
    • Labels:
      None

      Description


       In MEDLINE format XML file, generally doctag is "MedlineCitation" and idtag is 'PMID'
       So I usually set properties like below.
          xml.doctag=MedlineCitation
          xml.idtag=PMID


       But there were several MEDLINE documents, which 'PMID' is used twice in different context.
       Like the example below..

      <PubmedArticle>
          <MedlineCitation Owner="NLM" Status="MEDLINE">
              <PMID Version="1">11031400</PMID>
             ....
              ...
              <CommentsCorrectionsList>
                  <CommentsCorrections RefType="CommentIn">
                      <RefSource>Arch Fam Med. 2000 Sep-Oct;9(9):921-2</RefSource>
                      <PMID Version="1">11031401</PMID>
                  </CommentsCorrections>
              ....
          </PubmedData>
      </PubmedArticle>


      When I used Terrier3.0 to index above document, docid was indexed as '11031401', not '11031400'.

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          I guess what is needed is to be able to blacklist the <CommentsCorrectionsList> part of tree?

          Show
          craigm Craig Macdonald added a comment - I guess what is needed is to be able to blacklist the <CommentsCorrectionsList> part of tree?
          Hide
          wakeup06 SungBin Choi added a comment -

          That could be handy solution for this case. But for more complicated cases (when xml structure is too complex, so users not having (or don't want to know) detailed knowledge inside their input file), providing way of designating xml context path declaratively might be good alternative to consider in the future release, I think.

          Show
          wakeup06 SungBin Choi added a comment - That could be handy solution for this case. But for more complicated cases (when xml structure is too complex, so users not having (or don't want to know) detailed knowledge inside their input file), providing way of designating xml context path declaratively might be good alternative to consider in the future release, I think.
          Hide
          craigm Craig Macdonald added a comment -

          Tagging for 3.1

          Show
          craigm Craig Macdonald added a comment - Tagging for 3.1
          Hide
          craigm Craig Macdonald added a comment -

          Nut, this issue requires adding an xml.skip property, such that tags can be skipped.

          Show
          craigm Craig Macdonald added a comment - Nut, this issue requires adding an xml.skip property, such that tags can be skipped.
          Hide
          nutli Nut Limsopatham added a comment -

          Add function that allow to disregards terms identified in xml.blacklist property, and to index terms in sub-tags under the tags identified in xml.tags

          Show
          nutli Nut Limsopatham added a comment - Add function that allow to disregards terms identified in xml.blacklist property, and to index terms in sub-tags under the tags identified in xml.tags
          Hide
          nutli Nut Limsopatham added a comment -

          Change the name of the property for disregarded tags from xml.blacklist to xml.skip

          Show
          nutli Nut Limsopatham added a comment - Change the name of the property for disregarded tags from xml.blacklist to xml.skip

            People

            • Assignee:
              nutli Nut Limsopatham
              Reporter:
              wakeup06 SungBin Choi
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: