Class SimpleMedlineXMLCollection

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable, Collection

    public class SimpleMedlineXMLCollection
    extends SimpleXMLCollection
    Initial implementation of a class that generates a Collection with Documents from a series of XML files in the Medline format. It process a limited number of documents in an XML file to avoid OutOfMemory problem in case the XML file is too large.

    Properties:

    • lowercase - lower case all terms obtained. Highly recommended.
    • indexing.simplexmlcollection.reformxml - will try to reform broken & entities.
    • xml.doc.buffer.size - The maximum number of documents to process per interation.
    Author:
    Ben He
    • Field Detail

      • currentFileDocCounter

        protected int currentFileDocCounter
        The number of documents processed in the current XML file.
      • docTag

        public final java.lang.String docTag
        The tag of documents in the XML files.
        See Also:
        Constant Field Values
      • docEndTag

        public final java.lang.String docEndTag
        The end tag of documents in the XML files.
        See Also:
        Constant Field Values
      • fileTag

        public final java.lang.String fileTag
        The tag indicating the start of an XML file.
        See Also:
        Constant Field Values
      • fileEndTag

        public final java.lang.String fileEndTag
        The tag indicating the end of an XML file.
        See Also:
        Constant Field Values
      • EOL

        public final java.lang.String EOL
        The end of line string.
      • NUMBER_OF_DOCS_IN_BUFFER

        protected final int NUMBER_OF_DOCS_IN_BUFFER
        The number of documents to process per iteration.
    • Constructor Detail

      • SimpleMedlineXMLCollection

        public SimpleMedlineXMLCollection()
        The default constructor.
      • SimpleMedlineXMLCollection

        public SimpleMedlineXMLCollection​(java.lang.String CollectionSpecFilename,
                                          java.lang.String BlacklistSpecFilename)
        An alternative constructor.
        Parameters:
        CollectionSpecFilename - The name of the file containing the location of XML files in the collection.
        BlacklistSpecFilename - The name of the file containing the location of the blacklisted XML files in the collection.
      • SimpleMedlineXMLCollection

        public SimpleMedlineXMLCollection​(java.lang.String CollectionSpecFilename,
                                          java.lang.String ignored1,
                                          java.lang.String BlacklistSpecFilename,
                                          java.lang.String ignored2)
        Constructor required by TRECIndexing
      • SimpleMedlineXMLCollection

        public SimpleMedlineXMLCollection​(java.util.List<java.lang.String> files,
                                          java.lang.String ignored1,
                                          java.lang.String BlacklistSpecFilename,
                                          java.lang.String ignored2)
        Constructor required by TRECIndexing
    • Method Detail

      • openNextFile

        protected boolean openNextFile()
        Parse through up to a limited number of documents in the XML file. The limit is specified by property xml.doc.buffer.size.
        Overrides:
        openNextFile in class SimpleXMLCollection