org.terrier.indexing
Class SimpleMedlineXMLCollection

java.lang.Object
  extended by org.terrier.indexing.SimpleXMLCollection
      extended by org.terrier.indexing.SimpleMedlineXMLCollection
All Implemented Interfaces:
java.io.Closeable, Collection

public class SimpleMedlineXMLCollection
extends SimpleXMLCollection

Initial implementation of a class that generates a Collection with Documents from a series of XML files in the Medline format. It process a limited number of documents in an XML file to avoid OutOfMemory problem in case the XML file is too large.

Properties:<ul>

  • lowercase - lower case all terms obtained. Highly recommended.
  • indexing.simplexmlcollection.reformxml - will try to reform broken &AMP; entities.
  • xml.doc.buffer.size - The maximum number of documents to process per interation.
  • Author:
    Ben He

    Field Summary
    protected  int currentFileDocCounter
              The number of documents processed in the current XML file.
     java.lang.String docEndTag
              The end tag of documents in the XML files.
     java.lang.String docTag
              The tag of documents in the XML files.
     java.lang.String EOL
              The end of line string.
     java.lang.String fileEndTag
              The tag indicating the end of an XML file.
     java.lang.String fileTag
              The tag indicating the start of an XML file.
    protected  int NUMBER_OF_DOCS_IN_BUFFER
              The number of documents to process per iteration.
     
    Fields inherited from class org.terrier.indexing.SimpleXMLCollection
    bReformXML, dbFactory, dBuilder, DocIDBlacklist, DocIdIsAttribute, DocIdLocation, DocumentElements, Documents, DocumentTags, ELEMENT_ATTR_SEPARATOR, EOC, FilesToProcess, logger, TermElements, TermsInAttributes, thisDoc, xmlDoc
     
    Constructor Summary
    SimpleMedlineXMLCollection()
              The default constructor.
    SimpleMedlineXMLCollection(java.lang.String CollectionSpecFilename, java.lang.String BlacklistSpecFilename)
              An alternative constructor.
     
    Method Summary
    protected  boolean openNextFile()
              Parse through up to a limited number of documents in the XML file.
     
    Methods inherited from class org.terrier.indexing.SimpleXMLCollection
    close, endOfCollection, findDocumentElement, getDocument, hasNext, initialiseParser, initialiseTags, main, next, nextDocument, remove, reset
     
    Methods inherited from class java.lang.Object
    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
     

    Field Detail

    currentFileDocCounter

    protected int currentFileDocCounter
    The number of documents processed in the current XML file.


    docTag

    public final java.lang.String docTag
    The tag of documents in the XML files.

    See Also:
    Constant Field Values

    docEndTag

    public final java.lang.String docEndTag
    The end tag of documents in the XML files.

    See Also:
    Constant Field Values

    fileTag

    public final java.lang.String fileTag
    The tag indicating the start of an XML file.

    See Also:
    Constant Field Values

    fileEndTag

    public final java.lang.String fileEndTag
    The tag indicating the end of an XML file.

    See Also:
    Constant Field Values

    EOL

    public final java.lang.String EOL
    The end of line string.


    NUMBER_OF_DOCS_IN_BUFFER

    protected final int NUMBER_OF_DOCS_IN_BUFFER
    The number of documents to process per iteration.

    Constructor Detail

    SimpleMedlineXMLCollection

    public SimpleMedlineXMLCollection()
    The default constructor.


    SimpleMedlineXMLCollection

    public SimpleMedlineXMLCollection(java.lang.String CollectionSpecFilename,
                                      java.lang.String BlacklistSpecFilename)
    An alternative constructor.

    Parameters:
    CollectionSpecFilename - The name of the file containing the location of XML files in the collection.
    BlacklistSpecFilename - The name of the file containing the location of the blacklisted XML files in the collection.
    Method Detail

    openNextFile

    protected boolean openNextFile()
    Parse through up to a limited number of documents in the XML file. The limit is specified by property xml.doc.buffer.size.

    Overrides:
    openNextFile in class SimpleXMLCollection


    Terrier 3.5. Copyright © 2004-2011 University of Glasgow