Class SimpleXMLCollection

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable, Collection
    Direct Known Subclasses:
    SimpleMedlineXMLCollection

    public class SimpleXMLCollection
    extends java.lang.Object
    implements Collection
    Initial implementation of a class that generates a Collection with Documents from a series of XML files.

    Properties:

    • indexing.simplexmlcollection.reformxml - will try to reform broken & entities.
    • xml.blacklist.docids - docnos of documents that will not be indexed.
    • xml.doctag - tag that marks a document.
    • xml.idtag - tag that contains the docno. Attribute are specified as "element.attribute".
    • xml.terms - list of tags whose children contain terms that should be indexed.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected static boolean bReformXML
      Reform invalid XML by copying to temporary file.
      protected javax.xml.parsers.DocumentBuilderFactory dbFactory
      The xml parser factory for DOM
      protected javax.xml.parsers.DocumentBuilder dBuilder
      the xml parser
      protected java.util.HashSet<java.lang.String> DocIDBlacklist
      A black list of document to ignore.
      protected boolean DocIdIsAttribute
      set if DocIdLocation contains ELEMENT_ATTR_SEPARATOR
      protected java.lang.String DocIdLocation
      Contains the name of the tag that contains the document name
      protected java.util.HashSet<java.lang.String> DocumentElements
      Contains the names of tags that encapsulate entire documents
      protected java.util.LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument> Documents
      A list of all the document objects in this XML file
      protected boolean DocumentTags
      Set if DocumentElements.size > 0
      static java.lang.String ELEMENT_ATTR_SEPARATOR
      element attribute separator
      protected boolean EOC  
      protected java.util.List<java.lang.String> FilesToProcess
      The list of files to process.
      protected static org.slf4j.Logger logger  
      protected boolean PropertiesInAttibutes
      set if any PropertyElements contains ELEMENT_ATTR_SEPARATOR
      protected java.util.Map<java.lang.String,​java.lang.Integer> PropertyElements
      Contains the names of tags and attributes that encapsulate meta properties with their lengths
      protected java.util.HashSet<java.lang.String> TermElements
      Contains the names of tags and attributes that encapsulate terms
      protected boolean TermsInAttributes
      set if any TermElements contains ELEMENT_ATTR_SEPARATOR
      protected org.terrier.indexing.SimpleXMLCollection.XMLDocument thisDoc
      the current XML document that is being read by the indexer
      protected org.w3c.dom.Document xmlDoc
      the parsed structure of the XML file we currently have open
    • Constructor Summary

      Constructors 
      Constructor Description
      SimpleXMLCollection()
      Construct a SimpleXMLCollection
      SimpleXMLCollection​(java.lang.String CollectionSpecFilename, java.lang.String BlacklistSpecFilename)
      Construct a SimpleXMLCollection
      SimpleXMLCollection​(java.lang.String addressCollectionFilename, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)
      additional constructors required by TRECIndexing
      SimpleXMLCollection​(java.util.List<java.lang.String> filesToProcess)
      Construct a SimpleXMLCollection
      SimpleXMLCollection​(java.util.List<java.lang.String> collSpecFiles, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)
      additional constructors required by TRECIndexing
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void close()
      This is not supported in this implemented class.
      boolean endOfCollection()
      Returns true if the end of the collection has been reached
      protected boolean findDocumentElement​(org.w3c.dom.Node n)  
      Document getDocument()
      Get the document object representing the current document.
      boolean hasNext()
      Chech whether there is a next document in the collection
      protected void initialiseParser()  
      protected void initialiseTags()  
      protected void loadBlacklist​(java.lang.String BlacklistSpecFilename)  
      static void main​(java.lang.String[] args)
      main
      Document next()
      get the next document
      boolean nextDocument()
      Move the collection to the start of the next document.
      protected boolean openNextFile()  
      void remove()
      This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations
      void reset()
      Resets the Collection iterator to the start of the collection.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • logger

        protected static final org.slf4j.Logger logger
      • ELEMENT_ATTR_SEPARATOR

        public static final java.lang.String ELEMENT_ATTR_SEPARATOR
        element attribute separator
        See Also:
        Constant Field Values
      • bReformXML

        protected static final boolean bReformXML
        Reform invalid XML by copying to temporary file. NB This may be dangerous
      • DocumentElements

        protected java.util.HashSet<java.lang.String> DocumentElements
        Contains the names of tags that encapsulate entire documents
      • DocumentTags

        protected boolean DocumentTags
        Set if DocumentElements.size > 0
      • TermElements

        protected java.util.HashSet<java.lang.String> TermElements
        Contains the names of tags and attributes that encapsulate terms
      • DocIdLocation

        protected java.lang.String DocIdLocation
        Contains the name of the tag that contains the document name
      • DocIdIsAttribute

        protected boolean DocIdIsAttribute
        set if DocIdLocation contains ELEMENT_ATTR_SEPARATOR
      • TermsInAttributes

        protected boolean TermsInAttributes
        set if any TermElements contains ELEMENT_ATTR_SEPARATOR
      • PropertiesInAttibutes

        protected boolean PropertiesInAttibutes
        set if any PropertyElements contains ELEMENT_ATTR_SEPARATOR
      • PropertyElements

        protected java.util.Map<java.lang.String,​java.lang.Integer> PropertyElements
        Contains the names of tags and attributes that encapsulate meta properties with their lengths
      • dbFactory

        protected javax.xml.parsers.DocumentBuilderFactory dbFactory
        The xml parser factory for DOM
      • dBuilder

        protected javax.xml.parsers.DocumentBuilder dBuilder
        the xml parser
      • xmlDoc

        protected org.w3c.dom.Document xmlDoc
        the parsed structure of the XML file we currently have open
      • Documents

        protected java.util.LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument> Documents
        A list of all the document objects in this XML file
      • thisDoc

        protected org.terrier.indexing.SimpleXMLCollection.XMLDocument thisDoc
        the current XML document that is being read by the indexer
      • EOC

        protected boolean EOC
      • DocIDBlacklist

        protected java.util.HashSet<java.lang.String> DocIDBlacklist
        A black list of document to ignore.
      • FilesToProcess

        protected java.util.List<java.lang.String> FilesToProcess
        The list of files to process.
    • Constructor Detail

      • SimpleXMLCollection

        public SimpleXMLCollection​(java.util.List<java.lang.String> filesToProcess)
        Construct a SimpleXMLCollection
        Parameters:
        filesToProcess -
      • SimpleXMLCollection

        public SimpleXMLCollection()
        Construct a SimpleXMLCollection
      • SimpleXMLCollection

        public SimpleXMLCollection​(java.lang.String addressCollectionFilename,
                                   java.lang.String ignored1,
                                   java.lang.String BlacklistSpecFilename,
                                   java.lang.String ignored2)
        additional constructors required by TRECIndexing
      • SimpleXMLCollection

        public SimpleXMLCollection​(java.util.List<java.lang.String> collSpecFiles,
                                   java.lang.String ignored1,
                                   java.lang.String BlacklistSpecFilename,
                                   java.lang.String ignored2)
        additional constructors required by TRECIndexing
      • SimpleXMLCollection

        public SimpleXMLCollection​(java.lang.String CollectionSpecFilename,
                                   java.lang.String BlacklistSpecFilename)
        Construct a SimpleXMLCollection
        Parameters:
        CollectionSpecFilename -
        BlacklistSpecFilename -
    • Method Detail

      • loadBlacklist

        protected void loadBlacklist​(java.lang.String BlacklistSpecFilename)
      • initialiseParser

        protected void initialiseParser()
      • initialiseTags

        protected void initialiseTags()
      • close

        public void close()
        This is not supported in this implemented class.
        Specified by:
        close in interface java.lang.AutoCloseable
        Specified by:
        close in interface java.io.Closeable
      • hasNext

        public boolean hasNext()
        Chech whether there is a next document in the collection
        Returns:
        boolean
      • next

        public Document next()
        get the next document
        Returns:
        next document
      • remove

        public void remove()
        This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations
      • endOfCollection

        public boolean endOfCollection()
        Returns true if the end of the collection has been reached
        Specified by:
        endOfCollection in interface Collection
        Returns:
        boolean true if the end of collection has been reached, otherwise it returns false.
      • nextDocument

        public boolean nextDocument()
        Move the collection to the start of the next document.
        Specified by:
        nextDocument in interface Collection
        Returns:
        boolean true if there exists another document in the collection, otherwise it returns false.
      • findDocumentElement

        protected boolean findDocumentElement​(org.w3c.dom.Node n)
      • getDocument

        public Document getDocument()
        Get the document object representing the current document.
        Specified by:
        getDocument in interface Collection
        Returns:
        Document the current document;
      • reset

        public void reset()
        Resets the Collection iterator to the start of the collection.. This Collection implementation does not support reset.
        Specified by:
        reset in interface Collection
      • openNextFile

        protected boolean openNextFile()
      • main

        public static void main​(java.lang.String[] args)
                         throws java.io.IOException
        main
        Parameters:
        args -
        Throws:
        java.io.IOException