org.terrier.indexing
Class SimpleXMLCollection

java.lang.Object
  extended by org.terrier.indexing.SimpleXMLCollection
All Implemented Interfaces:
java.io.Closeable, Collection
Direct Known Subclasses:
SimpleMedlineXMLCollection

public class SimpleXMLCollection
extends java.lang.Object
implements Collection

Initial implementation of a class that generates a Collection with Documents from a series of XML files.

Properties:


Field Summary
protected static boolean bReformXML
          Reform invalid XML by copying to temporary file.
protected  javax.xml.parsers.DocumentBuilderFactory dbFactory
          The xml parser factory for DOM
protected  javax.xml.parsers.DocumentBuilder dBuilder
          the xml parser
protected  java.util.HashSet<java.lang.String> DocIDBlacklist
          A black list of document to ignore.
protected  boolean DocIdIsAttribute
          set if DocIdLocation contains ELEMENT_ATTR_SEPARATOR
protected  java.lang.String DocIdLocation
          Contains the name of the tag that contains the document name
protected  java.util.HashSet<java.lang.String> DocumentElements
          Contains the names of tags that encapsulate entire documents
protected  java.util.LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument> Documents
          A list of all the document objects in this XML file
protected  boolean DocumentTags
          Set if DocumentElements.size > 0
static java.lang.String ELEMENT_ATTR_SEPARATOR
          element attribute separator
protected  boolean EOC
           
protected  java.util.LinkedList<java.lang.String> FilesToProcess
          The list of files to process.
protected static org.apache.log4j.Logger logger
           
protected  java.util.HashSet<java.lang.String> TermElements
          Contains the names of tags and attributes that encapsulate terms
protected  boolean TermsInAttributes
          set if any TermElements contains ELEMENT_ATTR_SEPARATOR
protected  org.terrier.indexing.SimpleXMLCollection.XMLDocument thisDoc
          the current XML document that is being read by the indexer
protected  org.w3c.dom.Document xmlDoc
          the parsed structure of the XML file we currently have open
 
Constructor Summary
SimpleXMLCollection()
          Construct a SimpleXMLCollection
SimpleXMLCollection(java.util.List<java.lang.String> filesToProcess)
          Construct a SimpleXMLCollection
SimpleXMLCollection(java.lang.String CollectionSpecFilename, java.lang.String BlacklistSpecFilename)
          Construct a SimpleXMLCollection
 
Method Summary
 void close()
           This is not supported in this implemented class.
 boolean endOfCollection()
          Returns true if the end of the collection has been reached
protected  boolean findDocumentElement(org.w3c.dom.Node n)
           
 Document getDocument()
          Get the document object representing the current document.
 boolean hasNext()
          Chech whether there is a next document in the collection
protected  void initialiseParser()
           
protected  void initialiseTags()
           
static void main(java.lang.String[] args)
          main
 Document next()
          get the next document
 boolean nextDocument()
          Move the collection to the start of the next document.
protected  boolean openNextFile()
           
 void remove()
          This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations
 void reset()
          Resets the Collection iterator to the start of the collection.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected static final org.apache.log4j.Logger logger

ELEMENT_ATTR_SEPARATOR

public static final java.lang.String ELEMENT_ATTR_SEPARATOR
element attribute separator

See Also:
Constant Field Values

bReformXML

protected static final boolean bReformXML
Reform invalid XML by copying to temporary file. NB This may be dangerous


DocumentElements

protected java.util.HashSet<java.lang.String> DocumentElements
Contains the names of tags that encapsulate entire documents


DocumentTags

protected boolean DocumentTags
Set if DocumentElements.size > 0


TermElements

protected java.util.HashSet<java.lang.String> TermElements
Contains the names of tags and attributes that encapsulate terms


DocIdLocation

protected java.lang.String DocIdLocation
Contains the name of the tag that contains the document name


DocIdIsAttribute

protected boolean DocIdIsAttribute
set if DocIdLocation contains ELEMENT_ATTR_SEPARATOR


TermsInAttributes

protected boolean TermsInAttributes
set if any TermElements contains ELEMENT_ATTR_SEPARATOR


dbFactory

protected javax.xml.parsers.DocumentBuilderFactory dbFactory
The xml parser factory for DOM


dBuilder

protected javax.xml.parsers.DocumentBuilder dBuilder
the xml parser


xmlDoc

protected org.w3c.dom.Document xmlDoc
the parsed structure of the XML file we currently have open


Documents

protected java.util.LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument> Documents
A list of all the document objects in this XML file


thisDoc

protected org.terrier.indexing.SimpleXMLCollection.XMLDocument thisDoc
the current XML document that is being read by the indexer


EOC

protected boolean EOC

DocIDBlacklist

protected java.util.HashSet<java.lang.String> DocIDBlacklist
A black list of document to ignore.


FilesToProcess

protected java.util.LinkedList<java.lang.String> FilesToProcess
The list of files to process.

Constructor Detail

SimpleXMLCollection

public SimpleXMLCollection(java.util.List<java.lang.String> filesToProcess)
Construct a SimpleXMLCollection

Parameters:
filesToProcess -

SimpleXMLCollection

public SimpleXMLCollection()
Construct a SimpleXMLCollection


SimpleXMLCollection

public SimpleXMLCollection(java.lang.String CollectionSpecFilename,
                           java.lang.String BlacklistSpecFilename)
Construct a SimpleXMLCollection

Parameters:
CollectionSpecFilename -
BlacklistSpecFilename -
Method Detail

initialiseParser

protected void initialiseParser()

initialiseTags

protected void initialiseTags()

close

public void close()
This is not supported in this implemented class.

Specified by:
close in interface java.io.Closeable

hasNext

public boolean hasNext()
Chech whether there is a next document in the collection

Returns:
boolean

next

public Document next()
get the next document

Returns:
next document

remove

public void remove()
This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations


endOfCollection

public boolean endOfCollection()
Returns true if the end of the collection has been reached

Specified by:
endOfCollection in interface Collection
Returns:
boolean true if the end of collection has been reached, otherwise it returns false.

nextDocument

public boolean nextDocument()
Move the collection to the start of the next document.

Specified by:
nextDocument in interface Collection
Returns:
boolean true if there exists another document in the collection, otherwise it returns false.

findDocumentElement

protected boolean findDocumentElement(org.w3c.dom.Node n)

getDocument

public Document getDocument()
Get the document object representing the current document.

Specified by:
getDocument in interface Collection
Returns:
Document the current document;

reset

public void reset()
Resets the Collection iterator to the start of the collection.. This Collection implementation does not support reset.

Specified by:
reset in interface Collection

openNextFile

protected boolean openNextFile()

main

public static void main(java.lang.String[] args)
main

Parameters:
args -


Terrier 3.5. Copyright © 2004-2011 University of Glasgow