| 
 | ||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.terrier.indexing.SimpleXMLCollection
public class SimpleXMLCollection
Initial implementation of a class that generates a Collection with Documents from a series of XML files.
Properties:
| Field Summary | |
|---|---|
| protected static boolean | bReformXMLReform invalid XML by copying to temporary file. | 
| protected  javax.xml.parsers.DocumentBuilderFactory | dbFactoryThe xml parser factory for DOM | 
| protected  javax.xml.parsers.DocumentBuilder | dBuilderthe xml parser | 
| protected  java.util.HashSet<java.lang.String> | DocIDBlacklistA black list of document to ignore. | 
| protected  boolean | DocIdIsAttributeset if DocIdLocation contains ELEMENT_ATTR_SEPARATOR | 
| protected  java.lang.String | DocIdLocationContains the name of the tag that contains the document name | 
| protected  java.util.HashSet<java.lang.String> | DocumentElementsContains the names of tags that encapsulate entire documents | 
| protected  java.util.LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument> | DocumentsA list of all the document objects in this XML file | 
| protected  boolean | DocumentTagsSet if DocumentElements.size > 0 | 
| static java.lang.String | ELEMENT_ATTR_SEPARATORelement attribute separator | 
| protected  boolean | EOC | 
| protected  java.util.LinkedList<java.lang.String> | FilesToProcessThe list of files to process. | 
| protected static org.apache.log4j.Logger | logger | 
| protected  java.util.HashSet<java.lang.String> | TermElementsContains the names of tags and attributes that encapsulate terms | 
| protected  boolean | TermsInAttributesset if any TermElements contains ELEMENT_ATTR_SEPARATOR | 
| protected  org.terrier.indexing.SimpleXMLCollection.XMLDocument | thisDocthe current XML document that is being read by the indexer | 
| protected  org.w3c.dom.Document | xmlDocthe parsed structure of the XML file we currently have open | 
| Constructor Summary | |
|---|---|
| SimpleXMLCollection()Construct a SimpleXMLCollection | |
| SimpleXMLCollection(java.util.List<java.lang.String> filesToProcess)Construct a SimpleXMLCollection | |
| SimpleXMLCollection(java.lang.String CollectionSpecFilename,
                    java.lang.String BlacklistSpecFilename)Construct a SimpleXMLCollection | |
| Method Summary | |
|---|---|
|  void | close()This is not supported in this implemented class. | 
|  boolean | endOfCollection()Returns true if the end of the collection has been reached | 
| protected  boolean | findDocumentElement(org.w3c.dom.Node n) | 
|  Document | getDocument()Get the document object representing the current document. | 
|  boolean | hasNext()Chech whether there is a next document in the collection | 
| protected  void | initialiseParser() | 
| protected  void | initialiseTags() | 
| static void | main(java.lang.String[] args)main | 
|  Document | next()get the next document | 
|  boolean | nextDocument()Move the collection to the start of the next document. | 
| protected  boolean | openNextFile() | 
|  void | remove()This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations | 
|  void | reset()Resets the Collection iterator to the start of the collection. | 
| Methods inherited from class java.lang.Object | 
|---|
| clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait | 
| Field Detail | 
|---|
protected static final org.apache.log4j.Logger logger
public static final java.lang.String ELEMENT_ATTR_SEPARATOR
protected static final boolean bReformXML
protected java.util.HashSet<java.lang.String> DocumentElements
protected boolean DocumentTags
protected java.util.HashSet<java.lang.String> TermElements
protected java.lang.String DocIdLocation
protected boolean DocIdIsAttribute
protected boolean TermsInAttributes
protected javax.xml.parsers.DocumentBuilderFactory dbFactory
protected javax.xml.parsers.DocumentBuilder dBuilder
protected org.w3c.dom.Document xmlDoc
protected java.util.LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument> Documents
protected org.terrier.indexing.SimpleXMLCollection.XMLDocument thisDoc
protected boolean EOC
protected java.util.HashSet<java.lang.String> DocIDBlacklist
protected java.util.LinkedList<java.lang.String> FilesToProcess
| Constructor Detail | 
|---|
public SimpleXMLCollection(java.util.List<java.lang.String> filesToProcess)
filesToProcess - public SimpleXMLCollection()
public SimpleXMLCollection(java.lang.String CollectionSpecFilename,
                           java.lang.String BlacklistSpecFilename)
CollectionSpecFilename - BlacklistSpecFilename - | Method Detail | 
|---|
protected void initialiseParser()
protected void initialiseTags()
public void close()
close in interface java.io.Closeablepublic boolean hasNext()
public Document next()
public void remove()
public boolean endOfCollection()
endOfCollection in interface Collectionpublic boolean nextDocument()
nextDocument in interface Collectionprotected boolean findDocumentElement(org.w3c.dom.Node n)
public Document getDocument()
getDocument in interface Collectionpublic void reset()
reset in interface Collectionprotected boolean openNextFile()
public static void main(java.lang.String[] args)
args - | 
 | ||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||