public class SimpleXMLCollection extends Object implements Collection
Properties:
| Modifier and Type | Field and Description | 
|---|---|
| protected static boolean | bReformXMLReform invalid XML by copying to temporary file. | 
| protected DocumentBuilderFactory | dbFactoryThe xml parser factory for DOM | 
| protected DocumentBuilder | dBuilderthe xml parser | 
| protected HashSet<String> | DocIDBlacklistA black list of document to ignore. | 
| protected boolean | DocIdIsAttributeset if DocIdLocation contains ELEMENT_ATTR_SEPARATOR | 
| protected String | DocIdLocationContains the name of the tag that contains the document name | 
| protected HashSet<String> | DocumentElementsContains the names of tags that encapsulate entire documents | 
| protected LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument> | DocumentsA list of all the document objects in this XML file | 
| protected boolean | DocumentTagsSet if DocumentElements.size > 0 | 
| static String | ELEMENT_ATTR_SEPARATORelement attribute separator | 
| protected boolean | EOC | 
| protected LinkedList<String> | FilesToProcessThe list of files to process. | 
| protected static org.slf4j.Logger | logger | 
| protected boolean | PropertiesInAttibutesset if any PropertyElements contains ELEMENT_ATTR_SEPARATOR | 
| protected Map<String,Integer> | PropertyElementsContains the names of tags and attributes that encapsulate meta properties with their lengths | 
| protected HashSet<String> | TermElementsContains the names of tags and attributes that encapsulate terms | 
| protected boolean | TermsInAttributesset if any TermElements contains ELEMENT_ATTR_SEPARATOR | 
| protected org.terrier.indexing.SimpleXMLCollection.XMLDocument | thisDocthe current XML document that is being read by the indexer | 
| protected Document | xmlDocthe parsed structure of the XML file we currently have open | 
| Constructor and Description | 
|---|
| SimpleXMLCollection()Construct a SimpleXMLCollection | 
| SimpleXMLCollection(List<String> filesToProcess)Construct a SimpleXMLCollection | 
| SimpleXMLCollection(String CollectionSpecFilename,
                   String BlacklistSpecFilename)Construct a SimpleXMLCollection | 
| Modifier and Type | Method and Description | 
|---|---|
| void | close() This is not supported in this implemented class. | 
| boolean | endOfCollection()Returns true if the end of the collection has been reached | 
| protected boolean | findDocumentElement(Node n) | 
| Document | getDocument()Get the document object representing the current document. | 
| boolean | hasNext()Chech whether there is a next document in the collection | 
| protected void | initialiseParser() | 
| protected void | initialiseTags() | 
| static void | main(String[] args)main | 
| Document | next()get the next document | 
| boolean | nextDocument()Move the collection to the start of the next document. | 
| protected boolean | openNextFile() | 
| void | remove()This is unsupported by this Collection implementation, and
 any calls will throw UnsupportedOperationException
 Throws UnsupportedOperationException on all invocations | 
| void | reset()Resets the Collection iterator to the start of the collection. | 
protected static final org.slf4j.Logger logger
public static final String ELEMENT_ATTR_SEPARATOR
protected static final boolean bReformXML
protected HashSet<String> DocumentElements
protected boolean DocumentTags
protected HashSet<String> TermElements
protected String DocIdLocation
protected boolean DocIdIsAttribute
protected boolean TermsInAttributes
protected boolean PropertiesInAttibutes
protected Map<String,Integer> PropertyElements
protected DocumentBuilderFactory dbFactory
protected DocumentBuilder dBuilder
protected Document xmlDoc
protected LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument> Documents
protected org.terrier.indexing.SimpleXMLCollection.XMLDocument thisDoc
protected boolean EOC
protected LinkedList<String> FilesToProcess
public SimpleXMLCollection(List<String> filesToProcess)
filesToProcess - public SimpleXMLCollection()
protected void initialiseParser()
protected void initialiseTags()
public void close()
close in interface Closeableclose in interface AutoCloseablepublic boolean hasNext()
public Document next()
public void remove()
public boolean endOfCollection()
endOfCollection in interface Collectionpublic boolean nextDocument()
nextDocument in interface Collectionprotected boolean findDocumentElement(Node n)
public Document getDocument()
getDocument in interface Collectionpublic void reset()
reset in interface Collectionprotected boolean openNextFile()
public static void main(String[] args) throws IOException
args - IOExceptionTerrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow