Package org.terrier.indexing
Class SimpleXMLCollection
- java.lang.Object
-
- org.terrier.indexing.SimpleXMLCollection
-
- All Implemented Interfaces:
java.io.Closeable,java.lang.AutoCloseable,Collection
- Direct Known Subclasses:
SimpleMedlineXMLCollection
public class SimpleXMLCollection extends java.lang.Object implements Collection
Initial implementation of a class that generates a Collection with Documents from a series of XML files.Properties:
- indexing.simplexmlcollection.reformxml - will try to reform broken & entities.
- xml.blacklist.docids - docnos of documents that will not be indexed.
- xml.doctag - tag that marks a document.
- xml.idtag - tag that contains the docno. Attribute are specified as "element.attribute".
- xml.terms - list of tags whose children contain terms that should be indexed.
-
-
Field Summary
Fields Modifier and Type Field Description protected static booleanbReformXMLReform invalid XML by copying to temporary file.protected javax.xml.parsers.DocumentBuilderFactorydbFactoryThe xml parser factory for DOMprotected javax.xml.parsers.DocumentBuilderdBuilderthe xml parserprotected java.util.HashSet<java.lang.String>DocIDBlacklistA black list of document to ignore.protected booleanDocIdIsAttributeset if DocIdLocation contains ELEMENT_ATTR_SEPARATORprotected java.lang.StringDocIdLocationContains the name of the tag that contains the document nameprotected java.util.HashSet<java.lang.String>DocumentElementsContains the names of tags that encapsulate entire documentsprotected java.util.LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument>DocumentsA list of all the document objects in this XML fileprotected booleanDocumentTagsSet if DocumentElements.size > 0static java.lang.StringELEMENT_ATTR_SEPARATORelement attribute separatorprotected booleanEOCprotected java.util.List<java.lang.String>FilesToProcessThe list of files to process.protected static org.slf4j.Loggerloggerprotected booleanPropertiesInAttibutesset if any PropertyElements contains ELEMENT_ATTR_SEPARATORprotected java.util.Map<java.lang.String,java.lang.Integer>PropertyElementsContains the names of tags and attributes that encapsulate meta properties with their lengthsprotected java.util.HashSet<java.lang.String>TermElementsContains the names of tags and attributes that encapsulate termsprotected booleanTermsInAttributesset if any TermElements contains ELEMENT_ATTR_SEPARATORprotected org.terrier.indexing.SimpleXMLCollection.XMLDocumentthisDocthe current XML document that is being read by the indexerprotected org.w3c.dom.DocumentxmlDocthe parsed structure of the XML file we currently have open
-
Constructor Summary
Constructors Constructor Description SimpleXMLCollection()Construct a SimpleXMLCollectionSimpleXMLCollection(java.lang.String CollectionSpecFilename, java.lang.String BlacklistSpecFilename)Construct a SimpleXMLCollectionSimpleXMLCollection(java.lang.String addressCollectionFilename, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)additional constructors required by TRECIndexingSimpleXMLCollection(java.util.List<java.lang.String> filesToProcess)Construct a SimpleXMLCollectionSimpleXMLCollection(java.util.List<java.lang.String> collSpecFiles, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)additional constructors required by TRECIndexing
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidclose()This is not supported in this implemented class.booleanendOfCollection()Returns true if the end of the collection has been reachedprotected booleanfindDocumentElement(org.w3c.dom.Node n)DocumentgetDocument()Get the document object representing the current document.booleanhasNext()Chech whether there is a next document in the collectionprotected voidinitialiseParser()protected voidinitialiseTags()protected voidloadBlacklist(java.lang.String BlacklistSpecFilename)static voidmain(java.lang.String[] args)mainDocumentnext()get the next documentbooleannextDocument()Move the collection to the start of the next document.protected booleanopenNextFile()voidremove()This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocationsvoidreset()Resets the Collection iterator to the start of the collection.
-
-
-
Field Detail
-
logger
protected static final org.slf4j.Logger logger
-
ELEMENT_ATTR_SEPARATOR
public static final java.lang.String ELEMENT_ATTR_SEPARATOR
element attribute separator- See Also:
- Constant Field Values
-
bReformXML
protected static final boolean bReformXML
Reform invalid XML by copying to temporary file. NB This may be dangerous
-
DocumentElements
protected java.util.HashSet<java.lang.String> DocumentElements
Contains the names of tags that encapsulate entire documents
-
DocumentTags
protected boolean DocumentTags
Set if DocumentElements.size > 0
-
TermElements
protected java.util.HashSet<java.lang.String> TermElements
Contains the names of tags and attributes that encapsulate terms
-
DocIdLocation
protected java.lang.String DocIdLocation
Contains the name of the tag that contains the document name
-
DocIdIsAttribute
protected boolean DocIdIsAttribute
set if DocIdLocation contains ELEMENT_ATTR_SEPARATOR
-
TermsInAttributes
protected boolean TermsInAttributes
set if any TermElements contains ELEMENT_ATTR_SEPARATOR
-
PropertiesInAttibutes
protected boolean PropertiesInAttibutes
set if any PropertyElements contains ELEMENT_ATTR_SEPARATOR
-
PropertyElements
protected java.util.Map<java.lang.String,java.lang.Integer> PropertyElements
Contains the names of tags and attributes that encapsulate meta properties with their lengths
-
dbFactory
protected javax.xml.parsers.DocumentBuilderFactory dbFactory
The xml parser factory for DOM
-
dBuilder
protected javax.xml.parsers.DocumentBuilder dBuilder
the xml parser
-
xmlDoc
protected org.w3c.dom.Document xmlDoc
the parsed structure of the XML file we currently have open
-
Documents
protected java.util.LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument> Documents
A list of all the document objects in this XML file
-
thisDoc
protected org.terrier.indexing.SimpleXMLCollection.XMLDocument thisDoc
the current XML document that is being read by the indexer
-
EOC
protected boolean EOC
-
DocIDBlacklist
protected java.util.HashSet<java.lang.String> DocIDBlacklist
A black list of document to ignore.
-
FilesToProcess
protected java.util.List<java.lang.String> FilesToProcess
The list of files to process.
-
-
Constructor Detail
-
SimpleXMLCollection
public SimpleXMLCollection(java.util.List<java.lang.String> filesToProcess)
Construct a SimpleXMLCollection- Parameters:
filesToProcess-
-
SimpleXMLCollection
public SimpleXMLCollection()
Construct a SimpleXMLCollection
-
SimpleXMLCollection
public SimpleXMLCollection(java.lang.String addressCollectionFilename, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)additional constructors required by TRECIndexing
-
SimpleXMLCollection
public SimpleXMLCollection(java.util.List<java.lang.String> collSpecFiles, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)additional constructors required by TRECIndexing
-
SimpleXMLCollection
public SimpleXMLCollection(java.lang.String CollectionSpecFilename, java.lang.String BlacklistSpecFilename)Construct a SimpleXMLCollection- Parameters:
CollectionSpecFilename-BlacklistSpecFilename-
-
-
Method Detail
-
loadBlacklist
protected void loadBlacklist(java.lang.String BlacklistSpecFilename)
-
initialiseParser
protected void initialiseParser()
-
initialiseTags
protected void initialiseTags()
-
close
public void close()
This is not supported in this implemented class.- Specified by:
closein interfacejava.lang.AutoCloseable- Specified by:
closein interfacejava.io.Closeable
-
hasNext
public boolean hasNext()
Chech whether there is a next document in the collection- Returns:
- boolean
-
next
public Document next()
get the next document- Returns:
- next document
-
remove
public void remove()
This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations
-
endOfCollection
public boolean endOfCollection()
Returns true if the end of the collection has been reached- Specified by:
endOfCollectionin interfaceCollection- Returns:
- boolean true if the end of collection has been reached, otherwise it returns false.
-
nextDocument
public boolean nextDocument()
Move the collection to the start of the next document.- Specified by:
nextDocumentin interfaceCollection- Returns:
- boolean true if there exists another document in the collection, otherwise it returns false.
-
findDocumentElement
protected boolean findDocumentElement(org.w3c.dom.Node n)
-
getDocument
public Document getDocument()
Get the document object representing the current document.- Specified by:
getDocumentin interfaceCollection- Returns:
- Document the current document;
-
reset
public void reset()
Resets the Collection iterator to the start of the collection.. This Collection implementation does not support reset.- Specified by:
resetin interfaceCollection
-
openNextFile
protected boolean openNextFile()
-
main
public static void main(java.lang.String[] args) throws java.io.IOExceptionmain- Parameters:
args-- Throws:
java.io.IOException
-
-