Package org.terrier.indexing
Class SimpleXMLCollection
- java.lang.Object
-
- org.terrier.indexing.SimpleXMLCollection
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
,Collection
- Direct Known Subclasses:
SimpleMedlineXMLCollection
public class SimpleXMLCollection extends java.lang.Object implements Collection
Initial implementation of a class that generates a Collection with Documents from a series of XML files.Properties:
- indexing.simplexmlcollection.reformxml - will try to reform broken & entities.
- xml.blacklist.docids - docnos of documents that will not be indexed.
- xml.doctag - tag that marks a document.
- xml.idtag - tag that contains the docno. Attribute are specified as "element.attribute".
- xml.terms - list of tags whose children contain terms that should be indexed.
-
-
Field Summary
Fields Modifier and Type Field Description protected static boolean
bReformXML
Reform invalid XML by copying to temporary file.protected javax.xml.parsers.DocumentBuilderFactory
dbFactory
The xml parser factory for DOMprotected javax.xml.parsers.DocumentBuilder
dBuilder
the xml parserprotected java.util.HashSet<java.lang.String>
DocIDBlacklist
A black list of document to ignore.protected boolean
DocIdIsAttribute
set if DocIdLocation contains ELEMENT_ATTR_SEPARATORprotected java.lang.String
DocIdLocation
Contains the name of the tag that contains the document nameprotected java.util.HashSet<java.lang.String>
DocumentElements
Contains the names of tags that encapsulate entire documentsprotected java.util.LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument>
Documents
A list of all the document objects in this XML fileprotected boolean
DocumentTags
Set if DocumentElements.size > 0static java.lang.String
ELEMENT_ATTR_SEPARATOR
element attribute separatorprotected boolean
EOC
protected java.util.List<java.lang.String>
FilesToProcess
The list of files to process.protected static org.slf4j.Logger
logger
protected boolean
PropertiesInAttibutes
set if any PropertyElements contains ELEMENT_ATTR_SEPARATORprotected java.util.Map<java.lang.String,java.lang.Integer>
PropertyElements
Contains the names of tags and attributes that encapsulate meta properties with their lengthsprotected java.util.HashSet<java.lang.String>
TermElements
Contains the names of tags and attributes that encapsulate termsprotected boolean
TermsInAttributes
set if any TermElements contains ELEMENT_ATTR_SEPARATORprotected org.terrier.indexing.SimpleXMLCollection.XMLDocument
thisDoc
the current XML document that is being read by the indexerprotected org.w3c.dom.Document
xmlDoc
the parsed structure of the XML file we currently have open
-
Constructor Summary
Constructors Constructor Description SimpleXMLCollection()
Construct a SimpleXMLCollectionSimpleXMLCollection(java.lang.String CollectionSpecFilename, java.lang.String BlacklistSpecFilename)
Construct a SimpleXMLCollectionSimpleXMLCollection(java.lang.String addressCollectionFilename, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)
additional constructors required by TRECIndexingSimpleXMLCollection(java.util.List<java.lang.String> filesToProcess)
Construct a SimpleXMLCollectionSimpleXMLCollection(java.util.List<java.lang.String> collSpecFiles, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)
additional constructors required by TRECIndexing
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
This is not supported in this implemented class.boolean
endOfCollection()
Returns true if the end of the collection has been reachedprotected boolean
findDocumentElement(org.w3c.dom.Node n)
Document
getDocument()
Get the document object representing the current document.boolean
hasNext()
Chech whether there is a next document in the collectionprotected void
initialiseParser()
protected void
initialiseTags()
protected void
loadBlacklist(java.lang.String BlacklistSpecFilename)
static void
main(java.lang.String[] args)
mainDocument
next()
get the next documentboolean
nextDocument()
Move the collection to the start of the next document.protected boolean
openNextFile()
void
remove()
This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocationsvoid
reset()
Resets the Collection iterator to the start of the collection.
-
-
-
Field Detail
-
logger
protected static final org.slf4j.Logger logger
-
ELEMENT_ATTR_SEPARATOR
public static final java.lang.String ELEMENT_ATTR_SEPARATOR
element attribute separator- See Also:
- Constant Field Values
-
bReformXML
protected static final boolean bReformXML
Reform invalid XML by copying to temporary file. NB This may be dangerous
-
DocumentElements
protected java.util.HashSet<java.lang.String> DocumentElements
Contains the names of tags that encapsulate entire documents
-
DocumentTags
protected boolean DocumentTags
Set if DocumentElements.size > 0
-
TermElements
protected java.util.HashSet<java.lang.String> TermElements
Contains the names of tags and attributes that encapsulate terms
-
DocIdLocation
protected java.lang.String DocIdLocation
Contains the name of the tag that contains the document name
-
DocIdIsAttribute
protected boolean DocIdIsAttribute
set if DocIdLocation contains ELEMENT_ATTR_SEPARATOR
-
TermsInAttributes
protected boolean TermsInAttributes
set if any TermElements contains ELEMENT_ATTR_SEPARATOR
-
PropertiesInAttibutes
protected boolean PropertiesInAttibutes
set if any PropertyElements contains ELEMENT_ATTR_SEPARATOR
-
PropertyElements
protected java.util.Map<java.lang.String,java.lang.Integer> PropertyElements
Contains the names of tags and attributes that encapsulate meta properties with their lengths
-
dbFactory
protected javax.xml.parsers.DocumentBuilderFactory dbFactory
The xml parser factory for DOM
-
dBuilder
protected javax.xml.parsers.DocumentBuilder dBuilder
the xml parser
-
xmlDoc
protected org.w3c.dom.Document xmlDoc
the parsed structure of the XML file we currently have open
-
Documents
protected java.util.LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument> Documents
A list of all the document objects in this XML file
-
thisDoc
protected org.terrier.indexing.SimpleXMLCollection.XMLDocument thisDoc
the current XML document that is being read by the indexer
-
EOC
protected boolean EOC
-
DocIDBlacklist
protected java.util.HashSet<java.lang.String> DocIDBlacklist
A black list of document to ignore.
-
FilesToProcess
protected java.util.List<java.lang.String> FilesToProcess
The list of files to process.
-
-
Constructor Detail
-
SimpleXMLCollection
public SimpleXMLCollection(java.util.List<java.lang.String> filesToProcess)
Construct a SimpleXMLCollection- Parameters:
filesToProcess
-
-
SimpleXMLCollection
public SimpleXMLCollection()
Construct a SimpleXMLCollection
-
SimpleXMLCollection
public SimpleXMLCollection(java.lang.String addressCollectionFilename, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)
additional constructors required by TRECIndexing
-
SimpleXMLCollection
public SimpleXMLCollection(java.util.List<java.lang.String> collSpecFiles, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)
additional constructors required by TRECIndexing
-
SimpleXMLCollection
public SimpleXMLCollection(java.lang.String CollectionSpecFilename, java.lang.String BlacklistSpecFilename)
Construct a SimpleXMLCollection- Parameters:
CollectionSpecFilename
-BlacklistSpecFilename
-
-
-
Method Detail
-
loadBlacklist
protected void loadBlacklist(java.lang.String BlacklistSpecFilename)
-
initialiseParser
protected void initialiseParser()
-
initialiseTags
protected void initialiseTags()
-
close
public void close()
This is not supported in this implemented class.- Specified by:
close
in interfacejava.lang.AutoCloseable
- Specified by:
close
in interfacejava.io.Closeable
-
hasNext
public boolean hasNext()
Chech whether there is a next document in the collection- Returns:
- boolean
-
next
public Document next()
get the next document- Returns:
- next document
-
remove
public void remove()
This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations
-
endOfCollection
public boolean endOfCollection()
Returns true if the end of the collection has been reached- Specified by:
endOfCollection
in interfaceCollection
- Returns:
- boolean true if the end of collection has been reached, otherwise it returns false.
-
nextDocument
public boolean nextDocument()
Move the collection to the start of the next document.- Specified by:
nextDocument
in interfaceCollection
- Returns:
- boolean true if there exists another document in the collection, otherwise it returns false.
-
findDocumentElement
protected boolean findDocumentElement(org.w3c.dom.Node n)
-
getDocument
public Document getDocument()
Get the document object representing the current document.- Specified by:
getDocument
in interfaceCollection
- Returns:
- Document the current document;
-
reset
public void reset()
Resets the Collection iterator to the start of the collection.. This Collection implementation does not support reset.- Specified by:
reset
in interfaceCollection
-
openNextFile
protected boolean openNextFile()
-
main
public static void main(java.lang.String[] args) throws java.io.IOException
main- Parameters:
args
-- Throws:
java.io.IOException
-
-