|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.terrier.indexing.SimpleXMLCollection
public class SimpleXMLCollection
Initial implementation of a class that generates a Collection with Documents from a series of XML files.
Properties:
Field Summary | |
---|---|
protected static boolean |
bReformXML
Reform invalid XML by copying to temporary file. |
protected javax.xml.parsers.DocumentBuilderFactory |
dbFactory
The xml parser factory for DOM |
protected javax.xml.parsers.DocumentBuilder |
dBuilder
the xml parser |
protected java.util.HashSet<java.lang.String> |
DocIDBlacklist
A black list of document to ignore. |
protected boolean |
DocIdIsAttribute
set if DocIdLocation contains ELEMENT_ATTR_SEPARATOR |
protected java.lang.String |
DocIdLocation
Contains the name of the tag that contains the document name |
protected java.util.HashSet<java.lang.String> |
DocumentElements
Contains the names of tags that encapsulate entire documents |
protected java.util.LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument> |
Documents
A list of all the document objects in this XML file |
protected boolean |
DocumentTags
Set if DocumentElements.size > 0 |
static java.lang.String |
ELEMENT_ATTR_SEPARATOR
element attribute separator |
protected boolean |
EOC
|
protected java.util.LinkedList<java.lang.String> |
FilesToProcess
The list of files to process. |
protected static org.apache.log4j.Logger |
logger
|
protected java.util.HashSet<java.lang.String> |
TermElements
Contains the names of tags and attributes that encapsulate terms |
protected boolean |
TermsInAttributes
set if any TermElements contains ELEMENT_ATTR_SEPARATOR |
protected org.terrier.indexing.SimpleXMLCollection.XMLDocument |
thisDoc
the current XML document that is being read by the indexer |
protected org.w3c.dom.Document |
xmlDoc
the parsed structure of the XML file we currently have open |
Constructor Summary | |
---|---|
SimpleXMLCollection()
Construct a SimpleXMLCollection |
|
SimpleXMLCollection(java.util.List<java.lang.String> filesToProcess)
Construct a SimpleXMLCollection |
|
SimpleXMLCollection(java.lang.String CollectionSpecFilename,
java.lang.String BlacklistSpecFilename)
Construct a SimpleXMLCollection |
Method Summary | |
---|---|
void |
close()
This is not supported in this implemented class. |
boolean |
endOfCollection()
Returns true if the end of the collection has been reached |
protected boolean |
findDocumentElement(org.w3c.dom.Node n)
|
Document |
getDocument()
Get the document object representing the current document. |
boolean |
hasNext()
Chech whether there is a next document in the collection |
protected void |
initialiseParser()
|
protected void |
initialiseTags()
|
static void |
main(java.lang.String[] args)
main |
Document |
next()
get the next document |
boolean |
nextDocument()
Move the collection to the start of the next document. |
protected boolean |
openNextFile()
|
void |
remove()
This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations |
void |
reset()
Resets the Collection iterator to the start of the collection. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final org.apache.log4j.Logger logger
public static final java.lang.String ELEMENT_ATTR_SEPARATOR
protected static final boolean bReformXML
protected java.util.HashSet<java.lang.String> DocumentElements
protected boolean DocumentTags
protected java.util.HashSet<java.lang.String> TermElements
protected java.lang.String DocIdLocation
protected boolean DocIdIsAttribute
protected boolean TermsInAttributes
protected javax.xml.parsers.DocumentBuilderFactory dbFactory
protected javax.xml.parsers.DocumentBuilder dBuilder
protected org.w3c.dom.Document xmlDoc
protected java.util.LinkedList<org.terrier.indexing.SimpleXMLCollection.XMLDocument> Documents
protected org.terrier.indexing.SimpleXMLCollection.XMLDocument thisDoc
protected boolean EOC
protected java.util.HashSet<java.lang.String> DocIDBlacklist
protected java.util.LinkedList<java.lang.String> FilesToProcess
Constructor Detail |
---|
public SimpleXMLCollection(java.util.List<java.lang.String> filesToProcess)
filesToProcess
- public SimpleXMLCollection()
public SimpleXMLCollection(java.lang.String CollectionSpecFilename, java.lang.String BlacklistSpecFilename)
CollectionSpecFilename
- BlacklistSpecFilename
- Method Detail |
---|
protected void initialiseParser()
protected void initialiseTags()
public void close()
close
in interface java.io.Closeable
public boolean hasNext()
public Document next()
public void remove()
public boolean endOfCollection()
endOfCollection
in interface Collection
public boolean nextDocument()
nextDocument
in interface Collection
protected boolean findDocumentElement(org.w3c.dom.Node n)
public Document getDocument()
getDocument
in interface Collection
public void reset()
reset
in interface Collection
protected boolean openNextFile()
public static void main(java.lang.String[] args)
args
-
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |