Package org.terrier.indexing
Class SimpleMedlineXMLCollection
- java.lang.Object
-
- org.terrier.indexing.SimpleXMLCollection
-
- org.terrier.indexing.SimpleMedlineXMLCollection
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
,Collection
public class SimpleMedlineXMLCollection extends SimpleXMLCollection
Initial implementation of a class that generates a Collection with Documents from a series of XML files in the Medline format. It process a limited number of documents in an XML file to avoid OutOfMemory problem in case the XML file is too large.Properties:
- lowercase - lower case all terms obtained. Highly recommended.
- indexing.simplexmlcollection.reformxml - will try to reform broken & entities.
- xml.doc.buffer.size - The maximum number of documents to process per interation.
- Author:
- Ben He
-
-
Field Summary
Fields Modifier and Type Field Description protected int
currentFileDocCounter
The number of documents processed in the current XML file.java.lang.String
docEndTag
The end tag of documents in the XML files.java.lang.String
docTag
The tag of documents in the XML files.java.lang.String
EOL
The end of line string.java.lang.String
fileEndTag
The tag indicating the end of an XML file.java.lang.String
fileTag
The tag indicating the start of an XML file.protected int
NUMBER_OF_DOCS_IN_BUFFER
The number of documents to process per iteration.-
Fields inherited from class org.terrier.indexing.SimpleXMLCollection
bReformXML, dbFactory, dBuilder, DocIDBlacklist, DocIdIsAttribute, DocIdLocation, DocumentElements, Documents, DocumentTags, ELEMENT_ATTR_SEPARATOR, EOC, FilesToProcess, logger, PropertiesInAttibutes, PropertyElements, TermElements, TermsInAttributes, thisDoc, xmlDoc
-
-
Constructor Summary
Constructors Constructor Description SimpleMedlineXMLCollection()
The default constructor.SimpleMedlineXMLCollection(java.lang.String CollectionSpecFilename, java.lang.String BlacklistSpecFilename)
An alternative constructor.SimpleMedlineXMLCollection(java.lang.String CollectionSpecFilename, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)
Constructor required by TRECIndexingSimpleMedlineXMLCollection(java.util.List<java.lang.String> files, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)
Constructor required by TRECIndexing
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected boolean
openNextFile()
Parse through up to a limited number of documents in the XML file.-
Methods inherited from class org.terrier.indexing.SimpleXMLCollection
close, endOfCollection, findDocumentElement, getDocument, hasNext, initialiseParser, initialiseTags, loadBlacklist, main, next, nextDocument, remove, reset
-
-
-
-
Field Detail
-
currentFileDocCounter
protected int currentFileDocCounter
The number of documents processed in the current XML file.
-
docTag
public final java.lang.String docTag
The tag of documents in the XML files.- See Also:
- Constant Field Values
-
docEndTag
public final java.lang.String docEndTag
The end tag of documents in the XML files.- See Also:
- Constant Field Values
-
fileTag
public final java.lang.String fileTag
The tag indicating the start of an XML file.- See Also:
- Constant Field Values
-
fileEndTag
public final java.lang.String fileEndTag
The tag indicating the end of an XML file.- See Also:
- Constant Field Values
-
EOL
public final java.lang.String EOL
The end of line string.
-
NUMBER_OF_DOCS_IN_BUFFER
protected final int NUMBER_OF_DOCS_IN_BUFFER
The number of documents to process per iteration.
-
-
Constructor Detail
-
SimpleMedlineXMLCollection
public SimpleMedlineXMLCollection()
The default constructor.
-
SimpleMedlineXMLCollection
public SimpleMedlineXMLCollection(java.lang.String CollectionSpecFilename, java.lang.String BlacklistSpecFilename)
An alternative constructor.- Parameters:
CollectionSpecFilename
- The name of the file containing the location of XML files in the collection.BlacklistSpecFilename
- The name of the file containing the location of the blacklisted XML files in the collection.
-
SimpleMedlineXMLCollection
public SimpleMedlineXMLCollection(java.lang.String CollectionSpecFilename, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)
Constructor required by TRECIndexing
-
SimpleMedlineXMLCollection
public SimpleMedlineXMLCollection(java.util.List<java.lang.String> files, java.lang.String ignored1, java.lang.String BlacklistSpecFilename, java.lang.String ignored2)
Constructor required by TRECIndexing
-
-
Method Detail
-
openNextFile
protected boolean openNextFile()
Parse through up to a limited number of documents in the XML file. The limit is specified by property xml.doc.buffer.size.- Overrides:
openNextFile
in classSimpleXMLCollection
-
-