Package org.terrier.indexing
Class MultiDocumentFileCollection
- java.lang.Object
-
- org.terrier.indexing.MultiDocumentFileCollection
-
- All Implemented Interfaces:
java.io.Closeable,java.lang.AutoCloseable,Collection
- Direct Known Subclasses:
TRECCollection,WARC018Collection,WARC09Collection
public abstract class MultiDocumentFileCollection extends java.lang.Object implements Collection
-
-
Field Summary
Fields Modifier and Type Field Description protected java.lang.StringcurrentFilenameFilename of current fileprotected java.lang.StringdesiredEncodingEncoding to be used to open all files.protected java.util.Map<java.lang.String,java.lang.String>DocPropertiesproperties for the current documentprotected java.lang.Class<? extends Document>documentClassClass to use for all documents parsed by this classprotected intdocumentsInThisFileCounts the number of documents that have been found in this file.protected booleaneocare we at the end of the collection?protected booleaneofhas the end of the current input file been reached?protected intFileNumberThe index in the FilesToProcess of the currently processed file.protected java.util.List<java.lang.String>FilesToProcessThe list of files to process.protected booleanforceUTF8should UTF8 encoding be assumed?protected java.io.InputStreamisthe input stream of the current input fileprotected static org.slf4j.Loggerloggerlogger for this classprotected booleanSkipFileA boolean which is true when a new file is open.protected TokenisertokeniserTokeniser to use for all documents parsed by this class
-
Constructor Summary
Constructors Modifier Constructor Description protectedMultiDocumentFileCollection()protectedMultiDocumentFileCollection(java.io.InputStream input)A constructor that reads only the specified InputStream.protectedMultiDocumentFileCollection(java.lang.String CollectionSpecFilename)construct a collection from the denoted collection.spec fileprotectedMultiDocumentFileCollection(java.util.List<java.lang.String> _FilesToProcess)construct a collection from the denoted collection.spec file
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected voidcheckEncoding()voidclose()Closes the collection, any files that may be open.booleanendOfCollection()Returns true if the end of the collection has been reachedprotected voidextractCharset()abstract DocumentgetDocument()Get the document object representing the current document.booleanhasNext()Check whether it is the last document in the collectionprotected voidloadDocumentClass()Loads the class that will supply all documents for this Collection.Documentnext()Return the next documentabstract booleannextDocument()Move the collection to the start of the next document.protected voidopenNewFile()protected booleanopenNextFile()Opens the next document from the collection specification.voidreset()Resets the Collection iterator to the start of the collection.
-
-
-
Field Detail
-
logger
protected static final org.slf4j.Logger logger
logger for this class
-
documentsInThisFile
protected int documentsInThisFile
Counts the number of documents that have been found in this file.
-
eoc
protected boolean eoc
are we at the end of the collection?
-
eof
protected boolean eof
has the end of the current input file been reached?
-
SkipFile
protected boolean SkipFile
A boolean which is true when a new file is open.
-
currentFilename
protected java.lang.String currentFilename
Filename of current file
-
forceUTF8
protected final boolean forceUTF8
should UTF8 encoding be assumed?
-
is
protected java.io.InputStream is
the input stream of the current input file
-
DocProperties
protected java.util.Map<java.lang.String,java.lang.String> DocProperties
properties for the current document
-
FilesToProcess
protected java.util.List<java.lang.String> FilesToProcess
The list of files to process.
-
FileNumber
protected int FileNumber
The index in the FilesToProcess of the currently processed file.
-
desiredEncoding
protected java.lang.String desiredEncoding
Encoding to be used to open all files.
-
documentClass
protected java.lang.Class<? extends Document> documentClass
Class to use for all documents parsed by this class
-
tokeniser
protected Tokeniser tokeniser
Tokeniser to use for all documents parsed by this class
-
-
Constructor Detail
-
MultiDocumentFileCollection
protected MultiDocumentFileCollection()
-
MultiDocumentFileCollection
protected MultiDocumentFileCollection(java.util.List<java.lang.String> _FilesToProcess)
construct a collection from the denoted collection.spec file
-
MultiDocumentFileCollection
protected MultiDocumentFileCollection(java.lang.String CollectionSpecFilename)
construct a collection from the denoted collection.spec file
-
MultiDocumentFileCollection
protected MultiDocumentFileCollection(java.io.InputStream input)
A constructor that reads only the specified InputStream.
-
-
Method Detail
-
getDocument
public abstract Document getDocument()
Description copied from interface:CollectionGet the document object representing the current document.- Specified by:
getDocumentin interfaceCollection- Returns:
- Document the current document;
-
checkEncoding
protected void checkEncoding()
-
loadDocumentClass
protected void loadDocumentClass()
Loads the class that will supply all documents for this Collection. Set by property trec.document.class
-
hasNext
public boolean hasNext()
Check whether it is the last document in the collection- Returns:
- boolean
-
next
public Document next()
Return the next document- Returns:
- next document
-
close
public void close()
Closes the collection, any files that may be open.- Specified by:
closein interfacejava.lang.AutoCloseable- Specified by:
closein interfacejava.io.Closeable
-
endOfCollection
public boolean endOfCollection()
Returns true if the end of the collection has been reached- Specified by:
endOfCollectionin interfaceCollection- Returns:
- boolean true if the end of collection has been reached, otherwise it returns false.
-
openNextFile
protected boolean openNextFile() throws java.io.IOExceptionOpens the next document from the collection specification.- Returns:
- boolean true if the file was opened successufully. If there are no more files to open, it returns false.
- Throws:
java.io.IOException- if there is an exception while opening the collection files.
-
extractCharset
protected void extractCharset()
-
openNewFile
protected void openNewFile() throws java.lang.Exception- Throws:
java.lang.Exception
-
nextDocument
public abstract boolean nextDocument()
Move the collection to the start of the next document.- Specified by:
nextDocumentin interfaceCollection- Returns:
- boolean true if there exists another document in the collection, otherwise it returns false.
-
reset
public void reset()
Resets the Collection iterator to the start of the collection.- Specified by:
resetin interfaceCollection
-
-