Package org.terrier.indexing
Class MultiDocumentFileCollection
- java.lang.Object
-
- org.terrier.indexing.MultiDocumentFileCollection
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
,Collection
- Direct Known Subclasses:
TRECCollection
,WARC018Collection
,WARC09Collection
public abstract class MultiDocumentFileCollection extends java.lang.Object implements Collection
-
-
Field Summary
Fields Modifier and Type Field Description protected java.lang.String
currentFilename
Filename of current fileprotected java.lang.String
desiredEncoding
Encoding to be used to open all files.protected java.util.Map<java.lang.String,java.lang.String>
DocProperties
properties for the current documentprotected java.lang.Class<? extends Document>
documentClass
Class to use for all documents parsed by this classprotected int
documentsInThisFile
Counts the number of documents that have been found in this file.protected boolean
eoc
are we at the end of the collection?protected boolean
eof
has the end of the current input file been reached?protected int
FileNumber
The index in the FilesToProcess of the currently processed file.protected java.util.List<java.lang.String>
FilesToProcess
The list of files to process.protected boolean
forceUTF8
should UTF8 encoding be assumed?protected java.io.InputStream
is
the input stream of the current input fileprotected static org.slf4j.Logger
logger
logger for this classprotected boolean
SkipFile
A boolean which is true when a new file is open.protected Tokeniser
tokeniser
Tokeniser to use for all documents parsed by this class
-
Constructor Summary
Constructors Modifier Constructor Description protected
MultiDocumentFileCollection()
protected
MultiDocumentFileCollection(java.io.InputStream input)
A constructor that reads only the specified InputStream.protected
MultiDocumentFileCollection(java.lang.String CollectionSpecFilename)
construct a collection from the denoted collection.spec fileprotected
MultiDocumentFileCollection(java.util.List<java.lang.String> _FilesToProcess)
construct a collection from the denoted collection.spec file
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected void
checkEncoding()
void
close()
Closes the collection, any files that may be open.boolean
endOfCollection()
Returns true if the end of the collection has been reachedprotected void
extractCharset()
abstract Document
getDocument()
Get the document object representing the current document.boolean
hasNext()
Check whether it is the last document in the collectionprotected void
loadDocumentClass()
Loads the class that will supply all documents for this Collection.Document
next()
Return the next documentabstract boolean
nextDocument()
Move the collection to the start of the next document.protected void
openNewFile()
protected boolean
openNextFile()
Opens the next document from the collection specification.void
reset()
Resets the Collection iterator to the start of the collection.
-
-
-
Field Detail
-
logger
protected static final org.slf4j.Logger logger
logger for this class
-
documentsInThisFile
protected int documentsInThisFile
Counts the number of documents that have been found in this file.
-
eoc
protected boolean eoc
are we at the end of the collection?
-
eof
protected boolean eof
has the end of the current input file been reached?
-
SkipFile
protected boolean SkipFile
A boolean which is true when a new file is open.
-
currentFilename
protected java.lang.String currentFilename
Filename of current file
-
forceUTF8
protected final boolean forceUTF8
should UTF8 encoding be assumed?
-
is
protected java.io.InputStream is
the input stream of the current input file
-
DocProperties
protected java.util.Map<java.lang.String,java.lang.String> DocProperties
properties for the current document
-
FilesToProcess
protected java.util.List<java.lang.String> FilesToProcess
The list of files to process.
-
FileNumber
protected int FileNumber
The index in the FilesToProcess of the currently processed file.
-
desiredEncoding
protected java.lang.String desiredEncoding
Encoding to be used to open all files.
-
documentClass
protected java.lang.Class<? extends Document> documentClass
Class to use for all documents parsed by this class
-
tokeniser
protected Tokeniser tokeniser
Tokeniser to use for all documents parsed by this class
-
-
Constructor Detail
-
MultiDocumentFileCollection
protected MultiDocumentFileCollection()
-
MultiDocumentFileCollection
protected MultiDocumentFileCollection(java.util.List<java.lang.String> _FilesToProcess)
construct a collection from the denoted collection.spec file
-
MultiDocumentFileCollection
protected MultiDocumentFileCollection(java.lang.String CollectionSpecFilename)
construct a collection from the denoted collection.spec file
-
MultiDocumentFileCollection
protected MultiDocumentFileCollection(java.io.InputStream input)
A constructor that reads only the specified InputStream.
-
-
Method Detail
-
getDocument
public abstract Document getDocument()
Description copied from interface:Collection
Get the document object representing the current document.- Specified by:
getDocument
in interfaceCollection
- Returns:
- Document the current document;
-
checkEncoding
protected void checkEncoding()
-
loadDocumentClass
protected void loadDocumentClass()
Loads the class that will supply all documents for this Collection. Set by property trec.document.class
-
hasNext
public boolean hasNext()
Check whether it is the last document in the collection- Returns:
- boolean
-
next
public Document next()
Return the next document- Returns:
- next document
-
close
public void close()
Closes the collection, any files that may be open.- Specified by:
close
in interfacejava.lang.AutoCloseable
- Specified by:
close
in interfacejava.io.Closeable
-
endOfCollection
public boolean endOfCollection()
Returns true if the end of the collection has been reached- Specified by:
endOfCollection
in interfaceCollection
- Returns:
- boolean true if the end of collection has been reached, otherwise it returns false.
-
openNextFile
protected boolean openNextFile() throws java.io.IOException
Opens the next document from the collection specification.- Returns:
- boolean true if the file was opened successufully. If there are no more files to open, it returns false.
- Throws:
java.io.IOException
- if there is an exception while opening the collection files.
-
extractCharset
protected void extractCharset()
-
openNewFile
protected void openNewFile() throws java.lang.Exception
- Throws:
java.lang.Exception
-
nextDocument
public abstract boolean nextDocument()
Move the collection to the start of the next document.- Specified by:
nextDocument
in interfaceCollection
- Returns:
- boolean true if there exists another document in the collection, otherwise it returns false.
-
reset
public void reset()
Resets the Collection iterator to the start of the collection.- Specified by:
reset
in interfaceCollection
-
-