Package org.terrier.indexing
Class SimpleFileCollection
- java.lang.Object
-
- org.terrier.indexing.SimpleFileCollection
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
,Collection
public class SimpleFileCollection extends java.lang.Object implements Collection
Implements a collection that can read arbitrary files on disk. It will use the file list given to it in the constructor, or it will read the file specified by the property collection.spec. Properties:- indexing.simplefilecollection.extensionsparsers - a comma delimited lists of tuples, in the form "extension:DocumentClass". For instance, one tuple could be "txt:FileDocument". The default txt:FileDocument,text:FileDocument,tex:FileDocument,bib:FileDocument,pdf:PDFDocument,html:TaggedDocument,htm:TaggedDocument,xhtml:TaggedDocument,xml:TaggedDocument,doc:MSWordDocument,ppt:MSPowerpointDocument,xls:MSExcelDocument.
- indexing.simplefilecollection.defaultparser - the default parser for any unknown extensions. If this property is empty, then such documents will not be opened.
- indexing.simplefilecollection.recurse - whether directories should be opened looking for files.
- Author:
- Craig Macdonald & Vassilis Plachouras
-
-
Field Summary
Fields Modifier and Type Field Description protected java.lang.String
currentDocno
overridden docno for the current documentprotected java.io.InputStream
currentStream
The InputStream of the most recently opened document.static java.lang.String
DEFAULT_MAPPING_PROPERTY
What to parse each file type withprotected int
DocCounter
The identifier of a document in the collection.protected java.util.Map<java.lang.String,java.lang.Class<? extends Document>>
extension_DocumentClass
Maps filename extensions to Document classes.protected java.util.LinkedList<java.lang.String>
FileList
The list of files to index.protected java.util.List<java.lang.String>
firstList
Contains the list of files first handed to the SimpleFileCollection, allowing the SimpleFileCollection instance to be simply reset.protected java.util.List<java.lang.String>
indexedFiles
This is filled during traversal, so document IDs can be matched with filenamesprotected static org.slf4j.Logger
logger
static java.lang.String
NAMESPACE_DOCUMENTS
The default namespace for all parsers to be loaded from.protected boolean
Recurse
Whether directories should be recursed into by this classprotected java.lang.String
thisFilename
The filename of the current file we are processing.protected Tokeniser
tokeniser
-
Constructor Summary
Constructors Constructor Description SimpleFileCollection()
A default constructor that uses the files to be processed by this collection, as specified by the property collection.specSimpleFileCollection(java.lang.String addressCollectionFilename)
Creates an instance of the class.SimpleFileCollection(java.lang.String addressCollectionFilename, java.lang.String ignored1, java.lang.String ignored2, java.lang.String ignored3)
additional constructors required by TRECIndexingSimpleFileCollection(java.util.List<java.lang.String> filelist, boolean recurse)
Constructs an instance of the class with the given list of files.SimpleFileCollection(java.util.List<java.lang.String> filelist, java.lang.String ignored1, java.lang.String ignored2, java.lang.String ignored3)
additional constructors required by TRECIndexing
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
addDirectoryListing()
Called when thisFile is identified as a directory, this adds the entire contents of the directory onto the list to be processed.void
close()
protected void
createExtensionDocumentMapping()
Parses the properties indexing.simplefilecollection.extensionsparsers and indexing.simplefilecollection.defaultparser and attempts to load all the mentioned classes, in a hashtable mapping filename extension to their respective parsers.boolean
endOfCollection()
Checks whether there are more documents in the colection.java.lang.String
getDocCounter()
Returns the current document's identifier string.Document
getDocument()
Return the current document in the collection.java.util.List<java.lang.String>
getFileList()
Returns the ist of indexed files in the order they were indexed in.boolean
hasNext()
Check whether there is a next document in the collection to be processedprotected Document
makeDocument(java.lang.String Filename, java.io.InputStream in)
Given the opened document in, of Filename and File f, work out which parser to try, and instantiate it.Document
next()
Move onto the next document in the collection to be processed.boolean
nextDocument()
Move onto the next document in the collection to be processed.void
remove()
This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocationsvoid
reset()
Starts again from the beginning of the collection.
-
-
-
Field Detail
-
logger
protected static final org.slf4j.Logger logger
-
NAMESPACE_DOCUMENTS
public static final java.lang.String NAMESPACE_DOCUMENTS
The default namespace for all parsers to be loaded from. Only used if the class name specified does not contain any periods ('.')- See Also:
- Constant Field Values
-
DEFAULT_MAPPING_PROPERTY
public static final java.lang.String DEFAULT_MAPPING_PROPERTY
What to parse each file type with- See Also:
- Constant Field Values
-
FileList
protected java.util.LinkedList<java.lang.String> FileList
The list of files to index.
-
firstList
protected java.util.List<java.lang.String> firstList
Contains the list of files first handed to the SimpleFileCollection, allowing the SimpleFileCollection instance to be simply reset.
-
indexedFiles
protected java.util.List<java.lang.String> indexedFiles
This is filled during traversal, so document IDs can be matched with filenames
-
DocCounter
protected int DocCounter
The identifier of a document in the collection.
-
currentDocno
protected java.lang.String currentDocno
overridden docno for the current document
-
Recurse
protected boolean Recurse
Whether directories should be recursed into by this class
-
extension_DocumentClass
protected java.util.Map<java.lang.String,java.lang.Class<? extends Document>> extension_DocumentClass
Maps filename extensions to Document classes. The entry |DEFAULT| maps to the default document parser, specified by indexing.simplefilecollection.defaultparser
-
thisFilename
protected java.lang.String thisFilename
The filename of the current file we are processing.
-
currentStream
protected java.io.InputStream currentStream
The InputStream of the most recently opened document. This is used to ensure that files are closed once they have been finished reading.
-
tokeniser
protected Tokeniser tokeniser
-
-
Constructor Detail
-
SimpleFileCollection
public SimpleFileCollection(java.util.List<java.lang.String> filelist, boolean recurse)
Constructs an instance of the class with the given list of files.- Parameters:
filelist
- ArrayList the files to be processed by this collection.
-
SimpleFileCollection
public SimpleFileCollection()
A default constructor that uses the files to be processed by this collection, as specified by the property collection.spec
-
SimpleFileCollection
public SimpleFileCollection(java.util.List<java.lang.String> filelist, java.lang.String ignored1, java.lang.String ignored2, java.lang.String ignored3)
additional constructors required by TRECIndexing
-
SimpleFileCollection
public SimpleFileCollection(java.lang.String addressCollectionFilename, java.lang.String ignored1, java.lang.String ignored2, java.lang.String ignored3)
additional constructors required by TRECIndexing
-
SimpleFileCollection
public SimpleFileCollection(java.lang.String addressCollectionFilename)
Creates an instance of the class. The files to be processed are specified in the file with the given name.- Parameters:
addressCollectionFilename
- String the name of the file that contains the list of files to be processed by this collecion.
-
-
Method Detail
-
createExtensionDocumentMapping
protected void createExtensionDocumentMapping()
Parses the properties indexing.simplefilecollection.extensionsparsers and indexing.simplefilecollection.defaultparser and attempts to load all the mentioned classes, in a hashtable mapping filename extension to their respective parsers. If indexing.simplefilecollection.defaultparser is set, then that class will be used to attempt to parse documents that no explicit parser is set.
-
hasNext
public boolean hasNext()
Check whether there is a next document in the collection to be processed- Returns:
- has next
-
next
public Document next()
Move onto the next document in the collection to be processed.- Returns:
- next document
-
remove
public void remove()
This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations
-
nextDocument
public boolean nextDocument()
Move onto the next document in the collection to be processed.- Specified by:
nextDocument
in interfaceCollection
- Returns:
- boolean true if there are more documents in the collection, otherwise return false.
-
getDocument
public Document getDocument()
Return the current document in the collection.- Specified by:
getDocument
in interfaceCollection
- Returns:
- Document the next document object from the collection.
-
makeDocument
protected Document makeDocument(java.lang.String Filename, java.io.InputStream in)
Given the opened document in, of Filename and File f, work out which parser to try, and instantiate it. If you wish to use a different constructor for opening documents, then you need to subclass this method.- Parameters:
Filename
- the filename of the currently open documentin
- The stream of the currently open document- Returns:
- Document object to parse the document, or null if no suitable parser exists.
-
endOfCollection
public boolean endOfCollection()
Checks whether there are more documents in the colection.- Specified by:
endOfCollection
in interfaceCollection
- Returns:
- boolean true if there are no more documents in the collection, otherwise it returns false.
-
reset
public void reset()
Starts again from the beginning of the collection.- Specified by:
reset
in interfaceCollection
-
getDocCounter
public java.lang.String getDocCounter()
Returns the current document's identifier string.- Returns:
- String the identifier of the current document.
-
close
public void close()
- Specified by:
close
in interfacejava.lang.AutoCloseable
- Specified by:
close
in interfacejava.io.Closeable
-
getFileList
public java.util.List<java.lang.String> getFileList()
Returns the ist of indexed files in the order they were indexed in.
-
addDirectoryListing
protected void addDirectoryListing()
Called when thisFile is identified as a directory, this adds the entire contents of the directory onto the list to be processed.
-
-