Package org.terrier.indexing
Class SimpleFileCollection
- java.lang.Object
-
- org.terrier.indexing.SimpleFileCollection
-
- All Implemented Interfaces:
java.io.Closeable,java.lang.AutoCloseable,Collection
public class SimpleFileCollection extends java.lang.Object implements Collection
Implements a collection that can read arbitrary files on disk. It will use the file list given to it in the constructor, or it will read the file specified by the property collection.spec. Properties:- indexing.simplefilecollection.extensionsparsers - a comma delimited lists of tuples, in the form "extension:DocumentClass". For instance, one tuple could be "txt:FileDocument". The default txt:FileDocument,text:FileDocument,tex:FileDocument,bib:FileDocument,pdf:PDFDocument,html:TaggedDocument,htm:TaggedDocument,xhtml:TaggedDocument,xml:TaggedDocument,doc:MSWordDocument,ppt:MSPowerpointDocument,xls:MSExcelDocument.
- indexing.simplefilecollection.defaultparser - the default parser for any unknown extensions. If this property is empty, then such documents will not be opened.
- indexing.simplefilecollection.recurse - whether directories should be opened looking for files.
- Author:
- Craig Macdonald & Vassilis Plachouras
-
-
Field Summary
Fields Modifier and Type Field Description protected java.lang.StringcurrentDocnooverridden docno for the current documentprotected java.io.InputStreamcurrentStreamThe InputStream of the most recently opened document.static java.lang.StringDEFAULT_MAPPING_PROPERTYWhat to parse each file type withprotected intDocCounterThe identifier of a document in the collection.protected java.util.Map<java.lang.String,java.lang.Class<? extends Document>>extension_DocumentClassMaps filename extensions to Document classes.protected java.util.LinkedList<java.lang.String>FileListThe list of files to index.protected java.util.List<java.lang.String>firstListContains the list of files first handed to the SimpleFileCollection, allowing the SimpleFileCollection instance to be simply reset.protected java.util.List<java.lang.String>indexedFilesThis is filled during traversal, so document IDs can be matched with filenamesprotected static org.slf4j.Loggerloggerstatic java.lang.StringNAMESPACE_DOCUMENTSThe default namespace for all parsers to be loaded from.protected booleanRecurseWhether directories should be recursed into by this classprotected java.lang.StringthisFilenameThe filename of the current file we are processing.protected Tokenisertokeniser
-
Constructor Summary
Constructors Constructor Description SimpleFileCollection()A default constructor that uses the files to be processed by this collection, as specified by the property collection.specSimpleFileCollection(java.lang.String addressCollectionFilename)Creates an instance of the class.SimpleFileCollection(java.lang.String addressCollectionFilename, java.lang.String ignored1, java.lang.String ignored2, java.lang.String ignored3)additional constructors required by TRECIndexingSimpleFileCollection(java.util.List<java.lang.String> filelist, boolean recurse)Constructs an instance of the class with the given list of files.SimpleFileCollection(java.util.List<java.lang.String> filelist, java.lang.String ignored1, java.lang.String ignored2, java.lang.String ignored3)additional constructors required by TRECIndexing
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected voidaddDirectoryListing()Called when thisFile is identified as a directory, this adds the entire contents of the directory onto the list to be processed.voidclose()protected voidcreateExtensionDocumentMapping()Parses the properties indexing.simplefilecollection.extensionsparsers and indexing.simplefilecollection.defaultparser and attempts to load all the mentioned classes, in a hashtable mapping filename extension to their respective parsers.booleanendOfCollection()Checks whether there are more documents in the colection.java.lang.StringgetDocCounter()Returns the current document's identifier string.DocumentgetDocument()Return the current document in the collection.java.util.List<java.lang.String>getFileList()Returns the ist of indexed files in the order they were indexed in.booleanhasNext()Check whether there is a next document in the collection to be processedprotected DocumentmakeDocument(java.lang.String Filename, java.io.InputStream in)Given the opened document in, of Filename and File f, work out which parser to try, and instantiate it.Documentnext()Move onto the next document in the collection to be processed.booleannextDocument()Move onto the next document in the collection to be processed.voidremove()This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocationsvoidreset()Starts again from the beginning of the collection.
-
-
-
Field Detail
-
logger
protected static final org.slf4j.Logger logger
-
NAMESPACE_DOCUMENTS
public static final java.lang.String NAMESPACE_DOCUMENTS
The default namespace for all parsers to be loaded from. Only used if the class name specified does not contain any periods ('.')- See Also:
- Constant Field Values
-
DEFAULT_MAPPING_PROPERTY
public static final java.lang.String DEFAULT_MAPPING_PROPERTY
What to parse each file type with- See Also:
- Constant Field Values
-
FileList
protected java.util.LinkedList<java.lang.String> FileList
The list of files to index.
-
firstList
protected java.util.List<java.lang.String> firstList
Contains the list of files first handed to the SimpleFileCollection, allowing the SimpleFileCollection instance to be simply reset.
-
indexedFiles
protected java.util.List<java.lang.String> indexedFiles
This is filled during traversal, so document IDs can be matched with filenames
-
DocCounter
protected int DocCounter
The identifier of a document in the collection.
-
currentDocno
protected java.lang.String currentDocno
overridden docno for the current document
-
Recurse
protected boolean Recurse
Whether directories should be recursed into by this class
-
extension_DocumentClass
protected java.util.Map<java.lang.String,java.lang.Class<? extends Document>> extension_DocumentClass
Maps filename extensions to Document classes. The entry |DEFAULT| maps to the default document parser, specified by indexing.simplefilecollection.defaultparser
-
thisFilename
protected java.lang.String thisFilename
The filename of the current file we are processing.
-
currentStream
protected java.io.InputStream currentStream
The InputStream of the most recently opened document. This is used to ensure that files are closed once they have been finished reading.
-
tokeniser
protected Tokeniser tokeniser
-
-
Constructor Detail
-
SimpleFileCollection
public SimpleFileCollection(java.util.List<java.lang.String> filelist, boolean recurse)Constructs an instance of the class with the given list of files.- Parameters:
filelist- ArrayList the files to be processed by this collection.
-
SimpleFileCollection
public SimpleFileCollection()
A default constructor that uses the files to be processed by this collection, as specified by the property collection.spec
-
SimpleFileCollection
public SimpleFileCollection(java.util.List<java.lang.String> filelist, java.lang.String ignored1, java.lang.String ignored2, java.lang.String ignored3)additional constructors required by TRECIndexing
-
SimpleFileCollection
public SimpleFileCollection(java.lang.String addressCollectionFilename, java.lang.String ignored1, java.lang.String ignored2, java.lang.String ignored3)additional constructors required by TRECIndexing
-
SimpleFileCollection
public SimpleFileCollection(java.lang.String addressCollectionFilename)
Creates an instance of the class. The files to be processed are specified in the file with the given name.- Parameters:
addressCollectionFilename- String the name of the file that contains the list of files to be processed by this collecion.
-
-
Method Detail
-
createExtensionDocumentMapping
protected void createExtensionDocumentMapping()
Parses the properties indexing.simplefilecollection.extensionsparsers and indexing.simplefilecollection.defaultparser and attempts to load all the mentioned classes, in a hashtable mapping filename extension to their respective parsers. If indexing.simplefilecollection.defaultparser is set, then that class will be used to attempt to parse documents that no explicit parser is set.
-
hasNext
public boolean hasNext()
Check whether there is a next document in the collection to be processed- Returns:
- has next
-
next
public Document next()
Move onto the next document in the collection to be processed.- Returns:
- next document
-
remove
public void remove()
This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations
-
nextDocument
public boolean nextDocument()
Move onto the next document in the collection to be processed.- Specified by:
nextDocumentin interfaceCollection- Returns:
- boolean true if there are more documents in the collection, otherwise return false.
-
getDocument
public Document getDocument()
Return the current document in the collection.- Specified by:
getDocumentin interfaceCollection- Returns:
- Document the next document object from the collection.
-
makeDocument
protected Document makeDocument(java.lang.String Filename, java.io.InputStream in)
Given the opened document in, of Filename and File f, work out which parser to try, and instantiate it. If you wish to use a different constructor for opening documents, then you need to subclass this method.- Parameters:
Filename- the filename of the currently open documentin- The stream of the currently open document- Returns:
- Document object to parse the document, or null if no suitable parser exists.
-
endOfCollection
public boolean endOfCollection()
Checks whether there are more documents in the colection.- Specified by:
endOfCollectionin interfaceCollection- Returns:
- boolean true if there are no more documents in the collection, otherwise it returns false.
-
reset
public void reset()
Starts again from the beginning of the collection.- Specified by:
resetin interfaceCollection
-
getDocCounter
public java.lang.String getDocCounter()
Returns the current document's identifier string.- Returns:
- String the identifier of the current document.
-
close
public void close()
- Specified by:
closein interfacejava.lang.AutoCloseable- Specified by:
closein interfacejava.io.Closeable
-
getFileList
public java.util.List<java.lang.String> getFileList()
Returns the ist of indexed files in the order they were indexed in.
-
addDirectoryListing
protected void addDirectoryListing()
Called when thisFile is identified as a directory, this adds the entire contents of the directory onto the list to be processed.
-
-