org.terrier.indexing
Class SimpleFileCollection

java.lang.Object
  extended by org.terrier.indexing.SimpleFileCollection
All Implemented Interfaces:
java.io.Closeable, Collection

public class SimpleFileCollection
extends java.lang.Object
implements Collection

Implements a collection that can read arbitrary files on disk. It will use the file list given to it in the constructor, or it will read the file specified by the property collection.spec. Properties:

Author:
Craig Macdonald & Vassilis Plachouras

Field Summary
protected  java.io.InputStream currentStream
          The InputStream of the most recently opened document.
protected  int Docid
          The identifier of a document in the collection.
protected  java.util.Map<java.lang.String,java.lang.Class<? extends Document>> extension_DocumentClass
          Maps filename extensions to Document classes.
protected  java.util.LinkedList<java.lang.String> FileList
          The list of files to index.
protected  java.util.List<java.lang.String> firstList
          Contains the list of files first handed to the SimpleFileCollection, allowing the SimpleFileCollection instance to be simply reset.
protected  java.util.List<java.lang.String> indexedFiles
          This is filled during traversal, so document IDs can be matched with filenames
protected static org.apache.log4j.Logger logger
           
static java.lang.String NAMESPACE_DOCUMENTS
          The default namespace for all parsers to be loaded from.
protected  boolean Recurse
          Whether directories should be recursed into by this class
protected  java.lang.String thisFilename
          The filename of the current file we are processing.
protected  Tokeniser tokeniser
           
 
Constructor Summary
SimpleFileCollection()
          A default constructor that uses the files to be processed by this collection, as specified by the property collection.spec
SimpleFileCollection(java.util.List<java.lang.String> filelist, boolean recurse)
          Constructs an instance of the class with the given list of files.
SimpleFileCollection(java.lang.String addressCollectionFilename)
          Creates an instance of the class.
 
Method Summary
protected  void addDirectoryListing()
          Called when thisFile is identified as a directory, this adds the entire contents of the directory onto the list to be processed.
 void close()
           
protected  void createExtensionDocumentMapping()
          Parses the properties indexing.simplefilecollection.extensionsparsers and indexing.simplefilecollection.defaultparser and attempts to load all the mentioned classes, in a hashtable mapping filename extension to their respective parsers.
 boolean endOfCollection()
          Checks whether there are more documents in the colection.
 java.lang.String getDocid()
          Returns the current document's identifier string.
 Document getDocument()
          Return the current document in the collection.
 java.util.List<java.lang.String> getFileList()
          Returns the ist of indexed files in the order they were indexed in.
 boolean hasNext()
          Check whether there is a next document in the collection to be processed
static void main(java.lang.String[] args)
          Simple test case.
protected  Document makeDocument(java.lang.String Filename, java.io.InputStream in)
          Given the opened document in, of Filename and File f, work out which parser to try, and instantiate it.
 Document next()
          Move onto the next document in the collection to be processed.
 boolean nextDocument()
          Move onto the next document in the collection to be processed.
 void remove()
          This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations
 void reset()
          Starts again from the beginning of the collection.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected static final org.apache.log4j.Logger logger

NAMESPACE_DOCUMENTS

public static final java.lang.String NAMESPACE_DOCUMENTS
The default namespace for all parsers to be loaded from. Only used if the class name specified does not contain any periods ('.')

See Also:
Constant Field Values

FileList

protected java.util.LinkedList<java.lang.String> FileList
The list of files to index.


firstList

protected java.util.List<java.lang.String> firstList
Contains the list of files first handed to the SimpleFileCollection, allowing the SimpleFileCollection instance to be simply reset.


indexedFiles

protected java.util.List<java.lang.String> indexedFiles
This is filled during traversal, so document IDs can be matched with filenames


Docid

protected int Docid
The identifier of a document in the collection.


Recurse

protected boolean Recurse
Whether directories should be recursed into by this class


extension_DocumentClass

protected java.util.Map<java.lang.String,java.lang.Class<? extends Document>> extension_DocumentClass
Maps filename extensions to Document classes. The entry |DEFAULT| maps to the default document parser, specified by indexing.simplefilecollection.defaultparser


thisFilename

protected java.lang.String thisFilename
The filename of the current file we are processing.


currentStream

protected java.io.InputStream currentStream
The InputStream of the most recently opened document. This is used to ensure that files are closed once they have been finished reading.


tokeniser

protected Tokeniser tokeniser
Constructor Detail

SimpleFileCollection

public SimpleFileCollection(java.util.List<java.lang.String> filelist,
                            boolean recurse)
Constructs an instance of the class with the given list of files.

Parameters:
filelist - ArrayList the files to be processed by this collection.

SimpleFileCollection

public SimpleFileCollection()
A default constructor that uses the files to be processed by this collection, as specified by the property collection.spec


SimpleFileCollection

public SimpleFileCollection(java.lang.String addressCollectionFilename)
Creates an instance of the class. The files to be processed are specified in the file with the given name.

Parameters:
addressCollectionFilename - String the name of the file that contains the list of files to be processed by this collecion.
Method Detail

createExtensionDocumentMapping

protected void createExtensionDocumentMapping()
Parses the properties indexing.simplefilecollection.extensionsparsers and indexing.simplefilecollection.defaultparser and attempts to load all the mentioned classes, in a hashtable mapping filename extension to their respective parsers. If indexing.simplefilecollection.defaultparser is set, then that class will be used to attempt to parse documents that no explicit parser is set.


hasNext

public boolean hasNext()
Check whether there is a next document in the collection to be processed

Returns:
has next

next

public Document next()
Move onto the next document in the collection to be processed.

Returns:
next document

remove

public void remove()
This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations


nextDocument

public boolean nextDocument()
Move onto the next document in the collection to be processed.

Specified by:
nextDocument in interface Collection
Returns:
boolean true if there are more documents in the collection, otherwise return false.

getDocument

public Document getDocument()
Return the current document in the collection.

Specified by:
getDocument in interface Collection
Returns:
Document the next document object from the collection.

makeDocument

protected Document makeDocument(java.lang.String Filename,
                                java.io.InputStream in)
Given the opened document in, of Filename and File f, work out which parser to try, and instantiate it. If you wish to use a different constructor for opening documents, then you need to subclass this method.

Parameters:
Filename - the filename of the currently open document
in - The stream of the currently open document
Returns:
Document object to parse the document, or null if no suitable parser exists.

endOfCollection

public boolean endOfCollection()
Checks whether there are more documents in the colection.

Specified by:
endOfCollection in interface Collection
Returns:
boolean true if there are no more documents in the collection, otherwise it returns false.

reset

public void reset()
Starts again from the beginning of the collection.

Specified by:
reset in interface Collection

getDocid

public java.lang.String getDocid()
Returns the current document's identifier string.

Returns:
String the identifier of the current document.

close

public void close()
Specified by:
close in interface java.io.Closeable

getFileList

public java.util.List<java.lang.String> getFileList()
Returns the ist of indexed files in the order they were indexed in.


addDirectoryListing

protected void addDirectoryListing()
Called when thisFile is identified as a directory, this adds the entire contents of the directory onto the list to be processed.


main

public static void main(java.lang.String[] args)
Simple test case. Pass the filename of a file that lists files to be processed to this test case.



Terrier 3.5. Copyright © 2004-2011 University of Glasgow