|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.terrier.indexing.SimpleFileCollection
public class SimpleFileCollection
Implements a collection that can read arbitrary files on disk. It will use the file list given to it in the constructor, or it will read the file specified by the property collection.spec. Properties:
Field Summary | |
---|---|
protected java.io.InputStream |
currentStream
The InputStream of the most recently opened document. |
protected int |
Docid
The identifier of a document in the collection. |
protected java.util.Map<java.lang.String,java.lang.Class<? extends Document>> |
extension_DocumentClass
Maps filename extensions to Document classes. |
protected java.util.LinkedList<java.lang.String> |
FileList
The list of files to index. |
protected java.util.List<java.lang.String> |
firstList
Contains the list of files first handed to the SimpleFileCollection, allowing the SimpleFileCollection instance to be simply reset. |
protected java.util.List<java.lang.String> |
indexedFiles
This is filled during traversal, so document IDs can be matched with filenames |
protected static org.apache.log4j.Logger |
logger
|
static java.lang.String |
NAMESPACE_DOCUMENTS
The default namespace for all parsers to be loaded from. |
protected boolean |
Recurse
Whether directories should be recursed into by this class |
protected java.lang.String |
thisFilename
The filename of the current file we are processing. |
protected Tokeniser |
tokeniser
|
Constructor Summary | |
---|---|
SimpleFileCollection()
A default constructor that uses the files to be processed by this collection, as specified by the property collection.spec |
|
SimpleFileCollection(java.util.List<java.lang.String> filelist,
boolean recurse)
Constructs an instance of the class with the given list of files. |
|
SimpleFileCollection(java.lang.String addressCollectionFilename)
Creates an instance of the class. |
Method Summary | |
---|---|
protected void |
addDirectoryListing()
Called when thisFile is identified as a directory, this adds the entire contents of the directory onto the list to be processed. |
void |
close()
|
protected void |
createExtensionDocumentMapping()
Parses the properties indexing.simplefilecollection.extensionsparsers and indexing.simplefilecollection.defaultparser and attempts to load all the mentioned classes, in a hashtable mapping filename extension to their respective parsers. |
boolean |
endOfCollection()
Checks whether there are more documents in the colection. |
java.lang.String |
getDocid()
Returns the current document's identifier string. |
Document |
getDocument()
Return the current document in the collection. |
java.util.List<java.lang.String> |
getFileList()
Returns the ist of indexed files in the order they were indexed in. |
boolean |
hasNext()
Check whether there is a next document in the collection to be processed |
static void |
main(java.lang.String[] args)
Simple test case. |
protected Document |
makeDocument(java.lang.String Filename,
java.io.InputStream in)
Given the opened document in, of Filename and File f, work out which parser to try, and instantiate it. |
Document |
next()
Move onto the next document in the collection to be processed. |
boolean |
nextDocument()
Move onto the next document in the collection to be processed. |
void |
remove()
This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations |
void |
reset()
Starts again from the beginning of the collection. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final org.apache.log4j.Logger logger
public static final java.lang.String NAMESPACE_DOCUMENTS
protected java.util.LinkedList<java.lang.String> FileList
protected java.util.List<java.lang.String> firstList
protected java.util.List<java.lang.String> indexedFiles
protected int Docid
protected boolean Recurse
protected java.util.Map<java.lang.String,java.lang.Class<? extends Document>> extension_DocumentClass
protected java.lang.String thisFilename
protected java.io.InputStream currentStream
protected Tokeniser tokeniser
Constructor Detail |
---|
public SimpleFileCollection(java.util.List<java.lang.String> filelist, boolean recurse)
filelist
- ArrayList the files to be processed by this collection.public SimpleFileCollection()
public SimpleFileCollection(java.lang.String addressCollectionFilename)
addressCollectionFilename
- String the name of the file that
contains the list of files to be processed by this collecion.Method Detail |
---|
protected void createExtensionDocumentMapping()
public boolean hasNext()
public Document next()
public void remove()
public boolean nextDocument()
nextDocument
in interface Collection
public Document getDocument()
getDocument
in interface Collection
protected Document makeDocument(java.lang.String Filename, java.io.InputStream in)
Filename
- the filename of the currently open documentin
- The stream of the currently open document
public boolean endOfCollection()
endOfCollection
in interface Collection
public void reset()
reset
in interface Collection
public java.lang.String getDocid()
public void close()
close
in interface java.io.Closeable
public java.util.List<java.lang.String> getFileList()
protected void addDirectoryListing()
public static void main(java.lang.String[] args)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |