public class WARC09Collection extends MultiDocumentFileCollection
Document class to be used can be specified with the
trec.document.class property. The following links denote the pages
that were used to construct the format of this object:
http://www.yr-bcn.es/webspam/datasets/uk2006-pages/excerpt.txt
http://archive-access.sourceforge.net/warc/warc_file_format.html
http://crawler.archive.org/apidocs/index.html?org/archive/io/arc/ARCWriter.html
http://crawler.archive.org/apidocs/org/archive/io/GzippedInputStream.html
Properties
Document class to parse individual documents (defaults to TaggedDocument).| Modifier and Type | Field and Description |
|---|---|
protected String |
currentDocno
properties for the current document
|
protected long |
currentDocumentBlobLength |
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser| Constructor and Description |
|---|
WARC09Collection() |
WARC09Collection(InputStream input) |
WARC09Collection(String CollectionSpecFilename) |
| Modifier and Type | Method and Description |
|---|---|
Document |
getDocument()
Get the document object representing the current document.
|
boolean |
nextDocument()
Move the collection to the start of the next document.
|
protected String |
readLine()
read a line from the currently open InputStream is
|
void |
reset()
Resets the Collection iterator to the start of the collection.
|
close, endOfCollection, extractCharset, hasNext, loadDocumentClass, next, openNewFile, openNextFile, readCollectionSpecprotected long currentDocumentBlobLength
protected String currentDocno
public WARC09Collection()
public WARC09Collection(InputStream input)
public WARC09Collection(String CollectionSpecFilename)
public Document getDocument()
getDocument in interface CollectiongetDocument in class MultiDocumentFileCollectionpublic boolean nextDocument()
nextDocument in interface CollectionnextDocument in class MultiDocumentFileCollectionprotected String readLine() throws IOException
IOExceptionpublic void reset()
reset in interface Collectionreset in class MultiDocumentFileCollectionTerrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow