public class WARC09Collection extends MultiDocumentFileCollection
Document
class to be used can be specified with the
trec.document.class property. The following links denote the pages
that were used to construct the format of this object:
http://www.yr-bcn.es/webspam/datasets/uk2006-pages/excerpt.txt
http://archive-access.sourceforge.net/warc/warc_file_format.html
http://crawler.archive.org/apidocs/index.html?org/archive/io/arc/ARCWriter.html
http://crawler.archive.org/apidocs/org/archive/io/GzippedInputStream.html
Properties
Document
class to parse individual documents (defaults to TaggedDocument
).Modifier and Type | Field and Description |
---|---|
protected String |
currentDocno
properties for the current document
|
protected long |
currentDocumentBlobLength |
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser
Constructor and Description |
---|
WARC09Collection() |
WARC09Collection(InputStream input) |
WARC09Collection(String CollectionSpecFilename) |
Modifier and Type | Method and Description |
---|---|
Document |
getDocument()
Get the document object representing the current document.
|
boolean |
nextDocument()
Move the collection to the start of the next document.
|
protected String |
readLine()
read a line from the currently open InputStream is
|
void |
reset()
Resets the Collection iterator to the start of the collection.
|
close, endOfCollection, extractCharset, hasNext, loadDocumentClass, next, openNewFile, openNextFile, readCollectionSpec
protected long currentDocumentBlobLength
protected String currentDocno
public WARC09Collection()
public WARC09Collection(InputStream input)
public WARC09Collection(String CollectionSpecFilename)
public Document getDocument()
getDocument
in interface Collection
getDocument
in class MultiDocumentFileCollection
public boolean nextDocument()
nextDocument
in interface Collection
nextDocument
in class MultiDocumentFileCollection
protected String readLine() throws IOException
IOException
public void reset()
reset
in interface Collection
reset
in class MultiDocumentFileCollection
Terrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow