public class WARC09Collection extends MultiDocumentFileCollection
Document class to be used can be specified with the
trec.document.class property. The following links denote the pages
that were used to construct the format of this object:
http://www.yr-bcn.es/webspam/datasets/uk2006-pages/excerpt.txt
http://archive-access.sourceforge.net/warc/warc_file_format.html
http://crawler.archive.org/apidocs/index.html?org/archive/io/arc/ARCWriter.html
http://crawler.archive.org/apidocs/org/archive/io/GzippedInputStream.html
Properties
Document class to parse individual documents (defaults to TaggedDocument).| Modifier and Type | Field and Description |
|---|---|
protected String |
currentDocno
properties for the current document
|
protected long |
currentDocumentBlobLength |
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser| Constructor and Description |
|---|
WARC09Collection() |
WARC09Collection(InputStream input) |
WARC09Collection(List<String> files,
String TagSet,
String BlacklistSpecFilename,
String ignored) |
WARC09Collection(String CollectionSpecFilename) |
WARC09Collection(String CollectionSpecFilename,
String TagSet,
String BlacklistSpecFilename,
String ignored) |
| Modifier and Type | Method and Description |
|---|---|
Document |
getDocument()
Get the document object representing the current document.
|
boolean |
nextDocument()
Move the collection to the start of the next document.
|
protected String |
readLine()
read a line from the currently open InputStream is
|
void |
reset()
Resets the Collection iterator to the start of the collection.
|
close, endOfCollection, extractCharset, hasNext, loadDocumentClass, next, openNewFile, openNextFileprotected long currentDocumentBlobLength
protected String currentDocno
public WARC09Collection()
public WARC09Collection(InputStream input)
public WARC09Collection(String CollectionSpecFilename)
public WARC09Collection(List<String> files, String TagSet, String BlacklistSpecFilename, String ignored)
public Document getDocument()
getDocument in interface CollectiongetDocument in class MultiDocumentFileCollectionpublic boolean nextDocument()
nextDocument in interface CollectionnextDocument in class MultiDocumentFileCollectionprotected String readLine() throws IOException
IOExceptionpublic void reset()
reset in interface Collectionreset in class MultiDocumentFileCollectionTerrier Information Retrieval Platform 5.1. Copyright © 2004-2019, University of Glasgow