public class WARC10Collection extends WARC018Collection
currentDocumentBlobLength, warc_crawldate_header, warc_docno_header, warc_url_header
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser
Constructor and Description |
---|
WARC10Collection() |
WARC10Collection(InputStream input) |
WARC10Collection(List<String> files,
String TagSet,
String BlacklistSpecFilename,
String ignored) |
WARC10Collection(String CollectionSpecFilename) |
WARC10Collection(String CollectionSpecFilename,
String TagSet,
String BlacklistSpecFilename,
String ignored) |
Modifier and Type | Method and Description |
---|---|
boolean |
nextDocument()
Move the collection to the start of the next document.
|
protected void |
processRedirect(String source,
String target) |
getDocid, getDocument, parseHeaders, readLine
close, endOfCollection, extractCharset, hasNext, loadDocumentClass, next, openNewFile, openNextFile, reset
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
endOfCollection, reset
public WARC10Collection()
public WARC10Collection(InputStream input)
public WARC10Collection(String CollectionSpecFilename)
public WARC10Collection(List<String> files, String TagSet, String BlacklistSpecFilename, String ignored)
public boolean nextDocument()
nextDocument
in interface Collection
nextDocument
in class WARC018Collection
Terrier Information Retrieval Platform 5.1. Copyright © 2004-2019, University of Glasgow