public class WARC10Collection extends WARC018Collection
currentDocumentBlobLength, warc_crawldate_header, warc_docno_header, warc_url_header
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser
Constructor and Description |
---|
WARC10Collection() |
WARC10Collection(InputStream input) |
WARC10Collection(String CollectionSpecFilename) |
Modifier and Type | Method and Description |
---|---|
boolean |
nextDocument()
Move the collection to the start of the next document.
|
protected void |
processRedirect(String source,
String target) |
getDocid, getDocument, parseHeaders, readLine
close, endOfCollection, extractCharset, hasNext, loadDocumentClass, next, openNewFile, openNextFile, readCollectionSpec, reset
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
endOfCollection, reset
public WARC10Collection()
public WARC10Collection(InputStream input)
public WARC10Collection(String CollectionSpecFilename)
public boolean nextDocument()
nextDocument
in interface Collection
nextDocument
in class WARC018Collection
Terrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow