public class WARC10Collection extends WARC018Collection
currentDocumentBlobLength, warc_crawldate_header, warc_docno_header, warc_url_headercurrentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser| Constructor and Description |
|---|
WARC10Collection() |
WARC10Collection(InputStream input) |
WARC10Collection(List<String> files,
String TagSet,
String BlacklistSpecFilename,
String ignored) |
WARC10Collection(String CollectionSpecFilename) |
WARC10Collection(String CollectionSpecFilename,
String TagSet,
String BlacklistSpecFilename,
String ignored) |
| Modifier and Type | Method and Description |
|---|---|
boolean |
nextDocument()
Move the collection to the start of the next document.
|
protected void |
processRedirect(String source,
String target) |
getDocid, getDocument, parseHeaders, readLineclose, endOfCollection, extractCharset, hasNext, loadDocumentClass, next, openNewFile, openNextFile, resetclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitendOfCollection, resetpublic WARC10Collection()
public WARC10Collection(InputStream input)
public WARC10Collection(String CollectionSpecFilename)
public WARC10Collection(List<String> files, String TagSet, String BlacklistSpecFilename, String ignored)
public boolean nextDocument()
nextDocument in interface CollectionnextDocument in class WARC018CollectionTerrier Information Retrieval Platform 5.1. Copyright © 2004-2019, University of Glasgow