public class WARC018Collection extends MultiDocumentFileCollection implements Collection
Document
class to be used can be specified with the
trec.document.class property.
Properties
Document
class to parse individual documents (defaults to TaggedDocument
).Modifier and Type | Field and Description |
---|---|
protected long |
currentDocumentBlobLength
the length of the blob containing the document data
|
protected String |
warc_crawldate_header
what header for the crawldate document metadata
|
protected String |
warc_docno_header
what header for the docno document metadata
|
protected String |
warc_url_header
what header for the url document metadata
|
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser
Constructor and Description |
---|
WARC018Collection() |
WARC018Collection(InputStream input) |
WARC018Collection(List<String> files) |
WARC018Collection(List<String> files,
String TagSet,
String BlacklistSpecFilename,
String ignored) |
WARC018Collection(String CollectionSpecFilename) |
WARC018Collection(String CollectionSpecFilename,
String TagSet,
String BlacklistSpecFilename,
String ignored) |
Modifier and Type | Method and Description |
---|---|
String |
getDocid()
Get the String document identifier of the current document.
|
Document |
getDocument()
Get the document object representing the current document.
|
boolean |
nextDocument()
Move the collection to the start of the next document.
|
protected int |
parseHeaders(boolean requireContentLength) |
protected String |
readLine()
read a line from the currently open InputStream is
|
close, endOfCollection, extractCharset, hasNext, loadDocumentClass, next, openNewFile, openNextFile, reset
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
endOfCollection, reset
protected long currentDocumentBlobLength
protected final String warc_docno_header
protected final String warc_url_header
protected final String warc_crawldate_header
public WARC018Collection()
public WARC018Collection(InputStream input)
public WARC018Collection(String CollectionSpecFilename)
public WARC018Collection(List<String> files, String TagSet, String BlacklistSpecFilename, String ignored)
public boolean nextDocument()
nextDocument
in interface Collection
nextDocument
in class MultiDocumentFileCollection
public String getDocid()
public Document getDocument()
getDocument
in interface Collection
getDocument
in class MultiDocumentFileCollection
protected int parseHeaders(boolean requireContentLength) throws IOException
IOException
protected String readLine() throws IOException
IOException
Terrier Information Retrieval Platform 5.1. Copyright © 2004-2019, University of Glasgow