public class WARC018Collection extends MultiDocumentFileCollection implements Collection
Document class to be used can be specified with the
trec.document.class property.
Properties
Document class to parse individual documents (defaults to TaggedDocument).| Modifier and Type | Field and Description |
|---|---|
protected long |
currentDocumentBlobLength
the length of the blob containing the document data
|
protected String |
warc_crawldate_header
what header for the crawldate document metadata
|
protected String |
warc_docno_header
what header for the docno document metadata
|
protected String |
warc_url_header
what header for the url document metadata
|
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser| Constructor and Description |
|---|
WARC018Collection() |
WARC018Collection(InputStream input) |
WARC018Collection(String CollectionSpecFilename) |
| Modifier and Type | Method and Description |
|---|---|
String |
getDocid()
Get the String document identifier of the current document.
|
Document |
getDocument()
Get the document object representing the current document.
|
boolean |
nextDocument()
Move the collection to the start of the next document.
|
protected int |
parseHeaders(boolean requireContentLength) |
protected String |
readLine()
read a line from the currently open InputStream is
|
close, endOfCollection, extractCharset, hasNext, loadDocumentClass, next, openNewFile, openNextFile, readCollectionSpec, resetclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitendOfCollection, resetprotected long currentDocumentBlobLength
protected final String warc_docno_header
protected final String warc_url_header
protected final String warc_crawldate_header
public WARC018Collection()
public WARC018Collection(InputStream input)
public WARC018Collection(String CollectionSpecFilename)
public boolean nextDocument()
nextDocument in interface CollectionnextDocument in class MultiDocumentFileCollectionpublic String getDocid()
public Document getDocument()
getDocument in interface CollectiongetDocument in class MultiDocumentFileCollectionprotected int parseHeaders(boolean requireContentLength)
throws IOException
IOExceptionprotected String readLine() throws IOException
IOExceptionTerrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow