Package org.terrier.indexing
Class WARC018Collection
- java.lang.Object
-
- org.terrier.indexing.MultiDocumentFileCollection
-
- org.terrier.indexing.WARC018Collection
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
,Collection
- Direct Known Subclasses:
WARC10Collection
public class WARC018Collection extends MultiDocumentFileCollection implements Collection
This object is used to parse WARC format web crawls, 0.18. The preciseDocument
class to be used can be specified with the trec.document.class property.Properties
- trec.document.class the
Document
class to parse individual documents (defaults toTaggedDocument
). - warc018collection.force.utf8 - should UTF8 encoding be assumed throughout. Defaults to false.
- warc018collection.header.docno - what header has the thing to be used as docno? Defaults to warc-trec-id.
- warc018collection.header.url - what header has the thing to be used as url? Defaults to warc-target-url.
- Author:
- Craig Macdonald
-
-
Field Summary
Fields Modifier and Type Field Description protected long
currentDocumentBlobLength
the length of the blob containing the document dataprotected int
readLineByteCount
protected java.lang.String
warc_crawldate_header
what header for the crawldate document metadataprotected java.lang.String
warc_docno_header
what header for the docno document metadataprotected java.lang.String
warc_url_header
what header for the url document metadata-
Fields inherited from class org.terrier.indexing.MultiDocumentFileCollection
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser
-
-
Constructor Summary
Constructors Constructor Description WARC018Collection()
WARC018Collection(java.io.InputStream input)
WARC018Collection(java.lang.String CollectionSpecFilename)
WARC018Collection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
WARC018Collection(java.util.List<java.lang.String> files)
WARC018Collection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.String
getDocid()
Get the String document identifier of the current document.Document
getDocument()
Get the document object representing the current document.boolean
nextDocument()
Move the collection to the start of the next document.protected static java.lang.String
parseDate(java.lang.String date)
protected int
parseHeaders(boolean requireContentLength)
protected java.lang.String
readLine()
read a line from the currently open InputStream is-
Methods inherited from class org.terrier.indexing.MultiDocumentFileCollection
checkEncoding, close, endOfCollection, extractCharset, hasNext, loadDocumentClass, next, openNewFile, openNextFile, reset
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.terrier.indexing.Collection
endOfCollection, reset
-
-
-
-
Field Detail
-
currentDocumentBlobLength
protected long currentDocumentBlobLength
the length of the blob containing the document data
-
warc_docno_header
protected final java.lang.String warc_docno_header
what header for the docno document metadata
-
warc_url_header
protected final java.lang.String warc_url_header
what header for the url document metadata
-
warc_crawldate_header
protected final java.lang.String warc_crawldate_header
what header for the crawldate document metadata
-
readLineByteCount
protected int readLineByteCount
-
-
Constructor Detail
-
WARC018Collection
public WARC018Collection()
-
WARC018Collection
public WARC018Collection(java.io.InputStream input)
-
WARC018Collection
public WARC018Collection(java.lang.String CollectionSpecFilename)
-
WARC018Collection
public WARC018Collection(java.util.List<java.lang.String> files)
-
WARC018Collection
public WARC018Collection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
WARC018Collection
public WARC018Collection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
-
Method Detail
-
nextDocument
public boolean nextDocument()
Move the collection to the start of the next document.- Specified by:
nextDocument
in interfaceCollection
- Specified by:
nextDocument
in classMultiDocumentFileCollection
- Returns:
- boolean true if there exists another document in the collection, otherwise it returns false.
-
getDocid
public java.lang.String getDocid()
Get the String document identifier of the current document.
-
getDocument
public Document getDocument()
Get the document object representing the current document.- Specified by:
getDocument
in interfaceCollection
- Specified by:
getDocument
in classMultiDocumentFileCollection
- Returns:
- Document the current document;
-
parseHeaders
protected int parseHeaders(boolean requireContentLength) throws java.io.IOException
- Throws:
java.io.IOException
-
readLine
protected java.lang.String readLine() throws java.io.IOException
read a line from the currently open InputStream is- Throws:
java.io.IOException
-
parseDate
protected static final java.lang.String parseDate(java.lang.String date)
-
-