Class WARC018Collection

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable, Collection
    Direct Known Subclasses:
    WARC10Collection

    public class WARC018Collection
    extends MultiDocumentFileCollection
    implements Collection
    This object is used to parse WARC format web crawls, 0.18. The precise Document class to be used can be specified with the trec.document.class property.

    Properties

    • trec.document.class the Document class to parse individual documents (defaults to TaggedDocument).
    • warc018collection.force.utf8 - should UTF8 encoding be assumed throughout. Defaults to false.
    • warc018collection.header.docno - what header has the thing to be used as docno? Defaults to warc-trec-id.
    • warc018collection.header.url - what header has the thing to be used as url? Defaults to warc-target-url.
    Author:
    Craig Macdonald
    • Field Detail

      • currentDocumentBlobLength

        protected long currentDocumentBlobLength
        the length of the blob containing the document data
      • warc_docno_header

        protected final java.lang.String warc_docno_header
        what header for the docno document metadata
      • warc_url_header

        protected final java.lang.String warc_url_header
        what header for the url document metadata
      • warc_crawldate_header

        protected final java.lang.String warc_crawldate_header
        what header for the crawldate document metadata
      • readLineByteCount

        protected int readLineByteCount
    • Constructor Detail

      • WARC018Collection

        public WARC018Collection()
      • WARC018Collection

        public WARC018Collection​(java.io.InputStream input)
      • WARC018Collection

        public WARC018Collection​(java.lang.String CollectionSpecFilename)
      • WARC018Collection

        public WARC018Collection​(java.util.List<java.lang.String> files)
      • WARC018Collection

        public WARC018Collection​(java.util.List<java.lang.String> files,
                                 java.lang.String TagSet,
                                 java.lang.String BlacklistSpecFilename,
                                 java.lang.String ignored)
      • WARC018Collection

        public WARC018Collection​(java.lang.String CollectionSpecFilename,
                                 java.lang.String TagSet,
                                 java.lang.String BlacklistSpecFilename,
                                 java.lang.String ignored)
    • Method Detail

      • nextDocument

        public boolean nextDocument()
        Move the collection to the start of the next document.
        Specified by:
        nextDocument in interface Collection
        Specified by:
        nextDocument in class MultiDocumentFileCollection
        Returns:
        boolean true if there exists another document in the collection, otherwise it returns false.
      • getDocid

        public java.lang.String getDocid()
        Get the String document identifier of the current document.
      • parseHeaders

        protected int parseHeaders​(boolean requireContentLength)
                            throws java.io.IOException
        Throws:
        java.io.IOException
      • readLine

        protected java.lang.String readLine()
                                     throws java.io.IOException
        read a line from the currently open InputStream is
        Throws:
        java.io.IOException
      • parseDate

        protected static final java.lang.String parseDate​(java.lang.String date)