Class WARC09Collection

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable, Collection

    public class WARC09Collection
    extends MultiDocumentFileCollection
    This object is used to parse WARC format web crawls, version 0.9. The precise Document class to be used can be specified with the trec.document.class property. The following links denote the pages that were used to construct the format of this object: http://www.yr-bcn.es/webspam/datasets/uk2006-pages/excerpt.txt http://archive-access.sourceforge.net/warc/warc_file_format.html http://crawler.archive.org/apidocs/index.html?org/archive/io/arc/ARCWriter.html http://crawler.archive.org/apidocs/org/archive/io/GzippedInputStream.html

    Properties

    Author:
    Craig Macdonald
    • Field Detail

      • currentDocumentBlobLength

        protected long currentDocumentBlobLength
      • currentDocno

        protected java.lang.String currentDocno
        properties for the current document
    • Constructor Detail

      • WARC09Collection

        public WARC09Collection()
      • WARC09Collection

        public WARC09Collection​(java.io.InputStream input)
      • WARC09Collection

        public WARC09Collection​(java.lang.String CollectionSpecFilename)
      • WARC09Collection

        public WARC09Collection​(java.util.List<java.lang.String> files,
                                java.lang.String TagSet,
                                java.lang.String BlacklistSpecFilename,
                                java.lang.String ignored)
      • WARC09Collection

        public WARC09Collection​(java.util.List<java.lang.String> files)
      • WARC09Collection

        public WARC09Collection​(java.lang.String CollectionSpecFilename,
                                java.lang.String TagSet,
                                java.lang.String BlacklistSpecFilename,
                                java.lang.String ignored)
    • Method Detail

      • nextDocument

        public boolean nextDocument()
        Move the collection to the start of the next document.
        Specified by:
        nextDocument in interface Collection
        Specified by:
        nextDocument in class MultiDocumentFileCollection
        Returns:
        boolean true if there exists another document in the collection, otherwise it returns false.
      • readLine

        protected java.lang.String readLine()
                                     throws java.io.IOException
        read a line from the currently open InputStream is
        Throws:
        java.io.IOException