Class TRECWebCollection

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable, Collection

    public class TRECWebCollection
    extends TRECCollection
    Version of TRECCollection which can parse standard form DOCHDR tags in TREC Web corpoa. A standard format DOCHDR tag from WT2G is shown below.
     <DOCHDR>
     http://www.city.geneva.ny.us:80/index.htm 192.108.245.124 19970121041510 text/html 2407
     HTTP/1.0 200 OK
     Date: Tue, 21 Jan 1997 04:14:08 GMT
     Server: Apache/1.1.1
     Content-type: text/html
     Content-length: 2236
     Last-modified: Fri, 18 Oct 1996 17:33:56 GMT
     </DOCHDR>
     
    TRECWebCollection parses each HTTP header as Document property. In addition, the URL, IP address, date and length are parsed from the DOCHDR tags. In particular, the following Document properies are set, depending on the format of the DOCHDR tag:
    • url (all corpora)
    • ip (WT2G, WT10G only)
    • docbytelength (WT2G, WT10G, Blog06, Blogs08 only)
    • contenttype (WT2G, WT10G only, but usually identified in the HTTP headers)
    • crawldate (WT2G, WT10G only)

    Supported TREC Collections:
    There are some variations in the format of the DOCHDR tags in the various TREC web corpora, in particular the first line of the tag. The following corpora are supported.

    • WT2G, WT10G: URL IP Crawldate content-type docbytelength
    • Blogs06,Blogs08: URL invalidIP invalidCrawldate docbytelength
    • GOV,GOV2,W3C,CERC: URL
    For indexing the more recent TREC ClueWeb09 corpus, see WARC018Collection.
    Since:
    3.5
    Author:
    Craig Macdonald
    • Constructor Detail

      • TRECWebCollection

        public TRECWebCollection()
        Constructs an instance of the TRECWebCollection.
      • TRECWebCollection

        public TRECWebCollection​(java.io.InputStream input)
        Constructs an instance of the TRECWebCollection, given an InputStream.
        Parameters:
        input -
      • TRECWebCollection

        public TRECWebCollection​(java.lang.String CollectionSpecFilename,
                                 java.lang.String TagSet,
                                 java.lang.String BlacklistSpecFilename,
                                 java.lang.String ignored)
        Constructs an instance of the TRECWebCollection.
        Parameters:
        CollectionSpecFilename -
        TagSet -
        BlacklistSpecFilename -
        ignored -
      • TRECWebCollection

        public TRECWebCollection​(java.util.List<java.lang.String> files,
                                 java.lang.String TagSet,
                                 java.lang.String BlacklistSpecFilename,
                                 java.lang.String ignored)
      • TRECWebCollection

        public TRECWebCollection​(java.util.List<java.lang.String> files,
                                 java.lang.String TagSet,
                                 java.lang.String BlacklistSpecFilename)
      • TRECWebCollection

        public TRECWebCollection​(java.util.List<java.lang.String> files)
    • Method Detail

      • afterPropertyTags

        protected void afterPropertyTags()
                                  throws java.io.IOException
        Overrides:
        afterPropertyTags in class TRECCollection
        Throws:
        java.io.IOException