org.terrier.indexing
Class TRECWebCollection
java.lang.Object
   org.terrier.indexing.TRECCollection
org.terrier.indexing.TRECCollection
       org.terrier.indexing.TRECWebCollection
org.terrier.indexing.TRECWebCollection
- All Implemented Interfaces: 
- java.io.Closeable, Collection, DocumentExtractor
- public class TRECWebCollection 
- extends TRECCollection
Version of TRECCollection which can parse
 standard form DOCHDR tags in TREC Web corpoa. 
 A standard format DOCHDR tag from WT2G is shown below.
 
 <DOCHDR>
 http://www.city.geneva.ny.us:80/index.htm 192.108.245.124 19970121041510 text/html 2407
 HTTP/1.0 200 OK
 Date: Tue, 21 Jan 1997 04:14:08 GMT
 Server: Apache/1.1.1
 Content-type: text/html
 Content-length: 2236
 Last-modified: Fri, 18 Oct 1996 17:33:56 GMT
 </DOCHDR>
 
 TRECWebCollection parses each HTTP header as Document property.
 In addition, the URL, IP address, date and length are parsed
 from the DOCHDR tags. In particular, the following Document properies
 are set, depending on the format of the DOCHDR tag:
 
 - url (all corpora)
- ip (WT2G, WT10G only)
- docbytelength (WT2G, WT10G, Blog06, Blogs08 only)
- contenttype (WT2G, WT10G only, but usually identified in the HTTP headers)
- crawldate (WT2G, WT10G only)
 Supported TREC Collections:
 There are some variations in the format of the DOCHDR tags in the various
 TREC web corpora, in particular the first line of the tag. The following corpora are supported.
 
 - WT2G, WT10G: URL IP Crawldate content-type docbytelength
- Blogs06,Blogs08: URL invalidIP invalidCrawldate docbytelength
- GOV,GOV2,W3C,CERC: URL
For indexing the more recent TREC ClueWeb09 corpus, seeWARC018Collection.
- Since:
- 3.5
- Author:
- Craig Macdonald
 
| Fields inherited from class org.terrier.indexing.TRECCollection | 
| br, currentFilename, desiredEncoding, DocIDBlacklist, docnotag, DocProperties, documentClass, documentCounter, documentsInThisFile, end_docnoTag, end_docnoTagLength, end_docTag, end_docTagLength, endOfCollection, endPropertyTags, FileNumber, FilesToProcess, ignoreProperties, logger, propertyTagLengths, SkipFile, start_docnoTag, start_docnoTagLength, start_docTag, start_docTagLength, startPropertyTags, tags_CaseSensitive, ThisDocID, tokeniser | 
 
| Constructor Summary | 
| TRECWebCollection()Constructs an instance of the TRECWebCollection.
 | 
| TRECWebCollection(java.io.InputStream input)Constructs an instance of the TRECWebCollection, given an InputStream.
 | 
| TRECWebCollection(java.lang.String CollectionSpecFilename,
                  java.lang.String TagSet,
                  java.lang.String BlacklistSpecFilename,
                  java.lang.String ignored)Constructs an instance of the TRECWebCollection.
 | 
 
 
| Methods inherited from class org.terrier.indexing.TRECCollection | 
| close, endOfCollection, getDocid, getDocument, getDocument, getDocumentString, getTag, hasNext, loadDocumentClass, next, nextDocument, openNextFile, readCollectionSpec, readDocumentBlacklist, remove, reset, setTags | 
 
| Methods inherited from class java.lang.Object | 
| clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait | 
 
TRECWebCollection
public TRECWebCollection()
- Constructs an instance of the TRECWebCollection.
 
TRECWebCollection
public TRECWebCollection(java.io.InputStream input)
- Constructs an instance of the TRECWebCollection, given an InputStream.
 
- Parameters:
- input-
 
TRECWebCollection
public TRECWebCollection(java.lang.String CollectionSpecFilename,
                         java.lang.String TagSet,
                         java.lang.String BlacklistSpecFilename,
                         java.lang.String ignored)
- Constructs an instance of the TRECWebCollection.
 
- Parameters:
- CollectionSpecFilename-
- TagSet-
- BlacklistSpecFilename-
- ignored-
 
afterPropertyTags
protected void afterPropertyTags()
                          throws java.io.IOException
- 
- Overrides:
- afterPropertyTagsin class- TRECCollection
 
- 
- Throws:
- java.io.IOException
 
Terrier 3.5. Copyright © 2004-2011 University of Glasgow