org.terrier.indexing
Class TRECWebCollection

java.lang.Object
  extended by org.terrier.indexing.TRECCollection
      extended by org.terrier.indexing.TRECWebCollection
All Implemented Interfaces:
java.io.Closeable, Collection, DocumentExtractor

public class TRECWebCollection
extends TRECCollection

Version of TRECCollection which can parse standard form DOCHDR tags in TREC Web corpoa. A standard format DOCHDR tag from WT2G is shown below.

 <DOCHDR>
 http://www.city.geneva.ny.us:80/index.htm 192.108.245.124 19970121041510 text/html 2407
 HTTP/1.0 200 OK
 Date: Tue, 21 Jan 1997 04:14:08 GMT
 Server: Apache/1.1.1
 Content-type: text/html
 Content-length: 2236
 Last-modified: Fri, 18 Oct 1996 17:33:56 GMT
 </DOCHDR>
 
TRECWebCollection parses each HTTP header as Document property. In addition, the URL, IP address, date and length are parsed from the DOCHDR tags. In particular, the following Document properies are set, depending on the format of the DOCHDR tag:

Supported TREC Collections:
There are some variations in the format of the DOCHDR tags in the various TREC web corpora, in particular the first line of the tag. The following corpora are supported.

For indexing the more recent TREC ClueWeb09 corpus, see WARC018Collection.

Since:
3.5
Author:
Craig Macdonald

Field Summary
 
Fields inherited from class org.terrier.indexing.TRECCollection
br, currentFilename, desiredEncoding, DocIDBlacklist, docnotag, DocProperties, documentClass, documentCounter, documentsInThisFile, end_docnoTag, end_docnoTagLength, end_docTag, end_docTagLength, endOfCollection, endPropertyTags, FileNumber, FilesToProcess, ignoreProperties, logger, propertyTagLengths, SkipFile, start_docnoTag, start_docnoTagLength, start_docTag, start_docTagLength, startPropertyTags, tags_CaseSensitive, ThisDocID, tokeniser
 
Constructor Summary
TRECWebCollection()
          Constructs an instance of the TRECWebCollection.
TRECWebCollection(java.io.InputStream input)
          Constructs an instance of the TRECWebCollection, given an InputStream.
TRECWebCollection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
          Constructs an instance of the TRECWebCollection.
 
Method Summary
protected  void afterPropertyTags()
           
 
Methods inherited from class org.terrier.indexing.TRECCollection
close, endOfCollection, getDocid, getDocument, getDocument, getDocumentString, getTag, hasNext, loadDocumentClass, next, nextDocument, openNextFile, readCollectionSpec, readDocumentBlacklist, remove, reset, setTags
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TRECWebCollection

public TRECWebCollection()
Constructs an instance of the TRECWebCollection.


TRECWebCollection

public TRECWebCollection(java.io.InputStream input)
Constructs an instance of the TRECWebCollection, given an InputStream.

Parameters:
input -

TRECWebCollection

public TRECWebCollection(java.lang.String CollectionSpecFilename,
                         java.lang.String TagSet,
                         java.lang.String BlacklistSpecFilename,
                         java.lang.String ignored)
Constructs an instance of the TRECWebCollection.

Parameters:
CollectionSpecFilename -
TagSet -
BlacklistSpecFilename -
ignored -
Method Detail

afterPropertyTags

protected void afterPropertyTags()
                          throws java.io.IOException
Overrides:
afterPropertyTags in class TRECCollection
Throws:
java.io.IOException


Terrier 3.5. Copyright © 2004-2011 University of Glasgow