org.terrier.indexing
Class TRECWebCollection
java.lang.Object
org.terrier.indexing.TRECCollection
org.terrier.indexing.TRECWebCollection
- All Implemented Interfaces:
- java.io.Closeable, Collection, DocumentExtractor
public class TRECWebCollection
- extends TRECCollection
Version of TRECCollection which can parse
standard form DOCHDR tags in TREC Web corpoa.
A standard format DOCHDR tag from WT2G is shown below.
<DOCHDR>
http://www.city.geneva.ny.us:80/index.htm 192.108.245.124 19970121041510 text/html 2407
HTTP/1.0 200 OK
Date: Tue, 21 Jan 1997 04:14:08 GMT
Server: Apache/1.1.1
Content-type: text/html
Content-length: 2236
Last-modified: Fri, 18 Oct 1996 17:33:56 GMT
</DOCHDR>
TRECWebCollection parses each HTTP header as Document property.
In addition, the URL, IP address, date and length are parsed
from the DOCHDR tags. In particular, the following Document properies
are set, depending on the format of the DOCHDR tag:
- url (all corpora)
- ip (WT2G, WT10G only)
- docbytelength (WT2G, WT10G, Blog06, Blogs08 only)
- contenttype (WT2G, WT10G only, but usually identified in the HTTP headers)
- crawldate (WT2G, WT10G only)
Supported TREC Collections:
There are some variations in the format of the DOCHDR tags in the various
TREC web corpora, in particular the first line of the tag. The following corpora are supported.
- WT2G, WT10G: URL IP Crawldate content-type docbytelength
- Blogs06,Blogs08: URL invalidIP invalidCrawldate docbytelength
- GOV,GOV2,W3C,CERC: URL
For indexing the more recent TREC ClueWeb09 corpus, see WARC018Collection
.
- Since:
- 3.5
- Author:
- Craig Macdonald
Fields inherited from class org.terrier.indexing.TRECCollection |
br, currentFilename, desiredEncoding, DocIDBlacklist, docnotag, DocProperties, documentClass, documentCounter, documentsInThisFile, end_docnoTag, end_docnoTagLength, end_docTag, end_docTagLength, endOfCollection, endPropertyTags, FileNumber, FilesToProcess, ignoreProperties, logger, propertyTagLengths, SkipFile, start_docnoTag, start_docnoTagLength, start_docTag, start_docTagLength, startPropertyTags, tags_CaseSensitive, ThisDocID, tokeniser |
Constructor Summary |
TRECWebCollection()
Constructs an instance of the TRECWebCollection. |
TRECWebCollection(java.io.InputStream input)
Constructs an instance of the TRECWebCollection, given an InputStream. |
TRECWebCollection(java.lang.String CollectionSpecFilename,
java.lang.String TagSet,
java.lang.String BlacklistSpecFilename,
java.lang.String ignored)
Constructs an instance of the TRECWebCollection. |
Methods inherited from class org.terrier.indexing.TRECCollection |
close, endOfCollection, getDocid, getDocument, getDocument, getDocumentString, getTag, hasNext, loadDocumentClass, next, nextDocument, openNextFile, readCollectionSpec, readDocumentBlacklist, remove, reset, setTags |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TRECWebCollection
public TRECWebCollection()
- Constructs an instance of the TRECWebCollection.
TRECWebCollection
public TRECWebCollection(java.io.InputStream input)
- Constructs an instance of the TRECWebCollection, given an InputStream.
- Parameters:
input
-
TRECWebCollection
public TRECWebCollection(java.lang.String CollectionSpecFilename,
java.lang.String TagSet,
java.lang.String BlacklistSpecFilename,
java.lang.String ignored)
- Constructs an instance of the TRECWebCollection.
- Parameters:
CollectionSpecFilename
- TagSet
- BlacklistSpecFilename
- ignored
-
afterPropertyTags
protected void afterPropertyTags()
throws java.io.IOException
- Overrides:
afterPropertyTags
in class TRECCollection
- Throws:
java.io.IOException
Terrier 3.5. Copyright © 2004-2011 University of Glasgow