Package org.terrier.indexing
Class TRECWebCollection
- java.lang.Object
-
- org.terrier.indexing.MultiDocumentFileCollection
-
- org.terrier.indexing.TRECCollection
-
- org.terrier.indexing.TRECWebCollection
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
,Collection
public class TRECWebCollection extends TRECCollection
Version of TRECCollection which can parse standard form DOCHDR tags in TREC Web corpoa. A standard format DOCHDR tag from WT2G is shown below.<DOCHDR> http://www.city.geneva.ny.us:80/index.htm 192.108.245.124 19970121041510 text/html 2407 HTTP/1.0 200 OK Date: Tue, 21 Jan 1997 04:14:08 GMT Server: Apache/1.1.1 Content-type: text/html Content-length: 2236 Last-modified: Fri, 18 Oct 1996 17:33:56 GMT </DOCHDR>
TRECWebCollection parses each HTTP header as Document property. In addition, the URL, IP address, date and length are parsed from the DOCHDR tags. In particular, the following Document properies are set, depending on the format of the DOCHDR tag:- url (all corpora)
- ip (WT2G, WT10G only)
- docbytelength (WT2G, WT10G, Blog06, Blogs08 only)
- contenttype (WT2G, WT10G only, but usually identified in the HTTP headers)
- crawldate (WT2G, WT10G only)
Supported TREC Collections:
There are some variations in the format of the DOCHDR tags in the various TREC web corpora, in particular the first line of the tag. The following corpora are supported.- WT2G, WT10G: URL IP Crawldate content-type docbytelength
- Blogs06,Blogs08: URL invalidIP invalidCrawldate docbytelength
- GOV,GOV2,W3C,CERC: URL
WARC018Collection
.- Since:
- 3.5
- Author:
- Craig Macdonald
-
-
Field Summary
-
Fields inherited from class org.terrier.indexing.TRECCollection
br, DocIDBlacklist, docnotag, documentCounter, end_docnoTag, end_docnoTagLength, end_docTag, end_docTagLength, endPropertyTags, ignoreProperties, propertyTagLengths, propertyTags, start_docnoTag, start_docnoTagLength, start_docTag, start_docTagLength, startPropertyTags, tags_CaseSensitive, ThisDocID
-
Fields inherited from class org.terrier.indexing.MultiDocumentFileCollection
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser
-
-
Constructor Summary
Constructors Constructor Description TRECWebCollection()
Constructs an instance of the TRECWebCollection.TRECWebCollection(java.io.InputStream input)
Constructs an instance of the TRECWebCollection, given an InputStream.TRECWebCollection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
Constructs an instance of the TRECWebCollection.TRECWebCollection(java.util.List<java.lang.String> files)
TRECWebCollection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename)
TRECWebCollection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
afterPropertyTags()
-
Methods inherited from class org.terrier.indexing.TRECCollection
getDocument, getTag, hasNext, next, nextDocument, openNewFile, readDocumentBlacklist, reset, setTags
-
Methods inherited from class org.terrier.indexing.MultiDocumentFileCollection
checkEncoding, close, endOfCollection, extractCharset, loadDocumentClass, openNextFile
-
-
-
-
Constructor Detail
-
TRECWebCollection
public TRECWebCollection()
Constructs an instance of the TRECWebCollection.
-
TRECWebCollection
public TRECWebCollection(java.io.InputStream input)
Constructs an instance of the TRECWebCollection, given an InputStream.- Parameters:
input
-
-
TRECWebCollection
public TRECWebCollection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
Constructs an instance of the TRECWebCollection.- Parameters:
CollectionSpecFilename
-TagSet
-BlacklistSpecFilename
-ignored
-
-
TRECWebCollection
public TRECWebCollection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
TRECWebCollection
public TRECWebCollection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename)
-
TRECWebCollection
public TRECWebCollection(java.util.List<java.lang.String> files)
-
-
Method Detail
-
afterPropertyTags
protected void afterPropertyTags() throws java.io.IOException
- Overrides:
afterPropertyTags
in classTRECCollection
- Throws:
java.io.IOException
-
-