public class TRECWebCollection extends TRECCollection
<DOCHDR> http://www.city.geneva.ny.us:80/index.htm 192.108.245.124 19970121041510 text/html 2407 HTTP/1.0 200 OK Date: Tue, 21 Jan 1997 04:14:08 GMT Server: Apache/1.1.1 Content-type: text/html Content-length: 2236 Last-modified: Fri, 18 Oct 1996 17:33:56 GMT </DOCHDR>TRECWebCollection parses each HTTP header as Document property. In addition, the URL, IP address, date and length are parsed from the DOCHDR tags. In particular, the following Document properies are set, depending on the format of the DOCHDR tag:
Supported TREC Collections:
There are some variations in the format of the DOCHDR tags in the various
TREC web corpora, in particular the first line of the tag. The following corpora are supported.
WARC018Collection
.br, DocIDBlacklist, docnotag, documentCounter, end_docnoTag, end_docnoTagLength, end_docTag, end_docTagLength, endPropertyTags, ignoreProperties, propertyTagLengths, start_docnoTag, start_docnoTagLength, start_docTag, start_docTagLength, startPropertyTags, tags_CaseSensitive, ThisDocID
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser
Constructor and Description |
---|
TRECWebCollection()
Constructs an instance of the TRECWebCollection.
|
TRECWebCollection(InputStream input)
Constructs an instance of the TRECWebCollection, given an InputStream.
|
TRECWebCollection(List<String> files,
String TagSet,
String BlacklistSpecFilename,
String ignored) |
TRECWebCollection(String CollectionSpecFilename,
String TagSet,
String BlacklistSpecFilename,
String ignored)
Constructs an instance of the TRECWebCollection.
|
Modifier and Type | Method and Description |
---|---|
protected void |
afterPropertyTags() |
getDocument, getTag, hasNext, next, nextDocument, openNewFile, readDocumentBlacklist, reset, setTags
close, endOfCollection, extractCharset, loadDocumentClass, openNextFile
public TRECWebCollection()
public TRECWebCollection(InputStream input)
input
- public TRECWebCollection(String CollectionSpecFilename, String TagSet, String BlacklistSpecFilename, String ignored)
CollectionSpecFilename
- TagSet
- BlacklistSpecFilename
- ignored
- protected void afterPropertyTags() throws IOException
afterPropertyTags
in class TRECCollection
IOException
Terrier Information Retrieval Platform 5.1. Copyright © 2004-2019, University of Glasgow