Initial version of parser for ClueWeb09. Currently, the class is named WARC0.18 parser. I'm thinking about having a sub-class of this which does the ClueWeb09 specific bits.
In essence, the changes from a standard WARC v 0.18 parser are:
- Subtract 49 bytes from content-length for a warcinfo message, and 16 bytes for a http message (e.g. a document).
- Put warc-trec-id in document properties object as docno.
- Does not support file splitting or record-level compression.
- Furthermore, I may force the charset to UTF for english documents - my interpretation of http://boston.lti.cs.cmu.edu/Data/clueweb09/dataset.html#encodings is ambiguous.
This successfully indexes the sample collection