Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 3.0
    • Component/s: None
    • Labels:
      None

      Description

      We need to check/ have support for indexing WARC collections

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          Initial version of parser for ClueWeb09. Currently, the class is named WARC0.18 parser. I'm thinking about having a sub-class of this which does the ClueWeb09 specific bits.

          In essence, the changes from a standard WARC v 0.18 parser are:

          • Subtract 49 bytes from content-length for a warcinfo message, and 16 bytes for a http message (e.g. a document).
          • Put warc-trec-id in document properties object as docno.
          • Does not support file splitting or record-level compression.
          • Furthermore, I may force the charset to UTF for english documents - my interpretation of http://boston.lti.cs.cmu.edu/Data/clueweb09/dataset.html#encodings is ambiguous.

          This successfully indexes the sample collection

          Show
          craigm Craig Macdonald added a comment - Initial version of parser for ClueWeb09. Currently, the class is named WARC0.18 parser. I'm thinking about having a sub-class of this which does the ClueWeb09 specific bits. In essence, the changes from a standard WARC v 0.18 parser are: Subtract 49 bytes from content-length for a warcinfo message, and 16 bytes for a http message (e.g. a document). Put warc-trec-id in document properties object as docno. Does not support file splitting or record-level compression. Furthermore, I may force the charset to UTF for english documents - my interpretation of http://boston.lti.cs.cmu.edu/Data/clueweb09/dataset.html#encodings is ambiguous. This successfully indexes the sample collection
          Hide
          craigm Craig Macdonald added a comment -

          The ClueWeb09 sample is incorrect. Experiments using files from the main collection show them to be OK. I will update the patch to reflect this fact.

          Show
          craigm Craig Macdonald added a comment - The ClueWeb09 sample is incorrect. Experiments using files from the main collection show them to be OK. I will update the patch to reflect this fact.
          Hide
          craigm Craig Macdonald added a comment -

          Updated version - removes offset length adjustments from ClueWeb09 sample; fixes some character set regex parsing

          Show
          craigm Craig Macdonald added a comment - Updated version - removes offset length adjustments from ClueWeb09 sample; fixes some character set regex parsing
          Hide
          craigm Craig Macdonald added a comment -

          Updated version. Can deal with incorrect WARC files where an extended character is in the URL header.

          Show
          craigm Craig Macdonald added a comment - Updated version. Can deal with incorrect WARC files where an extended character is in the URL header.
          Hide
          craigm Craig Macdonald added a comment -

          Was latterly commited to CORE.

          Show
          craigm Craig Macdonald added a comment - Was latterly commited to CORE.

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: