Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-171

Indexing support for TREC Tweets11 corpus

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 4.0
    • Component/s: .indexing
    • Labels:
      None

      Description

      We should provide indexing support for the new TREC Tweets11 corpus, which is being currently used in the Microblog track. In particular, the Tweets11 corpus of tweets can either be crawled as JSON or as HTML sequence files. The HTML sequence files can subsequently be scraped for content and then written in JSON format by the crawler. As such, we should provide the means to index a tweet collection stored in the common JSON format. This should be general enough to also support Twitter Gardenhose/Firehose format JSON tweets.

        Attachments

          Activity

          Hide
          richardm Richard McCreadie added a comment - - edited

          I have provided a plugin for Terrier 3.5 which adds support for indexing a JSON format tweet collection. Details on how to use this plugin can be found at the following wiki page: http://ir.dcs.gla.ac.uk/wiki/Terrier/Tweets11. If you have any comments or wish to suggest improvements to this plugin, please do feel free to post in this JIRA issue.

          Show
          richardm Richard McCreadie added a comment - - edited I have provided a plugin for Terrier 3.5 which adds support for indexing a JSON format tweet collection. Details on how to use this plugin can be found at the following wiki page: http://ir.dcs.gla.ac.uk/wiki/Terrier/Tweets11 . If you have any comments or wish to suggest improvements to this plugin, please do feel free to post in this JIRA issue.
          Hide
          richardm Richard McCreadie added a comment -

          We have been indexing Tweets for some time. Closing this issue.

          Show
          richardm Richard McCreadie added a comment - We have been indexing Tweets for some time. Closing this issue.

            People

            • Assignee:
              richardm Richard McCreadie
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: