Class TwitterJSONDocument

  • All Implemented Interfaces:
    Document

    public class TwitterJSONDocument
    extends java.lang.Object
    implements Document
    This is a Terrier Document implementation of a Tweet stored in JSON format. It parses out the fields of the Tweet from an input google.gson JsonObject. This document implementation implements fields and meta-data. Fields: Each TwitterJSONDocument is considered to have four fields for searching. The tokenised tweet text, denoted TWEET, this will have been processed by the tokeniser and subjected to stopword removal/stemming. The raw user name of the tweeter broken on spaces, denoted NAME. The raw screen name of the tweeter broken on spaces, denoted SNAME. The location of the tweet processed by the terrier EnglishTokeniser and subjected to stopword removal/stemming, denoted LOC. Meta-Data: During the parsing process, the properties of each TwitterJSONDocument is decorated with tweet meta-data. The following are added to the document properties, if and only if they exist in JSON input. Note that unless you are using the TREC Twitter API crawler or the Gardenhose/Firehose stream, then the majority of this data will be missing, as much of this data is unavailable when scraping the HTML. // Tweet data docno id created_at source lang text truncated retweet_count contributors // User data user.screen_name user.created_at user.protected user.lang user.name user.profile_image_url user.friends_count user.favourites_count user.listed_count user.statuses_count user.followers_count user.description user.location user.id user.time_zone user.utc_offset // if tweet is reply in_reply_to_screen_name in_reply_to_user_id in_reply_to_status_id // if place is known (like a region, for example a city, defined my a polygon with gps coordinates for points) place.place_type place.country_code place.id place.name place.full_name place.url place.country place.bounding_box.type (always polygon?) place.bounding_box.coordinates.size place.bounding_box.coordinates.[n].lat place.bounding_box.coordinates.[n].lng // if user coordinates are known coordinates.type (always point?) coordinates.lat coordinates.lng // if geo location of user is known geo.type (always point?) geo.lat geo.lng // if is retweet All of the above, but add retweet. on the front
    Since:
    4.0
    Author:
    Richard McCreadie
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void addProperty​(java.lang.String propertyName, java.lang.String propertyValue)
      Add a specific property to the properties for this document.
      protected int byteLength​(java.lang.String t)  
      void doParsing​(com.google.gson.JsonObject json)  
      boolean endOfDocument()
      Returns true when the end of the document has been reached, and there are no other terms to be retrieved from it.
      java.util.Map<java.lang.String,​java.lang.String> getAllProperties()
      Returns the underlying map of all the properties defined by this Document.
      java.util.Set<java.lang.String> getFields()
      Returns a list of the fields the current term appears in.
      java.lang.String getJsonText()  
      java.lang.String getNextTerm()
      Gets the next term of the document.
      java.lang.String getProperty​(java.lang.String name)
      Allows access to a named property of the Document.
      java.io.Reader getReader()
      Returns a Reader object so client code can tokenise the document or deal with the document itself.
      void setJsonText​(java.lang.String jsonText)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • TwitterJSONDocument

        public TwitterJSONDocument​(java.lang.String JSONTweet)
      • TwitterJSONDocument

        public TwitterJSONDocument​(java.lang.String JSONTweet,
                                   boolean saveAll)
      • TwitterJSONDocument

        public TwitterJSONDocument​(com.google.gson.JsonObject json)
    • Method Detail

      • doParsing

        public void doParsing​(com.google.gson.JsonObject json)
      • getNextTerm

        public java.lang.String getNextTerm()
        Description copied from interface: Document
        Gets the next term of the document. NB:Null string returned from getNextTerm() should be ignored. They do not signify the lack of any more terms. endOfDocument() should be used to check that.
        Specified by:
        getNextTerm in interface Document
        Returns:
        String the next term of the document. Null returns should be ignored.
      • getFields

        public java.util.Set<java.lang.String> getFields()
        Description copied from interface: Document
        Returns a list of the fields the current term appears in.
        Specified by:
        getFields in interface Document
        Returns:
        HashSet a set of the terms that the current term appears in.
      • endOfDocument

        public boolean endOfDocument()
        Description copied from interface: Document
        Returns true when the end of the document has been reached, and there are no other terms to be retrieved from it.
        Specified by:
        endOfDocument in interface Document
        Returns:
        boolean true if there are no more terms in the document, otherwise it returns false.
      • getReader

        public java.io.Reader getReader()
        Description copied from interface: Document
        Returns a Reader object so client code can tokenise the document or deal with the document itself. Examples might be extracting URLs, language detection.
        Specified by:
        getReader in interface Document
      • getProperty

        public java.lang.String getProperty​(java.lang.String name)
        Description copied from interface: Document
        Allows access to a named property of the Document. Examples might be URL, filename etc.
        Specified by:
        getProperty in interface Document
        Parameters:
        name - Name of the property. It is suggested, but not required that this name should not be case insensitive.
      • getAllProperties

        public java.util.Map<java.lang.String,​java.lang.String> getAllProperties()
        Description copied from interface: Document
        Returns the underlying map of all the properties defined by this Document.
        Specified by:
        getAllProperties in interface Document
      • addProperty

        public void addProperty​(java.lang.String propertyName,
                                java.lang.String propertyValue)
        Add a specific property to the properties for this document. This method has a second function, in that it will attempt to trim the tweet if it exceeds the meta index length for the key.
        Parameters:
        propertyName -
        propertyValue -
      • byteLength

        protected int byteLength​(java.lang.String t)
      • getJsonText

        public java.lang.String getJsonText()
      • setJsonText

        public void setJsonText​(java.lang.String jsonText)