Package org.terrier.indexing
Class TwitterJSONDocument
- java.lang.Object
-
- org.terrier.indexing.TwitterJSONDocument
-
- All Implemented Interfaces:
Document
public class TwitterJSONDocument extends java.lang.Object implements Document
This is a Terrier Document implementation of a Tweet stored in JSON format. It parses out the fields of the Tweet from an input google.gson JsonObject. This document implementation implements fields and meta-data. Fields: Each TwitterJSONDocument is considered to have four fields for searching. The tokenised tweet text, denoted TWEET, this will have been processed by the tokeniser and subjected to stopword removal/stemming. The raw user name of the tweeter broken on spaces, denoted NAME. The raw screen name of the tweeter broken on spaces, denoted SNAME. The location of the tweet processed by the terrier EnglishTokeniser and subjected to stopword removal/stemming, denoted LOC. Meta-Data: During the parsing process, the properties of each TwitterJSONDocument is decorated with tweet meta-data. The following are added to the document properties, if and only if they exist in JSON input. Note that unless you are using the TREC Twitter API crawler or the Gardenhose/Firehose stream, then the majority of this data will be missing, as much of this data is unavailable when scraping the HTML. // Tweet data docno id created_at source lang text truncated retweet_count contributors // User data user.screen_name user.created_at user.protected user.lang user.name user.profile_image_url user.friends_count user.favourites_count user.listed_count user.statuses_count user.followers_count user.description user.location user.id user.time_zone user.utc_offset // if tweet is reply in_reply_to_screen_name in_reply_to_user_id in_reply_to_status_id // if place is known (like a region, for example a city, defined my a polygon with gps coordinates for points) place.place_type place.country_code place.id place.name place.full_name place.url place.country place.bounding_box.type (always polygon?) place.bounding_box.coordinates.size place.bounding_box.coordinates.[n].lat place.bounding_box.coordinates.[n].lng // if user coordinates are known coordinates.type (always point?) coordinates.lat coordinates.lng // if geo location of user is known geo.type (always point?) geo.lat geo.lng // if is retweet All of the above, but add retweet. on the front- Since:
- 4.0
- Author:
- Richard McCreadie
-
-
Constructor Summary
Constructors Constructor Description TwitterJSONDocument(com.google.gson.JsonObject json)
TwitterJSONDocument(java.lang.String JSONTweet)
TwitterJSONDocument(java.lang.String JSONTweet, boolean saveAll)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addProperty(java.lang.String propertyName, java.lang.String propertyValue)
Add a specific property to the properties for this document.protected int
byteLength(java.lang.String t)
void
doParsing(com.google.gson.JsonObject json)
boolean
endOfDocument()
Returns true when the end of the document has been reached, and there are no other terms to be retrieved from it.java.util.Map<java.lang.String,java.lang.String>
getAllProperties()
Returns the underlying map of all the properties defined by this Document.java.util.Set<java.lang.String>
getFields()
Returns a list of the fields the current term appears in.java.lang.String
getJsonText()
java.lang.String
getNextTerm()
Gets the next term of the document.java.lang.String
getProperty(java.lang.String name)
Allows access to a named property of the Document.java.io.Reader
getReader()
Returns a Reader object so client code can tokenise the document or deal with the document itself.void
setJsonText(java.lang.String jsonText)
-
-
-
Method Detail
-
doParsing
public void doParsing(com.google.gson.JsonObject json)
-
getNextTerm
public java.lang.String getNextTerm()
Description copied from interface:Document
Gets the next term of the document. NB:Null string returned from getNextTerm() should be ignored. They do not signify the lack of any more terms. endOfDocument() should be used to check that.- Specified by:
getNextTerm
in interfaceDocument
- Returns:
- String the next term of the document. Null returns should be ignored.
-
getFields
public java.util.Set<java.lang.String> getFields()
Description copied from interface:Document
Returns a list of the fields the current term appears in.
-
endOfDocument
public boolean endOfDocument()
Description copied from interface:Document
Returns true when the end of the document has been reached, and there are no other terms to be retrieved from it.- Specified by:
endOfDocument
in interfaceDocument
- Returns:
- boolean true if there are no more terms in the document, otherwise it returns false.
-
getReader
public java.io.Reader getReader()
Description copied from interface:Document
Returns a Reader object so client code can tokenise the document or deal with the document itself. Examples might be extracting URLs, language detection.
-
getProperty
public java.lang.String getProperty(java.lang.String name)
Description copied from interface:Document
Allows access to a named property of the Document. Examples might be URL, filename etc.- Specified by:
getProperty
in interfaceDocument
- Parameters:
name
- Name of the property. It is suggested, but not required that this name should not be case insensitive.
-
getAllProperties
public java.util.Map<java.lang.String,java.lang.String> getAllProperties()
Description copied from interface:Document
Returns the underlying map of all the properties defined by this Document.- Specified by:
getAllProperties
in interfaceDocument
-
addProperty
public void addProperty(java.lang.String propertyName, java.lang.String propertyValue)
Add a specific property to the properties for this document. This method has a second function, in that it will attempt to trim the tweet if it exceeds the meta index length for the key.- Parameters:
propertyName
-propertyValue
-
-
byteLength
protected int byteLength(java.lang.String t)
-
getJsonText
public java.lang.String getJsonText()
-
setJsonText
public void setJsonText(java.lang.String jsonText)
-
-