Greenville on Twitter
In this blogpost, we use R to use Twitter data to analyze topics of interest to Greenville, SC. We will describe obtaining, manipulating, and summarizing the data.
Twitter is a “microblogging” service where users can, usually publicly, share links, pictures, or short comments (up to 140 characters) onto a timeline. The public timeline consists of all public tweets, but people can build their own private timelines to narrow content to just what they want to see. (They do this by “following” users.) Over the years, many companies, news organizations, and users have considered the social media site essential for sharing news and other information. (Or cat memes.) Twitter has some organizational tools such as replies/conversation threads, mentions (i.e. naming other users using the @ notation), and hashtags (naming a topic using # notation). Twitter has encouraged the use of these organizational tools by automatically making mentions and hashtags clickable links.
These organizational tools can make for some interesting analysis. For instance, a game show may encourage viewers to vote on a winner using hashtags. On their end, they create a filter for a particular hashtag (e.g. #votemyplayer) and count votes. This also makes Twitter data ripe for text mining (which they use to identify trending topics).
Obtaining the Twitter data
Twitter makes it possible for software to obtain Twitter comments without having to resort to “web-scraping” techniques (i.e. downloading the data as a web page and then parsing the HTML). Instead, you can go through an Application Programming Interface (API) to obtain the comments directly. If you’re interested, Twitter has a whole subdomain related to accessing their data, including documentation. There are a lot of technical details, but for the casual user probably the only ones of interest are API key and rate limits. This post won’t fuss with rate limits, but more serious work may require some further understanding of these issues. However, you will need to create an API key. Follow these instructions, which are tailored for R users. It essentially consists of creating a token at Twitter’s app web site and running an R function with the token. I set variables consumer_secret
, consumer_key
, access_token
, and access_secret
in an R block just copying and pasting from the Twitter apps site, not echoed in this blog post for obvious reasons.
Fortunately, the twitteR package makes obtaining data from Twitter easy. It’s on CRAN, so grab it using install.packages
(it will also install dependencies such as the bit64
and httr
packages if you don’t have them already) before moving on.
We authenticate our R program to Twitter and then start with searching the public timeline for “Greenville”. Note due to the changing nature of Twitter, your results will probably be different:
searchTwitter
returns data as a list, which may or may not be desirable. As a default, it returns the last 25 items matching the query you pass (this can be changed by using the n=
option to the function). I used twListToDF
(part of the twitteR
package) to convert to a data frame. The data frame contains a lot of useful information, such as the tweet, information about whether it’s a reply and the tweet to which it’s a reply, screen name, and date stamp. Thus, Twitter provides a rich data source to provide information on topics, interactions, and reactions.
Analyzing the data
Retweets
The first thing to notice is that many of these tweets may be “retweets”, where a user posts the exact same tweet as a previous user to create a larger audience for the tweet. This data point may be interesting in its own right, but for now, because we are just analyzing the text, we will filter out retweets:
The thing to notice here is that there are several different Greenvilles, so this makes analysis of the local area pretty hard. Many of the tweets can be about Greenville, NC or SC. In this particular dataset, there was even a Greenville Road in California (where there was a car fire). Rather than play a filtering game, it may be better to apply some knowledge specific to the area. For instance, local tweets will often be tagged with #yeahThatgreenville
. So we will search again for the #yeahthatgreenville
hashtag (and add a few more tweets as well). This time, we’ll keep retweets:
Here I do two separate queries and add them together using the bind_rows
function from dplyr
.
Who is tweeting
The first thing we can do is get a list of users who tweet under this hastag as well as their number of tweets:
So I snuck a trick into the above graph. In bar charts presenting counts, I usually prefer the order in descending bar length. That way I can identify the most and least common screen names quickly. I accomplish this by using x=reorder(screenName,screenName,function (x) -length(x)))
in the aes()
function above. Now we can see that @GiovanniDodd
was the most prolific tweeter in the last 200 tweets I accessed. Some of the prolific tweeters appear to be businesses, such as @CourtyardGreenville
or perhaps tourism accounts such as @Greenville_SC
.
What users are saying
To analyze what users are saying about “#yeahthatgreenville”, we use the tidytext
package. There are a number of packages that can be used to analyze text, and tm
used to be a favorite, but tidytext
fits within the context of tidy data. We prefer the tidy data framework because it works with data in a specific format and has a number of powerful tools that have a specific focus but interoperate well, much like the UNIX ideal. Here, tidytext
will allow us to use dplyr
and similar tools using the pipe operator. The code will be easier to read and follow.
I used the select
function from dplyr
to keep only the id
and text
fields. The unnest_tokens()
functions creates a long dataset with a single word replacing the text. All the other fields remain unchanged. We can now easily create a bar chart of the words used the most:
This plot is very busy, so we plot, say, the top 20 words:
Unfortunately, this is terribly unexciting. Of course “a”, “to”, “for”, and similar words are going to be at the top. In text mining, we create a list of “stop words”, including these, which are so common they are usually not worth including in an analysis. The tidytext
package includes a stop_words
data frame to assist us:
We’ll change stop_words
slightly to be useful to us. This involves adding a column to help us filter out in the next step and adding some common, uninteresting words “https”, “t.co”, “yeahthatgreenville”, and “amp”. We filter these out for various reasons, e.g. “https” and “t.co” are used in URLs, “amp” is left over from tokening some HTML code, and we searched on “yeahthatgreenville”. Augmenting stop words is a bit of an iterative process, which I’m not showing here, but I went back and forth a few times to get this list.
Now, we can determine which of the words above are stop words and thus not worth analyzing:
The anti_join
function is probably not familiar to most data scientists or statisticians. It is the opposite of a merge in a sense. Basically, the command above merges the tweet_words
and my_stop_words
data frames, and then removes the rows that came from the my_stop_words
dataset, leaving only the rows in tweet_words
(the id
and word
) that does not match with something from my_stop_words
. This is desirable because our my_stop_words
dataset contains words we do not want to analyze.
Now we can analyze the more interesting words:
Sentiment analysis
Sentiment analysis is, in short, the quantitative study of the emotional content of text. The most sophisticated analysis, of course, is very difficult, but we can make a start using a simple procedure. Many of the ideas here can be found in a vignette for the package written by Julia Silge and David Robinson.
As a start, we use the Bing lexicon, which maps a word to positive/negative according to whether its sentiment content is positive or negative.
Sentiment analysis then is an exercise in an inner-join:
Once you get to this point, sentiment analysis can start fairly easily:
There are many more positive words than negative words, so the mood tilts positive in our crude analysis. We can also group by tweet, and see whether there more more positive or negative tweets:
On average, there are 1.3333333 positive words per tweet and 1.0416667 negative words per tweet, if you accept the assumptions of the above analysis.
There is, of course, a lot more that can be done, but this will get you started. For some more sophisticated ideas you can check Julia Silge’s analysis of Reddit data, for instance. Another kind of analysis looking at sentiment and emotional content can be found here (with the caveat that it uses the predecessor to dplyr
and thus runs somewhat less efficiently). Finally, it would probably be useful to supplement the above sentiment data frames with situation-specific sentiment analysis, such as making goallllllll
in the above a positive word.
Conclusions
The R packages twitteR
and tidytext
make analyzing content from Twitter easy. This is helpful if you want to analyze, for instance, real time reactions to events. Above we pulled content from Twitter, split it into words, and analyzed words by frequency while eliminating “uninteresting” words. Then we analyzed whether tweets were on the whole positive or negative using pre-made lexicons mapping words to positive or negative.