How we voted in South Carolina

Purpose

This post seeks to explore how Greenville, SC and surrounding areas voted in the 2016 election. It also demonstrates how to retrieve data from the Data.World site. To retrieve data from this site using the tools in this post, you have to create an account (easy to do if you have a Facebook, Twitter, or Github account). You can then get your own API key from your profile page. Furthermore, from R, you will need to get the data.world package (install.packages("data.world")). You can then load the API key into R using saved_cfg <- data.world::save_config("YOUR_API_KEY"). This is the same saved_cfg used below.

Furthermore, for purposes of map display we need to get the shapes of the districts in SC. I found one such collection on Github at User nvelesko’s Github page. I just downloaded as a zip file and extracted into a local directory I named precinct_shp. There are versions of these shape files for most states. I use the readOGR function from package rgdal to read them in.

Setup and acquiring data

First, we load the shape files that were downloaded from the Github site above. The readOGR function seems to be a little strange, because I tried calling it with precinct_shp (the directory of the shape files) directly, but it gave errors. Eventually, I gave up and changed the working directly to read in the shape files, and then changed it back. Note that if you’re doing this in an R Notebook or R markdown file, you’ll get some strange messages about how changing the working directory in an R notebook works.

How to make interactive maps with Census and local data in R

So the goal here is to focus back on Greenville County and have even more granularity. I look at median house prices near Greenville and then overlay the park data downloaded earlier. This time, for the Census data, I use the tidycensus package that came out recently. Furthermore, instead of using ggplot2 to create a static map, I use the leaflet package to create an interactive map, and, furthermore integrate data from disparate sources in a convenient way.

US Census Data

The US Census collects a number of demographic measures and publishes aggregate data through its website. There are several ways to use Census data in R, from the Census API to the USCensus2010 package. If you are interested in geopolitical data in the US, I recommend exploring both these options - the Census API requires a key for each person who uses it, and the package requires downloading a very large dataset. The setups for both require some effort, but once that effort is done you don’t have to do it again.

The acs package in R allows you to access the Census API easily. I highly recommend checking it out, and that’s the method we will use here. Note that I’ve already defined the variable api_key - if you are trying to run this code you will need to first run something like api_key <- <enter your Census API key> before running the rest of this code.

Motivation behind this example

I was diagnosed with sleep apnea last year, and have to use a continuous positive airway pressure (CPAP) machine to sleep well enough to feel alert during the day. The machine uploads data (via cellular connection) to a website that will give me results for the last two weeks. This data includes both usage (time of usage, air leakage, number of times mask was put on/taken off), and results (apnea-hypopnea index, which is an average of the number of times per hour that slow or no breathing occurred for at least 10 seconds). The website only displays results from the last two weeks, and I’d like to eventually do a long-term analysis. I’d also like to have things displayed my own way, because, well, I’m like that.

I could enter this information in a spreadsheet, and for import into R or other statistical software that might be the sensible thing to do. However, by having this data in context of other diary entries and text surround it I get to see this data in context of other things going on in my life. This information does not exist in a vacuum, and is important context for other things. For instance, if I’m dealing with a particularly stressful situation, it would be nice to go back and see how I dealt with that in the context of how my sleeping is going (and vice versa - does the apnea get better or worse during that time?). Another issue is that I’m dealing with migraines, and I’d like to know something about the frequency and severity in the context of sleep.

Methodology for data collection

This personal data collection exercise uses an excellent piece of software specifically for journaling called The Journal. I’ve been using The Journal since 2007 to record events and just simply jog my memory of goings on in my life. The software has a few nifty features that dovetail nicely with data collection.

Daily entries

The Journal splits writing up into categories. Categories can be either loose-leaf (where entries can be organized hierarchically any way you want) and daily (where entries are organized by the date of entry). If you set it up a certain way, you can have the Journal lock entries on every day except for the day you are working on. It can also automatically create an entry for the day you are working on. Very handy for just daily jouraling in general.

Topics

Topics are tags for specific pieces of text or entries. If you select a piece of text and tag it with a topic (say, CPAP), you can extract that piece of text later. Couple this with the Search by Topic command, and you can extract all text tagged with a certain topic into one document and save a single document with all text from that topic. So, for example, I will tag all my CPAP writings with the CPAP topic, and later on save a text file with what I have written about CPAP therapy (in this case, the data I collected).

Templates

The Journal has a sophisticated template system that can insert not only the same text over and over, but tag it automatically with a certain topic and even fill in certain data such as the current date and time. I use the template feature to create some structured text (a data entry form of sorts) and tag the whole piece of inserted text with the CPAP topic. That way, I don’t have to bother with selecting and tagging manually. I can simply insert the text and fill in the numbers when I read the website.

The template looks like this:

Sleep numbers for <ENTRYDATE format=“mm/dd/yyyy”/>
* Usage: 
* Leakage: L/min
* AHI: events/h
* Mask on/off: 
* MyAir score: 
* Comments:

Because the text follows the same structure for all such entries, it is easy to write R code to pull out the data and make a data.frame.

What you don’t see (and is hard to show here) is that in the template itself I selected all of the text and tagged it CPAP. That way, my CPAP entries will always be tagged, and I can easily extract them later.

Methodology for analysis

Data extraction

The first part of data extraction is in The Journal. I use a saved search from the Search Entries by Topic function, then click View All Result Entries to see the text I had entered. The result is a screen showing the last 100 pieces of text I tagged CPAP (which may include other pieces of text if I felt the need to write on the topic). I can change this with an option. Clicking Save to File will allow me to save to a Journal file, and RTF, or a TXT file. I save the result to a TXT file so that I can easily read it in R. The text file contains only the data I entered for the CPAP machine, as well as any other text I tagged (which is fairly uncommon).

Data import

This is where I pay the price for putting the data in a diary rather than a tabular format. I use readlines.

Acquiring inauguration speeches

Though not about Greenville especially, it might be interesting to quantitatively analyze inauguration speeches. This analysis will be done using two paradigms: the tm package and the tidytext package. We will read the speeches in such a way that we use the tidytext package; later on we will use some tools from that package to make analyses traditionally done by tm.

I looked around for inauguration speeches, and finally found them at www.bartelby.com. They are in a format more for human consumption, but with the use of the rvest (harvest?) package, we can read them in relatively easily. However, we need to do a mapping from speech IDs to speakers (newly inaugurated presidents), which is a little ugly and tedious.

Greenville on Twitter

In this blogpost, we use R to use Twitter data to analyze topics of interest to Greenville, SC. We will describe obtaining, manipulating, and summarizing the data.

Twitter is a “microblogging” service where users can, usually publicly, share links, pictures, or short comments (up to 140 characters) onto a timeline. The public timeline consists of all public tweets, but people can build their own private timelines to narrow content to just what they want to see. (They do this by “following” users.) Over the years, many companies, news organizations, and users have considered the social media site essential for sharing news and other information. (Or cat memes.) Twitter has some organizational tools such as replies/conversation threads, mentions (i.e. naming other users using the @ notation), and hashtags (naming a topic using # notation). Twitter has encouraged the use of these organizational tools by automatically making mentions and hashtags clickable links.

These organizational tools can make for some interesting analysis. For instance, a game show may encourage viewers to vote on a winner using hashtags. On their end, they create a filter for a particular hashtag (e.g. #votemyplayer) and count votes. This also makes Twitter data ripe for text mining (which they use to identify trending topics).

Obtaining the Twitter data

Twitter makes it possible for software to obtain Twitter comments without having to resort to “web-scraping” techniques (i.e. downloading the data as a web page and then parsing the HTML). Instead, you can go through an Application Programming Interface (API) to obtain the comments directly. If you’re interested, Twitter has a whole subdomain related to accessing their data, including documentation. There are a lot of technical details, but for the casual user probably the only ones of interest are API key and rate limits. This post won’t fuss with rate limits, but more serious work may require some further understanding of these issues. However, you will need to create an API key. Follow these instructions, which are tailored for R users. It essentially consists of creating a token at Twitter’s app web site and running an R function with the token. I set variables consumer_secret, consumer_key, access_token, and access_secret in an R block just copying and pasting from the Twitter apps site, not echoed in this blog post for obvious reasons.

Plotting GeoJSON polygons on a map with R

In a previous post we plotted some points, retrieved from a public dataset in GeoJSON format, on top of a Google Map of the area surrounding Greenville, SC. In this post we plot some public data in GeoJSON format as well, but instead of particular points, we plot polygons. Polygons describe an area rather than a single point. As before, to set up we do the following:

Plotting GeoJSON data on a map with R

GeoJSON is a standard text-based data format for encoding geographical information, which relies on the JSON (Javascript object notation) standard. There are a number of public datasets for Greenville, SC that use this format, and, the R programming language makes working with these data easy. Install the rgeojson library, which is part of the ROpenSci family of packages.

In this post we plot some public data in GeoJSON format on top of a retrieved Google Map. To set up we do the following:

Upstate Data Analysis