randomjohn.github.io

How we voted in South Carolina

2017-08-04T00:00:00+00:00

Purpose

This post seeks to explore how Greenville, SC and surrounding areas voted in the 2016 election. It also demonstrates how to retrieve data from the Data.World site. To retrieve data from this site using the tools in this post, you have to create an account (easy to do if you have a Facebook, Twitter, or Github account). You can then get your own API key from your profile page. Furthermore, from R, you will need to get the data.world package (install.packages("data.world")). You can then load the API key into R using saved_cfg <- data.world::save_config("YOUR_API_KEY"). This is the same saved_cfg used below.

Furthermore, for purposes of map display we need to get the shapes of the districts in SC. I found one such collection on Github at User nvelesko’s Github page. I just downloaded as a zip file and extracted into a local directory I named precinct_shp. There are versions of these shape files for most states. I use the readOGR function from package rgdal to read them in.

Setup and acquiring data

First, we load the shape files that were downloaded from the Github site above. The readOGR function seems to be a little strange, because I tried calling it with precinct_shp (the directory of the shape files) directly, but it gave errors. Eventually, I gave up and changed the working directly to read in the shape files, and then changed it back. Note that if you’re doing this in an R Notebook or R markdown file, you’ll get some strange messages about how changing the working directory in an R notebook works.

library(data.world)
library(tidyverse)
library(rgdal)
set_config(saved_cfg) # saved_cfg was set in an invisible block which has my API key
owd <- getwd()
setwd(precinct_shp)
precinct_shapes <- readOGR(".","Statewide")

## OGR data source with driver: ESRI Shapefile 
## Source: ".", layer: "Statewide"
## with 2155 features
## It has 4 fields

setwd(owd)

To download the election data, we use the simplified commands from the dwapi package (automatically loaded by data.world). The list_tables command lists the available data tables for that site. Here there are two: one for the election itself and one for registration. For exploration, we download both. The data are located in the directory of user @tamilyn, and are named south-carolina-election-data. If you use Python or other common data analysis tool, data.world has released tools that connect your tool of choice with their API. As of the date on this blog post, the file size total was about 850 kB for both files.

ds_url <- "tamilyn/south-carolina-election-data"
election_tables <- dwapi::list_tables(ds_url)
election_df <- dwapi::download_table_as_data_frame(ds_url,election_tables[1])
regis_df <- dwapi::download_table_as_data_frame(ds_url,election_tables[2])

Showing the data

Plotting the precinct data using the standard R tools is easy:

plot(precinct_shapes)

This is because plot “knows” what to do with shape data. In fact, we can explore this a little bit further:

class(precinct_shapes)

## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

This shows that we have an object of class sp, which is a spatial data object defined using the sp package. It is also a SpatialPolygonsDataFrame. Behind the scenes, plot is calling a method that is defined just for these kinds of objects, which tells it how to render these kinds of shape data. We could also (and have on this blog) done this using the ggplot2 package, but for this kind of exploration the quick plot is good. So don’t give completely up on R base graphics.

The election data has a bit of an odd structure.

DT::datatable(election_df %>% mutate(row=1:nrow(.)) %>% select(row,everything()) %>% slice(c(1:5,103:107)))

So the fancy stuff I did in kable above was basically to show the original row numbers and print them to the left of all the other variables. If I had not used the select(row,everything()), the row would have printed last. This is a nice example of how to do a quick custom column-sorting. But that’s not why we’re here. The election file shows how people voted in a rather raw fashion. Specifically, here I want to tally those people who voted for the different candidates for president. But it’s not that straightforward, because I have to count the straight-ticket voters as well as the split-party voters who specifically noted their president’s choice on the ballot.

The goal here is to get to the percentage of votes going to the big two parties. While it may be an interesting exercise to some to look at the third party votes, we’re not going to do that here.

total_pres_votes <- election_df %>% 
  filter(office %in% c("STRAIGHT PARTY","President and Vice President")) %>% 
  group_by(precinct) %>% 
  summarize(total_votes=sum(votes,nm.ra=TRUE))
DT::datatable(total_pres_votes)

The first few results look ok, so we can get votes for the different parties and merge this back on to get the percentage.

total_party_votes <- election_df %>% 
  dplyr::filter((office %in% c("STRAIGHT PARTY","President and Vice President")) & (party %in% c("DEM","REP"))) %>% 
  group_by(precinct,party) %>% 
  summarize(total_party_votes=sum(votes,nm.ra=TRUE)) %>% 
  left_join(total_pres_votes) %>% 
  mutate(party_perc=total_party_votes/total_votes*100)

## Joining, by = "precinct"

DT::datatable(total_party_votes)

So one picky issue is worth mentioning here. This dataset counts absentee ballots as its own precinct (or their own precincts). Part of the joys of blogging is sweeping issues like this under the rug, but this can be the source of interesting analyses in their own right.

The final bit of data wrangling I do here is to widen the dataset, so I can merge it easily later on with the shapefile.

total_party_votes_wide <- total_party_votes %>% 
  select(-total_party_votes,-total_party_votes) %>% 
  group_by(precinct) %>% 
  spread(key=party,value=party_perc)
DT::datatable(total_party_votes_wide )

There were some casualties in this operation, namely the total party votes and total votes, but I don’t really need them for the simple thing I’m doing here.

The spread function I used above comes from the tidyr package, loaded by tidyverse. It, along with gather, enable navigating between “long” and “wide” datasets. You have to be careful using these functions, though, or you may get something strange. For instance, when I left in the party votes using spread above (in a previous iteration of this post), I ended up with an ugly wide and long dataset, but with NA every other row on each of the percent columns.

Merging data

In the last blog post, we were simply able to plot two maps on top of each other using the leaflet package. We are faced with a slightly different issue here. One data set is a geographic dataset, but the elections data only lists a precinct name. So it is up to us to do the merge.

precinct_elect <- merge(precinct_shapes,total_party_votes_wide,by.x="PNAME",by.y="precinct")

Honestly, I thought the above step would be the hardest part of the post. But like plot, the merge function in R understands what it’s operating on (i.e. uses a method that’s specific to spatial objects). Someone else did the hard work, and I just use the magic.

library(leaflet)
pal <- colorNumeric(palette = "viridis", 
                    domain = c(0,100))
 
precinct_elect %>% 
  leaflet(width="100%") %>% 
  addPolygons(popup = ~PNAME,
              stroke = FALSE,
                smoothFactor = 0,
                fillOpacity = 0.5,
                color = ~ pal(DEM)) %>% 
  addLegend("bottomright", 
              pal = pal, 
              values = ~ DEM,
              title = "Dem P/VP Vote %",
              labFormat = labelFormat(suffix = "%"),
              opacity = 1)

Now there are a few things to note:

In this blog post, the map is static. If you actually run this code in RStudio, it will be an interactive map that lets you zoom and pan, and where clicking areas will give you the precinct name.
I basically copied and pasted the leaflet code from the previous blog post, and changed variable names.
It looks like the dataset has a few holes. This may be where sweeping the absentee ballot under the rug leaves out a lot.

Discussion

I used the data.world package and website to download and use Greenville election data. I then merged it with precinct shapefiles found from a different source (Github). The actual merge process wasn’t hard, and in fact the most difficult part of this process was deciding how I wanted to present the data. This election data is rather rich, but has a few necessary quirks in its structure.

Once you have characteristic data in the right format and shape files, it’s magically easy to merge them.

I copied and pasted most of the leaflet code from my last post to present the data, with tweaks for variable names.

How to make interactive maps with Census and local data in R

2017-07-21T00:00:00+00:00

So the goal here is to focus back on Greenville County and have even more granularity. I look at median house prices near Greenville and then overlay the park data downloaded earlier. This time, for the Census data, I use the tidycensus package that came out recently. Furthermore, instead of using ggplot2 to create a static map, I use the leaflet package to create an interactive map, and, furthermore integrate data from disparate sources in a convenient way.

Download the local park data

The local parks file can be found here courtesy of a small group of dedicated volunteers and an API that makes publishing geojson files easy at the Open Upstate site. We will download a polygon file for the park boundaries as well as a point geojson file for the address of each park.

data_url <- "https://data.openupstate.org/maps/city-parks/parks.php"
data_file <- "parks.geojson"
# for some reason, I can't read from the url directly, though the tutorial
# says I can
download.file(data_url, data_file)
data_park <- geojson_read(data_file, what = "sp")
 
data_url <- "https://data.openupstate.org/maps/city-parks/geojson.php"
data_file <- "parks_point.geojson"
# for some reason, I can't read from the url directly, though the tutorial
# says I can
download.file(data_url, data_file)
data_park_addr <- geojson_read(data_file, what = "sp")

Download the median home value data

This code from tidycensus downloads demographic data and geometric data in a list column. A list column is a data frame, but one of the variables really contains a spatial data frame for each observation, which gives the polygon data for the census tracts. Having the demographic and geometric data in this format eases bookkeeping, and, thankfully, leaflet understands this format.

gvl_value <- get_acs(geography = "tract", 
                    variables = "B25077_001", 
                    state = "SC",
                    county = "Greenville County",
                    geometry = TRUE)

Plot the census and local data together

Now we bring everything together. The leaflet package was written to make extensive use of the pipe operator that dplyr introduced a few years ago. We can set a default data frame for a leaflet map, but when we add markers and polygons, we can set it from other data sources. The following code is one way to do this, where we use the tidycensus-generated dataset as the foundation of the leaflet. We add polygons and markers for the data park using the data= option of addPolygons and addMarkers. Note the use of the group= option to create layers, which can be clicked on and off interactively. The label= option (or popup= option for addPolygons) are used to generate popup windows that give additional information.

pal <- colorNumeric(palette = "viridis", 
                    domain = gvl_value$estimate)
 
gvl_value %>%
    st_transform(crs = "+init=epsg:4326") %>%
    leaflet(width = "100%") %>%
    addProviderTiles(provider = "CartoDB.Positron") %>%
    addPolygons(popup = ~ str_extract(NAME, "^([^,]*)"),
                stroke = FALSE,
                smoothFactor = 0,
                fillOpacity = 0.5,
                color = ~ pal(estimate),
                group="Median home value") %>%
    addLegend("bottomright", 
              pal = pal, 
              values = ~ estimate,
              title = "Median home value",
              labFormat = labelFormat(prefix = "$"),
              opacity = 1) %>% 
  addPolygons(data=data_park,fillOpacity=0.8,group="Parks") %>% 
  addMarkers(data=data_park_addr,group="Parks",label=~title) %>% 
  addLayersControl(overlayGroups = c("Parks","Median home value"))

Unfortunately due to the limitations of Github pages this had to be turned into a static image to be rendered. Perhaps it’s time to make the jump to blogdown and hugo like all the other cool kids?

Discussion

I’m just starting to learn about the leaflet package, but in just a couple of hours (and standing on the shoulders of giants) I was able to put together an interactive map combining Census data (median home value by census tract) and locally-generated data (park locations). Such combinations can be effectively used to examine local situations in the context of rich data already collected at a federal level (assuming the instability at the U.S. Census Bureau is temporary).

How to make maps with Census data in R

2017-07-21T00:00:00+00:00

US Census Data

The US Census collects a number of demographic measures and publishes aggregate data through its website. There are several ways to use Census data in R, from the Census API to the USCensus2010 package. If you are interested in geopolitical data in the US, I recommend exploring both these options - the Census API requires a key for each person who uses it, and the package requires downloading a very large dataset. The setups for both require some effort, but once that effort is done you don’t have to do it again.

The acs package in R allows you to access the Census API easily. I highly recommend checking it out, and that’s the method we will use here. Note that I’ve already defined the variable api_key - if you are trying to run this code you will need to first run something like api_key <- <enter your Census API key> before running the rest of this code.

library(acs)
api.key.install(key=api_key) # now you are ready to run the rest of the acs code

For purposes here, we will use the toy example of plotting median household income by county for every county in South Carolina. First, we obtain the Census data. The first command gives us the table and variable names of what we want. I then use that table number in the acs.fetch command to get the variable I want.

acs.lookup(endyear=2015, span=5,dataset="acs", keyword= c("median","income","family","total"), case.sensitive=F)

## Warning in acs.lookup(endyear = 2015, span = 5, dataset = "acs", keyword = c("median", : XML variable lookup tables for this request
##   seem to be missing from ' https://api.census.gov/data/2015/acs5/variables.xml ';
##   temporarily downloading and using archived copies instead;
##   since this is *much* slower, recommend running
##   acs.tables.install()

## An object of class "acs.lookup"
## endyear= 2015  ; span= 5 
## 
## results:
##    variable.code table.number
## 1     B10010_001       B10010
## 2     B19126_001       B19126
## 3     B19126_002       B19126
## 4     B19126_005       B19126
## 5     B19126_006       B19126
## 6     B19126_009       B19126
## 7     B19215_001       B19215
## 8     B19215_002       B19215
## 9     B19215_003       B19215
## 10    B19215_006       B19215
## 11    B19215_009       B19215
## 12    B19215_010       B19215
## 13    B19215_013       B19215
##                                                                                                                                                          table.name
## 1                                                                  Median Family Income for Families with GrndPrnt Householders Living With Own GrndChldrn < 18 Yrs
## 2                 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years
## 3                 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years
## 4                 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years
## 5                 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years
## 6                 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years
## 7  B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
## 8  B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
## 9  B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
## 10 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
## 11 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
## 12 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
## 13 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
##                                                                                                                                                variable.name
## 1                                                                                                        Median family income in the past 12 months-- Total:
## 2                                                                  Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Total:
## 3                                          Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Married-couple family -- Total
## 4                                                   Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Other family -- Total
## 5              Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Other family -- Male householder, no wife present -- Total
## 6         Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Other family -- Female householder, no husband present -- Total
## 7                                           Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Total (dollars):
## 8                        Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Male householder -- Total (dollars)
## 9        Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Male householder -- Living alone -- Total (dollars)
## 10   Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Male householder -- Not living alone -- Total (dollars)
## 11                     Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Female householder -- Total (dollars)
## 12     Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Female householder -- Living alone -- Total (dollars)
## 13 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Female householder -- Not living alone -- Total (dollars)

my_cnty <- geo.make(state = 45,county = "*")
home_median_price<-acs.fetch(geography=my_cnty, table.number="B19126",endyear=2015) # home median prices

## Warning in (function (endyear, span = 5, dataset = "acs", keyword, table.name, : XML variable lookup tables for this request
##   seem to be missing from ' https://api.census.gov/data/2015/acs5/variables.xml ';
##   temporarily downloading and using archived copies instead;
##   since this is *much* slower, recommend running
##   acs.tables.install()

## Error in if (url.test["statusMessage"] != "OK") {: missing value where TRUE/FALSE needed

knitr::kable(head(home_median_price@estimate))

	B19126_001	B19126_002	B19126_003	B19126_004	B19126_005	B19126_006	B19126_007	B19126_008	B19126_009	B19126_010	B19126_011
Abbeville County, South Carolina	44918	55141	65664	50698	24835	43187	50347	24886	22945	18101	29958
Aiken County, South Carolina	57396	70829	72930	70446	29302	36571	35469	37906	27355	22760	34427
Allendale County, South Carolina	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Anderson County, South Carolina	53169	65881	75444	60166	26608	36694	37254	36297	24384	17835	29280
Bamberg County, South Carolina	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Barnwell County, South Carolina	44224	59467	70542	54030	19864	25143	18633	45714	18317	13827	21315

Plotting the map data

If you have the maps and ggplot2 packages, you already have the data you need to plot. We use the map_data function from ggplot2 to pull in county shape data for South Carolina. (A previous attempt at this blogpost had used the ggmap package, but there is an incompatibility between that and the latest ggplot2 package at the time of this writing.)

library(ggplot2)

## Want to understand how all the pieces fit together? Buy the
## ggplot2 book: http://ggplot2.org/book/

sc_map <- map_data("county",region="south.carolina")
ggplot() + geom_polygon(aes(x=long,y=lat,group=group),data=sc_map,colour="white",fill="black") + theme_minimal()

Merging the demographic and map data

Now we have the demographic data and the map, but merging the two will take a little effort. The reason is that the map data gives a lower case representation of the county and calls it a “subregion”, while the Census data returns the county as “xxxx County, South Carolina”. I use the dplyr and stringr packages (for str_replace) to make short work of this merge.

library(dplyr)
library(stringr)
 
merged <- as.data.frame(home_median_price@estimate) %>% 
  mutate(county_full = rownames(.),
         county = str_replace(county_full,"(.+) County.*","\\1") %>% tolower) %>% 
  select(county,B19126_001) %>% 
  rename(med_income=B19126_001) %>% 
  right_join(sc_map,by=c("county"="subregion"))
 
knitr::kable(head(merged,10))

county	med_income	long	lat	group	order	region
abbeville	44918	-82.24809	34.41758	1	1	south carolina
abbeville	44918	-82.31685	34.35455	1	2	south carolina
abbeville	44918	-82.31111	34.33163	1	3	south carolina
abbeville	44918	-82.31111	34.29152	1	4	south carolina
abbeville	44918	-82.28247	34.26860	1	5	south carolina
abbeville	44918	-82.25955	34.25142	1	6	south carolina
abbeville	44918	-82.24809	34.21131	1	7	south carolina
abbeville	44918	-82.23663	34.18266	1	8	south carolina
abbeville	44918	-82.24236	34.15401	1	9	south carolina
abbeville	44918	-82.27674	34.10818	1	10	south carolina

It’s now a simple matter to plot this merged dataset. In fact, we only have to tweak a few things from the first time we plotted the map data.

ggplot() + geom_polygon(aes(x=long,y=lat,group=group,fill=med_income),data=merged) + theme_minimal()

Discussion

It’s pretty easy to plot U.S. Census data on a map. The real power of Census data comes not just from plotting it, but combining with other geographically-based data (such as crime). The acs package in R makes it easy to obtain Census data, which can then be merged with other data using packages such as dplyr and stringr and then plotted with ggplot2. Hopefully the authors of the ggmap and ggplot2 packages can work out their incompatibilities so that the above maps can be created using the Google API map or open street maps.

It should be noted that while I obtained county-level information, aggregate data can be obtained at Census block and tract levels as well, if you are looking to do some sort of localized analysis.

Personal data collection and analysis

2017-04-17T00:00:00+00:00

Motivation behind this example

I was diagnosed with sleep apnea last year, and have to use a continuous positive airway pressure (CPAP) machine to sleep well enough to feel alert during the day. The machine uploads data (via cellular connection) to a website that will give me results for the last two weeks. This data includes both usage (time of usage, air leakage, number of times mask was put on/taken off), and results (apnea-hypopnea index, which is an average of the number of times per hour that slow or no breathing occurred for at least 10 seconds). The website only displays results from the last two weeks, and I’d like to eventually do a long-term analysis. I’d also like to have things displayed my own way, because, well, I’m like that.

I could enter this information in a spreadsheet, and for import into R or other statistical software that might be the sensible thing to do. However, by having this data in context of other diary entries and text surround it I get to see this data in context of other things going on in my life. This information does not exist in a vacuum, and is important context for other things. For instance, if I’m dealing with a particularly stressful situation, it would be nice to go back and see how I dealt with that in the context of how my sleeping is going (and vice versa - does the apnea get better or worse during that time?). Another issue is that I’m dealing with migraines, and I’d like to know something about the frequency and severity in the context of sleep.

Methodology for data collection

This personal data collection exercise uses an excellent piece of software specifically for journaling called The Journal. I’ve been using The Journal since 2007 to record events and just simply jog my memory of goings on in my life. The software has a few nifty features that dovetail nicely with data collection.

Daily entries

The Journal splits writing up into categories. Categories can be either loose-leaf (where entries can be organized hierarchically any way you want) and daily (where entries are organized by the date of entry). If you set it up a certain way, you can have the Journal lock entries on every day except for the day you are working on. It can also automatically create an entry for the day you are working on. Very handy for just daily jouraling in general.

Topics

Topics are tags for specific pieces of text or entries. If you select a piece of text and tag it with a topic (say, CPAP), you can extract that piece of text later. Couple this with the Search by Topic command, and you can extract all text tagged with a certain topic into one document and save a single document with all text from that topic. So, for example, I will tag all my CPAP writings with the CPAP topic, and later on save a text file with what I have written about CPAP therapy (in this case, the data I collected).

Templates

The Journal has a sophisticated template system that can insert not only the same text over and over, but tag it automatically with a certain topic and even fill in certain data such as the current date and time. I use the template feature to create some structured text (a data entry form of sorts) and tag the whole piece of inserted text with the CPAP topic. That way, I don’t have to bother with selecting and tagging manually. I can simply insert the text and fill in the numbers when I read the website.

The template looks like this:

Sleep numbers for <ENTRYDATE format=“mm/dd/yyyy”/>
* Usage: 
* Leakage: L/min
* AHI: events/h
* Mask on/off: 
* MyAir score: 
* Comments:

Because the text follows the same structure for all such entries, it is easy to write R code to pull out the data and make a data.frame.

What you don’t see (and is hard to show here) is that in the template itself I selected all of the text and tagged it CPAP. That way, my CPAP entries will always be tagged, and I can easily extract them later.

Methodology for analysis

Data extraction

The first part of data extraction is in The Journal. I use a saved search from the Search Entries by Topic function, then click View All Result Entries to see the text I had entered. The result is a screen showing the last 100 pieces of text I tagged CPAP (which may include other pieces of text if I felt the need to write on the topic). I can change this with an option. Clicking Save to File will allow me to save to a Journal file, and RTF, or a TXT file. I save the result to a TXT file so that I can easily read it in R. The text file contains only the data I entered for the CPAP machine, as well as any other text I tagged (which is fairly uncommon).

Data import

This is where I pay the price for putting the data in a diary rather than a tabular format. I use readlines.

library(readr)
library(dplyr)
library(stringr)
library(ggplot2)
 
raw_file <- "_Rmd/cpap.txt"
 
if (file.exists(raw_file)) {
  # file.copy(raw_file,backup_file,copy.date = TRUE)
  raw_lines <- read_lines(raw_file)
  data_row <- 0
  cpap_df <- data.frame(date=c(),usage=c(),leakage=c(),events=c(),mask=c(),score=c())
  for (this_line in raw_lines) {
    if (str_sub(this_line,1,18) == "Sleep numbers for ") {
      data_row <- data_row+1
      cpap_df[data_row,"date"] <- as.Date(str_sub(this_line,19),"%m/%d/%Y")
    } else if (str_sub(this_line,1,9)=="* Usage: ") {
      tm <- str_extract(this_line,"1?[0-9]{1}:[0-9]{2}")
      tm <- as.numeric(str_split(tm,":")[[1]]) %>% 
        (function(x) x[1]*60+x[2])
      cpap_df[data_row,"usage"] <- tm
    } else if (str_sub(this_line,1,11)=="* Leakage: ") {
      cpap_df[data_row,"leakage"] <- as.numeric(str_extract(this_line,"[0-9]+"))
    } else if (str_sub(this_line,1,15)=="* Mask on/off: ") {
      cpap_df[data_row,"mask"] <- as.numeric(str_extract(this_line,"[0-9]+"))
    } else if (str_sub(this_line,1,15)=="* MyAir score: ") {
      cpap_df[data_row,"score"] <- as.numeric(str_extract(this_line,"[0-9]+"))
    } else if (str_sub(this_line,1,7)=="* AHI: ") {
      cpap_df[data_row,"events"] <- as.numeric(str_extract(this_line,"[0-9\\.]+"))
    }
    
  }
  
  for (data_row in 1:nrow(cpap_df)) {
    if (all(is.na(cpap_df[data_row,]))) {
      cpap_df <- cpap_df[1:(data_row-1),]
      data_row <- data_row-1
    }
  }
  
  # write_csv(cpap_df,csv_file)
} else {
  cat("Oops! ",raw_file," does not exist!\n")
}

The commented out code above basically writes out the file to a CSV (for easier processing in the future if I need) and backs up the CSV file each time I run the analysis. (I run this analysis in an R Studio R notebook.)

This code is basically a brute force conversion of the structured text of the template into a data frame - the text indicates the variable (column), and the row is given by a counter that is incremented every time a new entry, as indicated by the text “Sleep numbers for”, is read. The decisions on the variables are made in a rather old-fashioned way, with long if … else if … else block. The str_sub commands from the stringr package (you could also use base, if you wish) look for the substrings that I know will be present due to the template function in The Journal (and the hope that I don’t overwrite them when I record data), and the str_extract function will look for the numerical digits for most lines, two numbers separated by a colon (i.e. a time) for the usage line, and digits or a decimal point for the AHI line. These are converted to appropriate dates and numeric values, with the exception of the usage, which is converted to minutes.

The code above is slightly flawed in that there is the possibility for records that are all missing, so the last for block steps through and eliminates those records.

This is the most complicated part of the analysis! Once this data is in a data.frame, you can treat it as any other data analysis.

Data analysis

I won’t focus too heavily on the data analysis here, but just to demonstrate here is a usage graph:

cpap_df %>% ggplot(aes(as.Date(date,origin="1970-1-1"),usage)) +
  geom_hline(aes(yintercept=480),color="red",lty=2) +
  geom_line() +
  xlab("Date") +
  ylab("Usage (minutes)") +
  scale_y_continuous(limits=c(0,NA)) +
  scale_x_date(date_labels = "%b %d")

No, you don’t get to see the other stuff, how much I sleep is enough!

Discussion

While it might be easier in some ways to enter this data into a spreadsheet daily, I chose this method of personal data collection for several reasons:

It allows me to put each day’s data into context, for example recording in prose whether that day was stressful, was a holiday, or had work pressures
It enables me to enter several kinds of data into a diary (e.g. migraine data, dietary data, exercise data), and use multiple extractions to correlate data

Because I use The Journal almost daily, and because it has these sophisticated features, I use that as a central location for all sorts of personal data, and so it became the natural place to record my sleep habits, and it is also a good piece of software to use to record other habits. Perhaps it can also supplement Fitbits, Garmins, and other kinds of personal habit data collection workflows.

Inauguration speeches

2017-01-28T00:00:00+00:00

Acquiring inauguration speeches

Though not about Greenville especially, it might be interesting to quantitatively analyze inauguration speeches. This analysis will be done using two paradigms: the tm package and the tidytext package. We will read the speeches in such a way that we use the tidytext package; later on we will use some tools from that package to make analyses traditionally done by tm.

I looked around for inauguration speeches, and finally found them at www.bartelby.com. They are in a format more for human consumption, but with the use of the rvest (harvest?) package, we can read them in relatively easily. However, we need to do a mapping from speech IDs to speakers (newly inaugurated presidents), which is a little ugly and tedious.

library(rvest)
library(magrittr)
library(dplyr)
library(readr)
library(tidytext)
library(tm)
library(ggplot2)
 
# download and format data ------------------------------------------------
 
 
fmt_string <- "http://www.bartleby.com/124/pres%d.html"
 
speakers <- read.csv(textConnection("Number,Speaker
13,George Washington
14,George Washington
15,John Adams
16,Thomas Jefferson
17,Thomas Jefferson
18,James Madison
19,James Madison
20,James Monroe
21,James Monroe
22,John Quincy Adams
23,Andrew Jackson
24,Andrew Jackson
25,Martin Van Buren
26,William Henry Harrison
27,James Knox Polk
28,Zachary Taylor
29,Franklin Pierce
30,James Buchanon
31,Abraham Lincoln
32,Abraham Lincoln
33,Ulysses S. Grant
34,Ulysses S. Grant
35,Rutherford B. Hayes
36,James A. Garfield
37,Grover Cleveland
38,Benjamin Harrison
39,Grover Cleveland
40,William McKinley
41,William McKinley
42,Theodore Roosevelt
43,William Howard Taft
44,Woodrow Wilson
45,Woodrow Wilson
46,Warren G. Harding
47,Calvin Coolidge
48,Herbert Hoover
49,Franklin D. Roosevelt
50,Franklin D. Roosevelt
51,Franklin D. Roosevelt
52,Franklin D. Roosevelt
53,Harry S. Truman
54,Dwight D. Eisenhower
55,Dwight D. Eisenhower
56,John F. Kennedy
57,Lyndon Baines Johnson
58,Richard Milhaus Nixon
59,Richard Milhaus Nixon
60,Jimmy Carter
61,Ronald Reagan
62,Ronald Reagan
63,George H. W. Bush
64,Bill Clinton
65,Bill Clinton
66,George W. Bush
67,George W. Bush
68,Barack Obama
69,Barack Obama
70,Donald Trump"))
 
# read the speeches into a list of data.frames, append ID number in a new column                     
speeches <- list()
 
for (id in 13:70) {
  speech_html <- read_html(sprintf(fmt_string,id))
  
  speech_lines <- speech_html %>% 
    html_nodes("table") %>% 
    extract(9) %>% 
    html_table() %>% 
    as.data.frame() %>% 
    rename(text=X1,line=X2) %>% 
    mutate(id=rep(id,nrow(.)))
  
  speeches[[id-12]] <- speech_lines
}
 
# concatenate all the speeches and add speaker names
speech_df <- do.call(rbind,speeches)
 
speech_df <- speech_df %>% left_join(speakers,by=c("id"="Number"))

First analysis

Now that we have the speeches as a one-record-per-speech data frame, we can start to analyze them. This post will consist really of a basic analysis based on the “bag of words” paradigm. There are more sophisticated analyses that can be done, but even the basics can be interesting. First, we do a bit of data munging to create a one-record-per-word-per-speech dataset. The strategy is based on the tidy text paradigm described here. Once we have the dataset in the format we want, we can easily eliminate “uninteresting” words by using a filtering anti-join from the dplyr package. (Note: there may be analyses where you would want to keep these so-called “stop-words”, e.g. “a” and “the”, but for purposes here we just get rid of them.)

speech_words <- speech_df %>% 
  mutate(id=factor(id)) %>% 
  unnest_tokens(word,text) %>%
  count(id, word, sort = TRUE) %>%
  ungroup()
total_words <- speech_words %>% 
  group_by(id) %>% 
  summarize(total = sum(n))
 
speech_words <- left_join(speech_words, total_words) %>% 
  anti_join(stop_words %>% filter(lexicon=="onix") %>% 
              select(-lexicon) %>% 
              union(data.frame(word=c("s","so"))),by="word")

## Joining, by = "id"

## Warning in union_data_frame(x, y): joining character vector and factor,
## coercing into character vector

speech_words %>% head()

## # A tibble: 6 Ã— 4
##       id  word     n total
##   <fctr> <chr> <int> <int>
## 1     26 power    47  8463
## 2     21 power    11  4476
## 3     29 power    11  3341
## 4     27 power     9  4813
## 5     36 power     9  2990
## 6     25 power     8  3902

We can now plot the most common words in inauguration speech, just to dig into what that dataset looks like. Note that I polished this graph up a bit (changing axis labels to something pretty, rotating x-axis labels, etc.), but the first past through this graph was a bit ugly. To me, the two most important elements of this graph are selecting the 20 most common words and re-ordering from most to fewest.

# find frequencies of words used in speeches
# we do this so we can reorder in ggplot2 (there may be a way to do directly in ggplot2 without this step)
speech_freq <- speech_words %>% 
  group_by(word) %>% 
  summarize(frequency=n()) %>% 
  arrange(desc(frequency))
 
# plot frequencies of words over all speeches, top 20 only, in order of frequency most to fewest
ggplot(speech_freq %>% ungroup() %>% slice(1:20), aes(reorder(word,desc(frequency)))) +
  geom_bar(aes(y=frequency),stat="identity",alpha = 0.8, show.legend = FALSE) +
  labs(title = "Term Frequency Distribution in Presidential Inaugural Addresses") +
    xlab("Word") + ylab("Frequency") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

What makes speeches unique

At least using the bag-of-words paradigm, the term-frequency * inverse-document-frequency (TF-IDF) analysis is used to determine what words set speeches (or other documents) apart from each other. A word in a given document has a high TF-IDF score if it appears very often in that speech, but rarely in others. If a word appears less frequently in a speech, or appears more often in other speeches, that will lower the TF-IDF score. Thus, a word with a high TF-IDF score can be considered a signature word for a speech Using this strategy for all interesting words, we can compare styles of speeches, and even cluster them into groups.

First, we use the bind_tf_idf function from tidytext to calculate the TF-IDF score. Then we can find the words with the highest TF-IDF score - the words that do the most to distinguish one inauguration speech from another.

speech_words2 <- speech_words %>%
  bind_tf_idf(word, id, n)
speech_words2

## # A tibble: 34,734 Ã— 7
##        id  word     n total          tf       idf       tf_idf
##    <fctr> <chr> <int> <int>       <dbl>     <dbl>        <dbl>
## 1      26 power    47  8463 0.015254787 0.2102954 0.0032080118
## 2      21 power    11  4476 0.006654567 0.2102954 0.0013994250
## 3      29 power    11  3341 0.008403361 0.2102954 0.0017671883
## 4      27 power     9  4813 0.004729375 0.2102954 0.0009945658
## 5      36 power     9  2990 0.007419621 0.2102954 0.0015603122
## 6      25 power     8  3902 0.004839685 0.2102954 0.0010177636
## 7      30 power     7  2834 0.006178288 0.2102954 0.0012992655
## 8      50 power     7  1823 0.009681881 0.2102954 0.0020360551
## 9      38 power     6  4397 0.003472222 0.2102954 0.0007301924
##  [ reached getOption("max.print") -- omitted 1 row ]
## # ... with 34,724 more rows

speech_words2 %>%
  select(-total) %>%
  arrange(desc(tf_idf))

## # A tibble: 34,734 Ã— 6
##        id        word     n         tf      idf     tf_idf
##    <fctr>       <chr> <int>      <dbl>    <dbl>      <dbl>
## 1      14      arrive     1 0.01851852 4.060443 0.07519339
## 2      14 upbraidings     1 0.01851852 4.060443 0.07519339
## 3      14   incurring     1 0.01851852 3.367296 0.06235733
## 4      14    violated     1 0.01851852 3.367296 0.06235733
## 5      14   willingly     1 0.01851852 3.367296 0.06235733
## 6      14 injunctions     1 0.01851852 2.961831 0.05484872
## 7      14   knowingly     1 0.01851852 2.961831 0.05484872
## 8      14    previous     1 0.01851852 2.961831 0.05484872
## 9      14   witnesses     1 0.01851852 2.961831 0.05484872
## 10     14     besides     1 0.01851852 2.674149 0.04952127
## # ... with 34,724 more rows

plot_inaug <- speech_words2 %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  left_join(speakers %>% mutate(id=factor(Number)),by="id")
 
ggplot(plot_inaug %>% filter(tf_idf > 0.025), aes(word, tf_idf, fill = Speaker)) +
  geom_bar(alpha = 0.8, stat = "identity") +
  labs(title = "Highest tf-idf words in Presidential Inauguration Speeches",
       x = NULL, y = "tf-idf") +
  coord_flip()

Then we can do this analysis within each speech to find out what distinguishes them from other speeches. The for loop below can be used to print multiple pages of faceted graphs, good for when you are using RStudio or the R gui to explore.

plot_words <- speech_words2 %>% 
  left_join(speakers %>% mutate(id=factor(Number)),by="id") %>% 
  group_by(Speaker) %>% 
  top_n(15,tf_idf)
 
 
speakers_vec <- unique(plot_words$Speaker) 
n_panel <- 4
for (i in 1:floor(length(speakers_vec)/n_panel)) {
  these_speakers <- speakers_vec[((i-1)*n_panel+1):min(i*n_panel,length(speakers_vec))]
  this_plot <- ggplot(plot_words %>% filter(Speaker %in% these_speakers), aes(word, tf_idf, fill = Speaker)) +
    geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
    labs(title = "Highest tf-idf words in Inaugural Speeches",
         x = NULL, y = "tf-idf") +
    facet_wrap(~Speaker, ncol = 2, scales = "free") +
    coord_flip() 
  print(this_plot)
}

Which speeches are most like each other?

There’s a lot more that can be done here, but we’ll move on to clustering these inauguration speeches. This will require the use of the document-term matrix, which is a matrix that has documents in the rows, words in the columns, and entries that represent the frequency within the row’s document of the column’s term. The tidytext packages uses the cast_dtm function to create the document-term matrix, and the output can then be used by the tm package and other R commands for analysis.

plot_words_dtm <- speech_words %>% 
  left_join(speakers %>% mutate(id=factor(Number)),by="id") %>% 
  cast_dtm(id,word,n)
 
plot_words_dtm <- removeSparseTerms(plot_words_dtm,0.1)
plot_words_matrix <- as.matrix(plot_words_dtm)

To show the hierarchical clustering analysis, we can simply compute a distance matrix, which can be fed into hclust:

dist_matrix <- dist(scale(plot_words_matrix),method="euclidean")
inaug_clust <- hclust(dist_matrix,method="ward.D")
plot(inaug_clust)

It’s pretty interesting that Speech 26 is unlike nearly all the others. This was William Henry Harrison discussing something about the Roman aristocracy, something other presidents have not felt the need to do very much.

Let’s say we want to break these speeches into a given number of clusters. We can use the k-means approach.

inaug_km <- kmeans(plot_words_matrix,centers = 5,nstart = 25)
 
for (i in 1:length(inaug_km$withinss)) { 
  #For each cluster, this defines the documents in that cluster 
  inGroup <- which(inaug_km$cluster==i) 
  within <- plot_words_dtm[inGroup,] 
  if(length(inGroup)==1) within <- t(as.matrix(within)) 
  out <- plot_words_dtm[-inGroup,] 
  words <- apply(within,2,mean) - apply(out,2,mean) #Take the difference in means for each term 
  print(c("Cluster", i), quote=F) 
  labels <- order(words, decreasing=T)[1:20] #Take the top 20 Labels
  print(names(words)[labels], quote=F) #From here down just labels 
  if(i==length(inaug_km$withinss)) { 
    print("Cluster Membership") 
    print(table(inaug_km$cluster)) 
    print("Within cluster sum of squares by cluster") 
    print(inaug_km$withinss) 
  } 
}

## [1] Cluster 1      
##  [1] people     government country    own        citizens   time      
##  [7] nation     <NA>       <NA>       <NA>       <NA>       <NA>      
## [13] <NA>       <NA>       <NA>       <NA>       <NA>       <NA>      
## [19] <NA>       <NA>      
## [1] Cluster 2      
##  [1] government people     citizens   time       country    nation    
##  [7] own        <NA>       <NA>       <NA>       <NA>       <NA>      
## [13] <NA>       <NA>       <NA>       <NA>       <NA>       <NA>      
## [19] <NA>       <NA>      
## [1] Cluster 3      
##  [1] nation     time       own        people     citizens   country   
##  [7] government <NA>       <NA>       <NA>       <NA>       <NA>      
## [13] <NA>       <NA>       <NA>       <NA>       <NA>       <NA>      
## [19] <NA>       <NA>      
## [1] Cluster 4      
##  [1] citizens   country    own        nation     time       government
##  [7] people     <NA>       <NA>       <NA>       <NA>       <NA>      
## [13] <NA>       <NA>       <NA>       <NA>       <NA>       <NA>      
## [19] <NA>       <NA>      
## [1] Cluster 5      
##  [1] government people     citizens   country    own        nation    
##  [7] time       <NA>       <NA>       <NA>       <NA>       <NA>      
## [13] <NA>       <NA>       <NA>       <NA>       <NA>       <NA>      
## [19] <NA>       <NA>      
## [1] "Cluster Membership"
## 
##  1  2  3  4  5 
##  8 12 19 16  3 
## [1] "Within cluster sum of squares by cluster"
## [1]  760.3750  954.5833 1147.1579  733.8125  797.3333

Membership of speeches in clusters is here:

inaug_km$cluster

## 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 
##  4  4  1  4  4  4  4  2  2  2  4  2  1  5  5  4  3  2  1  4  4  4  2  1  2 
## 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 
##  1  1  1  2  4  2  4  3  3  1  5  3  2  3  4  3  3  3  4  3  3  3  3  2  2 
## 63 64 65 66 67 68 69 70 
##  3  3  3  4  3  3  3  3

It’s interesting to note that all of the speeches since Hoover (i.e. 49 through 70) have either been either in category 1 or 5, with the latest ones being in Cluster 1 (this includes Reagan, Bush, Clinton, Bush, Obama, and Trump). Nearly all speeches discuss the relationship between government and its people (as you would expect from an inauguration speech), but Cluster 5 seems to put more emphasis on people, and Cluster 1 on government. Hmmm…

Of course, you can probably get something different with fewer clusters, and you can use the hierarchical clustering analysis above to justify a different number of clusters.

Sentiment analysis

We return to the bag-of-words tidytext paradigm to do a sentiment analysis. The sentiment analysis we do here is very simple (perhaps oversimplified), and tidytext supports more sophisticated analysis. But this is a start. We start by going back to the one-record-per-speech data frame, and scoring words based on sentiment. We don’t worry about stop words at this point, because they will likely be scored as 0 anyway. We use the Bing sentiment list, which basically scores words as positive or negative (or nothing). We assign a score that basically gives a +1 to positive and -1 to negative. Then we add up the score column, and divide by the number of words in the speech. (Which is why we did not eliminate stop words here.) This gives a sort of average positivity/negativity score per word in the speech. If the score is negative, there are more negative words in the speech than positive. If the score is positive, there are more positive words. The higher the absolute value of the score, the higher the imbalance in positive/negative words. Similarly, we just count the number of sentiment words (whether positive or negative) to get an idea of the emotional content of the speech. (Note: this is a preliminary analysis. This does not distinguish between, say, “good” and “not good”. So take any individual results with a grain of salt and dig deeper.)

sw_sent <- speech_df %>% 
  mutate(id=factor(id)) %>% 
  unnest_tokens(word,text) %>% 
  inner_join(get_sentiments("bing")) %>% 
  mutate(score=(sentiment=="positive")-(sentiment=="negative"),is_scored=ifelse(sentiment %in% c("positive","negative"),1,0))

## Joining, by = "word"

sw_sent %>% 
  group_by(Speaker,id) %>% 
  summarize(speech_score=sum(score),speech_sent_words=sum(is_scored)) %>% 
  left_join(total_words,by="id") %>% 
  mutate(speech_score=speech_score/total,speech_sent_words=speech_sent_words/total) %>% 
  arrange(speech_score) %>% 
  print(n=nrow(.))

## Source: local data frame [58 x 5]
## Groups: Speaker [39]
## 
##                   Speaker     id speech_score speech_sent_words total
##                    <fctr> <fctr>        <dbl>             <dbl> <int>
## 1         Abraham Lincoln     32  0.001426534        0.07275321   701
## 2         Abraham Lincoln     31  0.002199615        0.06983778  3637
## 3           James Madison     19  0.010734930        0.09331131  1211
## 4         John F. Kennedy     56  0.010989011        0.10036630  1365
## 5   Franklin D. Roosevelt     50  0.011519473        0.08831596  1823
## 6          Woodrow Wilson     44  0.011716462        0.08787346  1707
## 7   Franklin D. Roosevelt     49  0.012227539        0.09409888  1881
## 8  William Henry Harrison     26  0.013115916        0.06865178  8463
## 9   Franklin D. Roosevelt     51  0.015613383        0.06022305  1345
## 10         Andrew Jackson     24  0.016992353        0.07306712  1177
## 11           Barack Obama     68  0.017827529        0.08499171  2412
## 12       Martin Van Buren     25  0.018452076        0.08867248  3902
## 13          Ronald Reagan     61  0.018457752        0.07752256  2438
## 14       Thomas Jefferson     17  0.019852262        0.07710065  2166
##  [ reached getOption("max.print") -- omitted 44 rows ]

Grover Cleveland and James Madison had the speeches with the highest emotional content, followed by Jimmy Carter and George W. Bush. Wilson, Franklin D. Roosevelt, and George Washington had the lowest emotional content. Abraham Lincoln (in 1860) had the speech with the least positive content (all speeches were positive on balance). William Henry Harrison’s odd speech about the Romans had near the least emotional content, and was one of the least positive speeches.

Conclusion

This analysis of inauguration speeches comes at a time where the change of US presidential power has a different feel, even the inauguration speech. The preliminary analysis above shows that Trump’s speech was similar in topics to speeches for the last 40 or so years, and nothing notable in its emotional content.

This first start revealed a few interesting patterns, but a more sophisticated analysis might reveal something further.

Greenville on Twitter

2016-12-21T00:00:00+00:00

In this blogpost, we use R to use Twitter data to analyze topics of interest to Greenville, SC. We will describe obtaining, manipulating, and summarizing the data.

Twitter is a “microblogging” service where users can, usually publicly, share links, pictures, or short comments (up to 140 characters) onto a timeline. The public timeline consists of all public tweets, but people can build their own private timelines to narrow content to just what they want to see. (They do this by “following” users.) Over the years, many companies, news organizations, and users have considered the social media site essential for sharing news and other information. (Or cat memes.) Twitter has some organizational tools such as replies/conversation threads, mentions (i.e. naming other users using the @ notation), and hashtags (naming a topic using # notation). Twitter has encouraged the use of these organizational tools by automatically making mentions and hashtags clickable links.

These organizational tools can make for some interesting analysis. For instance, a game show may encourage viewers to vote on a winner using hashtags. On their end, they create a filter for a particular hashtag (e.g. #votemyplayer) and count votes. This also makes Twitter data ripe for text mining (which they use to identify trending topics).

Obtaining the Twitter data

Twitter makes it possible for software to obtain Twitter comments without having to resort to “web-scraping” techniques (i.e. downloading the data as a web page and then parsing the HTML). Instead, you can go through an Application Programming Interface (API) to obtain the comments directly. If you’re interested, Twitter has a whole subdomain related to accessing their data, including documentation. There are a lot of technical details, but for the casual user probably the only ones of interest are API key and rate limits. This post won’t fuss with rate limits, but more serious work may require some further understanding of these issues. However, you will need to create an API key. Follow these instructions, which are tailored for R users. It essentially consists of creating a token at Twitter’s app web site and running an R function with the token. I set variables consumer_secret, consumer_key, access_token, and access_secret in an R block just copying and pasting from the Twitter apps site, not echoed in this blog post for obvious reasons.

Fortunately, the twitteR package makes obtaining data from Twitter easy. It’s on CRAN, so grab it using install.packages (it will also install dependencies such as the bit64 and httr packages if you don’t have them already) before moving on.

We authenticate our R program to Twitter and then start with searching the public timeline for “Greenville”. Note due to the changing nature of Twitter, your results will probably be different:

origop <- options("httr_oauth_cache")
options(httr_oauth_cache = TRUE)
library(twitteR)
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

[1] "Using direct authentication"

options(httr_oauth_cache = origop)


gvl_twitter <- searchTwitter("Greenville")
gvl_twitter_df <- twListToDF(gvl_twitter)

head(gvl_twitter_df)

                                                                                                                                              text
1                                                                         RT @JessLivMo: PROTEST: Greenville TODAY at 4pm! https://t.co/PtDqfV1iQi
2                              Y'all, #texaspawprints is at the Pet Supplies on lower Greenville today. Their kitties areÂ… https://t.co/AVwAEvm9w5
3     RT @nssottile: What is Governor @henrymcmaster doing to get Greenville resident and Clemson Ph.D. #NazaninZinouri back home? #sctweets #MusÂ…
4                                     Can you recommend anyone for this #job in #Greenville, SC? https://t.co/bGoRU5wFqQ #Labor #Hiring #CareerArc
  favorited favoriteCount      replyToSN             created truncated
1     FALSE             0           <NA> 2017-01-29 20:07:33     FALSE
2     FALSE             0           <NA> 2017-01-29 20:06:10     FALSE
3     FALSE             0           <NA> 2017-01-29 20:05:53     FALSE
4     FALSE             0           <NA> 2017-01-29 20:04:47     FALSE
  replyToSID                 id replyToUID
1       <NA> 825797550087217152       <NA>
2       <NA> 825797204724088836       <NA>
3       <NA> 825797131420053505       <NA>
4       <NA> 825796855728451584       <NA>
                                                                        statusSource
1       <a href="http://www.samruston.co.uk" rel="nofollow">Flamingo for Android</a>
2       <a href="http://linkis.com" rel="nofollow">Put your button on any page! </a>
3 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
4                <a href="http://www.tweetmyjobs.com" rel="nofollow">TweetMyJOBS</a>
       screenName retweetCount isRetweet retweeted    longitude
1 RatherBeGulfing           15      TRUE     FALSE         <NA>
2   KickButtVegan            0     FALSE     FALSE -96.77009998
3   ClaireOfTarth           47      TRUE     FALSE         <NA>
4   tmj_grn_labor            0     FALSE     FALSE  -82.4536115
     latitude
1        <NA>
2 32.81234958
3        <NA>
4  34.8268335
 [ reached getOption("max.print") -- omitted 2 rows ]

searchTwitter returns data as a list, which may or may not be desirable. As a default, it returns the last 25 items matching the query you pass (this can be changed by using the n= option to the function). I used twListToDF (part of the twitteR package) to convert to a data frame. The data frame contains a lot of useful information, such as the tweet, information about whether it’s a reply and the tweet to which it’s a reply, screen name, and date stamp. Thus, Twitter provides a rich data source to provide information on topics, interactions, and reactions.

Analyzing the data

Retweets

The first thing to notice is that many of these tweets may be “retweets”, where a user posts the exact same tweet as a previous user to create a larger audience for the tweet. This data point may be interesting in its own right, but for now, because we are just analyzing the text, we will filter out retweets:

library(dplyr)
gvl_twitter_unique <- gvl_twitter_df %>% filter(!isRetweet)

print(gvl_twitter_unique %>% select(text))

                                                                                                                                               text
1                               Y'all, #texaspawprints is at the Pet Supplies on lower Greenville today. Their kitties areÂ… https://t.co/AVwAEvm9w5
2                                      Can you recommend anyone for this #job in #Greenville, SC? https://t.co/bGoRU5wFqQ #Labor #Hiring #CareerArc
3  @Meghan_Trainor i miss u <ed><U+00A0><U+00BD><ed><U+00B8><U+00AD><ed><U+00A0><U+00BD><ed><U+00B8><U+00AD>\nCome back to Greenville soon <U+2764>
4  1/29/73 Greenville, SC. Bobby &amp; Terry Kay vs Freddy Sweetan/Mike DuBois, Johnny Weaver/Penny Banner vs The Alaskan/Â… https://t.co/hFH3iCeWLl
5                                Interested in a #job in #Greenville, SC? This could be a great fit: https://t.co/yuufbTYC57 #IT #Hiring #CareerArc
6           Join the Robert Half Technology team! See our latest #job opening here: https://t.co/xCNLSvTkDQ #RHTechJobs #IT #Greenville, SC #Hiring
7        Want to work at Hubbell Incorporated? We're #hiring in #Greenville, SC! Click for details: https://t.co/dfTOjYDWG9 #Job #ProductMgmt #Jobs
8                                                                           I need a church in Greenville ASAP! If you have suggestions let me know
9                        Wow! @Lyft pledges $1 million to @ACLU https://t.co/YTkkGeE5l6 #Lyft is finally in Greenville. App downloaded. #DeleteUber
10                             Greenville, NC: 3:00 PM Temp: 53.5ÂºF Dew: 25.8ÂºF Pressure: 1008.2mb Rain: 0.00" #encwx #ncwx https://t.co/sZnc3rvVsm
11                             Interested in a #job in #Dearborn, MI? This could be a great fit: https://t.co/GxLvH7wJ9J #Retail #Hiring #CareerArc
12                      Want to work in #Greenville, NC? View our latest opening: https://t.co/aU11faXsxp #Job #Healthcare #Jobs #Hiring #CareerArc
13                               Driving to Greenville, sharing real-time road info with wazers in my area. ETA 3:23 PM using @waze - Drive Social.
14                        #Pursue #Bright #Career With The #Universities In #South #Australia\n\nhttps://t.co/ZD6nPXglKr\n#CheapFlights #Greenville
15                                                                                       https://t.co/DVdFrLQwaF\nMy latest Greenville News column.
16                                                                         @igorvolsky Greenville-Spartanburg, South Carolina (GSP), today at 4:00.
17                         Join the WHBM team! See our latest #job opening here: https://t.co/jd5D8zjYXW #Retail #Greenville, SC #Hiring #CareerArc
18                       (Greenville, SC) I need to figure out what happened to my driver's license, or I'm going to los... https://t.co/CS4qMYpUkB
19        See our latest #Greenville, SC #job and click to apply: Mortgage Consultant (SAFE) - https://t.co/f9H5PO30Fe #Veterans #Hiring #CareerArc

The thing to notice here is that there are several different Greenvilles, so this makes analysis of the local area pretty hard. Many of the tweets can be about Greenville, NC or SC. In this particular dataset, there was even a Greenville Road in California (where there was a car fire). Rather than play a filtering game, it may be better to apply some knowledge specific to the area. For instance, local tweets will often be tagged with #yeahThatgreenville. So we will search again for the #yeahthatgreenville hashtag (and add a few more tweets as well). This time, we’ll keep retweets:

origop <- options("httr_oauth_cache")
options(httr_oauth_cache = TRUE)
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)  # needed to knit the Rmd file, may not be necessary for you to reauthenticate in 1 session

[1] "Using direct authentication"

options(httr_oauth_cache = origop)
gvl_twitter_unique <- searchTwitter("#yeahthatgreenville", n = 200) %>% twListToDF()

gvl_twitter_nolink <- gvl_twitter_unique %>% mutate(text = gsub("https?://[\\w\\./]+", 
    "", text, perl = TRUE))

Here I do two separate queries and add them together using the bind_rows function from dplyr.

Who is tweeting

The first thing we can do is get a list of users who tweet under this hastag as well as their number of tweets:

library(ggplot2)

gvl_twitter_nolink %>% ggplot(aes(x = reorder(screenName, screenName, function(x) -length(x)))) + 
    geom_bar() + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
    xlab("")

So I snuck a trick into the above graph. In bar charts presenting counts, I usually prefer the order in descending bar length. That way I can identify the most and least common screen names quickly. I accomplish this by using x=reorder(screenName,screenName,function (x) -length(x))) in the aes() function above. Now we can see that @GiovanniDodd was the most prolific tweeter in the last 200 tweets I accessed. Some of the prolific tweeters appear to be businesses, such as @CourtyardGreenville or perhaps tourism accounts such as @Greenville_SC.

What users are saying

To analyze what users are saying about “#yeahthatgreenville”, we use the tidytext package. There are a number of packages that can be used to analyze text, and tm used to be a favorite, but tidytext fits within the context of tidy data. We prefer the tidy data framework because it works with data in a specific format and has a number of powerful tools that have a specific focus but interoperate well, much like the UNIX ideal. Here, tidytext will allow us to use dplyr and similar tools using the pipe operator. The code will be easier to read and follow.

library(tidytext)

tweet_words <- gvl_twitter_nolink %>% select(id, text) %>% unnest_tokens(word, 
    text)

head(tweet_words)

                    id        word
1   825791100107505664     tonight
1.1 825791100107505664          at
1.2 825791100107505664      coffee
1.3 825791100107505664 underground
1.4 825791100107505664           1
1.5 825791100107505664           e

I used the select function from dplyr to keep only the id and text fields. The unnest_tokens() functions creates a long dataset with a single word replacing the text. All the other fields remain unchanged. We can now easily create a bar chart of the words used the most:

tweet_words %>% ggplot(aes(x = reorder(word, word, function(x) -length(x)))) + 
    geom_bar() + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
    xlab("")

This plot is very busy, so we plot, say, the top 20 words:

tweet_words %>% count(word, sort = TRUE) %>% slice(1:20) %>% ggplot(aes(x = reorder(word, 
    n, function(n) -n), y = n)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 60, 
    hjust = 1)) + xlab("")

Unfortunately, this is terribly unexciting. Of course “a”, “to”, “for”, and similar words are going to be at the top. In text mining, we create a list of “stop words”, including these, which are so common they are usually not worth including in an analysis. The tidytext package includes a stop_words data frame to assist us:

head(stop_words)

# A tibble: 6 Ã— 2
       word lexicon
      <chr>   <chr>
1         a   SMART
2       a's   SMART
3      able   SMART
4     about   SMART
5     above   SMART
6 according   SMART

We’ll change stop_words slightly to be useful to us. This involves adding a column to help us filter out in the next step and adding some common, uninteresting words “https”, “t.co”, “yeahthatgreenville”, and “amp”. We filter these out for various reasons, e.g. “https” and “t.co” are used in URLs, “amp” is left over from tokening some HTML code, and we searched on “yeahthatgreenville”. Augmenting stop words is a bit of an iterative process, which I’m not showing here, but I went back and forth a few times to get this list.

my_stop_words <- stop_words %>% select(-lexicon) %>% bind_rows(data.frame(word = c("https", 
    "t.co", "yeahthatgreenville", "amp", "gvl")))

Now, we can determine which of the words above are stop words and thus not worth analyzing:

tweet_words_interesting <- tweet_words %>% anti_join(my_stop_words)

head(tweet_words_interesting)

                  id    word
1 825791100107505664 tonight
2 825474845710376962 tonight
3 825443843445293057 tonight
4 825791100107505664  coffee
5 825791100107505664  coffee
6 825693272119054336  coffee

The anti_join function is probably not familiar to most data scientists or statisticians. It is the opposite of a merge in a sense. Basically, the command above merges the tweet_words and my_stop_words data frames, and then removes the rows that came from the my_stop_words dataset, leaving only the rows in tweet_words (the id and word) that does not match with something from my_stop_words. This is desirable because our my_stop_words dataset contains words we do not want to analyze.

Now we can analyze the more interesting words:

tweet_words_interesting %>% count(word, sort = TRUE) %>% slice(1:20) %>% ggplot(aes(x = reorder(word, 
    n, function(n) -n), y = n)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 60, 
    hjust = 1)) + xlab("")

Sentiment analysis

Sentiment analysis is, in short, the quantitative study of the emotional content of text. The most sophisticated analysis, of course, is very difficult, but we can make a start using a simple procedure. Many of the ideas here can be found in a vignette for the package written by Julia Silge and David Robinson.

As a start, we use the Bing lexicon, which maps a word to positive/negative according to whether its sentiment content is positive or negative.

bing_lex <- get_sentiments("bing")

head(bing_lex)

# A tibble: 6 Ã— 2
        word sentiment
       <chr>     <chr>
1    2-faced  negative
2    2-faces  negative
3         a+  positive
4   abnormal  negative
5    abolish  negative
6 abominable  negative

Sentiment analysis then is an exercise in an inner-join:

gvl_sentiment <- tweet_words_interesting %>% left_join(bing_lex)

head(gvl_sentiment)

                  id    word sentiment
1 825791100107505664 tonight      <NA>
2 825474845710376962 tonight      <NA>
3 825443843445293057 tonight      <NA>
4 825791100107505664  coffee      <NA>
5 825791100107505664  coffee      <NA>
6 825693272119054336  coffee      <NA>

Once you get to this point, sentiment analysis can start fairly easily:

gvl_sentiment %>% filter(!is.na(sentiment)) %>% group_by(sentiment) %>% summarise(n = n())

# A tibble: 2 Ã— 2
  sentiment     n
      <chr> <int>
1  negative    25
2  positive    96

There are many more positive words than negative words, so the mood tilts positive in our crude analysis. We can also group by tweet, and see whether there more more positive or negative tweets:

gvl_sent_anly2 <- gvl_sentiment %>% group_by(sentiment, id) %>% summarise(n = n()) %>% 
    ungroup() %>% group_by(sentiment) %>% summarise(n = mean(n, na.rm = TRUE))

gvl_sent_anly2

# A tibble: 3 Ã— 2
  sentiment        n
      <chr>    <dbl>
1  negative 1.041667
2  positive 1.333333
3      <NA> 6.361809

On average, there are 1.3333333 positive words per tweet and 1.0416667 negative words per tweet, if you accept the assumptions of the above analysis.

There is, of course, a lot more that can be done, but this will get you started. For some more sophisticated ideas you can check Julia Silge’s analysis of Reddit data, for instance. Another kind of analysis looking at sentiment and emotional content can be found here (with the caveat that it uses the predecessor to dplyr and thus runs somewhat less efficiently). Finally, it would probably be useful to supplement the above sentiment data frames with situation-specific sentiment analysis, such as making goallllllll in the above a positive word.

Conclusions

The R packages twitteR and tidytext make analyzing content from Twitter easy. This is helpful if you want to analyze, for instance, real time reactions to events. Above we pulled content from Twitter, split it into words, and analyzed words by frequency while eliminating “uninteresting” words. Then we analyzed whether tweets were on the whole positive or negative using pre-made lexicons mapping words to positive or negative.

Plotting GeoJSON polygons on a map with R

2016-12-16T00:00:00+00:00

In a previous post we plotted some points, retrieved from a public dataset in GeoJSON format, on top of a Google Map of the area surrounding Greenville, SC. In this post we plot some public data in GeoJSON format as well, but instead of particular points, we plot polygons. Polygons describe an area rather than a single point. As before, to set up we do the following:

library(rgdal)
if (!require(geojsonio)) {
    install.packages("geojsonio")
    library(geojsonio)
}
library(sp)
library(maps)
library(ggmap)
library(maptools)

Getting the data

The data we are going to analyze consists of the city parks in Greenville, SC. Though this data is located in an ArcGIS system, there is a GeoJSON version at OpenUpstate.

data_url <- "https://data.openupstate.org/maps/city-parks/parks.php"
data_file <- "parks.geojson"
# for some reason, I can't read from the url directly, though the tutorial
# says I can
download.file(data_url, data_file)
data_park <- geojson_read(data_file, what = "sp")

Analyzing the data

First, we plot the data as before:

plot(data_park)

While this was easy to do, it doesn’t give very much context. However, it does give the boundaries of the different parks. As before, we use the ggmap and ggplot2 package to give us some context. First, we download from Google the right map.

mapImage <- ggmap(get_googlemap(c(lon = -82.394012, lat = 34.852619), scale = 1, 
    zoom = 11), extent = "normal")

I got the latitude and longitude by looking up on Google, and then hand-tuned the scale and zoom.

A note of warning: if you do this with a recent version of ggmap and ggplot2, you may need to download the GitHub versions. See this Stackoverflow thread for details.

Now, we prepare our spatial object for plotting. This is a more difficult process than before, and requires the use of the fortify command from ggplot2 package to make sure everything makes it to the right format:

data_park_df <- fortify(data_park)

Now we can make the plot:

print(mapImage + geom_polygon(aes(long, lat, group = group), data = data_park_df, 
    colour = "green"))

Note the use of the group= option in the geom_polygon function above. This tells geom_polygon that there are many polygons rather than just one. Without that option, you get a big mess:

print(mapImage + geom_polygon(aes(long, lat), data = data_park_df, colour = "green"))

Mashup of parking convenient to Swamp Rabbit Trail and city parks

Now, say you want to combine the city parks data with the parking places convenient to Swamp Rabbit Trail that was the subject of the last post. That is very easy using the ggplot2 package. We get the data and manipulate it as last time:

Next, we use the layering feature of ggplot2 to draw the map:

Conclusions

We continue to explore public geographical data by examining data representing areas in addition to points. In addition, we layer data from two sources.

Plotting GeoJSON data on a map with R

2016-12-11T00:00:00+00:00

GeoJSON is a standard text-based data format for encoding geographical information, which relies on the JSON (Javascript object notation) standard. There are a number of public datasets for Greenville, SC that use this format, and, the R programming language makes working with these data easy. Install the rgeojson library, which is part of the ROpenSci family of packages.

In this post we plot some public data in GeoJSON format on top of a retrieved Google Map. To set up we do the following:

library(rgdal)
if (!require(geojsonio)) {
    install.packages("geojsonio")
    library(geojsonio)
}
library(sp)
library(maps)
library(ggmap)
library(maptools)

I wrapped geojsonio in a require because it may not be installed on your system. Geojsonio takes most of the work out of dealing with GeoJSON data, thus allowing you to concentrate on your analysis rather than data manipulation to a great extent. There is still some data manipulation to be done, as seen below, but it’s fairly lightweight.

Getting the data

The data we are going to analyze consists of the convenient parking locations for access to the Swamp Rabbit Trail running between Greenville, SC and Traveler’s Rest, SC. Though this data is located in an ArcGIS system, there is a GeoJSON version at OpenUpstate.

data_url <- "https://data.openupstate.org/maps/swamp-rabbit-trail/parking/geojson.php"
data_file <- "srt_parking.geojson"
# for some reason, I can't read from the url directly, though the tutorial
# says I can
download.file(data_url, data_file)
data_json <- geojson_read(data_file, what = "sp")

Theoretically, you can use geojson_read to get the data from the URL directly; however this seemed to fail for me. I’m not sure why doing the two-step process with download.file and then geojson_read works, but it is probably a good idea to download your data first in most cases. The what="sp" option in geojson_read is used to return the data in a spatial object. Now that the data is in a spatial object, we can analyze however we wish and forget about the original data format.

Analyzing the data

The first thing you can do is plot the data, and the plot command makes that easy. If you don’t know what is going on behind the scenes, the plot command detects that it is dealing with a spatial object and calls the plot method from the sp package. But we just issue a simple command:

plot(data_json)

Unfortunately, this plot is not very helpful because it simply plots the points without any context. So we use the ggmap and ggplot2 package to give us some context. First, we download from Google the right map.

mapImage <- ggmap(get_googlemap(c(lon = -82.394012, lat = 34.852619), scale = 1, 
    zoom = 11), extent = "normal")

I got the latitude and longitude by looking up on Google, and then hand-tuned the scale and zoom.

A note of warning: if you do this with a recent version of ggmap and ggplot2, you may need to download the GitHub versions. See this Stackoverflow thread for details.

Now, we prepare our spatial object for plotting:

data_df <- as.data.frame(data_json)
names(data_df)[4:5] <- c("lon", "lat")

There’s really no output from this. I suppose the renaming step isn’t necessary, but I believe in descriptive labels.

Now we can make the plot:

print(mapImage + geom_point(aes(lon, lat), data = data_df))

It may be helpful to add labels based on the name of the location, given in the ‘title’ field:

mapImage + geom_point(aes(lon, lat), data = data_df) + geom_text(aes(lon, lat, 
    label = title, hjust = 0, vjust = 0.5), data = data_df, check_overlap = TRUE)

Here, I use geom_text to make the labels, and tweaked the options by hand using the help page.

Conclusions

GeoJSON data is becoming more popular, especially in public data. The geojsonio package makes working with such data trivial. Once the data is in a spatial data format, R’s wide variety of spatial data tools are available.