Jekyll2017-08-11T21:04:13+00:00https://randomjohn.github.io/randomjohn.github.ioShowcasing data analysis in R using public data about Greenville/Spartanburg, SC by John JohnsonHow we voted in South Carolina2017-08-04T00:00:00+00:002017-08-04T00:00:00+00:00https://randomjohn.github.io/how-we-voted-in-greenville-sc<h1 id="purpose">Purpose</h1>
<p>This post seeks to explore how Greenville, SC and surrounding areas voted in the 2016 election. It also demonstrates how to retrieve data from the <a href="http://data.world">Data.World</a> site. To retrieve data from this site using the tools in this post, you have to create an account (easy to do if you have a Facebook, Twitter, or Github account). You can then get your own API key from your profile page. Furthermore, from R, you will need to get the <code class="highlighter-rouge">data.world</code> package (<code class="highlighter-rouge">install.packages("data.world")</code>). You can then load the API key into R using <code class="highlighter-rouge">saved_cfg <- data.world::save_config("YOUR_API_KEY")</code>. This is the same <code class="highlighter-rouge">saved_cfg</code> used below.</p>
<p>Furthermore, for purposes of map display we need to get the shapes of the districts in SC. I found one such collection on Github at <a href="https://github.com/nvkelso/election-geodata">User nvelesko</a>’s Github page. I just downloaded as a zip file and extracted into a local directory I named <code class="highlighter-rouge">precinct_shp</code>. There are versions of these shape files for most states. I use the <code class="highlighter-rouge">readOGR</code> function from package <code class="highlighter-rouge">rgdal</code> to read them in.</p>
<h1 id="setup-and-acquiring-data">Setup and acquiring data</h1>
<p>First, we load the shape files that were downloaded from the Github site above. The <code class="highlighter-rouge">readOGR</code> function seems to be a little strange, because I tried calling it with <code class="highlighter-rouge">precinct_shp</code> (the directory of the shape files) directly, but it gave errors. Eventually, I gave up and changed the working directly to read in the shape files, and then changed it back. Note that if you’re doing this in an R Notebook or R markdown file, you’ll get some strange messages about how changing the working directory in an R notebook works.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">data.world</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">rgdal</span><span class="p">)</span><span class="w">
</span><span class="n">set_config</span><span class="p">(</span><span class="n">saved_cfg</span><span class="p">)</span><span class="w"> </span><span class="c1"># saved_cfg was set in an invisible block which has my API key
</span><span class="n">owd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">getwd</span><span class="p">()</span><span class="w">
</span><span class="n">setwd</span><span class="p">(</span><span class="n">precinct_shp</span><span class="p">)</span><span class="w">
</span><span class="n">precinct_shapes</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readOGR</span><span class="p">(</span><span class="s2">"."</span><span class="p">,</span><span class="s2">"Statewide"</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## OGR data source with driver: ESRI Shapefile
## Source: ".", layer: "Statewide"
## with 2155 features
## It has 4 fields</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">setwd</span><span class="p">(</span><span class="n">owd</span><span class="p">)</span></code></pre></figure>
<p>To download the election data, we use the simplified commands from the <code class="highlighter-rouge">dwapi</code> package (automatically loaded by <code class="highlighter-rouge">data.world</code>). The <code class="highlighter-rouge">list_tables</code> command lists the available data tables for that site. Here there are two: one for the election itself and one for registration. For exploration, we download both. The data are located in the directory of user @tamilyn, and are named south-carolina-election-data. If you use Python or other common data analysis tool, data.world has released tools that connect your tool of choice with their API. As of the date on this blog post, the file size total was about 850 kB for both files.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">ds_url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"tamilyn/south-carolina-election-data"</span><span class="w">
</span><span class="n">election_tables</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dwapi</span><span class="o">::</span><span class="n">list_tables</span><span class="p">(</span><span class="n">ds_url</span><span class="p">)</span><span class="w">
</span><span class="n">election_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dwapi</span><span class="o">::</span><span class="n">download_table_as_data_frame</span><span class="p">(</span><span class="n">ds_url</span><span class="p">,</span><span class="n">election_tables</span><span class="p">[</span><span class="m">1</span><span class="p">])</span><span class="w">
</span><span class="n">regis_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dwapi</span><span class="o">::</span><span class="n">download_table_as_data_frame</span><span class="p">(</span><span class="n">ds_url</span><span class="p">,</span><span class="n">election_tables</span><span class="p">[</span><span class="m">2</span><span class="p">])</span></code></pre></figure>
<h1 id="showing-the-data">Showing the data</h1>
<p>Plotting the precinct data using the standard R tools is easy:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">precinct_shapes</span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2017-08-04-how-we-voted-in-greenville-sc.Rmdunnamed-chunk-3-1.png" alt="plot of chunk unnamed-chunk-3" /></p>
<p>This is because <code class="highlighter-rouge">plot</code> “knows” what to do with shape data. In fact, we can explore this a little bit further:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="nf">class</span><span class="p">(</span><span class="n">precinct_shapes</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"</code></pre></figure>
<p>This shows that we have an object of class <code class="highlighter-rouge">sp</code>, which is a spatial data object defined using the <code class="highlighter-rouge">sp</code> package. It is also a <code class="highlighter-rouge">SpatialPolygonsDataFrame</code>. Behind the scenes, <code class="highlighter-rouge">plot</code> is calling a method that is defined just for these kinds of objects, which tells it how to render these kinds of shape data. We could also (and have on this blog) done this using the <code class="highlighter-rouge">ggplot2</code> package, but for this kind of exploration the quick plot is good. So don’t give completely up on R base graphics.</p>
<p>The election data has a bit of an odd structure.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">DT</span><span class="o">::</span><span class="n">datatable</span><span class="p">(</span><span class="n">election_df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">row</span><span class="o">=</span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">.</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">row</span><span class="p">,</span><span class="n">everything</span><span class="p">())</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">slice</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">,</span><span class="m">103</span><span class="o">:</span><span class="m">107</span><span class="p">)))</span></code></pre></figure>
<p><img src="/figures//2017-08-04-how-we-voted-in-greenville-sc.Rmdunnamed-chunk-5-1.png" alt="plot of chunk unnamed-chunk-5" /></p>
<p>So the fancy stuff I did in <code class="highlighter-rouge">kable</code> above was basically to show the original row numbers and print them to the left of all the other variables. If I had not used the <code class="highlighter-rouge">select(row,everything())</code>, the row would have printed last. This is a nice example of how to do a quick custom column-sorting. But that’s not why we’re here. The election file shows how people voted in a rather raw fashion. Specifically, here I want to tally those people who voted for the different candidates for president. But it’s not that straightforward, because I have to count the straight-ticket voters as well as the split-party voters who specifically noted their president’s choice on the ballot.</p>
<p>The goal here is to get to the percentage of votes going to the big two parties. While it may be an interesting exercise to some to look at the third party votes, we’re not going to do that here.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">total_pres_votes</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">election_df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">office</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"STRAIGHT PARTY"</span><span class="p">,</span><span class="s2">"President and Vice President"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">precinct</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="n">total_votes</span><span class="o">=</span><span class="nf">sum</span><span class="p">(</span><span class="n">votes</span><span class="p">,</span><span class="n">nm.ra</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span><span class="n">DT</span><span class="o">::</span><span class="n">datatable</span><span class="p">(</span><span class="n">total_pres_votes</span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2017-08-04-how-we-voted-in-greenville-sc.Rmdunnamed-chunk-6-1.png" alt="plot of chunk unnamed-chunk-6" /></p>
<p>The first few results look ok, so we can get votes for the different parties and merge this back on to get the percentage.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">total_party_votes</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">election_df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">dplyr</span><span class="o">::</span><span class="n">filter</span><span class="p">((</span><span class="n">office</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"STRAIGHT PARTY"</span><span class="p">,</span><span class="s2">"President and Vice President"</span><span class="p">))</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="p">(</span><span class="n">party</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"DEM"</span><span class="p">,</span><span class="s2">"REP"</span><span class="p">)))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">precinct</span><span class="p">,</span><span class="n">party</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="n">total_party_votes</span><span class="o">=</span><span class="nf">sum</span><span class="p">(</span><span class="n">votes</span><span class="p">,</span><span class="n">nm.ra</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">left_join</span><span class="p">(</span><span class="n">total_pres_votes</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">party_perc</span><span class="o">=</span><span class="n">total_party_votes</span><span class="o">/</span><span class="n">total_votes</span><span class="o">*</span><span class="m">100</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Joining, by = "precinct"</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">DT</span><span class="o">::</span><span class="n">datatable</span><span class="p">(</span><span class="n">total_party_votes</span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2017-08-04-how-we-voted-in-greenville-sc.Rmdunnamed-chunk-7-1.png" alt="plot of chunk unnamed-chunk-7" /></p>
<p>So one picky issue is worth mentioning here. This dataset counts absentee ballots as its own precinct (or their own precincts). Part of the joys of blogging is sweeping issues like this under the rug, but this can be the source of interesting analyses in their own right.</p>
<p>The final bit of data wrangling I do here is to widen the dataset, so I can merge it easily later on with the shapefile.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">total_party_votes_wide</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">total_party_votes</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">total_party_votes</span><span class="p">,</span><span class="o">-</span><span class="n">total_party_votes</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">precinct</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">spread</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="n">party</span><span class="p">,</span><span class="n">value</span><span class="o">=</span><span class="n">party_perc</span><span class="p">)</span><span class="w">
</span><span class="n">DT</span><span class="o">::</span><span class="n">datatable</span><span class="p">(</span><span class="n">total_party_votes_wide</span><span class="w"> </span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2017-08-04-how-we-voted-in-greenville-sc.Rmdunnamed-chunk-8-1.png" alt="plot of chunk unnamed-chunk-8" /></p>
<p>There were some casualties in this operation, namely the total party votes and total votes, but I don’t really need them for the simple thing I’m doing here.</p>
<p>The <code class="highlighter-rouge">spread</code> function I used above comes from the <code class="highlighter-rouge">tidyr</code> package, loaded by <code class="highlighter-rouge">tidyverse</code>. It, along with <code class="highlighter-rouge">gather</code>, enable navigating between “long” and “wide” datasets. You have to be careful using these functions, though, or you may get something strange. For instance, when I left in the party votes using <code class="highlighter-rouge">spread</code> above (in a previous iteration of this post), I ended up with an ugly wide and long dataset, but with <code class="highlighter-rouge">NA</code> every other row on each of the percent columns.</p>
<h1 id="merging-data">Merging data</h1>
<p>In the last blog post, we were simply able to plot two maps on top of each other using the <code class="highlighter-rouge">leaflet</code> package. We are faced with a slightly different issue here. One data set is a geographic dataset, but the elections data only lists a precinct name. So it is up to us to do the merge.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">precinct_elect</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">precinct_shapes</span><span class="p">,</span><span class="n">total_party_votes_wide</span><span class="p">,</span><span class="n">by.x</span><span class="o">=</span><span class="s2">"PNAME"</span><span class="p">,</span><span class="n">by.y</span><span class="o">=</span><span class="s2">"precinct"</span><span class="p">)</span></code></pre></figure>
<p>Honestly, I thought the above step would be the hardest part of the post. But like <code class="highlighter-rouge">plot</code>, the <code class="highlighter-rouge">merge</code> function in R understands what it’s operating on (i.e. uses a method that’s specific to spatial objects). Someone else did the hard work, and I just use the magic.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">leaflet</span><span class="p">)</span><span class="w">
</span><span class="n">pal</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">colorNumeric</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"viridis"</span><span class="p">,</span><span class="w">
</span><span class="n">domain</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">100</span><span class="p">))</span><span class="w">
</span><span class="n">precinct_elect</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">leaflet</span><span class="p">(</span><span class="n">width</span><span class="o">=</span><span class="s2">"100%"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">addPolygons</span><span class="p">(</span><span class="n">popup</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="n">PNAME</span><span class="p">,</span><span class="w">
</span><span class="n">stroke</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
</span><span class="n">smoothFactor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w">
</span><span class="n">fillOpacity</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w">
</span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">pal</span><span class="p">(</span><span class="n">DEM</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">addLegend</span><span class="p">(</span><span class="s2">"bottomright"</span><span class="p">,</span><span class="w">
</span><span class="n">pal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pal</span><span class="p">,</span><span class="w">
</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">DEM</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Dem P/VP Vote %"</span><span class="p">,</span><span class="w">
</span><span class="n">labFormat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">labelFormat</span><span class="p">(</span><span class="n">suffix</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"%"</span><span class="p">),</span><span class="w">
</span><span class="n">opacity</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2017-08-04-how-we-voted-in-greenville-sc.Rmdunnamed-chunk-10-1.png" alt="plot of chunk unnamed-chunk-10" /></p>
<p>Now there are a few things to note:</p>
<ul>
<li>In this blog post, the map is static. If you actually run this code in RStudio, it will be an interactive map that lets you zoom and pan, and where clicking areas will give you the precinct name.</li>
<li>I basically copied and pasted the <code class="highlighter-rouge">leaflet</code> code from the previous blog post, and changed variable names.</li>
<li>It looks like the dataset has a few holes. This may be where sweeping the absentee ballot under the rug leaves out a lot.</li>
</ul>
<h1 id="discussion">Discussion</h1>
<p>I used the <code class="highlighter-rouge">data.world</code> package and website to download and use Greenville election data. I then merged it with precinct shapefiles found from a different source (Github). The actual merge process wasn’t hard, and in fact the most difficult part of this process was deciding how I wanted to present the data. This election data is rather rich, but has a few necessary quirks in its structure.</p>
<p>Once you have characteristic data in the right format and shape files, it’s magically easy to merge them.</p>
<p>I copied and pasted most of the <code class="highlighter-rouge">leaflet</code> code from my last post to present the data, with tweaks for variable names.</p>John JohnsonPurpose This post seeks to explore how Greenville, SC and surrounding areas voted in the 2016 election. It also demonstrates how to retrieve data from the Data.World site. To retrieve data from this site using the tools in this post, you have to create an account (easy to do if you have a Facebook, Twitter, or Github account). You can then get your own API key from your profile page. Furthermore, from R, you will need to get the data.world package (install.packages("data.world")). You can then load the API key into R using saved_cfg <- data.world::save_config("YOUR_API_KEY"). This is the same saved_cfg used below. Furthermore, for purposes of map display we need to get the shapes of the districts in SC. I found one such collection on Github at User nvelesko’s Github page. I just downloaded as a zip file and extracted into a local directory I named precinct_shp. There are versions of these shape files for most states. I use the readOGR function from package rgdal to read them in. Setup and acquiring data First, we load the shape files that were downloaded from the Github site above. The readOGR function seems to be a little strange, because I tried calling it with precinct_shp (the directory of the shape files) directly, but it gave errors. Eventually, I gave up and changed the working directly to read in the shape files, and then changed it back. Note that if you’re doing this in an R Notebook or R markdown file, you’ll get some strange messages about how changing the working directory in an R notebook works.How to make interactive maps with Census and local data in R2017-07-21T00:00:00+00:002017-07-21T00:00:00+00:00https://randomjohn.github.io/r-maps-with-leaflet<p>So the goal here is to focus back on Greenville County and have even more granularity. I look at median house prices near Greenville and then overlay the park data downloaded earlier. This time, for the Census data, I use the <code class="highlighter-rouge">tidycensus</code> package that came out recently. Furthermore, instead of using <code class="highlighter-rouge">ggplot2</code> to create a static map, I use the <code class="highlighter-rouge">leaflet</code> package to create an interactive map, and, furthermore integrate data from disparate sources in a convenient way.</p>
<h1 id="download-the-local-park-data">Download the local park data</h1>
<p>The local parks file can be found <a href="https://data.openupstate.org/map-layers">here</a> courtesy of a small group of dedicated volunteers and an API that makes publishing geojson files easy at the Open Upstate site. We will download a polygon file for the park boundaries as well as a point geojson file for the address of each park.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">data_url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"https://data.openupstate.org/maps/city-parks/parks.php"</span><span class="w">
</span><span class="n">data_file</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"parks.geojson"</span><span class="w">
</span><span class="c1"># for some reason, I can't read from the url directly, though the tutorial
# says I can
</span><span class="n">download.file</span><span class="p">(</span><span class="n">data_url</span><span class="p">,</span><span class="w"> </span><span class="n">data_file</span><span class="p">)</span><span class="w">
</span><span class="n">data_park</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">geojson_read</span><span class="p">(</span><span class="n">data_file</span><span class="p">,</span><span class="w"> </span><span class="n">what</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"sp"</span><span class="p">)</span><span class="w">
</span><span class="n">data_url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"https://data.openupstate.org/maps/city-parks/geojson.php"</span><span class="w">
</span><span class="n">data_file</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"parks_point.geojson"</span><span class="w">
</span><span class="c1"># for some reason, I can't read from the url directly, though the tutorial
# says I can
</span><span class="n">download.file</span><span class="p">(</span><span class="n">data_url</span><span class="p">,</span><span class="w"> </span><span class="n">data_file</span><span class="p">)</span><span class="w">
</span><span class="n">data_park_addr</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">geojson_read</span><span class="p">(</span><span class="n">data_file</span><span class="p">,</span><span class="w"> </span><span class="n">what</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"sp"</span><span class="p">)</span></code></pre></figure>
<h1 id="download-the-median-home-value-data">Download the median home value data</h1>
<p>This code from <code class="highlighter-rouge">tidycensus</code> downloads demographic data <em>and</em> geometric data in a list column. A list column is a data frame, but one of the variables really contains a spatial data frame for each observation, which gives the polygon data for the census tracts. Having the demographic and geometric data in this format eases bookkeeping, and, thankfully, leaflet understands this format.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gvl_value</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_acs</span><span class="p">(</span><span class="n">geography</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"tract"</span><span class="p">,</span><span class="w">
</span><span class="n">variables</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"B25077_001"</span><span class="p">,</span><span class="w">
</span><span class="n">state</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"SC"</span><span class="p">,</span><span class="w">
</span><span class="n">county</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Greenville County"</span><span class="p">,</span><span class="w">
</span><span class="n">geometry</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre></figure>
<h1 id="plot-the-census-and-local-data-together">Plot the census and local data together</h1>
<p>Now we bring everything together. The <code class="highlighter-rouge">leaflet</code> package was written to make extensive use of the pipe operator that <code class="highlighter-rouge">dplyr</code> introduced a few years ago. We can set a default data frame for a leaflet map, but when we add markers and polygons, we can set it from other data sources. The following code is one way to do this, where we use the <code class="highlighter-rouge">tidycensus</code>-generated dataset as the foundation of the leaflet. We add polygons and markers for the data park using the <code class="highlighter-rouge">data=</code> option of <code class="highlighter-rouge">addPolygons</code> and <code class="highlighter-rouge">addMarkers</code>. Note the use of the <code class="highlighter-rouge">group=</code> option to create layers, which can be clicked on and off interactively. The <code class="highlighter-rouge">label=</code> option (or <code class="highlighter-rouge">popup=</code> option for addPolygons) are used to generate popup windows that give additional information.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">pal</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">colorNumeric</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"viridis"</span><span class="p">,</span><span class="w">
</span><span class="n">domain</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gvl_value</span><span class="o">$</span><span class="n">estimate</span><span class="p">)</span><span class="w">
</span><span class="n">gvl_value</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">st_transform</span><span class="p">(</span><span class="n">crs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"+init=epsg:4326"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">leaflet</span><span class="p">(</span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"100%"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">addProviderTiles</span><span class="p">(</span><span class="n">provider</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"CartoDB.Positron"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">addPolygons</span><span class="p">(</span><span class="n">popup</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">str_extract</span><span class="p">(</span><span class="n">NAME</span><span class="p">,</span><span class="w"> </span><span class="s2">"^([^,]*)"</span><span class="p">),</span><span class="w">
</span><span class="n">stroke</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
</span><span class="n">smoothFactor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w">
</span><span class="n">fillOpacity</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w">
</span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">pal</span><span class="p">(</span><span class="n">estimate</span><span class="p">),</span><span class="w">
</span><span class="n">group</span><span class="o">=</span><span class="s2">"Median home value"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">addLegend</span><span class="p">(</span><span class="s2">"bottomright"</span><span class="p">,</span><span class="w">
</span><span class="n">pal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pal</span><span class="p">,</span><span class="w">
</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">estimate</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Median home value"</span><span class="p">,</span><span class="w">
</span><span class="n">labFormat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">labelFormat</span><span class="p">(</span><span class="n">prefix</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"$"</span><span class="p">),</span><span class="w">
</span><span class="n">opacity</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">addPolygons</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">data_park</span><span class="p">,</span><span class="n">fillOpacity</span><span class="o">=</span><span class="m">0.8</span><span class="p">,</span><span class="n">group</span><span class="o">=</span><span class="s2">"Parks"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">addMarkers</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">data_park_addr</span><span class="p">,</span><span class="n">group</span><span class="o">=</span><span class="s2">"Parks"</span><span class="p">,</span><span class="n">label</span><span class="o">=~</span><span class="n">title</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">addLayersControl</span><span class="p">(</span><span class="n">overlayGroups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Parks"</span><span class="p">,</span><span class="s2">"Median home value"</span><span class="p">))</span></code></pre></figure>
<p><img src="/figures//2017-07-21-r-maps-with-leaflet.Rmdunnamed-chunk-4-1.png" alt="plot of chunk unnamed-chunk-4" /></p>
<p>Unfortunately due to the limitations of Github pages this had to be turned into a static image to be rendered. Perhaps it’s time to make the jump to blogdown and hugo like all the other cool kids?</p>
<h1 id="discussion">Discussion</h1>
<p>I’m just starting to learn about the <code class="highlighter-rouge">leaflet</code> package, but in just a couple of hours (and standing on the shoulders of giants) I was able to put together an interactive map combining Census data (median home value by census tract) and locally-generated data (park locations). Such combinations can be effectively used to examine local situations in the context of rich data already collected at a federal level (assuming the instability at the U.S. Census Bureau is temporary).</p>John JohnsonSo the goal here is to focus back on Greenville County and have even more granularity. I look at median house prices near Greenville and then overlay the park data downloaded earlier. This time, for the Census data, I use the tidycensus package that came out recently. Furthermore, instead of using ggplot2 to create a static map, I use the leaflet package to create an interactive map, and, furthermore integrate data from disparate sources in a convenient way.How to make maps with Census data in R2017-07-21T00:00:00+00:002017-07-21T00:00:00+00:00https://randomjohn.github.io/r-maps-with-census-data<h2 id="us-census-data">US Census Data</h2>
<p>The US Census collects a number of demographic measures and publishes aggregate data through its website. There are several ways to use Census data in R, from the <a href="https://www.census.gov/developers/">Census API</a> to the <a href="https://www.jstatsoft.org/article/view/v037i06">USCensus2010</a> package. If you are interested in geopolitical data in the US, I recommend exploring both these options - the Census API requires a key for each person who uses it, and the package requires downloading a very large dataset. The setups for both require some effort, but once that effort is done you don’t have to do it again.</p>
<p>The <code class="highlighter-rouge">acs</code> package in R allows you to access the Census API easily. I highly recommend checking it out, and that’s the method we will use here. Note that I’ve already defined the variable <code class="highlighter-rouge">api_key</code> - if you are trying to run this code you will need to first run something like <code class="highlighter-rouge">api_key <- <enter your Census API key></code> before running the rest of this code.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">acs</span><span class="p">)</span><span class="w">
</span><span class="n">api.key.install</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="n">api_key</span><span class="p">)</span><span class="w"> </span><span class="err">#</span><span class="w"> </span><span class="n">now</span><span class="w"> </span><span class="n">you</span><span class="w"> </span><span class="n">are</span><span class="w"> </span><span class="n">ready</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">run</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">rest</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">acs</span><span class="w"> </span><span class="n">code</span></code></pre></figure>
<p>For purposes here, we will use the toy example of plotting median household income by county for every county in South Carolina. First, we obtain the Census data. The first command gives us the table and variable names of what we want. I then use that table number in the <code class="highlighter-rouge">acs.fetch</code> command to get the variable I want.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">acs.lookup</span><span class="p">(</span><span class="n">endyear</span><span class="o">=</span><span class="m">2015</span><span class="p">,</span><span class="w"> </span><span class="n">span</span><span class="o">=</span><span class="m">5</span><span class="p">,</span><span class="n">dataset</span><span class="o">=</span><span class="s2">"acs"</span><span class="p">,</span><span class="w"> </span><span class="n">keyword</span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"median"</span><span class="p">,</span><span class="s2">"income"</span><span class="p">,</span><span class="s2">"family"</span><span class="p">,</span><span class="s2">"total"</span><span class="p">),</span><span class="w"> </span><span class="n">case.sensitive</span><span class="o">=</span><span class="nb">F</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning in acs.lookup(endyear = 2015, span = 5, dataset = "acs", keyword = c("median", : XML variable lookup tables for this request
## seem to be missing from ' https://api.census.gov/data/2015/acs5/variables.xml ';
## temporarily downloading and using archived copies instead;
## since this is *much* slower, recommend running
## acs.tables.install()</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## An object of class "acs.lookup"
## endyear= 2015 ; span= 5
##
## results:
## variable.code table.number
## 1 B10010_001 B10010
## 2 B19126_001 B19126
## 3 B19126_002 B19126
## 4 B19126_005 B19126
## 5 B19126_006 B19126
## 6 B19126_009 B19126
## 7 B19215_001 B19215
## 8 B19215_002 B19215
## 9 B19215_003 B19215
## 10 B19215_006 B19215
## 11 B19215_009 B19215
## 12 B19215_010 B19215
## 13 B19215_013 B19215
## table.name
## 1 Median Family Income for Families with GrndPrnt Householders Living With Own GrndChldrn < 18 Yrs
## 2 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years
## 3 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years
## 4 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years
## 5 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years
## 6 B19126. Median Family Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Family Type by Presence of Own Children Under 18 Years
## 7 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
## 8 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
## 9 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
## 10 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
## 11 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
## 12 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
## 13 B19215. Median Nonfamily Household Income in the Past 12 Months (in 2015 Inflation-Adjusted Dollars) by Sex of Householder by Living Alone by Age of Householder
## variable.name
## 1 Median family income in the past 12 months-- Total:
## 2 Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Total:
## 3 Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Married-couple family -- Total
## 4 Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Other family -- Total
## 5 Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Other family -- Male householder, no wife present -- Total
## 6 Median family income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Other family -- Female householder, no husband present -- Total
## 7 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Total (dollars):
## 8 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Male householder -- Total (dollars)
## 9 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Male householder -- Living alone -- Total (dollars)
## 10 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Male householder -- Not living alone -- Total (dollars)
## 11 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Female householder -- Total (dollars)
## 12 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Female householder -- Living alone -- Total (dollars)
## 13 Median nonfamily household income in the past 12 months (in 2015 Inflation-adjusted dollars) -- Female householder -- Not living alone -- Total (dollars)</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">my_cnty</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">geo.make</span><span class="p">(</span><span class="n">state</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">45</span><span class="p">,</span><span class="n">county</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"*"</span><span class="p">)</span><span class="w">
</span><span class="n">home_median_price</span><span class="o"><-</span><span class="n">acs.fetch</span><span class="p">(</span><span class="n">geography</span><span class="o">=</span><span class="n">my_cnty</span><span class="p">,</span><span class="w"> </span><span class="n">table.number</span><span class="o">=</span><span class="s2">"B19126"</span><span class="p">,</span><span class="n">endyear</span><span class="o">=</span><span class="m">2015</span><span class="p">)</span><span class="w"> </span><span class="err">#</span><span class="w"> </span><span class="n">home</span><span class="w"> </span><span class="n">median</span><span class="w"> </span><span class="n">prices</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning in (function (endyear, span = 5, dataset = "acs", keyword, table.name, : XML variable lookup tables for this request
## seem to be missing from ' https://api.census.gov/data/2015/acs5/variables.xml ';
## temporarily downloading and using archived copies instead;
## since this is *much* slower, recommend running
## acs.tables.install()</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Error in if (url.test["statusMessage"] != "OK") {: missing value where TRUE/FALSE needed</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">(</span><span class="n">head</span><span class="p">(</span><span class="n">home_median_price</span><span class="o">@</span><span class="n">estimate</span><span class="p">))</span></code></pre></figure>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: right">B19126_001</th>
<th style="text-align: right">B19126_002</th>
<th style="text-align: right">B19126_003</th>
<th style="text-align: right">B19126_004</th>
<th style="text-align: right">B19126_005</th>
<th style="text-align: right">B19126_006</th>
<th style="text-align: right">B19126_007</th>
<th style="text-align: right">B19126_008</th>
<th style="text-align: right">B19126_009</th>
<th style="text-align: right">B19126_010</th>
<th style="text-align: right">B19126_011</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Abbeville County, South Carolina</td>
<td style="text-align: right">44918</td>
<td style="text-align: right">55141</td>
<td style="text-align: right">65664</td>
<td style="text-align: right">50698</td>
<td style="text-align: right">24835</td>
<td style="text-align: right">43187</td>
<td style="text-align: right">50347</td>
<td style="text-align: right">24886</td>
<td style="text-align: right">22945</td>
<td style="text-align: right">18101</td>
<td style="text-align: right">29958</td>
</tr>
<tr>
<td style="text-align: left">Aiken County, South Carolina</td>
<td style="text-align: right">57396</td>
<td style="text-align: right">70829</td>
<td style="text-align: right">72930</td>
<td style="text-align: right">70446</td>
<td style="text-align: right">29302</td>
<td style="text-align: right">36571</td>
<td style="text-align: right">35469</td>
<td style="text-align: right">37906</td>
<td style="text-align: right">27355</td>
<td style="text-align: right">22760</td>
<td style="text-align: right">34427</td>
</tr>
<tr>
<td style="text-align: left">Allendale County, South Carolina</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
</tr>
<tr>
<td style="text-align: left">Anderson County, South Carolina</td>
<td style="text-align: right">53169</td>
<td style="text-align: right">65881</td>
<td style="text-align: right">75444</td>
<td style="text-align: right">60166</td>
<td style="text-align: right">26608</td>
<td style="text-align: right">36694</td>
<td style="text-align: right">37254</td>
<td style="text-align: right">36297</td>
<td style="text-align: right">24384</td>
<td style="text-align: right">17835</td>
<td style="text-align: right">29280</td>
</tr>
<tr>
<td style="text-align: left">Bamberg County, South Carolina</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
</tr>
<tr>
<td style="text-align: left">Barnwell County, South Carolina</td>
<td style="text-align: right">44224</td>
<td style="text-align: right">59467</td>
<td style="text-align: right">70542</td>
<td style="text-align: right">54030</td>
<td style="text-align: right">19864</td>
<td style="text-align: right">25143</td>
<td style="text-align: right">18633</td>
<td style="text-align: right">45714</td>
<td style="text-align: right">18317</td>
<td style="text-align: right">13827</td>
<td style="text-align: right">21315</td>
</tr>
</tbody>
</table>
<h2 id="plotting-the-map-data">Plotting the map data</h2>
<p>If you have the <code class="highlighter-rouge">maps</code> and <code class="highlighter-rouge">ggplot2</code> packages, you already have the data you need to plot. We use the <code class="highlighter-rouge">map_data</code> function from <code class="highlighter-rouge">ggplot2</code> to pull in county shape data for South Carolina. (A previous attempt at this blogpost had used the <code class="highlighter-rouge">ggmap</code> package, but there is an incompatibility between that and the latest <code class="highlighter-rouge">ggplot2</code> package at the time of this writing.)</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Want to understand how all the pieces fit together? Buy the
## ggplot2 book: http://ggplot2.org/book/</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">sc_map</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map_data</span><span class="p">(</span><span class="s2">"county"</span><span class="p">,</span><span class="n">region</span><span class="o">=</span><span class="s2">"south.carolina"</span><span class="p">)</span><span class="w">
</span><span class="n">ggplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_polygon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">long</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="n">lat</span><span class="p">,</span><span class="n">group</span><span class="o">=</span><span class="n">group</span><span class="p">),</span><span class="n">data</span><span class="o">=</span><span class="n">sc_map</span><span class="p">,</span><span class="n">colour</span><span class="o">=</span><span class="s2">"white"</span><span class="p">,</span><span class="n">fill</span><span class="o">=</span><span class="s2">"black"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme_minimal</span><span class="p">()</span></code></pre></figure>
<p><img src="/figures//2017-07-21-r-maps-with-census-data.Rmdunnamed-chunk-1-1.png" alt="plot of chunk unnamed-chunk-1" /></p>
<h2 id="merging-the-demographic-and-map-data">Merging the demographic and map data</h2>
<p>Now we have the demographic data and the map, but merging the two will take a little effort. The reason is that the map data gives a lower case representation of the county and calls it a “subregion”, while the Census data returns the county as “xxxx County, South Carolina”. I use the <code class="highlighter-rouge">dplyr</code> and <code class="highlighter-rouge">stringr</code> packages (for <code class="highlighter-rouge">str_replace</code>) to make short work of this merge.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">
</span><span class="n">merged</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">home_median_price</span><span class="o">@</span><span class="n">estimate</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">county_full</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rownames</span><span class="p">(</span><span class="n">.</span><span class="p">),</span><span class="w">
</span><span class="n">county</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_replace</span><span class="p">(</span><span class="n">county_full</span><span class="p">,</span><span class="s2">"(.+) County.*"</span><span class="p">,</span><span class="s2">"\\1"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">tolower</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">county</span><span class="p">,</span><span class="n">B</span><span class="m">19126</span><span class="err">_</span><span class="m">001</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">med_income</span><span class="o">=</span><span class="n">B</span><span class="m">19126</span><span class="err">_</span><span class="m">001</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">right_join</span><span class="p">(</span><span class="n">sc_map</span><span class="p">,</span><span class="n">by</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"county"</span><span class="o">=</span><span class="s2">"subregion"</span><span class="p">))</span><span class="w">
</span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">(</span><span class="n">head</span><span class="p">(</span><span class="n">merged</span><span class="p">,</span><span class="m">10</span><span class="p">))</span></code></pre></figure>
<table>
<thead>
<tr>
<th style="text-align: left">county</th>
<th style="text-align: right">med_income</th>
<th style="text-align: right">long</th>
<th style="text-align: right">lat</th>
<th style="text-align: right">group</th>
<th style="text-align: right">order</th>
<th style="text-align: left">region</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">abbeville</td>
<td style="text-align: right">44918</td>
<td style="text-align: right">-82.24809</td>
<td style="text-align: right">34.41758</td>
<td style="text-align: right">1</td>
<td style="text-align: right">1</td>
<td style="text-align: left">south carolina</td>
</tr>
<tr>
<td style="text-align: left">abbeville</td>
<td style="text-align: right">44918</td>
<td style="text-align: right">-82.31685</td>
<td style="text-align: right">34.35455</td>
<td style="text-align: right">1</td>
<td style="text-align: right">2</td>
<td style="text-align: left">south carolina</td>
</tr>
<tr>
<td style="text-align: left">abbeville</td>
<td style="text-align: right">44918</td>
<td style="text-align: right">-82.31111</td>
<td style="text-align: right">34.33163</td>
<td style="text-align: right">1</td>
<td style="text-align: right">3</td>
<td style="text-align: left">south carolina</td>
</tr>
<tr>
<td style="text-align: left">abbeville</td>
<td style="text-align: right">44918</td>
<td style="text-align: right">-82.31111</td>
<td style="text-align: right">34.29152</td>
<td style="text-align: right">1</td>
<td style="text-align: right">4</td>
<td style="text-align: left">south carolina</td>
</tr>
<tr>
<td style="text-align: left">abbeville</td>
<td style="text-align: right">44918</td>
<td style="text-align: right">-82.28247</td>
<td style="text-align: right">34.26860</td>
<td style="text-align: right">1</td>
<td style="text-align: right">5</td>
<td style="text-align: left">south carolina</td>
</tr>
<tr>
<td style="text-align: left">abbeville</td>
<td style="text-align: right">44918</td>
<td style="text-align: right">-82.25955</td>
<td style="text-align: right">34.25142</td>
<td style="text-align: right">1</td>
<td style="text-align: right">6</td>
<td style="text-align: left">south carolina</td>
</tr>
<tr>
<td style="text-align: left">abbeville</td>
<td style="text-align: right">44918</td>
<td style="text-align: right">-82.24809</td>
<td style="text-align: right">34.21131</td>
<td style="text-align: right">1</td>
<td style="text-align: right">7</td>
<td style="text-align: left">south carolina</td>
</tr>
<tr>
<td style="text-align: left">abbeville</td>
<td style="text-align: right">44918</td>
<td style="text-align: right">-82.23663</td>
<td style="text-align: right">34.18266</td>
<td style="text-align: right">1</td>
<td style="text-align: right">8</td>
<td style="text-align: left">south carolina</td>
</tr>
<tr>
<td style="text-align: left">abbeville</td>
<td style="text-align: right">44918</td>
<td style="text-align: right">-82.24236</td>
<td style="text-align: right">34.15401</td>
<td style="text-align: right">1</td>
<td style="text-align: right">9</td>
<td style="text-align: left">south carolina</td>
</tr>
<tr>
<td style="text-align: left">abbeville</td>
<td style="text-align: right">44918</td>
<td style="text-align: right">-82.27674</td>
<td style="text-align: right">34.10818</td>
<td style="text-align: right">1</td>
<td style="text-align: right">10</td>
<td style="text-align: left">south carolina</td>
</tr>
</tbody>
</table>
<p>It’s now a simple matter to plot this merged dataset. In fact, we only have to tweak a few things from the first time we plotted the map data.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_polygon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">long</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="n">lat</span><span class="p">,</span><span class="n">group</span><span class="o">=</span><span class="n">group</span><span class="p">,</span><span class="n">fill</span><span class="o">=</span><span class="n">med_income</span><span class="p">),</span><span class="n">data</span><span class="o">=</span><span class="n">merged</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme_minimal</span><span class="p">()</span></code></pre></figure>
<p><img src="/figures//2017-07-21-r-maps-with-census-data.Rmdunnamed-chunk-3-1.png" alt="plot of chunk unnamed-chunk-3" /></p>
<h2 id="discussion">Discussion</h2>
<p>It’s pretty easy to plot U.S. Census data on a map. The real power of Census data comes not just from plotting it, but combining with other geographically-based data (such as crime). The <code class="highlighter-rouge">acs</code> package in R makes it easy to obtain Census data, which can then be merged with other data using packages such as <code class="highlighter-rouge">dplyr</code> and <code class="highlighter-rouge">stringr</code> and then plotted with <code class="highlighter-rouge">ggplot2</code>. Hopefully the authors of the <code class="highlighter-rouge">ggmap</code> and <code class="highlighter-rouge">ggplot2</code> packages can work out their incompatibilities so that the above maps can be created using the Google API map or open street maps.</p>
<p>It should be noted that while I obtained county-level information, aggregate data can be obtained at Census block and tract levels as well, if you are looking to do some sort of localized analysis.</p>John JohnsonUS Census Data The US Census collects a number of demographic measures and publishes aggregate data through its website. There are several ways to use Census data in R, from the Census API to the USCensus2010 package. If you are interested in geopolitical data in the US, I recommend exploring both these options - the Census API requires a key for each person who uses it, and the package requires downloading a very large dataset. The setups for both require some effort, but once that effort is done you don’t have to do it again. The acs package in R allows you to access the Census API easily. I highly recommend checking it out, and that’s the method we will use here. Note that I’ve already defined the variable api_key - if you are trying to run this code you will need to first run something like api_key <- <enter your Census API key> before running the rest of this code.Personal data collection and analysis2017-04-17T00:00:00+00:002017-04-17T00:00:00+00:00https://randomjohn.github.io/personal-data-collection<h2 id="motivation-behind-this-example">Motivation behind this example</h2>
<p>I was diagnosed with sleep apnea last year, and have to use a continuous positive airway pressure (CPAP) machine to sleep well enough to feel alert during the day. The machine uploads data (via cellular connection) to a website that will give me results for the last two weeks. This data includes both usage (time of usage, air leakage, number of times mask was put on/taken off), and results (apnea-hypopnea index, which is an average of the number of times per hour that slow or no breathing occurred for at least 10 seconds). The website only displays results from the last two weeks, and I’d like to eventually do a long-term analysis. I’d also like to have things displayed my own way, because, well, I’m like that.</p>
<p>I could enter this information in a spreadsheet, and for import into R or other statistical software that might be the sensible thing to do. However, by having this data in context of other diary entries and text surround it I get to see this data in context of other things going on in my life. This information does not exist in a vacuum, and is important context for other things. For instance, if I’m dealing with a particularly stressful situation, it would be nice to go back and see how I dealt with that in the context of how my sleeping is going (and vice versa - does the apnea get better or worse during that time?). Another issue is that I’m dealing with migraines, and I’d like to know something about the frequency and severity in the context of sleep.</p>
<h2 id="methodology-for-data-collection">Methodology for data collection</h2>
<p>This personal data collection exercise uses an excellent piece of software specifically for journaling called <a href="http://www.davidrm.com">The Journal</a>. I’ve been using The Journal since 2007 to record events and just simply jog my memory of goings on in my life. The software has a few nifty features that dovetail nicely with data collection.</p>
<h3 id="daily-entries">Daily entries</h3>
<p>The Journal splits writing up into categories. Categories can be either loose-leaf (where entries can be organized hierarchically any way you want) and daily (where entries are organized by the date of entry). If you set it up a certain way, you can have the Journal lock entries on every day except for the day you are working on. It can also automatically create an entry for the day you are working on. Very handy for just daily jouraling in general.</p>
<h3 id="topics">Topics</h3>
<p>Topics are tags for specific pieces of text or entries. If you select a piece of text and tag it with a topic (say, CPAP), you can extract that piece of text later. Couple this with the Search by Topic command, and you can extract all text tagged with a certain topic into one document and save a single document with all text from that topic. So, for example, I will tag all my CPAP writings with the CPAP topic, and later on save a text file with what I have written about CPAP therapy (in this case, the data I collected).</p>
<h3 id="templates">Templates</h3>
<p>The Journal has a sophisticated template system that can insert not only the same text over and over, but tag it automatically with a certain topic and even fill in certain data such as the current date and time. I use the template feature to create some structured text (a data entry form of sorts) and tag the whole piece of inserted text with the CPAP topic. That way, I don’t have to bother with selecting and tagging manually. I can simply insert the text and fill in the numbers when I read the website.</p>
<p>The template looks like this:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>Sleep numbers for <ENTRYDATE format=“mm/dd/yyyy”/>
* Usage:
* Leakage: L/min
* AHI: events/h
* Mask on/off:
* MyAir score:
* Comments:
</code></pre>
</div>
<p>Because the text follows the same structure for all such entries, it is easy to write R code to pull out the data and make a <code class="highlighter-rouge">data.frame</code>.</p>
<p>What you don’t see (and is hard to show here) is that in the template itself I selected all of the text and tagged it CPAP. That way, my CPAP entries will always be tagged, and I can easily extract them later.</p>
<h2 id="methodology-for-analysis">Methodology for analysis</h2>
<h3 id="data-extraction">Data extraction</h3>
<p>The first part of data extraction is in The Journal. I use a saved search from the Search Entries by Topic function, then click View All Result Entries to see the text I had entered. The result is a screen showing the last 100 pieces of text I tagged CPAP (which may include other pieces of text if I felt the need to write on the topic). I can change this with an option. Clicking Save to File will allow me to save to a Journal file, and RTF, or a TXT file. I save the result to a TXT file so that I can easily read it in R. The text file contains only the data I entered for the CPAP machine, as well as any other text I tagged (which is fairly uncommon).</p>
<h3 id="data-import">Data import</h3>
<p>This is where I pay the price for putting the data in a diary rather than a tabular format. I use <code class="highlighter-rouge">readlines</code>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">readr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">raw_file</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"_Rmd/cpap.txt"</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">file.exists</span><span class="p">(</span><span class="n">raw_file</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># file.copy(raw_file,backup_file,copy.date = TRUE)
</span><span class="w"> </span><span class="n">raw_lines</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_lines</span><span class="p">(</span><span class="n">raw_file</span><span class="p">)</span><span class="w">
</span><span class="n">data_row</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">cpap_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">date</span><span class="o">=</span><span class="nf">c</span><span class="p">(),</span><span class="n">usage</span><span class="o">=</span><span class="nf">c</span><span class="p">(),</span><span class="n">leakage</span><span class="o">=</span><span class="nf">c</span><span class="p">(),</span><span class="n">events</span><span class="o">=</span><span class="nf">c</span><span class="p">(),</span><span class="n">mask</span><span class="o">=</span><span class="nf">c</span><span class="p">(),</span><span class="n">score</span><span class="o">=</span><span class="nf">c</span><span class="p">())</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">this_line</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">raw_lines</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">str_sub</span><span class="p">(</span><span class="n">this_line</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">18</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"Sleep numbers for "</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">data_row</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_row</span><span class="m">+1</span><span class="w">
</span><span class="n">cpap_df</span><span class="p">[</span><span class="n">data_row</span><span class="p">,</span><span class="s2">"date"</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="n">str_sub</span><span class="p">(</span><span class="n">this_line</span><span class="p">,</span><span class="m">19</span><span class="p">),</span><span class="s2">"%m/%d/%Y"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">str_sub</span><span class="p">(</span><span class="n">this_line</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">9</span><span class="p">)</span><span class="o">==</span><span class="s2">"* Usage: "</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">tm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_extract</span><span class="p">(</span><span class="n">this_line</span><span class="p">,</span><span class="s2">"1?[0-9]{1}:[0-9]{2}"</span><span class="p">)</span><span class="w">
</span><span class="n">tm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">str_split</span><span class="p">(</span><span class="n">tm</span><span class="p">,</span><span class="s2">":"</span><span class="p">)[[</span><span class="m">1</span><span class="p">]])</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="o">*</span><span class="m">60</span><span class="o">+</span><span class="n">x</span><span class="p">[</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="n">cpap_df</span><span class="p">[</span><span class="n">data_row</span><span class="p">,</span><span class="s2">"usage"</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tm</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">str_sub</span><span class="p">(</span><span class="n">this_line</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">11</span><span class="p">)</span><span class="o">==</span><span class="s2">"* Leakage: "</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cpap_df</span><span class="p">[</span><span class="n">data_row</span><span class="p">,</span><span class="s2">"leakage"</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">str_extract</span><span class="p">(</span><span class="n">this_line</span><span class="p">,</span><span class="s2">"[0-9]+"</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">str_sub</span><span class="p">(</span><span class="n">this_line</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">15</span><span class="p">)</span><span class="o">==</span><span class="s2">"* Mask on/off: "</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cpap_df</span><span class="p">[</span><span class="n">data_row</span><span class="p">,</span><span class="s2">"mask"</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">str_extract</span><span class="p">(</span><span class="n">this_line</span><span class="p">,</span><span class="s2">"[0-9]+"</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">str_sub</span><span class="p">(</span><span class="n">this_line</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">15</span><span class="p">)</span><span class="o">==</span><span class="s2">"* MyAir score: "</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cpap_df</span><span class="p">[</span><span class="n">data_row</span><span class="p">,</span><span class="s2">"score"</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">str_extract</span><span class="p">(</span><span class="n">this_line</span><span class="p">,</span><span class="s2">"[0-9]+"</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">str_sub</span><span class="p">(</span><span class="n">this_line</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">7</span><span class="p">)</span><span class="o">==</span><span class="s2">"* AHI: "</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cpap_df</span><span class="p">[</span><span class="n">data_row</span><span class="p">,</span><span class="s2">"events"</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">str_extract</span><span class="p">(</span><span class="n">this_line</span><span class="p">,</span><span class="s2">"[0-9\\.]+"</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">data_row</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">cpap_df</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nf">all</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">cpap_df</span><span class="p">[</span><span class="n">data_row</span><span class="p">,])))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cpap_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cpap_df</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="p">(</span><span class="n">data_row</span><span class="m">-1</span><span class="p">),]</span><span class="w">
</span><span class="n">data_row</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_row</span><span class="m">-1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># write_csv(cpap_df,csv_file)
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"Oops! "</span><span class="p">,</span><span class="n">raw_file</span><span class="p">,</span><span class="s2">" does not exist!\n"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The commented out code above basically writes out the file to a CSV (for easier processing in the future if I need) and backs up the CSV file each time I run the analysis. (I run this analysis in an <a href="http://www.rstudio.com">R Studio</a> R notebook.)</p>
<p>This code is basically a brute force conversion of the structured text of the template into a data frame - the text indicates the variable (column), and the row is given by a counter that is incremented every time a new entry, as indicated by the text “Sleep numbers for”, is read. The decisions on the variables are made in a rather old-fashioned way, with long <code class="highlighter-rouge">if</code> … <code class="highlighter-rouge">else if</code> … <code class="highlighter-rouge">else</code> block. The <code class="highlighter-rouge">str_sub</code> commands from the <code class="highlighter-rouge">stringr</code> package (you could also use base, if you wish) look for the substrings that I know will be present due to the template function in The Journal (and the hope that I don’t overwrite them when I record data), and the <code class="highlighter-rouge">str_extract</code> function will look for the numerical digits for most lines, two numbers separated by a colon (i.e. a time) for the usage line, and digits or a decimal point for the AHI line. These are converted to appropriate dates and numeric values, with the exception of the usage, which is converted to minutes.</p>
<p>The code above is slightly flawed in that there is the possibility for records that are all missing, so the last <code class="highlighter-rouge">for</code> block steps through and eliminates those records.</p>
<p>This is the most complicated part of the analysis! Once this data is in a <code class="highlighter-rouge">data.frame</code>, you can treat it as any other data analysis.</p>
<h3 id="data-analysis">Data analysis</h3>
<p>I won’t focus too heavily on the data analysis here, but just to demonstrate here is a usage graph:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">cpap_df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">as.Date</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="n">origin</span><span class="o">=</span><span class="s2">"1970-1-1"</span><span class="p">),</span><span class="n">usage</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_hline</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">yintercept</span><span class="o">=</span><span class="m">480</span><span class="p">),</span><span class="n">color</span><span class="o">=</span><span class="s2">"red"</span><span class="p">,</span><span class="n">lty</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="s2">"Date"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Usage (minutes)"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="n">limits</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="kc">NA</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_x_date</span><span class="p">(</span><span class="n">date_labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"%b %d"</span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2017-04-17-personal-data-collection.Rmdunnamed-chunk-1-1.png" alt="plot of chunk unnamed-chunk-1" /></p>
<p>No, you don’t get to see the other stuff, how much I sleep is enough!</p>
<h2 id="discussion">Discussion</h2>
<p>While it might be easier in some ways to enter this data into a spreadsheet daily, I chose this method of personal data collection for several reasons:</p>
<ul>
<li>It allows me to put each day’s data into context, for example recording in prose whether that day was stressful, was a holiday, or had work pressures</li>
<li>It enables me to enter several kinds of data into a diary (e.g. migraine data, dietary data, exercise data), and use multiple extractions to correlate data</li>
</ul>
<p>Because I use The Journal almost daily, and because it has these sophisticated features, I use that as a central location for all sorts of personal data, and so it became the natural place to record my sleep habits, and it is also a good piece of software to use to record other habits. Perhaps it can also supplement Fitbits, Garmins, and other kinds of personal habit data collection workflows.</p>John JohnsonMotivation behind this example I was diagnosed with sleep apnea last year, and have to use a continuous positive airway pressure (CPAP) machine to sleep well enough to feel alert during the day. The machine uploads data (via cellular connection) to a website that will give me results for the last two weeks. This data includes both usage (time of usage, air leakage, number of times mask was put on/taken off), and results (apnea-hypopnea index, which is an average of the number of times per hour that slow or no breathing occurred for at least 10 seconds). The website only displays results from the last two weeks, and I’d like to eventually do a long-term analysis. I’d also like to have things displayed my own way, because, well, I’m like that. I could enter this information in a spreadsheet, and for import into R or other statistical software that might be the sensible thing to do. However, by having this data in context of other diary entries and text surround it I get to see this data in context of other things going on in my life. This information does not exist in a vacuum, and is important context for other things. For instance, if I’m dealing with a particularly stressful situation, it would be nice to go back and see how I dealt with that in the context of how my sleeping is going (and vice versa - does the apnea get better or worse during that time?). Another issue is that I’m dealing with migraines, and I’d like to know something about the frequency and severity in the context of sleep. Methodology for data collection This personal data collection exercise uses an excellent piece of software specifically for journaling called The Journal. I’ve been using The Journal since 2007 to record events and just simply jog my memory of goings on in my life. The software has a few nifty features that dovetail nicely with data collection. Daily entries The Journal splits writing up into categories. Categories can be either loose-leaf (where entries can be organized hierarchically any way you want) and daily (where entries are organized by the date of entry). If you set it up a certain way, you can have the Journal lock entries on every day except for the day you are working on. It can also automatically create an entry for the day you are working on. Very handy for just daily jouraling in general. Topics Topics are tags for specific pieces of text or entries. If you select a piece of text and tag it with a topic (say, CPAP), you can extract that piece of text later. Couple this with the Search by Topic command, and you can extract all text tagged with a certain topic into one document and save a single document with all text from that topic. So, for example, I will tag all my CPAP writings with the CPAP topic, and later on save a text file with what I have written about CPAP therapy (in this case, the data I collected). Templates The Journal has a sophisticated template system that can insert not only the same text over and over, but tag it automatically with a certain topic and even fill in certain data such as the current date and time. I use the template feature to create some structured text (a data entry form of sorts) and tag the whole piece of inserted text with the CPAP topic. That way, I don’t have to bother with selecting and tagging manually. I can simply insert the text and fill in the numbers when I read the website. The template looks like this: Sleep numbers for <ENTRYDATE format=“mm/dd/yyyy”/> * Usage: * Leakage: L/min * AHI: events/h * Mask on/off: * MyAir score: * Comments: Because the text follows the same structure for all such entries, it is easy to write R code to pull out the data and make a data.frame. What you don’t see (and is hard to show here) is that in the template itself I selected all of the text and tagged it CPAP. That way, my CPAP entries will always be tagged, and I can easily extract them later. Methodology for analysis Data extraction The first part of data extraction is in The Journal. I use a saved search from the Search Entries by Topic function, then click View All Result Entries to see the text I had entered. The result is a screen showing the last 100 pieces of text I tagged CPAP (which may include other pieces of text if I felt the need to write on the topic). I can change this with an option. Clicking Save to File will allow me to save to a Journal file, and RTF, or a TXT file. I save the result to a TXT file so that I can easily read it in R. The text file contains only the data I entered for the CPAP machine, as well as any other text I tagged (which is fairly uncommon). Data import This is where I pay the price for putting the data in a diary rather than a tabular format. I use readlines.Inauguration speeches2017-01-28T00:00:00+00:002017-01-28T00:00:00+00:00https://randomjohn.github.io/tidy-text-inauguration-speeches<h2 id="acquiring-inauguration-speeches">Acquiring inauguration speeches</h2>
<p>Though not about Greenville especially, it might be interesting to quantitatively analyze inauguration speeches. This analysis will be done using two paradigms: the <code class="highlighter-rouge">tm</code> package and the <code class="highlighter-rouge">tidytext</code> package. We will read the speeches in such a way that we use the <code class="highlighter-rouge">tidytext</code> package; later on we will use some tools from that package to make analyses traditionally done by <code class="highlighter-rouge">tm</code>.</p>
<p>I looked around for inauguration speeches, and finally found them at <code class="highlighter-rouge">www.bartelby.com</code>. They are in a format more for human consumption, but with the use of the <code class="highlighter-rouge">rvest</code> (harvest?) package, we can read them in relatively easily. However, we need to do a mapping from speech IDs to speakers (newly inaugurated presidents), which is a little ugly and tedious.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">rvest</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">magrittr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">readr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidytext</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tm</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="c1"># download and format data ------------------------------------------------
</span><span class="w">
</span><span class="n">fmt_string</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"http://www.bartleby.com/124/pres%d.html"</span><span class="w">
</span><span class="n">speakers</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="n">textConnection</span><span class="p">(</span><span class="s2">"Number,Speaker
13,George Washington
14,George Washington
15,John Adams
16,Thomas Jefferson
17,Thomas Jefferson
18,James Madison
19,James Madison
20,James Monroe
21,James Monroe
22,John Quincy Adams
23,Andrew Jackson
24,Andrew Jackson
25,Martin Van Buren
26,William Henry Harrison
27,James Knox Polk
28,Zachary Taylor
29,Franklin Pierce
30,James Buchanon
31,Abraham Lincoln
32,Abraham Lincoln
33,Ulysses S. Grant
34,Ulysses S. Grant
35,Rutherford B. Hayes
36,James A. Garfield
37,Grover Cleveland
38,Benjamin Harrison
39,Grover Cleveland
40,William McKinley
41,William McKinley
42,Theodore Roosevelt
43,William Howard Taft
44,Woodrow Wilson
45,Woodrow Wilson
46,Warren G. Harding
47,Calvin Coolidge
48,Herbert Hoover
49,Franklin D. Roosevelt
50,Franklin D. Roosevelt
51,Franklin D. Roosevelt
52,Franklin D. Roosevelt
53,Harry S. Truman
54,Dwight D. Eisenhower
55,Dwight D. Eisenhower
56,John F. Kennedy
57,Lyndon Baines Johnson
58,Richard Milhaus Nixon
59,Richard Milhaus Nixon
60,Jimmy Carter
61,Ronald Reagan
62,Ronald Reagan
63,George H. W. Bush
64,Bill Clinton
65,Bill Clinton
66,George W. Bush
67,George W. Bush
68,Barack Obama
69,Barack Obama
70,Donald Trump"</span><span class="p">))</span><span class="w">
</span><span class="c1"># read the speeches into a list of data.frames, append ID number in a new column
</span><span class="n">speeches</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">13</span><span class="o">:</span><span class="m">70</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">speech_html</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_html</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="n">fmt_string</span><span class="p">,</span><span class="n">id</span><span class="p">))</span><span class="w">
</span><span class="n">speech_lines</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">speech_html</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">"table"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">extract</span><span class="p">(</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">html_table</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">as.data.frame</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">X</span><span class="m">1</span><span class="p">,</span><span class="n">line</span><span class="o">=</span><span class="n">X</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">id</span><span class="o">=</span><span class="nf">rep</span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="n">nrow</span><span class="p">(</span><span class="n">.</span><span class="p">)))</span><span class="w">
</span><span class="n">speeches</span><span class="p">[[</span><span class="n">id</span><span class="m">-12</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">speech_lines</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># concatenate all the speeches and add speaker names
</span><span class="n">speech_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">do.call</span><span class="p">(</span><span class="n">rbind</span><span class="p">,</span><span class="n">speeches</span><span class="p">)</span><span class="w">
</span><span class="n">speech_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">speech_df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="n">speakers</span><span class="p">,</span><span class="n">by</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"id"</span><span class="o">=</span><span class="s2">"Number"</span><span class="p">))</span></code></pre></figure>
<h2 id="first-analysis">First analysis</h2>
<p>Now that we have the speeches as a one-record-per-speech data frame, we can start to analyze them. This post will consist really of a basic analysis based on the “bag of words” paradigm. There are more sophisticated analyses that can be done, but even the basics can be interesting. First, we do a bit of data munging to create a one-record-per-word-per-speech dataset. The strategy is based on the <a href="http://juliasilge.com/blog/RStudio-Conf/">tidy text paradigm described here</a>. Once we have the dataset in the format we want, we can easily eliminate “uninteresting” words by using a filtering <code class="highlighter-rouge">anti-join</code> from the <code class="highlighter-rouge">dplyr</code> package. (Note: there may be analyses where you would want to keep these so-called “stop-words”, e.g. “a” and “the”, but for purposes here we just get rid of them.)</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">speech_words</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">speech_df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">id</span><span class="o">=</span><span class="n">factor</span><span class="p">(</span><span class="n">id</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">unnest_tokens</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="n">text</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ungroup</span><span class="p">()</span><span class="w">
</span><span class="n">total_words</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">speech_words</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">id</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">speech_words</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="n">speech_words</span><span class="p">,</span><span class="w"> </span><span class="n">total_words</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">anti_join</span><span class="p">(</span><span class="n">stop_words</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">lexicon</span><span class="o">==</span><span class="s2">"onix"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">lexicon</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">union</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="n">word</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"s"</span><span class="p">,</span><span class="s2">"so"</span><span class="p">))),</span><span class="n">by</span><span class="o">=</span><span class="s2">"word"</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Joining, by = "id"</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning in union_data_frame(x, y): joining character vector and factor,
## coercing into character vector</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">speech_words</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">head</span><span class="p">()</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## # A tibble: 6 × 4
## id word n total
## <fctr> <chr> <int> <int>
## 1 26 power 47 8463
## 2 21 power 11 4476
## 3 29 power 11 3341
## 4 27 power 9 4813
## 5 36 power 9 2990
## 6 25 power 8 3902</code></pre></figure>
<p>We can now plot the most common words in inauguration speech, just to dig into what that dataset looks like. Note that I polished this graph up a bit (changing axis labels to something pretty, rotating x-axis labels, etc.), but the first past through this graph was a bit ugly. To me, the two most important elements of this graph are selecting the 20 most common words and re-ordering from most to fewest.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># find frequencies of words used in speeches
# we do this so we can reorder in ggplot2 (there may be a way to do directly in ggplot2 without this step)
</span><span class="n">speech_freq</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">speech_words</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="n">frequency</span><span class="o">=</span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">frequency</span><span class="p">))</span><span class="w">
</span><span class="c1"># plot frequencies of words over all speeches, top 20 only, in order of frequency most to fewest
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">speech_freq</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">slice</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">20</span><span class="p">),</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">reorder</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="n">desc</span><span class="p">(</span><span class="n">frequency</span><span class="p">))))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">frequency</span><span class="p">),</span><span class="n">stat</span><span class="o">=</span><span class="s2">"identity"</span><span class="p">,</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.8</span><span class="p">,</span><span class="w"> </span><span class="n">show.legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Term Frequency Distribution in Presidential Inaugural Addresses"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="s2">"Word"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Frequency"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">45</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span></code></pre></figure>
<p><img src="/figures//2017-01-27-tidy-text-inauguration-speeches.Rmdunnamed-chunk-3-1.png" alt="plot of chunk unnamed-chunk-3" /></p>
<h2 id="what-makes-speeches-unique">What makes speeches unique</h2>
<p>At least using the bag-of-words paradigm, the term-frequency * inverse-document-frequency (TF-IDF) analysis is used to determine what words set speeches (or other documents) apart from each other. A word in a given document has a high TF-IDF score if it appears very often in that speech, but rarely in others. If a word appears less frequently in a speech, or appears more often in other speeches, that will lower the TF-IDF score. Thus, a word with a high TF-IDF score can be considered a signature word for a speech Using this strategy for all interesting words, we can compare styles of speeches, and even cluster them into groups.</p>
<p>First, we use the <code class="highlighter-rouge">bind_tf_idf</code> function from <code class="highlighter-rouge">tidytext</code> to calculate the TF-IDF score. Then we can find the words with the highest TF-IDF score - the words that do the most to distinguish one inauguration speech from another.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">speech_words2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">speech_words</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">bind_tf_idf</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">speech_words2</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## # A tibble: 34,734 × 7
## id word n total tf idf tf_idf
## <fctr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 26 power 47 8463 0.015254787 0.2102954 0.0032080118
## 2 21 power 11 4476 0.006654567 0.2102954 0.0013994250
## 3 29 power 11 3341 0.008403361 0.2102954 0.0017671883
## 4 27 power 9 4813 0.004729375 0.2102954 0.0009945658
## 5 36 power 9 2990 0.007419621 0.2102954 0.0015603122
## 6 25 power 8 3902 0.004839685 0.2102954 0.0010177636
## 7 30 power 7 2834 0.006178288 0.2102954 0.0012992655
## 8 50 power 7 1823 0.009681881 0.2102954 0.0020360551
## 9 38 power 6 4397 0.003472222 0.2102954 0.0007301924
## [ reached getOption("max.print") -- omitted 1 row ]
## # ... with 34,724 more rows</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">speech_words2</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">total</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">tf_idf</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## # A tibble: 34,734 × 6
## id word n tf idf tf_idf
## <fctr> <chr> <int> <dbl> <dbl> <dbl>
## 1 14 arrive 1 0.01851852 4.060443 0.07519339
## 2 14 upbraidings 1 0.01851852 4.060443 0.07519339
## 3 14 incurring 1 0.01851852 3.367296 0.06235733
## 4 14 violated 1 0.01851852 3.367296 0.06235733
## 5 14 willingly 1 0.01851852 3.367296 0.06235733
## 6 14 injunctions 1 0.01851852 2.961831 0.05484872
## 7 14 knowingly 1 0.01851852 2.961831 0.05484872
## 8 14 previous 1 0.01851852 2.961831 0.05484872
## 9 14 witnesses 1 0.01851852 2.961831 0.05484872
## 10 14 besides 1 0.01851852 2.674149 0.04952127
## # ... with 34,724 more rows</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot_inaug</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">speech_words2</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">tf_idf</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rev</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">word</span><span class="p">))))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">left_join</span><span class="p">(</span><span class="n">speakers</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">id</span><span class="o">=</span><span class="n">factor</span><span class="p">(</span><span class="n">Number</span><span class="p">)),</span><span class="n">by</span><span class="o">=</span><span class="s2">"id"</span><span class="p">)</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">plot_inaug</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">tf_idf</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">0.025</span><span class="p">),</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">tf_idf</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Speaker</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.8</span><span class="p">,</span><span class="w"> </span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Highest tf-idf words in Presidential Inauguration Speeches"</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"tf-idf"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_flip</span><span class="p">()</span></code></pre></figure>
<p><img src="/figures//2017-01-27-tidy-text-inauguration-speeches.Rmdunnamed-chunk-4-1.png" alt="plot of chunk unnamed-chunk-4" /></p>
<p>Then we can do this analysis within each speech to find out what distinguishes them from other speeches. The <code class="highlighter-rouge">for</code> loop below can be used to print multiple pages of faceted graphs, good for when you are using RStudio or the R gui to explore.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot_words</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">speech_words2</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">left_join</span><span class="p">(</span><span class="n">speakers</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">id</span><span class="o">=</span><span class="n">factor</span><span class="p">(</span><span class="n">Number</span><span class="p">)),</span><span class="n">by</span><span class="o">=</span><span class="s2">"id"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">Speaker</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">top_n</span><span class="p">(</span><span class="m">15</span><span class="p">,</span><span class="n">tf_idf</span><span class="p">)</span><span class="w">
</span><span class="n">speakers_vec</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">plot_words</span><span class="o">$</span><span class="n">Speaker</span><span class="p">)</span><span class="w">
</span><span class="n">n_panel</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">4</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="nf">floor</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">speakers_vec</span><span class="p">)</span><span class="o">/</span><span class="n">n_panel</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">these_speakers</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">speakers_vec</span><span class="p">[((</span><span class="n">i</span><span class="m">-1</span><span class="p">)</span><span class="o">*</span><span class="n">n_panel</span><span class="m">+1</span><span class="p">)</span><span class="o">:</span><span class="nf">min</span><span class="p">(</span><span class="n">i</span><span class="o">*</span><span class="n">n_panel</span><span class="p">,</span><span class="nf">length</span><span class="p">(</span><span class="n">speakers_vec</span><span class="p">))]</span><span class="w">
</span><span class="n">this_plot</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">plot_words</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">Speaker</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">these_speakers</span><span class="p">),</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">tf_idf</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Speaker</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.8</span><span class="p">,</span><span class="w"> </span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">,</span><span class="w"> </span><span class="n">show.legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Highest tf-idf words in Inaugural Speeches"</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"tf-idf"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="n">Speaker</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_flip</span><span class="p">()</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">this_plot</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p><img src="/figures//2017-01-27-tidy-text-inauguration-speeches.Rmdunnamed-chunk-5-1.png" alt="plot of chunk unnamed-chunk-5" /><img src="/figures//2017-01-27-tidy-text-inauguration-speeches.Rmdunnamed-chunk-5-2.png" alt="plot of chunk unnamed-chunk-5" /><img src="/figures//2017-01-27-tidy-text-inauguration-speeches.Rmdunnamed-chunk-5-3.png" alt="plot of chunk unnamed-chunk-5" /><img src="/figures//2017-01-27-tidy-text-inauguration-speeches.Rmdunnamed-chunk-5-4.png" alt="plot of chunk unnamed-chunk-5" /><img src="/figures//2017-01-27-tidy-text-inauguration-speeches.Rmdunnamed-chunk-5-5.png" alt="plot of chunk unnamed-chunk-5" /><img src="/figures//2017-01-27-tidy-text-inauguration-speeches.Rmdunnamed-chunk-5-6.png" alt="plot of chunk unnamed-chunk-5" /><img src="/figures//2017-01-27-tidy-text-inauguration-speeches.Rmdunnamed-chunk-5-7.png" alt="plot of chunk unnamed-chunk-5" /><img src="/figures//2017-01-27-tidy-text-inauguration-speeches.Rmdunnamed-chunk-5-8.png" alt="plot of chunk unnamed-chunk-5" /><img src="/figures//2017-01-27-tidy-text-inauguration-speeches.Rmdunnamed-chunk-5-9.png" alt="plot of chunk unnamed-chunk-5" /></p>
<h2 id="which-speeches-are-most-like-each-other">Which speeches are most like each other?</h2>
<p>There’s a lot more that can be done here, but we’ll move on to clustering these inauguration speeches. This will require the use of the document-term matrix, which is a matrix that has documents in the rows, words in the columns, and entries that represent the frequency within the row’s document of the column’s term. The <code class="highlighter-rouge">tidytext</code> packages uses the <code class="highlighter-rouge">cast_dtm</code> function to create the document-term matrix, and the output can then be used by the <code class="highlighter-rouge">tm</code> package and other R commands for analysis.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot_words_dtm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">speech_words</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">left_join</span><span class="p">(</span><span class="n">speakers</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">id</span><span class="o">=</span><span class="n">factor</span><span class="p">(</span><span class="n">Number</span><span class="p">)),</span><span class="n">by</span><span class="o">=</span><span class="s2">"id"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">cast_dtm</span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="n">word</span><span class="p">,</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">plot_words_dtm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">removeSparseTerms</span><span class="p">(</span><span class="n">plot_words_dtm</span><span class="p">,</span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span><span class="n">plot_words_matrix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">plot_words_dtm</span><span class="p">)</span></code></pre></figure>
<p>To show the hierarchical clustering analysis, we can simply compute a distance matrix, which can be fed into <code class="highlighter-rouge">hclust</code>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">dist_matrix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dist</span><span class="p">(</span><span class="n">scale</span><span class="p">(</span><span class="n">plot_words_matrix</span><span class="p">),</span><span class="n">method</span><span class="o">=</span><span class="s2">"euclidean"</span><span class="p">)</span><span class="w">
</span><span class="n">inaug_clust</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">hclust</span><span class="p">(</span><span class="n">dist_matrix</span><span class="p">,</span><span class="n">method</span><span class="o">=</span><span class="s2">"ward.D"</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">inaug_clust</span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2017-01-27-tidy-text-inauguration-speeches.Rmdunnamed-chunk-7-1.png" alt="plot of chunk unnamed-chunk-7" /></p>
<p>It’s pretty interesting that Speech 26 is unlike nearly all the others. This was William Henry Harrison discussing something about the Roman aristocracy, something other presidents have not felt the need to do very much.</p>
<p>Let’s say we want to break these speeches into a given number of clusters. We can use the k-means approach.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">inaug_km</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">kmeans</span><span class="p">(</span><span class="n">plot_words_matrix</span><span class="p">,</span><span class="n">centers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">25</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">inaug_km</span><span class="o">$</span><span class="n">withinss</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1">#For each cluster, this defines the documents in that cluster
</span><span class="w"> </span><span class="n">inGroup</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">inaug_km</span><span class="o">$</span><span class="n">cluster</span><span class="o">==</span><span class="n">i</span><span class="p">)</span><span class="w">
</span><span class="n">within</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">plot_words_dtm</span><span class="p">[</span><span class="n">inGroup</span><span class="p">,]</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">inGroup</span><span class="p">)</span><span class="o">==</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="n">within</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">as.matrix</span><span class="p">(</span><span class="n">within</span><span class="p">))</span><span class="w">
</span><span class="n">out</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">plot_words_dtm</span><span class="p">[</span><span class="o">-</span><span class="n">inGroup</span><span class="p">,]</span><span class="w">
</span><span class="n">words</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">within</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="n">mean</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">out</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="n">mean</span><span class="p">)</span><span class="w"> </span><span class="c1">#Take the difference in means for each term
</span><span class="w"> </span><span class="n">print</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"Cluster"</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">),</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="nb">F</span><span class="p">)</span><span class="w">
</span><span class="n">labels</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">order</span><span class="p">(</span><span class="n">words</span><span class="p">,</span><span class="w"> </span><span class="n">decreasing</span><span class="o">=</span><span class="nb">T</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="m">20</span><span class="p">]</span><span class="w"> </span><span class="c1">#Take the top 20 Labels
</span><span class="w"> </span><span class="n">print</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">words</span><span class="p">)[</span><span class="n">labels</span><span class="p">],</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="nb">F</span><span class="p">)</span><span class="w"> </span><span class="c1">#From here down just labels
</span><span class="w"> </span><span class="k">if</span><span class="p">(</span><span class="n">i</span><span class="o">==</span><span class="nf">length</span><span class="p">(</span><span class="n">inaug_km</span><span class="o">$</span><span class="n">withinss</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"Cluster Membership"</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">inaug_km</span><span class="o">$</span><span class="n">cluster</span><span class="p">))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"Within cluster sum of squares by cluster"</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">inaug_km</span><span class="o">$</span><span class="n">withinss</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] Cluster 1
## [1] people government country own citizens time
## [7] nation <NA> <NA> <NA> <NA> <NA>
## [13] <NA> <NA> <NA> <NA> <NA> <NA>
## [19] <NA> <NA>
## [1] Cluster 2
## [1] government people citizens time country nation
## [7] own <NA> <NA> <NA> <NA> <NA>
## [13] <NA> <NA> <NA> <NA> <NA> <NA>
## [19] <NA> <NA>
## [1] Cluster 3
## [1] nation time own people citizens country
## [7] government <NA> <NA> <NA> <NA> <NA>
## [13] <NA> <NA> <NA> <NA> <NA> <NA>
## [19] <NA> <NA>
## [1] Cluster 4
## [1] citizens country own nation time government
## [7] people <NA> <NA> <NA> <NA> <NA>
## [13] <NA> <NA> <NA> <NA> <NA> <NA>
## [19] <NA> <NA>
## [1] Cluster 5
## [1] government people citizens country own nation
## [7] time <NA> <NA> <NA> <NA> <NA>
## [13] <NA> <NA> <NA> <NA> <NA> <NA>
## [19] <NA> <NA>
## [1] "Cluster Membership"
##
## 1 2 3 4 5
## 8 12 19 16 3
## [1] "Within cluster sum of squares by cluster"
## [1] 760.3750 954.5833 1147.1579 733.8125 797.3333</code></pre></figure>
<p>Membership of speeches in clusters is here:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">inaug_km</span><span class="o">$</span><span class="n">cluster</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
## 4 4 1 4 4 4 4 2 2 2 4 2 1 5 5 4 3 2 1 4 4 4 2 1 2
## 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
## 1 1 1 2 4 2 4 3 3 1 5 3 2 3 4 3 3 3 4 3 3 3 3 2 2
## 63 64 65 66 67 68 69 70
## 3 3 3 4 3 3 3 3</code></pre></figure>
<p>It’s interesting to note that all of the speeches since Hoover (i.e. 49 through 70) have either been either in category 1 or 5, with the latest ones being in Cluster 1 (this includes Reagan, Bush, Clinton, Bush, Obama, and Trump). Nearly all speeches discuss the relationship between government and its people (as you would expect from an inauguration speech), but Cluster 5 seems to put more emphasis on people, and Cluster 1 on government. Hmmm…</p>
<p>Of course, you can probably get something different with fewer clusters, and you can use the hierarchical clustering analysis above to justify a different number of clusters.</p>
<h2 id="sentiment-analysis">Sentiment analysis</h2>
<p>We return to the bag-of-words <code class="highlighter-rouge">tidytext</code> paradigm to do a sentiment analysis. The sentiment analysis we do here is very simple (perhaps oversimplified), and <code class="highlighter-rouge">tidytext</code> supports more sophisticated analysis. But this is a start. We start by going back to the one-record-per-speech data frame, and scoring words based on sentiment. We don’t worry about stop words at this point, because they will likely be scored as 0 anyway. We use the Bing sentiment list, which basically scores words as positive or negative (or nothing). We assign a score that basically gives a +1 to positive and -1 to negative. Then we add up the score column, and divide by the number of words in the speech. (Which is why we did not eliminate stop words here.) This gives a sort of average positivity/negativity score per word in the speech. If the score is negative, there are more negative words in the speech than positive. If the score is positive, there are more positive words. The higher the absolute value of the score, the higher the imbalance in positive/negative words. Similarly, we just count the number of sentiment words (whether positive or negative) to get an idea of the emotional content of the speech. (Note: this is a preliminary analysis. This does not distinguish between, say, “good” and “not good”. So take any individual results with a grain of salt and dig deeper.)</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">sw_sent</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">speech_df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">id</span><span class="o">=</span><span class="n">factor</span><span class="p">(</span><span class="n">id</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">unnest_tokens</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="n">text</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">inner_join</span><span class="p">(</span><span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"bing"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">score</span><span class="o">=</span><span class="p">(</span><span class="n">sentiment</span><span class="o">==</span><span class="s2">"positive"</span><span class="p">)</span><span class="o">-</span><span class="p">(</span><span class="n">sentiment</span><span class="o">==</span><span class="s2">"negative"</span><span class="p">),</span><span class="n">is_scored</span><span class="o">=</span><span class="n">ifelse</span><span class="p">(</span><span class="n">sentiment</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"positive"</span><span class="p">,</span><span class="s2">"negative"</span><span class="p">),</span><span class="m">1</span><span class="p">,</span><span class="m">0</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Joining, by = "word"</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">sw_sent</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">Speaker</span><span class="p">,</span><span class="n">id</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="n">speech_score</span><span class="o">=</span><span class="nf">sum</span><span class="p">(</span><span class="n">score</span><span class="p">),</span><span class="n">speech_sent_words</span><span class="o">=</span><span class="nf">sum</span><span class="p">(</span><span class="n">is_scored</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">left_join</span><span class="p">(</span><span class="n">total_words</span><span class="p">,</span><span class="n">by</span><span class="o">=</span><span class="s2">"id"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">speech_score</span><span class="o">=</span><span class="n">speech_score</span><span class="o">/</span><span class="n">total</span><span class="p">,</span><span class="n">speech_sent_words</span><span class="o">=</span><span class="n">speech_sent_words</span><span class="o">/</span><span class="n">total</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">arrange</span><span class="p">(</span><span class="n">speech_score</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="n">nrow</span><span class="p">(</span><span class="n">.</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Source: local data frame [58 x 5]
## Groups: Speaker [39]
##
## Speaker id speech_score speech_sent_words total
## <fctr> <fctr> <dbl> <dbl> <int>
## 1 Abraham Lincoln 32 0.001426534 0.07275321 701
## 2 Abraham Lincoln 31 0.002199615 0.06983778 3637
## 3 James Madison 19 0.010734930 0.09331131 1211
## 4 John F. Kennedy 56 0.010989011 0.10036630 1365
## 5 Franklin D. Roosevelt 50 0.011519473 0.08831596 1823
## 6 Woodrow Wilson 44 0.011716462 0.08787346 1707
## 7 Franklin D. Roosevelt 49 0.012227539 0.09409888 1881
## 8 William Henry Harrison 26 0.013115916 0.06865178 8463
## 9 Franklin D. Roosevelt 51 0.015613383 0.06022305 1345
## 10 Andrew Jackson 24 0.016992353 0.07306712 1177
## 11 Barack Obama 68 0.017827529 0.08499171 2412
## 12 Martin Van Buren 25 0.018452076 0.08867248 3902
## 13 Ronald Reagan 61 0.018457752 0.07752256 2438
## 14 Thomas Jefferson 17 0.019852262 0.07710065 2166
## [ reached getOption("max.print") -- omitted 44 rows ]</code></pre></figure>
<p>Grover Cleveland and James Madison had the speeches with the highest emotional content, followed by Jimmy Carter and George W. Bush. Wilson, Franklin D. Roosevelt, and George Washington had the lowest emotional content. Abraham Lincoln (in 1860) had the speech with the least positive content (all speeches were positive on balance). William Henry Harrison’s odd speech about the Romans had near the least emotional content, and was one of the least positive speeches.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This analysis of inauguration speeches comes at a time where the change of US presidential power has a different feel, even the inauguration speech. The preliminary analysis above shows that Trump’s speech was similar in topics to speeches for the last 40 or so years, and nothing notable in its emotional content.</p>
<p>This first start revealed a few interesting patterns, but a more sophisticated analysis might reveal something further.</p>John JohnsonAcquiring inauguration speeches Though not about Greenville especially, it might be interesting to quantitatively analyze inauguration speeches. This analysis will be done using two paradigms: the tm package and the tidytext package. We will read the speeches in such a way that we use the tidytext package; later on we will use some tools from that package to make analyses traditionally done by tm. I looked around for inauguration speeches, and finally found them at www.bartelby.com. They are in a format more for human consumption, but with the use of the rvest (harvest?) package, we can read them in relatively easily. However, we need to do a mapping from speech IDs to speakers (newly inaugurated presidents), which is a little ugly and tedious.Greenville on Twitter2016-12-21T00:00:00+00:002016-12-21T00:00:00+00:00https://randomjohn.github.io/r-twitter<p>In this blogpost, we use <a href="http://www.r-project.org">R</a> to use <a href="http://www.twitter.com">Twitter</a> data to analyze topics of interest to Greenville, SC. We will describe obtaining, manipulating, and summarizing the data.</p>
<p><a href="http://www.twitter.com">Twitter</a> is a “microblogging” service where users can, usually publicly, share links, pictures, or short comments (up to 140 characters) onto a timeline. The public timeline consists of all public tweets, but people can build their own private timelines to narrow content to just what they want to see. (They do this by “following” users.) Over the years, many companies, news organizations, and users have considered the social media site essential for sharing news and other information. (Or cat memes.) Twitter has some organizational tools such as replies/conversation threads, mentions (i.e. naming other users using the @ notation), and hashtags (naming a topic using # notation). Twitter has encouraged the use of these organizational tools by automatically making mentions and hashtags clickable links.</p>
<p>These organizational tools can make for some interesting analysis. For instance, a game show may encourage viewers to vote on a winner using hashtags. On their end, they create a filter for a particular hashtag (e.g. #votemyplayer) and count votes. This also makes Twitter data ripe for text mining (which they use to identify trending topics).</p>
<h2 id="obtaining-the-twitter-data">Obtaining the Twitter data</h2>
<p>Twitter makes it possible for software to obtain Twitter comments without having to resort to “web-scraping” techniques (i.e. downloading the data as a web page and then parsing the HTML). Instead, you can go through an Application Programming Interface (API) to obtain the comments directly. If you’re interested, Twitter has a whole <a href="https://dev.twitter.com/">subdomain</a> related to accessing their data, including documentation. There are a lot of technical details, but for the casual user probably the only ones of interest are API key and rate limits. This post won’t fuss with rate limits, but more serious work may require some further understanding of these issues. However, you will need to create an API key. Follow <a href="http://bigcomputing.blogspot.com/2016/02/the-twitter-r-package-by-jeff-gentry-is.html">these instructions</a>, which are tailored for R users. It essentially consists of creating a token at <a href="http://apps.twitter.com">Twitter’s app web site</a> and running an R function with the token. I set variables <code class="highlighter-rouge">consumer_secret</code>, <code class="highlighter-rouge">consumer_key</code>, <code class="highlighter-rouge">access_token</code>, and <code class="highlighter-rouge">access_secret</code> in an R block just copying and pasting from the Twitter apps site, not echoed in this blog post for obvious reasons.</p>
<p>Fortunately, the <a href="https://cran.r-project.org/web/packages/twitteR/index.html">twitteR</a> package makes obtaining data from Twitter easy. It’s on CRAN, so grab it using <code class="highlighter-rouge">install.packages</code> (it will also install dependencies such as the <code class="highlighter-rouge">bit64</code> and <code class="highlighter-rouge">httr</code> packages if you don’t have them already) before moving on.</p>
<p>We authenticate our R program to Twitter and then start with searching the public timeline for “Greenville”. Note due to the changing nature of Twitter, your results will probably be different:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">origop</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">options</span><span class="p">(</span><span class="s2">"httr_oauth_cache"</span><span class="p">)</span><span class="w">
</span><span class="n">options</span><span class="p">(</span><span class="n">httr_oauth_cache</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">twitteR</span><span class="p">)</span><span class="w">
</span><span class="n">setup_twitter_oauth</span><span class="p">(</span><span class="n">consumer_key</span><span class="p">,</span><span class="w"> </span><span class="n">consumer_secret</span><span class="p">,</span><span class="w"> </span><span class="n">access_token</span><span class="p">,</span><span class="w"> </span><span class="n">access_secret</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">[1] "Using direct authentication"</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">options</span><span class="p">(</span><span class="n">httr_oauth_cache</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">origop</span><span class="p">)</span><span class="w">
</span><span class="n">gvl_twitter</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">searchTwitter</span><span class="p">(</span><span class="s2">"Greenville"</span><span class="p">)</span><span class="w">
</span><span class="n">gvl_twitter_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">twListToDF</span><span class="p">(</span><span class="n">gvl_twitter</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">gvl_twitter_df</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text"> text
1 RT @JessLivMo: PROTEST: Greenville TODAY at 4pm! https://t.co/PtDqfV1iQi
2 Y'all, #texaspawprints is at the Pet Supplies on lower Greenville today. Their kitties areÂ… https://t.co/AVwAEvm9w5
3 RT @nssottile: What is Governor @henrymcmaster doing to get Greenville resident and Clemson Ph.D. #NazaninZinouri back home? #sctweets #MusÂ…
4 Can you recommend anyone for this #job in #Greenville, SC? https://t.co/bGoRU5wFqQ #Labor #Hiring #CareerArc
favorited favoriteCount replyToSN created truncated
1 FALSE 0 <NA> 2017-01-29 20:07:33 FALSE
2 FALSE 0 <NA> 2017-01-29 20:06:10 FALSE
3 FALSE 0 <NA> 2017-01-29 20:05:53 FALSE
4 FALSE 0 <NA> 2017-01-29 20:04:47 FALSE
replyToSID id replyToUID
1 <NA> 825797550087217152 <NA>
2 <NA> 825797204724088836 <NA>
3 <NA> 825797131420053505 <NA>
4 <NA> 825796855728451584 <NA>
statusSource
1 <a href="http://www.samruston.co.uk" rel="nofollow">Flamingo for Android</a>
2 <a href="http://linkis.com" rel="nofollow">Put your button on any page! </a>
3 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
4 <a href="http://www.tweetmyjobs.com" rel="nofollow">TweetMyJOBS</a>
screenName retweetCount isRetweet retweeted longitude
1 RatherBeGulfing 15 TRUE FALSE <NA>
2 KickButtVegan 0 FALSE FALSE -96.77009998
3 ClaireOfTarth 47 TRUE FALSE <NA>
4 tmj_grn_labor 0 FALSE FALSE -82.4536115
latitude
1 <NA>
2 32.81234958
3 <NA>
4 34.8268335
[ reached getOption("max.print") -- omitted 2 rows ]</code></pre></figure>
<p><code class="highlighter-rouge">searchTwitter</code> returns data as a list, which may or may not be desirable. As a default, it returns the last 25 items matching the query you pass (this can be changed by using the <code class="highlighter-rouge">n=</code> option to the function). I used <code class="highlighter-rouge">twListToDF</code> (part of the <code class="highlighter-rouge">twitteR</code> package) to convert to a data frame. The data frame contains a lot of useful information, such as the tweet, information about whether it’s a reply and the tweet to which it’s a reply, screen name, and date stamp. Thus, Twitter provides a rich data source to provide information on topics, interactions, and reactions.</p>
<h2 id="analyzing-the-data">Analyzing the data</h2>
<h3 id="retweets">Retweets</h3>
<p>The first thing to notice is that many of these tweets may be “retweets”, where a user posts the exact same tweet as a previous user to create a larger audience for the tweet. This data point may be interesting in its own right, but for now, because we are just analyzing the text, we will filter out retweets:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">gvl_twitter_unique</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gvl_twitter_df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="o">!</span><span class="n">isRetweet</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">gvl_twitter_unique</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">text</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text"> text
1 Y'all, #texaspawprints is at the Pet Supplies on lower Greenville today. Their kitties areÂ… https://t.co/AVwAEvm9w5
2 Can you recommend anyone for this #job in #Greenville, SC? https://t.co/bGoRU5wFqQ #Labor #Hiring #CareerArc
3 @Meghan_Trainor i miss u <ed><U+00A0><U+00BD><ed><U+00B8><U+00AD><ed><U+00A0><U+00BD><ed><U+00B8><U+00AD>\nCome back to Greenville soon <U+2764>
4 1/29/73 Greenville, SC. Bobby &amp; Terry Kay vs Freddy Sweetan/Mike DuBois, Johnny Weaver/Penny Banner vs The Alaskan/Â… https://t.co/hFH3iCeWLl
5 Interested in a #job in #Greenville, SC? This could be a great fit: https://t.co/yuufbTYC57 #IT #Hiring #CareerArc
6 Join the Robert Half Technology team! See our latest #job opening here: https://t.co/xCNLSvTkDQ #RHTechJobs #IT #Greenville, SC #Hiring
7 Want to work at Hubbell Incorporated? We're #hiring in #Greenville, SC! Click for details: https://t.co/dfTOjYDWG9 #Job #ProductMgmt #Jobs
8 I need a church in Greenville ASAP! If you have suggestions let me know
9 Wow! @Lyft pledges $1 million to @ACLU https://t.co/YTkkGeE5l6 #Lyft is finally in Greenville. App downloaded. #DeleteUber
10 Greenville, NC: 3:00 PM Temp: 53.5ºF Dew: 25.8ºF Pressure: 1008.2mb Rain: 0.00" #encwx #ncwx https://t.co/sZnc3rvVsm
11 Interested in a #job in #Dearborn, MI? This could be a great fit: https://t.co/GxLvH7wJ9J #Retail #Hiring #CareerArc
12 Want to work in #Greenville, NC? View our latest opening: https://t.co/aU11faXsxp #Job #Healthcare #Jobs #Hiring #CareerArc
13 Driving to Greenville, sharing real-time road info with wazers in my area. ETA 3:23 PM using @waze - Drive Social.
14 #Pursue #Bright #Career With The #Universities In #South #Australia\n\nhttps://t.co/ZD6nPXglKr\n#CheapFlights #Greenville
15 https://t.co/DVdFrLQwaF\nMy latest Greenville News column.
16 @igorvolsky Greenville-Spartanburg, South Carolina (GSP), today at 4:00.
17 Join the WHBM team! See our latest #job opening here: https://t.co/jd5D8zjYXW #Retail #Greenville, SC #Hiring #CareerArc
18 (Greenville, SC) I need to figure out what happened to my driver's license, or I'm going to los... https://t.co/CS4qMYpUkB
19 See our latest #Greenville, SC #job and click to apply: Mortgage Consultant (SAFE) - https://t.co/f9H5PO30Fe #Veterans #Hiring #CareerArc</code></pre></figure>
<p>The thing to notice here is that there are several different Greenvilles, so this makes analysis of the local area pretty hard. Many of the tweets can be about Greenville, NC or SC. In this particular dataset, there was even a Greenville Road in California (where there was a car fire). Rather than play a filtering game, it may be better to apply some knowledge specific to the area. For instance, local tweets will often be tagged with <code class="highlighter-rouge">#yeahThatgreenville</code>. So we will search again for the <code class="highlighter-rouge">#yeahthatgreenville</code> hashtag (and add a few more tweets as well). This time, we’ll keep retweets:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">origop</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">options</span><span class="p">(</span><span class="s2">"httr_oauth_cache"</span><span class="p">)</span><span class="w">
</span><span class="n">options</span><span class="p">(</span><span class="n">httr_oauth_cache</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">setup_twitter_oauth</span><span class="p">(</span><span class="n">consumer_key</span><span class="p">,</span><span class="w"> </span><span class="n">consumer_secret</span><span class="p">,</span><span class="w"> </span><span class="n">access_token</span><span class="p">,</span><span class="w"> </span><span class="n">access_secret</span><span class="p">)</span><span class="w"> </span><span class="err">#</span><span class="w"> </span><span class="n">needed</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">knit</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">Rmd</span><span class="w"> </span><span class="n">file</span><span class="p">,</span><span class="w"> </span><span class="n">may</span><span class="w"> </span><span class="n">not</span><span class="w"> </span><span class="n">be</span><span class="w"> </span><span class="n">necessary</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">you</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">reauthenticate</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="n">session</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">[1] "Using direct authentication"</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">options</span><span class="p">(</span><span class="n">httr_oauth_cache</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">origop</span><span class="p">)</span><span class="w">
</span><span class="n">gvl_twitter_unique</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">searchTwitter</span><span class="p">(</span><span class="s2">"#yeahthatgreenville"</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">200</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">twListToDF</span><span class="p">()</span><span class="w">
</span><span class="n">gvl_twitter_nolink</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gvl_twitter_unique</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"https?://[\\w\\./]+"</span><span class="p">,</span><span class="w">
</span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">text</span><span class="p">,</span><span class="w"> </span><span class="n">perl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span></code></pre></figure>
<p>Here I do two separate queries and add them together using the <code class="highlighter-rouge">bind_rows</code> function from <code class="highlighter-rouge">dplyr</code>.</p>
<h3 id="who-is-tweeting">Who is tweeting</h3>
<p>The first thing we can do is get a list of users who tweet under this hastag as well as their number of tweets:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">gvl_twitter_nolink</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reorder</span><span class="p">(</span><span class="n">screenName</span><span class="p">,</span><span class="w"> </span><span class="n">screenName</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="p">))))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_bar</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">60</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="s2">""</span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2016-12-21-r-twitter.Rmdunnamed-chunk-5-1.png" alt="plot of chunk unnamed-chunk-5" /></p>
<p>So I snuck a trick into the above graph. In bar charts presenting counts, I usually prefer the order in descending bar length. That way I can identify the most and least common screen names quickly. I accomplish this by using <code class="highlighter-rouge">x=reorder(screenName,screenName,function (x) -length(x)))</code> in the <code class="highlighter-rouge">aes()</code> function above. Now we can see that <code class="highlighter-rouge">@GiovanniDodd</code> was the most prolific tweeter in the last 200 tweets I accessed. Some of the prolific tweeters appear to be businesses, such as <code class="highlighter-rouge">@CourtyardGreenville</code> or perhaps tourism accounts such as <code class="highlighter-rouge">@Greenville_SC</code>.</p>
<h3 id="what-users-are-saying">What users are saying</h3>
<p>To analyze what users are saying about “#yeahthatgreenville”, we use the <code class="highlighter-rouge">tidytext</code> package. There are a number of packages that can be used to analyze text, and <code class="highlighter-rouge">tm</code> used to be a favorite, but <code class="highlighter-rouge">tidytext</code> fits within the context of <a href="http://vita.had.co.nz/papers/tidy-data.pdf">tidy data</a>. We prefer the tidy data framework because it works with data in a specific format and has a number of powerful tools that have a specific focus but interoperate well, much like the UNIX ideal. Here, <code class="highlighter-rouge">tidytext</code> will allow us to use <code class="highlighter-rouge">dplyr</code> and similar tools using the pipe operator. The code will be easier to read and follow.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">tidytext</span><span class="p">)</span><span class="w">
</span><span class="n">tweet_words</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gvl_twitter_nolink</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">text</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">unnest_tokens</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w">
</span><span class="n">text</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">tweet_words</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text"> id word
1 825791100107505664 tonight
1.1 825791100107505664 at
1.2 825791100107505664 coffee
1.3 825791100107505664 underground
1.4 825791100107505664 1
1.5 825791100107505664 e</code></pre></figure>
<p>I used the <code class="highlighter-rouge">select</code> function from <code class="highlighter-rouge">dplyr</code> to keep only the <code class="highlighter-rouge">id</code> and <code class="highlighter-rouge">text</code> fields. The <code class="highlighter-rouge">unnest_tokens()</code> functions creates a long dataset with a single word replacing the text. All the other fields remain unchanged. We can now easily create a bar chart of the words used the most:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">tweet_words</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reorder</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="p">))))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_bar</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">60</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="s2">""</span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2016-12-21-r-twitter.Rmdunnamed-chunk-7-1.png" alt="plot of chunk unnamed-chunk-7" /></p>
<p>This plot is very busy, so we plot, say, the top 20 words:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">tweet_words</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">slice</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">20</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reorder</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w">
</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">n</span><span class="p">),</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">60</span><span class="p">,</span><span class="w">
</span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">xlab</span><span class="p">(</span><span class="s2">""</span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2016-12-21-r-twitter.Rmdunnamed-chunk-8-1.png" alt="plot of chunk unnamed-chunk-8" /></p>
<p>Unfortunately, this is terribly unexciting. <em>Of course</em> “a”, “to”, “for”, and similar words are going to be at the top. In text mining, we create a list of “stop words”, including these, which are so common they are usually not worth including in an analysis. The <code class="highlighter-rouge">tidytext</code> package includes a <code class="highlighter-rouge">stop_words</code> data frame to assist us:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">head</span><span class="p">(</span><span class="n">stop_words</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text"># A tibble: 6 × 2
word lexicon
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART</code></pre></figure>
<p>We’ll change <code class="highlighter-rouge">stop_words</code> slightly to be useful to us. This involves adding a column to help us filter out in the next step and adding some common, uninteresting words “https”, “t.co”, “yeahthatgreenville”, and “amp”. We filter these out for various reasons, e.g. “https” and “t.co” are used in URLs, “amp” is left over from tokening some HTML code, and we searched on “yeahthatgreenville”. Augmenting stop words is a bit of an iterative process, which I’m not showing here, but I went back and forth a few times to get this list.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">my_stop_words</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stop_words</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">lexicon</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">bind_rows</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="n">word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"https"</span><span class="p">,</span><span class="w">
</span><span class="s2">"t.co"</span><span class="p">,</span><span class="w"> </span><span class="s2">"yeahthatgreenville"</span><span class="p">,</span><span class="w"> </span><span class="s2">"amp"</span><span class="p">,</span><span class="w"> </span><span class="s2">"gvl"</span><span class="p">)))</span></code></pre></figure>
<p>Now, we can determine which of the words above are stop words and thus not worth analyzing:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">tweet_words_interesting</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tweet_words</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">anti_join</span><span class="p">(</span><span class="n">my_stop_words</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">tweet_words_interesting</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text"> id word
1 825791100107505664 tonight
2 825474845710376962 tonight
3 825443843445293057 tonight
4 825791100107505664 coffee
5 825791100107505664 coffee
6 825693272119054336 coffee</code></pre></figure>
<p>The <code class="highlighter-rouge">anti_join</code> function is probably not familiar to most data scientists or statisticians. It is the opposite of a merge in a sense. Basically, the command above merges the <code class="highlighter-rouge">tweet_words</code> and <code class="highlighter-rouge">my_stop_words</code> data frames, and then <em>removes</em> the rows that came from the <code class="highlighter-rouge">my_stop_words</code> dataset, leaving only the rows in <code class="highlighter-rouge">tweet_words</code> (the <code class="highlighter-rouge">id</code> and <code class="highlighter-rouge">word</code>) that does not match with something from <code class="highlighter-rouge">my_stop_words</code>. This is desirable because our <code class="highlighter-rouge">my_stop_words</code> dataset contains words we <em>do not</em> want to analyze.</p>
<p>Now we can analyze the more interesting words:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">tweet_words_interesting</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">slice</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">20</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reorder</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w">
</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">n</span><span class="p">),</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">60</span><span class="p">,</span><span class="w">
</span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">xlab</span><span class="p">(</span><span class="s2">""</span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2016-12-21-r-twitter.Rmdunnamed-chunk-12-1.png" alt="plot of chunk unnamed-chunk-12" /></p>
<h2 id="sentiment-analysis">Sentiment analysis</h2>
<p>Sentiment analysis is, in short, the quantitative study of the emotional content of text. The most sophisticated analysis, of course, is very difficult, but we can make a start using a simple procedure. Many of the ideas here can be found in a <a href="https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html">vignette</a> for the package written by Julia Silge and David Robinson.</p>
<p>As a start, we use the Bing lexicon, which maps a word to positive/negative according to whether its sentiment content is positive or negative.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">bing_lex</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"bing"</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">bing_lex</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text"># A tibble: 6 × 2
word sentiment
<chr> <chr>
1 2-faced negative
2 2-faces negative
3 a+ positive
4 abnormal negative
5 abolish negative
6 abominable negative</code></pre></figure>
<p>Sentiment analysis then is an exercise in an inner-join:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gvl_sentiment</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tweet_words_interesting</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="n">bing_lex</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">gvl_sentiment</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text"> id word sentiment
1 825791100107505664 tonight <NA>
2 825474845710376962 tonight <NA>
3 825443843445293057 tonight <NA>
4 825791100107505664 coffee <NA>
5 825791100107505664 coffee <NA>
6 825693272119054336 coffee <NA></code></pre></figure>
<p>Once you get to this point, sentiment analysis can start fairly easily:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gvl_sentiment</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">sentiment</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">sentiment</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text"># A tibble: 2 × 2
sentiment n
<chr> <int>
1 negative 25
2 positive 96</code></pre></figure>
<p>There are many more positive words than negative words, so the mood tilts positive in our crude analysis. We can also group by tweet, and see whether there more more positive or negative tweets:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gvl_sent_anly2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gvl_sentiment</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">id</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">sentiment</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span><span class="n">gvl_sent_anly2</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text"># A tibble: 3 × 2
sentiment n
<chr> <dbl>
1 negative 1.041667
2 positive 1.333333
3 <NA> 6.361809</code></pre></figure>
<p>On average, there are 1.3333333 positive words per tweet and 1.0416667 negative words per tweet, if you accept the assumptions of the above analysis.</p>
<p>There is, of course, a lot more that can be done, but this will get you started. For some more sophisticated ideas you can check <a href="http://juliasilge.com/blog/Reddit-Responds/">Julia Silge’s analysis of Reddit data</a>, for instance. Another kind of analysis looking at sentiment and emotional content can be found <a href="https://mran.microsoft.com/posts/twitter.html">here</a> (with the caveat that it uses the predecessor to <code class="highlighter-rouge">dplyr</code> and thus runs somewhat less efficiently). Finally, it would probably be useful to supplement the above sentiment data frames with situation-specific sentiment analysis, such as making <code class="highlighter-rouge">goallllllll</code> in the above a positive word.</p>
<h2 id="conclusions">Conclusions</h2>
<p>The R packages <code class="highlighter-rouge">twitteR</code> and <code class="highlighter-rouge">tidytext</code> make analyzing content from Twitter easy. This is helpful if you want to analyze, for instance, real time reactions to events. Above we pulled content from Twitter, split it into words, and analyzed words by frequency while eliminating “uninteresting” words. Then we analyzed whether tweets were on the whole positive or negative using pre-made lexicons mapping words to positive or negative.</p>John JohnsonIn this blogpost, we use R to use Twitter data to analyze topics of interest to Greenville, SC. We will describe obtaining, manipulating, and summarizing the data. Twitter is a “microblogging” service where users can, usually publicly, share links, pictures, or short comments (up to 140 characters) onto a timeline. The public timeline consists of all public tweets, but people can build their own private timelines to narrow content to just what they want to see. (They do this by “following” users.) Over the years, many companies, news organizations, and users have considered the social media site essential for sharing news and other information. (Or cat memes.) Twitter has some organizational tools such as replies/conversation threads, mentions (i.e. naming other users using the @ notation), and hashtags (naming a topic using # notation). Twitter has encouraged the use of these organizational tools by automatically making mentions and hashtags clickable links. These organizational tools can make for some interesting analysis. For instance, a game show may encourage viewers to vote on a winner using hashtags. On their end, they create a filter for a particular hashtag (e.g. #votemyplayer) and count votes. This also makes Twitter data ripe for text mining (which they use to identify trending topics). Obtaining the Twitter data Twitter makes it possible for software to obtain Twitter comments without having to resort to “web-scraping” techniques (i.e. downloading the data as a web page and then parsing the HTML). Instead, you can go through an Application Programming Interface (API) to obtain the comments directly. If you’re interested, Twitter has a whole subdomain related to accessing their data, including documentation. There are a lot of technical details, but for the casual user probably the only ones of interest are API key and rate limits. This post won’t fuss with rate limits, but more serious work may require some further understanding of these issues. However, you will need to create an API key. Follow these instructions, which are tailored for R users. It essentially consists of creating a token at Twitter’s app web site and running an R function with the token. I set variables consumer_secret, consumer_key, access_token, and access_secret in an R block just copying and pasting from the Twitter apps site, not echoed in this blog post for obvious reasons.Plotting GeoJSON polygons on a map with R2016-12-16T00:00:00+00:002016-12-16T00:00:00+00:00https://randomjohn.github.io/r-geojson-gardens<p>In a <a href="2016-12-11-r-geojson-srt.html">previous post</a> we plotted some points, retrieved from a public dataset in GeoJSON format, on top of a Google Map of the area surrounding Greenville, SC. In this post we plot some public data in GeoJSON format as well, but instead of particular points, we plot polygons. Polygons describe an area rather than a single point. As before, to set up we do the following:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">rgdal</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">require</span><span class="p">(</span><span class="n">geojsonio</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"geojsonio"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">geojsonio</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">sp</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">maps</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggmap</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">maptools</span><span class="p">)</span></code></pre></figure>
<h2 id="getting-the-data">Getting the data</h2>
<p>The data we are going to analyze consists of the city parks in Greenville, SC. Though this data is located in an ArcGIS system, there is a <a href="https://data.openupstate.org/maps/city-parks/parks.php">GeoJSON version</a> at <a href="http://data.openupstate.org">OpenUpstate</a>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">data_url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"https://data.openupstate.org/maps/city-parks/parks.php"</span><span class="w">
</span><span class="n">data_file</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"parks.geojson"</span><span class="w">
</span><span class="c1"># for some reason, I can't read from the url directly, though the tutorial
# says I can
</span><span class="n">download.file</span><span class="p">(</span><span class="n">data_url</span><span class="p">,</span><span class="w"> </span><span class="n">data_file</span><span class="p">)</span><span class="w">
</span><span class="n">data_park</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">geojson_read</span><span class="p">(</span><span class="n">data_file</span><span class="p">,</span><span class="w"> </span><span class="n">what</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"sp"</span><span class="p">)</span></code></pre></figure>
<h2 id="analyzing-the-data">Analyzing the data</h2>
<p>First, we plot the data as before:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">data_park</span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2016-12-16-r-geojson-gardens.Rmdunnamed-chunk-2-1.png" alt="plot of chunk unnamed-chunk-2" /></p>
<p>While this was easy to do, it doesn’t give very much context. However, it does give the boundaries of the different parks. As before, we use the <code class="highlighter-rouge">ggmap</code> and <code class="highlighter-rouge">ggplot2</code> package to give us some context. First, we download from Google the right map.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">mapImage</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggmap</span><span class="p">(</span><span class="n">get_googlemap</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">lon</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-82.394012</span><span class="p">,</span><span class="w"> </span><span class="n">lat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">34.852619</span><span class="p">),</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">zoom</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">11</span><span class="p">),</span><span class="w"> </span><span class="n">extent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"normal"</span><span class="p">)</span></code></pre></figure>
<p>I got the latitude and longitude by looking up on Google, and then hand-tuned the scale and zoom.</p>
<p>A note of warning: if you do this with a recent version of <code class="highlighter-rouge">ggmap</code> and <code class="highlighter-rouge">ggplot2</code>, you may need to download the GitHub versions. See this <a href="http://stackoverflow.com/questions/40642850/ggmap-error-geomrasterann-was-built-with-an-incompatible-version-of-ggproto/40644348">Stackoverflow thread</a> for details.</p>
<p>Now, we prepare our spatial object for plotting. This is a more difficult process than before, and requires the use of the <code class="highlighter-rouge">fortify</code> command from <code class="highlighter-rouge">ggplot2</code> package to make sure everything makes it to the right format:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">data_park_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fortify</span><span class="p">(</span><span class="n">data_park</span><span class="p">)</span></code></pre></figure>
<p>Now we can make the plot:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">print</span><span class="p">(</span><span class="n">mapImage</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_polygon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">long</span><span class="p">,</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">group</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data_park_df</span><span class="p">,</span><span class="w">
</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"green"</span><span class="p">))</span></code></pre></figure>
<p><img src="/figures//2016-12-16-r-geojson-gardens.Rmdunnamed-chunk-5-1.png" alt="plot of chunk unnamed-chunk-5" /></p>
<p>Note the use of the <code class="highlighter-rouge">group=</code> option in the <code class="highlighter-rouge">geom_polygon</code> function above. This tells <code class="highlighter-rouge">geom_polygon</code> that there are many polygons rather than just one. Without that option, you get a big mess:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">print</span><span class="p">(</span><span class="n">mapImage</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_polygon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">long</span><span class="p">,</span><span class="w"> </span><span class="n">lat</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data_park_df</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"green"</span><span class="p">))</span></code></pre></figure>
<p><img src="/figures//2016-12-16-r-geojson-gardens.Rmdunnamed-chunk-6-1.png" alt="plot of chunk unnamed-chunk-6" /></p>
<h2 id="mashup-of-parking-convenient-to-swamp-rabbit-trail-and-city-parks">Mashup of parking convenient to Swamp Rabbit Trail and city parks</h2>
<p>Now, say you want to combine the city parks data with the parking places convenient to Swamp Rabbit Trail that was the subject of the last post. That is very easy using the <code class="highlighter-rouge">ggplot2</code> package. We get the data and manipulate it as last time:</p>
<p>Next, we use the layering feature of <code class="highlighter-rouge">ggplot2</code> to draw the map:</p>
<p><img src="/figures//2016-12-16-r-geojson-gardens.Rmdunnamed-chunk-8-1.png" alt="plot of chunk unnamed-chunk-8" /></p>
<h2 id="conclusions">Conclusions</h2>
<p>We continue to explore public geographical data by examining data representing areas in addition to points. In addition, we layer data from two sources.</p>John JohnsonIn a previous post we plotted some points, retrieved from a public dataset in GeoJSON format, on top of a Google Map of the area surrounding Greenville, SC. In this post we plot some public data in GeoJSON format as well, but instead of particular points, we plot polygons. Polygons describe an area rather than a single point. As before, to set up we do the following:Plotting GeoJSON data on a map with R2016-12-11T00:00:00+00:002016-12-11T00:00:00+00:00https://randomjohn.github.io/r-geojson-srt<p>GeoJSON is a standard text-based data format for encoding geographical information, which relies on the JSON (Javascript object notation) standard. There are a number of public datasets for Greenville, SC that use this format, and, the <a href="http://www.r-project.org">R</a> programming language makes working with these data easy. Install the <a href="https://ropensci.org/tutorials/geojsonio_tutorial.html">rgeojson</a> library, which is part of the <a href="https://ropensci.org">ROpenSci</a> family of packages.</p>
<p>In this post we plot some public data in GeoJSON format on top of a retrieved Google Map. To set up we do the following:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">rgdal</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">require</span><span class="p">(</span><span class="n">geojsonio</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"geojsonio"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">geojsonio</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">sp</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">maps</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggmap</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">maptools</span><span class="p">)</span></code></pre></figure>
<p>I wrapped <code class="highlighter-rouge">geojsonio</code> in a require because it may not be installed on your system. Geojsonio takes most of the work out of dealing with GeoJSON data, thus allowing you to concentrate on your analysis rather than data manipulation to a great extent. There is still some data manipulation to be done, as seen below, but it’s fairly lightweight.</p>
<h2 id="getting-the-data">Getting the data</h2>
<p>The data we are going to analyze consists of the convenient parking locations for access to the Swamp Rabbit Trail running between Greenville, SC and Traveler’s Rest, SC. Though this data is located in an ArcGIS system, there is a <a href="https://data.openupstate.org/maps/swamp-rabbit-trail/parking/geojson.php">GeoJSON version</a> at <a href="http://data.openupstate.org">OpenUpstate</a>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">data_url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"https://data.openupstate.org/maps/swamp-rabbit-trail/parking/geojson.php"</span><span class="w">
</span><span class="n">data_file</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"srt_parking.geojson"</span><span class="w">
</span><span class="c1"># for some reason, I can't read from the url directly, though the tutorial
# says I can
</span><span class="n">download.file</span><span class="p">(</span><span class="n">data_url</span><span class="p">,</span><span class="w"> </span><span class="n">data_file</span><span class="p">)</span><span class="w">
</span><span class="n">data_json</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">geojson_read</span><span class="p">(</span><span class="n">data_file</span><span class="p">,</span><span class="w"> </span><span class="n">what</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"sp"</span><span class="p">)</span></code></pre></figure>
<p>Theoretically, you can use <code class="highlighter-rouge">geojson_read</code> to get the data from the URL directly; however this seemed to fail for me. I’m not sure why doing the two-step process with <code class="highlighter-rouge">download.file</code> and then <code class="highlighter-rouge">geojson_read</code> works, but it is probably a good idea to download your data first in most cases. The <code class="highlighter-rouge">what="sp"</code> option in <code class="highlighter-rouge">geojson_read</code> is used to return the data in a spatial object. Now that the data is in a spatial object, we can analyze however we wish and forget about the original data format.</p>
<h2 id="analyzing-the-data">Analyzing the data</h2>
<p>The first thing you can do is plot the data, and the <code class="highlighter-rouge">plot</code> command makes that easy. If you don’t know what is going on behind the scenes, the <code class="highlighter-rouge">plot</code> command detects that it is dealing with a spatial object and calls the plot method from the <code class="highlighter-rouge">sp</code> package. But we just issue a simple command:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">data_json</span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2016-12-11-r-geojson-srt.Rmdunnamed-chunk-2-1.png" alt="plot of chunk unnamed-chunk-2" /></p>
<p>Unfortunately, this plot is not very helpful because it simply plots the points without any context. So we use the <code class="highlighter-rouge">ggmap</code> and <code class="highlighter-rouge">ggplot2</code> package to give us some context. First, we download from Google the right map.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">mapImage</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggmap</span><span class="p">(</span><span class="n">get_googlemap</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">lon</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-82.394012</span><span class="p">,</span><span class="w"> </span><span class="n">lat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">34.852619</span><span class="p">),</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">zoom</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">11</span><span class="p">),</span><span class="w"> </span><span class="n">extent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"normal"</span><span class="p">)</span></code></pre></figure>
<p>I got the latitude and longitude by looking up on Google, and then hand-tuned the scale and zoom.</p>
<p>A note of warning: if you do this with a recent version of <code class="highlighter-rouge">ggmap</code> and <code class="highlighter-rouge">ggplot2</code>, you may need to download the GitHub versions. See this <a href="http://stackoverflow.com/questions/40642850/ggmap-error-geomrasterann-was-built-with-an-incompatible-version-of-ggproto/40644348">Stackoverflow thread</a> for details.</p>
<p>Now, we prepare our spatial object for plotting:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">data_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">data_json</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">data_df</span><span class="p">)[</span><span class="m">4</span><span class="o">:</span><span class="m">5</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"lon"</span><span class="p">,</span><span class="w"> </span><span class="s2">"lat"</span><span class="p">)</span></code></pre></figure>
<p>There’s really no output from this. I suppose the renaming step isn’t necessary, but I believe in descriptive labels.</p>
<p>Now we can make the plot:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">print</span><span class="p">(</span><span class="n">mapImage</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">lon</span><span class="p">,</span><span class="w"> </span><span class="n">lat</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data_df</span><span class="p">))</span></code></pre></figure>
<p><img src="/figures//2016-12-11-r-geojson-srt.Rmdunnamed-chunk-5-1.png" alt="plot of chunk unnamed-chunk-5" /></p>
<p>It may be helpful to add labels based on the name of the location, given in the ‘title’ field:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">mapImage</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">lon</span><span class="p">,</span><span class="w"> </span><span class="n">lat</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data_df</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_text</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">lon</span><span class="p">,</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w">
</span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">vjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data_df</span><span class="p">,</span><span class="w"> </span><span class="n">check_overlap</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre></figure>
<p><img src="/figures//2016-12-11-r-geojson-srt.Rmdunnamed-chunk-6-1.png" alt="plot of chunk unnamed-chunk-6" /></p>
<p>Here, I use <code class="highlighter-rouge">geom_text</code> to make the labels, and tweaked the options by hand using the help page.</p>
<h2 id="conclusions">Conclusions</h2>
<p>GeoJSON data is becoming more popular, especially in public data. The <code class="highlighter-rouge">geojsonio</code> package makes working with such data trivial. Once the data is in a spatial data format, R’s wide variety of spatial data tools are available.</p>John JohnsonGeoJSON is a standard text-based data format for encoding geographical information, which relies on the JSON (Javascript object notation) standard. There are a number of public datasets for Greenville, SC that use this format, and, the R programming language makes working with these data easy. Install the rgeojson library, which is part of the ROpenSci family of packages. In this post we plot some public data in GeoJSON format on top of a retrieved Google Map. To set up we do the following: