Personal data collection and analysis

Motivation behind this example

I was diagnosed with sleep apnea last year, and have to use a continuous positive airway pressure (CPAP) machine to sleep well enough to feel alert during the day. The machine uploads data (via cellular connection) to a website that will give me results for the last two weeks. This data includes both usage (time of usage, air leakage, number of times mask was put on/taken off), and results (apnea-hypopnea index, which is an average of the number of times per hour that slow or no breathing occurred for at least 10 seconds). The website only displays results from the last two weeks, and I’d like to eventually do a long-term analysis. I’d also like to have things displayed my own way, because, well, I’m like that.

I could enter this information in a spreadsheet, and for import into R or other statistical software that might be the sensible thing to do. However, by having this data in context of other diary entries and text surround it I get to see this data in context of other things going on in my life. This information does not exist in a vacuum, and is important context for other things. For instance, if I’m dealing with a particularly stressful situation, it would be nice to go back and see how I dealt with that in the context of how my sleeping is going (and vice versa - does the apnea get better or worse during that time?). Another issue is that I’m dealing with migraines, and I’d like to know something about the frequency and severity in the context of sleep.

Methodology for data collection

This personal data collection exercise uses an excellent piece of software specifically for journaling called The Journal. I’ve been using The Journal since 2007 to record events and just simply jog my memory of goings on in my life. The software has a few nifty features that dovetail nicely with data collection.

Daily entries

The Journal splits writing up into categories. Categories can be either loose-leaf (where entries can be organized hierarchically any way you want) and daily (where entries are organized by the date of entry). If you set it up a certain way, you can have the Journal lock entries on every day except for the day you are working on. It can also automatically create an entry for the day you are working on. Very handy for just daily jouraling in general.


Topics are tags for specific pieces of text or entries. If you select a piece of text and tag it with a topic (say, CPAP), you can extract that piece of text later. Couple this with the Search by Topic command, and you can extract all text tagged with a certain topic into one document and save a single document with all text from that topic. So, for example, I will tag all my CPAP writings with the CPAP topic, and later on save a text file with what I have written about CPAP therapy (in this case, the data I collected).


The Journal has a sophisticated template system that can insert not only the same text over and over, but tag it automatically with a certain topic and even fill in certain data such as the current date and time. I use the template feature to create some structured text (a data entry form of sorts) and tag the whole piece of inserted text with the CPAP topic. That way, I don’t have to bother with selecting and tagging manually. I can simply insert the text and fill in the numbers when I read the website.

The template looks like this:

Sleep numbers for <ENTRYDATE format=“mm/dd/yyyy”/>
* Usage: 
* Leakage: L/min
* AHI: events/h
* Mask on/off: 
* MyAir score: 
* Comments: 

Because the text follows the same structure for all such entries, it is easy to write R code to pull out the data and make a data.frame.

What you don’t see (and is hard to show here) is that in the template itself I selected all of the text and tagged it CPAP. That way, my CPAP entries will always be tagged, and I can easily extract them later.

Methodology for analysis

Data extraction

The first part of data extraction is in The Journal. I use a saved search from the Search Entries by Topic function, then click View All Result Entries to see the text I had entered. The result is a screen showing the last 100 pieces of text I tagged CPAP (which may include other pieces of text if I felt the need to write on the topic). I can change this with an option. Clicking Save to File will allow me to save to a Journal file, and RTF, or a TXT file. I save the result to a TXT file so that I can easily read it in R. The text file contains only the data I entered for the CPAP machine, as well as any other text I tagged (which is fairly uncommon).

Data import

This is where I pay the price for putting the data in a diary rather than a tabular format. I use readlines.

Read More

Inauguration speeches

Acquiring inauguration speeches

Though not about Greenville especially, it might be interesting to quantitatively analyze inauguration speeches. This analysis will be done using two paradigms: the tm package and the tidytext package. We will read the speeches in such a way that we use the tidytext package; later on we will use some tools from that package to make analyses traditionally done by tm.

I looked around for inauguration speeches, and finally found them at They are in a format more for human consumption, but with the use of the rvest (harvest?) package, we can read them in relatively easily. However, we need to do a mapping from speech IDs to speakers (newly inaugurated presidents), which is a little ugly and tedious.

Read More

Greenville on Twitter

In this blogpost, we use R to use Twitter data to analyze topics of interest to Greenville, SC. We will describe obtaining, manipulating, and summarizing the data.

Twitter is a “microblogging” service where users can, usually publicly, share links, pictures, or short comments (up to 140 characters) onto a timeline. The public timeline consists of all public tweets, but people can build their own private timelines to narrow content to just what they want to see. (They do this by “following” users.) Over the years, many companies, news organizations, and users have considered the social media site essential for sharing news and other information. (Or cat memes.) Twitter has some organizational tools such as replies/conversation threads, mentions (i.e. naming other users using the @ notation), and hashtags (naming a topic using # notation). Twitter has encouraged the use of these organizational tools by automatically making mentions and hashtags clickable links.

These organizational tools can make for some interesting analysis. For instance, a game show may encourage viewers to vote on a winner using hashtags. On their end, they create a filter for a particular hashtag (e.g. #votemyplayer) and count votes. This also makes Twitter data ripe for text mining (which they use to identify trending topics).

Obtaining the Twitter data

Twitter makes it possible for software to obtain Twitter comments without having to resort to “web-scraping” techniques (i.e. downloading the data as a web page and then parsing the HTML). Instead, you can go through an Application Programming Interface (API) to obtain the comments directly. If you’re interested, Twitter has a whole subdomain related to accessing their data, including documentation. There are a lot of technical details, but for the casual user probably the only ones of interest are API key and rate limits. This post won’t fuss with rate limits, but more serious work may require some further understanding of these issues. However, you will need to create an API key. Follow these instructions, which are tailored for R users. It essentially consists of creating a token at Twitter’s app web site and running an R function with the token. I set variables consumer_secret, consumer_key, access_token, and access_secret in an R block just copying and pasting from the Twitter apps site, not echoed in this blog post for obvious reasons.

Read More

Plotting GeoJSON polygons on a map with R

In a previous post we plotted some points, retrieved from a public dataset in GeoJSON format, on top of a Google Map of the area surrounding Greenville, SC. In this post we plot some public data in GeoJSON format as well, but instead of particular points, we plot polygons. Polygons describe an area rather than a single point. As before, to set up we do the following:

Read More

Plotting GeoJSON data on a map with R

GeoJSON is a standard text-based data format for encoding geographical information, which relies on the JSON (Javascript object notation) standard. There are a number of public datasets for Greenville, SC that use this format, and, the R programming language makes working with these data easy. Install the rgeojson library, which is part of the ROpenSci family of packages.

In this post we plot some public data in GeoJSON format on top of a retrieved Google Map. To set up we do the following:

Read More