In this session, we will introduce basic techniques in data wrangling and visualization in R. Specifically, we will cover some basic tools using out-of-the-box R commands, then introduce the powerful framework of the tidyverse (both in wrangling and visualizing data), and finally gain some understanding of the philosophy of this framework to set up deeper exploration of our data. Throughout, we will be using a publicly available dataset of AirBnB listings, and in the classroom version, we will follow an R script and there will be additional exercises in between sections.

Base R Basics

(back to top)

This tutorial assumes you have RStudio installed, can load a CSV file into your current session, and are comfortable entering commands both from an R script or directly into the console.

Loading the data

Let’s load up the AirBnB data. Remember to set the working directory to the folder with the data in it (one easy to do this is in the Files tab using the “More” drop-down menu). Then, in a fresh script (or following along in the class script), type and execute:

listings = read.csv('listings.csv')

Note we are using = as our assignment operator. Some R-ers use <- which does the same thing, and is just a different convention. This command may take a second or two (it’s a big file!) but notice now we have a variable listings in our Environment tab.

Taking a look

Let’s check it out. The head command prints out the first parts of a vector, matrix, table, etc.


It looks like each column is a variable (like “reviews_per_month” or “price”) and each row corresponds to a different AirBnB listing. We can look in Environment and see there’s actually 3,585 rows of 95 variables. Some other useful “recon”-ish commands are:

str(listings)       # display the structure of an object
summary(listings)   # give summary statistics
colnames(listings)  # display just column names

A few things to note:

  • There are different variable types: int (integer), logi (true/false), num (numeric), chr (character), Factor.
  • Factor tends to be anything R can’t categorize as one of the other types, and so it gives each unique value (string, number, whatever) its own “factor”. We can prevent R from converting string-like or non-number-y values into factors by modifying our csv command with read.csv(..., stringsAsFactors=FALSE). This usually keeps strings as strings.
  • Sometimes the variable type that R picks isn’t what we expect: check out any of the columns dealing with price (We’ll deal with this later).
  • We have a missing data problem: many columns have an “NA” count (we’ll deal with this later too).

But there is a lot to process here (95 variables!). Maybe we want to look at a specific row, or a specific group of columns. Try out a few of these:

listings[4,]       # row four
listings[,5]       # column five
listings[4,5]      # row four, column five
listings[5]        # also column five
listings["name"]   # also column five
listings$name      # also column five
listings[4,]$name  # column five for row four
listings[c(3,4),]$name  # column five for row three and four
listings[c(3,4),c(5,6)] # column five and six for row three and four
listings[4,5:7]    # row 4, columns five through seven

Let’s try that summary command again, but on just a few columns…

summary(listings[c('square_feet', 'reviews_per_month')])
##   square_feet     reviews_per_month
##  Min.   :   0.0   Min.   : 0.010   
##  1st Qu.: 415.0   1st Qu.: 0.480   
##  Median : 825.0   Median : 1.170   
##  Mean   : 858.5   Mean   : 1.971   
##  3rd Qu.:1200.0   3rd Qu.: 2.720   
##  Max.   :2400.0   Max.   :19.150   
##  NA's   :3529     NA's   :756

You might have noticed we snuck in the c(...) notation to handle multiple indices, which creates a vector of values. Similar to the numeric/factor/character data types from before, which took a single value, there are several data types that are “array-like” and can hold multiple values. Some of them are:

  • Data frame. Our listings object is actually a data.frame, since this is the default object returned from the read.csv function. It is basically a table of values, where each column has a particular data type and can be indexed by name.
  • Vector. Ordered list of any data type. For example: my.vec = c(1, 3, 10) or my.vec2 = c('Ann', 'Bob', 'Sue').
  • List. Un-ordered list of any data type, for example my.list = list(c(1,3,10)).
  • Matrix. This is just a table of values, where everything is the same data type, and you cannot index by column.

We can usually convert from one data type to another by doing something like as.numeric() or as.matrix(), but we should always check that our conversion did what we expected. We’ll only use data frames and vectors in this session, but later we’ll also see an enhanced version of the data frame type.

Another common base R function that gets a lot of mileage, is table (although we’ll introduce a more flexible alternative later). table provides a quick way to cross-tabulate counts of different variables. So in our dataset, if we want to see the count of how many listings are listed under each room type, we can just do

## Entire home/apt    Private room     Shared room 
##            2127            1378              80

And if we wanted to cross-tabulate this with the number the room accommodates, we can just add that in to the table command, like this:

table(listings$room_type, listings$accommodates)
##                     1   2   3   4   5   6   7   8   9  10  11  12  14  16
##   Entire home/apt  25 597 347 592 232 201  38  52  10  19   4   5   3   2
##   Private room    369 855  79  56  13   2   1   3   0   0   0   0   0   0
##   Shared room      45  31   2   2   0   0   0   0   0   0   0   0   0   0

We can even make one of the arguments a “conditional,” meaning a statement that can be answered by “true” or “false”, like the count of rooms by type that accommodate at least 4 people:

table(listings$room_type, listings$accommodates >= 4)
##                   FALSE TRUE
##   Entire home/apt   969 1158
##   Private room     1303   75
##   Shared room        78    2

We’ll learn some cleaner (and hopefully more intuitive) ways to select and filter and summarize the data like this later. But for now, let’s try visualizing some of it.

How about the distribution of daily rates/prices?

We want to run something like hist(listings$price), but this gives an error: “price is not numeric”. (Try it!) Why?

str(listings$price)        # notice it says "Factor w/ 324 Levels"

Like we mentioned earlier, when R loads a file into a data table, it automatically converts each column into what it thinks is the right type of data. For numbers, it converts it into “numeric”, and usually for strings (i.e. letters) it converts it into “factors” — each different string gets its own “factor.” The price column got converted into factors because the dollar signs and commas made R think it was strings. So each different price is its own different factor.

(We would still have a similar problem even if we used stringAsFactors=FALSE when we loaded the CSV, just instead of factors, the prices would all be strings, i.e. of type chr, but still not a number.)

Let’s make a new variable that will have the numeric version of price in it:

listings$nprice = as.numeric(gsub('\\$|,', '', listings$price))

This command says: in the price column, substitute (sub) the \\$|, character out with nothing '', then convert everything to type numeric, then assign it to this new column called nprice. The '\\$|,' character is really some magic that means “the $ character or the , character,” and there’s no need to worry about it too much. (The \\ are called “escape characters” because $ has special meaning otherwise, and the | symbol means “or”.)

Now let’s try again:


Well that is a horrible figure, but at least it worked. Maybe a scatter plot of price vs. reviews?

plot(listings$review_scores_rating, listings$nprice)

That is the ugliest thing I have ever seen. But there does seem to be some upward trend happening between these variables, so that might be interesting? Before we start poking around much more, let’s rescue ourselves from the Base R trenches by introducing some better tools.

Base R Exercises

(back to top)

Exercise 1. Conditional statements. Earlier we did a table by looking at rooms that accommodated “at least 4” (>= 4). We can also look at “at most 4” (<= 4), “exactly 4” (== 4), or “anything but 4” (!= 4) people, and of course “strictly less than 4” (<) and “strictly more than 4” (>). We can also join conditional statements together by saying “at most 4 OR exactly 7” (accommodates <= 4 | accommodates == 7) where we used the OR operator |, or a similar statement using the AND operator &.

How could we do a table of listing counts by room type comparing how many are/are not in the Back Bay neighborhood?


table(listings$room_type, listings$neighbourhood == 'Back Bay')
##                   FALSE TRUE
##   Entire home/apt  1876  251
##   Private room     1339   39
##   Shared room        79    1

Exercise 2. The %in% operator. What if we wanted to check if the listing was in one of several neighborhoods, like the North End/West End/Beacon Hill strip? We can put the neighborhoods in a vector (or list) and check if the listing is %in% the vector, for example listings$neighbourhood %in% c('North End', 'West End', 'Beacon Hill').

How could we check the number of listings by room type that accommodate either 2, 4, or 7 AND have at least 2 bedrooms?


table(listings$room_type, listings$accommodates %in% c(2,4,7) & listings$bedrooms >= 2)
##                   FALSE TRUE
##   Entire home/apt  1738  379
##   Private room     1378    0
##   Shared room        80    0

(What happens if we keep passing table() more and more arguments, like table(..., listings$accommodates==2, listings$accommodates==4, ...) ?)

Exercise 3. Converting dates and times. We often have date/time information we need to use in our data, but which are notoriously tricky to handle: different formats, different time zones, … blech. R provides a data type (Date) to handle dates in a cleaner way. We can usually take our raw dates (like “2016-01-12”) and convert by doing as.Date(my.table$, '%Y-%m-%d'). The second argument is a formatting string that tells as.Date how the input raw data is formatted. This example uses %Y (meaning 4-digit year), %m and %d (meaning 2-digit month and day). There are similar strings for other formats (see for example here).

Try creating a new column in listings named “last_review_date” that has the “last_review” column in Date format.


listings$host_since_date = as.Date(listings$host_since, '%Y-%m-%d')

This allows us to treat dates like numbers, and R will do all the conversion and math “behind the scenes” for us. Use min(), max(), and mean() to find the earliest, last, and average date a host became a host. Or how about: how many days between the 3rd and 4th listings’ hosts dates?


## [1] "2008-11-11"
## [1] "2016-09-06"
## [1] "2014-03-27"
listings[4,'host_since_date'] - listings[3,'host_since_date']
## Time difference of 1441 days

There is a ton more to learn here, if you are interested. Date can handle any format, including numeric formats (like Excel generates or UNIX time stamps), but sometimes the difficulty is something like handling dates that are formatted in different ways in the same column, or contain errors (“Marhc 27th”) …

Exercise 4. Text handling. We have seen the chr data type, which can be single characters or strings of characters. We can get substrings of a string using substr(); for example substr("I love R", start=1, stop=4) gives “I lo”. We can paste two strings together using paste(); for example paste("Hello", "there") gives “Hellothere” (no space). We can substitute one string into another using sub(); for example sub("little", "big", "Mary had a little lamb") gives “Mary had a big lamb”. (We used gsub() earlier, which just allows multiple substitutions, not just the first occurrence.)

Try creating a new column with the first 5 letters of the host name followed by the full name of the listing without spaces.


listings$host_list_name = paste(substr(listings$host_name,start=1,stop=5),
                                gsub(' ','',listings$name))

We are not going to cover escape characters, string formatting, or the more general topic of regular expressions (“regex”), but we have seen some of these topics already. When converting price to numeric, we used the string \\$|, to represent “any dollar sign OR comma”, which is an example of escape characters and regular expressions. When converting dates, we used strings like %Y to represent 4-digit year; this is an example of string formatting.

Introducing the Tidyverse

(back to top)

Hadley Wickham, a statistician and computer scientist, introduced a suite of packages to give an elegant, unified approach to handling data in R (check out the paper!). These data analysis tools, and the philosophy of data handling that goes with them, have become standard practice when using R.

The motivating observation is that data tidying and preparation consumes a majority of the data scientist’s time; exacerbating the problem is the fact that data cleaning is seen as lowly janitor work, and often skipped or done shoddily. If the data scientist is a chef, data preparation is keeping a clean kitchen, and we all tend to have dirty plates stacked to the ceiling.

The underlying concept is then to envision data wrangling in an idiomatic way (as a “grammar”), with a simple set of rules that can unify data structures and data handling everywhere. In this preliminary section, we will focus on this natural approach to data handling: data are the nouns, and actions are the verbs. We will then see how this directly nests with an elegant way to visualize that data, and later we will delve into tidy structures: of standardizing the way we represent the data itself.

Loading the libraries

First we need to load the packages. If you did the homework, you already have them installed, but if not (shame!) install them with: install.packages('tidyr') and install.packages('dplyr').

Okay, now we’ll load them into our current R session by calling:


Some basics

Let’s try doing some of the basic data recon that we were messing with before, but with tidyr and dplyr.

How about selecting a specific column, and looking at the first few rows:

head(select(listings, reviews_per_month))
##   reviews_per_month
## 1                NA
## 2              1.30
## 3              0.47
## 4              1.00
## 5              2.25
## 6              1.70

This is fine, but it’s a little awkward having to nest our code like that. Luckily, there is a nifty operator included with tidyr called the chaining operator which looks like %>% and serves like a pipeline from one function to another. Now we can instead do this:

listings %>% select(reviews_per_month) %>% head()
##   reviews_per_month
## 1                NA
## 2              1.30
## 3              0.47
## 4              1.00
## 5              2.25
## 6              1.70

which is much, much nicer. Notice that the chaining operator feeds in the object on its left as the first argument into the function on its right.

Now, let’s learn some more verbs. How about also selecting the name, and filtering out missing entries and low values?

listings %>% select(name, reviews_per_month) %>% 
  filter(!, reviews_per_month > 12)
##                                         name reviews_per_month
## 1           One Private room @ Jamaica Plain             12.13
## 2            #3 Real close to the airport...             14.34
## 3         Only 7 minutes to downtown Boston.             15.54
## 4                  E1 Five mins from airport             15.00
## 5            Luxury Room Near Airport + City             12.73
## 6 Luxury Private Room with Organic Breakfast             12.95
## 7         Spacious 1 bedroom in East Boston.             19.15
## 8             E3 Convenient to Logan Airport             16.30
## 9                  E2 Steps from Maverick Sq             12.16

Amazing. It’s as if we are speaking to the console in plain English. The function returns “True” if something is NA, so ! (read: “not is NA”) returns the opposite.

How many of those NAs are there? Let’s count them:

listings %>% count(
## # A tibble: 2 × 2
##   ``     n
##                        <lgl> <int>
## 1                      FALSE  2829
## 2                       TRUE   756

Hmm. Does it have anything to do with just recent listings? Let’s do a table to summarize the number of reviews for an NA entry by showing the average number of reviews:

listings %>%
  filter( %>%
  summarize( = mean(number_of_reviews))
## 1               0

Ah, so these are just listings without any reviews yet. That’s not alarming. (Note to international students: summarise also works!)

Now, how about a summary statistic, like the average price for a listing?

Well, the first thing we need to do is make sure the price is in a numeric form. We already dealt with this before by creating a new column using the dollar-sign base R syntax. Let’s instead take a tidy R approach and mutate the listings data table by adding this new column right in our chain:

listings %>% 
  mutate(nprice = as.numeric(gsub('\\$|,', '', price))) %>%
  summarize(avg.price = mean(nprice))
##   avg.price
## 1  173.9258

This approach has several advantages over the base R way. One advantage is we can use the column temporarily, as part of our chain, without affecting the data table that we have loaded into memory. We can even overwrite the original column if we want to keep the same name. Another advantage is that we can easily convert/add multiple columns at once, like this:

listings %>%
  mutate(price = as.numeric(gsub('\\$|,', '', price)),
         weekly = as.numeric(gsub('\\$|,', '', weekly_price)),
         monthly = as.numeric(gsub('\\$|,', '', monthly_price))) %>%
  summarize(avg.price = mean(price),
            avg.weekly = mean(weekly, na.rm=TRUE),
            avg.monthly = mean(monthly, na.rm=TRUE))
##   avg.price avg.weekly avg.monthly
## 1  173.9258   922.3924    3692.098

Here we used the argument na.rm=TRUE in mean, which just removes any NA values from the mean computation — we could have also chained another filter command with the same result.

Another advantage is we can create a new column, and then use those new values immediately in another column! Let’s create a new column that is the “weekly price per day” called weekly_price_per by dividing the weekly price by 7. Then let’s use that number and the daily price rate to compute the difference between the two (i.e. the discount by taking the weekly rate). Then we’ll look at the average of this discount across all listings.

listings %>%
  mutate(price = as.numeric(gsub('\\$|,', '', price)),
         weekly = as.numeric(gsub('\\$|,', '', weekly_price)),
         weekly_price_per = weekly / 7,
         weekly_discount = price - weekly_price_per) %>%
  summarize(avg_discount = mean(weekly_discount, na.rm=T))
##   avg_discount
## 1     19.03908

Average discount per day for booking by the week: about 20 bucks!

Let’s take a deeper look at prices, and we can make our lives easier by just overwriting that price column with the numeric version and saving it back into our listings data frame:

listings = listings %>% mutate(price = as.numeric(gsub('\\$|,', '', price)))

Now — what if we want to look at mean price, and group_by neighborhood?

listings %>% 
  group_by(neighbourhood_cleansed) %>%
  summarize(avg.price = mean(price))
## # A tibble: 25 × 2
##    neighbourhood_cleansed avg.price
##                    <fctr>     <dbl>
## 1                 Allston 112.30769
## 2                Back Bay 240.95033
## 3             Bay Village 266.83333
## 4             Beacon Hill 224.44330
## 5                Brighton 118.76757
## 6             Charlestown 198.04505
## 7               Chinatown 232.35211
## 8              Dorchester  91.63941
## 9                Downtown 236.45930
## 10            East Boston 119.15333
## # ... with 15 more rows

Maybe we’re a little worried these averages are skewed by a few outlier listings. Let’s try

listings %>%
  group_by(neighbourhood_cleansed) %>%
  summarize(avg.price = mean(price),
            med.price = median(price),
            num = n())
## # A tibble: 25 × 4
##    neighbourhood_cleansed avg.price med.price   num
##                    <fctr>     <dbl>     <dbl> <int>
## 1                 Allston 112.30769      85.0   260
## 2                Back Bay 240.95033     209.0   302
## 3             Bay Village 266.83333     206.5    24
## 4             Beacon Hill 224.44330     195.0   194
## 5                Brighton 118.76757      90.0   185
## 6             Charlestown 198.04505     180.0   111
## 7               Chinatown 232.35211     219.0    71
## 8              Dorchester  91.63941      72.0   269
## 9                Downtown 236.45930     225.0   172
## 10            East Boston 119.15333      99.0   150
## # ... with 15 more rows

The n() function here just gives a count of how many rows we have in each group. Nothing too crazy, but we do notice some red flags to our “mean” approach.

  • First, if there are a very small number of listings in a neighborhood compared to the rest of the dataset, we may worry we don’t have a representative sample, or that this data point should be discredited somehow (on the other hand, maybe it’s just a small neighborhood, like Bay Village, and it’s actually outperforming expectation).

  • Second, if the median is very different than the mean for a particular neighborhood, it indicates that we have outliers skewing the average. Because of those outliers, as a rule of thumb, means tend to be a misleading statistic to use with things like rent prices or incomes.

One thing we can do is just filter out any neighborhood below a threshold count:

listings %>%
  group_by(neighbourhood_cleansed) %>%
  summarize(avg.price = mean(price),
            med.price = median(price),
            num = n()) %>%
  filter(num > 200)
## # A tibble: 6 × 4
##   neighbourhood_cleansed avg.price med.price   num
##                   <fctr>     <dbl>     <dbl> <int>
## 1                Allston 112.30769        85   260
## 2               Back Bay 240.95033       209   302
## 3             Dorchester  91.63941        72   269
## 4                 Fenway 220.39310       191   290
## 5          Jamaica Plain 138.47813       100   343
## 6              South End 204.34969       180   326

We can also arrange this info (sort it) by the hopefully more meaningful median price:

listings %>%
  group_by(neighbourhood_cleansed) %>%
  summarize(avg.price = mean(price),
            med.price = median(price),
            num = n()) %>%
  filter(num > 200) %>%
## # A tibble: 6 × 4
##   neighbourhood_cleansed avg.price med.price   num
##                   <fctr>     <dbl>     <dbl> <int>
## 1             Dorchester  91.63941        72   269
## 2                Allston 112.30769        85   260
## 3          Jamaica Plain 138.47813       100   343
## 4              South End 204.34969       180   326
## 5                 Fenway 220.39310       191   290
## 6               Back Bay 240.95033       209   302

(Descending order would just be arrange(desc(med.price)).) We can also pick a few neighborhoods to look at by using the %in% keyword in a filter command with a list of the neighborhoods we want:

listings %>%
  filter(neighbourhood_cleansed %in% c('Downtown', 'Back Bay', 'Chinatown')) %>%
  group_by(neighbourhood_cleansed) %>%
  summarize(avg.price = mean(price),
            med.price = median(price),
            num = n()) %>%
## # A tibble: 3 × 4
##   neighbourhood_cleansed avg.price med.price   num
##                   <fctr>     <dbl>     <dbl> <int>
## 1               Back Bay  240.9503       209   302
## 2              Chinatown  232.3521       219    71
## 3               Downtown  236.4593       225   172

We have now seen: select, filter, count, summarize, mutate, group_by, and arrange. This is the majority of the dplyr “verbs” for operating on a single data table (although there are many more), but as you can see, learning new verbs is pretty intuitive. What we have already gives us enough tools to accomplish a large swath of data analysis tasks.

But … we’d really like to visualize some of this data, not just scan summary tables. Next up, ggplot.

Tidyverse Exercises

(back to top)

We’ll now introduce a few new tricks for some of the dplyr verbs we covered earlier, but this is by no means a comprehensive treatment.

Exercise 1. More with select. In addition to selecting columns, select is useful for temporarily renaming columns. We simply do an assignment, for example select('New colname'=old_col_name). This is helpful for display purposes when our column names are hideous. Try generating the summary table of median price by room type but assigning some nicer column labels.


listings %>%
  mutate(price = as.numeric(gsub('\\$|,','',price))) %>%
  group_by(room_type) %>%
  summarize(med = median(price)) %>%
  select('Room type'=room_type, 'Median price'=med)
## # A tibble: 3 × 2
##       `Room type` `Median price`
##            <fctr>          <dbl>
## 1 Entire home/apt            199
## 2    Private room             80
## 3     Shared room             60

Another useful trick with select (and other functions in R) is to include all but a column by using the minus - sign before the excluded column. For example listings %>% select(-id) selects every column except the listing ID.

Exercise 2. More with group_by. We can group by multiple columns, and dplyr will start cross-tabulating the information within each group. For example, let’s say we want the count of listings by room type and accommodation, we could do

listings %>% group_by(room_type, accommodates) %>% count()
## Source: local data frame [26 x 3]
## Groups: room_type [?]
##          room_type accommodates     n
##             <fctr>        <int> <int>
## 1  Entire home/apt            1    25
## 2  Entire home/apt            2   597
## 3  Entire home/apt            3   347
## 4  Entire home/apt            4   592
## 5  Entire home/apt            5   232
## 6  Entire home/apt            6   201
## 7  Entire home/apt            7    38
## 8  Entire home/apt            8    52
## 9  Entire home/apt            9    10
## 10 Entire home/apt           10    19
## # ... with 16 more rows

This is the same information we got earlier using a table command (although in an interestingly longer format, which we will talk about later). Try finding the median daily price of a listing, grouped by number of bedrooms and number of bathrooms:


listings %>%
  mutate(price = as.numeric(gsub('\\$|,','',price))) %>%
  group_by(bedrooms, bathrooms) %>%
  summarize(med = median(price))
## Source: local data frame [42 x 3]
## Groups: bedrooms [?]
##    bedrooms bathrooms   med
##       <int>     <dbl> <dbl>
## 1         0       0.0    60
## 2         0       1.0   150
## 3         0       1.5   200
## 4         0       3.5   450
## 5         1       0.0    95
## 6         1       0.5    52
## 7         1       1.0   119
## 8         1       1.5    75
## 9         1       2.0    75
## 10        1       2.5    68
## # ... with 32 more rows

Exercise 3. More with mutate. The code block earlier with multiple mutation commands got a little repetitive, and we are lazy. We would rather have a verb so we can select some columns, and apply some function to mutate_all of them:

listings %>%
  select(price, weekly_price, monthly_price) %>%
  mutate_all(funs(numversion = as.numeric(gsub('\\$|,', '', .)))) %>%
##   price weekly_price monthly_price price_numversion
## 1   250                                         250
## 2    65      $400.00                             65
## 3    65      $395.00     $1,350.00               65
## 4    75                                          75
## 5    79                                          79
## 6    75                                          75
##   weekly_price_numversion monthly_price_numversion
## 1                      NA                       NA
## 2                     400                       NA
## 3                     395                     1350
## 4                      NA                       NA
## 5                      NA                       NA
## 6                      NA                       NA

This is fairly straightforward, with two “tricks”: funs() is a convenience function we have to use to tell dplyr to apply the transformation to multiple columns, and the period . serves as a stand-in for the column we’re on. Note also we have created new columns which tack on “_numversion" to the older columns, but if we leave out that assignment in funs() we just overwrite the previous columns. If we want to be able to specify which columns we want to mutate_at, we can do:

listings %>%
  select(name, price, weekly_price, monthly_price) %>%
  mutate_at(c('price', 'weekly_price', 'monthly_price'),  # specify a list of cols
            funs(as.numeric(gsub('\\$|,', '', .)))) %>%   # specify the transformation
##                                            name price weekly_price
## 1                    Sunny Bungalow in the City   250           NA
## 2             Charming room in pet friendly apt    65          400
## 3              Mexican Folk Art Haven in Boston    65          395
## 4 Spacious Sunny Bedroom Suite in Historic Home    75           NA
## 5                           Come Home to Boston    79           NA
## 6                Private Bedroom + Great Coffee    75           NA
##   monthly_price
## 1            NA
## 2            NA
## 3          1350
## 4            NA
## 5            NA
## 6            NA

This time also notice that we actually didn’t make new columns, we mutated the existing ones.

(There is also a variation for conditional operations (mutate_if) and analogous versions of all of this for summarize (summarize_all, …). We don’t have time to cover them all, but if you ever need it, you know it’s out there!)

Try using one of these methods to convert all the date columns to Date (fortunately they all use the same formatting).


listings %>%
  select(last_scraped, host_since, first_review, last_review) %>%
  mutate_all(funs(as.Date(., "%Y-%m-%d"))) %>%
##   last_scraped host_since first_review last_review
## 1   2016-09-07 2015-04-15         <NA>        <NA>
## 2   2016-09-07 2012-06-07   2014-06-01  2016-08-13
## 3   2016-09-07 2009-05-11   2009-07-19  2016-08-05
## 4   2016-09-07 2013-04-21   2016-08-28  2016-08-28
## 5   2016-09-07 2014-05-11   2015-08-18  2016-09-01
## 6   2016-09-07 2016-03-23   2016-04-20  2016-08-28

Introducing the Grammar of Graphics

(back to top)

We already saw how awful the Base R plotting functions like plot() and hist() are, straight out of the box, anyway. We’d like to argue that they aren’t just clunky for their aesthetic feel, but the fact that each function is stand-alone, takes different arguments, etc. We’d like some unifying approach to graphics, similar to what we’ve begun to see with tidyr.

ggplot gives us just that. ggplot was created by Leland Wilkinson with his book The Grammar of Graphics (which is the gg in ggplot), and put into code by Hadley Wickham. We’ll see it not only provides a clean way of approaching data visualization, but also nests with the tidyr universe like a hand in a glove.


What does grammar of graphics mean? A grammar is a set of guidelines for how to combine components (ingredients) to create new things. One example is the grammar of language: in English, you can combine a noun (like “the dog”) and a verb (like “runs”) to create a sentence (“the dog runs”). Another example is baking: you can combine a body (like flour), a binder (like eggs), a rising agent (like yeast), and flavoring (like sugar) to create a delicious dish (like a pastry). Notice that these are loose guidelines (see: experimental chefs, or the poetry of e.e. cummings) but there are certainly invalid combinations (like “dog the runned” or substituting salt for sugar).

Let’s translate this idea to visualization. Our ingredients are:

  • Data. This is the base of our dish, and is probably a data.frame object like we have been using.
  • Aesthetic. This is the mapping of the parts of the data to chart components. (Like “price on the x-axis”.)
  • Geometry. The specific visualization shape: a line plot, a point (scatter) plot, bar plot, etc.
  • Statistical transformation. How should the data be transformed or aggregated before visualizing?
  • Theme. This is like flavoring: how do we want the chart to look and feel?

In this scheme, our “required ingredients” are the Data, the Aesthetic, and the Geometry.


First, make sure you’ve got ggplot2 installed (with install.packages('ggplot2')) and then load it into your session:


That scatterplot of the price against the review score seemed interesting, we’d like to revisit it. First let’s save the numeric price column into our listings data table, just for convenience (you should have already done this in the previous section, but just in case):

listings = listings %>% mutate(price = as.numeric(gsub('\\$|,', '', price)))

Now, we chain this into the ggplot function…

listings %>%
  ggplot(aes(x=review_scores_rating, y=price)) +
## Warning: Removed 813 rows containing missing values (geom_point).