Statistical Modelling and Machine Learning in R: Part 1 Lecture Notes

Overview

In last session we learned about data wrangling and visualization. In this one, we’ll learn about statistical modelling and machine learning in R. Specifically, we’ll look at some supervised learning tasks (both regression and classification) and touch on some unsupervised learning at the end. We’ll discuss out-of-sample testing, regularization, and model selection via cross-validation. We’ll continue to add to your data wrangling toolkit by tackling some textual fields and looking at a natural language processing model.

Some Modelling Concepts

Before diving into the code, let’s take a couple minutes to get oriented.

Machine Learning and Modelling

We will be working with statistical and machine learning models. At its heart, machine learning is unknown function approximation - there is a relationship between the different columns of the data points which we do not know but would like to find out about. For most of the class, we’ll be looking at supervised learning models, in which the unknown function of interest is a mapping of predictor variables to response variables. We’ll conceptualize such models as having two parts:

The first part of modelling will be to specify a structure within which we’ll explore the relationship of predictors to response (e.g. we specify that the response variable depends linearly on the predictor).
Then we’ll use optimization over a data set to find the “best” model within the structural class we’ve specificed, where “best” means that the model minimizes some measure of the error between predicted and actual values.

We’ll also look briefly at unsupervised learning, where we don’t separate the data into predictors and response variables, but instead study the relationship of data points to each other (for example via clustering).

The focus of this course will be to get you familiar with the way R represents models and to provide examples of training and testing models. We hope that with this understanding in place, you’ll be able to generalize your skills from the models we’ll cover here to the vast world of models that’s available.

Model Evaluation

It is important to ask, “How good is my model?” The answer is far from clear a lot of the time, since there are often multiple good explanations for the data and since structural assumptions on relationships between data points can be difficult to verify. However, simply speaking there are at least two uses for a model that result in two different ideas about what qualifies a model as “good.” First, you may care about explaining the relationship between variables (inference), in which case getting the right model structure is very important. However, you may care instead about predicting outcomes of one variable based on another, in which case verifying the exact mathematical relationship is less important. What becomes important in this second setting is validating your model on data other than the data you used to build the model. We’ll focus mainly on predictive quality, though we’ll look at a couple of ways to check explanatory power in the context of linear regression.

Regression

Within the world of supervised learning, we can divide tasks into two parts. In settings where the response variable is continuous we call the modelling regression, and when the response is categorical we call it classification. We will begin with regression to understand what factors influence price in the AirBnB data set.

Let’s start by loading the data (after first setting the correct working directory). We’ll use the ‘listings.csv’ file for now. Since we’ll be wrangling and visualizing, we’ll also load the tidyverse package. (Instead of tidyverse, it also works to load tidyr, dplyr, and ggplot2 as we saw last session.)

listings = read.csv("listings.csv",stringsAsFactors = FALSE)
library(tidyverse)

Linear Regression

As a review, we need to change the price column to a numeric form.

listings = listings %>% mutate(price = as.numeric(gsub("\\$|,","",price)))
summary(listings$price) # Check to make sure things worked

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    85.0   150.0   173.9   220.0  4000.0

Now, which variables may be predictive of price? We can use names(listings) to get a look at all the variable names.

Data Preparation

Let’s begin by looking at the relationship between listings$accommodates and listings$price. As a first look:

plot(listings$accommodates,listings$price) # Note: I use base R instead of ggplot2 for the really simple first-blush plots

Looks like there are some outliers on both axes. There are fancier ways to deal with this statistically, but for today let’s just get rid of the outliers and fit a model on the cleaner data:

listings_for_lm = listings %>%
  filter(accommodates <= 10, price <=1000)

Let’s take another look:

plot(listings_for_lm$accommodates,listings_for_lm$price)

You may argue that it’s still a bit messy, but let’s see what we can do with a linear model. Since we care about prediction accuracy, we’ll reserve a portion of our data to be a test set. There are lots of ways to do this. We’ll use the modelr package, which is part of the tidyverse.

library(modelr) # Comes with tidyverse installation, but doesn't automatically load with the library(tidyverse) call
set.seed(445)
listings_for_lm = listings_for_lm %>% 
  resample_partition(c(train=0.7,test=0.3))

The object listings_for_lm is now a list with two elements: train and test.

Model Fitting

In R, we specify a model structure and then use the corresponding function to tell R to optimize for the best-fitting model. For linear regression, the function is lm():

lm_price_by_acc = lm(price ~ accommodates,data=listings_for_lm$train) # We'll talk more about the '~' notation soon

Let’s check out the lm_price_by_acc object:

names(lm_price_by_acc)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

The object lm_price_by_acc is a list of a bunch of relevant information generated by the lm() function call. We can use the $ to view different elements, for example

lm_price_by_acc$call

## lm(formula = price ~ accommodates, data = listings_for_lm$train)

So, it stores the function call in one of the list elements. But this isn’t the most useful way to check out the model fit. The function summary() is overloaded for many different objects and often gives a useful snapshot of the model:

summary(lm_price_by_acc)

## 
## Call:
## lm(formula = price ~ accommodates, data = listings_for_lm$train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -301.85  -53.76  -15.03   39.47  684.70 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    47.983      3.720   12.90   <2e-16 ***
## accommodates   40.386      1.089   37.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89.57 on 2490 degrees of freedom
## Multiple R-squared:  0.3556, Adjusted R-squared:  0.3554 
## F-statistic:  1374 on 1 and 2490 DF,  p-value: < 2.2e-16

There we go, this is more useful! First, let’s look at the section under “Coefficients”. Notice that R automatically adds an intercept term unless you tell it not to (we’ll see how to do this later). In the “estimate” column, we see that the point estimates for the linear model here say that the price is $55.20 plus $37.79 for every person accommodated. Notice the ’***’ symbols at the end of the “(Intercept)” and “accommodates” rows. These indicate that according to a statistical t-test, both coefficients are extremely significantly different than zero, so things are okay from an inference perspective.

Model Evaluation

As another check on inference quality, let’s plot the fitted line. There are some nifty functions in the modelr package that make interacting with models easy in the tidyr and dplyr setting. We’ll use modelr::add_predictions() here.

as.data.frame(listings_for_lm$train) %>%
  add_predictions(lm_price_by_acc) %>%
  ggplot(aes(x=accommodates)) +
  geom_point(aes(y=price)) +
  geom_line(aes(y=pred), color='red')

Nice. We can also remove the linear trend and check the residual uncertainty, which we’ll do here using modelr::add_residuals(). This is helpful to make sure that the residual uncertainty looks like random noise rather than systematic bias.

as.data.frame(listings_for_lm$train) %>%
  add_residuals(lm_price_by_acc,var="resid") %>%
  ggplot(aes(x=accommodates,y=resid)) + geom_point()

Since we have finitely many values, maybe box plots tell a better story:

as.data.frame(listings_for_lm$train) %>%
  add_residuals(lm_price_by_acc,var="resid") %>%
  group_by(as.factor(accommodates)) %>%
  ggplot(aes(x=as.factor(accommodates),y=resid)) + geom_boxplot()

Things are pretty centered around zero, with the exception of 9- & 10-person accommodations. Maybe the model doesn’t apply so well here and we could refine it in a later modelling iteration.

Let’s now take a look at out-of-sample performance. We’ll plot the predictions versus the actuals as we did before, but this time for the test data.

as.data.frame(listings_for_lm$test) %>%
  add_predictions(lm_price_by_acc) %>%
  ggplot(aes(x=accommodates)) +
  geom_point(aes(y=price)) +
  geom_line(aes(y=pred), color='red')

Now, what if we wanted to quantify how well the model predicts these out-of-sample values? There are several metrics to aggregate the prediction error. We’ll look at two:

Root Mean-squared Error (RMSE): $\sqrt{\sum_{t=1}^n (\hat{y}_t - y_t)^2/n}$
Mean Absolute Error (MAE): $\sum_{t=1}^n |\hat{y}_t - y_t|/n$

In both, $\hat{y}_t$ is the predicted value for test observation $t$, $y_t$ is the actual value, and $n$ is the size of the test set.

You could calculate these pretty easily with a line or two of code, and we’ll do so later on, but modelr has built-in functions to do this for you automatically:

rmse(lm_price_by_acc,listings_for_lm$test)

## [1] 100.5159

mae(lm_price_by_acc,listings_for_lm$test)

## [1] 68.62211

The units on both measurements are $, and since they’re measures of error, lower is better. These values don’t help much on their own, but they come in handy when comparing different models, which we’ll do next after a quick review.

Summary and More about Formulas

First, let’s review the pattern, because it can be generalized to a whole bunch of more complex models. We asked the questions: How does listing price depend on the number of people it accommodates? How well does accommodation size predict price? Since we were interested in prediction, we reserved part of our data as a test set. We then chose to use a linear model to answer these questions, and found the corresponding function lm(). This function, and modelling functions in general, takes as arguments

Data on the response and predictor variables, usually through a formula object
Model parameters (in the case of lm(), we used all the default values)

R then automatically found the “best” linear model by computing the least squares estimate, and returned a lm object, which was a list including information about

Fitted coefficients
Residuals
Statistical significance
And more…

We interacted with the model to evaluate goodness-of-fit and out-of-sample performance. In our case, we used the modelr and dplyr framework to do this cleanly.

We didn’t say too much about the price ~ accommodates syntax. Many modelling functions in R take formulas as arguments, together with a data argument. The data argument specifies the data frame, and the formula argument tells the model which are the responses and which are the predictors. We’ll play around with formulas in the exercises, but here are a few helpful pointers:

The ~ separates the response (on the left) from the predictors (on the right)
Predictors are separated with a +
Use . on the right-hand side to include all predictors in a given data frame
You can also use .-x to include all predictors except x
To include interactions between variables, use the * symbol. For example: y ~ x + z + x*z expresses the form: “regress y on x, z, and x interacted with z
To exclude the intercept term, include -1 or +0 on the right-hand side

For more detailed info, see https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html.

Other Models - Splines

Now that we’ve identified the pattern, let’s look at another type of model: splines. Without going into all the mathematical details, splines are (roughly) one way of fitting a higher-degree polynomial to the data. Let’s look at what happens when we allow a quadratic and cubic fit in the price by accomodation size setting.

library(splines)
lm2 = lm(price ~ ns(accommodates,2),data=listings_for_lm$train)
lm3 = lm(price ~ ns(accommodates,3),data=listings_for_lm$train)

We can again look inside these model objects using summary():

summary(lm2)

## 
## Call:
## lm(formula = price ~ ns(accommodates, 2), data = listings_for_lm$train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -249.53  -53.55  -16.84   33.26  687.19 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            76.840      3.868   19.86   <2e-16 ***
## ns(accommodates, 2)1  357.779     10.366   34.52   <2e-16 ***
## ns(accommodates, 2)2  267.484     15.112   17.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89.25 on 2489 degrees of freedom
## Multiple R-squared:  0.3604, Adjusted R-squared:  0.3599 
## F-statistic: 701.3 on 2 and 2489 DF,  p-value: < 2.2e-16

We can also plot the out-of-sample performance of each separately:

as.data.frame(listings_for_lm$test) %>%
  add_predictions(lm2) %>%
  ggplot(aes(x=accommodates)) +
  geom_point(aes(y=price)) +
  geom_line(aes(y=pred), color='red')

To plot the linear, quadratic, and cubic models all at once, we’ll use gather_predictions(). This function makes a big long column of predictions, with another column to tell which model generated which prediction (so that the data is in long or tidy format):

as.data.frame(listings_for_lm$test) %>%
  gather_predictions(lm_price_by_acc,lm2,lm3) %>%
  ggplot(aes(x=accommodates)) +
  geom_point(aes(y=price)) +
  geom_line(aes(y=pred,group=model,color=model))

In this case, all the models look reasonable. The quadratic and cubic models are hard to distinguish from each other. Let’s compare the RMSE and MAE:

rmse(lm_price_by_acc,listings_for_lm$test)

## [1] 100.5159

rmse(lm2,listings_for_lm$test)

## [1] 99.88911

rmse(lm3,listings_for_lm$test)

## [1] 99.8826

mae(lm_price_by_acc,listings_for_lm$test)

## [1] 68.62211

mae(lm2,listings_for_lm$test)

## [1] 68.10766

mae(lm3,listings_for_lm$test)

## [1] 68.14386

All values are roughly the same. The higher-order models give a bit of advantage, but not much. The fact that we only gain slightly from a more complex model isn’t surprising, given that we only have data for 10 distinct values of the accommodates variable. In fact, in some cases more complex models perform worse since they overfit to the data in the training set.

Model Selection and Tuning: Penalized Regression

Let’s work a bit harder on predicting price, this time using more than one predictor. In fact, we’ll add a bunch of predictors to the model and see what happens.

As one set of predictors, the column listings$amenities looks interesting:

listings %>% select(amenities) %>% head()

##                                                                                                                                                                                                                                                                                                                      amenities
## 1                                                                                               {TV,"Wireless Internet",Kitchen,"Free Parking on Premises","Pets live on this property",Dog(s),Heating,"Family/Kid Friendly",Washer,Dryer,"Smoke Detector","Fire Extinguisher",Essentials,Shampoo,"Laptop Friendly Workspace"}
## 2                               {TV,Internet,"Wireless Internet","Air Conditioning",Kitchen,"Pets Allowed","Pets live on this property",Dog(s),Heating,"Family/Kid Friendly",Washer,Dryer,"Smoke Detector","Carbon Monoxide Detector","Fire Extinguisher",Essentials,Shampoo,"Lock on Bedroom Door",Hangers,"Hair Dryer",Iron}
## 3 {TV,"Cable TV","Wireless Internet","Air Conditioning",Kitchen,"Free Parking on Premises",Heating,Washer,Dryer,"Smoke Detector","Carbon Monoxide Detector","First Aid Kit","Safety Card",Essentials,Shampoo,"Lock on Bedroom Door","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}
## 4         {TV,Internet,"Wireless Internet","Air Conditioning",Kitchen,"Free Parking on Premises",Gym,Breakfast,"Indoor Fireplace",Heating,Washer,Dryer,"Smoke Detector","Carbon Monoxide Detector","First Aid Kit","Safety Card","Fire Extinguisher",Essentials,Shampoo,Hangers,"Hair Dryer",Iron,"Laptop Friendly Workspace"}
## 5                                                                                                                                         {Internet,"Wireless Internet","Air Conditioning",Kitchen,Breakfast,Heating,"Smoke Detector","Carbon Monoxide Detector","First Aid Kit",Essentials,Shampoo,Hangers,"Hair Dryer",Iron}
## 6                                                                                                            {"Cable TV","Wireless Internet","Air Conditioning",Kitchen,"Free Parking on Premises","Pets live on this property",Cat(s),Heating,"Family/Kid Friendly","Smoke Detector","Carbon Monoxide Detector",Hangers,Iron}

This could be good predictive information if we can separate out which listing has which amenity. Our goal here is to turn the amenities column into many columns, one for each amenity, and with logical values indicating whether each listing has each amenity. This is just a bit tricky, so I’ve written a function called clean_amenities that will do this for us. We need to source() the file that has this function in it, and then we’ll call it on the listings data frame.

source("../3_modeling_and_ml/clean_amenities.R")
listings = clean_amenities(listings)

In total, we’ll use all of these predictors:

accommodates
property_type
review_scores_rating
neighbourhood_cleansed
accommodates*room_type
property_type*neighbourhood_cleansed
review_scores_rating*neighbourhood_cleansed
accommodates*review_scores_rating
All columns created from the amenities column

Note that whenever we include a non-numeric (or categorical) variable, R is going to create one indicator variable for all but one unique value of the variable. We’ll see this in the output of lm().

First, let’s clean up the data by getting rid of missing values and outliers. For categorical variables, we will remove all categories with only a few observations. Finally, we’ll separate again into training and test sets.

listings_big = listings %>%
  filter(!is.na(review_scores_rating),
         accommodates <= 10,
         property_type %in% c("Apartment","House","Bed & Breakfast","Condominium","Loft","Townhouse"),
         !(neighbourhood_cleansed %in% c("Leather District","Longwood Medical Area")),
         price <= 1000) %>%
  select(price,accommodates,room_type,property_type,review_scores_rating,neighbourhood_cleansed,starts_with("amenity"))

set.seed(144)
listings_big_lm = listings_big %>%
  resample_partition(c(train=0.7,test=0.3))

To get R to learn the model, we need to pass it a formula. We don’t want to write down all those amenity variables by hand. Luckily, we can use the paste() function to string all the variable names together, and then the as.formula() function to translate a string into a formula.

all_amenities = as.data.frame(listings_big_lm$train) %>% select(starts_with("amenity")) %>% names()
amenities_string = paste(all_amenities,collapse="+")
amenities_string # Taking a look to make sure things worked

## [1] "amenity_TV+amenity_Wireless_Internet+amenity_Kitchen+amenity_Free_Parking_on_Premises+amenity_Pets_live_on_this_property+amenity_Dogs+amenity_Heating+amenity_Family_Kid_Friendly+amenity_Washer+amenity_Dryer+amenity_Smoke_Detector+amenity_Fire_Extinguisher+amenity_Essentials+amenity_Shampoo+amenity_Laptop_Friendly_Workspace+amenity_Internet+amenity_Air_Conditioning+amenity_Pets_Allowed+amenity_Carbon_Monoxide_Detector+amenity_Lock_on_Bedroom_Door+amenity_Hangers+amenity_Hair_Dryer+amenity_Iron+amenity_Gym+amenity_Breakfast+amenity_Indoor_Fireplace+amenity_First_Aid_Kit+amenity_Safety_Card+amenity_Cable_TV+amenity_Cats+amenity_x24Hour_Checkin+amenity_Hot_Tub+amenity_Other_pets+amenity_Buzzer_Wireless_Intercom+amenity_Washer___Dryer+amenity_Smoking_Allowed+amenity_Suitable_for_Events+amenity_Wheelchair_Accessible+amenity_Elevator_in_Building+amenity_Doorman+amenity_Pool+amenity_Paid_Parking_Off_Premises+amenity_Free_Parking_on_Street"

big_formula = as.formula(paste("price ~ accommodates + accommodates*room_type + property_type + neighbourhood_cleansed + property_type*neighbourhood_cleansed + review_scores_rating*neighbourhood_cleansed + accommodates*review_scores_rating",amenities_string,sep="+"))

Now we can use the lm() function:

big_price_lm = lm(big_formula,data=listings_big_lm$train)

We won’t look at the summary because there are so many predictors. What happens when we compare in-sample and out-of-sample prediction performance?

rmse(big_price_lm,listings_big_lm$train) # In-sample

## Warning in predict.lm(model, data): prediction from a rank-deficient fit
## may be misleading

## [1] 63.92064

rmse(big_price_lm,listings_big_lm$test) # Out-of-sample

## Warning in predict.lm(model, data): prediction from a rank-deficient fit
## may be misleading

## [1] 76.71424

We’ve got an overfitting problem here, meaning that the training error is smaller than the test error. The model is too powerful for the amount of data we have. Note that R recognizes this by giving warnings about a “rank-deficient fit.”

Regularized/Penalized Regression

But is there still a way to use the info from all these variables without overfitting? Yes! One way to do this is by regularized, or penalized, regression.

Mathematically, we add a term to the optimization problem that we’re solving when fitting a model, a term which penalizes models that get too fancy without enough data. If we call $\beta$ the coefficient vector that we’d like to learn about for linear regression, then the regular regression we’ve worked with looks like \[ \min_\beta \sum_{t=1}^n (y_t-x_t^T\beta)^2, \] but penalized regression looks like \[ \min_\beta \sum_{t=1}^n (y_t-x_t^T\beta)^2 + \lambda ||\beta||. \]

There are two types of flexibility within this framework that I’ll mention:

Choice of norm, a structural decision, and
Choice of $\lambda$, a parametric decision.

Two natural choices of norm are the Euclidean 1- and 2-norms. When we use the 2-norm, it’s often called “ridge regression.” We’ll focus today on the 1-norm, or “LASSO regression”. On a very simple level, both types of regression shrink all the elements of the unconstrained $\beta$ vector towards zero, some more than others in a special way. LASSO shrinks the coefficients so that some are equal to zero. This feature is nice because it helps us interpret the model by getting rid of the effects of many of the variables.

To do LASSO, we’ll use the glmnet package. Of note, this package doesn’t work very elegantly with the tidyverse since it uses matrix representations of the data rather than data frame representations. However, it does what it does quite well, and will give us a chance to see some base R code. Let’s load the package and check out the function glmnet(). We can see the documentation from the command line using ?glmnet.

library(glmnet)

Notice that glmnet() doesn’t communicate with the data via formulas. Instead, it wants a matrix of predictor variables and a vector of values for the variable we’re trying to predict, including all the categorical variables that R automatically expanded into indicator variables. Fortunately, R has a model.matrix() function which takes a data frame and gets it into the right form for glmnet() and other functions with this type of input.

Notice also that there’s a way to specify lambda manually. Since we haven’t discussed choosing lambda yet, let’s just accept the default for now and see what we get.

x = model.matrix(~ .-price + accommodates*room_type + property_type*neighbourhood_cleansed + review_scores_rating*neighbourhood_cleansed + accommodates*review_scores_rating,data=as.data.frame(listings_big_lm$train))
y = as.vector(as.data.frame(listings_big_lm$train)$price)
lasso_price = glmnet(x,y)

This time the summary() function isn’t quite as useful:

summary(lasso_price)

##           Length Class     Mode   
## a0          100  -none-    numeric
## beta      21000  dgCMatrix S4     
## df          100  -none-    numeric
## dim           2  -none-    numeric
## lambda      100  -none-    numeric
## dev.ratio   100  -none-    numeric
## nulldev       1  -none-    numeric
## npasses       1  -none-    numeric
## jerr          1  -none-    numeric
## offset        1  -none-    logical
## call          3  -none-    call   
## nobs          1  -none-    numeric

It does give us some info, though. Notice that “lambda” is a vector of length 88. The glmnet() function has defined 88 different values of lambda and found the corresponding optimal beta vector for each one! We have 88 different models here. Let’s look at some of the coefficients for the different models. We’ll start with one where lambda is really high:

lasso_price$lambda[1]

## [1] 68.1282

nnzero(lasso_price$beta[,1]) # How many coefficients are nonzero?

## [1] 0

Here the penalty on the size of the coefficients is so high that R sets them all to zero. Moving to some smaller lambdas:

lasso_price$lambda[10]

## [1] 29.49107

lasso_price$beta[which(lasso_price$beta[,10] != 0),10]

##             room_typePrivate room accommodates:review_scores_rating 
##                       -31.0357907                         0.1966795

lasso_price$lambda[20]

## [1] 11.63189

lasso_price$beta[which(lasso_price$beta[,20] != 0),20]

##                                 room_typePrivate room 
##                                           -34.7310901 
##                        neighbourhood_cleansedBack Bay 
##                                            10.2454032 
##                                        amenity_TVTRUE 
##                                             2.4733494 
##                  amenity_Free_Parking_on_PremisesTRUE 
##                                            -3.7078042 
##                                    amenity_WasherTRUE 
##                                             0.2888948 
##                          amenity_Air_ConditioningTRUE 
##                                             3.4620595 
##                          amenity_Indoor_FireplaceTRUE 
##                                             5.0209555 
##                                  amenity_Cable_TVTRUE 
##                                             9.2480922 
##                      amenity_Elevator_in_BuildingTRUE 
##                                            11.1411059 
##                    accommodates:room_typePrivate room 
##                                            -6.2955671 
## property_typeCondominium:neighbourhood_cleansedFenway 
##                                             5.0995782 
##                     accommodates:review_scores_rating 
##                                             0.2773798

And, to see the whole path of lambdas:

plot.glmnet(lasso_price,xvar="lambda")

Here, each line is one variable. The plot is quite messy with so many variables, but it gives us the idea. As lambda shrinks, the model adds more and more nonzero coefficients.

Cross Validation

How do we choose which of the 88 models to use? Or in other words, how do we “tune” the $\lambda$ parameter? We’ll use a similar idea to the training-test set split called cross-validation.

The idea behind cross-validation is this: what if we trained our family of models (in this case 88) on only some of the training data and left out some other data points? Then we could use those other data points to figure out which of the lambdas works best out-of-sample. So we’d have a training set for training all the models, a validation set for choosing the best one, and a test set to evaluate performance once we’ve settled on a model.

There’s just one other trick: since taking more samples reduces noise, could we somehow take more validation set samples? Here’s where cross-validation comes in. We divide the training data into groups called folds, and then for each fold repeat the train-validate procedure on the remaining training data and then use the current fold as a validation set. We then average the performance of each model on each fold and pick the best one.

This is a very common resampling method that applies in lots and lots of settings. Lucky for us the glmnet package has a very handy function called cv.glmnet() which does the entire process automatically. Let’s look at the function arguments using ?cv.glmnet.

The relevant arguments for us right now are

x, the matrix of predictors
y, the response variable
nfolds, the number of ways to split the training set (defaults to 10)
type.measure, the metric of prediction quality. It defaults to mean squared error, the square of RMSE, for linear regression

Let’s do the cross-validation:

lasso_price_cv = cv.glmnet(x,y)
summary(lasso_price_cv) # What does the model object look like?

##            Length Class  Mode     
## lambda     86     -none- numeric  
## cvm        86     -none- numeric  
## cvsd       86     -none- numeric  
## cvup       86     -none- numeric  
## cvlo       86     -none- numeric  
## nzero      86     -none- numeric  
## name        1     -none- character
## glmnet.fit 12     elnet  list     
## lambda.min  1     -none- numeric  
## lambda.1se  1     -none- numeric

Notice the “lambda.min”. This is the best lambda as determined by the cross validation. “lambda.1se” is the largest lambda such that the “error is within 1 standard error of the minimum.”

There’s another automatic plotting function for cv.glmnet() which shows the error for each model:

plot.cv.glmnet(lasso_price_cv)

The first vertical dotted line shows lambda.min, and the second is lambda.1se. The figure illustrates that we cross-validate to find the “sweet spot” where there’s not too much bias (high lambda) and not too much noise (low lambda). The left-hand side of this graph is flatter than we’d sometimes see, meaning that the unpenalized model may not be too bad. However, increasing lambda increases interpretability at close to no loss in prediction accuracy!

Let’s again compare training and test error. Since the predict() function for glmnet objects uses matrices, we can’t use the rmse function like we did before.

x_all = model.matrix(~ .-price + accommodates*room_type + property_type*neighbourhood_cleansed + review_scores_rating*neighbourhood_cleansed + accommodates*review_scores_rating,data=listings_big) # Matrix form for combined test and training data

listings_big %>%
  mutate(is_test = 1:nrow(listings_big) %in% listings_big_lm$test$idx,
         pred = as.vector(predict.cv.glmnet(lasso_price_cv,newx=x_all))) %>%
  group_by(is_test) %>%
  summarize(rmse = sqrt(1/length(price)*sum((price-pred)^2)))

## # A tibble: 2 × 2
##   is_test     rmse
##     <lgl>    <dbl>
## 1   FALSE 70.45516
## 2    TRUE 78.62953

The overfitting problem has gotten better, but hasn’t yet gone away completely. I added a bunch variables for dramatic effect that we could probably screen out before running the LASSO if we really wanted a good model.

One more note on cross-validation: the glmnet package has built-in functionality for cross-validation. In situations where that’s not the case, modelr::crossv_kfold() will prepare data for cross validation in a nice way.

Classification

So far we’ve looked at models which predict a continuous response variable. There are many related models which predict categorical outcomes, such as whether an email is spam or not, or which digit a handwritten number is. We’ll take a brief look at two of these: logistic regression and classification trees.

Logistic Regression

Logistic regression is part of the class of generalized linear models (GLMs), which build directly on top of linear regression. These models take the linear fit and map it through a non-linear function. For logistic regression this function is the logistic function, $f(x) = \exp(x)/(1+\exp(x))$, which looks like this:

Since the function stays between zero and one, it can be interpreted as a mapping from predictor values to a probability of being in one of two classes.

Let’s apply this model to the listings data. Let’s try to predict which listings have elevators in the building by price. To make sure we’re asking a sensible question, we’ll only consider apartments. We’ll also filter out price outliers.

set.seed(863)
listings_glm = listings %>% 
  filter(property_type == "Apartment",
         price <= 500) %>%
  resample_partition(c(train=0.7,test=0.3))

Instead of the lm() function, we’ll now use glm(), but the syntax is almost exactly the same:

l.glm = glm(amenity_Elevator_in_Building ~ price,family="binomial",data=listings_glm$train)
summary(l.glm)

## 
## Call:
## glm(formula = amenity_Elevator_in_Building ~ price, family = "binomial", 
##     data = listings_glm$train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7908  -0.8582  -0.6797   1.1951   1.9612  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.987236   0.135344  -14.68   <2e-16 ***
## price        0.006732   0.000641   10.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1875.1  on 1490  degrees of freedom
## Residual deviance: 1753.9  on 1489  degrees of freedom
## AIC: 1757.9
## 
## Number of Fisher Scoring iterations: 4

Again, we can add predictions to the data frame and plot these along with the actuals, although the result doesn’t look nearly as clean:

as.data.frame(listings_glm$train) %>%
  mutate(pred = predict(l.glm,data=listings_glm$train,type="response")) %>%
  ggplot(aes(x=price)) + geom_line(aes(y=pred)) + geom_point(aes(y=amenity_Elevator_in_Building + 0))

One way to get a more informative plot is by using the logi.hist.plot() function in the popbio package.

In the meantime, we can explore out-of-sample performance. Ultimately, we want to predict whether or not a listing has an elevator. However, logistic regression gives us something a bit different: a probability that each listing has an elevator. This gives us flexibility in the way we predict. The most natural thing would be to predict that any listing with predicted probability above 0.5 has an elevator, and any listing with predicted probability below 0.5 does not have an elevator. But what if I use a wheelchair and I want to be really confident that there’s going to be an elevator? I may want to use a cutoff value of 0.9 rather than 0.5. In fact, we could choose any cutoff value and have a corresponding prediction model.

There’s a really nice metric that measures the quality of all cutoffs simultaneously: AUC, for “Area Under the receiver operating characteristic Curve.” That’s a mouthful, but the idea is simpler: For every cutoff, we’ll plot the false positive rate against the true positive rate and then take the area under this curve. (A positive in our case is a listing that has an elevator. So a true positive is a listing that we predict has an elevator and really does have an elevator, while a false positive is a listing that we predict has an elevator and does not actually have an elevator.)

As the cutoff shrinks down from 1 to 0, the rate of total positives will increase. If the rate of true positives increases faster than the rate of false positives, this is one indication that the model is good. This is what AUC measures.

The ROCR package is one implementation that allows us to plot ROC curves and calculate AuC. Here’s an example (make sure to install the packages first using install.packages("ROCR")):

library(ROCR)
preds = predict(l.glm,as.data.frame(listings_glm$test),type="response")
test_elevators = as.data.frame(listings_glm$test)$amenity_Elevator_in_Building
pred_obj = prediction(preds,test_elevators) # Creating a prediction object for ROCR
perf = performance(pred_obj,'tpr','fpr')
plot(perf) # ROC curve

performance(pred_obj,'auc') # AUC - a scalar measure of performance

## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.7122285
## 
## 
## Slot "alpha.values":
## list()

As you can see, the performance() function in the ROCR package is versatile and allows you to calculate and plot a bunch of different performance metrics.

In our case, this model gives an AUC of 0.7. The worst possible is 0.5 - random guessing. We’re definitely better than random here, and could likely improve by adding more predictors.

We’ve covered basic logistic regression, but just as with linear regression there are many, many extensions. For example, we could add higher-order predictor terms via splines. We could also do LASSO logistic regression if we wanted to use many predictors, using the glmnet package. In order to change from linear regression to logistic in the glmnet() or cv.glmnet() functions, just specify the argument family = "binomial" within the function call. When predicting, there are several options in the type argument that are useful depending on whether you want the probabilities or the class prediction.

Classification Trees

We will briefly explore classification trees (often referred to as CART, for Classification And Regression Trees), and then in the second half of the session we’ll take a deeper dive.

A (binary) classification tree makes predictions by grouping similar observations and then assigning a probability to each group using the proportion of observations within that group that belong to the positive class. Groups can be thought of as nodes on a tree, and tree branches correspond to logical criteria on the predictor variables. There’s a lot of neat math that goes into building the trees, but we won’t get into that today. For now let’s get familiarized by looking at a simple example. We need the rpart library.

library(rpart)

The model construction step follows the same established pattern. We use the modelling function rpart(), which takes a formula and a data frame (and optional parameters) as arguments.

l.rpart = rpart(amenity_Elevator_in_Building ~ price + neighbourhood_cleansed,data=listings_glm$train)
summary(l.rpart)

## Call:
## rpart(formula = amenity_Elevator_in_Building ~ price + neighbourhood_cleansed, 
##     data = listings_glm$train)
##   n= 1491 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.27282459      0 1.0000000 1.0005842 0.01966834
## 2 0.04939872      1 0.7271754 0.7366297 0.02854560
## 3 0.03009881      2 0.6777767 0.6900802 0.02700776
## 4 0.01130377      3 0.6476779 0.6649756 0.02693675
## 5 0.01000000      4 0.6363741 0.6694955 0.02746162
## 
## Variable importance
## neighbourhood_cleansed                  price 
##                     88                     12 
## 
## Node number 1: 1491 observations,    complexity param=0.2728246
##   mean=0.3226023, MSE=0.21853 
##   left son=2 (1133 obs) right son=3 (358 obs)
##   Primary splits:
##       neighbourhood_cleansed splits as  LLLLLLRLRLRLLRLLLLLLLRLRL, improve=0.27282460, (0 missing)
##       price                  < 206.5 to the left,  improve=0.07962297, (0 missing)
## 
## Node number 2: 1133 observations,    complexity param=0.04939872
##   mean=0.1853486, MSE=0.1509945 
##   left son=4 (801 obs) right son=5 (332 obs)
##   Primary splits:
##       neighbourhood_cleansed splits as  RRRLLL-L-L-LL-RLRLLLL-L-L, improve=0.09408350, (0 missing)
##       price                  < 231   to the left,  improve=0.02694371, (0 missing)
##   Surrogate splits:
##       price < 327   to the left,  agree=0.712, adj=0.018, (0 split)
## 
## Node number 3: 358 observations,    complexity param=0.03009881
##   mean=0.7569832, MSE=0.1839596 
##   left son=6 (113 obs) right son=7 (245 obs)
##   Primary splits:
##       price                  < 163.5 to the left,  improve=0.14891300, (0 missing)
##       neighbourhood_cleansed splits as  ------L-R-L--R-------R-R-, improve=0.03255809, (0 missing)
## 
## Node number 4: 801 observations
##   mean=0.1086142, MSE=0.09681718 
## 
## Node number 5: 332 observations,    complexity param=0.01130377
##   mean=0.3704819, MSE=0.2332251 
##   left son=10 (237 obs) right son=11 (95 obs)
##   Primary splits:
##       price                  < 245.5 to the left,  improve=0.04756624, (0 missing)
##       neighbourhood_cleansed splits as  RLR-----------L-R--------, improve=0.00152398, (0 missing)
## 
## Node number 6: 113 observations
##   mean=0.5132743, MSE=0.2498238 
## 
## Node number 7: 245 observations
##   mean=0.8693878, MSE=0.1135527 
## 
## Node number 10: 237 observations
##   mean=0.3037975, MSE=0.2115046 
## 
## Node number 11: 95 observations
##   mean=0.5368421, MSE=0.2486427

This is another case when the summary() function is less helpful. We can, however, plot the resulting tree:

library(rpart.plot)
prp(l.rpart)

The neighbourhood_cleansed variable names are a bit hard to read, but other than that the model is clear and simple. While logistic regression is a continuous, global method, classification trees is a piecewise constant, local method. In the next section we’ll look at a much fancier classification tree and discuss performance evaluation.

Additional Resources

I hope you enjoyed this portion of the class. If so, here are some places to look for more info:

Future sessions in this class, including Natural Language Processing (up next!), advanced modeling, and deep learning.
R for Data Science: http://r4ds.had.co.nz/, an online book on modeling via r’s tidyverse package.
Online courses, including Johns Hopkins’ data science course on coursera: https://www.coursera.org/specializations/jhu-data-science
MIT courses, including
- 15.071 (“The Analytics Edge”, an application and coding-based analytics course),
- 6.867 (EECS’s introductory Machine Learning course)
- 9.520 (A theory course about regularized machine learning methods)
R packages modelr and caret

Statistical Modelling and Machine Learning in R, Part 1 Lecture Notes

Clark Pixton

Statistical Modelling and Machine Learning in R: Part 1 Lecture Notes

Overview

Some Modelling Concepts

Machine Learning and Modelling

Model Evaluation

Regression

Linear Regression

Data Preparation

Model Fitting

Model Evaluation

Summary and More about Formulas

Other Models - Splines

Model Selection and Tuning: Penalized Regression

Regularized/Penalized Regression

Cross Validation

Classification

Logistic Regression

Classification Trees

Additional Resources