-->

Section notes: Linear regression

Kate Shulgina and June Shin (10/26/2018)

You can also download this page in jupyter notebook format (sans figures, but the code will work).

In this section, we will cover first some explanation of linear regression and fitting, and then go through some code examples in Python.

Linear regression as a predictive model

In the lecture notes, Sean covers linear regression from the perspective of a generative probabilistic model. While this is a good framework for figuring out how to write out the assumptions of the model explicitly, there are also other perspectives.

One perspective that helped me personally understand linear regression is the perspective of linear regression as a predictive model. Think of linear regression as the process of building a model for the purposes of using later to make predictions.

Let's pretend that we make a model that can predict what grade a student will get on an exam. We have various information on the past students in the class, like how many hours they studied total, how many hours they slept the night before, their age, etc etc, alongside known exam scores.

As a first step, it might make sense to see if there is a predictive relationship between the number of hours studied and the exam score. This is what the relationship looks like:

Our goal is to build a predictive model such that if we get a new student, and know the number of hours they slept, we could predict what grade they would get. One way might be to fit a line. From high school algebra, the equation of a line is $y=mx+b$. In this case, the equation is $y=\beta_1 x + \beta_0$, where $y=\text{exam grade}$, $x=\text{hours studied}$, $\beta_1$ is the slope and $\beta_0$ is the y-intercept.

We could imagine various different fits to the data-- some good and some not so good. We get these different fits by varying the parameter $\beta_1$ ($\beta_0$ is held at 0 to keep this example simple)

How can we compare the different fits and pick the best one? One common method is to compare the sum of squared residuals (RSS). A residual ($e_i$) can be viewed as the prediction error-- for a given student, how far off was the prediction from the fitted line? Good fits will have low prediction error and a low RSS.

$$\begin{aligned} e_i &= y_i - \hat y_i \\ &= y_i - (\beta_1 x_i + \beta_0) \end{aligned}$$$$ RSS = \sum_{i=1}^n (y_i - \beta_1 x_i - \beta_0)^2 $$

Let's try to visualize how the RSS changes for the different values of $\beta_1$ plotted above. You could imagine scanning across all possible values of $\beta_1$ and getting a sense of how the RSS changes with the parameter.

The value of $\beta_1$ that will give us the best fit is the one that has the lowest RSS, so we are trying to minimize the RSS, also known as the Least Squares Error. The function being minimizing has a lot of names in statistics and machine learning-- you might hear it called the objective function, the error function, the loss function, etc. They are all the same thing with different jargon-y names. We can use an optimizer like the scipy.optimize.minimize function to find the value of $\beta_1$ that gives the lowest value for the RSS, and therefore the best predictive line.

Multiple regression

Suppose we find the best fit line, and realize that using number of hours studied just isn't that good at predicting the exam score. We have other data at our disposal, like the number hours slept the night before the exam, that we could use in our model to improve our predictions.

Let's expand to two predictor variables: $x_1 = \text{hours studied}$ and $x_2 = \text{hours slept}$

And we are still predicting one variable: $y=\text{exam grade}$

Now we can imagine fitting a plane that given some values for $x_1, x_2$ can predict the student's exam score.

(Took the plot from the internet)

A plane fit to this data would have the equation $\hat y = \beta_2 x_2 + \beta_1 x_1 + \beta_0$. In the machine learning literature, they like to write things in compact matrix notation, so don't get scared if you see something like this $\hat y = \vec \beta \vec x$. Here, $\vec \beta = [ \beta_0, \beta_1, \beta_2 ]$ and $\vec x = [ 1, x_1, x_2 ]$. There is an extra 1 in $\vec x$ such that if you take the dot product of the two vectors, you get the expanded out equation that we started with.

Just like in the case with one predictor $x$, we can compute the prediction error (the residuals) by subtracting the observed $y$ from the one predicted by the plane.

$$ RSS = \sum_{i=1}^n (y_i - \beta_2 x_{i2} - \beta_1 x_{i1} - \beta_0)^2 $$

If we plugged the RSS into an optimizer, it would scan over the 3D range of possible $\beta_0, \beta_1, \beta_2$ values and find the ones that minimize the error.

You can imagine expanding this out to include any number of predictor variables. You can have predictor variables that are columns of data that you have, or even functions of data columns like $x_i = \sqrt{x_1}$ (in machine learning speak, these are called basis functions).

However, one thing to watch out for when adding more and more predictor variables is overfitting. This is when your model fits the training data so well, minimizes the errors so nicely, that it actually can't generalize and performs poorly on data was hasn't been previously seen. If it can't generalize, then it's a bad predictor.

(Took the plot from this website)

There's a large amount of thought on how to address this issue, and many rely on evaluating how good the model is on some held out test data.

Probabilistic framework for linear regression

The greatest strength of the probabilitic framework for linear regression that Sean brought up in class is that in defining the probability model, you are explicitly stating many of the assumptions that are unspoken in least squares fitting. Let's review it briefly.

In this way of thinking about linear regression, we are proposing that there is some absolute true linear relationship between $x$ and $y$, but the problem is that we make observations from the real world which is noisy so the observed points don't exactly fall on the line, but scatter around it with some noise.

$$y_i = \beta_1 x_i + \beta_0 + e_i$$

The "noise" in the observations is the error between the true line and the observations, which are the residuals. We model the noise as a random variable, usually Normally distributed. Our goal is to infer the parameters $\vec \beta$ of the original line that make the points scatted by noise as least unlikely as possible.

A common modelling decision is to model the residuals as being distributed $e_i \sim \mathcal{N}(0, \sigma^2)$, a Normal distribution with the same variance across all points. Most of the time, this assumption isn't even considered or questioned! You should always ask yourself if these are reasonable assumptions for your data? You'll see in the homework that this is a critical mistake that Moriarty makes. In the lecture notes, Sean shows that the maximum likelihood estimate for the parameters with this assumption is equivalent to least squares fitting-- so Normally-distributed noise is implicitly assumed in least squares!

Since $y_i$ is a function of a random variable ($e_i$), it is also a random variable with a Normal distirbution centered around the true line.

$$ y_i | x_i, \vec \beta, \sigma^2 \sim \mathcal{N}(\beta_1 x_i + \beta_0, \sigma^2) $$

The likelihood of some particular values of the parameters $\vec \beta$ and $\sigma$ is the probability of all of the observed data under that model. We use the Normal PDF to get the probability of the $y_i$s

$$\begin{aligned} \text{likelihood}(\vec \beta, \sigma^2) &= P(y_1, ..., y_n | \vec \beta, \sigma^2) \\ &= \prod_{i=1}^n P(y_i | x_i, \vec \beta, \sigma^2) \\ &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{(y_i - \beta_1x_i -\beta_0)^2}{2\sigma^2}} \end{aligned}$$

Now we can compare how well different lines fit the data by evaluating the likelihood of those parameters. To find the best values, we can feed this equation into an optimizer function.

Optimizers have been standardized to be minimizers instead of maximizers, so we need to feed it the negative log-likelihood instead. In this case, it would be

$$ \text{NLL} = - \sum_{i=1}^n \bigg( \text{log} \big( \frac{1}{\sqrt{2\pi\sigma^2}} \big) + \frac{(y_i - \beta_1x_i -\beta_0)^2}{2\sigma^2} \bigg) $$

I think one of the nice parts of this model, is that you can naturally get a "confidence interval" that's interpretable as a probability, directly from the Normal distribution of each $y_i$. You can imagine overlaying onto the linear regression line a band representing where you would expect 95% of your observed points to scatter around the line.

Example demonstrating the flexibility of the probabilistic approach

Let's walk through one last example together that hopefully will help you think through the homework assignment.

Here is our dataset.

(Took the plot from this website)

What we have plotted here is income of various people as a function of their age. What we want to do is fit a predictive line to this data-- maybe we are an advertising company and we want to make better targetted ads, for instance.

What is unusual about this dataset? The spread of the income increases with age (this might make sense-- not a lot of 25 year olds have had the time to become CEOs). This effect goes by the jargon-y term heteroskedasticity.

What does this trend mean with regards to least squares fitting? What are the assumptions implicit in least squares fitting? Does this data fit those assumptions?

What could we do about this issue? There are several options, including transforming the $y_i$s into a different space, or estimating the individual $\sigma^2_i$s separately.

In our case, let's suppose the variance is directly proportional to age, so we can define a relationship $\sigma_i^2 = a x_i \sigma^2$.

How would we write the distribution of the $y_i$s now?

$$ y_i | x_i, \vec \beta, \sigma \sim \mathcal{N}(\beta_1 x_i + \beta_0, a x_i \sigma^2) $$

Then the likelihood would be the same, except of the sigma term.

$$ \text{likelihood}(\vec \beta, \sigma) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi a x_i \sigma^2}} e^{\frac{(y_i - \beta_1x_i -\beta_0)^2}{2 a x_i \sigma^2}} $$

See how easy it was to incorporate a new assumption of the model into the probabilistic framework for linear regression? Now we can easily find the best fit linear regression line by minimizing the negative log-likelihood with respect to the $\vec \beta$ and $\sigma$ using an optimizer. If we were trying to modify least squares regression, it would not be as clear how to correctly modify the RSS function to incorporate some variation in how the residuals are distributed.

Now let's code it up!

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import scipy.optimize as optimize
import matplotlib.pyplot as plt
% matplotlib inline
We are going to start by doing a basic linear regression fit:
$$ y = mx + b $$
In [2]:
x = [1.52, 3.91, 2.23, 4.99, 5.81, 6.79, 6.94, 8.81, 9.02, 9.78]
y = [11.04, 8.86, 7.52, 8.43, 6.01, 5.71, 3.95, 3.66, 1.21, 0.47]
In [3]:
plt.plot(x, y, 'o')
Out[3]:
[<matplotlib.lines.Line2D at 0x11aba5400>]
In [4]:
def predict_yhat(x, m, b):
    yhat = m*x + b
    return yhat
We're going to use the L2 norm to minimize and fine the line of best fit: this is also known as the sum of square residuals
$$ \Sigma_i (\hat{y_i} - y_i)^2 $$
We can calculate these residuals directly and minimize this sum
In [5]:
def calculate_L2norm(params, x, y):
    L2 = 0
    for i in range(len(x)):
        yhat = predict_yhat(x[i], params[0], params[1])
        residual = yhat - y[i]
        L2 += residual ** 2
    return L2
In [6]:
guess=np.array([1.,1.])
In [7]:
minimization = optimize.minimize(calculate_L2norm,guess,(x,y))
m = minimization.x[0]
b = minimization.x[1]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m, b)
In [8]:
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
Out[8]:
[<matplotlib.lines.Line2D at 0x11a8e7940>]
We can also calculate the likelihood of the data given the predicted parameters by calculating the probability of the residual from a normal distribution
In [9]:
def calculate_nll(params, x, y):
    nll=0.
    # Loop through each time point to get a total negative log likelihood for the gene.
    for i in range(len(x)):
        yhat = predict_yhat(x[i], params[0], params[1])
        resid = yhat - y[i]
        nll -= stats.norm.logpdf(resid, loc=0, scale=1)
    return nll
In [10]:
minimization = optimize.minimize(calculate_nll,guess,(x,y))
m = minimization.x[0]
b = minimization.x[1]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m, b)
In [11]:
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
Out[11]:
[<matplotlib.lines.Line2D at 0x11ab65518>]
We can do the same, but now pull from a skewnormal distribution (weighted more by direction above or below line) or a lognormal distribution (weighted by amount of variance with respect to the mean)
In [12]:
def calculate_nll_skewnormal(params, x, y):
    nll=0.
    # Loop through each time point to get a total negative log likelihood for the gene.
    for i in range(len(x)):
        yhat = predict_yhat(x[i], params[0], params[1])
        resid = yhat - y[i]
        nll -= stats.skewnorm.logpdf(resid, a=2, loc=0, scale=1)
    return nll
In [13]:
minimization = optimize.minimize(calculate_nll_skewnormal,guess,(x,y))
m = minimization.x[0]
b = minimization.x[1]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m, b)
In [14]:
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
Out[14]:
[<matplotlib.lines.Line2D at 0x11adfc550>]
In [15]:
def calculate_nll_lognormal(params, x, y):
    nll=0.
    # Loop through each time point to get a total negative log likelihood for the gene.
    for i in range(len(x)):
        yhat = predict_yhat(x[i], params[0], params[1])
        resid = np.log(yhat) - np.log(y[i])
        nll -= stats.norm.logpdf(resid, loc=0, scale=1)
    return nll
In [16]:
minimization = optimize.minimize(calculate_nll_lognormal,guess,(x,y))
m = minimization.x[0]
b = minimization.x[1]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m, b)
In [17]:
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
Out[17]:
[<matplotlib.lines.Line2D at 0x11af380f0>]
Make it Bayesian (add a prior)!
In [18]:
def calculate_nll(params, x, y):
    nll=0.
    # Loop through each time point to get a total negative log likelihood for the gene.
    for i in range(len(x)):
        yhat = predict_yhat(x[i], params[0], params[1])
        resid = yhat - y[i]
        nll -= stats.norm.logpdf(resid, loc=0, scale=1) + stats.norm.logpdf(params[0], loc=-1, scale=1) + stats.norm.logpdf(params[1], loc=10, scale=5)
    return nll
In [19]:
minimization = optimize.minimize(calculate_nll,guess,(x,y))
m = minimization.x[0]
b = minimization.x[1]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m, b)
In [20]:
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
Out[20]:
[<matplotlib.lines.Line2D at 0x11ae99d68>]
We can make the functions more complicated and easily get polynomial regression
In [21]:
def predict_yhat(x, m1, m2, m3, m4, b):
    yhat = m1*x + m2*(x**2) + m3*(x**3) + m4*(x**4) + b
    return yhat
In [22]:
def calculate_nll(params, x, y):
    nll=0.
    # Loop through each time point to get a total negative log likelihood for the gene.
    for i in range(len(x)):
        yhat = predict_yhat(x[i], params[0], params[1], params[2], params[3], params[4])
        resid = yhat - y[i]
        nll -= stats.norm.logpdf(resid, loc=0, scale=1)
    return nll
In [23]:
guess=np.array([1.,1.,1.,1.,1.])
In [24]:
minimization = optimize.minimize(calculate_nll,guess,(x,y))
m1 = minimization.x[0]
m2 = minimization.x[1]
m3 = minimization.x[2]
m4 = minimization.x[3]
b = minimization.x[4]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m1, m2, m3, m4, b)
In [25]:
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
Out[25]:
[<matplotlib.lines.Line2D at 0x11b1f3518>]
But be careful of overfitting!
In [26]:
def predict_yhat(x, m):
    yhat = 0
    for i in range(len(m)):
        yhat += m[i]*(x**i)
    return yhat
In [27]:
def calculate_nll(params, x, y):
    nll=0.
    # Loop through each time point to get a total negative log likelihood for the gene.
    for i in range(len(x)):
        yhat = predict_yhat(x[i], params)
        resid = yhat - y[i]
        nll -= stats.norm.logpdf(resid, loc=0, scale=1)
    return nll
In [28]:
guess=np.array([1.]*6)
In [29]:
minimization = optimize.minimize(calculate_nll,guess,(x,y))
m = minimization.x
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m)
In [30]:
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
Out[30]:
[<matplotlib.lines.Line2D at 0x11b3e53c8>]
We can also do multivariate linear regression as well: if our predictor variable has more than one dimension:
In [31]:
x = [[1.52, 7.64, 10.87], [3.91, 6.64, 5.04], [2.23, 5.22, 3.23], 
     [4.99, 6.75, 5.91], [5.81, 4.07, 2.27], [6.79, 4.17, 1.47], 
     [6.94, 4.05, 0.44], [8.81, 3.97, -0.21], [9.02, 2.21, -1.04], [9.78, 2.07, -2.33]]
y = [11.04, 8.86, 7.52, 8.43, 6.01, 5.71, 3.95, 3.66, 1.21, 0.47]
In [32]:
def predict_yhat(x, m0, m1, m2, b):
    yhat = x[0]*m0 + x[1]*m1 + x[2]*m2 + b
    return yhat
In [33]:
def calculate_nll(params, x, y):
    nll=0.
    # Loop through each time point to get a total negative log likelihood for the gene.
    for i in range(len(x)):
        yhat = predict_yhat(x[i], params[0], params[1], params[2], params[3])
        resid = yhat - y[i]
        nll -= stats.norm.logpdf(resid, loc=0, scale=1)
    return nll
In [34]:
guess=np.array([1.,1.,1.,1.])
In [35]:
minimization = optimize.minimize(calculate_nll,guess,(x,y))
m0 = minimization.x[0]
m1 = minimization.x[1]
m2 = minimization.x[2]
b = minimization.x[3]
y_pred = [predict_yhat(x_pred, m0, m1, m2, b) for x_pred in x]
In [36]:
plt.plot(y, y_pred, 'o')
Out[36]:
[<matplotlib.lines.Line2D at 0x11b4c9908>]
We can make the predictions categorical! This is called logistic regression, one way linear regression can be extended to generalized linear models
In [37]:
x = [1.52, 3.91, 2.23, 4.99, 5.81, 6.79, 6.94, 8.81, 9.02, 9.78]
y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
In [38]:
plt.plot(x, y, 'o')
Out[38]:
[<matplotlib.lines.Line2D at 0x11b6d2128>]
In [39]:
def predict_yhat(x, m, b):
    yhat = 1/(1 + np.exp(-(m*x + b)))
    return yhat
In [40]:
def calculate_crossent(params, x, y):
    cross_entropy_logit = 0
    for i in range(len(x)):
        yhat = predict_yhat(x[i], params[0], params[1])
        #print(yhat)
        cross_entropy_logit += (-1 * y[i]) * np.log(yhat) - (1 - y[i]) * np.log(1 - yhat)
    return cross_entropy_logit
In [41]:
minimization = optimize.minimize(calculate_crossent,guess,(x,y))
m = minimization.x[0]
b = minimization.x[1]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m, b)
/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: divide by zero encountered in log
  
/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: invalid value encountered in double_scalars
  
/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: overflow encountered in exp
  
/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: divide by zero encountered in log
  
/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: invalid value encountered in double_scalars
  
/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: overflow encountered in exp
  
/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: overflow encountered in exp
  
/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: divide by zero encountered in log
  
/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: invalid value encountered in double_scalars
  
In [42]:
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
Out[42]:
[<matplotlib.lines.Line2D at 0x11b63c240>]