Section notes: Linear regression

Notes by Allison Kao (10/22/2019), adapted from Kate Shulgina and June Shin (10/26/2018)

You can download this page in Jupyter notebook .ipynb format.

In this section, we will cover first some explanation of linear regression and fitting, and then go through some code examples in Python.

Linear regression as a predictive model

In the lecture notes, Sean covers linear regression from the perspective of a generative probabilistic model. Another framework of using linear regression is as a predictive model. We'll use this predictive model to understand the advantage of using a generative probabilistic model, which may come in handy for your problem set.

Here, think of linear regression as the process of fitting a model for the purpose of predicting the outcomes of new data points. We can make a model that predicts what grade a student will get on an exam. We have various information on the past students in the class, like how many hours they studied total, how many hours they slept the night before, their age, etc etc, alongside known exam scores.

As a first pass, we can try to see if there is a predictive relationship between the number of hours studied and the exam score. This is what the relationship looks like:

Our goal is to build a predictive model such that if we get a new student, and know the number of hours they slept, we could predict what grade they would get. One way might be to fit a line. From high school algebra, the equation of a line is $y=mx+b$. In this case, the equation is $y=\beta_1 x + \beta_0$, where $y=\text{exam grade}$, $x=\text{hours studied}$, $\beta_1$ is the slope and $\beta_0$ is the y-intercept.

We could imagine various different fits to the data-- some good and some not so good. We get these different fits by varying the parameter $\beta_1$ ($\beta_0$ is held at 0 to keep this example simple)

How can we compare the different fits and pick the best one? One common method is to compare the sum of squared residuals (RSS). A residual ($e_i$) can be viewed as the prediction error-- for a given student, how far off was the prediction from the fitted line? Good fits will have low prediction error and a low RSS.

$$\begin{aligned} e_i &= y_i - \hat y_i \\ &= y_i - (\beta_1 x_i + \beta_0) \end{aligned}$$$$ RSS = \sum_{i=1}^n (y_i - \beta_1 x_i - \beta_0)^2 $$

Let's try to visualize how the RSS changes for the different values of $\beta_1$ plotted above. You could imagine scanning across all possible values of $\beta_1$ and getting a sense of how the RSS changes with the parameter.

The value of $\beta_1$ that will give us the best fit is the one that has the lowest RSS, so we are trying to minimize the RSS, also known as the Least Squares Error. The function being minimizing has a lot of names in statistics and machine learning-- you might hear it called the objective function, the error function, the loss function, etc. They are all the same thing with different jargon-y names. We can use an optimizer like the scipy.optimize.minimize function to find the value of $\beta_1$ that gives the lowest value for the RSS, and therefore the best predictive line.

Multiple regression

Suppose we find the best fit line, and realize that using number of hours studied just isn't that good at predicting the exam score. We have other data at our disposal, like the number hours slept the night before the exam, that we could use in our model to try to improve our predictions.

Let's expand to two predictor variables: $x_1 = \text{hours studied}$ and $x_2 = \text{hours slept}$

And we are still predicting one variable: $y=\text{exam grade}$

Now we can imagine fitting a plane that given some values for $x_1, x_2$ can predict the student's exam score.

(Took the plot from the internet)

A plane fit to this data would have the equation $\hat y = \beta_2 x_2 + \beta_1 x_1 + \beta_0$. In the machine learning literature, they like to write things in compact matrix notation, so don't get scared if you see something like this $\hat y = \vec \beta \vec x$. Here, $\vec \beta = [ \beta_0, \beta_1, \beta_2 ]$ and $\vec x = [ 1, x_1, x_2 ]$. There is an extra 1 in $\vec x$ such that if you take the dot product of the two vectors, you get the expanded out equation that we started with.

Just like in the case with one predictor $x$, we can compute the prediction error (the residuals) by subtracting the observed $y$ from the one predicted by the plane.

$$ RSS = \sum_{i=1}^n (y_i - \beta_2 x_{i2} - \beta_1 x_{i1} - \beta_0)^2 $$

If we plugged the RSS into an optimizer, it would scan over the 3D range of possible $\vec \beta$ or $[\beta_0, \beta_1, \beta_2]$ values and find the ones that minimize the error.

In this predictive model, you can imagine expanding this out to include any number of predictor variables. You can have predictor variables that are columns of data that you have, or even functions of data columns like $x_i = \sqrt{x_1}$ (in machine learning speak, these are called basis functions).

However, one thing to watch out for when adding more and more predictor variables is overfitting. This is when your model fits the training data so well and minimizes the errors so nicely that it actually can't generalize, performing poorly on new data was hasn't been previously seen. If the model can't generalize, then it's a bad predictor.

(Took the plot from this website)

There's a large amount of thought on how to address this issue, and many rely on evaluating how good the model is on some held out test data.

Probabilistic framework for linear regression

The greatest strength of the probabilistic framework for linear regression is that in defining the probability model used to generate the data, you are explicitly stating many of the assumptions that are unspoken in least squares fitting. Let's review it briefly.

In this framework of thinking about linear regression, we propose that there is some absolute true linear relationship between $x$ and $y$, but due to real-world noise, the observations don't exactly fall on the line, but rather, scatter around it.

$$y_i = \beta_1 x_i + \beta_0 + e_i$$

The "noise" in the observations is the error between the true line and the observations, which are the residuals. We model the noise as a random variable. Our goal is to infer the parameters $\vec \beta$ of the linear regression model that make the noisiness as unlikely as possible.

A common modelling decision is to model the residuals as being distributed $e_i \sim \mathcal{N}(0, \sigma^2)$, a normal distribution with the same variance across all points. Most of the time, this assumption isn't even considered or questioned! You should always ask yourself if these are reasonable assumptions for your data. For example, in the lecture notes, Sean shows that the maximum likelihood estimate for the parameters with this assumption is equivalent to least squares fitting-- so normally-distributed noise is implicitly assumed in least squares!

Since $y_i$ is a function of a random variable ($e_i$), it is also a random variable with a normal distribution centered around the true line.

$$ y_i | x_i, \vec \beta, \sigma^2 \sim \mathcal{N}(\beta_1 x_i + \beta_0, \sigma^2) $$

The likelihood of some particular values of the parameters $\vec \beta$ and constant $\sigma$ is the probability of all of the observed data under that model. We use the Normal PDF to get the probability of the $y_i$'s:

$$\begin{aligned} \text{likelihood}(\vec \beta, \sigma^2) &= P(y_1, ..., y_n | \vec \beta, \sigma^2) \\ &= \prod_{i=1}^n P(y_i | x_i, \vec \beta, \sigma^2) \\ &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{(y_i - \beta_1x_i -\beta_0)^2}{2\sigma^2}} \end{aligned}$$

Now we can compare how well different lines fit the data by evaluating the likelihood of those parameters. To find the best values, we can feed this equation into an optimizer function.

Optimizers have been standardized to be minimizers instead of maximizers, so we need to feed it the negative log-likelihood instead. In this case, it would be

$$ \text{NLL} = - \sum_{i=1}^n \bigg( \text{log} \big( \frac{1}{\sqrt{2\pi\sigma^2}} \big) + \frac{(y_i - \beta_1x_i -\beta_0)^2}{2\sigma^2} \bigg) $$

I think one of the nice parts of this model, is that you can naturally get a "confidence interval" that's interpretable as a probability, directly from the Normal distribution of each $y_i$. You can imagine overlaying onto the linear regression line a band representing where you would expect 95% of your observed points to scatter around the line.

Example demonstrating the flexibility of the probabilistic approach

Let's walk through one last example together that hopefully will help you think through the homework assignment.

Here is our dataset.

(Took the plot from this website)

What we have plotted here is income of various people as a function of their age. What we want to do is fit a predictive line to this data-- maybe we are an advertising company and we want to make better targetted ads, for instance.

What is unusual about this dataset? The spread of the income increases with age (this might make sense-- not a lot of 25 year olds have had the time to become CEOs). This effect goes by the jargon-y term heteroskedasticity.

(Took the plot from this website)

What does this trend mean with regards to least squares fitting? What are the assumptions implicit in least squares fitting? Does this data fit those assumptions?

What could we do about this issue? There are several options, including transforming the $y_i$s into a different space, or estimating the individual $\sigma^2_i$s separately.

In our case, let's suppose the variance is directly proportional to age, so we can define a relationship $\sigma_i^2 = a x_i \sigma^2$.

How would we write the distribution of the $y_i$s now?

$$ y_i | x_i, \vec \beta, \sigma \sim \mathcal{N}(\beta_1 x_i + \beta_0, a x_i \sigma^2) $$

Then the likelihood would be the same, except for the sigma term:

$$ \text{likelihood}(\vec \beta, \sigma) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi a x_i \sigma^2}} e^{\frac{(y_i - \beta_1x_i -\beta_0)^2}{2 a x_i \sigma^2}} $$

See how easy it was to incorporate a new assumption of the model into the probabilistic framework for linear regression? Now we can easily find the best fit linear regression line by minimizing the negative log-likelihood with respect to the $\vec \beta$ and $\sigma$ using an optimizer. If we were trying to modify least squares regression, it would not be as clear how to correctly modify the RSS function to incorporate some variation in how the residuals are distributed.

Now let's code it up!

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import scipy.optimize as optimize
import matplotlib.pyplot as plt
# % matplotlib inline

We are going to do a basic linear regression fit:

$$ y = mx + b $$
In [2]:
# Shhh ... don't look! 
# Behind the scenes look at data simulation (like the Arcs problem set) - don't quote me on these numbers!
# I set a lower bound to income because you can't have a negative income (I hope)

# ages = np.linspace(15,50,200)

# b = 30000
# m = 1500
# perfect_incomes = ages * m + b

# a = 8
# sigma2 = 80
# true_sigmas = a * ages * sigma2

# not_neg = False 
# while not_neg == False:
#     e = np.random.normal(loc=0, scale = sigmas)
#     noisy_incomes = perfect_incomes + e
#     if np.all(noisy_incomes >= 0):
#         not_neg = True

# data = np.column_stack((ages, noisy_incomes))
# np.savetxt('w07-section-data.dat', data, header="First column is 'ages', second column is 'noisy_incomes'")

The parameters used to generate this data are:

  • $b = 30000$
  • $m = 1500$
  • $\sigma_i = a * x_i * \sigma^2$ where $a = 8$ and $\sigma^2 = 80$

(Note: I simulated a dataset like the Arcs problem set - don't quote me on these numbers!)

In [3]:
# Load dummy data
data = np.loadtxt('w07-section-data.dat')
ages, noisy_incomes = data[:,0], data[:, 1]
a = 8
sigma2 = 80
true_sigmas = a * ages * sigma2
In [4]:
# Plot our data
plt.plot(ages, noisy_incomes, 'o')
[<matplotlib.lines.Line2D at 0x1136e37f0>]
In [5]:
def predict_yhat(x, m, b):
    yhat = m*x + b
    return yhat

We're first going to use the L2 norm to minimize and find the line of best fit: this is also known as the sum of square residuals.

$$ \Sigma_i (\hat{y_i} - y_i)^2 $$

We can calculate these residuals directly and minimize this sum to find the optimal linear parameters m, b. This is what np.linalg.lstsq does!

In [6]:
A = np.vstack([ages, np.ones(len(ages))]).T
m_predict1, b_predict1 = np.linalg.lstsq(A, noisy_incomes, rcond=0)[0]
y_predict1 = predict_yhat(ages, m_predict1, b_predict1)
print("m = {}".format(m_predict1))
print("b = {}".format(b_predict1))
m = 1320.0390898350058
b = 36651.65259696837

We can also calculate the likelihood of the data given the predicted parameters by calculating the probability of the residual from a normal distribution

In [7]:
def calc_nll(params, sigmas, xs, ys):
    ll = 0.
    m = params[0]
    b = params[1]
    for i in range(len(xs)):
        y_pred = m * xs[i] + b
        residual = ys[i] - y_pred
        ll += stats.norm.logpdf(residual, loc=0, scale=sigmas[i])
    return -ll

We can also calculate the likelihood of the data given the predicted parameters by calculating the probability of the residual from a normal distribution. Then, we can try to find optimal parameters via a maximum likelihood estimate (by minimizing the negative log likelihood).

In [8]:
guess = np.array([0., 0.]) # Initialize a guess

minimization = optimize.minimize(calc_nll, guess, (true_sigmas, ages, noisy_incomes))
m_predict2 = minimization.x[0]
b_predict2 = minimization.x[1]
y_predict2 = predict_yhat(ages, m_predict2, b_predict2)
print("m = {}".format(m_predict2))
print("b = {}".format(b_predict2))
m = 1468.9032839826427
b = 32050.486737611638
In [9]:
plt.plot(ages, noisy_incomes, 'o')
plt.plot(ages, y_predict1, '-')
plt.plot(ages, y_predict2, '-')
<matplotlib.legend.Legend at 0x1138198d0>

We can compare these two methods by calculating Total Log Likelihoods.

In [10]:
tll1 = -1 * calc_nll([m_predict1, b_predict1], true_sigmas, ages, noisy_incomes)
tll2 = -1 * calc_nll([m_predict2, b_predict2], true_sigmas, ages, noisy_incomes)
In [11]:
print("The total log likelihood of the OLS method is: {}".format(tll1))
print("The total log likelihood using MLE is: {}".format(tll2))
The total log likelihood of the OLS method is: -2270.1914854431316
The total log likelihood using MLE is: -2269.235652024821

How much better is MLE than the OLS method?

We can calculate the odds:

$$\frac{P( \text{data} \mid \text{MLE Method} )}{P(\text{data} \mid \text{OLS Method})}$$
In [12]:
print("The MLE method is {} times more likely!".format(np.exp(tll2 - tll1)))
The MLE method is 2.6008372653177934 times more likely!
A question to ponder ... how could you improve the least squares method?