Kate Shulgina and June Shin (10/26/2018)

You can also download this page in jupyter notebook format (sans figures, but the code will work).

In the lecture notes, Sean covers linear regression from the perspective of a generative probabilistic model. While this is a good framework for figuring out how to write out the assumptions of the model explicitly, there are also other perspectives.

One perspective that helped me personally understand linear regression is the perspective of linear regression as a predictive model. Think of linear regression as the process of building a model for the purposes of using later to make predictions.

Let's pretend that we make a model that can predict what grade a student will get on an exam. We have various information on the past students in the class, like how many hours they studied total, how many hours they slept the night before, their age, etc etc, alongside known exam scores.

As a first step, it might make sense to see if there is a predictive relationship between the number of hours studied and the exam score. This is what the relationship looks like:

Our goal is to build a predictive model such that if we get a new student, and know the number of hours they slept, we could predict what grade they would get. One way might be to fit a line. From high school algebra, the equation of a line is $y=mx+b$. In this case, the equation is $y=\beta_1 x + \beta_0$, where $y=\text{exam grade}$, $x=\text{hours studied}$, $\beta_1$ is the slope and $\beta_0$ is the y-intercept.

We could imagine various different fits to the data-- some good and some not so good. We get these different fits by varying the parameter $\beta_1$ ($\beta_0$ is held at 0 to keep this example simple)

How can we compare the different fits and pick the best one? One common method is to compare the sum of squared residuals (RSS). A **residual** ($e_i$) can be viewed as the prediction error-- for a given student, how far off was the prediction from the fitted line? Good fits will have low prediction error and a low RSS.

Let's try to visualize how the RSS changes for the different values of $\beta_1$ plotted above. You could imagine scanning across all possible values of $\beta_1$ and getting a sense of how the RSS changes with the parameter.

The value of $\beta_1$ that will give us the best fit is the one that has the lowest RSS, so we are trying to minimize the RSS, also known as the Least Squares Error. The function being minimizing has a lot of names in statistics and machine learning-- you might hear it called the objective function, the error function, the loss function, etc. They are all the same thing with different jargon-y names. We can use an optimizer like the `scipy.optimize.minimize`

function to find the value of $\beta_1$ that gives the lowest value for the RSS, and therefore the best predictive line.

Suppose we find the best fit line, and realize that using number of hours studied just isn't that good at predicting the exam score. We have other data at our disposal, like the number hours slept the night before the exam, that we could use in our model to improve our predictions.

Let's expand to two predictor variables: $x_1 = \text{hours studied}$ and $x_2 = \text{hours slept}$

And we are still predicting one variable: $y=\text{exam grade}$

Now we can imagine fitting a plane that given some values for $x_1, x_2$ can predict the student's exam score.

(Took the plot from the internet)

A plane fit to this data would have the equation $\hat y = \beta_2 x_2 + \beta_1 x_1 + \beta_0$. In the machine learning literature, they like to write things in compact matrix notation, so don't get scared if you see something like this $\hat y = \vec \beta \vec x$. Here, $\vec \beta = [ \beta_0, \beta_1, \beta_2 ]$ and $\vec x = [ 1, x_1, x_2 ]$. There is an extra 1 in $\vec x$ such that if you take the dot product of the two vectors, you get the expanded out equation that we started with.

Just like in the case with one predictor $x$, we can compute the prediction error (the residuals) by subtracting the observed $y$ from the one predicted by the plane.

$$ RSS = \sum_{i=1}^n (y_i - \beta_2 x_{i2} - \beta_1 x_{i1} - \beta_0)^2 $$If we plugged the RSS into an optimizer, it would scan over the 3D range of possible $\beta_0, \beta_1, \beta_2$ values and find the ones that minimize the error.

You can imagine expanding this out to include any number of predictor variables. You can have predictor variables that are columns of data that you have, or even functions of data columns like $x_i = \sqrt{x_1}$ (in machine learning speak, these are called basis functions).

However, one thing to watch out for when adding more and more predictor variables is **overfitting**. This is when your model fits the training data so well, minimizes the errors so nicely, that it actually can't generalize and performs poorly on data was hasn't been previously seen. If it can't generalize, then it's a bad predictor.

(Took the plot from this website)

There's a large amount of thought on how to address this issue, and many rely on evaluating how good the model is on some held out test data.

The greatest strength of the probabilitic framework for linear regression that Sean brought up in class is that in defining the probability model, you are explicitly stating many of the assumptions that are unspoken in least squares fitting. Let's review it briefly.

In this way of thinking about linear regression, we are proposing that there is some absolute true linear relationship between $x$ and $y$, but the problem is that we make observations from the real world which is noisy so the observed points don't exactly fall on the line, but scatter around it with some noise.

$$y_i = \beta_1 x_i + \beta_0 + e_i$$The "noise" in the observations is the error between the true line and the observations, which are the residuals. We model the noise as a random variable, usually Normally distributed. Our goal is to infer the parameters $\vec \beta$ of the original line that make the points scatted by noise as least unlikely as possible.

A common modelling decision is to model the residuals as being distributed $e_i \sim \mathcal{N}(0, \sigma^2)$, a Normal distribution with the same variance across all points. Most of the time, this assumption isn't even considered or questioned! You should always ask yourself if these are reasonable assumptions for your data? You'll see in the homework that this is a critical mistake that Moriarty makes. In the lecture notes, Sean shows that the maximum likelihood estimate for the parameters with this assumption is equivalent to least squares fitting-- so Normally-distributed noise is implicitly assumed in least squares!

Since $y_i$ is a function of a random variable ($e_i$), it is also a random variable with a Normal distirbution centered around the true line.

$$ y_i | x_i, \vec \beta, \sigma^2 \sim \mathcal{N}(\beta_1 x_i + \beta_0, \sigma^2) $$The likelihood of some particular values of the parameters $\vec \beta$ and $\sigma$ is the probability of all of the observed data under that model. We use the Normal PDF to get the probability of the $y_i$s

$$\begin{aligned} \text{likelihood}(\vec \beta, \sigma^2) &= P(y_1, ..., y_n | \vec \beta, \sigma^2) \\ &= \prod_{i=1}^n P(y_i | x_i, \vec \beta, \sigma^2) \\ &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{(y_i - \beta_1x_i -\beta_0)^2}{2\sigma^2}} \end{aligned}$$Now we can compare how well different lines fit the data by evaluating the likelihood of those parameters. To find the best values, we can feed this equation into an optimizer function.

Optimizers have been standardized to be minimizers instead of maximizers, so we need to feed it the negative log-likelihood instead. In this case, it would be

$$ \text{NLL} = - \sum_{i=1}^n \bigg( \text{log} \big( \frac{1}{\sqrt{2\pi\sigma^2}} \big) + \frac{(y_i - \beta_1x_i -\beta_0)^2}{2\sigma^2} \bigg) $$I think one of the nice parts of this model, is that you can naturally get a "confidence interval" that's interpretable as a probability, directly from the Normal distribution of each $y_i$. You can imagine overlaying onto the linear regression line a band representing where you would expect 95% of your observed points to scatter around the line.

Let's walk through one last example together that hopefully will help you think through the homework assignment.

Here is our dataset.

(Took the plot from this website)

What we have plotted here is income of various people as a function of their age. What we want to do is fit a predictive line to this data-- maybe we are an advertising company and we want to make better targetted ads, for instance.

What is unusual about this dataset? The spread of the income increases with age (this might make sense-- not a lot of 25 year olds have had the time to become CEOs). This effect goes by the jargon-y term *heteroskedasticity*.

What does this trend mean with regards to least squares fitting? What are the assumptions implicit in least squares fitting? Does this data fit those assumptions?

What could we do about this issue? There are several options, including transforming the $y_i$s into a different space, or estimating the individual $\sigma^2_i$s separately.

In our case, let's suppose the variance is directly proportional to age, so we can define a relationship $\sigma_i^2 = a x_i \sigma^2$.

How would we write the distribution of the $y_i$s now?

$$ y_i | x_i, \vec \beta, \sigma \sim \mathcal{N}(\beta_1 x_i + \beta_0, a x_i \sigma^2) $$Then the likelihood would be the same, except of the sigma term.

$$ \text{likelihood}(\vec \beta, \sigma) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi a x_i \sigma^2}} e^{\frac{(y_i - \beta_1x_i -\beta_0)^2}{2 a x_i \sigma^2}} $$See how easy it was to incorporate a new assumption of the model into the probabilistic framework for linear regression? Now we can easily find the best fit linear regression line by minimizing the negative log-likelihood with respect to the $\vec \beta$ and $\sigma$ using an optimizer. If we were trying to modify least squares regression, it would not be as clear how to correctly modify the RSS function to incorporate some variation in how the residuals are distributed.

In [1]:

```
import numpy as np
import pandas as pd
import scipy.stats as stats
import scipy.optimize as optimize
import matplotlib.pyplot as plt
% matplotlib inline
```

In [2]:

```
x = [1.52, 3.91, 2.23, 4.99, 5.81, 6.79, 6.94, 8.81, 9.02, 9.78]
y = [11.04, 8.86, 7.52, 8.43, 6.01, 5.71, 3.95, 3.66, 1.21, 0.47]
```

In [3]:

```
plt.plot(x, y, 'o')
```

Out[3]:

In [4]:

```
def predict_yhat(x, m, b):
yhat = m*x + b
return yhat
```

In [5]:

```
def calculate_L2norm(params, x, y):
L2 = 0
for i in range(len(x)):
yhat = predict_yhat(x[i], params[0], params[1])
residual = yhat - y[i]
L2 += residual ** 2
return L2
```

In [6]:

```
guess=np.array([1.,1.])
```

In [7]:

```
minimization = optimize.minimize(calculate_L2norm,guess,(x,y))
m = minimization.x[0]
b = minimization.x[1]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m, b)
```

In [8]:

```
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
```

Out[8]:

In [9]:

```
def calculate_nll(params, x, y):
nll=0.
# Loop through each time point to get a total negative log likelihood for the gene.
for i in range(len(x)):
yhat = predict_yhat(x[i], params[0], params[1])
resid = yhat - y[i]
nll -= stats.norm.logpdf(resid, loc=0, scale=1)
return nll
```

In [10]:

```
minimization = optimize.minimize(calculate_nll,guess,(x,y))
m = minimization.x[0]
b = minimization.x[1]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m, b)
```

In [11]:

```
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
```

Out[11]:

In [12]:

```
def calculate_nll_skewnormal(params, x, y):
nll=0.
# Loop through each time point to get a total negative log likelihood for the gene.
for i in range(len(x)):
yhat = predict_yhat(x[i], params[0], params[1])
resid = yhat - y[i]
nll -= stats.skewnorm.logpdf(resid, a=2, loc=0, scale=1)
return nll
```

In [13]:

```
minimization = optimize.minimize(calculate_nll_skewnormal,guess,(x,y))
m = minimization.x[0]
b = minimization.x[1]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m, b)
```

In [14]:

```
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
```

Out[14]:

In [15]:

```
def calculate_nll_lognormal(params, x, y):
nll=0.
# Loop through each time point to get a total negative log likelihood for the gene.
for i in range(len(x)):
yhat = predict_yhat(x[i], params[0], params[1])
resid = np.log(yhat) - np.log(y[i])
nll -= stats.norm.logpdf(resid, loc=0, scale=1)
return nll
```

In [16]:

```
minimization = optimize.minimize(calculate_nll_lognormal,guess,(x,y))
m = minimization.x[0]
b = minimization.x[1]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m, b)
```

In [17]:

```
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
```

Out[17]:

In [18]:

```
def calculate_nll(params, x, y):
nll=0.
# Loop through each time point to get a total negative log likelihood for the gene.
for i in range(len(x)):
yhat = predict_yhat(x[i], params[0], params[1])
resid = yhat - y[i]
nll -= stats.norm.logpdf(resid, loc=0, scale=1) + stats.norm.logpdf(params[0], loc=-1, scale=1) + stats.norm.logpdf(params[1], loc=10, scale=5)
return nll
```

In [19]:

```
minimization = optimize.minimize(calculate_nll,guess,(x,y))
m = minimization.x[0]
b = minimization.x[1]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m, b)
```

In [20]:

```
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
```

Out[20]:

In [21]:

```
def predict_yhat(x, m1, m2, m3, m4, b):
yhat = m1*x + m2*(x**2) + m3*(x**3) + m4*(x**4) + b
return yhat
```

In [22]:

```
def calculate_nll(params, x, y):
nll=0.
# Loop through each time point to get a total negative log likelihood for the gene.
for i in range(len(x)):
yhat = predict_yhat(x[i], params[0], params[1], params[2], params[3], params[4])
resid = yhat - y[i]
nll -= stats.norm.logpdf(resid, loc=0, scale=1)
return nll
```

In [23]:

```
guess=np.array([1.,1.,1.,1.,1.])
```

In [24]:

```
minimization = optimize.minimize(calculate_nll,guess,(x,y))
m1 = minimization.x[0]
m2 = minimization.x[1]
m3 = minimization.x[2]
m4 = minimization.x[3]
b = minimization.x[4]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m1, m2, m3, m4, b)
```

In [25]:

```
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
```

Out[25]:

In [26]:

```
def predict_yhat(x, m):
yhat = 0
for i in range(len(m)):
yhat += m[i]*(x**i)
return yhat
```

In [27]:

```
def calculate_nll(params, x, y):
nll=0.
# Loop through each time point to get a total negative log likelihood for the gene.
for i in range(len(x)):
yhat = predict_yhat(x[i], params)
resid = yhat - y[i]
nll -= stats.norm.logpdf(resid, loc=0, scale=1)
return nll
```

In [28]:

```
guess=np.array([1.]*6)
```

In [29]:

```
minimization = optimize.minimize(calculate_nll,guess,(x,y))
m = minimization.x
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m)
```

In [30]:

```
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
```

Out[30]:

In [31]:

```
x = [[1.52, 7.64, 10.87], [3.91, 6.64, 5.04], [2.23, 5.22, 3.23],
[4.99, 6.75, 5.91], [5.81, 4.07, 2.27], [6.79, 4.17, 1.47],
[6.94, 4.05, 0.44], [8.81, 3.97, -0.21], [9.02, 2.21, -1.04], [9.78, 2.07, -2.33]]
y = [11.04, 8.86, 7.52, 8.43, 6.01, 5.71, 3.95, 3.66, 1.21, 0.47]
```

In [32]:

```
def predict_yhat(x, m0, m1, m2, b):
yhat = x[0]*m0 + x[1]*m1 + x[2]*m2 + b
return yhat
```

In [33]:

```
def calculate_nll(params, x, y):
nll=0.
# Loop through each time point to get a total negative log likelihood for the gene.
for i in range(len(x)):
yhat = predict_yhat(x[i], params[0], params[1], params[2], params[3])
resid = yhat - y[i]
nll -= stats.norm.logpdf(resid, loc=0, scale=1)
return nll
```

In [34]:

```
guess=np.array([1.,1.,1.,1.])
```

In [35]:

```
minimization = optimize.minimize(calculate_nll,guess,(x,y))
m0 = minimization.x[0]
m1 = minimization.x[1]
m2 = minimization.x[2]
b = minimization.x[3]
y_pred = [predict_yhat(x_pred, m0, m1, m2, b) for x_pred in x]
```

In [36]:

```
plt.plot(y, y_pred, 'o')
```

Out[36]:

In [37]:

```
x = [1.52, 3.91, 2.23, 4.99, 5.81, 6.79, 6.94, 8.81, 9.02, 9.78]
y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
```

In [38]:

```
plt.plot(x, y, 'o')
```

Out[38]:

In [39]:

```
def predict_yhat(x, m, b):
yhat = 1/(1 + np.exp(-(m*x + b)))
return yhat
```

In [40]:

```
def calculate_crossent(params, x, y):
cross_entropy_logit = 0
for i in range(len(x)):
yhat = predict_yhat(x[i], params[0], params[1])
#print(yhat)
cross_entropy_logit += (-1 * y[i]) * np.log(yhat) - (1 - y[i]) * np.log(1 - yhat)
return cross_entropy_logit
```

In [41]:

```
minimization = optimize.minimize(calculate_crossent,guess,(x,y))
m = minimization.x[0]
b = minimization.x[1]
x_pred = np.arange(0, 10., 0.1)
y_pred = predict_yhat(x_pred, m, b)
```

In [42]:

```
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, '-')
```

Out[42]: