MCB 112: Section 7 notes

From Nathan Nakatsuka on 10/20/17.

Available in both html and as a Jupyter notebook page.

Format of section:

1) Review of Lecture

       a) Statistics

       b) How to code the statistics

2) Problem set walkthrough

      a) Useful python functions

How do probability and statistical inference relate to each other?

Probability and statistical inference allow one to go from a population to a sample (via probability theory) or from a sample to make inferences about a population. Another way of looking at this is that one can use statistical inference to go from data to a model and also can use probability theory to go from the underlying model to the data.

Using models involves 3 major parts:

1) Building models

2) Fitting models (what is the explanatory power of the model?)

       a) Regression: What are the parameters of this model?

       b) How well does the model explain the data?

3) Testing models (what is the predictive power of the model?)

       a) How well can the model explain new observations?

The basics of regression: Least Squares Fitting

Here we will talk about the most basic form of linear regression: a simple line. See for a picture of this. The key features are data points that fall above and below a line but appear to generally fit the shape of a line. The model of the line is $Y=\beta x + \epsilon$, where $\epsilon$ is normally distributed error (e.g. measurement error, etc.). The goal is linear regression here is to find the $\hat{\beta}$ that minimizes the residual sum of squares, where the residuals here are $Y[i]-\hat{Y}$ ($Y[i]$ is the data point, $\hat{Y}$ is your predicted $Y$ given your model).

Note that this assumes:

1) $\epsilon \sim N(0,\sigma^2)$

2) The variance $\sigma^2$ is the same for all data points.

The second point in particular is not true in this problem set, which is why we need to weight the points properly such that points with high variance have less of an effect on the model.

If you look at the lecture notes you will see the derivation for $\hat{\beta}$ for the normal case and the weighted least squares fit. You can see specifically for the weighted case that the points are weighted by their variance due to the variance being in the denominator.

Maximum likelihood Method:

A likelihood function is kind of like an inverse probability in that $L(\theta|data) = P(data|\theta)$, where $\theta$ are the parameters. The maximum likelihood method is to find the value of $\theta$ that maximized the likelihood of the data. So $\hat{\theta}^{ML}=argmax(log(P(data|\theta))$, where $argmax$ means find the $\theta$ where the function is at its max. If you add a prior you get the maximum a posteriori (MAP) estimate, so $\hat{\theta}^{MAP}=argmax(log(P(data|\theta)*P(\theta))$.


There is a big field of research in designing optimizers that minimize objective functions. These objective functions can be loss functions or reward functions (which you want to maximize). They work by finding the gradient of your function and going downhill until they reach a minimum. There are a variety of ways to ensure that this is efficient (e.g. tuning the learning rate $\eta$, the rate at which you descend the function) and does its best to avoid shallow local minima that are not the global minimum (e.g. trying different starting conditions or attempting to "jump" out of holes). See lecture notes for more on this.

Multiple regression

Multiple regression is the same as above except with multiple coefficients and multiple predictor variables, so there are multiple parameters to optimize. A typical example might be: $Y=\beta_1 x_t + \beta_2 x_2 + \beta_3 x_3$

Harmonic Regression

In the problem set you have a circadian expression for certain genes where you have the following model for gene expression: $y_t = b + a*sin(2\pi\omega(t+\phi))+\epsilon$, where $b$ shifts the expression up and down, $a$ alters the amplitude of the expression, $\omega=1/24$ to allow conversion of the time to radians, and $\phi$ shifts the phase. You are given the gene expression of the oscillating genes and your goal is to find the phase of the genes.

Moriarty's method is to convert the model of gene expression into one that is in the form of a linear regression and then do least squares fit to solve for the parameters. If you expand the equation above you get: $y_t = b + a*sin(2\pi\omega t+2\pi\omega\phi))+\epsilon$

This can further be converted to: $y_t = b + a*cos(2\pi\omega\phi)sin(2\pi\omega t)+a*sin(2\pi\omega\phi)cos(2\pi\omega t)+\epsilon$.

Then you can use a linear regression fit (as in script) to solve for $\hat{b}$, $\hat{p_1}=a*cos(2\pi\omega\phi)$ and $\hat{p_2}=a*sin(2\pi\omega\phi)$. $sin(2\pi\omega t)$ and $cos(2\pi\omega t)$ can be considered predictor variables just like $X_1$ and $X_2$

Our method is to maximize the likelihood directly, utilizing the $\sigma$ of each measurement to weight the measurements. The steps for this problem thus are:

1) Make the negative log likelihood function

2) Put the negative log likelihood function into scipy.optimize to get maximum likelihood estimates of the parameters.

Using moriarty's script you have the following:

Y: Expression levels for all genes

S_true: standard deviations for all data points [20, 2, 20, etc.]

X: Time points [4,4,8, etc.]

Step 1 of Part 1 of problem set

The likelihood function is: P(data|parameters). In this case when you're maximizing the likelihood of the data, you're assuming normally distributed noise and minimizing the squared sum of residuals as above, but you are accounting for the standard deviations of the data points.

Follow the code given to you at the end of the problem set and create a likelihood function as:

$\hat{y} = b + a*np.sin((2\pi/24)*(t+\phi))+\epsilon$ (you shouldn't put the $\epsilon$ into your equation, but you should realize it's there.)

Loop through each data point and calculate $\hat{y}$ each time and then get the residual as Y[i]-$\hat{y}$

Then sum up all the log likelihoods and set it as negative using something similar to the following code:

In [ ]:
nll -= stats.norm.logpdf(residual, loc=0, scale=stdev)

Step 2 of Part 1 of problem set

Use the scipy minimizer to minimize the negative log likelihood. This used the following semi pseudo-code:

In [ ]:
answer = optimize.minimize(nll_function, guess, (other_arguments))

Here other_arguments are X,Y[i],S_true. This gives you an object called, in this case, answer. answer.x is a list that holds the maximum likelihood parameters, so answer.x[0]=$\hat{b}$, answer.x[1]=$\hat{a}$, and answer.x[2]=$\hat{\phi}$. answer.success will be a Boolean of True or False depending on whether or not the minimization worked.

To ensure that your answer fits the observations, $\hat{a}$ must be greater than 0 and $\hat{\phi}$ must be between 0 and 24, so you should reverse any negative $\hat{a}$ (if you do so, add 12 to $\hat{\phi}$ to make sure it's in proper phase). You can add or subtract 24 to $\hat{\phi}$ until it's in the right range. Draw out a sine graph and see why this is true.

In [ ]:
if ahats[i] < 0:
    ahats[i] = -ahats[i]
    phats[i] += 12
while phats[i] < 0:
    phats[i] +=24
while phats[i] > 24:
    phats[i] -= 24

For part 2 of the problem set, just add up the total log likelihoods by looping over all genes and all data points and summing up the likelihoods (you can re-use the function you made for part 1.

Useful python functions for plotting in Part 3

To plot the model, get a lot of different points for the x axis (time points) and get the y values at each of these time points using your parameters. Use something like np.linspace(0,48,200) to get the many points.

To do subplots do something like the following:

In [ ]:
fig, axs = plt.subplots(4, 3, figsize=(12,12))
for i in range(G):
    row= i//3 # this makes the plot stay on row 0 until it gets to 4, then stay on row 1 until 7, etc.
    col = i % 3 # this makes it go from 0 to 2 repeatedly.
    axs[row,col].plot(times,yfits[i],label='Moriarty Fit')
    axs[row,col].plot(times,yfits2[i],label='My Fit')
    axs[row,col].plot(X,Y[i],'ro',label='Observed data')
    axs[row,col].legend(loc='lower right')
    axs[row,col].set_xlabel('time (hours)')
    axs[row,col].set_ylabel('expression (TPM)')
plt.tight_layout() # this makes the spacing right