From Nathan Nakatsuka on 10/20/17.

Available in both html and as a Jupyter notebook page.

1) Review of Lecture

a) Statistics

b) How to code the statistics

2) Problem set walkthrough

a) Useful python functions

1) Building models

2) Fitting models (what is the explanatory power of the model?)

a) Regression: What are the parameters of this model?

b) How well does the model explain the data?

3) Testing models (what is the predictive power of the model?)

a) How well can the model explain new observations?

Note that this assumes:

1) $\epsilon \sim N(0,\sigma^2)$

2) The variance $\sigma^2$ is the same for all data points.

The second point in particular is not true in this problem set, which is why we need to weight the points properly such that points with high variance have less of an effect on the model.

Moriarty's method is to convert the model of gene expression into one that is in the form of a linear regression and then do least squares fit to solve for the parameters. If you expand the equation above you get: $y_t = b + a*sin(2\pi\omega t+2\pi\omega\phi))+\epsilon$

This can further be converted to: $y_t = b + a*cos(2\pi\omega\phi)sin(2\pi\omega t)+a*sin(2\pi\omega\phi)cos(2\pi\omega t)+\epsilon$.

Then you can use a linear regression fit (as in moriarty.py script) to solve for $\hat{b}$, $\hat{p_1}=a*cos(2\pi\omega\phi)$ and $\hat{p_2}=a*sin(2\pi\omega\phi)$. $sin(2\pi\omega t)$ and $cos(2\pi\omega t)$ can be considered predictor variables just like $X_1$ and $X_2$

Our method is to maximize the likelihood directly, utilizing the $\sigma$ of each measurement to weight the measurements. The steps for this problem thus are:

1) Make the negative log likelihood function

2) Put the negative log likelihood function into scipy.optimize to get maximum likelihood estimates of the parameters.

Using moriarty's script you have the following:

Y: Expression levels for all genes

S_true: standard deviations for all data points [20, 2, 20, etc.]

X: Time points [4,4,8, etc.]

Follow the optimize_example.py code given to you at the end of the problem set and create a likelihood function as:

$\hat{y} = b + a*np.sin((2\pi/24)*(t+\phi))+\epsilon$ (you shouldn't put the $\epsilon$ into your equation, but you should realize it's there.)

In [ ]:

```
nll -= stats.norm.logpdf(residual, loc=0, scale=stdev)
```

In [ ]:

```
answer = optimize.minimize(nll_function, guess, (other_arguments))
```

In [ ]:

```
if ahats[i] < 0:
ahats[i] = -ahats[i]
phats[i] += 12
while phats[i] < 0:
phats[i] +=24
while phats[i] > 24:
phats[i] -= 24
```

To do subplots do something like the following:

In [ ]:

```
fig, axs = plt.subplots(4, 3, figsize=(12,12))
for i in range(G):
row= i//3 # this makes the plot stay on row 0 until it gets to 4, then stay on row 1 until 7, etc.
col = i % 3 # this makes it go from 0 to 2 repeatedly.
axs[row,col].plot(times,yfits[i],label='Moriarty Fit')
axs[row,col].plot(times,yfits2[i],label='My Fit')
axs[row,col].plot(X,Y[i],'ro',label='Observed data')
axs[row,col].legend(loc='lower right')
axs[row,col].set_xlabel('time (hours)')
axs[row,col].set_ylabel('expression (TPM)')
axs[row,col].set_title(genenames[i])
plt.tight_layout() # this makes the spacing right
```