Section 4: Intro to Probability

Notes by Gloria Ha [10/4/19], Kate Shulgina [10/5/18], and Tim Dunn [9/23/16]

(You can download these notes as a jupyter notebook page)

The first two thirds of these notes are a review of probability with some examples, all written by previous years' TFs. The last one third (heading: Markov chains) is what we'll be going over in section.

Laplace made significant contributions to probability theory by hammering home the idea of inverse probability. This concept is important because it provides a foundation for solving the inverse problem, the typical inductive task we, as scientists, are faced with when analyzing data.

Deductive vs. inductive logic

Deduction is used when a cause (or model) is known, and we would like to know the resulting effects (or the output from the model). This type of reasoning is common in mathematics, where relationships and fromulae can be deduced from given axioms. But deductive reasoning also allows us to predict the odds of, say, a blackjack hand, knowing the number of cards in a deck, the number of cards dealt to each player, number of face cards revealed, etc. In this blackjack example, the model is everything you know about the game, and the model is composed of individual model parameters (i.e. number of cards in a deck, number of players, etc.) that together define the output of the model. When a model is nice and parameterized, the parameters themselves can be updated, and a change in model output can be observed (e.g. how do the odds change as fewer and fewer cards remain in the deck?). Deduction is also known as the forward problem.


$$ \begin{align} cause &\to effect \\ model &\to ouput \end{align} $$

Induction is used to go the other way, determining the cause from many observations of the effects. This is the problem that biologists are typically trying to solve. Given some experimental data, what was the underlying cause of the data (i.e. the true nature of the underlying system)? This inverse problem is usually harder to tackle because often data are consistent with many different possible causes (or combinations of causes). In the blackjack example, let's suppose that you see 10 hands dealt in a row. With that information, can you now determine whether the dealer is playing with a standard deck of cards?


$$ \begin{align} effects &\to cause(s) \\ output &\to model \ parameter(s) \end{align} $$

Laplace's inverse problem

Laplace faced a similar inverse problem when confronted with the London and Paris census data. He observed that the boy/girl ratio in the census data was not quite 50:50—how could he go about determining whether the actual birth rate was biased or whether the ratio he observed was simply due to noise?

Laplace knew that he could solve the forward problem using the binomial distribution, which describes the probability of a certain number of "successes" in a string of independent binary (i.e. boy/girl) events (i.e. births).

Random variables and probability models

Before we dive into binomial distributions, let's review some basic probability concepts.

A random variable is a variable that takes on one of a defined set of values with some probability. Random variables are usually written as uppercase letters, and the values they assume are written lowercase. The set of values that a random variable can assume is called the sample space.

For example, let $W$ be the random variable of the weather tomorrow. Let's say the weather can be either sunny or rainy, so sample space is $W \in \{sunny, rainy\}$.

The PMF (or probability mass function) describes how to compute the probability of any value in the sample space. So for example, the PMF could look like $$P(W=w) = \left\{ \begin{array}{ll} 0.3 & \text{if } w = sunny \\ 0.7 & \text{if } w = rainy \\ \end{array} \right.$$

We're saying that if there were an infinite number of identical parallel universes, then we think that 70% of them would have a rainy day tomorrow and 30% of them would have a sunny day. The probabilities over the entire sample space always sum to 1.

As scientists, we often use probability models to explicitly write down in our hypothesis of how some system works. We can use probability models to quantify how well the observed data fits the the model we defined, or whether one model is better than another. When we define a probability model, we typically pick a distribution and values for its parameters.

So for the weather example, the distribution that we picked was the Bernoulli distribution, which describes the probability of a single yes-no event. We write that as $$ W \sim \text{Bernoulli}(p=0.3)$$

which is read verbally as "W is distributed Bernoulli with parameter $p$ = 0.3". The Bernoulli distribution has the PMF that we saw above. Notice that we also have to pick a value for the parameter $p$ if we're going to model something with this probability distribution. When we compare how well different probability models fit our data, we often are comparing different values for the parameters.

Binomial distribution

A binomial distribution can be thought of as the result of $N$ independent Bernoulli experiments. Independent means that the outcome of each experiment is not influenced by the outcomes of the others. The binomial distribution describes the number of successes out of $N$ Bernoulli trials, each with probability $p$ of a success. So in Laplace's census data, the birth of each child is an independent Bernoulli trial, where each child as some probability $p$ of being a boy and probability $1-p$ of being a girl. We write that we are modelling the the total number of boys $B$ out of a total number of births $N$ as a binomial distribution as:

$$B \sim \text{Binomial}(p, N)$$

The sample space of $B$ is $\{0,1,2,...,N\}$

The binomial distribution has the following PMF:

$$P(B=b) = \binom{N}{b}p^{b}{(1-p)}^{N-b}$$

To make it more concrete, suppose there are only 5 births (so $N = 5$) and 2 of them are boys ($b=2$). If we model births with a binomial distribution with even probability of either gender ($p=0.5$) then we can calculate the probability of getting 2 boys as:

$$P(B=2 \mid N=5, p=0.5) = \binom{5}{2}0.5^{2}{(1-0.5)}^{5-2} = 0.3125$$

The binomial coefficient is shorthand for $\frac{5!}{2!(5-2)!}$ and is read as "five choose two". It gives you the number of ways you can choose 2 out of 5 to be boys. Note that this is also the same as $\binom{5}{3}$, as we can equivalently choose 3 out of 5 to be girls.

Why do we need the binomial coefficient in the PMF formula? In our example, the probability of boys and girls is the same. So seeing the combination {G,G,G,G,G} has the same probability as the combination {B,G,G,G,G}. However, we are counting the total number of boys and there are different numbers of combinations that can give you the same total number of boys. There is only one way you could get no boys ({G,G,G,G,G}), five ways you could get 1 boy ({B,G,G,G,G},{G,B,G,G,G}, etc), ten ways you could get 2 boys, etc... The binomial coefficient reflects that some values are more likely because there are more ways to achieve combinations that give you those values, and increases the probability of those values by that amount.

Instead of calculating the binomial PMF by hand, we can have Python do it for us with the handy scipy.stats module:

In [1]:
import scipy.stats
import numpy as np

b = 2                                 # number of boys
N = 5                                 # total number of children born
p = 0.5                               # probability of a boy birth, in our model
scipy.stats.binom.pmf(b, N, p)        # evaluates PMF at P(B=b)

Let's use Python to compute the probability of Laplace's observations, if the probability of boys and girls is equal. He observed in the Paris census data that there were 493472 births, 251527 of which were boys.

In [2]:
b = 251527                            # number of boys
N = 493472                            # total number of children born
p = 0.5                               # probability of a boy birth, in our model
scipy.stats.binom.pmf(b, N, p)        # evaluates PMF at P(B=b)

The probability is tiny! Does this mean that $p=0.5$ hypothesis is wrong? Not necessarily. As Sean pointed out in lecture, even $P(b = 246736 \mid N = 493472, p = 0.5)$, where you plug in exactly $N/2$ for $b$, has a small value.

In [3]:
b = 246736                            # number of boys
N = 493472                            # total number of children born
p = 0.5                               # probability of a boy birth, in our model
scipy.stats.binom.pmf(b, N, p)        # evaluates PMF at P(B=b)

This is because any particular value for $b$ is unlikely because there are so many other similarly probable values. Indeed, for a continuous distibution, any particular value for the random variable has probability 0 because there are infinitely many possible values.

We can use the power of modern-day computing to visualize the entire PMF in Python over all possible values for $b$ to get a better sense of what's going on:

In [4]:
import matplotlib.pyplot as plt

# Set the parameters
p = 0.5                        # equally probable to get boys and girls
N = 493472                     # total number of births
B = list(range(0, N+1))        # a list of all possible values of b

# Use scipy.stats binomial equation to generate a probability for each b
# Note that because we are passing in a list of b, the pmf function returns a list of calculated P for each b
mypmf = scipy.stats.binom.pmf(B, N, p)

# Plot the result
plt.plot(B, mypmf)
plt.ylabel('$P(B \mid N = 493472, p = 0.5)$', fontsize=15)
plt.xlabel('$B$', fontsize=15)
In [5]:
# Plot the result
plt.plot(B, mypmf)
plt.ylabel('$P(B \mid N = 493472, p = 0.5)$', fontsize=15)
plt.xlabel('$B$', fontsize=15)
In [6]:
# Sanity check: does PMF sum to 1?
print('Binomial PMF sums to: %f' % sum(mypmf))
Binomial PMF sums to: 1.000000
In [7]:
## other useful functions:
# compute the cumulative probability, P(B<=b)
scipy.stats.binom.cdf(b, N, p)
In [8]:
# sample 10 from binomial distribution with these parameters
np.random.binomial(N, p, 10)
array([246725, 246711, 246858, 246583, 246861, 246550, 246981, 246990,
       247010, 246243])

Likelihoods and maximum likelihood parameter estimation

What Laplace really wanted to do was figure out is whether the observed frequency of boy births in Paris was inconsistent with a model where boys and girls are born equiprobably. Namely, he wanted to compare two models, one where $p=0.5$ and one where $p=\hat{p}^{ML}$ (the maximum likelihood estimate for $p$).

Maximum likelihood estimation is a way of making a best guess for the value of a model parameter from some data. As the name suggests, it is the value of the parameter that maximizes the likelihood.

Likelihood is a confusing bit of terminology. The likelihood of a model is defined as the probability of the data, given a model. It is the solution to the forward problem that we are familiar with:

$$P(data \mid model) = P(b \mid N, p)$$

However, now we think about the probability of the observation as a function of the model parameter. There are some values of $p$ that the model gives a higher probability for your data (like $p=0.51$) than others (like $p=0.1$, which is pretty bad).

We can use Python to visualize this.

In [9]:
# Set the parameters
p_vec = np.linspace(0,1,10000)
N = 493472
b = 251527 

# Use scipy.stats binomial equation to calculate each likelihood
# We can pass in a list of p's!
likelihoods = scipy.stats.binom.pmf(b, N, p_vec)

# Plot the result
plt.plot(p_vec, likelihoods)
plt.ylabel('$P(B=251527 \mid N = 493472, p)$', fontsize=15)
plt.xlabel('$p$', fontsize=15)

Notice that now $p$ is the x-axis!

We can find the maximum likelihood estimate of $p$ by just picking off the best value. The numpy argmax function can help us here.

In [10]:
# find index that maximizes likelihood]
pml_index = np.argmax(likelihoods)
p_mle = p_vec[pml_index]
print('The maximum likelihood estimate for p is %f' % p_mle)
The maximum likelihood estimate for p is 0.509751

Don't confuse this with the PMF, which gave a very similar plot earlier. Unlike the PMF, the integral of this function does not have to equal to 1.

A cleaner way to calculate the $p=\hat{p}^{ML}$ would be to do it mathematically— take the derivative of the likelihood with respect to $p$, set it to 0 and solve for $p$. If you were to do that, you would find that the ML estimate for $p$ is simply the observed frequency of boys: $$\hat{p}^{ML} = \frac{b}{N}$$ Most of the time, you'll find that the maximum likelihood estimate is also the obvious and intuitive one.

Model comparison

Now we want to compare the two models that Laplace was considering: one where $p=0.5$ and one where $p=\hat{p}^{ML}$. We know how to calculate likelihoods, $P(data\mid model)$, but that's not a value that directly answers our questions.

We wish we could calculate and compare $P(model \mid data) = P(p \mid b, N)$ instead, because then we could directly compare those values and see which $p$ is more likely.

But how do you calculate this inductive inverse probability? As Sean mentioned in lecture, Laplace's big logical leap was realizing that

$$ \begin{align} \frac{P(p_{1} \mid b,N)}{P(p_{2} \mid b,N)} \propto \frac{P(b \mid p_{1},N)}{P(b \mid p_{2},N)} \end{align} $$

Why is this true? Remember Bayes rule: $$P(p \mid b, N) = \frac{P(b \mid p, N) P(p)}{P(b \mid N)}$$

Plugging that into the ratio of likelihoods $$ \begin{align} \frac{P(p_{1} \mid b,N)}{P(p_{2} \mid b,N)} &= \frac{P(b \mid p_1, N) P(p_1)}{P(b \mid N)} \times \frac{P(b \mid N)}{P(b \mid p_2, N) P(p_2)} \\ &= \frac{P(b \mid p_1, N) P(p_1)}{P(b \mid p_2, N) P(p_2)} \\ &\propto \frac{P(b \mid p_{1},N)}{P(b \mid p_{2},N)} \qquad\text{if } P(p_1) = P(p_2) \end{align} $$

So now that we understand Bayes rule, we can see that the ratio of the posteriors is proportional to the ratio of the likelihoods if we assume a uniform prior, i.e. if we give the two hypotheses the same a priori probability of being the correct hypothesis.

So let's make that assumption and calculate the ratio of the likelihoods in Python:

In [11]:
likelihood_0 = scipy.stats.binom.pmf(b, N, 0.5)    # likelihood of model where p=0.5
likelihood_1 = scipy.stats.binom.pmf(b, N, p_mle)  # likelihood of model where p=MLE

print("The likelihood ratio is %.2E" % (likelihood_1/likelihood_0))
The likelihood ratio is 2.53E+40

Oftentimes, likelihoods and likelihood ratios can become very very large or small values, to the point where there is a risk of overflow or underflow. Computers can only hold numbers up to a finite number of digits, so if a number gets to small, it will generate an error. For this reason, likelihoods are often computed as log-likelihoods, which keeps the value within a range that a computer can manipulate.

The log-likelihood ratio is just the difference of the log-likelihoods. Another nice property of log-likelihood ratios is that they are positive when supporting model 1 and negative when supporting model 0.

In [12]:
log_lik_0 = np.log(likelihood_0)
log_lik_1 = np.log(likelihood_1)
log_lik_ratio = log_lik_1 - log_lik_0

print("The log-likelihood ratio is %.2f" % log_lik_ratio)
The log-likelihood ratio is 93.03

ROC plots

ROC plots are a visualization tool used to compare the performance of binary classifiers, i.e. something that tries to sort inputs into positives and negatives, enemy planes and friendly planes, pathogen sequences and sandmouse sequences.

This classifier score each input and then decides whether or not the input is a positive or negative based on whether it is greater than some threshold.

For example, suppose we had a system of looking at the radar signature of planes and assigning them a score based on whether they were enemy planes or friendly planes. These are the score we get for a set of known enemy and friendly planes

In [13]:
import matplotlib.pyplot as plt

# scores generated by our radar plane classification tool
enemy_scores = np.array([44,50,77,65,87,56,32,62,63,69])
friendly_scores = np.array([61,51,34,42,45,32,22,29,37,32])

# let's plot the scores to see how the distributions overlap
plt.hist(enemy_scores, alpha=0.3, color='red', bins=np.arange(0,100,5))
plt.hist(friendly_scores, alpha=0.3, color='blue', bins=np.arange(0,100,5))
plt.xlabel('radar scores')

An ROC curve characterizes how good a classifier is by scanning the threshold across all possible values and plotting how the false positive and true positive rates change.

  • True positive: how many of the positives were correct? How many of the planes above the threshold are really enemy planes?
  • False positive: how many of the positives were wrong? How many of the planes above the threshold are really friendly planes? We want to keep this number low... don't want to shoot down friendly planes.
  • True negative: how many of the negatives were correct? How many of the planes below the threshold are really friendly planes?
  • False negative: how many of the negatives were wrong? How many of the planes below the threshold are really enemy planes? We want this value to be low, don't want to let through enemy planes!
  • True positive rate = TP/(TP+FN). Of all the enemy planes, what proportion were above the threshold?
  • False positive rate = FP/(FP+TN). Of all the friendly planes, what proportion were above the threshold?

Suppose we set the threshold to be 0. What is the true positive rate? What is the false positive rate? Threshold at 90? Threshold at 60?

In [14]:
# Suppose we set the threshold to be 60. 
threshold = 60

# What is the true positive rate?
tp = np.sum(enemy_scores > threshold)
tp_plus_fn = len(enemy_scores)
tp_rate = tp / tp_plus_fn
print('If the threshold is set at %i, then TP=%i, TP+FN=%i, and the true positive rate is %.1f' % 
      (threshold, tp, tp_plus_fn, tp_rate))

# What is the false positive rate?
fp = np.sum(friendly_scores > threshold)
fp_plus_tn = len(friendly_scores)
fp_rate = fp / fp_plus_tn
print('If the threshold is set at %i, then FP=%i, FP+TN=%i, and the false positive rate is %.1f' % 
      (threshold, fp, fp_plus_tn, fp_rate))
If the threshold is set at 60, then TP=6, TP+FN=10, and the true positive rate is 0.6
If the threshold is set at 60, then FP=1, FP+TN=10, and the false positive rate is 0.1

As you can see, all ROC curves are anchored at (0,0) and (1,1), and good classifiers maintain a low false positive rate while having a high true positive rate for some range of thresholds.

To plot the ROC curve, we need to sweep across all values of the threshold and calculate the true postive and false positive rates.

In [15]:
max_value = np.max([enemy_scores, friendly_scores])

fp_rates = np.zeros(max_value)        # initialize array to keep track of false positive rates
tp_rates = np.zeros(max_value)        # initialize array to keep track of true positive rates

for threshold in range(max_value):
    # calculating true positive rate
    tp = np.sum(enemy_scores > threshold)
    tp_plus_fn = len(enemy_scores)
    tp_rates[threshold] = tp / tp_plus_fn
    # calculating false positive rate
    fp = np.sum(friendly_scores > threshold)
    fp_plus_tn = len(friendly_scores)
    fp_rates[threshold] = fp / fp_plus_tn

plt.plot(fp_rates, tp_rates)
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')

Playing around with joint probabilities

In the end of Sean's lecture notes, there is a nice example of how to manipulate joint probability tables. The example is the following:

Suppose we've done a single cell RNA-seq experiment, where we've treated cells with a drug (or left them untreated), and we've measured how often (in how many single cells) genes A and B are on or off. Our data consist of counts of 2000 individual cells:

T=no T=yes B=ON B=off B=ON B=off A=ON 180 80 10 360 A=off 720 20 90 540
In [16]:
# let's represent these tables as 3D-numpy arrays
treat_no_counts = np.array([[180, 80],[720, 20]])
treat_yes_counts = np.array([[10, 360],[90, 540]])
counts = np.array([treat_no_counts, treat_yes_counts])


[[180  80]
 [720  20]]

[[ 10 360]
 [ 90 540]]
Joint probability

The joint probabilities P(A,B,T) sums to one over everything. So normalize by dividing everything 2000 (the total # of cells):

In [17]:
# double check the total number of cells in 2000
n_cells = np.sum(counts)
print('Total number of cells in %i' % n_cells)
Total number of cells in 2000
In [18]:
# normalize probabilities by dividing by 2000
joint = counts / n_cells



[[0.09 0.04]
 [0.36 0.01]]

[[0.005 0.18 ]
 [0.045 0.27 ]]

Any marginal distribution is a sum of the joint probabilities over all the variables you don't care about, leaving the ones you do. For example, to get $P(T)$, we sum over $A$,$B$: $P(T)=\sum_{A,B} P(A,B,T)$, which leaves:

In [19]:
# let's get P(T) by marginalizing (summing) over A and B

print('P(T=no) = %.1f' % np.sum(joint[0]))
print('P(T=yes) = %.1f' % np.sum(joint[1]))
P(T=no) = 0.5
P(T=yes) = 0.5
Conditional distributions
In [20]:
# what if we wanted P(A,B | T=no)? Could get table for T=no and renormalize probabilities
joint_treat_no = joint[0]
joint_treat_no = joint_treat_no / np.sum(joint_treat_no)

print('P(A,B | T=no):')

P(A,B | T=no):
[[0.18 0.08]
 [0.72 0.02]]

Marginalizing a conditional

If we marginalize a conditional distribution, the conditioning stays as it was. For example, $P(B∣T=\text{no})=\sum_{A}P(A,B∣T=\text{no})$

In [21]:
# what if we wanted P(B | T=no)? Need to marginalize (sum) over A
print('P(B | T=no): ')
np.sum(joint_treat_no, axis=0)
P(B | T=no): 
array([0.9, 0.1])

Markov chains


In lecture Sean introduced Markov models, which are very relevant for this week's homework. As discussed in class, Markov models are a way to calculate the probability of a sequence of events for which each event's probability depends on what event(s) came before it -- if you have a first order Markov model, the probability of an event depends only on the event that happened directly before, while if you have a second order Markov model, the probability of an event depends on the previous two events. The purpose of this part of the notes is to walk you through how you could set up a Markov model and execute it, while explaining relevant terminology from lectures and the homework along the way. Below this is an exercise where you can try your hand at analyzing some data using a Markov model.

Let's consider birdsongs. Say that we have two species of birds, special and fancy, that we want to differentiate between based on their songs. Say that these birdsongs are comprised of sequences of two notes: do and re, which we will denote d and r. We have data in the format of strings of different lengths: 'ddrddr', 'drdrrdddrr', for example. We will try modeling this as a third order Markov model for this example. We can denote the random variable for note $i$ as $X_i$, where $X_i$ can take on value $d$ or $r$. If the first note in our sequence was d, then we could denote event $X_1=d$ as $x_1$. If I were to write out this probability for a sequence of notes of length $n$, I would get: \begin{equation*} P(x_1, x_2,...x_{n}) = p(x_1,x_2,x_3)\Pi_{i=4}^{n} p(x_i|x_{i-3},x_{i-2},x_{i-1}) \end{equation*}

The initial probability $p(x_1,x_2,x_3)$ is the probability of this particular sequence of three notes occurring. If the sequence we're looking at starts with 'ddr', we're calculating the probability of d, then d, then r happening at any point in a full sequence of notes. Now that we've taken care of the initial three notes, we can look at the following notes using conditional probabilities. For the sequence 'ddrddr', for $i=4$, we want to know the probability of 'd' after the sequence 'ddr'. For $i=5$, we want to know the probability of 'd' after the sequence 'drd'. And so on. How do we get these probabilities? We have to train on datasets of birdsongs of known species origin.

Say I want to find the conditional probability $p(d|ddr)$ for a special bird. It might feel more natural to think of the probability of the sequence 'ddrd', or $p(ddrd)$. We can relate these two probabilities using the equation for conditional probability: \begin{equation*} p(ddrd) = p(d|ddr)p(ddr) \end{equation*} In other words, the probability of getting the sequence 'ddrd' can be written as the probability of getting 'd' once you already have 'ddr' multiplied by the probability of having 'ddr' in the first place. We can rearrange this to solve for the conditional probability that we are interested in. \begin{equation*} p(d|ddr) = \frac{p(ddrd)}{p(ddr)} \end{equation*} Now we have two simple probabilities that we can calculate with our dataset! We can find the probability of 'ddrd' by seeing how many times the sequence 'ddrd' appears in the special birdsong database. We could then divide this by the total number of 4 note sequences (4-mers, if you will) found in the special database to obtain the probability. Similarly, you could find the number of times 'ddr' appears in the special birdsong database, and divide this by the total number of 3-mers in the database.

We can repeat this process for every conditional probability possible, for both the special and fancy training datasets. In the problem set, we are told to use half of the sequences as the training set, and half as the test set. We want to use data from known origins to train our model (get the probability parameters discussed above), and we also want to use data from known origins to test our model (see if we can figure out the origin of a piece of data using our Markov model). We don't want to use the same pieces of data to train and test the model. Depending on your data and model, you may use a larger or smaller fraction of the total dataset to train the model.

So now, theoretically, we have our conditional probabilities for our third order Markov model for both special and fancy bird species. Let's say we used half of our available data of known species origin to train the model and get these parameters. Now we want to see how well our model identifies the species origin of our remaining birdsong sequences. For each birdsong sequence in our test dataset, we want to calculate a log-odds score, a measure of how probable that sequence would be if it came from a special bird vs. a fancy bird (or the other way around, depends on what you're considering a 'positive' species identification -- for this exercise we will consider special to be a positive result). This is given by the logarithm of the ratio of the probability of that sequence $x = x_1, x_2,...,x_n$ occurring under the third order Markov model with special vs. fancy parameters: \begin{equation*} \mathrm{log-odds\, score} = \log\frac{P(x|H_0)}{P(x|H_1)}, \end{equation*} where in this case $H_0$ is the Markov model with special parameters and $H_1$ has fancy parameters. Note that since this is a logarithm of a ratio, we can write it as the difference of two logs: \begin{equation*} \mathrm{log-odds\, score} = \log P(x|H_0)-\log P(x|H_1) \end{equation*} If you remember from above, the Markov probability $P(x_1,x_2,...,x_n)$ is the product of numerous probabilities, so this can also be written as a sum of logarithms of probabilities. Why is this nice? Sometimes our probabilities are very small, and multiplying many small numbers together can give you underflow errors in Python. We are able to take the log because the probabilities we're dealing with are positive -- sometimes you have trouble when probabilities are 0, so you add a small epsilon.

Okay, so now we have a log-odds score for every sequence in our test dataset. How do we determine what's special and what's fancy? We want to define a score threshold or cutoff -- any sequence with a log odds score above the threshold is special and anything below it is fancy. Luckily we know the species origin of every sequence in our test dataset. For any threshold that we set, there will be a corresponding number of true positives (special sequences that are identified as special), true negatives (fancy sequences that are identified as fancy), false positives (fancy sequences ID'd as special) and false negatives (special sequences ID'd as fancy). We can calculate these values for a range of thresholds (say, the range of log-odds scores we get across the test dataset), and see how well we do at distinguishing the species. There are many metrics that combine TP, TN, FP, FN that you can consider when setting a threshold, and the homework mentions a few.

Sensitivity, or true positive rate, defined as $\frac{\mathrm{TP}}{\mathrm{TP+FN}}$, is the fraction of special sequences that were identified as special in the test dataset. In other words, it's the fraction of positive sequences that were correctly identified as positive, or the sensitivity of the identification. Specificity, defined as $\frac{\mathrm{TN}}{\mathrm{TN+FP}}$, is the fraction of fancy sequences that were identified as fancy in the test dataset. 1-specificity would give you the rate of false positives, as mentioned in the lecture notes. The ROC plots mentioned in the lecture notes, homework, and above, are plotting sensitivity vs. false positive rate, or 1-specificity, for a range of score thresholds. By looking at this plot, you can see how you trade off true positives for false positives as you change the threshold. If you make the threshold very high, you need a very high log-odds score to be classified as special -- you will probably have a very low false positive rate, but you will probably also have a very low true positive rate, losing a lot of the actual special birds. Alternatively, if you make the threshold very low, you will probably classify all of your special sequences as special, but you will also classify a lot of fancy sequences as special. You could use an ROC plot to determine what your threshold will be -- for example, you could want a low false positive rate of 5% or less, and you could find the corresponding score threshold.

As mentioned in class, ROC plots can be deceiving when you have very high background -- very few special birds and very many fancy birds. At this point, even if you have high sensitivity, and identify a large fraction of the special birds, you will also identify many fancy birds as special. In this case, a useful metric is the false discovery rate, given by $\frac{\mathrm{FP}}{\mathrm{FP+TP}}$, which tells you how much the noise is drowning out the signal.


Now that we've gone over how to do this theoretically, let's try it on some data! You can download the data at these links: special_data.txt and fancy_data.txt. Let's take a look at the special data file.

In [22]:
! head special_data.txt

Looks like some sequences of the notes 'd' and 'r' of varying length! The special dataset is similar. Note that each file has 1000 sequences. We can now load in the data into lists (so we'll have lists of strings).

In [23]:
# initialize special and fancy data lists
special_data = []
fancy_data = []
# load data from files
with open('special_data.txt') as special_file:
    for line in special_file:
with open('fancy_data.txt') as fancy_file:
    for line in fancy_file:

Exercise 1: See if you can use a simple zero order method to differentiate the birdsongs

See if you can use a simple scoring algorithm (+1 for 'd' and -1 for 'r', for example) on the special and fancy data to differentiate the sequences. I would recommend keeping two different arrays of length 1000: one for the scores you give to the fancy sequences, and one for the scores you gives to the special sequences.

In [ ]:
# store length of data
num_seq = len(fancy_data)
# initialize score arrays
fancy_scores = np.zeros(num_seq)
special_scores = np.zeros(num_seq)
# write script to score sequences in both datasets

Now that you have a score for every sequence in the dataset, try plotting a histogram of the distribution of scores for fancy and special. Check out plt.hist() and don't forget to import matplotlib!

In [ ]:
# plot histogram of scores

Try plotting an ROC plot. An easy way of going about this is to calculate the number of true positives, true negatives, false positives, and false negatives for each threshold score that you're considering. I would recommend first defining a range of threshold score values (for example, the minimum and the maximum of all of your scores), defining a linear space between those values (check out np.linspace), and then calculating the TP/FN/TN/FP for each threshold value. In this case, let's say that "positives" are scores above the threshold, and "negatives" are scores below the threshold (this is the typical definition, but I don't have a preconceived notion of how the proportion of 'd's in special birds compares to fancy birds. Remember that the ROC plot is sensitivity (TP/(TP+FN)) vs. false positive rate (1-TN/(TN+FP))

In [ ]:
# find minimum and maximum scores across both fancy and special scores
min_score = 
max_score = 

# define number and range of threshold scores
num_thresh = 1000
threshold_scores = 

# set up 
TP = np.zeros(num_thresh)
FP = np.zeros(num_thresh)
TN = np.zeros(num_thresh)
FN = np.zeros(num_thresh)

# calculate TP/FN/TN/FP for each sequence in both datasets

# plot it!

How well does this method perform in distinguishing the birdsongs?

Exercise 2: Train third order Markov model on fancy and special data

Exercise 2.1: Split up your data into training and testing sets

Now for the fun part! I would recommend splitting up the data into training and testing sets. You can use half of the data for training and half for testing (like the homework), or choose some other fraction (as long as the training and testing sets don't overlap)

In [ ]:
# split up training and testing sets
fancy_train = 
fancy_test = 
special_train = 
special_test = 

Now we want to use our training dataset to calculate the parameters of our Markov model -- the probabilities discussed above. There are many ways you could go about this, so don't feel like you have to follow these steps!

As discussed above, in order to calculate the conditional probabilities in the Markov model (ex: $P(d|ddr)$), we can calculate the unconditional probabilities of the related 4-mers and 3-mers ($P(ddrd)$ and $P(ddr)$ in this case).

How would we go about calculating these probabilities? We want to see how many times each 4-mer appears in the data, and how many times each 3-mer appears in the data, for both fancy and special birds.

Exercise 2.2: How many 4-mers and 3-mers are possible in this dataset?

One useful piece of information would be how many distinct 4-mers and 3-mers there are in the data. We know that there are 2 states in this model: 'd' and 'r'. How many distinct 3-character and 4-character sequences are possible? You could write them out by hand, or you could use probability (hint: does order matter? are you sampling with replacement?)!

Exercise 2.3: Store distinct 4-mers and 3-mers in a list

Now we want to store the distinct 4-mers and 3-mers in a list. Something like ['dddd','dddr',...] and ['ddd','ddr',...]. Again, you could type out all possible combinations by hand, or you could harness the tools of Python to get all possible combinations (this will be even more useful with an alphabet of four letter like 'ACTG')! itertools is one method. Again there are many ways to approach this, so feel free to diverge!

In [ ]:
# find all distinct 3-mer and 4-mer sequences and store them in lists
threemers = 
fourmers = 

Now we want to go through all of the training sequences and count how many times each 3-mer and 4-mer appears in the dataset.

Exercise 2.4: Find frequencies of distinct 3/4-mers in training data

Go through each sequence and identify each 3-mer and 4-mer, and update the counts of the corresponding 3/4-mers. The list.index() method will come in handy here.

In [ ]:
# initialize counts array of 3/4-mers
threemer_counts_special = np.zeros(len(threemers))
threemer_counts_fancy = np.zeros(len(threemers))
fourmer_counts_special = np.zeros(len(fourmers))
fourmer_counts_fancy = np.zeros(len(fourmers))

# go through each sequence and identify 3/4-mers and update counts

Now you can normalize the counts to get frequencies, which we'll use as probabilities in our Markov models.

In [ ]:
# normalize counts to get frequencies/probabilities
threemer_probs_special = 
threemer_probs_fancy = 
fourmer_probs_special = 
fourmer_probs_fancy = 

Now we've trained our models! We can simply divide the relevant 3-mer and 4-mer probabilities when calculating the conditional probabilities. All that's left to do is to test the model on distinguishing the testing set sequences as fancy or special.

Exercise 3: Test the model's performance on the testing set

Use the equation for a third order Markov chain probability to calculate the probability of each sequence given the two Markov models (fancy and special). Use these probabilities to calculate the log odds score, where in this case we are dividing the probability of being special by the probability of being fancy. Remember that we want to use logs of probabilities whenever we are multiplying/dividing them together to avoid underflow errors.

In [ ]:
# initialize score array
fancy_scores = np.zeros(len(fancy_test))
special_scores = np.zeros(len(special_test))
# iterate through each sequence

    # for each sequence, calculate the initial probability 
    # of the first three bases under both the fancy and special models

    # for each sequence, calculate the probabilities of each note given
    # the previous three notes (starting with the fourth note) under both models

    # for each sequence, calculate a log odds score and store it

Now you should have two arrays of scores, one for the special test dataset, and one for the fancy test dataset. You can now repeat the histogram and ROC plot part of exercise 1 on these new scores. Did the model do any better? What's the false positive rate if you set a sensitivity threshold of 0.9 or higher? What if the special birds only made up 1% of the fancy+special bird population? What would the false discovery rate be?


For the purposes of this exercise, I split everything up and wrote a lot of repetitive code. In your actual homework, you can consider putting scripts into functions. If you're interested in doing more with this dataset, you could see if generating random note sequences would be distinguishable from special sequences under a zero order model. You could also see if these sequences could be better distinguished with any other order of Markov model (though, they were generated under a third order model...).