- the probability of boys and girls
- a minicourse in probability calculus
- Markov models as an example
- (binary) classification problems
- occam’s razor
- bonus extra example, with RNA-seq data
- further reading
and now for some probability
the probability of boys and girls
In the 1770’s, the French mathematician Pierre Laplace started working on big data. He became interested in the curious biological observation that the ratio of boys to girls at birth isn’t exactly 50:50. Village records showed an apparent bias towards more boy births, but the numbers were small and vulnerable to statistical fluctuation. Laplace set out to use some new ideas about probability theory to put the question of sex ratio at birth to a mathematically rigorous test, and he needed big data sets to do it. “It is necessary to make use in this delicate research much greater numbers”, he wrote in his classic 1781 paper.
Laplace turned to the extensive census records in London and Paris. The Paris census for the years 1745-1770 recorded the birth of 251527 boys and 241945 girls, a 51:49 (1.04) ratio. The London census for 1664-1757 recorded the birth of 737629 boys and 698958 girls, again a 51:49 ratio.
Laplace’s work is one of the origins of probability theory. Today, his laborious manual calculations are easy to reproduce in a few lines of Python, and his problem makes a compact, biologically-motivated example for us to illustrate some key ideas of probabilistic inference.
(The bias in the human male/female ratio at birth is real and it remains poorly understood. It is typically measured at 1.02 to 1.08 in different human populations.)
the binomial distribution
Let’s call the probability of having a boy . The probability of having a girl is . The probability of having boys in total births is given by a binomial distribution:
A couple of things to explain, in case you haven’t seen them before:
is “the probability of b, given p and N”: a conditional probability. The vertical line means “given”. That is: if I told you p and N, what’s the probability of observing data b?
is conventional shorthand for the binomial coefficient: .
Suppose , and in the Paris data (251527 boys + 241945 girls). The probability of getting 251527 boys is:
Your calculator isn’t likely to be able to deal with that, but Python
can. For example, you can use the
pmf (probability mass function) of
import scipy.stats as stats p = 0.5 b = 251527 N = 493472 Prob = stats.binom.pmf(b, N, p) print(Prob)
which gives .
Probabilities sum to one, so . You can verify this in Python easily, because there are only possible values for , from to .
maximum likelihood estimate of p
Laplace’s goal isn’t to calculate the probability of the observed data, it’s to infer what is, given the observed census data. One way to approach this is ask, what is the best that explains the data – what is the value of that maximizes ?
It’s easily shown that this optimal is just the frequency of boys, . The hat on denotes an estimated parameter that’s been fitted to data.
(In this case, “it’s easily shown” means we take the derivative of with respect to , set to zero, and solve for . Also, if we’re being careful, take the second derivative at that point and show that it’s negative, so we know our extremum is a maximum, not a minimum. Typically – though not in this particularly simple case – you would also impose constraint(s) that probability parameters have to sum to one, using the constrained optimization technique of Lagrange multipliers.)
With , we get . So even with the best , it’s improbable that we would have observed exactly boys, simply because there’s many other that could have happened. The probability of the data is not the probability of ; is not . When I deal you a five card poker hand, it’s laughably unlikely that I would have dealt you exactly those five cards if I were dealing fairly, but that doesn’t mean you should reach for your revolver.
It seems like there ought to be some relationship, though. Our optimal does seem like a much better explanation of the observed data than is.
is called the likelihood of , signifying our intuition that seems like it should be a relative measure of how well a given explains our observed data . We call the maximum likelihood estimate of .
The London and Paris data have different maximum likelihood values of : 0.5135 versus 0.5097. Besides asking whether the birth sex ratio is 50:50, we might even ask, is the ratio the same in London as it is in Paris?
the probability of p
Laplace made an intuitive leap. He reasoned that one value is more probable than another by the ratio of these probabilities:
if we had no other reason to favor one value for over the other. (We’ll see later that he implicitly assumed a uniform prior for .) So is -fold more probable than , given the Paris data. This proportionality implies that we can obtain a probability distribution for by normalizing over the sum over all possible – which requires an integral, not just a simple sum, since is continuous:
This is a probability density, because is continuous. Strictly speaking, the probability of any specific value of is zero, because there are an infinite number of values of , and has to be 1.
It’s possible (and indeed, it frequently happens) that a probability density function like can be greater than 1.0 over a small range of , so don’t be confused if you see that; it’s the integral that counts. For example, you can see that there are large values of in the figure to the right, where I’ve plotted Laplace’s probability densities for the Paris and London data.
In the figure, it seems clear that is not supported by either the Paris or London data. We also see that the uncertainty around does not overlap with the uncertainty around . It appears that the birth sex ratio in Paris and London is different.
We’re just eyeballing though, when we say that it seems that the two distributions for $p$ don’t overlap 0.5, nor do they overlap each other. Can we be more quantitative?
the cumulative probability of p
Because the probability at any given continuous value of is actually zero, it’s hard to frame a question like “is ?”. Instead, Laplace now framed a question with a probability he could calculate: what is the probability that p <= 0.5?, If that probability is tiny, then the odds are that .
A cumulative probability distribution is the probability that a variable takes on a value less than or equal to . For a continuous real-valued variable with a probability density function , . For a continuous probability constrained to the range , and .
So Laplace framed his question as:
Then Laplace spent a bazillion pages working out those integrals by hand, obtaining an estimated log probability of -42.0615089 (i.e. a probability of ): decisive evidence that the probability of having a boy must be greater than 0.5.
(Yes, Laplace gave his answer to 9 significant digits. Go big or go home.)
These days we can replace Laplace’s virtuosic calculations and approximations with one call to Python:
import scipy.special as special p = 0.5 b = 251527 N = 493472 answer = special.betainc(b+1, N-b+1, p) print (answer)
which gives . Laplace got it pretty close – though he wasted a bunch of sig figs.
What sorcery is this
betainc() function? That
call is something called a Beta integral.
The complete Beta integral is:
A gamma function is a generalization of the factorial from integers to real numbers. For integer ,
The incomplete Beta integral is:
and has no clean analytical expression, but statistics packages
typically give you a computational method for calculating it – hence
If you wrote out Laplace’s problem in terms of binomial probability distributions for , you’d see the binomial coefficient cancel out (it’s a constant with respect to ), leaving:
Alas, reading documentation in Python is usually essential, and it
turns out that the
function doesn’t just calculate the incomplete beta function; it
sneakily calculates a regularized incomplete beta integral
by default, which is why all we needed to
do was call:
answer = special.betainc(b+1, N-b+1, p)
Beta integrals arise frequently in probabilistic inference. When we extend them beyond just , we can deal with data from dice and DNA and proteins (), generalizing the binomial distribution to the multinomial distribution, and the Beta integral to the multivariate Beta integral.
Laplace treated the unknown parameter like it was something he could infer, and express an uncertain probability distribution over it. He obtained that distribution by inverse probability: by using , the probability of the data if the parameter were known and given, to calculate , the probability of the unknown parameter given the data.
Laplace’s reasoning was clear, and proved to be influential. Soon he realized that the Reverend Thomas Bayes, in England, had derived a very similar approach to inverse probability just a few years earlier. We’ll learn about Bayes’ 1763 paper in a bit. But for now, let’s leave Laplace and Bayes, and lay out some basic terminology of probabilities and probabilistic inference.
a minicourse in probability calculus
Eight points suffice to grasp the main practicalities of probabilistic inference.
1. random variables
A random variable is something that can take on a value. The value might be discrete (like “boy” or “girl”, or a roll of a die 1..6) or it might be real-valued (like a real number drawn from a Gaussian distribution). We’ll denote random variables or events with capital letters, like . We’ll denote values or outcomes with small letters, like .
When we say (the probability of random variable ), we are envisioning a set of values , the probability that we could get each possible outcome .
Probabilities sum to one. If has discrete outcomes , . If has continuous outcomes , .
For example, suppose we have a fair die, and a loaded die. With the fair die, the probability of each outcome 1..6 is . With the loaded die, let’s suppose that the probability of rolling a six is , and the probability of rolling anything else (1..5) is 0.1. We have a bag with fair dice and loaded dice in it. We pick a die out of the bag randomly and roll it. What’s the probability of rolling 1, 2, …, or 6? We have two random variables in this example: let’s call the outcome of whether we chose a fair or a loaded die, and the outcome of our roll. takes on values or (fair or loaded); takes on values 1..6.
2. conditional probability
is a conditional probability distribution: the probability that X takes on some value, given a value of .
To put numbers to a discrete conditional probability distribution , envision a table with a row for each variable , and a column for each variable . Each row sums to one: .
In our example, I told you : the probability of rolling the possible outcomes 1..6, when you know whether the die is fair or loaded.
|roll R =||1||2||3||4||5||6|
|D = fair:|
|D = loaded:||0.1||0.1||0.1||0.1||0.1||0.5|
3. joint probability
is a joint probability: the probability that X takes on some value and Y takes on some value.
Again, envision a table with a row (or column) for each variable , and a column (or row) for each variable – but now the whole table sums to one, .
For instance, we might want to know the probability that we chose a loaded die and we rolled a six. You don’t know the joint distribution yet in our example, because I haven’t given you enough information.
4. relationship between conditional and joint probability
The joint probability that and both happen is the probability that happens, then happens given :
Also, conversely, because we’re not talking about causality (with a direction), only about statistical dependency:
So for our example, let’s suppose that the probability of choosing a fair die from the bag is 0.9, and the probability of choosing a loaded one is 0.1. That’s . Now we can calculate the joint probability distribution as .
|roll R =||1||2||3||4||5||6|
|D = fair:|
|D = loaded:||0.01||0.01||0.01||0.01||0.01||0.05|
If you have additional random variables in play, they stay where they are on the left or right side of the . For example, if the joint probability of were conditional on :
Or, if I start from the joint probability :
If we have a joint distribution like , we can “get rid” of one of the variables or by summing it out:
It’s called marginalization because imagine a 2-D table with rows for Y’s values and columns for X’s values. Each entry in the table is for two specific values . If you sum across the columns to the right margin, your row sums give you . If you sum down the rows to the bottom margin of the table, your column sums give you . When we obtain a distribution by marginalizing , we say is the marginal distribution of .
In our example, we can marginalize our joint probability matrix:
|roll R =||1||2||3||4||5||6||P(D)|
|D = fair:||0.9|
|D = loaded:||0.01||0.01||0.01||0.01||0.01||0.05||0.1|
Now we have the marginal distribution . This is the probability that you’re going to observe a specific roll of 1..6, if you don’t know what kind of die you pulled out of the bag. You’ve marginalized over your uncertainty of an unknown variable . Because sometimes you pull a loaded die out of the bag, the probability that you’re going to roll a six is slightly higher than .
If I have additional random variables in play, again they stay where they are. Thus:
Two random variables and are independent if:
which necessarily also means:
In our example, the outcome of a die roll is not independent of the die type , of course. However, if I chose a die and rolled it times, we can assume the individual rolls are independent, and the joint probability of those rolls could be factored into a product of their individual probabilities:
In probability modeling, we will often use independence assumptions to break a big joint probability distribution down into smaller tables, to reduce the number of parameters in our models. The most careful way to invoke an independence assumption is to write the joint probability out as a product of conditional probabilities, then specify which conditioning variables are going to be dropped. For example, we can write as:
Then state, “and I assume Y is independent of Z, so:”
It’s possible to have a situation where is dependent on in , but when a variable is introduced, . In this case we say that is conditionally independent of given . For example, could cause , and could cause ; ’s effect on is entirely through . This starts to get at ideas from Bayesian networks, a class of methods that give us tools for manipulating conditional dependencies and doing inference in complicated networks.
7. Bayes’ theorem
We’re allowed to apply the above rules repeatedly, algebraically, to manipulate probabilities. Suppose we know but we want to know . From the definition of conditional probability we can obtain:
and from the definition of marginalization we know:
Congratulations, you’ve just derived and proven Bayes’ theorem:
If you assume that is a constant (a uniform prior) it cancels out, and you recognize Laplace’s inverse probability calculation.
8. why Bayes’ theorem is a big deal
If we were just talking about and as random variables, Bayes’ theorem would be trivial. I mean, we just derived it in a couple of lines.
Things get interesting when we talk about observed data and a hypothesis :
The probability of our hypothesis, given the observed data, is proportional to the probability of the data given the hypothesis, times the probability of the hypothesis before you saw any data. The denominator, the normalization factor, is : the probability of the data summed over all possible hypotheses.
, the probability of the data, is usually the easiest bit. This is often called the likelihood. (It’s the probability of the data ; it’s the likelihood of the model .)
is the prior.
is sometimes called the evidence: the marginal probability of the data, summed over all the possible hypotheses that could’ve generated it.
is called the posterior probability of .
Bayes’ theorem looks like it’s telling us how to do science. Bayes’ theorem gives us a principled way to calculate the posterior probability of a hypothesis , given data that we’ve observed.
Well, hold on – maybe, but we do have some problems. Where do we get from, if it’s supposed to be a probability of something before any data have arrived? We may have to make a subjective assumption about it, like saying we assume a uniform prior: assume that all hypotheses are equiprobable before the data arrive.
Second, how do we enumerate all possible hypotheses ? Sometimes we’ll be in a hypothesis test situation of explicitly comparing one hypothesis against another, but in general, there’s always more we could come up with.
And third, perhaps most worrying of all, what does it mean to talk about the probability of a hypothesis? You might well argue that a hypothesis is either true or false; it’s not a repeatable experiment that you’re sampling observations from and getting a frequency.
A key feature of Bayesian inference is that we treat probability as a degree of belief, not just as a sampling frequency. This difference is the principal difference between two statistical philosophies, “Bayesians” and “frequentists”.
Markov models as an example
For many common sequence analysis problems, we want to be able to express the probability of a sequence of length . Let’s make it a DNA sequence, with an alphabet of 4 nucleotides ACGT, to be specific; but it could also be a protein sequence with an alphabet of 20 amino acids, or indeed some other kind of symbol string.
Why would we want to express the probability of a sequence, you ask? One good starting example is, suppose there are two different kinds of sequences with different statistical properties, and we want to be able to classify a new sequence (a binary classification problem). We want to express and for two probability models of the two kinds of sequences. Using those likelihoods, if we also know the prior probabilities and , we could calculate a posterior probability for assigning a given sequence to each model. Even if we don’t know those priors (which would often be the case), we can calculate a log-odds score as a well-justified measure of the evidence for the sequence matching model 1 compared to 2.
For interesting sequence lengths , we aren’t going to be able to express directly, because we’d need too many parameters: of them, for DNA. We need to make assumptions: we need to make independence assumptions that let us factorize the joint probability in terms of a smaller number of parameters.
the i.i.d. sequence model
For example, we could simply assume that each nucleotide is drawn from a probability distribution over the four nucleotides, and each nucleotide in the sequence is independently drawn from the identical distribution. Then:
This is about the simplest model you can imagine. It is called an independent, identically distributed sequence model, or just i.i.d. for short. If someone says an “i.i.d. sequence”, this is what they mean.
You could make two i.i.d. models and with different base composition – one with high probabilities for AT-rich sequence, the other with high probabilities for GC-rich sequence – and you’d have the basis for doing a probabilistic binary classification of sequences based on their nucleotide composition.
Let’s look in detail at how an i.i.d. model is derived, in terms of probability algebra for manipulating joint and conditional probabilities. Suppose (for sake of example) we have a sequence of six residues, , where each symbol represents a random variable that takes a value a/c/g/t. We want to express a joint probability . We can exactly factor that joint probability into a series of conditional probabilities like so:
All exact, so far; no approximation. All we did was converted joint probabilities to conditional probabilities, repeatedly.
Now we can look at terms like, e.g., and make independence assumptions. depends on ? No it doesn’t, we can assume: we can remove one, some, or all of the dependencies, and that corresponds to stating an assumption that the nucleotide at doesn’t depend on the nucleotide at the variable (position) we remove – or at least, doesn’t depend enough that we care, for the purposes of making a model, given the cost/benefit tradeoff of balancing the number of free parameters we need to estimate, versus what we gain from a more exact model.
The most extreme and simple thing we could assume is that doesn’t depend on any other position: . Make that assumption all the way down the chain and we have . That’s the independent part of an i.i.d. model.
position-specific scoring models for motifs
We would still have different distributions at the six positions. Maybe position is often an A, position is usually C or G, position is equiprobably any of the four, and so on. You can imagine if we had a multiple sequence alignment of example sequences, we could count up the frequencies we observe in each position, and use those as model parameters. We’d need parameters for this model, but that’s a much more compact representation than the we needed before making an independence assumption. This is a common model of short, fixed-length DNA sequence motifs: people call it a position-specific scoring model (PSSM) or a position weight matrix (PWM).
But to simplify our model still more, we could state a further assumption that each of the six positions is drawn from the same nucleotide frequency distribution – and that identically distributed assumption gets us to the i.i.d. model.
first order Markov models of dinucleotide composition
Suppose we’re ok with the identically distributed assumption – we’re not trying to capture position-specific statistical information (like in one particular conserved DNA sequence motif), we’re trying to capture overall statistical properties and biases in a DNA sequence. But suppose that complete independence seems too strong.
For example, in coding sequences, individual bases aren’t independent; they come in triplets because of the genetic code. We could do this, to factor a sequence of length in terms of conditional probabilities of triplets (codons), followed by an independence assumption that codons are statistically independent, i.e.:
But this is a pretty specialized model that depends on the fact that there’s a reading frame: we know exactly where to break the sequence into triplets. What if there’s no frame?
For example, many eukaryotic genome sequences show a strong depletion of CG dinucleotides. If you count up mononucleotide and dinucleotide frequencies, you see that . If we want to capture the fact that G is disfavored if the previous nucleotide was a C, we need something more than just an i.i.d. model. (And C is disfavored after a G, because of the other strand; the reverse complement of CG is GC.) Biologists refer to this as CpG bias, where the p in CpG indicates the phosphodiester linkage along a DNA strand – we talk about CG composition so much in other contexts, we say CpG to keep it straight that we’re talking about adjacent nucleotides.
If you tried to make a model that takes CpG bias into account, you might think you could do something with multiplying dinucleotide frequencies together, but probability algebra doesn’t work that way.
So let’s back up. Let’s go back to:
Dinucleotide composition bias means that we’re not going to assume that is independent of its neighbor , but we’re still going to drop the rest:
If we make the identically distributed assumption, that the same dinucleotide bias holds everywhere in the sequence, then we can get to a general form:
We have two kinds of parameters. The main parameters are the parameters for the probability that nucleotide follows . We also need to get the chain started somehow, so we need for the nucleotide probabilities at the first position.
K-th order Markov models
What we’ve just derived is called a first order Markov chain. In general, when someone says Markov model, they’re talking about a model where they assume that the identity of position $i$ depends on one or more previous positions , etc.
This generalizes to higher order Markov models. A 2nd order Markov chain would have terms and initial probabilities . 3rd order would have and . In general, for a -th order Markov model:
(binary) classification problems
As mentioned above, a common problem is classification: we have probability models for classes, and we want to figure out how to classify a data point . Bayes tells us that we want to calculate :
If we knew the likelihoods and the priors , then the are literally the probabilities that came from class . If you were betting, this is how you’d want to calculate your odds. For example, in our fair vs. loaded dice example, we could really imagine knowing (by definition ), (by rolling a loaded die a bazillion times and counting frequencies), and versus (if the game is to pull a die at random out of a bag, and we happen to know how many dice of each type are in the bag). If the game is, I draw a die at random from the bag, and roll it, you could precisely calculate the probability that the die was fair vs. loaded.
However, in many cases I don’t know the priors . I just have a data sample , and I can calculate the likelihoods for two or more models. Now I can only calculate the posterior probabilities up to an unknown constant. How should I assign a “score” to data sample , in order to decide which class it belongs to?
Suppose we’re talking binary classification, and I have two models and . Then it turns out that a good score to calculate is the log-odds score, also known as a log likelihood ratio (LLR):
The LLR is positive if fits the data better, negative if is better, and zero if the two models explain the data equally well.
Why take the logarithm? There’s a good practical reason. Probabilities can easily become so small that they underflow the computer’s floating point representation. Except for small problems, we typically work with probability calculations as sums of log probabilities, not products of probabilities.
derivation of log-odds scores
LLR scores are extremely common in data analysis, including biological data analysis. They have a justification you can derive easily.
Starting from Bayes’ theorem, dividing through first by in both numerator and denominator, then by :
This is showing us that we can rearrange the posterior probability in terms of an odds ratio that we can calculate, and a prior ratio that we can’t.
Suppose we go ahead and define an LLR score , then:
You might recognize this as a sigmoid function for a constant . For high scores, it asymptotically approaches 1; for low scores, it approaches 0; and it rises through 0.5 at . If the two models are a priori equiprobable, the sigmoid curve is centered with its midpoint at . That is, an LLR score of 0 means that the two models are a posteriori equiprobable.
The prior log odds ratio acts as a constant offset on the score. If is a priori more probable, then is positive – the sigmoid curve shifts to the right, signifying that it takes more LLR score to convince you that the data favor , even though was less probable a priori.
Sometimes is called the Bayes factor.
evaluating classification performance: ROC plots
Suppose we’ve developed a score for a binary classifier – whether it’s a log-odds score as a log likelihood ratio, or something arbitrary. We could choose to set a score threshold such that if we assign it to class , and otherwise we assign it to . Suppose we’re looking for a particular class of interesting things (“positives”, class ) against a background of uninteresting things (“negatives”, class ).
Indeed might be so boring and uninteresting that it’s our model of “nothing is going on with except random chance”: it is a null hypothesis.
Where should we set the threshold , to achieve the best results in discriminating positives from negatives?
There is no one answer to this question, because it depends on whether we care more about not missing any positives, or about not making a mistake of misclassifying a negative as a positive.
Imagine a little 2x2 matrix of the truth versus our classification:
- true positive (TP) is when we call positive and it is one;
- true negative (TN) is when we call negative and it is;
- false positive (FP) is when we call positive and it’s really noise
- false negative (FN) is when we call negative and it’s really signal
From these counts, we can calculate:
“Sensitivity” goes by other names, including the true positive rate, or recall.
1 - specificity is also called the false positive rate.
If we set the score threshold lower (less stringent), we classify more stuff positive: our sensitivity increases, but our specificity drops. We can ultimately achieve 100% sensitivity, but 0% specificity, by calling everything a positive. If we set the threshold higher, the sensitivity drops, but our specificity increases; we achieve 100% specificity, but 0% sensitivity, by calling everything a negative.
A plot of sensitivity versus false positive rate (both from 0 to 1.0) for all possible choices of threshold is called a receiver operating characteristic plot (ROC plot). A perfect ROC plot leaps immediately to 100% sensitivity at 0% FPR. Random guessing gives you a diagonal line.
(The name “receiver operating characteristic” is a historical artifact, as you might guess. It’s military jargon that arose in WWII for radar receivers distinguishing blips as enemy vs. friendly aircraft. Correct friend/foe classification is an important “operating characteristic” for a military radar receiver.)
a caveat to ROC plots
When you’re in a situation of detecting a small number of positives against a background of a large number of negatives – a needle in the haystack problem, common in genome-scale biological data analysis – a ROC plot can give you a misleading sense of the accuracy of a classification procedure. The number of false positives you detect will scale with the number of negatives you screen. You can have a procedure with a high specificity that still detects an unacceptable number of false positives, if the positives you’re looking for are rare.
For this reason, there’s two other common statistics that people calculate, which take into account the relative frequency of positives versus negatives:
Of the set of things that you classify as positive, PPV is the fraction that you’re right about, and FDR is the fraction that you’re wrong about. Notice that FDR = 1 - PPV.
For the reason above, it is quite possible to have a low false positive rate but an unacceptably high FDR.
The wikipedia page on sensitivity and specificity gives chapter and verse on these concepts and more.
Occam’s razor tells us to favor simpler hypotheses. But a more complex model, with more free parameters, will generally produce a better fit to the observed data. Imagine fitting a curve to some data points. As we allow more parameters in our curve, we eventually fit the data points exactly, even if we’re just fitting noise. How do we decide when adding more parameters to a model is justified by the data? A remarkable and beautiful feature of Bayesian inference is that an automatic Occam’s razor appears in the equations.
Imagine a trick coin factory that produces biased coins, such that any given coin can have any probability of flipping heads from 0 to 1, uniformly distributed. I give you a coin, and I tell you there’s a 50:50 chance that it’s a fair coin versus a trick coin. You flip the coin 100 times, and you observe heads. Now I’m going to offer you a bet on whether I gave you a fair or a trick coin. Can you calculate fair betting odds, given your observation of heads out of 100 flips?
Making a maximum likelihood estimate of the coin’s parametric (i.e.\ true, unknown) is unhelpful to you. The “trick coin” hypothesis with a maximum likelihood estimate , by definition, cannot give you a worse likelihood than the “fair coin” hypothesis that , because the trick coin hypothesis includes itself as a possibility. Assuming a trick coin is guaranteed to give the best fit to the observed number of flips.
The trick coin factory is a simple example of having two hypotheses and , where is a more complex hypothesis with more free parameters (in this case, one versus zero). In particular, the trick coin factory is an example of a common situation called nested hypotheses, where is a specific case of – the free parameter(s) of can be set so that . In this situation, can never be a worse explanation for the data than , because includes . If I fit to the data by making a maximum likelihood estimate , we know the simple hypothesis cannot have a better likelihood; . No Occam’s razor here. Maximum likelihood fitting favors more complex hypotheses.
Bayesian inference tells me that I’m supposed to compare to , not to . The unknown value of for is a so-called nuisance parameter. I can’t calculate a likelihood for without , but I don’t know what is. Probabilistic inference tells me I have to integrate it out by marginalization:
(This calculation turns out to be another Beta integral, but you don’t quite have all the machinery to solve it yet. The first term is just a binomial distribution again, but we would need introduce something called the Dirichlet distribution as a convenient way to parameterize .)
Intuitively, here’s what will happen. By varying , we can make consistent with lots of different observed data – any possible count from 0 to 100. , with its fixed value , is only capable of explaining a narrow range of observed data. The likelihood of a hypothesis is a probability, so it must sum to one over all possible observed data. There’s only so much probability to go around. A more complex hypothesis that’s compatible with lots of different observed data must necessarily tend to assign lower probability to any given set of observations. A simpler hypothesis can commit more of its probability mass onto the narrower range of observations that it’s compatible with.
Thus, an Occam’s razor arises naturally. If a simpler hypothesis can explain the data well, it will tend to have a higher posterior probability, compared to a more flexible hypothesis. This doesn’t happen if you make optimized point estimates for model parameters; it only happens if you integrate out the unknown parameters.
For example, if I calculate versus for the trick coin factory using a uniform Dirichlet prior, I get the result shown in the figure above. The trick coin hypothesis can equally well explain any observation (with , unsurprisingly). The fair coin hypothesis concentrates its probability mass around the expected . Between about and heads, the fair coin hypothesis has a higher likelihood than the trick coin factory hypothesis.
(My example is contrived to match a famous figure from David J.C. MacKay’s 1992 thesis.)
The plot also shows the likelihood for comparison, showing that the maximum likelihood fitted model always dominates the simpler nested hypothesis .
Even if we observed exactly heads, we can’t be sure that the coin was fair, because trick coins can flip heads too. The probability of observing is 0.0796 for a fair coin, 0.0099 for a trick coin (averaged over uniform ). The odds are about 8:1 in favor of it being a fair coin, if we flip heads with it. You can verify this yourself with a little Python script.
bonus extra example, with RNA-seq data
Suppose we’ve done a single cell RNA-seq experiment, where we’ve treated cells with a drug (or left them untreated), and we’ve measured how often (in how many single cells) genes A and B are on or off. Our data consist of counts of 2000 individual cells:
# Counts: T=no T=yes B=ON B=off B=ON B=off A=ON 180 80 10 360 A=off 720 20 90 540
The joint probabilities sums to one over everything. So normalize by dividing everything 2000 (the total # of cells):
# P(A,B,T) T=no T=yes B=ON B=off B=ON B=off A=ON 0.09 0.04 0.005 0.18 A=off 0.36 0.01 0.045 0.27
Any marginal distribution is a sum of the joint probabilities over all the variables you don’t care about, leaving the ones you do. For example, to get , we sum over A,B: , which leaves:
# P(T) T=no T=yes 0.5 0.5
A conditional distribution can be arrived at in a couple of different ways. One is to imagine building a separate joint probability table for each value of the condition. So we could for example focus just on the condition T=no; the subtable with T=no is:
# Counts, T=no B=ON B=off A=ON 180 80 A=off 720 20
and if we normalize that we get :
# P(A,B | T=no) B=ON B=off A=ON 0.18 0.08 A=off 0.72 0.02
and we could do the same for the T=yes condition.
The other way is to manipulate things with probability calculus. Since , we can get a conditional probability distribution from the joint:
so the full table for is:
# P(A,B | T): T=no T=yes B=ON B=off B=ON B=off A=ON 0.18 0.08 0.01 0.36 A=off 0.72 0.02 0.09 0.54
marginalizing a conditional probability
If we marginalize a conditional distribution, the conditioning stays as it was. For example, , so:
# P(B | T): T=no T=yes B=ON B=off B=ON B=off 0.9 0.1 0.1 0.9
and similarly we can get :
# P(A | T): T=no T=yes A=ON 0.26 0.37 A=off 0.74 0.63
conditioning a conditional probability
Similarly when we make a new conditional distribution from a conditional, the thing we divide through by has the same condition. For example:
so the table for looks like:
# P(A | B,t): T=no T=yes B=ON B=off B=ON B=off A=ON 0.2 0.8 0.1 0.4 A=off 0.8 0.2 0.9 0.6
If A is independent of B, then = . Here that clearly isn’t true. For example, the frequency of A+ cells among untreated cells is 0.26, but the frequency of A+ cells among B+ untreated cells is 0.2.
If you think for a bit about what the numbers in this example are saying, you’ll realize that there’s something very counterintuitive going on.
Suppose I only look at the response of gene A to the treatment T: i.e. the distribution . The fraction of cells positive for gene A clearly goes up after the cells are treated: from 0.26 to 0.37.
Now look at the response of gene A to treatment T when a cell is positive for gene B: i.e. . The fraction B+ cells positive for gene A goes down by two-fold after treatment: from 0.2 to 0.1.
Now look at the B- cells: i.e. . The fraction B- cells positive for gene A also goes down by two-fold after treatment: from 0.8 to 0.4.
So that means that if you hadn’t measured B, you would’ve thought that the fraction of A+ cells was going up after treatment. But if you do measure B, in both B+ or B- cells, the fraction of A+ cells is going down after treatment.
This is Simpson’s paradox.
One way to think about what’s happening is that your intuition is (mistakenly) comparing to in your head as if they’re directly comparable, and you can think of any difference you see as an effect on A. But is related to by marginalization:
so there’s another way to affect “indirectly”, which is through . If A has a strong dependency on a hidden variable B, you’ll get variation in A if you vary B. If your experiment varies something else (T), that’s correlated with B, and you observe a change in A, you can’t necessarily conclude that your T directly changed A; you might have only changed B.
Figure 1A from Trapnell (2015). Expression levels of genes A and B look anticorrelated (left), but you can get this result from having two different cell types in the experiment (blue and green, right), where gene A and B are positively correlated in each individual type.
For example, suppose B is really a cell type marker that has nothing directly to do with regulation of A, and that treatment causes B- cells to grow fast for some reason. Then after treatment, you’ve increased the fraction of B- cells in the population. If B- cells tend to express A at high level, and B+ cells express at low level, then the population-averaged level of A+ cells can look like it went up, even if the treatment actually downregulated A in both B+ and B- cells.
You can get Simpson’s paradox effects with measured RNA-seq TPM values too. The nice figure to the right comes from 2015 review by Cole Trapnell, Defining cell types and states with single-cell genomics. Although Trapnell’s article is about how single-cell RNA-seq experiments help avoid artifacts of bulk population averaging, the figure shows a plot of gene A and B levels in single cells, so the effect it illustrates can arise even in single cell data.
One famous example of Simpson’s paradox occurred in a study of gender bias in PhD admissions at UC Berkeley. Statistics appeared to show that the chances of getting admitted to Berkeley are better if you’re a man: 44% of men were admitted, but only 35% of women. But when the statistics were broken down by department, in each individual Berkeley department, the probability that a woman would be admitted was actually higher than the probability a man would be admitted. What was happening was that some departments have much lower admissions probabilities than others, and women were disproportionally applying to highly selective departments: the observed effect on is through a confounding correlation .