MCB112: Biological Data Analysis (Fall 2017)
 goals this week
 clustering single cell RNAseq data
 kmeans clustering
 mixture models
 the negative binomial distribution
week 06:
mixed feelings
goals this week

Introduction to singlecell RNAseq analysis, and how we use it to identify new “cell types” by clustering.

Understanding the Kmeans clustering algorithm, one of a few basic clustering algorithms.

Viewing Kmeans clustering as a mixture model, an important class of generative probability model, and viewing the Kmeans clustering algorithm as an example of expectation maximization.
clustering single cell RNAseq data
Starting around 2009 people started doing RNAseq experiments on single cells, and the technologies for doing these experiments are advancing rapidly. Single cell RNAseq is a major breakthrough because it allows us to see celltocell heterogeneity in transcriptional profiles, including different cell types. So long as we’re willing to use transcriptional profile as a proxy for “cell type”, by using singlecell RNAseq we can comprehensively survey all the cell types present in some tissue. This promises to be especially powerful in understanding complex biological structures composed of many cell types, like brains.
Figure 1 from R. Sandberg,
A single mammalian cell only contains about 400K mRNAs, and many are inevitably lost in the experimental procedure. Depending on the protocol, perhaps a few million different fragments are generated per cell. There’s little point in sequencing singlecell libraries to the depth of a standard RNAseq library (30M100M reads). Typically, experiments obtain 200K5M reads per cell, and sometimes as few as 50K [Pollen et al 2014].
These reads are distributed over 20K mammalian gene transcripts, so we only get small numbers of counts per typical gene: on the order of 2100 mapped read counts, depending on sequencing depth. Only highly expressed genes are detected reliably, and lowlevel transcripts may not be sampled at all. There is a “dropout” effect, correlated with low expression: puzzlingly frequently, no counts are detected at all for some expressed genes. Statistical modeling of singlecell RNAseq data typically includes a model of the dropout effect [Kharchenko et al, 2014].
For these reasons singlecell RNAseq experiments typically detect expression of fewer genes than population RNAseq experiments do. Population RNAseq experiments are better at deeply sampling the “complete” transcriptome of a population of cells. Single cell experiments are most useful for identifying enough higherexpressed genes to distinguish useful cell types. These used to be called marker genes, when we had few of them – cell types are still often operationally identified by a single marker gene they express highly, such as the parvalbumin+ (Pv+) interneurons in mammalian hippocampus.
One can pool single cell data after clustering to approximate population RNAseq on purified cell type populations, although the higher technical difficulty of single cell RNAseq experiments probably makes this an imperfect approximation, with higher technical noise.
Single cell RNAseq experiments are very noisy. The amount of material is small, and has to be amplified with noisy amplification protocols. Important experimental trickery is used, including the use of UMI barcodes (unique molecular identifiers), to detect and correct amplification artifacts. One good review of the experimental design, analysis, and interpretation of singlecell RNAseq experiment is [Stegle, Teichmann, and Marioni, 2015].
The heart of the analysis, and what concerns us here, is that we want to cluster the transcriptional profiles of the single cells, to identify groups of cells with similar profiles.
Our focus this week is on understanding a classic clustering algorithm called Kmeans clustering.
kmeans clustering
We have data points , and we want to assign them to clusters.
Each is a point in a multidimensional space. It has coordinates for each dimension . In two dimensions () we can visualize the as dots on a scatter plot. It’s easiest to think about Kmeans clustering in 2D, but important to remember that many applications of it are in high dimensions. Gene expression vectors for different genes in each cell are points in a dimensional space.
A cluster is a group of points that are close together, for some definition of “close”. The easiest to imagine is “close in space”. The Euclidean distance between any two points and is
definition of a kmeans clustering
Suppose we know there are clusters. (This is a drawback of Kmeans: we have to specify .) Suppose we assign each point to a cluster (i.e. a set) , so we can refer to the indices . We can calculate the centroid of each cluster, : its mean location, its center of mass, simply by averaging independently in each dimension $z$:
where the denominator is the number of points in cluster – the cardinality of set , if you haven’t seen that notation before.
The centroid is just another point in the same multidimensional space. We can calculate a distance of any of our items to a centroid , the same way we calculated the distance between two points .
For any given assignment consisting of sets , we can calculate the centroids , and then calculate the distance from each point to its assigned centroid . We can calculate a total distance for a clustering :
A Kmeans clustering of the data is defined by finding a clustering that minimizes the total distance , given . We literally try to place means (centroids) in the space, such that each point is close to one of them. That’s the definition of a Kmeans clustering; how do we find one?
kmeans clustering algorithm
The following algorithm almost does the job – well enough for practical purposes. (Almost, because it is a local optimizer rather than a global optimizer, as we’ll discuss in a sec.) Start with some randomly chosen centroids . Then iterate:

Assignment step: Assign each data point to its closest centroid .

Update step: Calculate new centroids .
Iterate until the assignments don’t change.
It is possible for a centroid to have no points assigned to it, so you have to do something to avoid dividing by zero in the update step. One common thing to do is to leave such a centroid unchanged.
The algorithm is guaranteed to converge, albeit to a local rather than a global optimum. It’s sometimes called Lloyd’s algorithm.
local minima problem; initialization strategies
Kmeans is extremely prone to spurious local optima. For instance, you can have perfectly obvious clusters yet end up with multiple centroids on one cluster and none on another.
People take two main approaches to this problem. They try to choose the initial guessed centroid locations “wisely”, and they run Kmeans many times and take the best solution (best being the clustering that gives the lowest total distance ).
Various ways of initializing the centroids include:
 choose random points in space;
 choose random points from and put the initial centroids there;
 assign all points in randomly to clusters , then calculate centroids of these.
soft kmeans
Standard Kmeans is distinguished by a hard assignment: data point belongs to cluster , and no other cluster. We can also imagine doing a soft assignment, where we assign a probability that point belongs to each cluster .
Let’s introduce , the assignment of point , so we can talk about : the probability that we assign point to cluster given the Kmeans centroids .
What should this probability be, since we haven’t described Kmeans in probabilistic terms? (Not yet, anyway.) The standard definition of soft kmeans calculates a “responsibility” :
where for each point , since the denominator forces normalization. We call it a “responsibility” instead of a probability because we haven’t given it any wellmotivated justification (again, not yet).
People call the parameter the stiffness. The higher is, the closer a centroid has to be to a point to get any responsibility for it. In the limit of , the closest centroid gets , and we get standard Kmeans hard clustering as a special case.
The centroid of each cluster is now a weighted mean, with each point weighted by its responsibility (remember that each point is a vector; we’re going to drop the indexing on its dimensions, for more compact notation):
The soft Kmeans algorithm starts from an initial guess of the centroids and iterates:

Assignment step: Calculate responsibilities for each data point , given the current centroids .

Update step: Calculate new centroids , given responsibilities .
This “responsibility” business looks ad hoc. Is there a justification for what we’re doing?
mixture models
A mixture model is a generative probability model composed of other probability distributions. A mixture model is specified by:
 components.
 each component has a mixture coefficient , the probability of sampling each component.
 each component has its own parameters , and generates a data sample with some probability .
The probability of sampling (generating) an observed data point goes through two steps:
 sample a component according to its mixture coefficient
 sample data point from the component ’s probability distribution .
This is a simple example of a hierarchical model. More complicated hierarchical models have more steps. It’s also a simple example of a Bayesian network composed of two random variables, with an random variable for the observed data dependent on a hidden random variable for the component choice.
The probability that an observed data point is generated by a mixture model is the marginal sum over the unknown component :
If we had to infer which (hidden) component generated observed data point , we can calculate the posterior probability:
The mean (average) over many data points assigned to a mixture is, for each component :
You can see this is the update step of soft Kmeans, with a probability in place of the ad hoc responsibility . But there’s a missing piece: we haven’t said anything yet about what the component probability distributions can be.
That’s because the components can be anything you like. It’s most common to make them all the same elemental type of distribution. So we can have mixture exponential distributions with parameters ; we can have Poisson mixture distributions also with parameters ; we can have mixture multinomial distributions with parameters ; and so on.
For example, one common mixture model is a mixture Gaussian distribution, with parameters and for each component. This might arise in trying to fit data that had multiple modes (peaks), with different peak widths, and different proportions of data in each peak. A mixture Gaussian would give:
soft Kmeans as a mixture probability model
Let’s stare at that, comparing it to soft Kmeans. Specifically, consider what the posterior probability looks like for a mixture Gaussian, compared to the responsibility calculation.
First notice that the vectorvalued term in the spherical Gaussian is the square of the Euclidean distance between the point and the mean : . Already we see a parallel between the of the Gaussian and the of soft Kmeans.
But what about ? A mixture Gaussian requires that we specify a variance for each component, and no such term appeared in soft Kmeans. Maybe soft Kmeans is making assumptions that make it go away?
Remember that our are vectors with dimensions , and so are our . We could specify some sort of an asymmetric Gaussian, possibly with correlations along arbitrary axes using a covariance matrix, but let’s keep it simple. Let’s specify symmetric spherical Gaussian distributions for our components, with a single per component.
Let’s go a step further, and assume that there is a constant for all spherical Gaussian components .
Let’s go another step further, and assume that all components are equiprobable: .
Now you should be able to prove to yourself that after making these assumptions, the mixture Gaussian posterior probability equation reduces to:
As if you haven’t already recognized the equation for the responsibilities in soft Kmeans, let’s go ahead and define a stiffness , and:
implicit assumptions of Kmeans
Having derived the responsibility update equation for soft Kmeans from first principles, we’ve laid bare its implicit assumptions:
 the data are drawn from a mixture Gaussian distribution;
 composed of spherical Gaussians;
 each component has an identical spherical Gaussian variance – each component is the same “width”; and
 each component has an equal expected number of data points.
It should not be surprising that Kmeans may tend to fail when any of these assumptions is violated. Standard descriptions of Kmeans will tell you it can fail when the individual data clusters are asymmetrically distributed, have different variances, or have different sizes in terms of number of assigned data points.
fitting mixture models with expectation maximization
Soft Kmeans, as a probability model, is a simplified example of an important class of optimization algorithm called expectation maximization (EM).
Expectation maximization is a common way of fitting mixture models, and indeed other models with hidden variables. You recall I already mentioned that kallisto and other RNAseq quantitation methods use EM to infer which transcript a read maps to, when the read is compatible with multiple reads. That’s essentially a clustering problem too.
For example, here’s how you’d use expectation maximization to fit a mixture Gaussian model to data. To simplify things, let’s assume that there is a single known, fixed variance parameter , as in a soft Kmeans stiffness. But let’s relax one of the other Kmeans assumptions, and allow different mixture coefficients . Then, starting from an initial guess at the means :

Expectation step: Calculate posterior probabilities for each data point , given the current means (centroids) , fixed , and mixture coefficients .

Maximization step: Calculate new centroids and mixture coefficients given the posterior probabilities .
We can follow the total log likelihood as we iterate, and watch it increase and asymptote. When we start from different starting conditions, a better solution has a better total log likelihood.
When we calculate the updated mixture coefficients , that’s just the expected fraction of data points assigned to component :
If we had to fit more than just a mean parameter for each component, we would have some way of doing maximum likelihood parameter estimation given posteriorweighted observations. (For example, if we were also trying to estimate terms, we could calculate the expected variance for each component , not just the expected mean.)
the negative binomial distribution
Singlecell RNAseq data consist of a small number of counts for each gene (or transcript) . Many analysis approaches model these observed counts directly, rather than trying to infer quantitation (in TPM).
If a singlecell RNAseq experiment generated observed counts from a mean expression level like pulling balls out of urns, then the differences we see across replicates of the same condition would just be technical variation due to finite count sampling error. We would expect the observed counts to be Poissondistributed around a mean .
The Poisson distribution only has one parameter. The variance of the Poisson is equal to its mean . This is a strong assumption about .
In real biological data from biological replicates have more variance than the Poisson predicts: they are overdispersed. This shouldn’t surprise you. If the samples are outbred organisms with different genotypes, for example, we get expression variation due to genotype variation. We also get variation from a host of other environmental and experimental factors that are changing when we do biological replicates.
Just as a practical matter, we want a distribution form that allows us to control the variance separately from the mean, so we can model this overdispersion. Empirically, two different distributions have been seen to fit RNAseq count data reasonably well: the negative binomial distribution and the lognormal distribution. The pset this week is based on the negative binomial distribution.
definition of the negative binomial
The negative binomial (NB) distribution appears in various notations and guises. The most relevant one to us [Robinson & Smyth, 2008] is:
Though this may look like gibberish, it’s strongly related to the familiar binomial distribution. Remember, gamma functions are just the continuous extension of the familiar factorial, where for integers , so that mess of gammas is just a binomial coefficient in continuous form. The terms inside parentheses can be written as and , if we say , thus , so again familiar terms of a binomial distribution appear. Suppose is the number of “successes” that occur with probability , and is the number of “failures” that are allowed before we stop. The binomial distribution is the probability of seeing successes (with probability ) before one failure (with probability ). The negative binomial distribution is the probability of seeing successes before failures.
The mean of the NB is . Its variance is .
The parameter is the dispersion. As , the NB asymptotes to the Poisson.
Examples of the negative binomial distribution, compared to the Poisson. 'Open image in new tab' to embiggen.
Once we assume the NB, then for each gene , we need two parameters to calculate a likelihood of its counts in a sample: its mean number of counts , and its dispersion .
SciPy’s negative binomial distribution
I mentioned that the NB appears in various notations. This is a fiddly detail that can bite you in this week’s pset. The NB is defined in different ways, so when you plug numbers into it, you have to be careful you’re using the right numbers. You may have to do some changeofvariables manipulation first.
You’re going to want to calculate NB log likelihoods in the pset, so you need a routine to calculate the NB log pmf (probability mass function) The NB is a discrete distribution, with outcomes defined over integers ).
So you’d naturally head over to
scipy.stats.nbinom
to find scipy.stats.nbinom.logpmf()
. It takes three arguments: the
observed data count , and two parameters and . These
are not and , alas, so we have to do some thinking.
To map between different notations for the NB, it helps to think about where the negative binomial arises most naturally, by comparison to the binomial. Suppose we have some sort of random discrete event (like flipping heads with a coin or rolling a one on a die) where we can talk about a probability of success and of failure. Now we look at a series of events (Bernoulli trials, in the jargon), and we ask: if I’m allowed failures before I have to stop, how many successes am I likely to see?
In a chain of events, I’m allowed any combination of failures and successes, but I know the final event is a failure, because the ‘th failure is what terminates the chain. So the probability of the chain is:
which is just
Now the next thing to notice about this is that one person’s parameter is another person’s random variable – I could say “what’s the probability of seeing successes before the ‘th failure”, or I could just as easily say “what’s the probability that there are failures in a chain that has exactly successes (and ends with a failure)?” So now we have to pay attention to whether someone’s NB PMF is calculating or whether they’re calculating .
Which means we have to read the documentation (and we have to check that it’s right – trust but verify, as Reagan urged us). We see that although Wikipedia shows the PMF in exactly the terms I defined it in, that’s not what SciPy does. SciPy says it calculates:
which (sigh) is the other way: is associated with , is associated with , so if we consider to be the failure probability, SciPy is calculating the PDF of the number of failures, not successes.
Fine, if that’s what SciPy does, let’s just do whatever change of variables we need to to make our equation for map onto SciPy’s PMF. Provided that you remember that for integer , you should be able to prove to yourself that you need:
and now when you call scipy.stats.nbinom.logpmf(x,n,p)
, it will give
you .
fitting the negative binomial distribution to data
Suppose I give you a bunch of samples that have been sampled from a negative binomial distribution with unknown parameters . How do I estimate optimal parameter values?
Maximum likelihood fitting of an NB turns out to be a bit fiddly. For almost all practical uses, an alternative parameter estimation strategy suffices, and it’s a strategy worth knowing about in general: method of moments (MOM) estimation.
The idea is that since I know the relationship between the NB parameters and the mean and variance (in the limit of having lots of observed data):
so why not calculate the mean and variance of the data, then just solve for the parameters:
We can often get away with MOM estimation. In the pset, you can. And in fact, you don’t even need to estimate , because it’s given to you. You only need to estimate the parameter of the NB – and that’s just the sample mean of the count data.