### MCB112: Biological Data Analysis (Fall 2017)

- again the problem of visualizing “high dimensional” data
- the key idea behind t-SNE (and SNE in general)
- the road to t-SNE lies through SNE
- on to t-SNE
- t-SNE in a nutshell
- three practicalities for the pset

# week 13:

*t-SNE*

Old friends make their farewell appearance in this final week of the course… Student’s t distribution, entropy of probability distributions, optimization of objective functions using gradient information, and principal components analysis among them. It’s almost like there was a plan all along.

### again the problem of visualizing “high dimensional” data

RNA-seq data is “high dimensional” in the following sense. RNA-seq data is a matrix of M samples x N genes, with mapped read counts or some other sort of quantitation (TPM, FPKM) for each gene in each sample. Each sample (a cell, tissue, or cell type) is associated with a list of expression levels for N genes; we can think of this as a vector in an N-dimensional space. Samples with similar gene expression patterns are close in “gene space”, as measured by Euclidean distance (or some other distance measure we might choose).

We can just as readily think of each *gene* being associated with a
list of expression levels in M samples, hence as a vector in an
M-dimensional space, and we might want to see how genes cluster in
“sample space”.

We want the distances in these “spaces” to reflect biologically meaningful similarity between two samples, or two genes. We have to be careful about what the numbers in these vectors are, and what we think they mean. If we’re using mapped read counts, we’d want to normalize them somehow to account for the fact that one sample might have more total reads than another; we don’t want to consider two samples to be different and distant just because we mapped more reads to one of them. If we’re going to make some sort of assumption about how distances are distributed – for example, if we’re going to use some clustering technique that assumes Gaussian distributions – we might want to put our matrix in units of log counts, not raw counts, as we’ve seen in using k-means clustering with RNA-seq data.

It’s difficult to visualize clustering in high-dimensional spaces. There are a variety of techniques for trying to project high-dimensional data down to two or three dimensions for visualization purposes. We’ve already seen one of the main techniques, principal components analysis (PCA). A limitation of PCA is that it corresponds to a rigid orthogonal rotation of the original space followed by a projection to 2D - you’re projecting the data onto a plane through the original space. If there are only a few well-separated clusters, dominated by variance along only a few directions, PCA can do a fine job. But in many cases, a linear method like PCA isn’t good enough, and we need to do some sort of nonlinear projection into 2D.

A new nonlinear visualization method for high-dimensional data called
*t-SNE* was introduced by
Laurens van der Maaten and Geoff Hinton in 2008.
t-SNE stands for **t-distributed stochastic neighbor embedding**.
t-SNE produces evocative, strangely beautiful visualizations. Probably
for that reason as much as any, it has become extraordinarily popular
in biological data analysis and machine learning in general. A t-SNE
plot is almost *de rigueur* in a single cell RNA-seq paper these days.
The figure to the right shows an example from a recent paper that
sequenced 42,035 single cells from whole *C. elegans* animals, showing
how cells cluster into tissues like neurons and muscle.

Figure 3A from [Cao et al. 2017], showing a t-SNE plot of 42,035

*Caenorhabditis elegans*single cells. ('open image in new tab' to embiggen.)

What does t-SNE do? How does it work? What does it assume?

### the key idea behind t-SNE (and SNE in general)

Each data point is given an ordered preference for relatively small number of neighbors in the original space. In the so-called “embedded space” (in the two-dimensional t-SNE plot), the points are arranged to preserve that preference and order.

A bit more specifically, each point has a specific preference in terms of a probability distribution for visiting neighboring points. As it visits its neighborhood, some points are visited more than others; distant points are rarely if ever visited.

The size of the “neighborhood” is defined as an effective number of
neighbor points that each point cares about. This is a controllable
parameter to t-SNE. Perplexingly, it’s called the
**perplexity**. We’ll see why; it’s information theory jargon, related
to Shannon entropy and the precise definition of what we mean by an
“effective number” of neighbors, when they have different visitation
probabilities.

The particular form of the probability distribution that t-SNE uses in the original high-dimensional space is a spherical Gaussian, where each point has its own parametric standard deviation ; whereas in the embedded low-dimensional space, it uses the same distribution for every point (a Cauchy distribution, a.k.a. a Student t distribution with one degree of freedom). This has the effect that points in spare versus dense regions in the original space tend to get more evenly distributed in the embedded space, because t-SNE cares about preserving neighbor relations, not absolute distances.

This evenness probably contributes to the visually pleasing geometry of t-SNE plots.

Go visit this
excellent article on t-SNE,
written by Martin Wattenberg, Fernanda Viegas, and Ian Johnson at
Google. The article is written in distill.pub,
a fabulous hypermodern forum for articles that use high-quality,
interactive data visualization to explain machine learning
techniques. It will give you a good high-level intuition for t-SNE –
and it uses MCB112-style control experiments, using t-SNE to project
*known 2D data clusters* into 2D t-SNE space, and you can watch many
ways in which t-SNE fails spectacularly, though always beautifully. (I
think the distill.pub article has one problem: it doesn’t sufficiently
consider the fact that t-SNE’s objective function isn’t convex, so it
can get stuck in local optima. The article doesn’t seem to give enough
consideration to the fact that some of the bizarre plots may simply be
spurious local optima that would look more sensible if you re-ran
t-SNE a few times and chose a better optimum. The local optimum
problem only comes up pretty far down in the article, and even then,
it presents results of 5 runs as if they’re equivalent, without
considering the value they achieved for the objective function.)

### the road to t-SNE lies through SNE

My explanation follows the concise and clear explanation in the 2008 van der Maaten and Hinton paper. They start by explaining SNE, stochastic neighbor embedding without the t, which was introduced in 2002 by Geoff Hinton and Sam Roweis.

We are given data points , each of which is a vector in an -dimensional space.

We aren’t going to need to index the dimensions; we’re going to write
everything in compact vector notation. But just so you remember: when
we write for the Euclidean distance
between points and (i.e. the Euclidean length, or
*norm*, of the resultant vector ), that’s . A fancy name for an ordinary
Euclidean distance is the (or just L2) norm.

We aim to project the original points to new data points in two dimensions.

First we define a conditional probability for the probability of visiting neighbor starting from point . If we assume that we visit neighbor as a function of its distance in proportion to a spherical Gaussian distribution, then the probability that we visit relative all other points (other than itself) is:

The are set to zero. We need a for each
point . We are going to set so that point has
a certain *effective number of neighbors*. We’ll make that notion
precise using Shannon entropy. The Shannon entropy of the distribution is:

Recall that a Shannon entropy is the number of yes/no questions it
takes to specify one of equiprobable outcomes; for example, it
takes 2 bits to specify one of the 4 DNA nucleotides (), and 10
bits to specify one of the 1024 possible DNA sequences of length 5
(). Turning that around, if I let you ask yes/no
questions, you could distinguish one of equiprobable
possibilities. For non-uniform probabilities, we can still calculate
and consider it some sort of “effective” number of
possibilities for our probability distribution. This is called the
**perplexity** of a probability distribution: .

The larger is, the more spread out the Gaussian, and
visits more neighbors. As decreases toward zero,
the Gaussian contracts and only the one closest neighbor is ever
visited, relative to the others. For each point , we dial
in to get to our chosen perplexity value.
(We’ll talk about the numerical details of that below; it becomes an
example of a general class of problems called *one dimensional
root-finding* problems).

It’s worth noting that the perplexity is a *global* parameter to SNE:
all the points want to visit the same number of neighbors, even
if some are in dense clusters and some are outliers in the
original space. This is responsible for some of the counterintuitive
behaviors described in the
distill.pub t-SNE visualizations.

In the embedded 2D space, we define the neighbor relations with conditional probabilities that are also calculated as Gaussians, but with a fixed variance , so that:

with defined to zero.

We try to find positions that make all . A typical measure for the difference between two probability distributions and is the so-called Kullback-Leibler divergence, . The sum of the KL divergences for all points and their neighbor distributions is:

I write that this is a function , because what we’re going to do in SNE is to find points that minimize this objective function, given the original points . We can use standard numerical function minimization, starting from a random guess at the . The objective function is differentiable, so we also have gradient information to help this minimization. (I don’t show the gradient for SNE, since we’re moving straight along to t-SNE.)

The SNE objective function isn’t convex; it has local optima. We at least need to do multiple optimization runs from multiple starting points. The local optima problem turns out to be very serious; the optimizaton surface is very rugged. Points get stuck, unable to move past each other in the iterations of the optimization. SNE implementations use special tricks to help, as discussed in the van der Maaten paper.

The van der Maaten paper describe some of the problems of SNE, one of
which is the *crowding problem*. To alleviate these problems (which I
won’t go into), the 2008 paper introduced t-SNE.

### on to t-SNE

t-SNE makes a small change to the calculation of (the probabilities in the original high-dimensional space), and a big change in the calculation of in the embedded space.

First t-SNE converts the asymmetric SNE ’s to symmetric “joint probabilities” that sum to one over all by:

It seems that this is essentially arbitrary – the symmetric version has a simpler gradient. There are other ways to symmetricize the , but this one has an advantage that each point is treated with uniform probability , so that we’re considering two ways of obtaining the joint either by or , and since these don’t give the same number (because what we’re doing is arbitrary), we’re just averaging them. This is not the heart of what makes t-SNE tick, but it’s part of the method.

t-SNE also assumes a symmetric distribution, but here is
where the big change is. t-SNE assumes a *heavy-tailed* distribution.
It will assign higher probabilities to points that are further away in
the 2D embedded space than they were in the original N-dimensional
space. This alleviates the crowding problem: points that were
clustered in the original roomy hypervolume need to be laid out in a
constrained area available in a mere two dimensions, and something has
to give, so we want to let them spread out. Specifically, t-SNE
assumes a Student t distribution with one degree of freedom, also
known as the Cauchy distribution, and calculates:

Again the are defined as zero. Notice that the denominator
is a sum over *all* pairs , not just a sum over the
possible neighbors for one point : this is what
makes the a symmetric “joint probability” that sums to one
over all .

With these definitions for t-SNE’s and , the KL objective function becomes:

We’ll seek points to minimize this KL divergence, using numerical optimization starting from a randomly chosen point. The gradient is:

(When you go to implement this, notice that and
are scalars, and is a scalar
that you calculate from the distance between the and
vectors… but is a *vector*, with two dimensions.
The gradient is an matrix.)

### t-SNE in a nutshell

Given:

- data points , the rows of an matrix.
- the chosen perplexity.

Calculation:

- Calculate , fitting to achieve the chosen perplexity.
- Calculate .
- Sample random starting points (2D vectors) from
- Use numerical optimization to minimize the KL distance , using gradient information .

Result:

- data points in the embedded 2D t-SNE space.

The only part of that that we didn’t discuss already is the sampling of the initial points, which shouldn’t be super important; the t-SNE paper recommends choosing them from a small 2D Gaussian centered at zero. That’s what means: is customary notation for a Gaussian random variable, is the 2D identity matrix, and is a small variance.

### three practicalities for the pset

#### 1. one dimensional root-finding using `scipy.optimize.bisect()`

To calculate the t-SNE stochastic neighbor distribution, you need the SNE conditional distributions ( of them, for data points). To calculate the , you need a for each data point, which is setting the width of the Gaussian distribution that point uses to define its stochastic neighbors.

For each data point , for a given , we calculate the Shannon entropy of the distribution; is the perplexity of the distribution, the effective number of neighbors. We’ll dial in to get the perplexity we want.

When is smaller, the stochastic neighbor distribution is tighter, reaching out to fewer neighbors; the perplexity is low. For larger , more neighbors, larger perplexity. Perplexity is a monotonic increasing function of , and . It’s a simple function and there’s got to be a solution.

If we cast this problem in terms of an objective function
that’s the difference between our current perplexity and
the target perplexity, then we’re looking for the value of
that achieves . Finding a scalar parameter such
that is a general problem called a *root-finding problem*,
which arises in many sorts of situations. It tends to get a chapter of
its own in books on scientific numerical methods (such as *Numerical
Recipes in C*, my go-to source).

The first step in any rootfinding problem is for you to identify a
*bracketing interval*: points and where and
have different signs, so there must be a point
in between them. A root finder is then going to narrow down that
interval until it finds the root.

In our problem, with a monotonically increasing function, for we need to identify points such that and . In a function as well behaved as what we’ve got here, and since we know , we can lay down a test point anywhere and keep moving it toward (or away) from zero until we get the sign we need. For example, for the side, we can do something like:

```
a = 1.0
while f(a) >= 0: a /= 2
```

By taking multiplicative steps, we guarantee that .

There are
a variety of root finders in SciPy,
including fancy methods that try to accumulate information about your
as you try points , to make sophisticated predictions
about where . A simple and robust method is *bisection*,
implemented in
scipy.optimize.bisect().

To call `scipy.optimize.bisect()`

, we provide it with the name of the
function that implements ; with the bracketing endpoints
and, if necessary (and it will be necessary for t-SNE), with
a tuple of data that `.bisect()`

will pass as arguments to your
objective function. For example, the following pseudocode:

```
def my_perplexity_diff(sigma, Di):
# calculate the perplexity for distance matrix row D[i] given sigma
# return that minus the desired perplexity: we're looking for a zero difference
# find your a,b bracketing interval
sigma[i] = scipy.optimize.bisect(my_perplexity_diff, a, b, args=(D[i].flatten()))
```

and you’d do that for each .

I slipped `D[i]`

in there without defining it. Your original data are
a matrix . The are calculated in
terms of Euclidean distances between row vectors and
. As you dial in the , you’re going to calculate
those pairwise distances over and over again. You can save compute
time by precalculating an matrix of all
pairwise distances between points. Moreover, then for a given
, a row is sufficient to calculate . You’ll see that my code for the pset is written to take
advantage of a precalculated distance matrix, and the
above pseudocode reflects that too.

And what’s up with the `D[i].flatten()`

? If the distance matrix is an
numpy array `D[i][j]`

, and I extract row `D[i]`

, I get
another *array*, now . But I wrote the objective
function pseudocode as if it’s just getting a row vector `Di`

, so I
flatten the array to a vector when I pass it. I could do this
different ways – but `.flatten()`

is good to know about (and its
opposite, `.reshape()`

as we’ll see now in dealing with the SciPy
optimizer.

#### 2. providing gradient information to `scipy.optimize.minimize()`

We’ve already used the SciPy numerical optimizer in the week on regression, and you might want to look back at that. But we didn’t provide the minimizer with gradient information. We didn’t have to then; we were optimizing a pretty simple, low-dimensional log likelihood objective function. With t-SNE, we’re in a more complex situation, and we want to use gradient information.

For t-SNE, you’re going to create an objective function that takes a current set of points (a array), and returns the KL divergence between the stochastic neighbor distribution in the original space (which you’ll have calculated already; you only need to do that once) and in the embedded space (which depends on your current point ). So your objective function in pseudocode is going to look something like:

```
def KL_dist(Y, P):
Q = calculate Q from Y
KL = calculate KL(P||Q)
return KL
```

The gradient is a matrix: how much changes, for a small change in one of the two dimensional coordinates of one point . van der Maaten (2008) show the equation for the gradient on p.2584 of their paper. It depends on and too, and on calculating given the current . You could write a separate function to calculate the gradient, but then you’d have to calculate twice, wastefully.

It doesn’t seem to be super well documented, but you can optionally have your objective function return the gradient too. This can save some calculation, compared to having a separate gradient-calculating function, because the calculation of and often share computationally intensive calculations – like here, where we’re really rather just calculate once, for a given . So in pseudocode our objective function will look something like:

```
def KL_dist(Y, P):
Q = calculate Q from Y
KL = calculate KL(P||Q)
gradient = np.array(Y.shape)
calculate each gradient term
return KL, gradient
```

Now a detail. The SciPy optimizer is very general - it doesn’t want
to know about the dimensionality of your problem, it just wants to
deal in terms of one big vector of current Y, and SciPy also wants the
gradient to be one big vector. We use `np.flatten()`

to convert a
matrix to a single vector for SciPy, and we use `np.reshape()`

to
convert a vector back to a matrix. is fine as an array – SciPy
passes the extra constant arguments in exactly the form you provided
them.

In pseudocode, then:

```
def KL_dist(Y, P):
M = P.shape[0] # or length(Y) / 2
Y = np.reshape(Y, (M,2))
Q = calculate Q from Y
KL = calculate KL(P||Q)
gradient = np.array(Y.shape)
calculate each gradient term
return KL, gradient.flatten()
```

The hard part is writing the objective function and getting its input and output in the form that SciPy needs, with the input and the output gradient as flattened vectors of elements. To call the minimizer then looks like:

```
Y = np.random.normal(0., 1e-4, (M,2)) # This chooses random initial Y points
result = scipy.optimize.minimize(KL_dist, Y.flatten(), args=(P), jac=True)
```

Recall that args=(P) is the interface for handing SciPy a bundle of constant additional data (as a tuple) that it needs to pass to the objective function (as additional arguments); here, the stochastic neighbor distribution in the original space that we’re trying to match.

`jac=True`

is the poorly-documented way of telling SciPy that our
objective function is going to return both value and the
gradient. “jac” means Jacobian matrix. The other way to do this is to
pass the name of a function that calculates the gradient,
i.e. `jac=my_gradient_func`

.

When the minimizer returns, it passes a tuple of information back. The parts we care about are:

`result.x`

: the optimized solution – our embedded Y coordinates, as a Mx2-long vector that we’ll reshape.`result.success`

: True if the minimizer succeeded. We ought to check this.`result.fun`

: the objective function value at the solution.

You’ll take `result.x`

and reshape it to an matrix, and
and that’s your t-SNE plot in two dimensions.

#### 3. using `sklearn.manifold.TSNE()`

After all this, you’ll be pleased to know that using the canned t-SNE implementation in scikit-learn is super easy:

```
from sklearn.manifold import TSNE
X = your MxN data array
Y = TSNE(perplexity=10).fit_transform(X)
# Y is now Mx2; plot it.
```

Why didn’t I tell you that in the first place? Wielding a powerful canned data analysis routine without respectful understanding is like a child with a gun.

Which seems like a good epitaph for MCB112, so we stop here. Good luck on the pset, and thanks for taking the course!