# Section 13: t-distributed stochastic neighbor embedding (t-SNE)¶

Notes by Joseph Sedlak - 12-1-17

## t-SNE¶

A method for non-linear dimensionality reduction commonly applied to visualize different cell types based on gene expression in single cell RNA sequencing experiments. The non-linearity aspect of t-SNE often makes it advantagous over PCA. PCA does well when there are a few well-seperated clusters with most of the variance captured in the first few principle components. However, PCA corresponds to a rigid orthogonal rotation in the original space, making it impossible to seperate clusters out in some cases, including the 4D-hypercube example in our HW. In such cases, a non-linear projection into 2D space (e.g. t-SNE) is required.

t-SNE's flexibility in identifying clusters where other dimensionality-reduction algorithms fail is also it's weakness. We will go through an overview of how tuning the perplexity metric matters. Perplexity is a tunable paramter, ultiamtely a guess of each point's number of close neighbors. Here is an overview of how to understand a t-SNE plot and some of its pitfalls by Martin Wattenberg, Fernanda Viegas, and Ian Johnson at Google. Note: as Sean mentioned, the article has a problem of not accounting for the fact the bizzare plots may come from the data getting stuck in a local optima, as they should consider the value achieved by the objective function.

Each point on the t-SNE graph has a probability dsitribution for visiting neighboring points, ultaimtely used to form clusters. In high-dimensional space, a Spherical Gaussian is used, where each point has it's own standard deviation $\sigma_i$. In low-dimensional space, it uses the same distribution for every point (the Cauchy distribution: a Student t distribution with one degree of freedom). Thus, t-SNE cares about preserving neighbor relations and not absolute distance as illusrtrated in the above article.

Overview of the t-SNE algorithm as from van der Maaten and Hinton (2008): $\textbf{M}$ data points (e.g. cells), $x_1 ... x_M$, each in $\textbf{N}$-dimensional space (e.g. number of genes).

$||x_i - x_j||$ is a symbol for Euclidean distance, or $\textbf{L}^2$. We hope to project datapoints $y_1..y_M$ to two dimensions.

First define a conditional probability $\textbf p_{i|j}$ of visiting neighbor $j$ starting from point $i$, shown in equation (1) in the high dimensional space. Probability assumes we visit neighbor $j$ as a function of its distance in proprtion to a spherical Gaussian distribution. Second, we define a conditional probability $\textbf q_{i|j}$, for 2D space (equation 4). Our goal is to try to find positions $y_i .. y_M$ such that all $\textbf q_{i|j}$ ~ $\textbf p_{i|j}$. In t-SNE, we convert the assymetric SNE $\textbf p_{i|j}$ to a symmetric joint probaiblity $\textbf p_{ij}$, defined as $\textbf p_{ij} = \frac{p_{i|j} + p_{j|i}}{2M}$. Similarly, t-SNE assumes a symmetric distribution for $q_{ij}$, shown in equation 4. Note, $q_{ij}$ assumes a heavy-tailed distrbution (Cauchy distribution), meaning points further away in the 2D space will have a higher probability.

We can assess $\textbf q_{i|j}$ ~ $\textbf p_{i|j}$ by using a typical metric to measure the difference between two probability distributions $\textbf{P}$ and $\textbf{Q}$, called the Kullback-Leibler divergene, $KL(P|Q) = \sum\limits_{i}\sum\limits_{j} p_{ij}log(\frac{p_{ij}}{q_{ij}})$. The KL equation becomes our objective function $f(\textbf{y})$. We want to find points $\textbf{y}$ that minimize the objective function, starting from a randomly chosen point. The gradient is given by: $\frac{\delta{f}}{\delta{y_i}} = 4\sum\limits_{j}(p_{ij}-q_{ij})(y_i-y_j)(1+||y_i-y_j||^2)^{-1}$

## Homework hints¶

Remember to log transform + 1 your data.

#### problem 1:¶

Straightforward; you've already implemented PCA in pset 10.

You can use Sean's example code to perform SVD on the data and plot the first two PCs. Altenratively, you can use the sklearn package, sklearn.decomposition.PCA.

In [4]:
#Xs = X - np.mean(X, axis=0)       # subtract column mean from each column
#U, S, Wt = np.linalg.svd(Xs)      # SVD
#X2       = U[:,:2]                # X2 is now our data, projected onto the first two PCs


#### problem 2:¶

There is a lot of help for this problem in Prof. Eddy's lecture notes.

I highly recommend breaking down the problem into small steps, with a function defined for each major calcualtion.

For example, you maywant the folllowing functions:

1. Calculate $P_{ji}$
2. Calculate Shannon entropy
3. Calculate perplexity
4. Calculate perplexity difference
5. Calculate KL_dist
6. Calculate $\sigma_i$
7. Calc tSNE result with optimize function
In [5]:
def my_perplexity_diff(sigma, Di):
# calculate the perplexity for distance matrix row D[i] given sigma
# find your a,b bracketing interval
sigma[i] = scipy.optimize.bisect(my_perplexity_diff, a, b, args=(D[i].flatten()))

def KL_dist(Y, P):
M  = P.shape[0]   # or length(Y) / 2
Y  = np.reshape(Y, (M,2))

Q  = calculate Q from Y
KL = calculate KL(P||Q)

calculate each gradient term

Y = np.random.normal(0., 1e-4, (M,2))       # This chooses random initial Y points
result = scipy.optimize.minimize(KL_dist, Y.flatten(), args=(P), jac=True)

  File "<ipython-input-5-fd752fae5ce5>", line 11
Q  = calculate Q from Y
^
SyntaxError: invalid syntax


#### problem 3:¶

Straightforward with Sean's hints from lecture below.

In [6]:
from sklearn.manifold import TSNE

#X = your MxN data array
#Y = TSNE(perplexity=10).fit_transform(X)
# Y is now Mx2; plot it.