Notes by Joseph Sedlak - 12-1-17

Available as a Jupyter notebook page download or in your web browser.

A method for non-linear dimensionality reduction commonly applied to visualize different cell types based on gene expression in single cell RNA sequencing experiments. The non-linearity aspect of t-SNE often makes it advantagous over PCA. PCA does well when there are a few well-seperated clusters with most of the variance captured in the first few principle components. However, PCA corresponds to a rigid orthogonal rotation in the original space, making it impossible to seperate clusters out in some cases, including the 4D-hypercube example in our HW. In such cases, a non-linear projection into 2D space (e.g. t-SNE) is required.

t-SNE's flexibility in identifying clusters where other dimensionality-reduction algorithms fail is also it's weakness. We will go through an overview of how tuning the perplexity metric matters. Perplexity is a tunable paramter, ultiamtely a guess of each point's number of close neighbors. Here is an overview of how to understand a t-SNE plot and some of its pitfalls by Martin Wattenberg, Fernanda Viegas, and Ian Johnson at Google. Note: as Sean mentioned, the article has a problem of not accounting for the fact the bizzare plots may come from the data getting stuck in a local optima, as they should consider the value achieved by the objective function.

Link: https://distill.pub/2016/misread-tsne/

Each point on the t-SNE graph has a probability dsitribution for visiting neighboring points, ultaimtely used to form clusters. In high-dimensional space, a Spherical Gaussian is used, where each point has it's own standard deviation $\sigma_i$. In low-dimensional space, it uses the same distribution for every point (the Cauchy distribution: a Student t distribution with one degree of freedom). Thus, t-SNE cares about preserving neighbor relations and not absolute distance as illusrtrated in the above article.

Overview of the t-SNE algorithm as from van der Maaten and Hinton (2008): $\textbf{M}$ data points (e.g. cells), $x_1 ... x_M$, each in $\textbf{N}$-dimensional space (e.g. number of genes).

$||x_i - x_j||$ is a symbol for Euclidean distance, or $\textbf{L}^2$. We hope to project datapoints $y_1..y_M$ to two dimensions.

First define a conditional probability $\textbf p_{i|j}$ of visiting neighbor $j$ starting from point $i$, shown in equation (1) in the high dimensional space. Probability assumes we visit neighbor $j$ as a function of its distance in proprtion to a spherical Gaussian distribution. Second, we define a conditional probability $\textbf q_{i|j}$, for 2D space (equation 4). Our goal is to try to find positions $y_i .. y_M$ such that all $\textbf q_{i|j}$ ~ $\textbf p_{i|j}$. In t-SNE, we convert the assymetric SNE $\textbf p_{i|j}$ to a symmetric joint probaiblity $\textbf p_{ij}$, defined as $\textbf p_{ij} = \frac{p_{i|j} + p_{j|i}}{2M}$. Similarly, t-SNE assumes a symmetric distribution for $q_{ij}$, shown in equation 4. Note, $q_{ij}$ assumes a heavy-tailed distrbution (Cauchy distribution), meaning points further away in the 2D space will have a higher probability.

We can assess $\textbf q_{i|j}$ ~ $\textbf p_{i|j}$ by using a typical metric to measure the difference between two probability distributions $\textbf{P}$ and $\textbf{Q}$, called the Kullback-Leibler divergene, $KL(P|Q) = \sum\limits_{i}\sum\limits_{j} p_{ij}log(\frac{p_{ij}}{q_{ij}})$. The KL equation becomes our objective function $f(\textbf{y})$. We want to find points $\textbf{y}$ that minimize the objective function, starting from a randomly chosen point. The gradient is given by: $\frac{\delta{f}}{\delta{y_i}} = 4\sum\limits_{j}(p_{ij}-q_{ij})(y_i-y_j)(1+||y_i-y_j||^2)^{-1}$

In [4]:

```
#Xs = X - np.mean(X, axis=0) # subtract column mean from each column
#U, S, Wt = np.linalg.svd(Xs) # SVD
#X2 = U[:,:2] # X2 is now our data, projected onto the first two PCs
```

There is a lot of help for this problem in Prof. Eddy's lecture notes.

I highly recommend breaking down the problem into small steps, with a function defined for each major calcualtion.

For example, you maywant the folllowing functions:

- Calculate $P_{ji}$
- Calculate Shannon entropy
- Calculate perplexity
- Calculate perplexity difference
- Calculate KL_dist
- Calculate $\sigma_i$
- Calc tSNE result with optimize function

In [5]:

```
def my_perplexity_diff(sigma, Di):
# calculate the perplexity for distance matrix row D[i] given sigma
# find your a,b bracketing interval
sigma[i] = scipy.optimize.bisect(my_perplexity_diff, a, b, args=(D[i].flatten()))
def KL_dist(Y, P):
M = P.shape[0] # or length(Y) / 2
Y = np.reshape(Y, (M,2))
Q = calculate Q from Y
KL = calculate KL(P||Q)
gradient = np.array(Y.shape)
calculate each gradient term
return KL, gradient.flatten()
Y = np.random.normal(0., 1e-4, (M,2)) # This chooses random initial Y points
result = scipy.optimize.minimize(KL_dist, Y.flatten(), args=(P), jac=True)
```

Straightforward with Sean's hints from lecture below.

In [6]:

```
from sklearn.manifold import TSNE
#X = your MxN data array
#Y = TSNE(perplexity=10).fit_transform(X)
# Y is now Mx2; plot it.
```