Section 13: t-SNE

Notes by Kate Shulgina and Kevin Mizes [12/5/18]

(You can download these notes as a jupyter notebook page)

t-SNE vs PCA

t-SNE is a way of visualizing multidimensional data in two dimensions that preserves relationships between neighboring points. This non-linear projection focuses on preserving local features (ie which points are close to one another) as opposed to global features (how different are distant clusters). PCA just tries to find a projection that maximizes the variance between the point, focusing more on keeping dissimilar points far apart.

Unlike PCA, which is a linear projection onto a 2D plane, t-SNE is a non-linear projection. This means that in some cases t-SNE can visualize clusters that PCA can't. Let's look at some examples in 3D to get an intuition.

One example is if the clusters are on the corners of a cube (hypercube in the pset):

In [2]:
### A LOT OF CODE TO MAKE POINTS ON A CUBE, NOT VERY IMPORTANT

# importing stuff
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats 
np.random.seed(4)

# coordinates of the corners of a cube
means = [[-1, 1, 1], [-1, -1, 1], [-1, 1, -1], [-1, -1, -1],
         [ 1, 1, 1], [ 1, -1, 1], [ 1, 1, -1], [ 1, -1, -1]]

# generate points around the corners using Gaussian noise
npts = 15
ncorners = len(means)
xs = np.zeros([npts * ncorners, 3])
sigma = .4
for i in range(ncorners):
    xs[i*npts:(i+1)*npts, 0] = scipy.stats.norm.rvs(means[i][0], sigma, npts)
    xs[i*npts:(i+1)*npts, 1] = scipy.stats.norm.rvs(means[i][1], sigma, npts)
    xs[i*npts:(i+1)*npts, 2] = scipy.stats.norm.rvs(means[i][2], sigma, npts)    
In [21]:
# Sean's favorite colors
colors = ['xkcd:red',    'xkcd:green',  'xkcd:blue',   'xkcd:orange', 
          'xkcd:purple', 'xkcd:pink',   'xkcd:teal',   'xkcd:lavender']

# make the plot!
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
for i in range(ncorners):
    ax.scatter(xs[i*npts:(i+1)*npts, 0], xs[i*npts:(i+1)*npts, 1], xs[i*npts:(i+1)*npts, 2], c=colors[i])
plt.show()
In [29]:
# CALCULATING PCA PROJECTION TO 2D
xs_cent = xs - np.mean(xs, axis=0)                       # centering data
U, S, Wt = np.linalg.svd(xs_cent, full_matrices=False)   # SVD
y_pca = U[:,:2]                                          # U is projection, unscaled by eigenvalues

# CALCULATING t-SNE PROJECTION TO 2D
from sklearn.manifold import TSNE 
y_tsne = TSNE(perplexity=30).fit_transform(xs)           # using canned t-SNE with perplexity 30
In [30]:
# make the plots!
fig, axarr = plt.subplots(1, 2, figsize=(15,4))

for i in range(ncorners):
    axarr[0].scatter(y_pca[i*npts:(i+1)*npts, 0], y_pca[i*npts:(i+1)*npts, 1], c=colors[i])
    axarr[1].scatter(y_tsne[i*npts:(i+1)*npts, 0], y_tsne[i*npts:(i+1)*npts, 1], c=colors[i])
    
axarr[0].set_title('PCA')
axarr[0].set_xlabel('PC 1')
axarr[0].set_ylabel('PC 2')

axarr[1].set_title('t-SNE')
axarr[1].set_xlabel('t-SNE 1')
axarr[1].set_ylabel('t-SNE 2')

plt.show()

As you can see from the plots, t-SNE is much better than PCA at separating clusters sitting on the corners of a cube, since it does not force a linear projection.

High-level overview of t-SNE

  1. Calculate distances between all points in M-dimensional space
  2. Compute the $p_{ij}$ terms between all pairs of points in M-dimensional space (this involves fitting $\sigma_i$s to the perplexity)
  3. Initialize random points in 2-dimensional space ($Y$s)
  4. Compute the $q_{ij}$ terms between all pairs of points in 2-dimensional space
  5. Move the $Y$ points around in 2-dimensional space via gradient descent until distribution of $q_{ij}$s is most similar to the distribution of $p_{ij}$s

A little more detail

You compute the $p_{ij}$s once in the beginning of the algorithm. You do this by averaging the values for $p_{i|j}$ and $p_{j|i}$ via this formula: $$p_{ij} = \frac{p_{i|j} + p_{j|i}}{2M}$$

You get the $p_{i|j}$ by getting the value of a Gaussian centered at the point $x_i$ for each other point $x_j$ and then dividing by the sum: $$ p_{i|j} = \frac { \exp \left( - \lVert x_i - x_j \rVert^2 / 2\sigma_i^2 \right) } { \sum_{k \neq i} \exp \left( - \lVert x_i - x_k \rVert^2 / 2\sigma_i^2 \right) } $$

Notice that each point $i$ uses a different $\sigma_i$ for the Gaussian around it. We pick the values of $\sigma_i$ such that the $p_{i|j}$s fit a desired preplexity.

What is the perplexity changing in the t-sne implementation?

In t-sne, we are embedding points by their local structure (not global!). Often in this sort of machine learning regression/classification/unsupervized area, you'll find a class of algorithms called nearest neighbor algorithms, which rely on using nearby datapoints to draw some sort of conclusion about another data point. T-sne captures local structure using a smoothed version of nearest neighbors specified by the perplexity (value of perplexity is the effective number of neighbors). Instead of picking some fixed number of neighbors K, we pick weighted neighbors in some spherical Gaussian ball about our high-dimensional data point. The size (variance, $\sigma_i$) of the ball changes for each datapoint since the density of the dataset can be different throughout the high-dimensional space. For a sparse region, we'd want a large $\sigma_i$ since we need to look for neighbors over a larger area.

While the perplexity smooths over the region to give us an 'effective' number of neighbors, it is a hard constant fixing the amount of information (through the metric entropy) we gain from the neighbors. Entropy is a metric of unpredictability of a state. If entropy is at maximum, that means the distribution is uniformly distributed (which can be derived fairly easily). This means that $\sigma_i$ is not only affected by the density or sparsity of data, but also by how evenly it is distributed around a datapoint.

If all a data points neighbors were evenly distributed, we would take exactly the perplexity number of neighbors. This is because the information about the structure (through $p_{j|i}$) is the same for each point. If the the data points neighbors were unevenly spread out, you'd increase sigma to get $p_{j|i}$ to look more uniform!

I highly recommend the article Sean linked in class.

A practical matter: Dealing with this parameter in your implementation

In the t-sne implementation, we pick some fixed value of perplexity for the 'effective' number of neighbors we want to look out. But the actual smoothing over neighbors is done from some variable $\sigma_i$. And while we can calculate perplexity as a function of $\sigma_i$, the problem is we want to find $\sigma_i$ given some fixed perplexity!

$$ \text{Perplexity} = 2^{H_i}$$

$$ H_i = -\sum_j p_{j|i} log(p_{j|i}) $$

$$ p_{j|i} = \frac {exp(-\vert \vert x_i - x_j \vert \vert^2 / 2\sigma_i^2)} {\sum_{k\neq i} exp(-\vert\vert x_i - x_k \vert\vert^2 / 2\sigma_i^2)} $$

So Perplexity is dependent on two variables $f(\sigma_i, \mathbf{x})$.

If we didn't have computers, you might have to do some pretty gross algebra here to find $\sigma_i = f(perplexity)$, but luckily we do have computers so you don't have to do that.

One way to find the correct $\sigma_i$ is to just guess (please don't actually do this - though if you feel up for writing your own binary search algorithm, that would be neat).

sigma_i = 0.000001
while True:
    perplexity = calc_perplexity(sigma_i, x) # some implementation of the equations above
    if perplexity == 50:
        print('found sigma_i! its {}'.format(sigma_i))
        break
    sigma_i += 0.000001

However, we can take advantage of a slightly smarter way (which uses binary search) using scipy.optimize.bisect(). This function will find the roots of a problem. Thus, it can find for us $\sigma_i$ where $f(sigma_i, x) - perplexity = 0$

At its core, this function takes 3 inputs: some function (that you write!), a lower bound, and an upperbound. It is required that the function should cross through 0 at some point between the bounds. If there are multiple roots, it will only return one of them.

In [22]:
# Lets explore scipy's bisect

import scipy.optimize as optimize
import matplotlib.pyplot as plt

# some function with multiple roots
def func(x):
    return np.sin(x) + .1*x

# let's plot it
x = np.linspace(-3,7,100)
plt.plot(x,func(x))
plt.xlabel('x')
plt.ylabel('y')

# lets find a root!
lower_bound = -2
upper_bound = 2
root = optimize.bisect(func, lower_bound, upper_bound)
print('between {} and {}, there is a root at {}'.format(lower_bound, upper_bound, root))

# what about another root?
lower_bound = 5
upper_bound = 7
root = optimize.bisect(func, lower_bound, upper_bound)
print('between {} and {}, there is a root at {}'.format(lower_bound, upper_bound, root))

# let's check to see if scipy is right!
root = optimize.bisect(func, lower_bound, upper_bound)
value = func(root)
print('f({})={}'.format(root, value))
between -2 and 2, there is a root at 0.0
between 5 and 7, there is a root at 5.679207796312767
f(5.679207796312767)=-1.5109025142123755e-12

How do we choose this bracketing interval? This can be a challenge for many functions that we know nothing about, but luckily, we know something about the function on this pset (two things actually).

First, we have a monotonically increasing function. In this case, our function will strictly increase (why? Entropy must always be positive! and 2^Entropy will also be positive! Can you convince yourself that entropy has to be positive?). This means that for some monotonically increasing function, for a>b, f(a)>f(b).

Second, we are looking for $\sigma_i$ which is a variance. This value should always be greater than 0!

In [8]:
# Lets say we have some function where I have no idea where the roots are

# some function I found online.
def func(x):
    return np.exp(1e-3*x)-50

x = np.linspace(-5,5,100)
plt.plot(x,func(x))
plt.xlabel('x-value (sigma)')
plt.ylabel('y-value (perplexity)')

# one option, just set the bounds super large!
lower_bound = -100000000000
upper_bound = 100000000000
optimize.bisect(func, lower_bound, upper_bound)
Out[8]:
3912.0230054281506
In [9]:
# a better option, taking advantage of monotonically increasing function.
lower_bound_guess = 10 # some random number
while func(lower_bound_guess) > 0: # while were not below 0
    lower_bound_guess /= 2 # decrease function. The multiplicative decrease prevents lower bound from going below 0!
upper_bound_guess = lower_bound_guess
while func(upper_bound_guess) < 0:
    upper_bound_guess *= 2 # we'll find the upper bound pretty quickly with a multiplicative update

print('upper bound: {}'.format(upper_bound_guess))
print('lower bound: {}'.format(lower_bound_guess))
print(optimize.bisect(func, lower_bound, upper_bound))
upper bound: 5120
lower bound: 10
3912.0230054281506

One final thing, is that our input function to the root finding code has additional fixed data in it. To input that, we include it as an additional argument.

In [10]:
def func(x,t):
    return(x**2 - 1/t)

t0 = .5
lower_bound = 0
upper_bound = 100000 

root = optimize.bisect(func, lower_bound, upper_bound, args = (t0))
print(root)

x = np.linspace(-5,5,100)
plt.plot(x,func(x,t0))
plt.xlabel('x')
plt.ylabel('y')
1.4142135623743113
Out[10]:
Text(0,0.5,'y')

Details on scipy.optimize.minimize with gradient

When you have optimized your $\sigma_i$s and computed the $p_{ij}$s, the next step is to randomly initialize points $Y$ in 2D and move them around until the distribution of $q_{ij}$s calculated from the 2D points is similar to the distribution of $p_{ij}$s calculated from the M-dimensional points.

What are the practical details of how to move the 2D points around that improves the comparision between the $P$ and $Q$s?

In lecture, Sean gave us several equations. One of these is the actual function that we are trying to improve-- the KL divergence between the two distributions.

$$ f(Y) = \mathrm{KL} = \sum_i \sum_j p_{ij} \log \frac{p_{ij}} {q_{ij}} $$

This outputs a single value representing the difference of the $Q$ distribution with the $P$ distribution. We're trying to wiggle the 2D $Y$ points around so their $Q$ is similar to $P$.

He also gave us the gradient of this function. This tells us, how does the KL divergence change if we move one of the $y_i$s?

$$ \frac {\partial f} {\partial y_i} = 4 \sum_j (p_{ij} - q_{ij}) (y_i - y_j) (1 + \lVert y_i - y_j \rVert^2)^{-1} $$

Remember that the $y_i$s are in 2D, so you can move them along two axes. So this gradient returns a vector with 2 entries, representing the slope in each direction.

When we were using scipy.optimize.minimize earlier in week 7, we were giving it only the function to optimize and not the gradient. This was because the optimization was simpler and we could afford to have the optimizer compute the slope at each point by moving around a little. Now we will want to give the optimizer a formula to calculate the gradient.

How do we give this function to the optimizer so it knows that this function is also returning the gradient? We need to pass one new argument: jac=True

Y = np.random.normal(0., 1e-4, (n,2))       # This chooses random initial Y points
result = scipy.optimize.minimize(KL_dist, Y.flatten(), args=(P), jac=True)

For a refresher on the details of scipy.optimize.minimize see the section 7 notes. Just as a brief overview, you need to give it

  1. the function being optimized (KL_dist),
  2. initial guesses (Y),
  3. constant arguments for your function (args=(P))

and we have one extra argument of jac=True which tells the optimizer that the function will also be returning the gradient for each $Y$.

We need to give the optimizer a function that returns both the the KL divergence and the gradient. Here is some pseudocode for what you should be aiming for (borrowed from Sean's notes):

def KL_dist(Y, P):
   Q  = calculate Q from Y
   KL = calculate KL(P||Q)

   gradient = np.array(Y.shape)
   calculate each gradient term

   return KL, gradient

You should have one single function that takes in a given set of $Y$ points, a recomputed $P$ matrix, and returns the KL-divergence and gradient at that set of $Y$s.

One detail is that the optimizer is built to optimize a vector of values, not a $N\times2$ matrix that $Y$ is. To convert a matrix to a vector of individual values, we can use .flatten() and we can use np.reshape to turn a vector back into a matrix. It expects to get the gradient information for each point back as a vector also, so we must flatten that as well.

Let's look at an example

In [31]:
# make a vector
vector = np.array(range(12))
print(vector)
[ 0  1  2  3  4  5  6  7  8  9 10 11]
In [32]:
# reshape into a matrix
matrix = np.reshape(vector, (6, 2))
print(matrix)
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]]
In [33]:
# flatten back into a vector
print(matrix.flatten())
[ 0  1  2  3  4  5  6  7  8  9 10 11]

Implementing t-sne using sklearn

sklearn is fantastic. It's good to know how things work but a majority of the time sklearn is your friend. However, you couldn't do things like make modifications tailored to your data (maybe we don't want euclidean distance as our metric, but correlation).

In [58]:
# Lets look with week 6 data!

datafile = 'w06-data.tbl'

def read_data(infile):
    '''
    read_data(infile)
    Read Lestrade's input file, w06-data.tbl, or a file in that format.
    Return:
       ctype[0..N-1] : cell types 0..Q-1 for each cell i
       data[i,g]     : array of count data; rows = cells i; cols = genes g
       N             : number of cells (rows of the data file)
       G             : number of genes (cols of the data file, after the first)
       Q             : number of cell types
    '''
    ctype = []
    data  = []
    with open(infile) as f:
        for line in f:
            if line[0] == '#': continue   # skip comment lines
            fields = line.split()
            ctype.append(int(fields[1]))
            data.append( [int(fields[2]), int(fields[3])])  # assumes exactly 2 genes!!
    ctype = np.array(ctype)
    data  = np.array(data)
    N, G  = np.shape(data)
    Q     = np.max(ctype) + 1
    return ctype, data, N, G, Q

ctype, data, N, G, Q = read_data(datafile)
logdata = np.log(data)
In [60]:
colormap = 'rbgmy'
for c in range(Q):
    idx = np.where(ctype==c)
    plt.plot(logdata[idx,0], logdata[idx,1], Linestyle="", Marker='.',color=colormap[c])
plt.xlabel('caraway(counts)')
plt.ylabel('kiwi(counts)')
plt.title('Cell types using Wiggins classes (K=5)')
Out[60]:
Text(0.5,1,'Cell types using Wiggins classes (K=5)')
In [69]:
from sklearn.manifold import TSNE

perplexity_range = [2,5,10,20,50,100]
Y = []
X = logdata

for perp in perplexity_range:
    print('fitting model perplexity = {}'.format(perp))
    Y.append( TSNE(perplexity=perp).fit_transform(X) )
fitting model perplexity = 2
fitting model perplexity = 5
fitting model perplexity = 10
fitting model perplexity = 20
fitting model perplexity = 50
fitting model perplexity = 100
In [71]:
fig, axs = plt.subplots(2,3, figsize=(12, 8), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace = .5, wspace=.5)

axs = axs.ravel()

for i in range(6):
    
    colormap = 'rbgmy'
    for c in range(Q):
        idx = np.where(ctype==c)
        axs[i].plot(Y[i][idx,0], Y[i][idx,1], Linestyle="", Marker='.',color=colormap[c])
    axs[i].set_xlabel('tsne-1')
    axs[i].set_ylabel('tsne-2')
    axs[i].set_title('Cell types embedding, perplexity = {}'.format(perplexity_range[i]))
    
In [77]:
# on the MNIST handwritten dataset, now in 64 dimensions!

from sklearn import datasets, manifold

digits = datasets.load_digits(n_class=6)
from matplotlib import offsetbox
X = digits.data
y = digits.target


images_and_labels = list(zip(digits.images, digits.target))
for index, (image, label) in enumerate(images_and_labels[:4]):
    plt.subplot(2, 4, index + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
Out[77]:
(1083, 64)
In [79]:
# code taken here: https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html

# Scale and visualize the embedding vectors
def plot_embedding(X, title=None):
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min)

    plt.figure()
    ax = plt.subplot(111)
    for i in range(X.shape[0]):
        plt.text(X[i, 0], X[i, 1], str(y[i]),
                 color=plt.cm.Set1(y[i] / 10.),
                 fontdict={'weight': 'bold', 'size': 9})

    if hasattr(offsetbox, 'AnnotationBbox'):
        # only print thumbnails with matplotlib > 1.0
        shown_images = np.array([[1., 1.]])  # just something big
        for i in range(X.shape[0]):
            dist = np.sum((X[i] - shown_images) ** 2, 1)
            if np.min(dist) < 4e-3:
                # don't show points that are too close
                continue
            shown_images = np.r_[shown_images, [X[i]]]
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
                X[i])
            ax.add_artist(imagebox)
    plt.xticks([]), plt.yticks([])
    if title is not None:
        plt.title(title)
        
print("Computing t-SNE embedding")
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
X_tsne = tsne.fit_transform(X)

plot_embedding(X_tsne, "t-SNE embedding of the digits")

plt.show()
Computing t-SNE embedding

A tiny tiny bit on the KL divergence

We haven't had too much time this course to talk about information theory, but it is good to have a bit of an idea of what this value measures.

Above we said entropy, $H(p)$, is a value of unpredictability in a system. Another way to think of it is the amount of 'surprise' in a system. If variables are completely uniformly random, you are always surprised to what value you might get when making an observation. If there is some unevenness in the distribution, you would be less surprised to see higher probability variables than others (and thus the average amount of surprise, or entropy is lower).

The KL divergence measures the extra amount of surprise you get when you need to look at both distribution p and distribution q. We can decompose it into entropy:

$$ KL(p\vert\vert q) = H(p,q) - H(p) $$

Where $H(p,q)$ is the cross-entropy between distributions p and q. The cross entropy measures how much information, or surpirise in q can tell us about p. If the distribution q describes distribution p, then there is no extra surprise and the KL divergence is 0. This is a good metric for this problem since we want this 2-dimensional distribution q to describe this high dimensional distribution p as closely as possible!