MCB112: Biological Data Analysis (Fall 2019)


Week 11 section: t-SNE

Irina Shlosman, with special thanks to Kevin Mizes for inspiration for optimize.bisect.

You can also download these notes in jupyter notebook format (albeit without some of the figures).

1. Playing with dimensions - which tool to pick…

We have seen many times, at this point, instances of biological datasets that are multi-dimensional. With RNA-seq data, specifically, we expect to be operating with numerous samples (e.g. single cells, tissue samples etc), each of which has several thousands of genes whose expression we would like to probe. These kinds of experiments give us \(N \times M\)-dimensional matrices, where \(N\) is the number of samples, and \(M\) is the number of genes, or, equivalently, the dimensionality of the space in which each sample resides. This picture is further complicated if we decide to assess other parameters - e.g. the biological sex of the specimen, WT vs mutant variation, or different experimental conditions. One of the ways to make sense of such data - and visualize the results - is to recast it in a fewer number of dimensions, most frequently in 2D.


As Sean outlined in the lecture, there are many approaches to dimensionality reduction, and one of them we covered last week - principal components analysis (PCA). PCA proved to be very powerful on Watson’s dataset, yielding nice clustering of cells that reproduced true biological subtypes. In that case, there were only 2 PCs that contributed in a significant way to the variance of the data, so we were able to denoise the data and plot it in 2 dimensions.

Here is an illustration (that you might have seen before) that summarizes this process for a 3D dataset:

pca_3D Figure 1

We can imagine scenarios, however, where PCA would fail to identify clusters correctly - for instance, if there were more than 2 PCs with significant variances.

Thinking of this problem in geometric terms, we would expect to encounter issues with PCA if we had an extra cluster of cells above the “plane” of maximum variance in which the remaining clusters can be best separated (as in Figure 1). If we were to project this modified dataset into 2D, we’d end up with considerable overlap between clusters that, in reality, are distinct.

In such cases - and we can definitely think of more complex geometries that might be a challenge for PCA (Figure 2) -

t-SNE 3D image Figure 2

t-SNE (t-Distributed Stochastic Neighbor Embedding) becomes particularly useful, since it does not rely on global structure to embed points into 2D, but rather on their local structure.

Here is the result of t-SNE analysis (Figure 3) of the original structure shown above with perplexity = 30.

t-SNE 2D reconstruction Figure 3

Very beautiful! These illustrations are taken from the article that Sean references in his lecture notes. I highly recommend going online and playing with their examples - they helped me develop some intuition for how to interpret t-SNE plots.

2. Review the conceptual basis for t-SNE

t-SNE belongs to a family of nearest neighbor algorithms (see my favorite source Wiki for more) and captures local structure by defining neighbor relationships between the high-dimensional points in the dataset.

For each point \(i\) which resides in \(M\)-dimensional space with coordinates {\(x_{i1}...x_{im}\)}, we can draw a spherical gaussian, speficied by a variance \(\sigma_i\), which will define the probability that any point \(j\) in the vicinity is its neighbor.

The euclidean distance between the two points is given by

\[d_{ij} = ||x_i-x_j||\]

and the conditional probability that \(j\) is \(i\)’s neighbor is

\[p_{j|i} = \frac{\exp(\frac{-d_{ij}^2}{2\sigma_i^2})}{\sum_k\exp(\frac{-d_{ik}^2}{2\sigma_i^2})}\]

As expected, the larger the chosen value of sigma, the greater the number of points that that will fall within the sphere and be counted as “neighbors”. Conversely, the farther the point is from the center of the Gaussian, the less likely it is to be a neighbor.

In t-SNE, we choose sigma values for each point such that the perplexity, which we can think of as the effective number of neighbors, is the same for all points. As a result, for points that are within a densely populated clusters, the sigma will be small, shrinking their spherical gaussians, whereas for points in sparse regions, the sigma will be larger to capture the desired neighbors.


Question: What does that tell us about intepreting cluster sizes in t-SNE?

Question: What about inter-cluster distances?


Thus, we can describe the original M-dimensional space of our N samples by an \(N \times N\) matrix of conditions probabilities that define all the inter-point relationships.

The point of t-SNE, in layman’s terms, is then to find a 2D representation that would capture as closely as possible these relationships.

What do we mean by this mathematically?

We would like to find such coordinates {\({y_1...y_N}\)} in 2D space that the probabity distribution function \(Q_{ij}\) that describes the relationship of all the points in this 2-dimensional space matches as closely as possible the original M-dimensional \(P_{ij}\).

\[ q_{ij} = \frac { \left( 1 + \lVert y_i - y_j \rVert^2 \right)^{-1} } { \sum_{k \neq l} \left( 1 + \lVert y_k - y_l \rVert^2 \right)^{-1} } \]

We can quantify the degree of similarity (or, rather difference) between these two probabilities with the so-called Kullback-Leibler divergence:

\[ D_{KL}(P||Q) = \sum_i\sum_j p_{ij}\log(\frac{p_{ij}}{q_{ij}})\]

When \(p_{ij} = q_{ij}\), the logarithm term goes to 0, and the distance reaches its minimum.

Our goal, therefore, is to find optimals parameters \({y_1...y_N}\) that minimize the Kullback-Leibler divergence.

We have tackled a number of optimization problems before, so, conceptually, we should be all set to take on this one. The hurdle in this particular case is setting up all the inputs to pass on to the optimizer.

So let’s talk about how to do that.

3. * The gruesome details* - setting up input for optimize.minimize

1. Calculate the true probability matrix:

The target matrix is \(P_{ij}\), an \(N\times N\) matrix that we will generate from the observed counts.

For that we need to - precompute the distance matrix \(D_{ij}\) of dimensions \(N \times N\) - calculate an array of sigma values {\(\sigma_1...\sigma_N\)} of size \(1\times N\)


Calculate the distance matrix

The distance matrix should be relatively straightforward. I would create a separate function and precalculate the matrix so that I can use it as input for later functions.

def calculate_dist(X):

    # declare empty D matrix
    
    # loop through the points i,j
    
    # calculate the distance between them 
    hint: use np.linalg.norm
    
    return D

Calculate sigma

Calculating sigma for each point \(i\) is a little bit more involved. As Sean outlined in the lecture notes, we can easily determine the value of a perplexity, given a sigma. The Shannon entropy is given by:

\[ H_i = - \sum_j p_{j \mid i} \log_2 p_{j \mid i} \]

where sigma lurks into the equation through the \(p_{j|i}\) term

And perplexity is just \[ \mathrm{Perplexity} = 2^H\]

However, we would like to carry out the reverse process - given a perplexity value, find a sigma that satisfies it, and for that purpose, we will turn to scipy.optimize.bisect().

The basic idea behind this is that we can recast the problem of finding a sigma as a root-finding problem where we find a sigma such that \(f(\sigma) = 0\) by passing our function to scipy.optimize.bisect().

In our particular case, the function for which we want to find the root is \(\mathrm{Perplexity_{true}}-\mathrm{Perplexity_{estimated}}\), where perplexity itself is a function of \(\sigma\).

Our function, therefore, should look something like this:

    def perplexity_diff(sigma_i, D_i, perplexity_true):

    # calculate the numerator of p_ji conditional probability
    # from D_i, a 1xN array of distances between point i
    # and all the other points, and our guess of sigma
    # sigma_i
    
    # np.sum p_ji to get the denominator
    
    # normalize p_ji by the denominator
    
    # calculate the entropy 
    
    # calculate estimated perplexity 
    
    # calculate difference between perplexity_true and estimated
    
    return difference
    

We can then pass on that difference to scipy.optimize.bisect() specifying boundary values (a,b), and making sure that the \(D_i\) array is flattened.


A brief interlude on how to use scipy.optimize.bisect()

Just to give you a hint, here is how you would do that for an analogous problem - finding the root of equation \[\exp(10^{-3x})-50 = 0\]

Let’s imagine that I did not google this equation and did not look at the plot of the function, so that I don’t have a good guess for the upper and lower bounds. What I do know, however, is that

  1. This is a monotonically increasing function
  2. The root is > 1, since \[ e^{0.001x} = 50\]

If we take \(p = 0.001x\), we can rewrite this as follows:

\[e^{p} > e\] which demands that \(p > 1\), which means that \(x >> 1\)

So I can make a reasonable guess for the upper and lower bounds, and apply the principles I outlined above to find the root of this equation!

import numpy as np
import scipy.optimize as optimize
import matplotlib.pyplot as plt

def func(x):
        return np.exp(1e-3*x)-50

lower_bound = 2 # some random number above one

while func(lower_bound) > 0: # while were not below 0
    lower_bound /= 2 # decrease function. 
    # since the update is multiplicative, it won't go below 0
    
upper_bound = lower_bound # set the upper bound initially
# to the same value

while func(upper_bound) < 0:
    upper_bound *= 2 # update multiplicatively

print('upper bound: {}'.format(upper_bound))
print('lower bound: {}'.format(lower_bound))
print(optimize.bisect(func, lower_bound, upper_bound))
upper bound: 4096
lower bound: 2
3912.0230054281465

Let’s check that we got the right answer:

X = np.arange(-100,5000)
plt.plot(X, np.exp(0.001*X)-50)
plt.plot(X, np.zeros(len(X)),'k--')
[<matplotlib.lines.Line2D at 0x115de5630>]
png
png

Looks like we got the intercept!


Calculate the probability matrix

OK, now that we know how to obtain sigma values for all of our points, we are ready to set up the \(P_{ij}\) matrix.

We know that this matrix is the average of conditional probability matrices: \[P_{ij} = \frac{P_{i|j}p_j+P_{j|i}p_i}{2}\]

Since we assume that all points are equiprobable, i.e. \[p_j = p_i = \frac{1}{N}\]

The equation simplifies to \[P_{ij} = \frac{P_{i|j}+P_{j|i}}{2N}\]

So I’d first set up a function to calculate conditional probabilities, and then combine them to give the final symmetric probability distribution.

Just as a reminder,

\[p_{j|i} = \frac{\exp(\frac{-d_{ij}^2}{2\sigma_i^2})}{\sum_k\exp(\frac{-d_{ik}^2}{2\sigma_i^2})}\]

def calculate_conditional(X, sigma, D):
    # declare an empty matrix to populate P_j|i
    # loop through the points
    # if i != j:
        # calculate the numerator of P_ji 
        from the precomputed matrix entries
        
        # make sure to normalize P_ji by the denominator 
        
        Here you should be careful: remember that the
        sum is over all the points, *except* i itself. You can
        either exclude it from the calculation OR calculate 
        np.sum(your conditional for the whole row D[:,i]) and 
        then subtract out the contribution of (i|i) term
     
    Finally, P(i|i) = 0
    # if i == j:
        # set P_j|i = 0
                
return P_j|i

Finally, compute the symmetrized \(P_{ji}\) given the two conditional probabilities. Phew, we’re done with the initial set up!


2. Initialization and \(Q_{ij}\) calculation

Now what’s left is generating random guesses for \(y_1...y_N\) in the 2D space - Sean recommended using a very small gaussian.

mu = 0
sigma = 1e-4
Y = np.random.normal(mu,sigma, (len(X),2)) 
Y is of dimensions Nx2

We will do that for each independent run of the scipy.optimize.minimize routine.

And computing \(Q_{ij}\) matrix, which should be of the same dimensions as \(P_{ij}\) matrix, i.e. \(N\times N\). I will leave you to figure it out - there shouldn’t be any unpleasant surprises there… :)


3. KL divergence and the gradient

The final piece is calculating the KL divergence (that we will use to monitor the convergence of the algorithm) and the gradient, which we have to pass on to scipy.optimize.minimize.

Sean recommended combining calculations of the KL divergence and the gradient, since they both involve calculating the \(Q_{ij}\) matrix, and it’s nice not to have to do it twice.

Note: These two terms, alongside with \(Y\) and the true probability distribution \(P_{ij}\) will get passed on to scipy.optimize.minimize. \(P_{ij}\) the algorithm doesn’t care about; KL distance is just a scalar, but both the gradient and Y are \(N \times 2\) matrices, and so we need to make sure to flatten them

Here’s a suggested framework for KL calculations:

def KL_dist(Y, Pij):

    # Note that: Y is a flattened array of dim 1 x N*2
    # Don't forget to reshape Y back for calculation of Q_ij
    # Calculate Q_ij using your calculate_qij function
  
    # Set up an array for the gradient of dim N x 2;
    # it is easier to flatten it later
    
    # Loop through i:
        # Loop through j: 
            # Calculate gradient for (i,j) and sum over all j's:
            gradient[i]+= gradient for (i,j)
            
            # Calculate KL for each pair (i,) and sum over all 
            i's and j's, taking care to avoid the zero entries
            of the P_ij matrix.
           
            if P_ij[i,j]>0:
                KL += KL for (i,j)
    
    # Scale the gradient by a factor of 4
    # Return the resulting KL and the *flattened* gradient
    return KL, gradient.flatten()
    

At this point, you should be all set to carry out t-SNE analysis on the dataset. Make sure to run the analysis multiple times from random starting positions, because t-SNE, as Sean talked about in the lecture, is prone to getting stuck in local minima!

Something to consider:

Which parameters (aka perplexity values) to test on each of the datasets?

What do you expect your plots to look like for different perplexity values?

4. Canned t-SNE - let’s perplex a bit

# Borrowing data from HW10
import pandas as pd

df = pd.read_table('w10-data.tbl', delim_whitespace=True)
data = df.values
logdata = np.log(data)

Let’s run t-SNE on previous week’s dataset, and see how the algorithm does. Painful to see this one-liner and to think of the hours that will have to be spent to implement what already exists, isn’t it?

Let’s console ourselves with the thought that maybe we’ll learn something useful in the process.

from sklearn.manifold import TSNE

perplexities = [5,10,30,50,100,200]
Y = []
X = logdata

for perp in perplexities:
    print('fitting model perplexity = {}'.format(perp))
    Y.append( TSNE(perplexity=perp).fit_transform(X) )
fitting model perplexity = 5
fitting model perplexity = 10
fitting model perplexity = 30
fitting model perplexity = 50
fitting model perplexity = 100
fitting model perplexity = 200
# load cluster labels determined by k-means
labels = np.loadtxt('labels_best.txt')

Now let’s plot the results of t-SNE analysis for different perplexity values and coloring the points by their “true” clusters.

fig, axs = plt.subplots(2,int(len(perplexities)/2), figsize=(12, 8))
axs = axs.ravel()
k=8

colormap = ['darkorange',    'lightcoral',  'maroon', 'olive',
         'indigo', 'teal', 'cyan',    'lime']

for i in range(len(perplexities)):
    for j in range(len(logdata)):
        axs[i].plot(Y[i][j,0], Y[i][j,1], Linestyle="", Marker='.',color=colormap[int(labels[j])])
        axs[i].set_xlabel('tsne-1')
        axs[i].set_ylabel('tsne-2')
        axs[i].set_title('Cell types embedding, perplexity = {}'.format(perplexities[i]))

plt.tight_layout()
png
png

Hmm, this is clearly a difficult dataset for t-SNE. There is a perplexity value ~ 100, however, at which it seems to do relatively reasonably. Interestingly, above and below this value the prediction looks quite terrible. The outcome of the analysis is clearly very sensitive to the chosen perplexity value… We might have thought, perhaps naively, that the optimal perplexity value would be in the same range as as the approximate size of the cluster: in our case, ~200/8 = 25 ish. However, perplexity values 10-50 result in sub-optimal separation, and we see better separation at 100.


OK, let’s investigate one other property of t-SNE analysis - and that is the relationship between true cluster sizes and inter-cluster distances, and the t-SNE output.

For that purpose, we can choose a simpler dataset (our 2D test dataset from HW5), where we know that cells cluster nicely in log-log space.

# Read in the data using code from HW5

def read_data(infile):
    '''
    read_data(infile)
    Read Lestrade's input file, w05-data.tbl, or a file in that format.
    Return:
       ctype[0..N-1] : cell types 0..Q-1 for each cell i
       data[i,g]     : array of count data; rows = cells i; cols = genes g
       N             : number of cells (rows of the data file)
       G             : number of genes (cols of the data file, after the first)
       Q             : number of cell types
    '''
    ctype = []
    data  = []
    with open(infile) as f:
        for line in f:
            if line[0] == '#': continue  # skip comment lines
            fields = line.split() # convert to list
            ctype.append(int(fields[1]))  # append the type column
            data.append( [int(fields[2]), int(fields[3])])  # append columns for caraway and kiwi counts
    
    # convert lists to numpy arrays for ease of manipulation
    ctype = np.array(ctype)
    data  = np.array(data)
    
    N, G  = np.shape(data)
    Q     = np.max(ctype) + 1 # remember that indexing...
    return ctype, data, N, G, Q
datafile = 'w05-data.tbl'
ctype, data2D, N, G, Q = read_data(datafile)
log2Ddata = np.log(data2D)
# Run t-SNE
Y2D = []
X2D = log2Ddata

for perp in perplexities:
    print('fitting model perplexity = {}'.format(perp))
    Y2D.append( TSNE(perplexity=perp).fit_transform(X2D) )
fitting model perplexity = 5
fitting model perplexity = 10
fitting model perplexity = 30
fitting model perplexity = 50
fitting model perplexity = 100
fitting model perplexity = 200
fig, axs = plt.subplots(2,int(len(perplexities)/2), figsize=(12, 8))
axs = axs.ravel()

colormap = ['darkorange',    'lightcoral',  'maroon', 'olive',
         'indigo', 'teal', 'cyan',    'lime']

for i in range(len(perplexities)):
    for j in range(len(log2Ddata)):
        axs[i].plot(Y2D[i][j,0], Y2D[i][j,1], Linestyle="", Marker='.',color=colormap[int(ctype[j])])
        axs[i].set_xlabel('tsne-1')
        axs[i].set_ylabel('tsne-2')
        axs[i].set_title('Cell types embedding, perplexity = {}'.format(perplexities[i]))

plt.tight_layout()
png
png

Remembering what the true clusters look like in log-log space (this figure is taken directly from your HW5), what can we say about t-SNE’s ability to reproduce true cluster sizes and inter-cluster distances?

Wiggins’ clusters
Wiggins’ clusters