MCB112: Biological Data Analysis (Fall 2019)


Section 09: Non-negative Matrix Factorization

Section notes for Nov 8, 2019 by Angel Tang, adapted from Jung-Eun Shin (TF 2018) and William Mallard (TF 2017)

Non-negative matrix factorization

Non-negative matrix factorization (NMF) is a matrix decomposition technique that splits a matrix into two smaller non-negative matrices whose product is approximately equal to the original matrix.

Why do we want to do this? The idea is that there is some rule-driven process generating our data; we have taken a number of noisy measurements of the system’s output, and we would like to discover the underlying structure of our data – and ultimately infer the specific rules of the generative model.

In a deep sense we are actually compressing our data.

For NMF, our matix of observations V is of size N×M; we are decomposing it into two matrices W and H of size N×R and R×M. If (N, M, R) = (100, 60, 5), then V, W, and H contain 6000, 500, and 300 elements. If NMF succeeds at factoring V into W × H, then we’ve effectively compressed 6000 numbers into just 800 numbers (ie, 500 + 300).

Since W × H only approximates V, this is actually lossy compression.

For this problem set, we’re observing N genes across M conditions. We think there are several sets of gene regulatory modules, but we don’t know how many there are, and we don’t know which genes are in them. To figure this out, we will factor V.

Just as you would factor 15 into 3×5, our goal is to factor V into W × H. Instead of natural numbers (ie, non-negative integers), W and H are non-negative matrices.

For matrix multiplication to be possible, W and H must have compatible dimensions. So if we are given V with dimensions N×M, then W and H must have dimensions N×R and R×M, for some R.

Since we don’t know R a priori, we need to factor V using a range of R’s, and use log likelihood to assess how close each resulting W × H is to V.

If we succeed, then: - R is the number of gene regulatory modules - W tells us the identities of the genes in each of those modules - H tells us how much each of those modules contributes to each of the samples

Homework hints

Problem 1

We want to generate known W and H matrices, and use them to generate a noisy V matrix.

For W, each row represents a gene, and each column represents a module. Each row should have a non-zero entry in only one column – unless it is a moonlighting gene in which case it should have a non-zero value in two columns. Each column should sum to one.

For H, each row represents a module, and each column represents a sample. This can be random, but each column should sum to one. You could generate a ones matrix, and apply np.random.dirichlet() along each column.

For V, find the matrix product of W and H, and then add Poisson noise.

We’ll feed V into our NMF function, and try to recover the true W and H.

Problem 2

The NMF function should take two arguments: V (the matrix to decompose) and R (the number of modules we think V contains).

To decompose V, start with randomly generated W and H matrices, and run the NMF update steps until the log likelihood converges.

The NMF update equations contain three different subscripts, so the easiest solution is to use three nested for loops for the first W update step, and another three for the H update step.

If that’s too easy for you and you’re up for a challenge, it is also possible to implement both update steps with nothing but matrix algebra. This runs much faster than the for loop implementation, but it also requires some mathematical intuition and a solid understanding of NumPy broadcasting.

Hint: Both update steps are simply projecting two 2D matrices into 3D via broadcasting, performing element-wise multiplication, and collapsing them back to 2D via summation along the axis shared by the original two matrices.

After each update step, we want to calculate the log likelihood that V was generated by our current guess at W and H. See Sean’s lecture notes for the log likelihood formula.

Note: You should not need a for loop to calculate log likelihood. All three matrices in the log likelihood equation are of the same size, so you can use normal arithmetic operators and the log function on your NumPy matrices, and then use the .sum() method for the double summation.

Note: From step to step, we only care about the gradient of the log likelihood – ie, how much it has changed – which means that any constant terms will cancel out. Specifically, there is a log(V!) term that we can ignore, which is nice because V! is huge and dealing with it would be a pain.

Recover W and H for R=3..6, and compare the resulting log likelihoods. Did you recover the correct R? Can you recover the gene-to-module assignments from W? Do you also recover the moonlighting genes?

Hint: You’ll probably need to row-normalize W, and then look at a histogram. Don’t forget to flatten() your NumPy array before you histogram.

Problem 3

If you’ve made it this far, Problem 3 should be easy. Take the provided data set, and repeat the same steps as Problem 2. Find R, find W, identify any moonlighting genes.

Python tips

Random functions

import numpy as np
import matplotlib.pyplot as plt

Randomly generate a W matrix with N rows and R columns, such that each column sum up to one. The H matrix can be generated similarly. May be useful when simulating your data and initialization for NMF.

N, M, R = 100, 50, 5 # N genes, M experiments, R modules
W = np.random.dirichlet(np.ones(N), size=R).T # you can also use np.apply_along_axis as will be explained later
print(W.shape)
print(sum(W))
(100, 5)
[1. 1. 1. 1. 1.]

np.random.dirichlet generates a list of n random numbers that sum to 1. Alpha is the typical Dirichlet parameter vector; lower values of alpha generate a sparser vector. This might come in handy when generating your W matrix for your homework because a single gene modeule is likely to have a sparse relative expression vector. An illustration is shown below.

alphas = [10, 1, 0.5, 0.1]
n = 100
figs,axs = plt.subplots(1, 4, figsize=(18,3))
for i, alpha in enumerate(alphas):
    axs[i].hist(np.random.dirichlet(alpha*np.ones(n)), bins=20, density=True)

plt.show()
png
png

To add Poisson noise to a matrix:

N, M = 10, 5
L = 1e6 * np.random.random((N, M))

print("The original matrix is:\n", np.round(L, 3))
The original matrix is:
 [[ 26164.478 255339.516 961222.355 533003.247 849035.773]
 [679657.839 937173.353 184110.205 652722.568 560370.512]
 [165253.335 153232.012 386973.747 126864.92  983281.184]
 [837285.701 319938.388  95274.152 353786.773 543073.632]
 [ 78379.417 456845.869 779657.774 257211.508 637184.122]
 [ 13568.458 653935.297 643826.31  749554.737 844360.055]
 [867347.429 607954.847 357052.158 659543.702 911075.499]
 [686982.619 438091.14  695428.677 635141.198 774093.148]
 [285834.492 567038.837 763680.055 555299.572 837752.182]
 [482262.746 575971.523  82402.039 508228.636 744306.056]]
print("The matrix after adding Poisson noise:\n", np.random.poisson(L))
The matrix after adding Poisson noise:
 [[ 26162 255085 961252 532443 850878]
 [679549 937425 183725 653459 560748]
 [166165 153120 387472 126879 983700]
 [835947 319557  95272 354417 541489]
 [ 78250 456166 779735 257442 636926]
 [ 13471 654797 642564 748827 845547]
 [865074 608288 357175 658783 910624]
 [688338 437253 695083 635829 775501]
 [285996 566094 765267 554128 838264]
 [481509 576665  82477 509687 743968]]

Matrix multiplication

Python has a dedicated single-character infix math operator specifically for the purpose of matrix multiplication: the @ symbol. This was added to Python3 to eliminate ambiguity about what * means in the context of two matrices.

The @ operator performs matrix multiplication on two matrices of compatible dimensions. eg, if A is an N×R matrix and B is an R×M matrix, then A @ B will give an N×M matrix.

diagram of matrix multiplication
diagram of matrix multiplication

The * operator performs element-wise multiplication on two matrices of the same dimensions. eg, if A is an N×M matrix and B is an N×M matrix, then A * B will give an N×M matrix.

Matrix functions

To create an N×M matrix of zeros, use np.zeros(). There is also np.ones(), and we are already familiar with np.random.random().

N, M = 10, 5

np.zeros((N, M))
np.ones((N, M))
np.random.random((N, M))

Let’s say you have a function that only operates on 1D arrays – eg, np.random.dirichlet(). And let’s say you’re trying to apply this function to each column of a matrix of ones. You can achieve this with np.apply_along_axis(). It takes three arguments: the function to apply, the axis to apply it along, and the matrix to apply it to. As usual, axis=0 is columns, axis=1 is rows.

N, R = 10, 5
W = np.ones((N, R))

np.apply_along_axis(np.random.dirichlet, 0, W)

You can sum all of a matrix’s values with the .sum() method. If you want to sum the values in all columns or rows, you can use the .sum() method’s axis argument.

N, M = 5, 3
X = np.ones((N, M))

X.sum()        # --> 30.0
X.sum(axis=0)  # --> array([ 5.,  5.,  5.])
X.sum(axis=1)  # --> array([ 3.,  3.,  3.,  3.,  3.])

To normalize the columns of a matrix, we can just divide the matrix by the sum of the columns.

N, M = 5, 3
X = np.random.random((N, M))

X /= X.sum(axis=0)

It’s actually a little surprising that this works, given that we’re asking to divide a 5×3 matrix by a 3-element vector. NumPy somehow figures out that you actually want it to perform element-wise division between the 1×3 matrix and each row of the 5×3 matrix. But for reasons that are unclear to me, it cannot figure out what you want it to do for a 5×3 matrix and a 5-element vector … at least not without a little help. To help NumPy figure out what you want, we need to use NumPy broadcasting.

Other useful functions to be familiar with:

np.multiply()
np.divide()
np.power()

Check your dimensions as you go!

NumPy broadcasting for element-wise operations

If you have a matrix A with dimensions N×M, and you want to add an M-element array b to all of A’s rows, you just need to add another axis to b, keeping the two same-sized axes aligned. In this example, A.shape is (N, M) and b.shape is (M,).

To add an axis to b, NumPy uses the notation: B = b[np.newaxis,:] Since we put the new axis first, B.shape will be (1, M). If we had instead done b[:,np.newaxis], then B.shape would be (M, 1). But we are trying to make B compatible with A, so we made B’s second dimension agree with A’s second dimension (ie, both are of size M).

np.newaxis is synonymous with None, so you could also do b[None,:].

Now that A and B are both 2D matrices, you can just say: A + B, and Numpy will replicate the first (and only) row of B N times before performing standard element-wise addition on two N×M matrices. This is best described visually:

diagram of Numpy broadcasting
diagram of Numpy broadcasting

In Python:

A = np.array([0, 10, 20, 30])
B = np.array([0, 1, 2])

print('A:', A, 'shape:', A.shape)
print('B:', B, 'shape:', B.shape)
A: [ 0 10 20 30] shape: (4,)
B: [0 1 2] shape: (3,)
# we want to make A.shape to (4, 1)
new_A = A[:, np.newaxis]
print("new A:\n", new_A, 'shape:', new_A.shape)
new A:
 [[ 0]
 [10]
 [20]
 [30]] shape: (4, 1)
# now we can add A and B
print("A+B:\n", new_A+B)
A+B:
 [[ 0  1  2]
 [10 11 12]
 [20 21 22]
 [30 31 32]]

Note: Broadcasting is not strictly necessary for this homework. Anything you can do with NumPy broadcasting, you can also do by brute force with a for loop; it will just run much slower (~10-100x for this homework).