### MCB112: Biological Data Analysis (Fall 2017)

# Section 10: Non-negative Matrix Factorization

*Notes by William Mallard [11/10/2017]*

## Non-negative matrix factorization

**Non-negative matrix factorization (NMF)** is a matrix decomposition technique that splits a matrix into two smaller non-negative matrices whose product is approximately equal to the original matrix.

*Why do we want to do this?* The idea is that there is some rule-driven process generating our data; we have taken a number of noisy measurements of the system’s output, and we would like to discover the underlying structure of our data – and ultimately infer the specific rules of the generative model.

In a deep sense we are actually **compressing** our data.

For NMF, our matix of observations **V** is of size N×M; we are decomposing it into two matrices **W** and **H** of size N×R and R×M. If (N, M, R) = (100, 60, 5), then **V**, **W**, and **H** contain 6000, 500, and 300 elements. If NMF succeeds at factoring V into **W** × **H**, then we’ve effectively compressed 6000 numbers into just 800 numbers (ie, 500 + 300).

Since **W** × **H** only approximates **V**, this is actually **lossy compression**.

For this problem set, we’re observing N genes across M conditions. We think there are several sets of gene regulatory modules, but we don’t know how many there are, and we don’t know which genes are in them. To figure this out, we will factor **V**.

Just as you would factor 15 into 3×5, our goal is to factor **V** into **W** × **H**. Instead of natural numbers (ie, non-negative integers), **W** and **H** are non-negative matrices.

For matrix multiplication to be possible, **W** and **H** must have compatible dimensions. So if we are given **V** with dimensions N×M, then **W** and **H** must have dimensions N×R and R×M, for some R.

Since we don’t know R *a priori*, we need to factor **V** using a range of R’s, and use log likelihood to assess how close each resulting **W** × **H** is to **V**.

If we succeed, then:

- R is the number of gene regulatory modules
**W**tells us the identities of the genes in each of those modules**H**tells us how much each of those modules contributes to each of the samples

## Python tips

### Random functions

To select N numbers from 0..M-1 according to a list of weights W:

```
N, M = 10, 5
W = np.random.dirichlet(np.ones(M))
np.random.choice(M, size=N, p=W)
```

This might come in handy when assigning genes to modules.

To generate a list of N random numbers that sums to 1:

```
np.random.dirichlet(np.ones(N))
```

You can feed `np.random.dirichlet`

values other than 1 and it will treat them as weights. If you feed it a list of ones and zeros, it will leave the 0’s as 0’s, and replace the 1’s with random fractional values that sum to one. This might come in handy when generating your **W** matrix.

To add Poisson noise to a matrix:

```
N, M = 10, 5
L = 1e6 * np.random.random((N, M))
np.random.poisson(L)
```

### Matrix multiplication

Python has a dedicated single-character infix math operator specifically for the purpose of matrix multiplication: **the @ symbol**. This was added to Python3 to eliminate ambiguity about what

`*`

means in the context of two matrices.The `@`

operator performs matrix multiplication on two matrices of compatible dimensions. eg, if `A`

is an N×R matrix and `B`

is an R×M matrix, then `A @ B`

will give an N×M matrix.

The `*`

operator performs element-wise multiplication on two matrices of the same dimensions. eg, if `A`

is an N×M matrix and `B`

is an N×M matrix, then `A * B`

will give an N×M matrix.

### Matrix functions

To create an N×M matrix of zeros, use `np.zeros()`

. There is also `np.ones()`

, and we are already familiar with `np.random.random()`

.

```
N, M = 10, 5
np.zeros((N, M))
np.ones((N, M))
np.random.random((N, M))
```

Let’s say you have a function that only operates on 1D arrays – eg, `np.random.dirichlet()`

. And let’s say you’re trying to apply this function to each column of a matrix of ones. You can achieve this with `np.apply_along_axis()`

. It takes three arguments: the function to apply, the axis to apply it along, and the matrix to apply it to. As usual, axis=0 is columns, axis=1 is rows.

```
N, R = 10, 5
W = np.ones((N, R))
np.apply_along_axis(np.random.dirichlet, 0, W)
```

You can sum all of a matrix’s values with the `.sum()`

method. If you want to sum the values in all columns or rows, you can use the `.sum()`

method’s `axis`

argument.

```
N, M = 5, 3
X = np.ones((N, M))
X.sum() # --> 30.0
X.sum(axis=0) # --> array([ 5., 5., 5.])
X.sum(axis=1) # --> array([ 3., 3., 3., 3., 3.])
```

To normalize the columns of a matrix, we can just divide the matrix by the sum of the columns.

```
N, M = 5, 3
X = np.random.random((N, M))
X /= X.sum(axis=0)
```

It’s actually a little surprising that this works, given that we’re asking to divide a 5×3 matrix by a 3-element vector. NumPy somehow figures out that you actually want it to perform element-wise division between the 1×3 matrix and each row of the 5×3 matrix. But for reasons that are unclear to me, it cannot figure out what you want it to do for a 5×3 matrix and a 5-element vector … at least not without a little help. To help NumPy figure out what you want, we need to use NumPy broadcasting.

### NumPy broadcasting

If you have a matrix **A** with dimensions N×M, and you want to add an M-element array **b** to all of **A**’s rows, you just need to add another axis to **b**, keeping the two same-sized axes aligned. In this example, `A.shape`

is (N, M) and `b.shape`

is (M,).

To add an axis to **b**, NumPy uses the notation: `B = b[np.newaxis,:]`

Since we put the new axis first, `B.shape`

will be (1, M). If we had instead done `b[:,np.newaxis]`

, then `B.shape`

would be (M, 1). But we are trying to make **B** compatible with **A**, so we made **B**’s second dimension agree with **A**’s second dimension (ie, both are of size M).

A few notes:

- Your goal is to make
*each position*of`A.shape`

and`B.shape`

*agree*– either they should be equal, or one of them should equal 1. - To add an axis to a 2D matrix
**C**, you could use C[:,:,np.newaxis], C[:,np.newaxis,:], or C[np.newaxis,:,:] depending on where you wanted to put the new axis. - To add an axis to a 3D matrix
**D**, you could use D[:,:,:,np.newaxis], etc. `np.newaxis`

is synonymous with`None`

, so you could also do`b[None,:]`

.

Now that **A** and **B** are both 2D matrices, you can just say: `A + B`

, and Numpy will replicate the first (and only) row of **B** M times before performing standard element-wise addition on two N×M matrices. This is best described visually:

**Note:** Broadcasting is not strictly necessary for this homework. Anything you can do with NumPy broadcasting, you can also do by brute force with a `for`

loop; it will just run much slower (~10-100x for this homework).

## Homework hints

### Problem 1

We want to generate known **W** and **H** matrices, and use them to generate a noisy **V** matrix.

For **W**, each row represents a gene, and each column represents a module. Each row should have a non-zero entry in only one column – unless it is a moonlighting gene in which case it should have a non-zero value in two columns. Each column should sum to one.

**Hint:** You could use `np.random.choice()`

to pick which module to assign each gene to, generate a zeros matrix, use a for loop to fill those choices into that matrix, throw in a few moonlighting genes, and apply `np.random.dirichlet()`

along each column.

For **H**, each row represents a module, and each column represents a sample. This can be random, but each column should sum to one. You could generate a ones matrix, and apply `np.random.dirichlet()`

along each column.

For **V**, find the matrix product of **W** and **H**, and then add Poisson noise.

We’ll feed **V** into our NMF function, and try to recover the true **W** and **H**.

### Problem 2

The NMF function should take two arguments: **V** (the matrix to decompose) and R (the number of modules we think **V** contains).

To decompose **V**, start with randomly generated **W** and **H** matrices, and run the NMF update steps until the log likelihood converges.

The NMF update equations contain three different subscripts, so the easiest solution is to use three nested for loops for the first **W** update step, and another three for the **H** update step.

*If that’s too easy for you and you’re up for a challenge, it is also possible to implement both update steps with nothing but matrix algebra. This runs much faster than the for loop implementation, but it also requires some mathematical intuition and a solid understanding of NumPy broadcasting.*

_ **Hint:** Both update steps are simply projecting two 2D matrices into 3D via broadcasting, performing element-wise multiplication, and collapsing them back to 2D via summation along the axis shared by the original two matrices._

After each update step, we want to calculate the log likelihood that **V** was generated by our current guess at **W** and **H**. See Sean’s lecture notes for the log likelihood formula.

**Note:** You should not need a for loop to calculate log likelihood. All three matrices in the log likelihood equation are of the same size, so you can use normal arithmetic operators and the log function on your NumPy matrices, and then use the `.sum()`

method for the double summation.

**Note:** From step to step, we only care about the gradient of the log likelihood – ie, how much it has changed – which means that any constant terms will cancel out. Specifically, there is a log(V!) term that we can ignore, which is nice because **V**! is huge and dealing with it would be a pain.

Recover **W** and **H** for R=3..6, and compare the resulting log likelihoods. Did you recover the correct R? Can you recover the gene-to-module assignments from **W**? Do you also recover the moonlighting genes?

**Hint:** You’ll probably need to row-normalize **W**, and then look at a histogram. Don’t forget to `flatten()`

your NumPy array before you histogram.

### Problem 3

If you’ve made it this far, Problem 3 should be easy. Take the provided data set, and repeat the same steps as Problem 2. Find R, find **W**, identify any moonlighting genes.