# read mapping with kallisto

## goals this week

• an introduction to kallisto, an innovative, important, and still relatively new method for RNA-seq read mapping and transcript quantitation;

• our first time installing and running a command line application for biological data analysis, and controlling it with Python and Jupyter notebook;

• our first time dealing with a couple of important biological sequence data formats: FASTA for sequences, FASTQ for reads and their base-calling “quality values”.

• a peek ahead at some stuff we’ll get deeper into in weeks ahead, as we see how kallisto combines its read mapping algorithm with using a likelihood model and expectation maximization for quantitation.

## read mapping: historical overview

In 2006, Illumina Solexa sequencers were introduced, capable of cheaply obtaining a large number of very short reads (20-35nt). Up until then, DNA sequencing had largely been used to assemble sequences of genes and genomes. Using short reads for assembly is problematic, because assembly requires overlap. But many people realized that with an abundance of cheap, short reads, you could use sequencing as a counting assay; you could quantitate the abundance of individual sequences in any DNA/RNA population by sequencing unique tags from them.

There was an explosion in something-Seq technologies, marrying some kind of DNA/RNA enrichment method (often an existing one) to a high through sequencing readout. For example, ChIP-Seq (chromatin immunoprecipitation + sequencing) means using antibodies to purify protein/DNA complexes, to identify what DNA sequences are bound by a specific protein.

For studying RNA populations, various sequencing strategies had already been applied to discoverying new RNAs and quantitating their abundance, either by completely sequencing cloned cDNA libraries, or various ways of using strategies to generate fragments or tags, such as expressed sequence tag (EST) sequencing. When Illumina short read technology was quickly and inevitably applied to high throughput mRNA quantitation, it was called RNA-seq.

To convert an RNA-seq experiment of $>$40M short 25nt reads to a useful quantitation of each mRNA transcript in a cell, the first computational step is to “map” reads to the possible transcripts they came from. From mapped read counts, we can infer transcript abundances.

To map a 25nt read to a transcriptome or a genome, at first people used similarity search programs like BLAST to find significant (high-scoring) alignments. But BLAST, and programs like it, is designed to find distantly related sequences, differing by many millions of years of evolution. Its algorithms are tuned to the homology search problem. It requires a fair amount of compute power, and for mapping 40M reads at a time, the compute requirements were prohibitive.

But the 25nt read isn’t just similar; it is expected to map exactly or at least near-exactly to its source, differing essentially only by basecalling errors made by the sequencing machine. This energized a new front in computational sequence analysis: the development of read mappers, using fast exact-match text matching algorithms from computer science, as opposed to the more statistical inference-based algorithms used for homology search. Exact-match text searching algorithms based on hashing, suffix trees and suffix arrays became important, but these methods (though powerful and important) often came with large memory requirements.

### the Burrows-Wheeler transform: BWA and Bowtie

An important milestone in 2009 was the adoption of algorithms based on a surprising, beautiful, and counterintuitive algorithm called the Burrows-Wheeler Transform (BWT). The BWT was originally conceived as a data compression algorithm, but turned out to also be useful as the basis for memory-efficient suffix-array-like text searches. Two influential BWT-based read mappers were introduced in 2009: BWA, from Heng Li and Richard Durbin, and Bowtie, from Ben Langmead, Cole Trapnell, Mihai Pop, and Steven Salzberg.

### spliced read mappers, transcriptome assembly

If you know what mRNAs (or mRNA isoforms) are being made from a genome, you can map reads to a transcriptome (the set of all RNAs). If you don’t – if you want to use RNA-seq to identify new RNAs, or alternative isoforms – then you can map reads to a genome. But then you have to deal with the fact that reads that span a spliced intron will map discontinuously to the genome. To use read mapping to identify unexpected intron splicing events, your read mapper has to map both sides of the read separately, allowing an intron between them. An important spliced read mapper was introduced in 2009: TopHat, by Trapnell, Pachter, and Salzberg.

As sequencing read lengths increased, it became possible to contemplate de novo transcriptome assembly, either by mapping reads to a genome first, or by trying to directly assemble RNA-seq reads. Mapping-first approaches include Scripture and Cufflinks assembly-first approaches include ABySS and Trinity.

I’m just letting you know that these other important angles exist. kallisto, which is where we’re headed, assumes a transcriptome is known.

### quantitation with mapping uncertainty

Reads don’t uniquely map to their source, for a variety of reasons. There are duplications and repetitive sequences (Alus and such), and alternative RNA processing creates overlapping mRNA isoforms that share exonic sequences. Short reads and base calling errors exacerbate read mapping uncertainty.

As we’ve seen, an important milestone in dealing with read mapping uncertainty came from Bo Li, Colin Dewey, and coworkers in 2010. Li and Dewey’s approach introduced a generative statistical model, inference with expectation maximization, and TPM units for measuring transcript abundance. We’ve seen TPM units already, and we’ll be seeing generative statistical models and expectation maximization in weeks to come.

### k-mer hashing

As short reads got longer (150nt reads are now common on Illumina platforms), the cost of aligning an entire read to a potential source was still a problem (an hour or so to map a typical RNA-seq sample). In many areas, including both read mapping, sequence-based population diversity studies, and sequence assembly, interest grew in using algorithms based on fixed-length k-mers, short subsequences of length $k$; simple and efficient hashing algorithms for identifying matching k-mers; and so-called de Bruijn graphs for representing how overlapping k-mers stitch together into larger “contigs” of transcripts and genomes. (I’ll define these terms below; kallisto uses all of these concepts.)

An important milestone was the 2014 introduction of Sailfish, by Rob Patro, Steve Mount, and Carl Kingsford. The Sailfish paper, building on previous work on fast k-mer hashing in a program called Jellyfish, emphasized the fact that you don’t need to map an entire 50-150nt read to know where it came from. Mapping a k-mer, if the k-mer is long enough to be unique, is usually sufficient. You can just think in terms of assignment of a read to potential sources, rather than alignment. Sailfish showed that this “alignment-free” approach could accelerate RNA-seq mapping and quantitation by more than an order of magnitude, and inspired the development of kallisto.

## the kallisto pseudoalignment algorithm

kallisto was published last year by Nicolas Bray, Harold Pimentel, Pall Melsted, and Lior Pachter, after being introduced a while before in a paper in the bioarxiv. The key idea is pseudoalignment. Like Sailfish, kallisto uses fast k-mer hashing to identify matching k-mers in a read and a transcriptome. kallisto uses a set of k-mer matches for each read to deduce the set of possible transcripts that the read could come from: it calls this assignment a pseudoalignment. (kallisto assumes that the transcriptome is known: you provide an input file with the sequences of all the possible RNA isoforms that you think are in play. We’ll call this set of transcript sequences $T$.)

### intuition

Figure 1 from the kallisto paper. (a) shows a read (black), and three mRNA isoforms (colors). (b) shows a toy colored de Bruijn graph for the transcriptome of three mRNAs. (c) shows the k-mers of the read being hashed to vertices in the graph (black circles), thus identifying a subpath. (d) Many k-mers can be skipped because they are covered by the same set of transcripts as a previous k-mer. (e) the set of transcripts compatible with the read is the intersection of the sets of transcripts compatible with its k-mers.

Here’s the first main intuition behind kallisto. Each individual k-mer in an (error-free) read matches a set of possible transcripts $S_j \subseteq T$. We can find each set $S_j$ quickly by hashing. There’s at least one transcript that matches all the k-mers in the read. The set $S$ of transcripts compatible with the entire read is the intersection of the compatibilities of its k-mers, $S = \bigcap_j S_j$.

Here’s the second main intuition. We can do even better than this, because we have information about the expected colinear order of the matching k-mers. Suppose we make a graph of the transcriptome with k-mers as its nodes (vertices) and edges connecting directly overlapping k-mers. Transcripts that share overlapping exonic sequence create branches in this graph. Each transcript corresponds to a specific subpath through the graph. Each vertex (each k-mer) has a set of transcript paths that go through it. Adjacent vertices tend to share the same set of paths, until a transcript stops or starts, or the graph branches. We can identify contigs in the graph that are covered by the same set of transcripts. We only need the intersection of the sets $S_j$, so once we’ve seen one k-mer map to a contig, we don’t have to look at any more of the vertices in that contig, and we can skip to a k-mer in the next possible contig(s). This skipping saves many k-mer lookups.

kallisto expects to be operating in a regime of high sequencing accuracy, because it does exact-match hashing of its k-mers. The key parameter is the k-mer length $k$, which must be short enough that most k-mers are error-free, but long enough that error-free k-mers map only where they’re supposed to, and k-mers containing any base calling errors don’t map to the transcriptome at all.

### k-mers

A k-mer is just a contiguous subsequence of length $k$. For example, if we have a read GGAACTGAGGT and $k=4$, the k-mers on the top strand are GGAA, GAAC, AACT, ACTG, CTGA, TGAG, GAGG, and AGGT.

### hashing

A hash table is a method for rapid indexing and looking up information on some key. Here, our keys are the k-mers. The hash table is composed of a fixed number of slots, with each slot containing a list of keys and whatever information we need about each key. A hash function $h(x)$ maps a key $x$ to an integer index that tells us that key $x$ goes in slot $h(x)$. The basic idea is that if you know what keys you’re going to need to look up fast – the k-mers in the known transcriptome – you can precompute a hash table for them. Then, for each k-mer $r_j$ in a read, we compute the hash function $h(r_j)$, go immediately to that slot, and find the list of transcripts that have that matching k-mer.

We generally want the hash table to have a number of slots about on the order of the number of keys we need to store. Then instead of having to search over all $N$ key strings of length $K$ to find a match – naively an $O(KN)$ operation – we compute a hash function in $O(K)$ and go straight to the matching slot.

An example of a simple hash function: we could just add up the ASCII values of the letters in a key string, and divide modulo the number of slots:

def naive_hash(s, hashsize):
h = 0
for i in range(len(s))
h += ord(s[i])
return h % hashsize

This is a bad hash function that will tend to assign slots nonuniformly, because biases in the composition of the keys will produce biases in the hash values. An ideal hash function assigns its keys uniformly to slots. Rarely, we might have a special problem where we could actually do that exactly (by knowing something about the keys), but more commonly, the best we can do is when a hash function maps its inputs randomly (though deterministically and reproducibily) to slots. The way people think about hash functions is therefore related to how people think about deterministic pseudorandom number generators.

Sometimes in DNA/RNA sequence analysis, we use direct hashing, where we represent a fixed-length k-mer string by an integer and use that value directly. For example, we can represent a 6-mer sequence GAATTC in base-4 as 200331, and convert the base-4 to 2048 + 48 + 4 + 1 = 2101, and hash it into one of $4^6 = 4096$ slots 0..4095. Because the number of slots in a direct DNA hash is $4^k$, we can run into memory problems as $k$ grows.

### the transcriptome de Bruijn graph (T-DBG)

A key difference between Sailfish and kallisto is that Sailfish processes k-mers independently, whereas kallisto keeps information about the order of the k-mers on the read and the transcriptome. kallisto uses the graph structure we described above in the intuition.

A de Bruijn graph is a directed graph in which vertices are k-mers, and edges represent overlaps between the k-mers. A sequence is a path through the graph. If we had a perfect error-free de Bruijn graph (and a single chromosome, and no issues with repeats), we could identify a Eulerian cycle that visits every edge exactly once, and we’d recover the complete genome sequence assembly.

kallisto uses a variant called a colored de Bruijn graph. Each transcript goes through a subpath of vertices; so at each vertex, we keep track of the set of transcripts that cover us. kallisto calls this set the “k-compatibility class” of each k-mer. (In the kallisto paper, the key line is: “the path cover of the transcriptome induces a k-compatibility class for each k-mer.”)

### the likelihood

Using k-mer hashing on the transcriptome de Bruijn graph, kallisto collects counts $c_e$ for equivalence classes $e \in E$ where the equivalence classes $E$ are all the different sets $S$ of transcripts that a read can be assigned to.

The equivalence classes $E$ are essentially the same as the segments $S$ in this week’s homework, the Arc locus. What we want are counts assigned to transcripts $i$, but what we have are counts assigned to contigs/equivalence classes/segments, and for each equivalence class, there are several possible transcripts, so the source transcript identity is a hidden variable in our inference.

The kallisto paper states the likelihood as:

where $\tau_i$ are the unknown transcript abundances (in TPM), $L_i$ are the (effective) transcript lengths, the $F$ are observed fragments $f$ (from which we’ve obtained reads from both ends, or a read from one end randomly), and $y_{f,t}$ is an indicator variable that’s 1 if fragment $f$ is compatible with transcript $t$. This can be rewritten in terms of counts $c_e$ for each equivalence class:

(I had to stare at this last bit for a long time to convince myself that it’s probably right.)

kallisto then finds a vector $\tau$ that maximizes this likelihood using an iterative algorithm called expectation-maximization (EM). We’ll see this algorithm – and implement it – later in the course. Conceptually it’s straightforward:

• if I told you $\tau$, then I could apportion each read to each of its compatible transcripts $i$ with probability $\tau_i$, and thus collect expected counts of reads for each transcript. (Expectation step.)

• If I told you the read count for each transcript $i$, then you could renormalize to estimate its relative abundance $\tau_i$. (Maximization step.)

• You don’t know either one… but if you start from a random guess at $\tau$, and iterate the expectation and maximization steps, you’ll converge to a local optimum for $\tau$.

## Practicalities with kallisto

### installing kallisto

Instructions are here.

Here’s what I did on my Mac OS/X laptop:

# install Homebrew package manager

### manipulating and transforming strings

Data types in Python are objects, and objects have methods that you can call on them to manipulate their contents. You get used to the idea of even what seem like elemental data types being objects with methods. (I’m a C programmer; for me, it takes a lot of getting used to.) A good and useful place to practice is with strings. We’ve already seen the <str>.format() method, in learning how to print stuff, and the ‘.split()' method for splitting a line into data fields, but you may not have been thinking about these in terms of methods that operate on a data object. Here's some other examples:

S  = '{:.2f}'.format(1.2345)           # S = '1.23'
S  = 'acgt'.upper()                    # S = 'ACGT'
S  = ','.join(['abc', 'def', 'ghi'])   # S = 'abc,def,ghi' (concat a list, with <str> as the separator!)

Here’s a more complex example that will be useful in this week’s homework (though there are a jillion other ways to do it too). <str>.translate(tbl) translates each character in <str> to another character, according to a so-called “translation table” <tbl>. You create a translation table using str.maketrans(from, to), where (a detail!) str is literal, not a variable (str.maketrans is a so-called static method). For example, this little useful incantation:

S = 'AAAAAAAAAAAAAA'.translate(str.maketrans('ACGT', 'TGCA'))

returns ‘TTTTTTTTTTTTTT’, having converted A’s to T’s (and C to G, G to C, T to A).