section 04: Alignment

Jakob Heinz (2026) | Joshua Park (2024)


Full disclosure, I have never crocheted, regardless, I like to think about dynamic programming like how I imagine crocheting to be. The hard part is figuring out the pattern, but once you have that, you can just stitch across the rows and columns. So, hopefully, you find implementing a DP alignment function to be a more relaxing than frustrating activity in this week's pset.

And with that, let's get into alignment.

Building an intuition about global alignment with edit distance

Edit distance (also called Levenshtein distance) is older (1965) than the global (Needleman-Wunsch) algorithm (1970), however, conveniently, it is a special form of global alignment that calculates the minimum number of edits needed to transform one string into another. Under this definition, an insertion, a deletion, and a substitution are all penalized equally, meaning each counts as one edit, so our gap penalty is 1 and our mismatch penalty is 1. I think edit distance is a nice example to begin with, since the addition is nice (all whole numbers) and every change is penalized equally, so there are less numbers to keep track of!

For example, consider the two sequences:

X = "CAT"
Y = "GAT"

To change X into Y, we can see that we just need to substitute the C at X[0] with a G, so these two strings have an edit distance of 1. Let's verify this by filling in a DP matrix.

A brief exposition on the empty string, \(\epsilon\)

When I first learned about DP alignment matrices, I struggled to understand what the empty string represented. The matrix has size (len(X) + 1) × (len(Y) + 1) because the extra row and column represent aligning each sequence against the empty string, commonly denoted \(\epsilon\), or in python, "". The cost of aligning a string of length \(l\) against \(\epsilon\) is \(l\), meaning you need \(l\) insertions (or deletions) to get from an empty string to that string. For example, the edit distance between the empty string \(\epsilon\) and the sequence CAT is 3, since we would need to insert C, then insert A, then insert T into \(\epsilon\) to transform it into CAT, which is what these initializations of the first row and the first column in the dp score matrix represent.

And, as you can imagine, the \(\epsilon\) addition to the dp score matrix can create a headache of off by one errors, which is probably where I have lost most of my years and tears. A simple "hack" is to actually append a character, that is not in your sequence vocabulary, to the beginning of the sequences you are trying to align, for example * or $, then your sequence and your scoring matrix are in the same coordinate space!

Back to the actual edit distance algorithm (implemented in python)

# Initialization
dp = np.zeros([len(X)+1, len(Y)+1]) 
for j in range(len(Y) + 1):
    dp[0][j] = j   # cost of aligning Y[1..j] against ε (j insertions)
for i in range(len(X) +1):
    dp[i][0] = i   # cost of aligning X[1..i] against ε (i deletions)

# Fill
for i in range(1, len(X) + 1):
    for j in range(1, len(Y) + 1):
        penalty = 0 if X[i-1] == Y[j-1] else 1  # 0 if match, 1 if mismatch
        deletion     = dp[i-1][j]   + 1        # delete X[i]
        insertion    = dp[i][j-1]   + 1        # insert Y[j]
        substitution = dp[i-1][j-1] + penalty  # substitute X[i] → Y[j] pen = 0 if match 

        dp[i][j] = min(deletion, insertion, substitution)

Note the i-1 and j-1 when indexing into X and Y is the handling of the off-by-one from the \(\epsilon\) row/column. I did not implement the "hack" I proposed earlier.

An example of the edit distance matrix for a substitution

Filling in the edit distance matrix for X = "CAT" Y = "GAT"

Try to fill out this matrix by hand:

\(\epsilon\) G A T
\(\epsilon\)
C
A
T

Initialization: If you are looking at this in class PLEASE STOP SCROLLING, or forget everything you have seen.

\(\epsilon\) G A T
\(\epsilon\) 0 1 2 3
C 1
A 2
T 3

Solution:

\(\epsilon\) G A T
\(\epsilon\) 0 1 2 3
C 1 1 2 3
A 2 2 1 2
T 3 3 2 1

The bottom-right cell gives us an edit distance of 1, which is what we were expecting from our hand calculation.

The traceback is highlighted in bold

Two common approaches to find the traceback

To go from the optimal alignment score in the bottom right corner, to the actual path that got us there, there are two common approaches: 1) Start in the bottom right cell and again calculate min(deletion, insertion, substitution) from the three predecessor cells (diagonal, upper, and left) to determine which one was the winner that resulted in that score in the cell, then move into the cell that resulted in the winner and repeat this step until the indices i and j are both 0. (I personally prefer this approach, but both have their strengths). 2) The other approach is to store a sort of "shadow matrix" while we are filling the dp matrix that contains information of the path. So for example when we assign dp[i,j] = min(deletion, insertion, substitution) we go to the shadow matrix and say what cell we originated from i.e tb[i,j] = (i-1,j-1) if it was a match, or tb[i,j] = (i-1,j) if it was a deletion.

These two approaches will work for edit distance, global, or local alignment (and more!) just be careful to always use the same scoring structure in the traceback as in the fill if you use approach 1.

An example of the edit distance matrix for a substitution AND an insertion

Now let's try a slightly more difficult case with an insertion

Let's consider the two strings

X = "ACG"
Y = "ACCT"

A quick hand calculation shows that we would expect an edit distance of 2. We'll need to insert a C into X, and change the G into a T.

Now let's fill in the scoring matrix to confirm this is true.

Try to fill out this matrix by hand:

\(\epsilon\) A C C T
\(\epsilon\)
A
C
G

Initialization:

\(\epsilon\) A C C T
\(\epsilon\) 0 1 2 3 4
A 1
C 2
G 3

Solution:

\(\epsilon\) A C C T
\(\epsilon\) 0 1 2 3 4
A 1 0 1 2 3
C 2 1 0 1 2
G 3 2 1 1 2

Thankfully, the bottom-right cell gives us an edit distance of 2, as expected. The traceback is highlighted in bold.

A note on the traceback of the optimal alignment here

There are actually two optimal routes for us to achieve an edit distance of 2. It doesn't matter if we add an insertion in X before the C or after the C, or if we have a substitution to C and then an insertion in X.

AC-G
   *
ACCT

and

A-CG
   *
ACCT
ACG-
  *
ACCT

are all optimal alignments for the two strings. Try to identify the cells in the scoring matrix where this tie occurs. It is important to remember that a traceback will give you AN optimal alignment, but not necessarily all of them. It can be a fun practice problem to count all possible optimal alignments of two strings!

Global alignment (Needleman-Wunsch)

We are ready to move on to the more general global alignment. Let's first consider what we want to change from the edit distance case: 1) We are no longer counting changes, but rather assigning penalties, so we now want to find the alignment with the highest score as opposed to before we wanted the lowest edit distance. 2) We now want to penalize insertions and deletions more strongly than substitutions, as they are biologically much less likely, so let's set the gap penalty,\(\gamma\), to -2. 3) We still want to use a linear gap penalty as we have been. 4) If there is a match, we want to reward that, so matches are given a score of +1 5) Some substitutions are more likely than others, which can depend on actual biology or the sequencing technology you are using. For this example let's consider A↔G and C↔T mutations and their reciprocals as more likely. So for these mismatches we will penalize them by -0.5 and all other mismatches we penalize -1. The rationale behind this is that the A↔G and C↔T entries are penalized lower because these are transitions (purine↔purine and pyrimidine↔pyrimidine), which occur more frequently. All other mismatches are transversions (purine↔pyrimidine) and are penalized the full -1.

In summary, when we compare two bases, their score is determined by the substitution matrix \(\sigma\):

$ \sigma =

\begin{array}{c|cccc} & A & C & G & T \\ \hline A & +1 & -1 & -0.5 & -1 \\ C & -1 & +1 & -1 & -0.5 \\ G & -0.5 & -1 & +1 & -1 \\ T & -1 & -0.5 & -1 & +1 \\ \end{array}

$

and we have \(\gamma = -2\)

Now let's try this global alignment scoring on the following sequences:

X = "AATC"
Y = "GATCT" 

Try to fill out this matrix by hand:

\(\epsilon\) G A T C T
\(\epsilon\)
A
A
T
C

Initialization:

\(\epsilon\) G A T C T
\(\epsilon\) 0 -2 -4 -6 -8 -10
A -2
A -4
T -6
C -8

Solution:

\(\epsilon\) G A T C T
\(\epsilon\) 0 -2 -4 -6 -8 -10
A -2 -0.5 -1 -3 -5 -7
A -4 -2.5 0.5 -1.5 -3.5 -5.5
T -6 -4.5 -1.5 1.5 -0.5 -2.5
C -8 -6.5 -3.5 -0.5 2.5 0.5

The optimal alignment score is 0.5 (bottom-right cell). The traceback is highlighted in bold.

Alignment:

AATC-
*
GATCT

Local alignment (Smith-Waterman)

Let's build off of our previous problem (again) and think about what we want to change from global alignment:

1) We want to allow the alignment to start anywhere in either sequence, if the alignment restarts, it will start from a score of 0, so if no continuation of the alignment of the prefixes gives a score over 0 we want to restart the alignment search (insert a 0 in the matrix) 2) When we find the maximal scoring alignment, we need to check the whole matrix, not just the bottom right corner. 3) That's really it! It's surprising (at least to me) that such a powerful change was really only a small change in the algorithm

We will reuse the substitution matrix \(\sigma\) from above and the same gap penalty \(\gamma = -2\)

Let's try to align these two sequences:

X = "ATTG"
Y = "GATTCA"

Try to fill out this matrix by hand:

\(\epsilon\) G A T T C A
\(\epsilon\)
A
T
T
G

Initialization (now we initialize to 0, not \(i \cdot \gamma\)):

\(\epsilon\) G A T T C A
\(\epsilon\) 0 0 0 0 0 0 0
A 0
T 0
T 0
G 0

Solution:

\(\epsilon\) G A T T C A
\(\epsilon\) 0 0 0 0 0 0 0
A 0 0 1 0 0 0 1
T 0 0 0 2 1 0 0
T 0 0 0 1 3 1 0
G 0 1 0 0 1 2 0.5

We search the whole matrix to find that the maximum score is 3.

Alignment:

ATT
ATT

Python examples

We likely won't have time to cover this in class, but we provide practice problems (and solutions!) for implementing the algorithms we discussed today!

Practice implementing global (Needleman-Wunsch) and local (Smith-Waterman) versions of the DP algorithm: w04-section-exercises.ipynb.

Solutions: w04-section-solutions.ipynb

Additional thoughts and readings

If, throughout the course of this week, you have come to realize that you also love alignment, do I have good news for you. There are (probably) thousands (at the very least hundreds) of papers on these topics and I encourage you to look into some of them (even if you hate alignment). Below is an insufficient and incomplete overview of what I think are interesting topics to explore further:

As you may have noticed, most of the excitement in alignment occurs around the diagonal in the dp scoring matrix, so many optimizations will not fill the entire matrix. At some point we don't need to track that an alignment score of -1000 somewhere in the matrix becomes -1002 with one more insertion. Ukkonen's algorithm and X-drop are popular approaches to restrict a dp scoring matrix to only fill a band around the diagonal, rather than fill the whole matrix.

When aligning reads to a genome it can also be quite slow to score the read against the whole genome (as you will see in your pset this week), so a common work around is to find small "seeds" of exact kmer matches, then extend these seeds to do a full dp alignment in a much narrower search window of the genome. This seed-and-extend approach is used by popular aligners like BLAST and Minimap2 use.

There are also computer engineering speedups such as only storing the difference in score between cells in the dp matrix rather than the sum of the alignment. This allows us to work with u8 integers rather than u64, which allows for SIMD (Single instruction, Multiple Data) speedups. This is the Suzuki and Kasahara approach.