### MCB112: Biological Data Analysis (Fall 2018)

• Verena: A nice clear solution. Uses compact itertools.product() calls to create lists of all possible 2-mers and 3-mers, collects their counts, and converts them to frequencies (therefore, estimated joint probabilities). When she calculates a log odds score for a given read sequence, she constructs the appropriate first-order Markov conditional probability $$P(x_i \mid x_{i-2},x_{i-1})$$ from the 3-mer and 2-mer joints, $$\frac{P(x_i,x_{i-1},x_{i-2})}{P(x_{i-1},x_{i-2})}$$.
• Kevin: Nice visualization of Moriarty's test (pathogen vs. uniform composition random sequences) compared to a more relevant test of trying to distinguish pathogen from mouse reads. Like Verena's solution, uses itertools.product() to compactly construct dicts of all 2-mers and 3-mers, counts and estimates joint probabilities; uses those joints to obtain the appropriate 1st-order Markov conditional probability when scoring a read. A nice touch: shows that this model is just as good as Moriarty's in discriminating pathogen vs. random sequences.
• Sean: One of my tricks is to 'digitize' sequences, so I can treat them as arrays of (0,1,2,3) instead of A,C,G,T. That lets me use numpy arrays indexed by residue number. I also compute an 'encoding' of k-mers as a base-4 number: 0..15 for dinucleotides AA..TT, 0..63 for 3-mers AAA..TTT, and so on. That lets me do efficient, arbitrary Nth-order Markov models in terms of $$P(x_i | c)$$, for a previous context $$c$$ of any length.