### MCB112: Biological Data Analysis (Fall 2018)

# answers 04:

*a plague of sand mice*

Three different answers to peruse and compare for this week's problem set, as Jupyter Notebook pages for download. There are various ways to manipulate the first-order Markov model probabilities in this problem:

Verena: A nice clear solution. Uses compact

`itertools.product()`

calls to create lists of all possible 2-mers and 3-mers, collects their counts, and converts them to frequencies (therefore, estimated joint probabilities). When she calculates a log odds score for a given read sequence, she constructs the appropriate first-order Markov conditional probability \(P(x_i \mid x_{i-2},x_{i-1})\) from the 3-mer and 2-mer joints, \(\frac{P(x_i,x_{i-1},x_{i-2})}{P(x_{i-1},x_{i-2})}\).Kevin: Nice visualization of Moriarty's test (pathogen vs. uniform composition random sequences) compared to a more relevant test of trying to distinguish pathogen from mouse reads. Like Verena's solution, uses

`itertools.product()`

to compactly construct dicts of all 2-mers and 3-mers, counts and estimates joint probabilities; uses those joints to obtain the appropriate 1st-order Markov conditional probability when scoring a read. A nice touch: shows that this model is just as good as Moriarty's in discriminating pathogen vs. random sequences.Sean: One of my tricks is to 'digitize' sequences, so I can treat them as arrays of (0,1,2,3) instead of A,C,G,T. That lets me use numpy arrays indexed by residue number. I also compute an 'encoding' of k-mers as a base-4 number: 0..15 for dinucleotides AA..TT, 0..63 for 3-mers AAA..TTT, and so on. That lets me do efficient, arbitrary Nth-order Markov models in terms of \(P(x_i | c)\), for a previous context \(c\) of any length.