the adventure of the missing phenotype
Three answers to peruse and compare for this week's problem set, as Jupyter Notebook pages for download:
Daniel: Uses pandas throughout, showing how to quickly massage the non-pandas-ish input data lines into a form that pandas can grok. Explains that the input values come in as strings, not numbers, and shows how to use a pandas dataframe to convert them to floats -- while at the same time checking that they really are numbers, and flagging where they're not. Shows how to construct boolean masks to extract a clean subset of rows.
Steffan: Uses straight python for reservoir sampling, then switched into pandas. Like Daniel's answer, uses pandas
infomethods to notice that the data come in as objects (strings), not numbers, and that they need to be converted to floats. Shows a different way of tidying the data. Shows how you can use the Seaborn
hueargument to highlight M vs. F data not just in a catplot, but even in a boxplot - Lestrade could've seen the bimodality in his data even in his boxplots.
Sean: My bare bones version. Straight python until the last possible moment, when the pset actually specified using pandas to read in the tidy data. Plus an apology for a subtle error I made in generating the synthetic data (the TPMs don't sum to \(10^6\) over all genes); and a link to the sandmouse script I used to generate the simulated data in the first place.