# the adventure of the missing phenotype

• Nathan: concise…

• Joe: …also concise…

• Sean: …and my own bare bones version; plus an apology for a (subtle?) error I made in generating the synthetic data (the TPMs don’t sum to $10^6$ over all genes); and a link to the sandmouse script I used to generate the simulated data in the first place.

• Will: An all-Pandas version!

• Shows how to use Pandas to read data in chunks, rather than slurping it all into memory at once (useful for huge data files).

• Uses Pandas to read the original data file directly, which means getting Pandas to parse the file in the way we need (such as, setting column names the way we need them, for tidy data).

• Shows how to force Pandas to read data columns as numbers (floats) even in the presence of some non-numerical contaminating values.

• Shows how to drop bad rows.