Section 3: Data Exploration & Visualization

Section notes for 09/28/2019 by Mary Richardson

(You can download these notes in [Jupyter notebook .ipynb format].)

First, let's import all the modules we will need in this section. I usually import everything at the beginning of my script.

In [ ]:
import numpy as np                # So we can do useful things with arrays
import pandas as pd               # So we can store our *tidy* data
import matplotlib.pyplot as plt   # So we can plot all the things
import seaborn as sns             # So we can make even prettier plots of all the things

Pandas

Pandas (Python Data Analysis Library) is a data analysis library that makes it easy to store data in a nice, tidy format. It plays particularly well with jupyter notebooks, since it makes your data easy to read too! We'll focus on a Pandas data structure called a DataFrame, which is a 2D table with labeled rows and columns. We'll start by running some of the examples from the lecture notes, just to get a handle on Pandas.

Create a DataFrame

We can create a DataFrame from data stored in a list of lists, a 2D numpy array, or a dictionary.

In [ ]:
# Create a pandas dataframe from a list of lists
D   =  [[ 12.0, 16.0,  4.0, 8.0  ],
        [  7.0, 21.0, 14.0, 28.0 ],
        [  5.0, 25.0, 20.0, 10.0 ]]
df  = pd.DataFrame(D)
df
In [ ]:
# Create a pandas dataframe from a 2D numpy array
D   =  np.array(
       [[ 12.0, 16.0,  4.0, 8.0  ],
        [  7.0, 21.0, 14.0, 28.0 ],
        [  5.0, 25.0, 20.0, 10.0 ]])
df  = pd.DataFrame(D, index=['tamarind','caraway','kohlrabi'], # Set row names
                   columns=['sample1', 'sample2', 'sample3', 'sample4']) # Set column names
df
In [ ]:
# Create a pandas dataframe from a dictionary
D = { 'tamarind': [ 12.0, 16.0,  4.0, 8.0 ],
      'caraway':  [  7.0, 21.0, 14.0, 28.0],
      'kohlrabi': [  5.0, 25.0, 20.0, 10.0] } # Automatically sets column names
df = pd.DataFrame(D, index=['sample1', 'sample2', 'sample3', 'sample4']) # Set row names
df

Exercise: Can you make a DataFrame using the dictionary below?

In [ ]:
D = { 'tamarind': [ 12.0, 16.0,  4.0, 8.0 ],
      'caraway':  [  7.0, 21.0, 14.0, 28.0],
      'kohlrabi': [  5.0, 25.0, 20.0, 10.0],
      'ginger':   ['NaN',  9.0, 16.0, 17.0],
      'epazote':  [ 10.0, 12.0,  3.0, 6.0 ],
      'valerian': [ 27.0, 25.0,'NaN', 19.0]}

Question: Which of the above tables would you consider a tidy dataset and why? (Remember that in a tidy dataset variables form columns and observations form rows.)

Read data from a file

We can also create a DataFrame directly from data in a file. When we do this, we have a few options for naming the rows and columns. When we call the pd.read_table() function, we should consider specifying the following options:

  • header=None if there is no header
  • header=2 if the column names are in line n (remember we're counting the lines starting at 0)
  • name=['col1','col2'] if we need to input our own column names (for example if we set header=None)
  • index_row=0 if the row names are in column n
  • comment='#' if we need to ignore all lines starting with a certain character
  • skiprows=3 if we need to ignore the first n rows
In [ ]:
# Read data from a tsv file into a new pandas dataframe
df = pd.read_table('section-data.tsv', # Default assumes delimiter is tabs
                    header=3) # Set the column names using the 3rd row
df.head() # We can use df.head() to just peek at the first few rows
In [ ]:
# Read data from a csv file into a new pandas dataframe
df = pd.read_table('section-data.csv', 
                   sep=',', # Set the delimiter to commas, this is the same as pd.read_csv('data.csv')
                   comment='#', # Set the comment character to ignore comment lines
                   names=['gene_name', 'sample1', 'sample2', 'sample3', 'sample4']) # Set the column names yourself
df.head()
In [ ]:
# Read data from a column-aligned file into a new pandas dataframe
df = pd.read_table('section-data.tbl', 
                   delim_whitespace=True, # Set the delimiter to whitespace
                   index_col=0, # Set the row names using the 1st column
                   skiprows=4) # Skip all the comment lines up to the header
df.head()

Exercise: It looks like the column names in the above DataFrame aren't quite right. Can you read in the data.tbl file and fix the column names?

If you get this to work, also try looking through the pandas.read_table() documentation. Test out some of the other parameters we haven't tried here (I've barely scratched the surface with these examples!) There are lots of cool things you can do with this function to customize how you read in a file, and you might find some of them useful on this week's pset and beyond.

In [ ]:
# Read data from a column-aligned file into a new pandas dataframe

Question: Why do we typically use column-justified data formats? What's the trick to reading a column-justified file using Pandas?

Explore the data

Now we'll explore the data. We want to consider:

  • does our data make sense?
  • what's the range of values in our data?
  • are there values that we might need to remove from our analysis? (e.g. invalid values or values that don't make sense intuitively)

To access values in the DataFrame, we can specify rows and columns in several ways:

  • df['col_name'] gets the column as a Series
  • df.loc['row_name'] gets the row as a Series
  • df[2:5] gets the rows numbered 2 through 4 as a DataFrame
  • df.iloc[2:5] gets the rows indexed 2 through 4 as a DataFrame
  • df['col_name']['row_name'] gets the column and the row as a Series
  • df.loc['row_name','col_name'] gets the column and the row as a Series
In [ ]:
# Find the sample1 values for all genes
df['sample1'] # Get df[col]
In [ ]:
# Find the sample1 value for the tamarind gene
df['sample1']['tamarind'] # Get df[col][row]
In [ ]:
# Equivalently, find the sample1 value for the tamarind gene
df.loc['tamarind','sample1'] # Get df.loc[row][col]
In [ ]:
# Find the row with the max value for sample1
dfmax = df['sample1'].max() # Find the max value over all rows for sample1 (default is over column, axis=0)
df[df['sample1']==dfmax] # Find the row where the value equals this max value
In [ ]:
# Find the row with the min value for sample1
dfmin = df['sample1'].min() # Find the min value over all rows for sample1 (default is over column, axis=0)
minrow = df['sample1'].idxmin() # Find the index of the row where the value equals this min value
df.loc[minrow] # Get the row at this index

Exercise: Find the max and min value over the whole DataFrame instead of just in one column.

In [ ]:
# Find the row with the max over the whole dataframe

Question: Are there any values that might be problematic in this dataset?

Tidy the data

To clean up our DataFrame, first we'll remove columns with missing values.

In [ ]:
# Find positions with NaN
isna = df.isna()
isna

Exercise: Find all the rows that contain NaN.

In [ ]:
# Find rows that contain NaN

Exercise: Remove the rows that you found contain NaN from the DataFrame.

In [ ]:
# Remove rows that don't have data for our analysis

Now that we've cleaned up the data, it's time to make it tidy. This part is hard and I always find myself looking back at the pandas.melt() documentation to check the syntax. Let's start by looking at some of the examples in the documentation and then we'll come back to our DataFrame.

In [ ]:
# First make gene_name a column again instead of the index, so that we can use it to melt the dataframe
df = df.reset_index() 
df

Exercise: Melt the dataframe so that you have just three columns (gene_name, sample, and expression). Hints: Keep gene_name as the ID variable, and make 'sample1', 'sample2', 'sample3', and 'sample4' values in the new column. You should end up with something like this:

In [ ]:
# Melt the dataframe, so that each row only has one value

Exercise (Challenge): Add the wt and mut labels that were initially in the header of the data.tbl file to this new tidy DataFrame. You should have a column in the end that has the wt or mut designation. You should have something like this in the end:

In [ ]:
# Melt the dataframe, so that each row only has one value

Matplotlib and Seaborn

Now that our data is tidy, we can plot it pretty easily using Matplotlib and/or Seaborn, which are two super useful modules for visualizing data. We're not going to go into the details of plotting here (trust me, you can spend hours trying to make a plot pretty with these modules). But the basic idea is we want to set our x values using one column of our DataFrame and our y values using another column of our DataFrame. This is why we care so much about having tidy data before we try to plot – it makes life much easier!

Exercise: Time to practice what we learned earlier. Read the tidy-data.tsv file into a new DataFrame.

In [ ]:
# Read in the tidy data
In [ ]:
# Plot with matplotlib
fig,ax = plt.subplots()
ax.scatter(x=df['gene_name'], y=df['expression'])
ax.set_title('Gene Expression Levels')
ax.set_xlabel('Gene Name')
ax.set_ylabel('Expression (TPM)')
In [ ]:
# Plot with seaborn
fig2,ax2 = plt.subplots()
sns.swarmplot(ax=ax2, data=df, x='gene_name', y='expression')
ax2.set_title('Gene Expression Levels')
ax2.set_xlabel('Gene Name')
ax2.set_ylabel('Expression (TPM)')

Exercise: Try to color the data points based on the sample number using seaborn. (Hint: your answer should look almost the same as the cell above, but you'll need to add an argument. Check out the seaborn swarmplot documentation)

In [ ]:
# Plot with seaborn, but color the data points by sample number