MCB112: Biological Data Analysis (Fall 2017)

Section 00: Setting up Python and Jupyter

Notes by Tim Dunn & William Mallard [9/1/2017]

The purpose of this tutorial is to give a brief introduction to Python and guide you through Python installation.

Installing Python

Python is an open-source programming language that has become exceedingly popular in science. Its advantages are numerous. Its most powerful aspects are:


The freedom of an open-source, decentralized programming platform can be overwhelming. Because there is no central proprietor managing the programming environment and essential libraries, the number of ways to set up Python (and the number of individual installation steps) is staggering. Luckily, there are organizations that provide (or, for commercial applications, sell) bundled kits of useful Python tools and libraries, making everything much easier. For this course, we will be using one of these bundled kits, the Anaconda Python distribution.

Download Anaconda here, and choose the Python 3.6 version. If you have already installed Python 2.7, you can switch over to Python 3.6 by following these instructions.

The Anaconda distribution includes:

The Python interpreter

With a traditional programming language like C, you write all of your code in a text file and feed it to a compiler, which translates your human-readable code into machine code (0s and 1s) that you can run on your computer. Every time you add code to your analysis, you have to recompile your program, which slows down your data analysis workflow substantially.

Programming languages like Python, Matlab, and R are processed by an [interpreter]( – which also translates your code into machine language, but it does it line-by-line rather than all at once. This interpreter structure makes working with Python highly interactive and allows you to write scripts (files that contain instructions that should be interpreted in sequence) to perform desired analyses.

How you access the Python interpreter varies from one computer to the next, depending on the type of Python shell the user has installed. A “shell” is a text-based interface that allows you to interact with an underlying system (in this case, the Python interpreter). At its most basic, the Python interpreter can be accessed using a simple command prompt in the terminal.

Once you’ve installed the Anaconda Python distribution, open Terminal (Mac: ⌘+Space → terminal) or Powershell (Windows: ⊞+R → powershell) and enter python on the command line. You’re now using the Python interpreter, and any command you enter will be processed as Python code to generate an output. Try typing a simple expression like 1 + 2 or print(‘hello, world’) and then pressing enter:

>>> 1 + 2

>>> print('hello, world')
hello, world

Note that these commands follow Python syntax; entering hello, world alone, for instance, results in an error. The interpreter can only interpret commands the user gives it in the Python language.


In addition to the interpreter, Python code can also be written and saved into .py files that can be loaded in subsequent Python sessions. Writing code in .py files makes it easier to manage multiple lines of code, and allows you to save and share your work. You can edit .py files in a text editor like Spyder, which comes with the Anaconda distribution.

Launch Spyder by entering spyder at the Mac or Windows command line. Alternatively, type spyder in the search bar (Mac) or start menu (Windows). The Anaconda Navigator can also be used to launch Spyder on both Windows and Mac.

Once Spyder is open, create a new .py file and type print(‘hello, world’). This code can now be saved in the .py file and executed on command. Save the file as and then launch terminal or powershell and navigate to the folder where you just saved your new .py file.

Once in this directory, launch the Python shell by typing python (and then pressing enter) and then enter import helloworld.

>>> import helloworld
hello, world

Alternatively, you can have Python interpret the file without entering the Python shell. To do this, type python on the command line. This type of command can also be run from any directory, as long as you tell Python where to look for the .py file:

    python $HOME/Python/

The same code can also be executed from within Spyder. Use RunRun and, if prompted, select Execute in current Python or IPython console in the next window. The Python shell in the bottom right-hand corner should display the output.

Note that this Python shell (IPython) looks slightly different from the basic Python shell. The IPython shell will become a core component of the way we code in this course, but we will use it in a slightly different form.

Your .py file can of course become more complex. Paste the following code into, and use Run to execute the code in Spyder/IPython.

import matplotlib.pyplot as plt

fig = plt.figure()
fig.suptitle('Hello, Python', fontsize=20, fontweight='bold')
ax = fig.add_subplot(111)
ax.text(3, 5, 'Hello, world!', style='italic',
        bbox={'facecolor':'blue', 'alpha':0.5, 'pad':10}, fontsize=30)      
ax.axis([0, 10, 0, 10])

You should see the following output in Spyder’s IPython console:


Jupyter Notebook and QTConsole

IPython is an enhanced Python shell that adds many features to the basic Python interpreter. These enhancements make it easier to play with data and plots, so it is used extensively in scientific computing. In 2011, IPython released a new feature called IPython Notebook, an innovative way to document code and analysis results.

Recently, IPython was rolled in to a larger umbrella project called Jupyter, which expanded the notebook to work with other programming languages besides Python. Jupyter provides access to the interactive IPython shell in the form of Jupyter QTConsole, which is more or less the same as the console within Spyder, except it can be loaded independently by entering jupyter qtconsole into Terminal (Mac) or Powershell (Windows).

An effective way to program in Python is to keep Jupyter’s QTConsole window open to test snippets of code, while consolidating useful code into a .py file that can be run in its entirety later. In tandem, it is good practice to document and save your code into a Jupyter notebook, the official Python environment for this class.

You can launch a new Jupyter notebook by entering jupyter notebook into Terminal (Mac) or Powershell (Windows). This will launch the Jupyter Notebook interface in a new browser window. Use the menu to navigate and create a new notebook file.

The Jupyter Notebook acts a Python interpreter, but saves both the code and the output (including plots & graphics) into a .ipnyb file. The .ipnyb file can also be annotated with text and equations to provide a cogent digest of all analyses.

Here’s how the above example looks in a Jupyter Notebook:

import matplotlib.pyplot as plt # Allows us to use all main plotting tools

# Makes sure plots and graphs are printed directly into the notebook file
%matplotlib inline

# Creates a new figure
fig = plt.figure()

# Adds a figure title
fig.suptitle('Hello, Python', fontsize=20, fontweight='bold')

# Adds axes
ax = fig.add_subplot(111)

# Adds styled text to the plot
ax.text(2.2, 5, 'Hello, world!', style='italic',
        bbox={'facecolor':'blue', 'alpha':0.5, 'pad':10}, fontsize=30)

# Scales the axes
ax.axis([0, 10, 0, 10])

# Shows the final figure


Note how in the Jupyter Notebook format, Python input and output are organized into discrete, labeled cells. These cells provide an intuitive way to group and separate code into steps that can be easily digested. We will discuss Jupyter notebooks in more detail in the next tutorial.

Python Packages

Python becomes a powerful tool for data analysis when tapping into the vast reservoir of open-source Python packagaes. A Python package is a set of Python tools united under a common functional banner. For instance, the matplotlib package contains many of the functions we will need to plot data and manipulate graphics objects. The NumPy package defines many of the functions and data structures we will need to efficiently store and manipulate numerical data. The Pandas package provides data structures and tools for working with tabular data. These three packages, and most other packages you will need for this course, come pre-installed with the Anaconda Python distribution.

There is another package we will use called Seaborn, however, that is not installed automatically. To install Seaborn, enter the following on the command line:

 % conda install seaborn

conda is the official Anaconda distribution package installer, and other common Python packages can be installed similarly. Sometimes a package is not available in the conda database, however. In such cases, another installer, pip, can be called at the command line and used as backup:

 % pip install seaborn

Look through this index of conda-approved Python packages, and spend a few minutes sifting through the official Python package index. See if anything looks appealing, and try to install it. Lightning, for instance, is a pretty fun tool for interactive data visualization.

Some other text editors

Programmers are often very particular about their text editors and programming environments. If you feel like exploring a different Python editor, here are a few with very loyal followings:

At the end of Section, we tested our Python installation by running Sean’s Python animation script from the w00 class lecture. Download the file here, and open it in Spyder.

In order for the animation to run properly, we first need to install a movie writing program called FFmpeg.

ruby -e "$(curl -fsSL"

Homebrew is a package manager that will help us install ffmpeg. Once Homebrew has downloaded and installed, then execute:

brew install ffmpeg

Once FFmpeg is installed, copy the code from and paste it into a new Jupyter notebook. Then, make the following modifications to the code:

# mp4file = sys.argv[1]
from IPython.display import HTML
rc('text', usetex=False)

This will give you:

#! /usr/bin/python

# One of these things is not like the others: opening for MCB112.
# Usage: ./ w00-intro.mp4
# Produces an animation with four histogram panels, showing sampling
# one at a time from normal distributions, three the same and one
# different. How many samples before you can visually decide which one
# is different?

import numpy as np
import matplotlib.pyplot as plt
import math
import sys

from matplotlib import animation
from matplotlib import rc

# mp4file   = sys.argv[1]            # output file, from cmdline.

# Some general configuration
t20_red   = (0.839, 0.153, 0.157)  # From Tableau20 palette.
np.random.seed(41)                 # reproducible output
rc('text', usetex=False)            # Use LaTeX for matlibplot text

# Configuration of the distributions: three are mu1/sd1, one is mu2/sd2
mu1 = 40.
sd1 = 15.

mu2 = 60.
sd2 = 15.

which = np.random.randint(4)         # 0..3 : which one is different.

# Configuration of the displayed histogram plots
t20_red     = (0.839, 0.153, 0.157)  # From Tableau20 palette.
yaxis_max   = 40.                    # y-axis max
xaxis_range = (0,120)                # x-axis range (min,max tuple)
nsamples    = 100                    # number of samples = # of frames in animation

def tufteize(a):
    Gratuitous prettification of axes object a

X   = [ [] for _ in range(4)]        # X[0..3] will be the four data sets
fig = plt.figure()                   # Initialize an empty figure
ax  = []                             # ax[0..3] will be the four panels

# init()
# animate.FuncAnimation() can take an optional initialization
# function. We don't really need one, but if we don't include
# one, it calls animate() twice with i==0.
def init():
    print("({} is different)".format(which))

# animate()
# Render frame <i> of our animation, after adding one sample to each
# dataset. This function is passed to animate.FuncAnimation().
def animate(i):
    print("animating frame {}".format(i))

    # Add one new sample to each dataset.
    for g in range(4):
        if (g == which):

    # Render new frame of the animation.
    plt.suptitle(r'\emph gene expression.   Replicates: {}'.format(i+1), fontsize=16)
    for g in range(4):
        ax[g].hist(X[g], bins=20, range=xaxis_range, color=t20_red)
        ax[g].text(120,40, r'\textbf{' + '{}'.format('ABCD'[g]) + '}', ha='right')
        ax[g].text(120,35,'mean: {0:.1f}'.format(np.mean(X[g])),       ha='right')
        ax[g].text(120,30,'  sd: {0:.1f}'.format(np.std(X[g])),        ha='right')

# We're ready to animate.
#   (appears that interval=500 triggers some sort of bug in!?)
anim = animation.FuncAnimation(fig, animate, init_func=init, frames=nsamples, interval=600,
                               repeat=False, blit=False)

# If you're in a jupyter notebook you can render inline in the browser.
# You'll need:
from IPython.display import HTML


# Save the animation as an MP4 file.
(0 is different)
animating frame 0
animating frame 1
animating frame 2
animating frame 3
animating frame 4
animating frame 5
animating frame 6
animating frame 7
animating frame 8
animating frame 9
animating frame 10
animating frame 11
animating frame 12
animating frame 13
animating frame 14
animating frame 15
animating frame 16
animating frame 17
animating frame 18
animating frame 19
animating frame 20
animating frame 21
animating frame 22
animating frame 23
animating frame 24
animating frame 25
animating frame 26
animating frame 27
animating frame 28
animating frame 29
animating frame 30
animating frame 31
animating frame 32
animating frame 33
animating frame 34
animating frame 35
animating frame 36
animating frame 37
animating frame 38
animating frame 39
animating frame 40
animating frame 41
animating frame 42
animating frame 43
animating frame 44
animating frame 45
animating frame 46
animating frame 47
animating frame 48
animating frame 49
animating frame 50
animating frame 51
animating frame 52
animating frame 53
animating frame 54
animating frame 55
animating frame 56
animating frame 57
animating frame 58
animating frame 59
animating frame 60
animating frame 61
animating frame 62
animating frame 63
animating frame 64
animating frame 65
animating frame 66
animating frame 67
animating frame 68
animating frame 69
animating frame 70
animating frame 71
animating frame 72
animating frame 73
animating frame 74
animating frame 75
animating frame 76
animating frame 77
animating frame 78
animating frame 79
animating frame 80
animating frame 81
animating frame 82
animating frame 83
animating frame 84
animating frame 85
animating frame 86
animating frame 87
animating frame 88
animating frame 89
animating frame 90
animating frame 91
animating frame 92
animating frame 93
animating frame 94
animating frame 95
animating frame 96
animating frame 97
animating frame 98
animating frame 99

Next week we’ll learn more about Python syntax and usage. In the meantime, spend some time with the tutorial.