Section 00: Setting up Python and Jupyter
Notes by Daniel Eaton and June Shin, adapted from Tim Dunn and William Mallard
The purpose of this tutorial is to give a brief introduction to Python and guide you through Python installation.
Python is an open-source programming language that has become exceedingly popular in science. Its advantages are numerous. Its most powerful aspects are:
- readability and ease-of-use. The language syntax is and relatively simple and clean.
- interpreter / scripting. Because Python code does not need to be compiled into a program, it is faster to experiment with new code and datasets.
- open-source. Anyone can use Python and contribute to its development.
- strong community. The large and motivated developer community ensures that libraries and features are constantly growing.
- myriad of analysis packages. In illustration of this strong community, there are many powerful suites of code for collecting, analyzing, and visualizing data.
- graphics. Compared to other popular scientific programming environments, Python creates pretty, well-styled plots without much effort. Python also supports interactive graphics for data exploration.
- reproducibility. Because Python isn't proprietary, and because of tools like Jupyter Notebook and MyBinder, it is easy to document and re-run code to validate and learn from others' work.
The freedom of an open-source, decentralized programming platform can be overwhelming. Because there is no central proprietor managing the programming environment and essential libraries, the number of ways to set up Python (and the number of individual installation steps) is staggering. Luckily, there are organizations that provide (or, for commercial applications, sell) bundled kits of useful Python tools and libraries, making everything much easier. For this course, we will be using one of these bundled kits, the Anaconda Python distribution.
NOTE: Certain versions of the installer will prompt you to decide if you want Anaconda's Python distribution to be your system default. Unless you have a previous installation of Python, we recommend that you make this Python distribution your default. This option is very easy to miss so try not to click through the installer too quickly.
The Anaconda distribution includes:
- The Python interpreter. This runs Python code.
- Spyder. A text editor for saving Python files and scripts.
- Jupyter Notebook and QTConsole. Very useful environments for managing code and sharing code/results.
- Python packages you'll need for data analysis.
The Command Line
Command lines are text-based interfaces which interact with an operating system such as
Terminal on OSX and
Powershell on Windows. Command lines allow you to manipulate the computer's filesystem and run applications, much in the same way you already do with Finder or Windows File Explorer.
In order to get our feet wet with the command line, we are going to do a quick example using some common commands. Open Terminal (Mac: ⌘+Space →
terminal) or Powershell (Windows: ⊞+R →
powershell). By default, you will open your command line into the directory
\Users\[Username]\. To check this, enter
pwd on the command line, which will print the
filepath of your current
Next, we will navigate to your
Documents directory, list its contents, and make a new folder:
% C:\Users\ACGT\> cd Documents % C:\Users\ACGT\Documents\> ls d----- 1/9/2018 3:04 PM Caenorhabditis_elegans d----- 1/14/2017 5:48 PM Escherichia_coli d----- 2/4/2018 1:14 PM Homo_sapiens % C:\Users\ACGT\Documents\> mkdir Mus_musculus % C:\Users\ACGT\Documents\> ls d----- 1/9/2018 3:04 PM Caenorhabditis_elegans d----- 1/14/2017 5:48 PM Escherichia_coli d----- 2/4/2018 1:14 PM Homo_sapiens d----- 9/7/2018 2:00 PM Mus_musculus
Now we will download an image (https://tinyurl.com/yd26dede) into our new directory using the
% C:\Users\ACGT\Documents\> cd Mus_musculus % C:\Users\ACGT\Documents\Mus_musculus> wget https://tinyurl.com/yd26dede % C:\Users\ACGT\Documents\Mus_musculus> ls -a---- 2/27/2016 1:53 PM 19430 yd26dede
Unfortunately, the image does not have a
file extension. It is supposed to be a
.jpg file. Luckily we can use the command line to rename the file:
% C:\Users\ACGT\Documents\Mus_musculus> mv yd26dede yd26dede.jpg % C:\Users\ACGT\Documents\Mus_musculus> ls -a---- 2/27/2016 1:53 PM 19430 yd26dede.jpg
You can now open the image from Finder or File Explorer and see what it is. Now, we will leave the folder and delete it:
% C:\Users\ACGT\Documents\Mus_musculus> cd .. % C:\Users\ACGT\Documents\> rmdir Mus_musculus rmdir: failed to remove 'Mus_musculus/': Directory not empty
Oops, it looks like the
rmdir command, which is used to delete directories, can only remove empty directories by default. Instead we need to use a different command
rm with the
-r option, which will delete the diectory and all files in it.
% C:\Users\ACGT\Documents\> rm -r Mus_musculus % C:\Users\ACGT\Documents\> ls d----- 1/9/2018 3:04 PM Caenorhabditis_elegans d----- 1/14/2017 5:48 PM Escherichia_coli d----- 2/4/2018 1:14 PM Homo_sapiens
Now we will return to our default or
% C:\Users\ACGT\Documents\> cd .. % C:\Users\ACGT\> pwd \Users\ACGT\
That's it for our command line exercise. We will not be using the command line much this semester, but if you are interested in learning more there are many good introductions online (including this one). While we have the command line open, we will discuss using it to expand Python's capabilities with packages.
Python becomes a powerful tool for data analysis when tapping into the vast reservoir of open-source Python packagaes. A Python package is a set of Python tools united under a common functional banner. For instance, the matplotlib package contains many of the functions we will need to plot data and manipulate graphics objects. The NumPy package defines many of the functions and data structures we will need to efficiently store and manipulate numerical data. The Pandas package provides data structures and tools for working with tabular data. These three packages, and most other packages you will need for this course, come pre-installed with the Anaconda Python distribution.
There is another package we will use called Seaborn, however, that is not installed automatically. To install Seaborn, enter the following on the command line:
% conda install seaborn
conda is the official Anaconda distribution package installer, and other common Python packages can be installed similarly. Sometimes a package is not available in the
conda database, however. In such cases, another installer,
pip, can be called at the command line and used as backup:
% pip install seaborn
Look through this index of conda-approved Python packages, and spend a few minutes sifting through the official Python package index. See if anything looks appealing, and try to install it. Lightning, for instance, is a pretty fun tool for interactive data visualization.
Now, we will begin to use python and some other modules included with Anaconda, including
The Python interpreter
With a traditional programming language like C, you write all of your code in a text file and feed it to a compiler, which translates your human-readable code into machine code (0s and 1s) that you can run on your computer. Every time you add code to your analysis, you have to recompile your program, which slows down your data analysis workflow substantially.
Programming languages like Python, Matlab, and R are processed by an [interpreter](https://en.wikipedia.org/wiki/Interpreter_(computing) -- which also translates your code into machine language, but it does it line-by-line rather than all at once. This interpreter structure makes working with Python highly interactive and allows you to write scripts (files that contain instructions that should be interpreted in sequence) to perform desired analyses.
How you access the Python interpreter varies from one computer to the next, depending on the type of Python shell the user has installed. A "shell" is a text-based interface that allows you to interact with an underlying system (in this case, the Python interpreter). At its most basic, the Python interpreter can be accessed using a simple command prompt in the terminal.
Once you've installed the Anaconda Python distribution, open Terminal (Mac: ⌘+Space →
terminal) or Powershell (Windows: ⊞+R →
powershell) and enter
python on the command line. You're now using the Python interpreter, and any command you enter will be processed as Python code to generate an output. Try typing a simple expression like 1 + 2 or print('hello, world') and then pressing enter:
>>> 1 + 2 3 >>> print('hello, world') hello, world
Note that these commands follow Python syntax; entering
hello, world alone, for instance, results in an error. The interpreter can only interpret commands the user gives it in the Python language.
Jupyter notebooks can be launched by typing
jupyter notebook into the command line or using the anaconda navigator and are typically edited in a web browser.
Jupyter notebooks are organized into individual cells that can be used to group and organize code, plots, text, and equations. Ultimately, a jupyter notebook should deliver a clear, step-wise narrative that effectively communicates your code and data to yourself and others.
As programming environments go, jupyter notebook is fairly simple. There are only a handful of keyboard shortcuts you will need to learn to use Jupyter effectively, and there are only two different types of cells you will use: markdown and code.
If you open a new Jupyter notebook, you'll be greeted with a single input cell set to code mode. Anything typed into a code cell needs to be Python code, and Jupyter will use the Python interpreter to run the code in the cell and print its output just below the code itself. Code in a code cell can be run using the
cell menu at the top of the page, or using one of the following three keyboard shortcuts: * CTRL + ENTER : Run code in selected cell * SHIFT + ENTER : Run code in selected cell, move to next cell * ALT + ENTER : Run code in selected cell, insert new cell below
Often, you'll want to surround code with some text as part of an explanation or narrative. Text (and other stylistic flourishes) can be added using a cell in markdown mode. A cell can be switched into a markdown cell by using the dropdown menu at the top of the page or by pressing ESC and then M. Text in markdown cells follows the markdown specification for text and page formatting. This cheatsheet will be helpful when using markdown cells. To switch back over to code mode, press ESC and then Y. Markdown cells also need to be rendered using CTRL + ENTER, but their output is displayed over the input markdown syntax. To edit a markdown cell after it has been rendered, simply double click the section.
The text below is the underlying markdown for the above paragraph:
#### Markdown Cells Often, you'll want to surround code with some text as part of an explanation or narrative. Text (and other stylistic flourishes) can be added using a cell in **markdown** mode. A cell can be switched into a markdown cell by using the dropdown menu at the top of the page or by pressing <kbd>ESC</kbd> and then <kbd>M</kbd>. Text in markdown cells follows the [markdown](https://daringfireball.net/projects/markdown/) specification for text and page formatting. This [cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) will be helpful when using markdown cells. To switch back over to code mode, press <kbd>ESC</kbd> and then <kbd>Y</kbd>. Markdown cells also need to be processed using <kbd>CTRL</kbd> + <kbd>ENTER</kbd>, but their output is displayed over the input markdown syntax. To edit a markdown cell after it has been processed, simply double click the section.
A few other useful keyboard shortcuts are:
- ESC and then A : Insert cell before selected cell
- ESC and then B : Insert cell after selected cell
- ESC and then D and then D : Delete selected cell
- CTRL + SHIFT + - : Split cell
help menu at the top of your jupyter notebook for more keyboard shortcuts.
Jupyter Notebook Example
Find the secret code in this file.
The secret code is three letters followed by three numbers. Three letter codes are hidden in the text file in lines that look like CODE:X where the X is the letter code. The three number code (###) is given by the number of lines in the text file. What is the code? (First, just getting the code is enough!) Challenge: Can you get a nice printed statement that looks like the following? The secret code is XXX###