Setting up Python and Jupyter

First we want to make sure you're set up to run Python, Jupyter Lab, and command line tools. We'll install the Anaconda data science package to get almost all of what we'll need.

Getting Set Up

Windows-specific first step

If you are on a Mac, you already have a wondrous and powerful command line environment, though you may not know it's there: the Terminal.

If you are on Windows, you are one step away from a wondrous and powerful command line environment. We recommend that you install an Ubuntu subsystem before continuing. This will allow you to work from an Ubuntu terminal, which will make your life easier since we'll use several unix tools/commands in this course.

If you are on Linux, you're chuckling a superior chuckle because you already know about the command line.

The Command Line

Command lines are text-based interfaces which interact with an operating system such as Terminal on OS/X and Powershell on Windows. Command lines allow you to manipulate the computer's filesystem and run applications, much in the same way you already do with Finder or Windows File Explorer. We can use the command line to manipulate files, run programs, and test code (among many other things).

Let's do a quick example using some common commands to get familiar with the command line. Open Terminal (Mac: ⌘+SPACE → terminal) or Ubuntu terminal (Windows: ⊞+R → ubuntu).

To print the location (path) of your current directory, enter pwd ("print working directory") on the command line. By default, you will probably open your command line into the directory /Users/[Username]:
```
pwd
```
Next, we will navigate to your Documents directory:
```
cd Documents
```
If you are on a Windows computer, to navigate to a Windows directory in the Ubuntu terminal add /mnt before the path. For example, to navigate to D:\Documents:
```
cd /mnt/d/Documents
```
Let's list its contents:
```
ls
```
You should see a list of all the files and folders in your Documents directory. Let's add a new folder and list the contents again:
```
mkdir mcb112
ls
```
Now you should see the new folder added to the list of files and folders in the Documents directory.
Let's move into this new directory and download a file there. For OS/X, you can use the curl command:
```
cd mcb112
curl -O http://mcb112.org/w00/w00-answers-SRE.ipynb
ls
```
The -O option tells curl to save the downloaded data to a file with the same original name, w00-answers-SRE.ipynb.

For Windows with the Ubuntu subsystem, use the wget command:
```
cd mcb112
wget http://mcb112.org/w00/w00-answers-SRE.ipynb
ls
```
We can use the command line to rename ("move") the file with mv:
```
mv w00-answers-SRE.ipynb SRE-w00.ipynb
ls
```
Now we should see SRE-w00.ipynb in our list of files instead of w00-answers-SRE.ipynb.

Now, we will leave the folder and delete it:

cd ..
rmdir mcb112

rmdir: mcb112: Directory not empty

Oops, it looks like the rmdir command, which is used to delete directories, can only remove empty directories by default. Instead we need to use a different command rm with the -r option, which will delete the directory and all files in it:
```
rm -r mcb112
ls
```

(Recursive deletion is dangerous, so make sure you've typed the command correctly!)

Now we will return to our default or Home directory:
```
cd
pwd
```

That's it for our command line exercise. We will be using the command line sparingly this semester, but it is totally worth learning how to use it well. There are many good introductions online (including this one). Now let's move on to installing Python.

Installing Python

Python is an open-source programming language that has become exceedingly popular in science. Its most powerful aspects are:

Readability: The language syntax is relatively simple and clean.
Scripting: Because Python code does not need to be compiled into a program, it is faster to experiment with new code and datasets.
Open-source: Anyone can use Python and contribute to its development.
Strong community: A large and motivated developer community ensures that libraries and features are constantly growing.
Numerous analysis packages: Because of the strong community, there are many powerful suites of code for collecting, analyzing, and visualizing data.
Graphics: Python creates pretty, well-styled plots without much effort. Python also supports interactive graphics for data exploration.
Reproducibility: Because Python isn't proprietary, and because of tools like Jupyter Notebook and MyBinder, it is easy to document and re-run code to validate and learn from others' work.

Anaconda

The freedom of an open-source, decentralized programming platform is powerful but can also be overwhelming. Because there is no central proprietor managing the programming environment and essential libraries, the number of ways to set up Python (and the number of individual installation steps) is staggering.

Luckily, there are organizations that provide (or, for commercial applications, sell) bundled kits of useful Python tools and libraries, making everything much easier. For this course, we will be using one of these bundled kits, the Anaconda Python distribution.

The Anaconda distribution includes several important tools, including:

Python interpreter: This runs Python code.
Jupyter Notebook/Lab: Jupyter provides a very useful environment for managing code and sharing code/results. This is what you will use throughout the course to write, run, and submit code.
Python packages: Several useful data analysis packages are included. We'll talk more about these later.
Conda: The conda package and environment manager makes it easy to install additional packages as needed and create virtual environments with the package versions you need.

Let's install Anaconda!

If you already have Anaconda installed, you should update it.
- If you already have Anaconda installed on your Mac or Ubuntu subsystem, run conda update --all to update to the latest version. If this command fails, you should reinstall Anaconda following the instructions below.
- If you already have Anaconda installed under Windows, you should install the Linux version under the Ubuntu system following the instructions below. These two Anaconda installations can peacefully coexist and shouldn't affect one another. You will use the Ubuntu installation, not the Windows installation, for this course.
If you do not have Anaconda, download it here, and choose the appropriate Python 3.12 version. Certain versions of the installer will prompt you to decide if you want Anaconda's Python distribution to be your system default. Unless you have a previous installation of Python, we recommend that you make this Python distribution your default. This option is very easy to miss so try not to click through the installer too quickly. If you have already installed Python 2.7, you can switch over to Python 3.12 by following these instructions.

Using Python

I have python... now what? There are many ways to write and run Python code on your computer, including:

The command line: We can run basic Python commands directly using the Python interpreter through the command line.
Python script: If we want to save our code, we can write Python scripts in a text editor and run these scripts on the command line.
Jupyter Notebook/Lab: We can save both code and analysis results easily in a Jupyter Notebook, which we run through a web browser.

The command line

With a traditional programming language like C, you write all of your code in a text file and feed it to a compiler, which translates your human-readable code into machine code (0s and 1s) that you can run on your computer. Every time you add code to your analysis, you have to recompile your program, which slows down your data analysis workflow substantially.

Programming languages like Python, Matlab, and R are processed by an interpreter which also translates your code into machine language, but it does it line-by-line rather than all at once. This interpreter structure makes working with Python highly interactive and allows you to write scripts (files that contain instructions that should be interpreted in sequence) to perform desired analyses.

How you access the Python interpreter varies from one computer to the next, depending on the type of Python shell the user has installed. A shell is a text-based interface that allows you to interact with an underlying system (in this case, the Python interpreter). At its most basic, the Python interpreter can be accessed using a simple command prompt in the terminal.

So let's get coding!

Enter python on the command line:
```
python
```
You're now using the Python interpreter, and any command you enter will be processed as Python code to generate an output.
Try typing a simple expression like 1 + 2 or print('hello, world') and then press ENTER:
```
>>> 1 + 2
3
>>> print('hello, world')
hello, world
```
Try typing an expression that doesn't follow Python syntax like hello, world alone:
```
>>> hello, world
```
This results in a syntax error. The interpreter can only interpret commands the user gives it in the Python language. We'll learn more about Python commands and syntax in the next section, so fear not!
Exit the python interface on the command line:
```
>>> exit()
```
Alternatively you can type CTRL + D to end the Python session.

Python script

Python scripts are text files that contain Python code and end in .py. You can write them on your computer using a text editor and then run them on the command line using the Python interpreter. Let's give it a shot.

Open a text editor (like TextEdit on Mac or Notepad on Windows) and type the following code:
```
 # This is a Python script
 print(1 + 2)
 print('hello, world')
```
Save the file as hello.py in your Documents directory.
Open the command line and navigate to the directory where you saved the file. You can use the cd command to change directories. For example, if you saved the file in your Documents directory:
```
cd Documents
```
1. Run the script using the Python interpreter:
  
  python hello.py
You should see the output of the script in the command line:
```
 3
 hello, world
```

Jupyter Notebook

Jupyter notebooks make it easy to document code and analysis results. A Jupyter notebook acts a Python interpreter, but it saves both the code and the output (including plots) as a single .ipnyb file.

Jupyter notebooks are organized into individual cells that can be used to group and organize code, plots, text, and equations. As programming environments go, Jupyter notebook is fairly simple. There are only a handful of options you will need to learn to use Jupyter effectively.

There are two ways to view and edit Jupyter notebooks: Jupyter notebook and Jupyter lab. Jupyter notebook is the original interface, and Jupyter lab is a newer, more feature-rich interface. We will demonstrate Jupyter lab here, but you are welcome to use Jupyter notebook if you prefer. Both interfaces are accessed through a web browser and can open ".ipynb" files. We first need to install these two interfaces, bringing us to the next section:

Python Packages

Python becomes a powerful tool for data analysis when tapping into the vast reservoir of open-source Python packages. A Python package is a set of Python tools united under a common functional banner.

For instance, the matplotlib package contains many of the functions we will need to plot data and manipulate graphics objects. The NumPy package defines many of the functions and data structures we will need to efficiently store and manipulate numerical data. The Pandas package provides data structures and tools for working with tabular data. These three packages, and most other packages you will need for this course, come pre-installed with the Anaconda Python distribution and are located in your base environment. It is usually best practice to keep your base environment clean and only install packages you need for specific projects in a separate environments. Luckily, there is a good way to handle this.

One of the benefits of Conda package management is the ability to create separate environments. Think of these as separate rooms in a house. Some basics are installed in the entire house, like lights (or python, libssl, and a few other new conda environment staples that come with a new conda environment). Other things, like furniture, you need to put into each room individually (or matplotlib in the office, machine learning packages in the kitchen, etc.) You can create a new environment for each project you work on, and install only the packages you need for that project. This keeps each conda environment clean and prevents conflicts between packages. Towards this effort, if you start with one package manager, it is best practice to try to install all packages you need with that first. That is, if you install some things with conda, try to only use pip as a last resort, and vice versa. This helps reduce the number of dependency conflicts you may encounter, because things installed by pip are not tracked by conda, and vice versa. It can become confusing when you end up with multiple different versions of the same package installed in different ways.

Let's make a mcb112 conda environment. We will install a couple of things we will need for the course.

To make a new conda environment, called mcb112, enter the following on the command line. We are telling conda that we need matplotlib jupyter jupyterlab watermark seaborn as packages in our mcb112 environment. Each of these packages also have their own list of dependencies, which conda will install for us (pandas and numpy included):
```
conda create --name mcb112 matplotlib jupyter jupyterlab watermark seaborn
```
Activate the new environment. Imagine I am now entering the mcb112 room in my house. I will only be able to access the "furniture" I brought into this room.
```
conda activate mcb112
```
In the future, if you need to add additional "furniture" along the way, first activate the environment, then install the package. For example, to install the tqdm package, which can add some fun progress bars to your code, run this after activating the mcb112 environment:
```
conda install tqdm
```
When you are done working in the mcb112 environment, you can deactivate it, or "close the door" to that room:
```
conda deactivate
```

For now, let's reactivate our mcb112 environment and practice launching Jupyter lab.

First, open the command line and navigate to your chosen directory (wherever you want to store your homework files for this course). Glance back at The Command Line section above for help changing directories. Now, launch a new Jupyterlab session:
- To launch a Jupyter notebook session on Mac, type jupyter lab into the command line. This will open a web browser where you can access your notebooks and all the files in your current directory.
- To launch a Jupyter notebook session on Windows through the Ubuntu terminal, type jupyter lab --no-browser and copy one of the URLs that show up in the terminal into a browser.
Create a new notebook by clicking File > New > Notebook in the upper right corner This will open a new tab with your notebook. Select the Python 3 kernel.
Now select the input cell at the beginning of the notebook. This is a code block, denoted by In [ ]: to the left of the cell. The cell will be highlighted in blue. This is where you will type your Python code. Try typing a simple expression like 1 + 2 or print('hello, world') in this cell.
Run the cell by clicking the "play" button in the cell menu. Jupyter will use the Python interpreter to run the code and print its output just below the cell. There are also several keyboard shortcuts to run a cell:
- CTRL + ENTER : Run code in selected cell, stay in cell
- SHIFT + ENTER : Run code in selected cell, move to next cell
- ALT + ENTER : Run code in selected cell, insert new cell below
Check out the help menu at the top of your notebook for more keyboard shortcuts.
Create a new cell above the first by selecting Insert and then Insert Cell Above. There are also several keyboard shortcuts to edit and create cells:
- ESC and then A : Insert cell before selected cell
- ESC and then B : Insert cell after selected cell
- ESC and then D and then D : Delete selected cell
- ESC and then z : Undo cell deletion
Change the cell type to a markdown cell by selecting Markdown in the code menu dropdown. Descriptive text and equations can be added to a markdown cell instead of code.

You can toggle between code and markdown cells using the following shortcuts:
- To switch to code mode, press ESC and then Y
- To switch to markdown mode, press ESC and then M
Now try typing the text below into your markdown cell, and run the cell by clicking Run or using one of the keyboard shortcuts. Notice how the formatting automatically changes when you run the cell. Text in markdown cells follows the markdown specification for text and page formatting. Markdown can look overwhelming the first time you use it, but you really only need to know how to type equations and plain text for this class. Check out this cheatsheet for basic markdown formatting.
```
## Jupyter Notebook Intro

This is an example of a markdown cell.

You can add equations and formatting to markdown cells to make them easier to read:
* you can add equations like $A = \pi r^2$
* you can put important info in **bold** or *italics*
* you can even add [hyperlinks](https://jupyter-notebook.readthedocs.io)
```
To edit a markdown cell after it has been rendered, simply double click the section. For a challenge, see if you can change the equation in this markdown cell to be the following:

$$\nu_i = \frac{\tau_i \ell_i}{\sum_j \tau_j \ell_j}$$

Jupyter markdown cells support $\LaTeX$ equations, which allow you to type all sorts of mathematical expressions and symbols like the equation above. To type an inline equation, simply surround the LaTeX expression with single $ signs. Or to type an equation on its own line, surround the LaTeX expression with double $$ signs. You can use this cheatsheet when working with your own equations, but it contains way more than you will ever need for this course. A more reasonable starting point is this table, which covers the basics for writing simple equations.

LaTeX Expression	Equation
`{x}^{y}`	${x}^{y}$
`{x}_{y}`	${x}_{y}$
`\frac{x}{y}`	$\frac{x}{y}$
`\sqrt{x}`	$\sqrt{x}$
`\bar{x}`	$\bar{x}$
`\hat{x}`	$\hat{x}$
`\langle x \rangle`	$\langle x \rangle$
`\sum_{i}{x}`	$\sum_{i}{x}$
`\prod_{i}{x}`	$\prod_{i}{x}$

You can also type greek letters in LaTeX equations using \. For example, \nu is the greek letter $\nu$, and \tau is the greek letter $\tau$. To get a sense of what you can do with LaTeX in markdown cells, check out these more complicated examples.

Critically, we will be grading your notebooks by rerunning your notebooks on our end. There are two things you can do to help make that process smoother.
Versioning: Your last cell should contain a watermark of what package versions you used. This will help us ensure that, if we run across any errors running your code, we can test with the same versions you used. Occasionally, bugs pop up in certain package versions. Having a record (for us and yourself) of how to reproduce it is extremely helpful. You can add a watermark to your notebook by running the following code in a code cell: python import watermark %load_ext watermark %watermark -v -m -p numpy,matplotlib,seaborn,pandas,jupyterlab
Restart the kernel and run all cells: This will clear all the outputs from your notebook and ensure that your code runs from top to bottom without any hidden variables or errors. You can do this by selecting Kernel > Restart Kernel and Run all cells. This is a good way to check that your notebook is working as expected and that you are turning in exactly what we will be seeing on our end. Double check to make sure the numbering on your cells is 1 to n.
Make sure you've saved your notebook, then go back to the command line window where you started the Jupyter session and press CTRL + C. This is the standard way to exit a program from the command line and will close your Jupyter session.

Now, let's test that we can run a full-fledged notebook. Use the Jupyter lab file navigator to open the w00-answers-SRE.ipynb notebook we downloaded earlier. Try to run all cells. Notice where the code fails, and try to resolve the error (perhaps using the wget or curl methods we learned earlier).

That's it for our Jupyter notebook tutorial. In a single Jupyter notebook, you should have multiple code cells interspersed with markdown cells explaining your approach or results. It's good practice to split your code into multiple cells because it makes debugging easier. This is part of the power of a Jupyter notebook – you can test your code in pieces and try out small changes without rerunning an entire script. Spend some time exploring the Jupyter lab documentation if you are interested in what else you can do in your notebooks.

Coding in Python

If you're new to Python or coding as a whole, have no fear! Be sure to attend next week's Python bootcamp section. If you get stuck, come to our office hours or post a question on Ed. Our goal is to have you up and running with Python and Jupyter right away in this first week or two, and we're here to help!

w00: intro	lecture	section	"pset"	"answers"
w01: genomes	lecture	[molbio] [py]	pset	answers
w02: probability	lecture	section	pset	answers
w03: data	lecture	section	pset	answers
w04: alignment	lecture	section	pset	answers
w05: Bayes	lecture	section	pset	answers
w06: p-values	lecture	section	pset	answers
w07: EM	lecture	section	pset	answers
w08: HMMs	lecture	section	pset	answers
w09: k-means	lecture	section	pset	answers
w10: regression	lecture	section	pset	answers
w11: PCA	lecture	section	pset	answers
w12: work	lecture	-	-	-
w13: trees	lecture	section	pset	answers