MCB112: Biological Data Analysis (Fall 2018)


Section 01: Jupyter review from last week and Python

(For Friday 9/14/18 section by Kevin, adapted from 9/9/16 lecture by Tim Dunn)

Python examples you can run if you open the notebook version of this section here.

While you can do most everything within Jupyter notebooks (which I recommend if the command line gets confusing), it is good to know how to navigate around from the command line.

Open command line by and you should see something like:

% C:\Users\Yourname>

The > means you can type command line commands (which are different from python code).

Some very basic operations are in the table below

Command Example Description
ls ls lists files in current directory
cd cd \folder\ changes directory into folder
cd .. goes back a directory
mkdir mkdir myFolder makes a directory
rmdir rmdir myFolder removes directory (must be empty)
rmdir -r myFolder deletes directory and files inside
pwd pwd lists path to current directory

Lets navigate to where we want to do our MCB112 homework and see what's in the folder (this may be different for everyone). Here is what I do:

C:\Users\Yourname> cd Documents
C:\Users\Yourname\Documents> cd Classes\MCB_112
C:\Users\Yourname\Documents\Classes\MCB_112\> ls

This last command lists what's in my current directory. So I need to make a new directory for this week's homework:

C:\Users\Yourname\Documents\Classes\MCB_112\> mkdir hw01
C:\Users\Yourname\Documents\Classes\MCB_112\> cd hw01
C:\Users\Yourname\Documents\Classes\MCB_112\> jupyter notebook

This will launch me into a Jupyter notebook already in my folder! Of course you could navigate here through the Jupyter home screen as well.

Installing python packages!!

The command line is also where you will install python packages. We will often use various packages (numpy, scipy, pandas, matplotlib) for analysis and plotting. Will most come by default with anaconda, if you ever get an error saying you don't have a package, you can install it at the command line using:

C:\Users\Yourname> conda install packageName

Jupyter Notebook

Please note that this is similar to last week. Notes are here if enough people would like a review of navigating Jupyter notebooks.

Jupyter notebooks can be launched by typing jupyter notebook into the command line and are typically edited in a web browser.

Jupyter notebooks are organized into individual cells that can be used to group and organize code, plots, text, and equations. Ultimately, a jupyter notebook should deliver a clear, step-wise narrative that effectively communicates your code and data to yourself and others.

As programming environments go, jupyter notebook is fairly simple. There are only a handful of keyboard shortcuts you will need to learn to use Jupyter effectively, and there are only two different types of cells you will use: markdown and code.

Code Cells

If you open a new Jupyter notebook, you'll be greeted with a single input cell set to code mode. Anything typed into a code cell needs to be Python code, and Jupyter will use the Python interpreter to run the code in the cell and print its output just below the code itself. Code in a code cell can be run using the cell menu at the top of the page, or using one of the following three keyboard shortcuts: * CTRL + ENTER : Run code in selected cell * SHIFT + ENTER : Run code in selected cell, move to next cell * ALT + ENTER : Run code in selected cell, insert new cell below

# this is a code cell

# At the In []:, we can type python code
# But this won't run because it is commented with '#'

print('hello world!')

Markdown Cells

Often, you'll want to surround code with some text as part of an explanation or narrative. Text (and other stylistic flourishes) can be added using a cell in markdown mode. A cell can be switched into a markdown cell by using the dropdown menu at the top of the page or by pressing ESC and then M. Text in markdown cells follows the markdown specification for text and page formatting. This cheatsheet will be helpful when using markdown cells. To switch back over to code mode, press ESC and then Y. Markdown cells also need to be processed using CTRL + ENTER, but their output is displayed over the input markdown syntax. To edit a markdown cell after it has been processed, simply double click the section.

This is a markdown cell

This is a subheader in a markdown cell

Markdown is a simple way to format text and make it pretty as you work.

A few other useful keyboard shortcuts are:

See the help menu at the top of your jupyter notebook for more keyboard shortcuts.

Python basics

Data types

Numbers

In python, we can represent any number and use any standard operation on them (addition, subtraction, multiplication, division)

1
1+3
7*3
1/2
Boolean values and comparisons

Another variable type is the Boolean, which takes one of two values (True or False)

True
Strings

Text can also be represented in python

'Hello world!'

Variables

We can store any of these data types as a variable. To do this we use a single =, with the variable name always to the left, and the assigned variable always to the right

x = 3
y = 5
x * y
a = 'Hello '
b = 'world!'
a+b

Notice that the addition operator can be used to combine two strings.

x = 5
y = x
y

Logical operations

Python allows us to do some basic logic comparisons through a set of relational operators. The result will be a boolean.

Operator Operation
== equal
!= not equal
< less than
<= less than or equal to
> greater than
>= greater than or equal to

Note that = is used for assignment, while == is used to test for equality.

3 > 2
3 < 2
x = 5
y = 2+3
x==y
a = 'apples'
b = 'oranges'
a == b

Conversion between data types

Sometimes we want to go from a number to a string, or vice versa. In python, this can be easy

x = 1
x + ' apple'
x = 1
str(x) + ' apple'
x = '2'
x * 3
x = '2'
int(x)*3
x = '2.5'
float(x)*3

Type conversion can be usefull if we want to format data nicely in a print statement.

num_apples = 5
print('There are ' + str(num_apples) + ' apples')

Python also has a nice way to format strings and variables

num_apples = 5
print('There are {} apples'.format(num_apples))
TF = 'Kevin'
office_hours = 'Wednesday'
print('{}\'s office hours are on {}'.format(TF,office_hours))

floats vs. ints

In Python, we actually have two different types of numbers: ints (integers) and floats (floating point numbers, or decimals). Python will often convert between them, but in the case above (converting from a string to a number), we have to be explicit.

Fancier data types

We can organize collections of data into structures with various useful properties. Two such examples are lists and dicts.

Lists

A list is a convenient data structure for organizing variables when there is a set order. Lists group data of any type into a structure that can be sampled via indexing, where an item or items in a list are referred to by their position in the list. For instance,

[0,1,2,3,4,5]
['apple','banana','orange']
a = [0,1,2,3,4,5]
a[2]
a = ['apple','banana','orange']
a[0]
a = ['apple','banana','orange']
a[-1]
# we can mix data types in lists, and have lists within lists!
[['apple',5], ['banana',3],['orange',7]]
Slicing

Indexing can only access a single element, but often we want a bunch of elements from a list. To do that we use the : to define a range of items which we wish to take

a = [0,1,2,3,4,5]
print(a[2:])
print(a[:3])
print(a[1:3])
Dicts

A dict, or dictionary, is a special data structure that allows you to associate values with a specific keys (usually some name). The data structure takes its name from a dictionary, because a word dictionary is organized similarly: take a word, look it up in the dictionary, see some associated description. In a Python dict, you take a word (the key), look it up in the dict, and see some list of numbers or strings associated with that word (the values).

{'apple': 'red','banana': 'yellow','orange': 'orange'}
{'apple': 5,'banana': 3,'orange': 7}
a = {}
a['apple'] = 5
a['banana'] = 3
a['orange'] = 7
print(a)
{'apple': 5, 'banana': 3, 'orange': 7}
{'apple': 6, 'banana': 3, 'orange': 7}
a = {'apple': 5,'banana': 3,'orange': 7}
a['banana']
'there are ' + str( a['apple']  ) + ' apples'
# this is just a nicer way to print things
'there are {} apples'.format(a['apple'])
Sets

Another data type for storing multiple pieces of information are sets. Sets don't rely on fixed indexing, and cannot have duplicate values. Information is automatically sorted

{1,4,3,4}
{1, 3, 4}
{'hello',1,5,'apple'}
{1, 5, 'apple', 'hello'}
a = [1,4,3,4]
set(a)
{1, 3, 4}
Built in list functions

Python has a ton of useful built in functions for operating on lists!

Here is a useful table of some list functions

operation result
x in s True if an item of s is equal to x, else False
x not in s False if an item of s is equal to x, else True
s[i:j:k] slice of s from i to j with step k
len(s) length of s
min(s) smallest item of s
max(s) largest item of s
s.index(x) index of the first occurrence of x in s
s.count(x) total number of occurrences of x in s
sorted(s) for a list of numbers, returns a sorted list in ascending order
s.append(x) adds item x to list s

For dicts we have a couple extra

operation result
s.clear() Erases the dict
s.values() Returns list with all the assigned values in the dictionary
s.keys() Returns all the index or the keys to which contains the values that it was assigned to
s.items() returns a list containing both the list but each element in the dictionary is inside a tuple
s.update(t) Updates keys on dictionary s with keys on dictionary t

There are special functions for sets as well

operation result
| Takes the union of two sets (returns all items)
& Takes the intersection of two sets (returns only items in both sets)
- Takes the difference between two sets
s = [3,5,7,1,2]
5 in s
max(s)
sorted(s)
s.append(9)
print(s)
s = ['apple','banana','orange']
'carrot' in s
s.index('banana')
a = {'apple': 5,'banana': 3,'orange': 7}
a.keys()
dict_keys(['apple', 'banana', 'orange'])
a.values()
dict_values([5, 3, 7])
a.items()

For lists of lists with multiple items, we can specify which of the items within the lists to sort by. This is more advance, but feel free to copy the syntax and use as you like.

a = [['banana',3],['orange',7],['apple',5]]
print(a)
print(sorted(a))                       # default sort by item in index 0
print(sorted(a, key=lambda kv: kv[0])) # sort by item in index 0
print(sorted(a, key=lambda kv: kv[1])) # sort by item in index 1
[['banana', 3], ['orange', 7], ['apple', 5]]
[['apple', 5], ['banana', 3], ['orange', 7]]
[['apple', 5], ['banana', 3], ['orange', 7]]
[['banana', 3], ['apple', 5], ['orange', 7]]
# sorting a dictionary
a = {'orange': 7, 'banana': 3,'apple': 5}
print(a)
print(sorted(a.items()))                       # default sort by key
print(sorted(a.items(), key=lambda kv: kv[0])) # sort by keys
print(sorted(a.items(), key=lambda kv: kv[1])) # sort by values

{'orange': 7, 'banana': 3, 'apple': 5}
[('apple', 5), ('banana', 3), ('orange', 7)]
[('apple', 5), ('banana', 3), ('orange', 7)]
[('banana', 3), ('apple', 5), ('orange', 7)]
a = {5,3,6,2}
b = {2,1,7,5}
print(a & b)
print(a | b)
print(a - b)
{2, 5}
{1, 2, 3, 5, 6, 7}
{3, 6}
a = {'apple','banana','orange'}
b = {'orange','carrot','pumpkin'}
print(a & b)
print(a - b)
print('carrot' in b)
print('carrot' in a)
{'orange'}
{'banana', 'apple'}
True
False

Control flow

So far we've treated scripts as a list of Python commands that run sequentially -- every line runs, one after another, from top to bottom. But sometimes we need to skip chunks of code under certain conditions, or run different chunks of code depending on some condition. And sometimes we want to run the same chunk of code multiple times.

Conditionals

We use if statements to control the path our data follows through our code.

x = 5
if x > 3:
    print('x is greater than 3')

Control flow has two other commands: else and elif (else if), which both follow the if statement. An if statement can have as many elif statements as you want.

Each if/elif statement is evaluated in sequence, and only the code for the first condition that evaluates to True is run -- and then Python jumps to the end of the if statement, and continues through your script line by line as usual.

We can also use else to specify what to do if none of the if or elif statements evaluate to True. An if statement can only have one else statement, and it must come last.

x = 4
y = 5
if x > y:
    print('x is greater than y')
elif x < y:
    print('y is greater than x')
else:
    print('x and y are equal')
    
a = {'apple': 5,'banana': 3,'orange': 7}
fruit = 'carrot'
if fruit in a.keys():
    print('there are ' + str(a[fruit]) + ' ' + fruit + 's')
    print('there are {} {}s'.format(a[fruit],fruit))
else:
    print('there aren\'t any ' + fruit + 's')
    print('there aren\'t any {}s'.format(fruit))
there aren't any carrots
there aren't any carrots
Iteration

Loops allow you to perform a task repeatedly with iteration, meaning after each task repetition, something changes. In a Python for loop, the loop addresses each item in a group, in sequence, until the last item in the group. For instance:

s = [0,1,2,3,4]
for item in s:
    print(item)
allnames = ['Kate','Daniel','Kevin','June','Michael','Steffan']
for name in allnames:
    print(name + ' is cool')
a = {'apple': 5,'banana': 3,'orange': 7}
for key in a.keys():
    #print('There are ' + str(a[key]) + ' ' + key + 's')
    print('There are {} {}s'.format(a[key],key))
There are 5 apples
There are 3 bananas
There are 7 oranges
a = {'apple': 5,'banana': 3,'orange': 7}
b = {'orange': 7,'carrot': 4,'pumpkin': 1}

a.update(b)
print(a)
{'apple': 5, 'banana': 3, 'orange': 7, 'carrot': 4, 'pumpkin': 1}
Note for control flow

Different programming languages use different conventions to define control flow. Some use end statements, some keep blocks of code in { } brackets. In python, we use : to indicate the start of some controlled code, and tabs to limit it. For example

s = [1,2,3,4,5]
x = 0
for i in s:
    print(i)
    x = x + i
print(x)
s = [1,2,3,4,5]
x = 0
for i in s:
    print(i)
x = x + i
print(x)

Reading and writing text files (putting it all together!)

For your homework this week, you'll need to read data from a text file and do some basic processing steps. Let's practice reading data from a text file here.

L = ['my first line',
     'my second line',
     'my third line']

file = 'my_file_name'

with open(file, 'w') as fd:
    for line in L:
        print(line, file=fd)
L = []

for line in open(file):
    L.append(line)
    
print(L)

Lets try downloading the TF's office hours and formatting that into a dict

For simplicity, I'm creating the file from python to avoid downloading anything. For your homework, you will need to download a text file from the webpage

L = ['Sean Thu 2-3, Biolabs 1008',
'Kate Thu 10:30-11:30, Biolabs 1033',
'Daniel Mon 12:15-1:15, Biolabs 2062/2064',
'Kevin Weds 3-4, Northwest 209',
'June Fri 10:30-11:30, Biolabs 2062/2064',
'Michael Tue 1:30-3:30, Brower Room, Adams House',
'Steffan Sun 2-3pm, Brower Room, Adams House']

file = 'TF_office_hours.txt'

with open(file, 'w') as fd:
    for line in L:
        print(line, file=fd)

Now we will read the data and format it into a dict. Lets use the TF's name as the key, and the string of the day as the value. That way we can easily lookup office hour day from a TF's name.

We need to split each text line by whitespace using a string function: .split(). From there, we know that in the text file the name is always first, then the day.

office_hours = {}
file = 'TF_office_hours.txt'

with open(file, 'rt') as fd:
    for line in fd:
        fields = line.split() # this splits the text up by spaces.
        office_hours[fields[0]] = fields[1]
        
# When are Kevin's office hours?
office_hours['Kevin']
'Weds'

This time, lets read the text file again, but now we want to build are dictionary using the day as the key. That way, we can look up who to go to on any given day.

office_hours = {}
file = 'TF_office_hours.txt'

with open(file, 'rt') as fd:
    for line in fd:
        fields = line.split()
        if fields[1] in office_hours.keys():
            office_hours[fields[1]].append(fields[0])
        else:
            office_hours[fields[1]] = [fields[0]]

# Whose office hours can I go to on Thursday?
day = 'Sat'
if day in office_hours.keys():
    print(office_hours[day])
else:
    print('There are no office hours today!')

An example to work on from last week

With all this information, you can now solve the puzzle from last week!

Find the secret code in this file.

The secret code is three letters followed by three numbers. Three letter codes are hidden in the text file in lines that look like CODE:X where the X is the letter code. The three number code (###) is given by the number of lines in the text file. What is the code? (First, just getting the code is enough!) Challenge: Can you get a nice printed statement that looks like the following? The secret code is XXX###