Section 3: The fun world of Pandas, Matplotlib, Seaborn

Section notes by June Shin for Friday, September 28, 2018

(You can download these notes as a jupyter notebook page)

First, we need to start by importing all the modules will be using for the rest of the section. I like to set up entire notebook and the modules I'll be using up at the top so I can import everything I'll end up needing at the very beginning.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

Let's get started with pandas: a handy data analysis library that's especially good at handling data structures in a nice, tidy, easy to read and manipulate format

Pandas can build a series in many different ways!

  • Convert a list into a series (Notice the "dtype: object" at the bottom: this is because the elements in the list were strings.)
In [2]:
data = ['apple','banana','cherry','date','elderberry']
s = pd.Series(data)
s
Out[2]:
0         apple
1        banana
2        cherry
3          date
4    elderberry
dtype: object
  • Convert a numpy array into a series
In [3]:
data = np.array(['apple','banana','cherry','date','elderberry'])
s = pd.Series(data)
s
Out[3]:
0         apple
1        banana
2        cherry
3          date
4    elderberry
dtype: object
  • Set the index of the series
In [4]:
data = np.array(['apple','banana','cherry','date','elderberry'])
s = pd.Series(data, index=['a','b','c','d','e'])
s
Out[4]:
a         apple
b        banana
c        cherry
d          date
e    elderberry
dtype: object
  • Convert a dictionary to a series (Notice here that the index has already been set by the dictionary)
In [5]:
data = {'a': 'apple', 'b': 'banana', 'c': 'cherry', 'd': 'date', 'e':'elderberry'}
s = pd.Series(data)
s
Out[5]:
a         apple
b        banana
c        cherry
d          date
e    elderberry
dtype: object
  • Different values in the series will change the dtype. Now it's an int!
In [6]:
data = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
s = pd.Series(data)
s
Out[6]:
a    1
b    2
c    3
d    4
e    5
dtype: int64
  • Set the data type while defining the series
In [7]:
data = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
s = pd.Series(data, dtype=float)
s
Out[7]:
a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
dtype: float64
  • Swapping the keys and values of the dictionary will swap the index and values of the series
In [8]:
data = {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}
s = pd.Series(data)
s
Out[8]:
1    a
2    b
3    c
4    d
5    e
dtype: object
  • If there is an index that does not contain data that is given to the series, then pandas will fill it with a NaN value
In [9]:
data = {'a': 'apple', 'b': 'banana', 'c': 'cherry', 'd': 'date', 'e':'elderberry'}
s = pd.Series(data, index=['a','e','b', 'c', 'd', 'f'])
s
Out[9]:
a         apple
e    elderberry
b        banana
c        cherry
d          date
f           NaN
dtype: object

Let's say I want the word apple from this series: how can I index into the pandas series?

  • I know it's the first element of the series so I can just use an integer index
In [10]:
s[0]
Out[10]:
'apple'
  • Or I can use the associated index of the series
In [11]:
s['a']
Out[11]:
'apple'
  • I can use multiple integer indexes or multiple series indexes
In [12]:
s[[0,1]]
Out[12]:
a         apple
e    elderberry
dtype: object
In [13]:
s[['a', 'b']]
Out[13]:
a     apple
b    banana
dtype: object
  • What happens if an integer index is out of range of the series? Index Error
In [14]:
s[6]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   3117             return self._engine.get_value(s, k,
-> 3118                                           tz=getattr(series.dtype, 'tz', None))
   3119         except KeyError as e1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 6

During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)
<ipython-input-14-6e68df1be101> in <module>()
----> 1 s[6]

/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
    765         key = com._apply_if_callable(key, self)
    766         try:
--> 767             result = self.index.get_value(self, key)
    768 
    769             if not is_scalar(result):

/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   3122 
   3123             try:
-> 3124                 return libindex.get_value_box(s, key)
   3125             except IndexError:
   3126                 raise

pandas/_libs/index.pyx in pandas._libs.index.get_value_box()

pandas/_libs/index.pyx in pandas._libs.index.get_value_box()

IndexError: index out of bounds
  • What happens if an index doesn't exist in the series? Key Error
In [15]:
s['g']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   3123             try:
-> 3124                 return libindex.get_value_box(s, key)
   3125             except IndexError:

pandas/_libs/index.pyx in pandas._libs.index.get_value_box()

pandas/_libs/index.pyx in pandas._libs.index.get_value_box()

TypeError: 'str' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-15-a0e0eca42368> in <module>()
----> 1 s['g']

/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
    765         key = com._apply_if_callable(key, self)
    766         try:
--> 767             result = self.index.get_value(self, key)
    768 
    769             if not is_scalar(result):

/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   3130                     raise InvalidIndexError(key)
   3131                 else:
-> 3132                     raise e1
   3133             except Exception:  # pragma: no cover
   3134                 raise e1

/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   3116         try:
   3117             return self._engine.get_value(s, k,
-> 3118                                           tz=getattr(series.dtype, 'tz', None))
   3119         except KeyError as e1:
   3120             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'g'

Now onto Pandas DataFrames!

  • A simple list can be put into a dataframe
In [16]:
data = ['apple','banana','cherry','date','elderberry']
df = pd.DataFrame(data)
df
Out[16]:
0
0 apple
1 banana
2 cherry
3 date
4 elderberry
  • Name the column
In [17]:
data = ['apple','banana','cherry','date','elderberry']
df = pd.DataFrame(data, columns=['fruit'])
df
Out[17]:
fruit
0 apple
1 banana
2 cherry
3 date
4 elderberry
  • Set the index
In [18]:
data = ['apple','banana','cherry','date','elderberry']
df = pd.DataFrame(data, columns=['fruit'], index=['a','b','c','d','e'])
df
Out[18]:
fruit
a apple
b banana
c cherry
d date
e elderberry
  • A list of lists will give you a multi-column dataframe
In [19]:
data = [['apple',1],['banana',2],['cherry',3],['date',4],['elderberry',5]]
df = pd.DataFrame(data,columns=['fruit','letters'])
df
Out[19]:
fruit letters
0 apple 1
1 banana 2
2 cherry 3
3 date 4
4 elderberry 5
  • Setting the dtype will set the data type when possible
In [20]:
data = [['apple',1],['banana',2],['cherry',3],['date',4],['elderberry',5]]
df = pd.DataFrame(data,columns=['fruit','number'], dtype=float)
df
Out[20]:
fruit number
0 apple 1.0
1 banana 2.0
2 cherry 3.0
3 date 4.0
4 elderberry 5.0
  • A dictionary will generate a dataframe where the keys will become the column names. (Note here that the order of the values for each key in the dictionary matters!)
In [21]:
data = {'fruit': ['apple','banana','cherry','date','elderberry'], 'number':[1,2,3,4,5]}
df = pd.DataFrame(data)
df
Out[21]:
fruit number
0 apple 1
1 banana 2
2 cherry 3
3 date 4
4 elderberry 5
  • Just like before, we can set the index. And if we assign a column name that is not in the dictionary, then pandas will fill that column with NaN's!
In [22]:
data = {'fruit': ['apple','banana','cherry','date','elderberry'], 'number':[1,2,3,4,5]}
df = pd.DataFrame(data, index=['a','b','c','d','e'], columns = ['fruit', 'number', 'letters'])
df
Out[22]:
fruit number letters
a apple 1 NaN
b banana 2 NaN
c cherry 3 NaN
d date 4 NaN
e elderberry 5 NaN

Ways to get columns, rows, and individual elements of a pandas dataframe:

  • Simply specify a column in brackets
In [23]:
df['fruit']
Out[23]:
a         apple
b        banana
c        cherry
d          date
e    elderberry
Name: fruit, dtype: object
In [24]:
df['letters']
Out[24]:
a    NaN
b    NaN
c    NaN
d    NaN
e    NaN
Name: letters, dtype: object
  • Doesn't work for rows!
In [25]:
df['a']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3077             try:
-> 3078                 return self._engine.get_loc(key)
   3079             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'a'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-25-906acb1d319e> in <module>()
----> 1 df['a']

/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2686             return self._getitem_multilevel(key)
   2687         else:
-> 2688             return self._getitem_column(key)
   2689 
   2690     def _getitem_column(self, key):

/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2693         # get column
   2694         if self.columns.is_unique:
-> 2695             return self._get_item_cache(key)
   2696 
   2697         # duplicate columns & possible reduce dimensionality

/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   2487         res = cache.get(item)
   2488         if res is None:
-> 2489             values = self._data.get(item)
   2490             res = self._box_item_values(item, values)
   2491             cache[item] = res

/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   4113 
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

/Users/jung-eunshin/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3078                 return self._engine.get_loc(key)
   3079             except KeyError:
-> 3080                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3081 
   3082         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'a'
  • Use .loc for row lookups and more complicated lookups
In [26]:
df.loc['a']
Out[26]:
fruit      apple
number         1
letters      NaN
Name: a, dtype: object
  • For .loc and .iloc (below), the row comes first and then the column inside the brackets
In [27]:
df.loc['a', 'fruit']
Out[27]:
'apple'
  • Specify multiple row for a single column
In [28]:
df.loc[['a','b'], 'fruit']
Out[28]:
a     apple
b    banana
Name: fruit, dtype: object
  • Specify all rows for multiple columns
In [29]:
df.loc[:, ['fruit', 'number']]
Out[29]:
fruit number
a apple 1
b banana 2
c cherry 3
d date 4
e elderberry 5
  • The key difference between .loc and .iloc is that .loc uses label indexing and .iloc uses integer indexing
In [30]:
df.iloc[0]
Out[30]:
fruit      apple
number         1
letters      NaN
Name: a, dtype: object
  • Row first, then column
In [31]:
df.iloc[0,0]
Out[31]:
'apple'
  • Multiple rows, first column
In [32]:
df.iloc[0:2,0]
Out[32]:
a     apple
b    banana
Name: fruit, dtype: object
  • All rows, first two columns
In [33]:
df.iloc[:,0:2]
Out[33]:
fruit number
a apple 1
b banana 2
c cherry 3
d date 4
e elderberry 5
  • Combine .iloc and .loc if you need both label and integer indexing
In [34]:
df.iloc[0].loc['fruit']
Out[34]:
'apple'
In [35]:
df.iloc[0:2].loc[:,'fruit']
Out[35]:
a     apple
b    banana
Name: fruit, dtype: object
In [36]:
df.iloc[:,0].loc[['a','b']]
Out[36]:
a     apple
b    banana
Name: fruit, dtype: object

Make sure you understand how the indexing works because it'll come in handy in playing with pandas dataframes!

Manipulating DataFrames and using pandas stats operations

  • Pandas can treat a series (a single column of the DataFrame) s a string for each element and then I can do a basic python operation such as len() and count()
  • We can add values to a column of the DataFrame by assigning the column to a series (or even list) of elements
In [37]:
df['letters'] = df['fruit'].str.len()
df
Out[37]:
fruit number letters
a apple 1 5
b banana 2 6
c cherry 3 6
d date 4 4
e elderberry 5 10
  • We can add a new column to the dataframe in the same manner
In [38]:
df['count_a'] = df['fruit'].str.count('a')
df
Out[38]:
fruit number letters count_a
a apple 1 5 1
b banana 2 6 3
c cherry 3 6 0
d date 4 4 1
e elderberry 5 10 0
In [39]:
df['count_e'] = df['fruit'].str.count('e')
df
Out[39]:
fruit number letters count_a count_e
a apple 1 5 1 1
b banana 2 6 3 0
c cherry 3 6 0 1
d date 4 4 1 1
e elderberry 5 10 0 3
  • Pandas can efficiently do column operations
In [40]:
df['count_a+e'] = df['count_a'] + df['count_e']
df
Out[40]:
fruit number letters count_a count_e count_a+e
a apple 1 5 1 1 2
b banana 2 6 3 0 3
c cherry 3 6 0 1 1
d date 4 4 1 1 2
e elderberry 5 10 0 3 3
  • DataFrames in pandas have quick simple stats operations we can use: (Note that they look a lot like numpy notations--pandas is good at taking in numpy arrays)
In [41]:
df.mean()
Out[41]:
number       3.0
letters      6.2
count_a      1.0
count_e      1.2
count_a+e    2.2
dtype: float64
In [42]:
df.std()
Out[42]:
number       1.581139
letters      2.280351
count_a      1.224745
count_e      1.095445
count_a+e    0.836660
dtype: float64
In [43]:
df.max()
Out[43]:
fruit        elderberry
number                5
letters              10
count_a               3
count_e               3
count_a+e             3
dtype: object
In [44]:
df.min()
Out[44]:
fruit        apple
number           1
letters          4
count_a          0
count_e          0
count_a+e        1
dtype: object
In [45]:
df.median()
Out[45]:
number       3.0
letters      6.0
count_a      1.0
count_e      1.0
count_a+e    2.0
dtype: float64
In [46]:
df.sum()
Out[46]:
fruit        applebananacherrydateelderberry
number                                    15
letters                                   31
count_a                                    5
count_e                                    6
count_a+e                                 11
dtype: object
In [47]:
df.count()
Out[47]:
fruit        5
number       5
letters      5
count_a      5
count_e      5
count_a+e    5
dtype: int64
In [48]:
df.describe()
Out[48]:
number letters count_a count_e count_a+e
count 5.000000 5.000000 5.000000 5.000000 5.00000
mean 3.000000 6.200000 1.000000 1.200000 2.20000
std 1.581139 2.280351 1.224745 1.095445 0.83666
min 1.000000 4.000000 0.000000 0.000000 1.00000
25% 2.000000 5.000000 0.000000 1.000000 2.00000
50% 3.000000 6.000000 1.000000 1.000000 2.00000
75% 4.000000 6.000000 1.000000 1.000000 3.00000
max 5.000000 10.000000 3.000000 3.000000 3.00000

Each operation can have many different options: Let's take a quick detour into learning how to read python package documentation!

In [49]:
df.sum(axis=1)
Out[49]:
a    10
b    14
c    11
d    12
e    21
dtype: int64
  • Other functions can be used to apply more complicated operations onto the dataframe
In [50]:
df.apply(np.sum, axis=0)
Out[50]:
fruit        applebananacherrydateelderberry
number                                    15
letters                                   31
count_a                                    5
count_e                                    6
count_a+e                                 11
dtype: object
In [51]:
df.apply(np.sum, axis=0)
Out[51]:
fruit        applebananacherrydateelderberry
number                                    15
letters                                   31
count_a                                    5
count_e                                    6
count_a+e                                 11
dtype: object
In [52]:
df.iloc[:,1:6].apply(np.sum, axis=1)
Out[52]:
a    10
b    14
c    11
d    12
e    21
dtype: int64
In [53]:
df.applymap(lambda x:x*10)
Out[53]:
fruit number letters count_a count_e count_a+e
a appleappleappleappleappleappleappleappleapplea... 10 50 10 10 20
b bananabananabananabananabananabananabananabana... 20 60 30 0 30
c cherrycherrycherrycherrycherrycherrycherrycher... 30 60 0 10 10
d datedatedatedatedatedatedatedatedatedate 40 40 10 10 20
e elderberryelderberryelderberryelderberryelderb... 50 100 0 30 30
  • We can even add more to the DataFrames with append, concat, and join
In [54]:
df_f = pd.DataFrame([['fig', 6]], columns=['fruit', 'number'])
df.append(df_f, sort=False) # sort=True ==> alphabetically sorts columns
Out[54]:
fruit number letters count_a count_e count_a+e
a apple 1 5.0 1.0 1.0 2.0
b banana 2 6.0 3.0 0.0 3.0
c cherry 3 6.0 0.0 1.0 1.0
d date 4 4.0 1.0 1.0 2.0
e elderberry 5 10.0 0.0 3.0 3.0
0 fig 6 NaN NaN NaN NaN
In [55]:
df_f = pd.DataFrame([['fig', 6]], columns=['fruit', 'number'], index=['f'])
pd.concat([df,df_f], axis=0, sort=True)# sort=False ==> has columns maintained in order of original df
Out[55]:
count_a count_a+e count_e fruit letters number
a 1.0 2.0 1.0 apple 5.0 1
b 3.0 3.0 0.0 banana 6.0 2
c 0.0 1.0 1.0 cherry 6.0 3
d 1.0 2.0 1.0 date 4.0 4
e 0.0 3.0 3.0 elderberry 10.0 5
f NaN NaN NaN fig NaN 6
In [56]:
# Note: my df is unchanged! To store it, I would need to store the df: 
df_new = pd.concat([df,df_f], axis=0, sort=False)
df_new
Out[56]:
fruit number letters count_a count_e count_a+e
a apple 1 5.0 1.0 1.0 2.0
b banana 2 6.0 3.0 0.0 3.0
c cherry 3 6.0 0.0 1.0 1.0
d date 4 4.0 1.0 1.0 2.0
e elderberry 5 10.0 0.0 3.0 3.0
f fig 6 NaN NaN NaN NaN
In [57]:
df_new
Out[57]:
fruit number letters count_a count_e count_a+e
a apple 1 5.0 1.0 1.0 2.0
b banana 2 6.0 3.0 0.0 3.0
c cherry 3 6.0 0.0 1.0 1.0
d date 4 4.0 1.0 1.0 2.0
e elderberry 5 10.0 0.0 3.0 3.0
f fig 6 NaN NaN NaN NaN
In [58]:
more_data = {'animal': ['alligator','bear','cat','dog','elephant'], 'odd':[True,False,True,False,True], 'number': [1,2,3,4,5]}
df_more = pd.DataFrame(more_data, index=['a','b','c','d','e'])
In [59]:
df_more.dtypes
Out[59]:
animal    object
odd         bool
number     int64
dtype: object
In [60]:
df.join(df_more, lsuffix='_fruit', rsuffix='_animal')
Out[60]:
fruit number_fruit letters count_a count_e count_a+e animal odd number_animal
a apple 1 5 1 1 2 alligator True 1
b banana 2 6 3 0 3 bear False 2
c cherry 3 6 0 1 1 cat True 3
d date 4 4 1 1 2 dog False 4
e elderberry 5 10 0 3 3 elephant True 5
In [61]:
df_both = df_new.join(df_more.set_index('number'), on='number', how='outer')
df_both
Out[61]:
fruit number letters count_a count_e count_a+e animal odd
a apple 1 5.0 1.0 1.0 2.0 alligator True
b banana 2 6.0 3.0 0.0 3.0 bear False
c cherry 3 6.0 0.0 1.0 1.0 cat True
d date 4 4.0 1.0 1.0 2.0 dog False
e elderberry 5 10.0 0.0 3.0 3.0 elephant True
f fig 6 NaN NaN NaN NaN NaN NaN

How does pandas deal with missing data?

In [62]:
df_both.isnull()
Out[62]:
fruit number letters count_a count_e count_a+e animal odd
a False False False False False False False False
b False False False False False False False False
c False False False False False False False False
d False False False False False False False False
e False False False False False False False False
f False False True True True True True True
In [63]:
df_both.notnull()
Out[63]:
fruit number letters count_a count_e count_a+e animal odd
a True True True True True True True True
b True True True True True True True True
c True True True True True True True True
d True True True True True True True True
e True True True True True True True True
f True True False False False False False False

most operations such as sum consider NA as zero

In [64]:
df_both.fillna(0)
Out[64]:
fruit number letters count_a count_e count_a+e animal odd
a apple 1 5.0 1.0 1.0 2.0 alligator True
b banana 2 6.0 3.0 0.0 3.0 bear False
c cherry 3 6.0 0.0 1.0 1.0 cat True
d date 4 4.0 1.0 1.0 2.0 dog False
e elderberry 5 10.0 0.0 3.0 3.0 elephant True
f fig 6 0.0 0.0 0.0 0.0 0 0
In [65]:
df_both.fillna(method='pad')
Out[65]:
fruit number letters count_a count_e count_a+e animal odd
a apple 1 5.0 1.0 1.0 2.0 alligator True
b banana 2 6.0 3.0 0.0 3.0 bear False
c cherry 3 6.0 0.0 1.0 1.0 cat True
d date 4 4.0 1.0 1.0 2.0 dog False
e elderberry 5 10.0 0.0 3.0 3.0 elephant True
f fig 6 10.0 0.0 3.0 3.0 elephant True
In [66]:
df_both.dropna()
Out[66]:
fruit number letters count_a count_e count_a+e animal odd
a apple 1 5.0 1.0 1.0 2.0 alligator True
b banana 2 6.0 3.0 0.0 3.0 bear False
c cherry 3 6.0 0.0 1.0 1.0 cat True
d date 4 4.0 1.0 1.0 2.0 dog False
e elderberry 5 10.0 0.0 3.0 3.0 elephant True
In [67]:
df_both.dropna(axis=1)
Out[67]:
fruit number
a apple 1
b banana 2
c cherry 3
d date 4
e elderberry 5
f fig 6
In [68]:
df_both.replace({'bear':'bat'})
Out[68]:
fruit number letters count_a count_e count_a+e animal odd
a apple 1 5.0 1.0 1.0 2.0 alligator True
b banana 2 6.0 3.0 0.0 3.0 bat False
c cherry 3 6.0 0.0 1.0 1.0 cat True
d date 4 4.0 1.0 1.0 2.0 dog False
e elderberry 5 10.0 0.0 3.0 3.0 elephant True
f fig 6 NaN NaN NaN NaN NaN NaN

We can also sort and group pandas dataframes

In [69]:
df_both.sort_index(ascending=False)
Out[69]:
fruit number letters count_a count_e count_a+e animal odd
f fig 6 NaN NaN NaN NaN NaN NaN
e elderberry 5 10.0 0.0 3.0 3.0 elephant True
d date 4 4.0 1.0 1.0 2.0 dog False
c cherry 3 6.0 0.0 1.0 1.0 cat True
b banana 2 6.0 3.0 0.0 3.0 bear False
a apple 1 5.0 1.0 1.0 2.0 alligator True
In [70]:
df_both.sort_index(axis=1)
Out[70]:
animal count_a count_a+e count_e fruit letters number odd
a alligator 1.0 2.0 1.0 apple 5.0 1 True
b bear 3.0 3.0 0.0 banana 6.0 2 False
c cat 0.0 1.0 1.0 cherry 6.0 3 True
d dog 1.0 2.0 1.0 date 4.0 4 False
e elephant 0.0 3.0 3.0 elderberry 10.0 5 True
f NaN NaN NaN NaN fig NaN 6 NaN
In [71]:
df_both.sort_values(by='count_a+e')
Out[71]:
fruit number letters count_a count_e count_a+e animal odd
c cherry 3 6.0 0.0 1.0 1.0 cat True
a apple 1 5.0 1.0 1.0 2.0 alligator True
d date 4 4.0 1.0 1.0 2.0 dog False
b banana 2 6.0 3.0 0.0 3.0 bear False
e elderberry 5 10.0 0.0 3.0 3.0 elephant True
f fig 6 NaN NaN NaN NaN NaN NaN
In [72]:
df_both.sort_values(by=['count_a+e','letters'])
Out[72]:
fruit number letters count_a count_e count_a+e animal odd
c cherry 3 6.0 0.0 1.0 1.0 cat True
d date 4 4.0 1.0 1.0 2.0 dog False
a apple 1 5.0 1.0 1.0 2.0 alligator True
b banana 2 6.0 3.0 0.0 3.0 bear False
e elderberry 5 10.0 0.0 3.0 3.0 elephant True
f fig 6 NaN NaN NaN NaN NaN NaN
In [73]:
df_both.groupby('letters')
Out[73]:
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x117eb0be0>
In [74]:
df_both.groupby('letters').groups
Out[74]:
{4.0: Index(['d'], dtype='object'),
 5.0: Index(['a'], dtype='object'),
 6.0: Index(['b', 'c'], dtype='object'),
 10.0: Index(['e'], dtype='object')}
In [75]:
df_both.groupby('odd').groups
Out[75]:
{False: Index(['b', 'd'], dtype='object'),
 True: Index(['a', 'c', 'e'], dtype='object')}
In [76]:
for name, group in df_both.groupby('odd'):
    print(name)
    print(group)
False
    fruit  number  letters  count_a  count_e  count_a+e animal    odd
b  banana       2      6.0      3.0      0.0        3.0   bear  False
d    date       4      4.0      1.0      1.0        2.0    dog  False
True
        fruit  number  letters  count_a  count_e  count_a+e     animal   odd
a       apple       1      5.0      1.0      1.0        2.0  alligator  True
c      cherry       3      6.0      0.0      1.0        1.0        cat  True
e  elderberry       5     10.0      0.0      3.0        3.0   elephant  True
In [77]:
df_both.groupby('odd').get_group(False)
Out[77]:
fruit number letters count_a count_e count_a+e animal odd
b banana 2 6.0 3.0 0.0 3.0 bear False
d date 4 4.0 1.0 1.0 2.0 dog False
In [78]:
df_both.groupby('odd').agg([np.mean, np.std, np.sum])
Out[78]:
number letters count_a count_e count_a+e
mean std sum mean std sum mean std sum mean std sum mean std sum
odd
False 3 1.414214 6 5.0 1.414214 10.0 2.000000 1.414214 4.0 0.500000 0.707107 1.0 2.5 0.707107 5.0
True 3 2.000000 9 7.0 2.645751 21.0 0.333333 0.577350 1.0 1.666667 1.154701 5.0 2.0 1.000000 6.0

We can also choose rows based on elements of a column

In [79]:
df['count_a+e'] >= 2
Out[79]:
a     True
b     True
c    False
d     True
e     True
Name: count_a+e, dtype: bool
In [80]:
df.loc[df['count_a+e'] >= 2]
Out[80]:
fruit number letters count_a count_e count_a+e
a apple 1 5 1 1 2
b banana 2 6 3 0 3
d date 4 4 1 1 2
e elderberry 5 10 0 3 3
In [81]:
df.query('letters >= 5')
Out[81]:
fruit number letters count_a count_e count_a+e
a apple 1 5 1 1 2
b banana 2 6 3 0 3
c cherry 3 6 0 1 1
e elderberry 5 10 0 3 3

Quick and simple example of how to use matplotlib for a scatterplot

In [82]:
plt.plot(df['count_a'], df['count_e'], 'bo')
plt.xlabel('number of a\'s in the word')
plt.ylabel('number of e\'s in the word')
plt.title('Vowels in Fruits')
for i in range(len(df['fruit'])):
    plt.annotate(df['fruit'][i], (df['count_a'][i], df['count_e'][i]), 
                 xytext=(5, -5), textcoords='offset points', ha='left', va='top',
                bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
        arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))

Seaborn makes pretty plots!

In [83]:
sns.barplot(df['fruit'], df['letters'], palette = 'husl')
Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b2ce1d0>
In [84]:
ax = sns.violinplot(df['count_a+e'], palette="muted")
ax.set_title('A and E in Fruit')
Out[84]:
<matplotlib.text.Text at 0x11b4ad6d8>

There's bound to be much more complicated applications of pandas, matplotlib, and seaborn, but the dataset we've been playing with is far too simple

Let's try a more complex data set and see what we've learned and what else we can do

In [85]:
df_votes = pd.read_csv('presidentialElections.csv', index_col=0)
In [86]:
df_votes.head()
Out[86]:
state demVote year south
1 Alabama 84.76 1932 True
2 Arizona 67.03 1932 False
3 Arkansas 86.27 1932 True
4 California 58.41 1932 False
5 Colorado 54.81 1932 False

1. Look at the most states' statistics from only the most recent election

In [87]:
# How do I look up the dataframe and get entries where the year is 2016?
df_votes.query('year == 2016')
Out[87]:
state demVote year south
1048 Alabama 34.36 2016 True
1049 Alaska 36.55 2016 False
1050 Arizona 44.58 2016 False
1051 Arkansas 33.65 2016 True
1052 California 61.48 2016 False
1053 Colorado 48.16 2016 False
1054 Connecticut 54.57 2016 False
1055 Delaware 53.09 2016 False
1056 DC 90.86 2016 False
1057 Florida 47.41 2016 True
1058 Georgia 45.35 2016 True
1059 Hawaii 62.22 2016 False
1060 Idaho 27.48 2016 False
1061 Illinois 55.24 2016 False
1062 Indiana 37.46 2016 False
1063 Iowa 41.74 2016 False
1064 Kansas 35.74 2016 False
1065 Kentucky 32.68 2016 False
1066 Louisiana 38.45 2016 True
1067 Maine 47.83 2016 False
1068 Maryland 60.33 2016 False
1069 Massachusetts 60.01 2016 False
1070 Michigan 47.03 2016 False
1071 Minnesota 46.44 2016 False
1072 Mississippi 40.06 2016 True
1073 Missouri 37.87 2016 False
1074 Montana 35.41 2016 False
1075 Nebraska 33.70 2016 False
1076 Nevada 47.92 2016 False
1077 New Hampshire 46.83 2016 False
1078 New Jersey 54.99 2016 False
1079 New Mexico 48.26 2016 False
1080 New York 59.00 2016 False
1081 North Carolina 46.17 2016 True
1082 North Dakota 27.23 2016 False
1083 Ohio 43.24 2016 False
1084 Oklahoma 28.93 2016 False
1085 Oregon 50.07 2016 False
1086 Pennsylvania 47.46 2016 False
1087 Rhode Island 54.41 2016 False
1088 South Carolina 40.67 2016 True
1089 South Dakota 31.74 2016 False
1090 Tennessee 34.72 2016 True
1091 Texas 43.12 2016 True
1092 Utah 27.17 2016 False
1093 Vermont 56.68 2016 False
1094 Virginia 49.75 2016 True
1095 Washington 52.54 2016 False
1096 West Virginia 26.18 2016 False
1097 Wisconsin 46.45 2016 False

2. How many times have there been a southern state that have voted more democratic?

In [88]:
# 1) Find the southern states; 2) Of those, find the instances where the demVote is greater than 50%; 3) Count these instances
df_votes.query('demVote > 50 & south == True').count()
Out[88]:
state      89
demVote    89
year       89
south      89
dtype: int64

3. Sort values by state and by democratic vote

In [89]:
# Sort the dataframe by two different columns
df_votes.sort_values(by=['state','demVote'])
Out[89]:
state demVote year south
436 Alabama 18.72 1968 True
487 Alabama 25.54 1972 True
1048 Alabama 34.36 2016 True
895 Alabama 36.84 2004 True
640 Alabama 38.28 1984 True
997 Alabama 38.36 2012 True
946 Alabama 38.74 2008 True
691 Alabama 39.86 1988 True
742 Alabama 40.88 1992 True
844 Alabama 41.59 2000 True
793 Alabama 43.16 1996 True
589 Alabama 47.45 1980 True
538 Alabama 55.73 1976 True
288 Alabama 56.52 1956 True
336 Alabama 56.83 1960 True
240 Alabama 64.55 1952 True
145 Alabama 81.28 1944 True
1 Alabama 84.76 1932 True
97 Alabama 85.22 1940 True
49 Alabama 86.38 1936 True
590 Alaska 26.41 1980 False
845 Alaska 27.67 2000 False
641 Alaska 29.87 1984 False
743 Alaska 30.29 1992 False
794 Alaska 33.27 1996 False
488 Alaska 34.62 1972 False
896 Alaska 35.52 2004 False
539 Alaska 35.65 1976 False
692 Alaska 36.27 1988 False
1049 Alaska 36.55 2016 False
... ... ... ... ...
944 Wisconsin 49.70 2004 False
143 Wisconsin 50.15 1940 False
238 Wisconsin 50.70 1948 False
740 Wisconsin 51.41 1988 False
1046 Wisconsin 52.83 2012 False
995 Wisconsin 56.22 2008 False
434 Wisconsin 62.09 1964 False
47 Wisconsin 63.46 1932 False
95 Wisconsin 63.80 1936 False
894 Wyoming 27.70 2000 False
1047 Wyoming 27.82 2012 False
639 Wyoming 27.97 1980 False
690 Wyoming 28.24 1984 False
945 Wyoming 29.07 2004 False
537 Wyoming 30.47 1972 False
996 Wyoming 32.54 2008 False
792 Wyoming 33.98 1992 False
486 Wyoming 35.51 1968 False
843 Wyoming 36.84 1996 False
287 Wyoming 37.09 1952 False
741 Wyoming 38.01 1988 False
588 Wyoming 39.81 1976 False
335 Wyoming 39.92 1956 False
385 Wyoming 44.99 1960 False
192 Wyoming 48.77 1944 False
239 Wyoming 51.62 1948 False
144 Wyoming 52.82 1940 False
48 Wyoming 56.07 1932 False
435 Wyoming 56.56 1964 False
96 Wyoming 60.58 1936 False

1097 rows × 4 columns

4. How many states' data were calculated per year?

In [90]:
# Group the dataframe by year and find the size of each group
df_votes.groupby('year').size()
Out[90]:
year
1932    48
1936    48
1940    48
1944    48
1948    47
1952    48
1956    48
1960    50
1964    50
1968    51
1972    51
1976    51
1980    51
1984    51
1988    51
1992    51
1996    51
2000    51
2004    51
2008    51
2012    51
2016    50
dtype: int64

5. Get the highest demVote percentage per state and the year associated with that percentage

In [91]:
# 1) Group the dataframe by state;  2) Get the maximum values for demVote per group; 
# 3) Get the associated index; 4) Get the year (from the row in the dataframe) associated with that maximum
idx = df_votes.groupby(['state'])['demVote'].transform(max) == df_votes['demVote']
In [92]:
df_votes[idx].sort_values('demVote', ascending=False)
Out[92]:
state demVote year south
86 South Carolina 98.57 1936 True
70 Mississippi 97.03 1936 True
16 Louisiana 92.79 1932 True
954 DC 92.46 2008 False
9 Georgia 91.60 1932 True
41 Texas 88.19 1932 True
49 Alabama 86.38 1936 True
3 Arkansas 86.27 1932 True
424 Rhode Island 80.87 1964 False
396 Hawaii 78.76 1964 False
406 Massachusetts 76.19 1964 False
56 Florida 76.08 1936 True
127 North Carolina 74.03 1940 True
34 Oklahoma 73.30 1932 False
74 Nevada 72.81 1936 False
92 Virginia 70.23 1936 True
50 Arizona 69.85 1936 False
32 North Dakota 69.59 1932 False
90 Utah 69.34 1936 False
72 Montana 69.28 1936 False
404 Maine 68.84 1964 False
88 Tennessee 68.77 1936 True
417 New York 68.56 1964 False
433 West Virginia 67.94 1964 False
391 Connecticut 67.81 1964 False
991 Vermont 67.46 2008 False
52 California 66.95 1936 False
407 Michigan 66.70 1964 False
93 Washington 66.38 1936 False
386 Alaska 65.91 1964 False
415 New Jersey 65.61 1964 False
405 Maryland 65.47 1964 False
423 Pennsylvania 64.92 1964 False
83 Oregon 64.41 1936 False
410 Missouri 64.05 1964 False
402 Kentucky 64.01 1964 False
414 New Hampshire 63.89 1964 False
95 Wisconsin 63.80 1936 False
408 Minnesota 63.76 1964 False
39 South Dakota 63.62 1932 False
25 Nebraska 62.98 1932 False
58 Idaho 62.96 1936 False
420 Ohio 62.94 1964 False
29 New Mexico 62.72 1932 False
953 Delaware 61.91 2008 False
400 Iowa 61.88 1964 False
959 Illinois 61.85 2008 False
390 Colorado 61.27 1964 False
96 Wyoming 60.58 1936 False
60 Indiana 56.63 1936 False
401 Kansas 54.09 1964 False

6. Get the median, mean, and standard deviation of the demVote percentage per year across the 50 states

In [93]:
# 1) Group by year; 2) Calculate medians, means, and standard deviations
df_votes[['year','demVote']].groupby('year').agg([np.median, np.mean, np.std])
Out[93]:
demVote
median mean std
year
1932 58.750 62.776042 14.085063
1936 61.300 64.726875 12.694312
1940 54.530 59.224583 13.588370
1944 52.605 56.590417 12.088866
1948 49.480 49.309149 9.436584
1952 43.870 43.901875 8.569068
1956 41.675 42.255208 6.954904
1960 48.830 48.503600 5.468181
1964 61.110 59.949000 10.980133
1968 41.590 41.307647 11.119246
1972 36.390 36.095098 9.001630
1976 48.750 49.545686 7.905734
1980 40.910 39.894510 9.348128
1984 38.830 39.417451 8.756930
1988 43.880 45.120784 7.748392
1992 42.480 42.001765 8.520815
1996 48.000 47.956078 8.552812
2000 46.460 46.013725 10.202994
2004 47.020 46.520588 10.380094
2008 51.380 51.290000 11.093420
2012 50.580 48.993922 11.813777
2016 46.305 45.065600 11.864589

7. Violin plots per year: what are the distributions in demVote across the United States?

In [94]:
# Hint: sns.violinplot() -- What kind of data is most appropriate?
f, ax = plt.subplots(figsize=(15, 5))
sns.violinplot(x="year", y="demVote",  hue='south',
               data=df_votes.loc[df_votes['year'].isin([1932, 1944, 1956, 1968, 1980, 1992, 2004, 2016])])
Out[94]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b58bcf8>

8. Bar plot of southern vs non-southern states average percentages over the years with standard deviation

In [95]:
# Hint: the bar plot function of seaborn can do quite a lot!
f, ax = plt.subplots(figsize=(20, 4))
sns.barplot(x='year', y='demVote', hue='south', ci = 'sd',  estimator=np.mean, data=df_votes)
Out[95]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b885400>

9. Line graphs of a few state's voting patterns over the years

In [96]:
# Hint: Use line plot, choose different colors for different states, only plot a few states, not all of them
# Make it somewhat easy to understand!
ax = sns.lineplot(x='year', y='demVote', hue='state', style='south', markers=True,
                  data=df_votes.loc[df_votes['state'].isin(['California', 'Texas', 'Massachusetts', 'New York', 'Ohio', 'Alabama', 'Florida', 'Louisiana'])])
ax.legend(loc='best', bbox_to_anchor=(0.5, 0., 1, 1))
ax.set_title('Sampling of state voting patterns over the years')
Out[96]:
<matplotlib.text.Text at 0x11b3a1358>

10. Heat map of each state per year, annotated. Can we see any trends?

In [97]:
# Hint: You might have to work a little bit of magic to get the dataframe in a format that seaborn will accept
# Look at the documentation for seaborn heatmap() and look at the sample data they use--get your data to match that format
# You'll have to deal with NaN's in the dataset
df_votes_sorted = df_votes.sort_values('south')
df_votes_hm = df_votes_sorted.pivot('state', 'year', ['demVote', 'south']).fillna(0)
In [98]:
# Sort the states so that the southern states are grouped together--maybe we'll see a common trend
df_votes_hm = df_votes_hm.sort_values(by=('south', 2016))
In [99]:
# Annotate the heatmap with actual percentages! Format it to make it readable
f, ax = plt.subplots(figsize=(10, 12))
sns.heatmap(df_votes_hm['demVote'], annot=True, fmt=".0f")
Out[99]:
<matplotlib.axes._subplots.AxesSubplot at 0x11bf1f8d0>