Adapted by Joseph Sedlak for MCB 112 W06 section (10-13-17) from anonymous's MCB 112 section notes written for w08 2016.

You can also download these notes as a Jupyter notebook page.

In [1]:

```
# Imports packages we will use
import numpy as np
import math
import pandas as pd
import scipy.spatial.distance as dis
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy.random as rand
import scipy.stats as stats
import scipy.misc as misc
from collections import Counter
```

In [2]:

```
# Generate 40 random numbers from a normal distribution with four different
# means and standard deviations
coords = []
means = [3,7,10,15]
sigma = [1,1,1,1]
points=40
for x in range(len(means)):
coords.extend(list(np.random.normal(means[x],sigma[x],points)))
```

In [3]:

```
# Quick dot plot to visualize the points
# Give the data points a random y value just for aesthetics
f = plt.figure(figsize = (20,1))
y = rand.rand(len(coords))
plt.scatter(coords,y,marker = 'o')
plt.tick_params(axis = 'y',left = 'off',right = 'off', labelleft = 'off')
plt.xlim(0,20)
plt.ylim(-0.5,1.5)
plt.show()
```

In English, the K-Means clustering algorithm is trying to find clusters that minimize the total distance between the cluster centers and all of the points assigned to each cluster. It is an iterative function that starts from a random position (or perhaps an intelligently guessed position) and converges to a local optimum. In each iteration, two things happen:

1) Measure the distance from each point ($x_{i}$), to each center ($\mu_{k}$). We define distance here as $\sqrt{(x_{i}-\mu_{k})^{2}}$. Then, ASSIGN each point to the cluster with the closest center. This is the ASSIGNMENT step.

2) Find the mean position of all the points assigned to each cluster ($\frac{\Sigma_{i\epsilon k}x_{i}}{|C_{k}|}$) and UPDATE the center of that cluster to be this position. This is the UPDATE step.

That's it. In practice it's a very simple process. The only real fiddly part is choosing initial positions that allow the algorithm to reach a global minimum, and not an unwanted local minimum. Picking many random starts and repeating the iteration for each of them is one way to do this. But there are others. Also, it is possible that the argorithm will converge on a local minimum that includes an empty cluster, with no points assigned to it. Clearly this is a non-optimal solution and we have no need to continue iterating along its pathway. This leaves us with two possibiliteis when we encounter an empty cluster: We can stop the current iteration, and restart with new initial positions, or we can try to save the current iteration by assigning points to the empty cluster. Below, I adopt the latter strategy and assign the point furthest from its current assigned center to instead be the sole member of the empty cluster, and then continue iterating as before.

In [4]:

```
def assignment(coords,centers):
'''
Return current cluster assigned to each point, as well as the
sum of this distances between each point and its assigned cluster
Parameters
----------
coords : list or 1D ndarray
list X-coordinates of all points
centers : list or 1D array
list of current estimates for cluster centers
Returns
-------
clusters : list or 1D array
list of current cluster assignment for each point
distance : float
sum of distances between each point and its assigned cluster center
'''
# Find minimum distance between a point and each center
distances = [np.min([math.sqrt((point-i)**2) for i in centers]) for point in coords]
# Assign each point to the closest center
clusters = [np.argmin([math.sqrt((point-i)**2) for i in centers]) for point in coords]
# Count how many points in each cluster
c = Counter(clusters)
# Look for empty clusters
for x in set(range(len(centers))).difference(set(c.keys())):
clusters,distances = assign_empty(x,clusters,distances)
distance = sum(distances)
return clusters, distance
def assign_empty(c,clusters,distances):
'''
Assigns the point furthest from its current cluster center to an empty cluster
Returns an updated cluster list and updated distances list
Parameters
----------
c : int
number of empty cluster
clusters : list or 1D array
list of current cluster assignment for each point
distances: list or 1D array
list of current distances between each point and its assigned center
Returns
-------
clusters : list or 1D array
list of current cluster assignment for each point
distances: list or 1D array
list of current distances between each point and its assigned center
'''
# find index of point furthest from its assigned center
max_idx = np.argmax(distances)
# set distances at that index to 0
distances[max_idx] = 0
# set clusters at that index to empty cluster
clusters[max_idx] = c
return clusters,distances
def update(coords, clusters, num_clust):
'''
Calculates the new cluster centers as an average of the positions of all
points included in the cluster.
Parameters
----------
coords : list or 1D ndarray
list X-coordinates of all points
clusters : list or 1D array
list of current cluster assignment for each point
num_clust: int
number of clusters
Returns
-------
centers : list or 1D array
list of current estimated center for each cluster
'''
coords = np.asarray(coords)
# Find indices of each point assigned to each cluster
clusters = ([np.where(np.asarray(clusters) == x) for x in range(num_clust)])
# Average position of each point assigned to each cluster
centers = [np.mean(coords[clusters[x]],0) for x in range(len(clusters))]
return centers
def plot_centers(centers,c,lw):
'''
Plots the current estimates for cluster centers as vertical bars
Parameters
----------
centers : list or 1D ndarray
list X-coordinates of all centers
c : str
color to plot centers in
lw : int
weight of line with which to plot centers
Returns
-------
None
'''
for center in centers:
plt.axvline(x = center,color = c,linewidth = lw)
```

In [5]:

```
# Initialize figure, plot raw data
plt.close('all')
f = plt.figure(figsize = (20,1))
y = rand.rand(len(coords))
plt.scatter(coords,y,marker = 'o')
plt.tick_params(axis = 'y',left = 'off',right = 'off', labelleft = 'off')
plt.xlim(0,20)
plt.ylim(-0.5,1.5)
# Define number of iterations (iters) and number of clusters (num_means)
iters = 2
num_clust = 4
# Initalize best clusters
min_dist = float('inf')
clusters_best = []
# Loop on iters
for i in range(iters):
# Each iteration starts with a random set of centers
centers = np.random.choice(np.linspace(min(coords),max(coords),1000),num_clust)
plot_centers(centers,'g',2)
# Start with empty clusters
clusters_old = []*len(coords)
clusters = [0]* len(coords)
# Iterate until clusters don't change
while clusters != clusters_old:
# Update clusters
clusters_old = clusters
# New Assignment
clusters,distance = assignment(coords,centers)
# New centers
centers = update(coords, clusters, num_clust)
# Plot best centers
plot_centers(centers,'r',2)
# Save clusters and centers if it has minimal summ distance to date
if distance < min_dist:
min_dist = distance
centers_best = centers
clusters_best = clusters
plt.show()
```

In [6]:

```
# Initialize new figure
plt.close('all')
f = plt.figure(figsize = (20,1))
y = rand.rand(len(coords))
# Plot raw data, colored by cluster
plt.scatter(coords,y,marker = 'o',c = clusters_best)
# Pretty up the plot a bit
plt.tick_params(axis = 'y',left = 'off',right = 'off', labelleft = 'off')
plt.xlim(0,20)
plt.ylim(-0.5,1.5)
# Plot best centroids (means)
plot_centers(centers_best,'k',2)
plt.show()
```

In [7]:

```
print(means)
print(np.round(sorted(centers_best),1))
```

Soft K-Means isn't too hard either. In Hard K-Means, we assigned each point only to its closest center and then we updated each center with the average position of only the points assigned to it. Soft K-Means will use essentially the same strategy, but now our assignment and update steps can include every point for each center, weighted by distance and an extra parameter, $\beta$. It is also an iterative function that also will find a local minima only, but now we can control the fuzziness of our clusters, i.e. how much we weight points distant from a center, relative to closer points, with $\beta$. (Soft K-Means is also sometimes called Fuzzy K-Means) Again, in each iteration two things happen:

1) Measure the distance from each point ($x_{i}$), to each center ($\mu_{k}$) and ASSIGN the RESPONSIBILITY ($r_{ik}$) of that center to the point. We define distance here as $\sqrt{(x_{i}-\mu_{k})^{2}}$ and RESPONSIBILITY as $\frac{e^{-\beta(x_{i}-\mu_{k})^{2}}}{\Sigma_{k}e^{-\beta(x_{i}-\mu_{k})^{2}}}$.**

2) UPDATE the centers of the clusters based on the RESPONSIBILITY weighted mean of ALL the coordinates defined as: $\frac{\Sigma_{i} r_{ik} x_{i} }{\Sigma_{i} r_{ik}}$.

It's worth making a quick note of what $\beta$ acually means. You will hear people refer to it as a "stiffness." Higher $\beta$s are stiffer since they cause the responsibilities of centers close to a point to be much higher than centers further away, to the extreme case where $\beta = \infty$ and all the responsibility of a point comes from it closest center. In the opposite extreme, when $\beta = 0$ then each point gets equal responsibility from each center. The plots above were made with $\beta = 0.1$. Now I'll change it to $\beta = 10$. Notice how the demarcation between the blue cluster and the red cluster become much clearer.

In [8]:

```
def soft_assignment(coords,centers,beta):
'''
Return the resposibilities of each center to each point
Parameters
----------
coords : dD ndarray
list X-coordinates of all points
centers : list or 1D array
list of current estimates for cluster centers
beta : float
parameter that adjusts the "stiffness" of the responsibility.
Returns
-------
resps : 2D array
Responsibilities of each point from each center
'''
# Calcualte the reponsibilites for each point
resps = [responsibilities(point,centers,beta) for point in coords]
return resps
def responsibilities(point, means, beta):
'''
Return the resposibilities of each center to one point
Parameters
----------
point : float
X-coordinate of current point
centers : list or 1D array
list of current estimates for cluster centers
beta : float
parameter that adjusts the "stiffness" of the responsibility.
Returns
-------
resps : list or 1D array
Responsibilities of current point from each center
'''
# Calculate the responsibilities from each center to current point
num = np.exp(np.multiply(-beta,[(point-i)**2 for i in means]))
# Normalize by sum of responsibilites of point from all centers
resps = np.divide(num,sum(num))
return resps
def soft_update(coords, resps,num_clust,beta):
'''
Calculates the new cluster centers as an responsibility weighted average of the positions of all
points.
Parameters
----------
coords : list or 1D array
list X-coordinates of all points
resps : list or 1D array
Responsibilities of current point from each center
num_clust: int
number of clusters
beta : float
parameter that adjusts the "stiffness" of the responsibility.
Returns
-------
centers : list or 1D array
list of current estimated center for each cluster
clusters : list or 1D array
list of current cluster assignment for each point
distance : float
sum current distances between each point and its assigned center
'''
# Find responsibility weighted mean of point positions for each center
num = ([sum(np.multiply(coords,[r[i] for r in resps])) for i in range(num_clust)])
denom = [sum([r[i] for r in resps]) for i in range(num_clust)]
centers = np.divide(num,denom)
# Assign each point to it's closest cluster
clusters = [np.argmax(np.multiply(responsibilities(point,centers,beta),point)) for point in coords]
# calcuate the distance between each point and its assigned center
distance = sum([np.min([math.sqrt((point-i)**2) for i in centers]) for point in coords])
return centers,clusters,distance
```

In [9]:

```
# Initialize figure, plot raw data
plt.close('all')
f = plt.figure(figsize = (20,1))
y = rand.rand(len(coords))
plt.scatter(coords,y,marker = 'o')
plt.tick_params(axis = 'y',left = 'off',right = 'off', labelleft = 'off')
plt.xlim(0,20)
plt.ylim(-0.5,1.5)
# Define number of iterations (iters), number of clusters (num_clust), and beta (beta)
iters = 5
num_clust = 4
beta = 0.5
# define a best distance to center (min_dist)
min_dist = float('inf')
clusters_best = []*len(coords)
# Loop on iters
for _ in range(iters):
# Each iteration starts with a random set of centers
centers = np.random.choice(np.linspace(min(coords),max(coords),1000),num_clust)
plot_centers(centers,'g',2)
# Start with empty clusters
center_diff = float('inf')
centers_old = [0]*num_clust
# Loop until centers stop moving significantly
while center_diff > 0.1:
# Assign responisbilities
resps = soft_assignment(coords,centers,beta)
# Update centers
centers,clusters,distance = soft_update(coords,resps,num_clust,beta)
# Find distance new centers are from previous centers
center_diff = dis.euclidean(centers,centers_old)
centers_old = centers
# plot converged centers
plot_centers(centers,'r',2)
# Save "best" centers and clusters
if distance < min_dist:
min_dist = distance
centers_best = centers
clusters_best = clusters
plt.show()
```

In [10]:

```
plt.close('all')
f = plt.figure(figsize = (20,1))
y = rand.rand(len(coords))
plt.scatter(coords,y, marker = 'o', c = clusters_best,edgecolor = 'None')
plt.tick_params(axis = 'y',left = 'off',right = 'off', labelleft = 'off')
plt.xlim(0,20)
plt.ylim(-0.5,1.5)
plot_centers(centers_best,'k',2)
plt.show()
```

In [11]:

```
print(means)
print(np.round(sorted(centers_best),1))
```

OK. Deep breath. This is not hard. I promise. Let's take it step by tiny little step. ** First we want to define our Mixture Model. This is the underlying combinations of probability distributions that our data were genreated from. They don't all have to be the same distribution; they don't even have to be the same type of distribution (you can mix distributions any which way you want). But in this case, for simplicity's sake (and to mirror the problem set), we are going to assume all our points were generated from 1 of Q diffenent Gaussian distribtions. Gaussian distributions, as you know, have 2 parameters: $\mu$ and $\sigma$. Because we are working with multiple Gaussian distributions, we need to add one more parameter, $\pi$, the mixing coefficient that is our best guess as to what fraction of the data was generated by each Gaussian.**

Now, with a mixture model defined, we begin our iteration process to fit it by declaring initial parameters.

We can choose arbitrary $\mu$ and $\pi$, since these are the parameters we will be trying to optimize. Our choice of sigma, though, will be the same through every iteration. Sigma will behave much like $\beta$ in the Soft K-means (in fact sigma is directly related to $\beta$ since $\beta = \frac{1}{2\sigma^{2}}$. I'll leave the proof of this in Sean's notes, but it's worth convincing yourself of its truth.) Accordingly, low sigmas will make your models "stiffer" with lower probability assigned to points further from a given $\mu$.

Then, in the first step of the iteration, we find the EXPECTATION : The Posterior Probability that each data point, $x_{i}$ came from each mixture component $q$ given the current estimates of $\mu$, $\pi$, and our set $\sigma$. Formally, this is $P(q|x_{i}\pi,\mu,\sigma)$, the normalized likleihood of each point coming from each componenet, $P(q|x_{i}\pi,\mu,\sigma) = \frac{\pi_{q}P(x_{i}|\mu_{q},\pi_{q},\sigma)}{\Sigma_{q}\pi_{q}P(x_{i}|\mu_{q},\pi_{q},\sigma)}$.

The second step is the MAXIMIZATION. In this case we using the posterior probability to find the optimal $\mu$ and $\pi$. To do this we calculate the new best estimate of $\pi$ as the mean of the Posterior Probability, $\frac{\Sigma_{i} P(q|x_{i}\pi,\mu,\sigma)}{N}$, and the new best estmate of $\mu$ as the weighted mean of the point positions $x_{i}$ as $\frac{\Sigma_{i} P(q|x_{i}\pi,\mu,\sigma)x_{i}}{\Sigma_{q} P(q|x_{i}\pi,\mu,\sigma)}$

Note: the exact equations that are taken for expectation and being maximized (i.e. a formal derivation of the E step and the M step) will not be covered in this course as that relies on a more advanced statistics background (see Wikipedia page: https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm).

In [12]:

```
def expectation(coords, mus, sigma, mix_coeffs):
'''
Return the Posterior Probability of each data point coming from each componenent
Parameters
----------
coords : list or1D array
list X-coordinates of all points
mus : list or 1D array
list of current estimates for component means
sigma : float
set parameter for component variance
mix_coeffs: list or 1D array
list of current extiamtes for mixture coefficients of components
Returns
-------
resps : 2D array
Posterior probabilities of each component being the
source of each point.
clusters : list or 1D array
list of current cluster assignment for each point
'''
# Initialze P(q|x)
Pqx = []
# Loop through points
for point in coords:
# Append pi * P(x|q)
Pqx.append([mix_coeffs[i]*stats.norm.pdf(point,mu,sigma) for i,mu in enumerate(mus)])
# Normalize
posts = [np.divide(x,sum(x)) for x in Pqx]
# Assign clusters based on max posterior probability
clusters = [np.argmax(post) for post in posts]
return posts,clusters
def maximization(coords, posts, mix_coeffs):
'''
Return the update mu and mixture coefficient for each component
Parameters
----------
coords : list or1D array
list X-coordinates of all points
mus : list or 1D array
list of current estimates for component means
sigma : float
set parameter for component variance
mix_coeffs: list or 1D array
list of current extiamtes for mixture coefficients of components
Returns
-------
mus : list or 1D array
current estimate of component means
mix_coeffs: list or 1D array
list of current extiamtes for mixture coefficients of components
'''
# calculate the new mean as the posterior weighted average of point positions
num_mu = np.sum([np.multiply(posts[i],x) for i,x in enumerate(coords)],0)
denom_mu = np.sum(posts,0)
mus = np.divide(num_mu,denom_mu)
# calculate the new mixture coefficients as the mean of the posteriors
mix_coeffs = np.divide(np.sum(posts,0),len(coords))
return mus,mix_coeffs
def negll(coords, mus, sigma, mix_coeffs):
'''
Return the negative log likelihood of data given the current mixture model
Parameters
----------
coords : list or1D array
list X-coordinates of all points
posts : list or 1D array
list of posterior probabilites of each point coming from each component
mix_coeffs: list or 1D array
list of current extiamtes for mixture coefficients of components
Returns
-------
nll : float
negative log likelihood of data given model
'''
# Compute the nll as the normal logpdf of the data, given mu and sigma, plus the log of the mixture coefficient
# Summed for each point
ll = np.sum([[np.log(mix_coeffs[i]) + stats.norm.logpdf(point,mu,sigma) for i,mu in enumerate(mus)] for point in coords])
return -ll
```

In [13]:

```
# Initialize figure, plot raw data
plt.close('all')
f = plt.figure(figsize = (20,1))
y = rand.rand(len(coords))
plt.scatter(coords,y,marker = 'o')
plt.tick_params(axis = 'y',left = 'off',right = 'off', labelleft = 'off')
plt.xlim(0,20)
plt.ylim(-0.5,1.5)
# Define number of iterations (iters), number of clusters (num_clust), and beta (beta)
iters = 20
num_clust = 4
beta = 10
# define a threshold for nll change that will stop iteration (nll_thresh)
nll_thresh = .001
# Sigma calculated from beta
sigma = math.sqrt (1/(2*beta))
# define best mus (mus_best), best distance to mu (min_dist), best mixture coefficients (mix_coeffs_best
# and intialize as starting best nll (nll_best)
min_dist = float('inf')
mus_best = []*len(coords)
nll_best = float('inf')
mix_coeffs_best = []
for _ in range(iters):
# Each iteration starts with a random set of mus (mus) with random mixture coeffcients (mix_coeffs)
mix_coeffs = rand.rand(num_clust)
mix_coeffs = np.divide(mix_coeffs,sum(mix_coeffs))
mus = np.random.choice(np.linspace(min(coords),max(coords),1000),num_clust)
plot_centers(mus,'g',2)
# We will iterate until the nll stops changing muc
# Here we initialize holder variables for the last nll and the difference between the current and last nll
nll_diff = float('inf')
nll_old = float(0)
# Iterate while the differnece between consecutive nlls is above a threshold
while nll_diff > (nll_thresh* nll_old):
# Calculate posterior probabilities and assign points to clusters
posts,clusters = expectation(coords, mus, sigma, mix_coeffs)
# Calculate new mus and mixture coefficients given current posterior probabilities
mus, mix_coeffs = maximization(coords, posts, mix_coeffs)
# Calcualte the nll of hte current mixture model
nll = negll(coords, mus, sigma, mix_coeffs)
# find difference in consecutive nlls and update the nll_old
nll_diff = abs(nll-nll_old)
nll_old = nll
# plot converged centers
plot_centers(mus,'r',2)
# Update best estimates for mus, mixture coefficients and cluster assignments
if nll < nll_best:
mix_coeffs_best = mix_coeffs
nll_best = nll
mus_best = mus
clusters_best = clusters
plt.show()
```

In [14]:

```
def Gaussian(X, mu, sigma):
'''
Return the negative log likelihood of data given the current mixture model
Parameters
----------
X : list or1D array
list X-coordinates to plot
mu : int or float
mean of Gaussian distribtion to plot
sigma : int or float
variance of Gaussian distribution to plot
Returns
-------
gauss : list or1D array
list of Y corrdinates calculated from X with Gaussian equation
'''
gauss = [stats.norm.pdf(x, mu, sigma) for x in X]
return gauss
```

In [15]:

```
# Define X coords to plot over
X = np.linspace(0,20,2000)
# calculate Y coords for each mixture with Gaussian function
Y= [np.multiply(mix_coeffs_best[i],Gaussian(X, mu, sigma)) for i,mu in enumerate(mus_best)]
# Sum mixtures
Ysum = np.sum(Y,0)
# y axis scaling variable for plotting aesthetics
ysmall = np.add(np.multiply(y,np.max(Ysum)),np.max(Ysum)*0.25)
# close any open plots
plt.close('all')
f = plt.figure(figsize = (20,1))
# plot each gaussian mixture component
for fit in Y:
plt.plot(X, fit, linewidth = 1, c = 'k')
# Plot sum of components
plt.plot(X,Ysum, linewidth = 2, c = 'k')
# Plot data poiits
plt.scatter(coords,ysmall, marker = 'o', c = clusters_best, edgecolor = 'None')
# Plot best mus
plot_centers(mus_best,'k',2)
# Pretty up the graph a bit
plt.tick_params(axis = 'y',left = 'off',right = 'off', labelleft = 'off')
plt.xlim(0,20)
plt.ylim(0,max(Ysum)+0.5*(max(Ysum)))
plt.show()
```

In [ ]:

```
```