MCB112: Biological Data Analysis (Fall 2018)


homework 05:

Student's game night

On Tuesday nights, the lab's favorite pub is tended by an odd bartender. Nobody knows his real name. He calls himself Student, and he serves only Guinness. Student challenges customers to play a strange betting game of his own invention.

the game

In a back room, hidden from the customers, there is a 20 row x 21 column grid on the wall. The rows are called \(\sigma\), and they are marked 5, 10, 15, up to 100, starting from the bottom. The 21 columns are called \(\mu\), and they are labeled -100, -90, -80 and so on up to 100, starting from the left.

Student, a practiced dart player with the uncanny ability to throw a dart into a rectangular grid with a perfectly uniform distribution, says he starts each game by stepping into the back room and throwing a dart at the board. Thus, he randomly selects one of the squares with uniform probability, thus values \(\mu\) and \(\sigma\).


An example of the grid in Student's back room, where he's thrown a dart that hit mu = 50, sigma = 75.

Student, unsurprisingly, is also an eccentric inventor. He explains that once the dart selects \(\mu\) and \(\sigma\), an elaborate hidden machine is activated in the ceiling. (You think you hear him say the word quincunx, but maybe it was only a sneeze.) The machine drops an object such that the object lands along a line on the bar, with a position that is Gaussian-distributed with mean \(\mu\) and standard deviation \(\sigma\). The line on the bar is carefully marked off as a real number line, in very fine increments, so customers can measure object positions precisely. Strangely, though you hadn't noticed before, now you realize that the bar seems to be infinitely long. Student says the positions of these \(n\) objects will be called \(x_i\) for \(i = 1..n\).

Student explains that his machine used to drop ping pong balls, but they rolled away. He found it works better with objects that are soft and flat, so they stay where they land, and can land on top of each other, so now the machine uses tea bags instead. Everyone calls it Student's tea distribution machine.

In summary (in case you didn't follow all the unnecessary nonsense around the problem): Student uniformly samples a mean \(\mu\) from 21 possible choices, and a standard deviation \(\sigma\) from 20 possible choices. From a normal distribution parameterized by \(\mu\) and \(\sigma\), he samples \(n\) numbers \(x_i\), \(i=1..n\). You will observe \(x_1..x_n\). Depending on the game, Student may or may not give you additional information about \(\mu\) and \(\sigma\).

Student reminds you it's a total rookie mistake to confuse the observed data \(x_1..x_n\) with the unknown parameters of the machine \(\mu\) and \(\sigma\). He says to remember that the hidden grid has only 20x21 discretized choices of \(\sigma\) and \(\mu\), but the tea bags can land anywhere along the line (i.e. the observations are continuously distributed).

The game is to guess where what column the hidden dart is in: i.e., what the unknown mean \(\mu\) is.

the betting

Customers are allowed to bet after each tea bag is dropped (each observation). There's a complicated table in the pub that keeps track of the observations and the betting. Betting odds are updated instantly at each round, based on the observations that have been seen so far. There are two different versions of the game:

One thing in particular catches your attention. The posted rules are explicit about how the pub estimates fair odds of the game mathematically from the current observed samples \(x_1..x_n\), as follows:

You realize that something's not right about the way Student is calculating the odds, especially when the sample size \(n\) is small. If you can do a better job of inferring the hidden column position of the dart -- the unknown \(\mu\) -- you can use that information to your betting advantage. Now that you know a few things about Bayesian inference, you set about to calculate the posterior distributions \(P( \mu \mid x_i..x_n, \sigma)\) and \(P( \mu \mid x_i..x_n)\) directly, without resorting to traditional summary statistics like the sample mean and sample standard deviation. Using these inferred distributions, you'll know the true odds, and be able to take advantage when the pub has miscalculated the odds.

You can download a script that implements Student's game. To use it, do ./student-game.py <n>, where <n> is the number of observed tea bag positions. For example:

   % ./student-game.py 3

generates 3 samples, along with all the other information about the game, including the hidden correct answers (so you can check how well your inference works, as \(n\) varies.) The script may also be useful for other things, like showing you how to plot semilog probability plots in matplotlib.

An example to use in your analysis below:

   X          = np.array([ 11.50, -2.32, 9.18]) # n=3 observations
   true_sigma = 60.                             # Student also tells you this in beginner's game
   true_mu    = -20.                            # the unknown column position, mu

This is an interesting example because all three samples just happened to come out to the right of the true \(\mu\), by chance, and fairly tightly grouped. (It's a real example that came up in testing this exercise.)

1. the beginner's game

Write a script that:

You can use this script, which implements Student's game, to generate data (and \(\sigma\)) to try your analysis out, for varying numbers of samples, especially small \(n\) (3-6).

Have your script show the plots for the X = [ 11.50, -2.32, 9.18], true_sigma = 60. example.

2. the advanced game

Now write a second script that:

\[ P( \mu \mid x_i..x_n) = \sum_{\sigma} P ( \mu, \sigma \mid x_i..x_n) \]

Again you can use the Student's game script to generate data (and \(\sigma\)) to try your analysis out, for varying numbers of samples, especially small \(n\) (3-6).

Have your script show the plots for the X = [ 11.50, -2.32, 9.18] example.

3. where's the advantage?

Is the pub calculating its odds correctly? Where do you see an advantage?

turning in your work

Email a Jupyter notebook page (a .ipynb file) to mcb112.teachingfellow@gmail.com. Please name your file <LastName><FirstName>_MCB112_<psetnumber>.ipynb; for example, mine would be EddySean_MCB112_05.ipynb.

the point of this baroque exercise (there is one)

Student's t distribution arises when we're trying to estimate the mean of a normal distribution from observed data, when the variance is unknown. Bayesian inference tells us that the posterior distribution of \(P(\mu \mid x_1..x_n)\) is:

\[ \begin{eqnarray*} P(\mu \mid x_1..x_n) & = & \frac{ P(x_1..x_n \mid \mu) P(\mu)} { \int P(x_1..x_n \mid \mu) P(\mu) d\mu} \\ & = & \frac{ \int P(x_1..x_n \mid \mu, \sigma) P(\sigma \mid \mu) P(\mu) d\sigma } { \int \int P(x_1..x_n \mid \mu, \sigma) P(\sigma \mid \mu) P(\mu) d\sigma d\mu }\\ \end{eqnarray*} \]

where we have to integrate (marginalize) over the unknown and uncertain \(\sigma\) parameter.

The Student's tea distribution machine problem discretizes the choices for \(\mu\) and \(\sigma\), so you can simply sum instead of integrate, and it explicitly makes their prior distributions uniform so the posterior distribution over \(\mu\) is well defined. This suffices to demonstrate the general properties and shape of the real Student's t distribution, with its fat tails relative to the normal distribution.

The pub should be using the t-distribution to calculate its odds:

def probdist_t(X, mu_values):
    """ 
    Given an ndarray X_1..X_n,
    and a list of the mu values in each column;
    return a list of the inferred P(mu | X) for each column,
    according to Student's t distribution with N-1 degrees of freedom.
    """
    N    = len(X)
    xbar = np.mean(X)
    s    = np.std(X, ddof=1)
    t    = [ (xbar - mu) / (s / np.sqrt(N)) for mu in mu_values ]    # t statistic, given sample mean, sample stddev, and N
#   t    = [ stats.ttest_1samp(X, mu)[0] for mu in mu_values ]       # ... (equivalently, python can calculate t statistic for you)
    Pr   = [ stats.t.pdf(val, N-1) for val in t ]
    Z    = sum(Pr)
    Pr   = [ p / Z for p in Pr ]    
    return Pr

You can add this line to your plot for the advanced game, to see how it matches your Bayesian calculation.

With a small number of samples, your uncertainty of your estimate for the unknown \(\sigma\) becomes important, and making a point estimate for it isn't a good idea. The correct thing to do is to marginalize over the uncertain and unknown hidden parameter \(\sigma\).

Although Student derived it analytically (which was quite a feat) and not as a Bayesian calculation, the exercise illustrates how Student's t distribution can be seen to arise as the marginal posterior probability for \(\mu\), given observed data, marginalized over unknown \(\sigma\) with a uniform prior.

hints

def probdist_beginner(X, sigma, mu_values):
    """ 
    Given an ndarray X_1..X_n, and a known sigma;
    and a list of the mu values in each column;
    return a list of the pub's inferred P(mu | X,sigma) for each column.
    """
    xbar = np.mean(X)
    N    = len(X)
    Pr   = [ stats.norm.pdf(x, loc=xbar, scale= sigma / math.sqrt(N)) for x in mu_values ]  # proportional to std error of the mean
    Z    = sum(Pr)                   # normalization constant
    Pr   = [ p / Z for p in Pr ]     # normalization to a discrete probability distribution
    return Pr

def probdist_advanced(X, mu_values):
    """ 
    Given an ndarray X_1..X_n,
    and a list of the mu values in each column;
    return a list of the pub's inferred P(mu | X) for each column.
    """
    xbar = np.mean(X)
    s    = np.std(X, ddof=1)     # note that numpy.std() by default calculates a population std dev; to get sample std. dev., set ddof=1
    N    = len(X)
    Pr   = [ stats.norm.pdf(x, loc=xbar, scale= s / math.sqrt(N)) for x in mu_values ]  # proportional to std error of the mean
    Z    = sum(Pr)                   # normalization constant
    Pr   = [ p / Z for p in Pr ]     # normalization to a discrete probability distribution
    return Pr