Why do we need Bayesian statistics? Part III – Learning multivariate distributions (tutorial)

Lighthouse2.md
Alt text for image

Introduction

In the previous entry of this tutorial series, we studied the lighthouse problem. The goal was to infer (or learn) the position of the lighthouse from observations of where the lighthouse’s light beam hits the floor. A schematic of the problem is presented below.

Alt text for image

Following what was established on the previous post, let us generate synthetic data for the lighhouse problem. We sample θ\theta uniformly in (π2,π2)\left( -\frac{\pi}{2}, \frac{\pi}{2} \right) and obtain xx as

tanθ=xlxlhx=lx+lhtanθ .(1) tan \theta = \frac{x-l_x}{l_h} \Rightarrow x = l_x + l_h tan \theta \ . \quad \quad \quad (1)

import numpy as np
from matplotlib import pyplot as plt
np.random.seed(42)

l_x,l_h = 0,1 #setting ground truth

th = np.pi*(np.random.rand(1024)-1/2) #sample theta
x = l_x +l_h*np.tan(th)

In the previous post we outline how to infer lxl_x, the lighthouse horizontal position, assuming the height, lhl_h is already known. However, the study of probabilities presented in that post does allow one to infer both lxl_x and lhl_h despite the fact that the observation xx is univariate.

The likelihood

This is possible by obtaining the conditional probability of a single point xx given lxl_x and lhl_h, as done in the previous blog post, which is given by
p(xlx,lh)=1πlh(xlx)2+lh2 .(2)p(x|l_x,l_h) = \frac{1}{\pi} \frac{l_h}{ (x-l_x)^2 + l_h^2 } \ . \quad \quad \quad (2)

For a set of multiple data points {xn}={x1,x2,x3,,xN}\{x_n\} = \{x_1, x_2,x_3, \ldots , x_N\} obtained independently, the conditional probability, or likelihood, of the whole dataset is
p({xn}lx,lh)=n=1N[1πlh(xnlx)2+lh2] .(3) p(\{x_n\}|l_x,l_h) = \prod_{n=1}^N \left[ \frac{1}{\pi} \frac{l_h}{ (x_n-l_x)^2 + l_h^2 } \right] \ . \quad \quad \quad (3)
Which, as in the previous blog post is expressed in terms of the logarithm for numerical stability, yielding
lnp({xn}lx,lh)=n=1Nln[1πlh(xnlx)2+lh2].(4)\ln p(\{x_n\}|l_x,l_h) = \sum_{n=1}^N \ln \left[ \frac{1}{\pi} \frac{l_h}{ (x_n-l_x)^2 + l_h^2 } \right] . \quad \quad \quad (4)
which can be calculated using the following piece of code

def log_like_2d(x,lx,lh):
    LX,LH=np.meshgrid(lx,lh) 
    log_like = np.zeros_like(LX)
    for xi in x:
        log_like += np.log(LH)-np.log(np.pi)-np.log((xi-LX)**2+LH**2)
    return log_like

Here we assume that lxl_x and lhl_h were given in a form of an (numpy) array representing all possible values. The choice to represent this way is better explained when we look into the prior.

Prior

In order to infer lxl_x and lhl_h we ought to invert the conditional probability using Bayes’ theorem:
p(lx,lh{xn})=p({xn}lx,lh)p({xn})p(lx,lh) ,(5)p(l_x,l_h|\{x_n\}) =\frac{ p(\{x_n\}|l_x,l_h)}{p(\{x_n\})} p(l_x,l_h) \ , \quad \quad \quad (5)

with the prior, p(lx,lh)p(l_x,l_h), chosen in a manner similar to the previous blog post
p(lx,lh)={1LHfor L2<lx<L2 and 0<lh<H0otherwise .(6) p(l_x,l_h) = \begin{cases} \frac{1}{LH} & \text{for} \ -\frac{L}{2}<l_x<\frac{L}{2} \ \text{and} \ 0< l_h< H \\ 0 & \text{otherwise} \end{cases} \ . \quad \quad \quad (6)
Or, in other words, the prior is uniform in lx(L2,L2)l_x \in (-\frac{L}{2},\frac{L}{2}) and lh(0,H)l_h \in (0,H). For the purposes of later visualization we choose L=H=2L = H = 2, but the results allow for much larger values.

L,H = 2.,2.

First let us generate the arrays for lxl_x and lhl_h:

lh_arr=np.linspace(0,H,101)+1e-12
lx_arr=np.linspace(-L/2,L/2,101)

this guarantees a good representation of the intervals lx(L2,L2)l_x \in (-\frac{L}{2},\frac{L}{2}) and lh(0,H)l_h \in (0,H).

Posterior

Going back to the Bayes’ theorem (5), and substituting the prior (6) and likelihood (4) one obtains
p(lx,lh{xn})=1Z exp[n=1Nln[1πlh(xnlx)2+lh2]] .(7) p(l_x,l_h|\{x_n\}) = \frac{1}{Z’} \ \exp\left[ \sum_{n=1}^N \ln \left[ \frac{1}{\pi} \frac{l_h}{ (x_n-l_x)^2 + l_h^2 } \right] \right] \ . \quad \quad \quad (7)
Where Z=LHP({xn})Z’ = LH P(\{x_n\}), note that P({xn})P(\{x_n\}) in (5) does not depend on lxl_x or lhl_h. As such, ZZ’ is calculares by ensuring that dlh dlx p(lx,lh{xn})=1\int \mathrm{d} l_h \ \mathrm{d} l_x \ p(l_x,l_h|\{x_n\}) = 1.
The following block of code gives a function to calculate the posterior (7)

def posterior(x,lx,lh):
    L = log_like_2d(x,lx,lh)
    L -= L.max()

    P = np.exp(L) #unnormalized posterior
    
    dlh,dlx = (lh[1]-lh[0]),(lx[1]-lx[0])
    Z = P.sum()*(dlh*dlx) #normalization factor

    P = P/Z
    return P

Inference

Now that we have the computational tools to calculate the posterior let us visualize how the posterior evolves with the number of datapoints.

Since we are talking about a two dimensional posterior, we present how it evolves with the number of datapoints through an animation (see the complete code in my GitHub repository).

#Removed data block, see notebook on GitHub for full code
Alt text for image

Marginal posterior distribution

While in the previous blog post we found the distribution for the lighthouse horizontal position, lxl_x, assuming a known height, lhl_h, here we obtain the distribution for both the lhl_h and lxl_x, denoted as p(lx,lh{xn})p(l_x,l_h|\{x_n\}). In order to obtain the posterior for only lxl_x or only for lhl_h we ought to sum (or integrate) over the probabilities of all possible value of the other parameter, that is:
p(lx{xn})=dlh p(lx,lh{xn}) ,(8a) p(l_x|\{x_n\}) = \int \mathrm{d}l_h \ p(l_x,l_h|\{x_n\}) \ , \quad \quad \quad (8a)
p(lh{xn})=dlx p(lx,lh{xn}) .(8b) p(l_h|\{x_n\}) = \int \mathrm{d}l_x \ p(l_x,l_h|\{x_n\}) \ . \quad \quad \quad (8b)

In probability theory obtaining the probability for one parameter from the integral over the other parameters as above is often referred as marginalization. Moreover, in Bayesian language p(lx{xn})p(l_x|\{x_n\}) and p(lh{xn})p(l_h|\{x_n\}) are the marginalized posteriors while p(lx,lh{xn})p(l_x,l_h|\{x_n\}) is the joint posterior.

It is important to notice that p(lx{xn})p(l_x|\{x_n\}) obtained above is conceptually different for what was done in the previous blog post. While there we considered that lhl_h was already known here we cover all possible values of lhl_h based on the posterior, which was learned from data.

In order to give a good visualization of how the marginal posterior for both lxl_x and lhl_h evolve with the total number of data points, we show an animation analogous to the previous one but also presenting the marginal posteriors parallel to their respective axis.

#Removed data block, see notebook on GitHub for full code
Alt text for image

Here we can see clearly that, as the number of data points increase, the posterior (both the joint and each of the marginalized ones) tightens around the ground truth values.

Conclusion

Here we finish the overarching tutorial on Bayesian statistics. While future posts involving the topic will come, we presented a tutorial where:

  1. Bayesian statistics concurs with more naive methods of inference,
  2. The system is severely more complex and these naive methods fail, and
  3. The present post, where we went further into the previous problems to show how to learn multiple values even when the data available is single-valued.

Although this is the last entry of this tutorial series, more future posts related to the intricacies of probability and statistics in this blog. I hope this was instructive. Stay tuned.

This post is available as a iPython notebook in my GitHub repository.

Author


Posted

in

by