Why do we need Bayesian statistics? Part II — The lighthouse problem (tutorial)

Lighthouse.md

Introduction

The lighthouse problem serves as a deceptively simple but captivating challenge in the field of statistics. As we shall explore, it subverts our conventional understandings of data and probability. Owing to its intriguing nature, this problem, introduced by Gull, is often featured in textbooks on Bayesian statistics, such as the one by Sivia. The problem is usually presented as

Imagine a faulty lighthouse situated at an unknown coastal location, denoted by lxl_x​, and at a height lhl_h​ above sea level. While the lighthouse emits flashes at random angles, you can only observe the points where the light strikes the coast, without directly seeing the lighthouse itself. Given a series of such observations, how can you determine the lighthouse’s location?

First, let’s clarify the geometry of the situation. We introduce the angle, denoted as θ\theta, between the lighthouse and the lightbeam, and the position where the lightbeam makes contact with the shore, referred to as xx. All the relevant variables are depicted in the figure below.

Lighthouse

The goal is, then, to use observations of xx to infer (or learn) the values of lxl_x and lhl_h. In simpler terms, can we determine the location of the lighthouse by observing where the lightbeam is seen on the shore? For the present blog post we will assume that the height lhl_h is already known, this is because for now we want to avoid the difficulties related to learn multiple parameters, but I will show how it is possible to learn lxl_x and lhl_h together in a following blog post (stay tuned).

While this blog post can be read independently, it builds upon the Bayesian language introduced in my previous blog post.
To explain how we can infer lxl_x from xx, let’s first delve into how synthetic data following this model would be generated. We will then use data visualization to gain insights regarding lxl_x. During this exploration, we will discover that the naive method, which involves calculating the mean and standard deviation, falls short. With this understanding, we will construct a Bayesian model and observe that it indeed leads to accurate inference on the lighthouse’s position. This example highlights the necessity of data science approaches grounded in a rigorous understanding of probabilities.

Generating (synthetic) data

First, let us decode what the question encodes with “the lighthouse emits flashes at random angles”, this means that θ\theta is uniformly distributed acute angle, that is p(θ)=1πp(\theta) = \frac{1}{\pi} for π2θπ2-\frac{\pi}{2}\leq\theta\leq \frac{\pi}{2}. An obtuse angle would mean that the lightbeam does not hit the coastline. By looking back at the figure above trigonometry lead us to obtaing the lightbeam hit position xx in terms of the angle as
tanθ=xlxlhx=lx+lhtanθ .(1) tan \theta = \frac{x-l_x}{l_h} \Rightarrow x = l_x + l_h tan \theta \ . \quad \quad \quad (1)

First, let us simulate the results of such experiment or, in other words, generate synthetic data. Here, for the sake of the argument let us assume that the lighthouse is located at the position origin, lx=0l_x = 0 and has unit height lh=1l_h=1. Generate synthetic data consists of sampling the random θ\theta and calculating xx through the formula above. The complete code, as well as this post in ipython notebook format is available in my GitHub

import numpy as np
from matplotlib import pyplot as plt
np.random.seed(42)

l_x,l_h = 0,1 #setting ground truth

th = np.pi*(np.random.rand(100000)-1/2) #sample theta
x = l_x +l_h*np.tan(th)

Before moving on with inference, let us try to visualize the data directly. One straightforward idea would be to simply create the histogram of the data. However a first attempt using the usual tools yields an awkward result.

#Removed data block, see notebook on GitHub for full code

png

The reason for this rather uninformative result comes from the fact that the dataset includes some rare but extremely large values. Let us observe this by visualizing the data in two different formats: firstly let us recreate the histogram but only for values within some region (chosen arbitrarily as [-50,50]), which will create a more reasonable histogram; and second, lets create a data scattered plot so that we can directly observe the rare deviant datapoints.

#Removed data block, see notebook on GitHub for full code

png

png

As these plots were generated with synthetic data, it’s important to emphasize that the rare instances of values that significantly deviate from the mean should not be considered as ‘outliers’ in the traditional sense, nor should they be attributed to modeling innacuracy or experimental error. Instead, these extreme values are a direct outcome of the inherent characteristics of the problem being studied. If we go back to Eq (1) we see that xx will get arbitrarely large as θ\theta approaches π/2\pi/2.

Now, considering the difficulties regarding data visualization alone, one may think that using the data visualized above to infer the lighthouse’s position will be a challenging task. In order to show this, we will use the opposite strategy from my previous blog post. Here we begin by using the naive ‘‘mean and standard deviation’’ route, which will produce unsound results. Later, I will show how an approach based on correctly calculating the likelihood and, by extension, the Bayesian posterior lead to appropriate results.

Mean and standard deviation

As one looks into the lighthouse image, they may think that the mean value of xx should correspond to the position lxl_x of the lighthouse, if enough data points are drawn. If someone with more mathematical familiarity looks into Eq. (1), they might see that the probability of θ\theta is symmetric around 00 — meaning the probability for θ\theta is equal to the probability of θ-\theta. Therefore it must follow that that the xx should be symmetric around lxl_x — the probability of lx+xl_x + {x’} should be equal to the probability of lxxl_x – {x’} for any x{x’}.

Now confident in the convergence of the mean to lxl_x, one could further argue that the actual value of lxl_x should fall within the interval [μσ,μ+σ][\mu – \sigma, \mu + \sigma], where μ\mu represents the mean (or average) of xx, and σ\sigma denotes the standard deviation of xx divided by the square root of the number of data points see discussion in the previous post.

One can observe, however, what happens when we try to visualize this interval (the complete code,as well as this post in ipython notebook format is available in my GitHub )

#Removed data block, see notebook on GitHub for full code

png

It becomes evident that, as the number of data points increase, the mean estimate of lxl_x does not converge towards the ground truth value. Furthermore, the ‘‘confidence interval’’, determined by the mean and standard deviation, does not narrow as we increase the number of datapoints. This observation suggests that the mean and standard deviation alone are insufficient to infer the lighthouse’s position.

Bayesian

Having established, in pratical terms, that the somewhat naive approach of mean and standard deviation fails in finding the position of the lighthouse, let us actually study the problem based on the Bayesian paradigm.

Cauchy distribution

Following the Bayesian language as defined in the previous post, we ought obtain the likelihood of a data point xx. In the paragraph where Eq. (1) is introduced, we discuss the uniform probability distribution for the angle θ\theta, and obtain xx in terms of θ\theta, lxl_x, and lhl_h. We can transform the probabilities for θ\theta into probabilities for xx as
p(x)=p(θ)dθdx.(2)p(x) = p(\theta) \left|\frac{\mathrm{d} \theta}{\mathrm{d} x}\right| . \quad \quad \quad (2)
Read this for more detail on how the equation above is derived.

Now, after differentiating Eq. (1) and performing some algebra, you obtain the conditional probability density for xx given lxl_x​ and lhl_h​ as
p(xlx,lh)=1πlh(xlx)2+lh2 .(3)p(x|l_x,l_h) = \frac{1}{\pi} \frac{l_h}{ (x-l_x)^2 + l_h^2 } \ . \quad \quad \quad (3)
This expression represents the probability density of observing a particular value xx when you have knowledge of the parameters lxl_x​ and lhl_h. Later we will use Bayes theorem to obtain the probability of lxl_x given the observations (and in a future blog post we wil obtain the probabilities of both lxl_x and lhl_h).

The probability distribution in Eq. (3) is also known as the Cauchy distribution.
We observe that the probability of xx has a maximum at x=lxx=l_x, and it is symmetric around lxl_x. This observation adds an intriguing layer of complexity, as one might expect the mean of xx to converge to lxl_x which, surprisingly, it does not.

In what follows, we procced into using Bayesian inference to address the lighthouse problem. However, it is worth noting that the perplexing behavior of the mean and standard deviation intervals can be attributed to certain unique properties of the Cauchy distribution. I will elaborate on these properties in a future blog post.

Likelihood

When we obtain data, in the form of a set of NN observations of the light, {xn}={x1,x2,x3,,xN}\{x_n\} = \{x_1, x_2,x_3, \ldots , x_N\} that are independent and identically distributed, meaning each data point is generate without correlation to the previous ones, and all are drawn from the same lighthouse position. As such, the likehood for this data is given by multiplying Eq. (3) for every data pont xnx_n thus yielding
p({xn}lx)=n=1N[1πlh(xnlx)2+lh2] .(4)p(\{x_n\}|l_x) = \prod_{n=1}^N \left[ \frac{1}{\pi} \frac{l_h}{ (x_n-l_x)^2 + l_h^2 } \right] \ . \quad \quad \quad (4)
Above and for the remainder of this post, the explicit dependance in lhl_h is ignored, as we assume its value is already known, in a future blog post we will study how the problem changes when one tries to infer both lxl_x and lhl_h.

Since each value within the product in Eq. (4) is small, and multiplying a large batch of small numbers can be approximated to zero by the computer, it’s a useful technique in inference to calculate the logarithm of the likelihood instead
lnp({xn}lx)=n=1Nln[1πlh(xnlx)2+lh2]=n=1N[ln(lh)ln(π)ln((xnlx)2+lh2)] .(5)\ln p(\{x_n\}|l_x) = \sum_{n=1}^N \ln \left[ \frac{1}{\pi} \frac{l_h}{ (x_n-l_x)^2 + l_h^2 } \right] = \sum_{n=1}^N \left[ \ln (l_h) – \ln(\pi) – \ln ({ (x_n-l_x)^2 + l_h^2 } ) \right] \ . \quad \quad \quad (5)
This is calculated through the following block of code:

def log_like(x,lx,lh):
    X,LX=np.meshgrid(x,lx) 
    den = (X-LX)**2+lh**2
    return np.sum(np.log(lh)-np.log(np.pi)-np.log(den),axis=1)

The use of np.meshgrid and np.sum above are methods to avoid the overuse of for loops or, in other words, to vectorize the code.

Prior

Following the Bayesian framework outlined in the previous post, we aim to incorporate prior information into our inference process. In this context, we want to choose a prior that is as uniform as possible to avoid introducing bias into our Bayesian inference. A uniform prior essentially means that we have as little prior information or preference as possible, allowing the data to have the most significant influence on our final inference. This approach is often referred to as non-informative or uninformative prior, as it doesn’t favor any particular parameter value and lets the data speak for itself. Here we will assume a prior of the form
p(lx)=limL{1Lfor L2<lx<L20otherwise .(6) p(l_x) = \lim_{L\rightarrow \infty} \begin{cases} \frac{1}{L} & \text{for} \ -\frac{L}{2}<l_x<\frac{L}{2} \\ 0 & \text{otherwise} \end{cases} \ . \quad \quad \quad (6)
Or, in other words, the prior is uniform in the interval (L2,L2)(-\frac{L}{2},\frac{L}{2}) and take the limit LL \rightarrow \infty.

Posterior

From the previous discussion, we obtain the posterior using the Bayes’ theorem:
p(lx,lh{xn})=p({xn}lx,lh)p({xn})p(lx,lh) ,(5)p(l_x,l_h|\{x_n\}) =\frac{ p(\{x_n\}|l_x,l_h)}{p(\{x_n\})} p(l_x,l_h) \ , \quad \quad \quad (5)
with the evidence, p(x)p(x), being a proportionality constant that does not depend on lxl_x will be referred to as ZZ. By substituting Eqs. (5) and (6) and taking the logarithm above we obtain
lnp(lx{xn})=lnp({xn}lx)ln(LZ) ,(8)\ln p(l_x|\{x_n\}) = \ln { p(\{x_n\}|l_x)} – \ln{(LZ)} \ , \quad \quad \quad (8)
thus leading to a posterior of the form
p(lx{xn})=1Z exp[lnp({xn}lx)] ,(9) p(l_x|\{x_n\}) = \frac{1}{Z’} \ \exp\left[ \ln { p(\{x_n\}|l_x)} \right] \ , \quad \quad \quad (9)
where Z=ZLZ’ = ZL, which does not depend on lxl_x and is numerically calculated by ensuring that the posterior is a probability distribution, i.e., dlx p(lx{xn})=1\int \mathrm{d} l_x \ p(l_x|\{x_n\}) = 1.
The following block of code gives a function to calculate the posterior with the appropriate normalization.

def posterior_x(data,precision=.001):
    lx = np.linspace(-10,10,int(1/precision)+1)
    dlx = lx[1]-lx[0]
    
    #non-marginalized posterior
    l = log_like(data,lx,l_h)
    l -= l.max()
    posterior = np.exp(l) 

    #Calculating Z' numerically
    Zprime = np.sum(posterior)*dlx

    #marginalized
    posterior = posterior/Zprime
    return lx, posterior

The posterior in Eq. (9) provides us with all the necessary tools for conducting Bayesian inference for the lighthouse problem. As an initial step to visualize the posterior, let’s juxtapose a graph of the posterior with a density histogram, both obtained using only the first 100 data points xnx_n we generated.

#Removed data block, see notebook on GitHub for full code
The data mean is -0.6674364059676746, while the posterior maximum is -0.09999999999999964

png

The figure above reveals two intriguing observations. Firstly, the posterior for lxl_x exhibits a qualitatively distinct behavior compared to the histogram of xx, displaying a significantly narrower distribution. Secondly, and perhaps more importantly, the posterior’s maximum value (argmaxlxp(lxxn).1)( \arg \max\limits_{l_x} p(l_x|{x_n}) \approx -.1 ) is remarkably closer to the true ground value of (0)(0) when compared to the data mean (μ0.67)(\mu \approx -0.67). This serves as a compelling example of how a rigorous study of probabilities translates into more precise inferences regarding the lighthouse’s position, lxl_x.

Credible intervals

Following the procedure outlined in the
previous blog post, we move into calculating the Bayesian credible interval from the posterior. As in the previous post we ought to find the smallest symmetric interval around the posterior maximum (or mode) such that the probability that lxl_x is within that interval is larger than 95%95\%. The code for finding it is outlined below.

def credible_interval(data,conf=.95,precision=.001): 
    q,post = posterior_x(data,precision)
    dq = q[1]-q[0]
    max_ind = post.argmax() #find the index for the posterior maximum
    dif = 1 
    while np.sum(post[max_ind-dif:max_ind+dif+1])*dq<conf: #increase the interval until finding the desired credible interval
        dif+=1
    if dif==1:
        return credible_interval(data,conf=.95,precision=.001/4)
    return q[max_ind-dif],q[max_ind+dif]

With this, we can graph the credible interval as a function of the number of datapoints, similarly to our previous plots of the mean and standard deviation intervals. For comparison, we plot the mean and standard deviation intervals once again. Our goal here is to see how the credible interval changes with the size of the dataset. We will obtain the interval generated by both methods with the same dataset of NN^\ast datapoints (the NN^\ast first xnx_n generated), with NN^\ast varying from 1010 to 10410^4.

While the block of code below is large, it is not more than code needed to make a graph in Python, the reader can go straight to the final result.

#Removed data block, see notebook on GitHub for full code

png

Here, we observe that the Bayesian credible interval is significantly more accurate than the one obtained naively through the mean and standard deviation. This holds true even for a small number of data points (note that the range in the bottom figure is significantly narrower when compared to the top one, the grey dashed line correspond to the same values in both graphs). The accuracy of using Bayesian becomes even more impresive when we notice that, while the mean and standard deviation interval should correspond to the 63.8% credible interval, the second graph presents the 95% credible intervals. This reinforces the importance of embracing a more comprehensive understanding of probability, as straightforward (yet somewhat naive) statistical methods can easily lead to misleading results.

Conclusion

Through the simple example of the lighthouse problem, this blog post has provided insights into the complexities of statistical analysis. It has underscored the limitations of relying solely on mean and standard deviation, particularly when confronted with distributions such as the Cauchy distribution that deviate from the central limit theorem. While we have briefly discussed some of the underlying reasons for this unconventional behavior, a more comprehensive examination of the Cauchy distribution will be the topic of a forthcoming blog post.

Furthermore, we have introduced the prospect of inferring both the lighthouse’s position, designated as lxl_x, and its height, lhl_h, based on lightbeam observations, xx. This converts a one-dimensional dataset (in which all xx values are single valued) into a two-dimensional posterior distribution, expressed as p(lx,lhx)p(l_x, l_h | x). A comprehensive tutorial for this more complex problem will be the topic of another future blog post.

I invite readers to remain engaged and follow for the forthcoming publications, where we will delve further into these nuanced aspects of probability and statistical analysis.

This post is available as a iPython notebook in my GitHub repository.

Author


Posted

in

by

Comments

One response to “Why do we need Bayesian statistics? Part II — The lighthouse problem (tutorial)”

  1. […] the previous entry of this tutorial series, we studied the lighthouse problem. The goal was to infer (or learn) the position of the lighhouse […]