Model-based reasoning and Bayesian inference

Model-based reasoning and Bayesian inference#

In model-based reasoning we assume a model of the process that generated our data. In Bayesian inference this model has a probabilistic nature, i.e., it encodes our knowledge (or assumptions) of the world in terms of random variables, probabilistic distributions of these random variables and dependence/independence relationships.

The following is a summary of the Bayesian approach to machine learning (https://www.cs.toronto.edu/~radford/ftp/bayes-tut.pdf):

Model formulation: We formulate our knowledge about the world (or the mechanism that generated our data) probabilistically:

We define a model that represents qualitative aspects of our knowledge. This model is expressed in terms of random variables and their relationships. The model has unknown parameters.
We specify our believes (before seeing any data) about how we expect the values of the unknown parameters to behave using a prior probability.

Data collection: We obtain data.
Posterior calculation: We use the data to compute the posterior probability distribution of the unknown parameters, given the observed data and the priors.
Model application: The posterior probability distribution can be used to:

Derive scientific conclusions, taking into account uncertainty.
Make predictions.
Make decisions.

To illustrate this process, we will a coin flipping experiment as an example.

First, let us import the required libraries and modules.

from scipy.special import comb
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("ggplot")

Coin flipping example#

In this example, we want to model the toss of a coin using a Bayesian approach.

1. Model formulation#

Our model will have two random variables:

\(X\): random variable representing the outcome of a coin toss

\(\Theta\): probability of getting a head (\(X=1\))

The distribution of \(Y\) is Bernoulli:

\[ P(X = x | \theta) = \theta^{y}(1 - \theta)^{1 - y} \]

An experiment consists of performing several coin tosses. The resulting data, \(\mathcal{D} = \{y_1, y_2, \dots, y_N\}\), from the experiment can be summarized by two values: \(N_{heads}\) (number of heads) and \(N_{tails}\) (number of tails). The likelihood of a particular experiment outcome is given by the binomial distribution:

\[ P(\mathcal{D} | \theta) = \binom{N_{heads} + N_{tails}}{N_{heads}} \theta^{N_{heads}} (1 - \theta)^{N_{tails}} \]

This term is called the likelihood, the conditional probability of the data \(\mathcal{D}\) given the model’s parameters \(\theta\). The following function implements the likelihood:

def likelihood(num_heads, num_tails, theta):
    return (
        comb(num_heads + num_tails, num_heads) *
        theta ** num_heads *
        (1 - theta) ** num_tails
        )

Note: the comb function is an implementation of the binomial coefficient:

\[ \binom(n}{k} = \frac{n!}{k! (n - k)!} \]

n = 10
n_heads = 6

# Numpy implementation of the binomial coefficient
print(np.math.factorial(n) / (np.math.factorial(n_heads) * np.math.factorial(n - n_heads)))

# comb function
comb(n, n_heads)

For instance if \(\theta = 0.3\), the probability of getting 6 heads and 4 tails in 10 coin tosses would be:

likelihood(6, 4, 0.3)

But if \(\theta=0.8\) the likelihood would be:

likelihood(6, 4, 0.8)

As part of our modeling we have to also represent our prior knowledge of the possible values of \(\theta\). This is encoded by a distribution of \(\theta\) values which is called the prior distribution:

\[ P(\theta) \]

If we don’t have any prior knowledge, this can be encoded as a uniform prior distribution. Since we are dealing with a coin, we can assume that coins that are closer to be fair are more likely, this means that we give higher probabilities to values of \(\theta\) closer to \(0.5\). To this end, We’ll use a discrete triangular distribution:

def triangular_prior(n_points):
    theta_vals = np.linspace(0, 1, n_points)
    theta_prior_p = np.concatenate([
        np.arange(0, n_points // 2),
        np.arange(n_points // 2, -1, -1)]
        )
    theta_prior_p = theta_prior_p / np.sum(theta_prior_p)
    theta_prior = {
            theta_vals[i]: theta_prior_p[i]
            for i in range(n_points)
            }
    return theta_prior

Let’s visualize this distribution:

n_points = 11
prior = triangular_prior(n_points)
fig, ax = plt.subplots(figsize=(10, 7))
ax.bar(
        list(map(lambda x: f"{x:.2f}", prior.keys())),
        prior.values(), width=0.05
        )
ax.set_xlabel(r"$\theta$")
ax.set_ylabel(r"$P(\theta)$")

2. Data collection#

To collect data, we can toss a real coin a number of times and record how many heads (\(N_{heads}\)) and tails (\(N_{tails}\)) we get. This is our data \(\mathcal{D}\). For this exercise let’s assume we got:

n_heads = 15
n_tails = 5

3. Posterior calculation#

To calculate the posterior we will use the the Bayes theorem:

\[ P(\theta|\mathcal{D}) = \frac{P(\mathcal{D}|\theta)P(\theta)}{P(\mathcal{D})} \]

We already have the prior distribution and the likelihood function. We need to calculate the probability of the evidence (our data \(D\)). For this we will use the law to total probability, taking advantage of the fact that the prior distribution is discrete and only takes values greater than 0 for a finite set of theta values \(\{\theta_0\dots\theta_{n-1}\}\) :

\[\begin{split} \begin{align} P(D) &= \sum_{\theta_i}{P(\mathcal{D},\theta=\theta_i)} \\ &= \sum_{\theta_i}{P(\mathcal{D}|\theta=\theta_i)P(\theta = \theta_i)} \\ \end{align} \end{split}\]

def evidence(n_heads, n_tails, n_points):
    priors = triangular_prior(n_points)
    joints = [
            likelihood(n_heads, n_tails, theta_i) *
            prior_value
            for theta_i, prior_value in priors.items()
            ]
    return sum(joints)

We can compute the evidence for different values:

evidence(n_heads=5, n_tails=5, n_points=3)

evidence(n_heads=10, n_tails=1, n_points=11)

Now, we are prepared to compute the posterior distribution function (\(P(\theta|\mathcal{D})\)):

def posterior(n_heads, n_tails, n_points):
    p_theta = triangular_prior(n_points)
    p_d = evidence(n_heads, n_tails, n_points)
    p_theta_d = {
            theta_i: 
            likelihood(n_heads, n_tails, theta_i) *
            prior_i / p_d
            for theta_i, prior_i in p_theta.items()
            }

    return p_theta_d

Now we call the function with our data and plot the resulting posterior distribution:

theta_posterior = posterior(
        n_heads=n_heads,
        n_tails=n_tails,
        n_points=n_points
        )
print(theta_posterior)

We can see a comparison between the prior distribution and the posterior:

fig, ax = plt.subplots(figsize=(15, 7))

theta_prior = triangular_prior(n_points)
theta_posterior = posterior(
        n_heads=n_heads,
        n_tails=n_tails,
        n_points=n_points
        )

ax.bar(
        list(map(lambda x: f"{x:.2f}", prior.keys())),
        prior.values(), width=0.1, label=r"$P(\theta)$",
        alpha=0.5
        )

ax.bar(
        list(map(lambda x: f"{x:.2f}", theta_posterior.keys())),
        theta_posterior.values(), width=0.1,
        label=r"$P(\theta|\mathcal{D})$", alpha=0.5
        )

ax.set_xlabel(r"$\theta$")
ax.set_ylabel(r"Probability")
ax.legend()

4. Model application#

We can see that the resulting posterior distribution gives higher probability to values of the parameter around \(0.7\), this is consistent with the fact that the data from the experiment has more heads than tails.

If we want to make a prediction of the outcome of a new coin toss, we can do it in different ways. Let’s see some common methods:

Maximum Aposteriori Probability#

We can use the mode of the posterior distribution, i.e.:

\[ \theta_{\text{MAP}} = \underset{\theta}{\text{argmax}} P(\theta|\mathcal{D}) \]

This is called maximum a posteriori estimation (MAP). The following Python function calculates the MAP estimator for a given posterior distribution:

def map_estimator(theta_posterior):
    return max(theta_posterior.items(), key=lambda x: x[1])

In our example the MAP estimator will be:

theta_map, map_p = map_estimator(theta_posterior)

print(theta_map)
print(map_p)

Which means that the probability of getting heads in the new experiment will be:

\[ P(Y=1|\theta = \theta_{\text{MAP}}) = \theta_{\text{MAP}} = 0.7 \]

Posterior Expectation#

Another alternative is to calculate the expected value of \(\theta\):

\[ \theta_{\text{Bayes}} = E_{P(\theta|\mathcal{D}}[\theta] \]

This is called the Bayes estimator. The following function calculates the Bayes estimator:

def bayes_estimator(theta_posterior):
    return sum(theta_i * p_i for theta_i, p_i in theta_posterior.items())

In our example the Bayes estimator will be:

print(bayes_estimator(theta_posterior))

The probability of getting heads in the new experiment according to this estimator will be:

\[ P(Y=1|\theta = \theta_{\text{Bayes}}) = \theta_{\text{Bayes}} = 0.6961 \]

Bayesian Model Averaging#

Bayesian Model Averaging (BMA) leverages from random number generators and samplers from the posterior distribution to generate a prediction. In fact, We can generate any number of models from the Bayesian process, and then, average any quantity of interest. If we want to have an estimation of \(\theta_{BMA}\):

\[ \theta_{BMA} = \frac{\sum_{i=1} ^ N_{samples} \theta_i}{N_{samples}} \]

This strategy is important, specially for some optimization techniques that do not directly provide a posterior distribution, but its samples (like Markov Chain Monte Carlo).

Let’s see an example for the coins:

n_samples = 100
theta_values, theta_probs = theta_posterior.keys(), theta_posterior.values()
theta_samples = np.random.choice(
    list(theta_values),
    p=list(theta_probs),
    size=n_samples
    )
print(theta_samples)

print(theta_samples.mean())