Probabilistic Neural Networks

Probabilistic Neural Networks#

In this notebook, we will review some concepts related to Probabilistic Neural Networks using tensorflow-probability. To begin, we will import the necessary libraries:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import tensorflow_probability as tfp
from keras.models import Model
from keras.layers import Dense
from keras.optimizers import Adam
from keras.losses import MeanSquaredError
from tqdm import tqdm
from IPython.display import display, Math
tfd = tfp.distributions
tfpl = tfp.layers

Dataset#

We will generate synthetic data for regression, where the dataset will have varying variances for each point. We will define the following constants:

N_SAMPLES = 1000
RANGE = (-1, 1)
FREQ = 50
DECAY = -5
NOISE = 0.5

Let us generate \(\mathbf{X}\):

x = tf.cast(tf.linspace(start=RANGE[0], stop=RANGE[1], num=N_SAMPLES), tf.float32)
display(Math(r"\mathbf{X}"))
display(x[:5])

Now, we generate \(\mathbf{y}\):

noise = (
        tf.random.normal(shape=(N_SAMPLES,), mean=0.0, stddev=NOISE) * (
            tf.exp(
                DECAY * tf.cast(
                    tf.linspace(start=0, stop=1, num=N_SAMPLES),
                    tf.float32
                    )
                )
            )
        )
y = x ** 2 + noise

Let’s visualize the dataset

fig, ax = plt.subplots()
ax.scatter(x, y, alpha=0.3)
fig.show()

As you can see, \(\mathbf{y}\) has a higher variance for lower values of \(\mathbf{x}\).

Motivation#

We will now observe how a feedforward neural network performs on this dataset. First, we will define the model using keras:

class FeedForward(Model):
    def __init__(self, units, *args, **kwargs):
        super(FeedForward, self).__init__(*args, **kwargs)
        self.dense_stack = [
                Dense(units=unit, activation="relu")
                for unit in units
                ]
        self.output_layer = Dense(units=1, activation="linear")

    def call(self, x):
        for layer in self.dense_stack:
            x = layer(x)
        y_pred = self.output_layer(x)
        return y_pred

Now, we will instantiate the model:

model = FeedForward([32, 16])
model.build((None, 1))
model.summary()

We will use mean squared error as the loss function for the model, since we are dealing with a regression problem.

model.compile(
    optimizer=Adam(learning_rate=1e-4),
    loss=MeanSquaredError()
    )

We will now train the model:

model.fit(
        tf.reshape(x, (-1, 1)), y,
        epochs=1000, batch_size=200
        )

Let’s observe the loss:

fig, ax = plt.subplots()
ax.plot(model.history.history["loss"])
ax.set_xlabel("Epoch")
ax.set_ylabel("Loss")
fig.show()

We will now obtain the model’s predictions:

y_pred = model.predict(tf.reshape(x, (-1, 1)))
display(y_pred)

Now, let’s visualize the predictions:

fig, ax = plt.subplots()
ax.scatter(x, y, alpha=0.3, label="data")
ax.plot(x, tf.squeeze(y_pred), c="k", label=r"$\tilde{y}$")
ax.set_xlabel(r"$\mathbf{x}$")
ax.set_ylabel(r"$\mathbf{y}$")
ax.legend()
fig.show()

As you can see, the neural network is able to capture the non-linear trend of the data, but it doesn’t estimate the standard error or the variance of the predictions. One alternative approach is to use Probabilistic Neural Networks (PNNs), which are a more general version of neural networks that incorporate probabilistic components in two ways:

A layer’s output can be a probability distribution.
We can compute or approximate a distribution for the layers’ parameters.

Let’s delve into the details:

Distributions on Layers#

If we want to estimate the variance of the predictions, a straightforward approach is to use an output distribution. For example, we can assume the following distribution for the model:

\[ P(\mathbf{y} | \mathbf{X}, \mathbf{W}) = N(\mathbf{y} = f(\mathbf{X}, \mathbf{W}), \sigma = g(\mathbf{X}, \mathbf{W})) \]

Where \(\mathbf{W}\) are the neural network parameters, \(f(\cdot)\) and \(g(\cdot)\) are the output of a neural network.

Similar to probabilistic linear regression, the model can be trained through maximum likelihood estimation (MLE). We will implement this using vanilla tensorflow. First, we will define the model:

class ProbNN(Model):
    def __init__(self, units, *args, **kwargs):
        super(ProbNN, self).__init__(*args, **kwargs)
        self.dense_stack = [
                Dense(units=unit, activation="relu")
                for unit in units
                ]
        self.output_mean = Dense(units=1, activation="linear")
        self.output_logvar = Dense(units=1, activation="linear")

    def call(self, x):
        for layer in self.dense_stack:
            x = layer(x)
        mean_pred = self.output_mean(x)
        logvar_pred = self.output_logvar(x)

        return mean_pred, logvar_pred

Now, we will instantiate this model:

model = ProbNN([32, 16])
model.build((None, 1))
model.summary()

We will now define the log probability density function of the normal distribution, which we will use for MLE:

@tf.function
def log_normal_pdf(y, mean, logvar):
    log2pi = tf.math.log(2. * np.pi)
    return tf.reduce_sum(
        -.5 * ((y - mean) ** 2. * tf.exp(-logvar) + logvar + log2pi),
        axis=1
        )

We can now define the training loop for the custom loss function:

optimizer = Adam(learning_rate=1e-3)
pbar = tqdm(range(1000))
for i in pbar:
    with tf.GradientTape() as t:
        mean_pred, logvar_pred = model(tf.reshape(x, (-1,  1)))
        loss = - tf.reduce_sum(log_normal_pdf(
            tf.reshape(y, (-1, 1)),
            mean_pred,
            logvar_pred,
        ))
        pbar.set_description(f"Loss: {float(loss):.2f}")
        grads = t.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

We will now generate the predicted distribution:

mean_pred, logvar_pred = model.predict(tf.reshape(x, (-1, 1)))
std_pred = tf.sqrt(tf.exp(logvar_pred))

Finally, let’s visualize the predictions:

fig, ax = plt.subplots()
ax.scatter(x, y, alpha=0.3, label="data")
ax.plot(x, tf.squeeze(mean_pred), c="k", label=r"$\tilde{y}$")
ax.fill_between(
        x,
        tf.squeeze(mean_pred - 3 * std_pred),
        tf.squeeze(mean_pred + 3 * std_pred),
        color="k", alpha=0.3
        )
ax.set_xlabel(r"$\mathbf{x}$")
ax.set_ylabel(r"$\mathbf{y}$")
ax.legend()
fig.show()

As you can see, the neural network is able to predict the variance of the predictions as well. This approach can be simplified by using tensorflow-probability, we can redefine the model by using probabilistic layers:

class ProbNN2(Model):
    def __init__(self, units, *args, **kwargs):
        super(ProbNN2, self).__init__(*args, **kwargs)
        self.dense_stack = [
                Dense(units=unit, activation="relu")
                for unit in units
                ]
        self.output_params = Dense(units=2, activation="linear")
        self.output_distro = tfpl.DistributionLambda(
                make_distribution_fn=lambda t: tfd.Normal(
                    loc=t[..., 0], scale=tf.exp(t[..., 1])
                    ),
                convert_to_tensor_fn=lambda s: s.sample(1)
                )

    def call(self, x):
        for layer in self.dense_stack:
            x = layer(x)
        params_pred = self.output_params(x)
        dist_pred = self.output_distro(params_pred)
        return dist_pred

Now, we will instantiate this model:

model = ProbNN2([32, 16])
model.build((None, 1))
model.summary()

We can now define the training loop for the custom loss function:

optimizer = Adam(learning_rate=1e-3)
pbar = tqdm(range(1000))
for i in pbar:
    with tf.GradientTape() as t:
        dist_pred = model(tf.reshape(x, (-1,  1)))
        loss = - tf.reduce_sum(dist_pred.log_prob(y))
        pbar.set_description(f"Loss: {float(loss):.2f}")
        grads = t.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

We will now obtain the predictions:

dist_pred = model(tf.reshape(x, (-1, 1)))
mean_pred, std_pred = dist_pred.mean(), dist_pred.stddev()

Finally, let’s visualize the predictions:

fig, ax = plt.subplots()
ax.scatter(x, y, alpha=0.3, label="data")
ax.plot(x, tf.squeeze(mean_pred), c="k", label=r"$\tilde{y}$")
ax.fill_between(
        x,
        tf.squeeze(mean_pred - 3 * std_pred),
        tf.squeeze(mean_pred + 3 * std_pred),
        color="k", alpha=0.3
        )
ax.set_xlabel(r"$\mathbf{x}$")
ax.set_ylabel(r"$\mathbf{y}$")
ax.legend()
fig.show()

As you can see, tensorflow-probability simplifies the use of distributions inside of neural networks

Distributions on Parameters#

Probabilistic neural networks also allow distributions on parameters, this is particularly important to address epistemic uncertainty.

Epistemic Uncertainty refers to missing or incorrect knowledge about the world, that may be due to incorrect labels or a lack of data.

Epistemic uncertainty can be reduced with more data, but Bayesian Neural Networks with distributions over parameters also address this. Specifically, when we train a conventional neural network, we end up with a point estimate of the model’s weights. However, these estimates may have a lot of variance when there’s not enough data.

Let’s see an example of how training a model several times with a varying number of samples behaves.

samples = [10, 100, 1000]
predictions = {}
idx = np.arange(1000)
np.random.shuffle(idx)
for sample in tqdm(samples):
    x_sample = x.numpy()[idx[:sample]].reshape(-1, 1)
    y_sample = y.numpy()[idx[:sample]]

    model_pred = []
    for i in range(10):
        model = FeedForward([32, 16])
        model.build((None, 1))
        model.compile(
            optimizer=Adam(learning_rate=1e-3),
            loss=MeanSquaredError(),
            )
        model.fit(
            x_sample, y_sample,
            epochs=300,
            verbose=False,
            batch_size=1000,
            )
        model_pred.append(model.predict(tf.reshape(x, (-1, 1)), verbose=False))
    predictions[sample] = (x_sample, y_sample, model_pred)

We can now visualize the model’s predictions from several trials with different sample sizes:

fig, axes = plt.subplots(1, 3, figsize=(15, 7))
for i, sample in enumerate(samples):
    ax = axes[i]
    x_sample, y_sample, model_prediction = predictions[sample]
    ax.scatter(x, y, label="Full data", alpha=0.1)
    ax.scatter(x_sample.flatten(), y_sample, label="Sample", alpha=0.5)
    ax.set_xlabel(r"$\mathbf{x}$")
    ax.set_ylabel(r"$\mathbf{y}$")
    ax.set_title(fr"$N={sample}$")
    for prediction in model_prediction:
        ax.plot(x, prediction, c="k", alpha=0.6)
    ax.legend()
fig.show()

As you can see, the model learns different parameters (which leads to different predictions) even when trained with the same number of samples. However, this variation decreases as the number of samples increases.

It is possible to incorporate epistemic uncertainty in neural networks through Bayesian modeling. Specifically, we aim to compute:

\[ P(\mathbf{W}| \mathbf{X}, \mathbf{y}) \]

where \(\mathbf{w}\) are the weights of the neural network. this distribution can be computed from a prior distribution \(p(\mathbf{w})\) and the likelihood distribution \(p(\mathbf{x}, \mathbf{y} | \mathbf{w})\), similar to bayesian linear regression. however, as it may be noticed, we need to use markov chain monte carlo or variational inference for optimization.

Markov Chain Monte Carlo#

We can train a bayesian neural network through markov chain montecarlo, however, we must define it as a JointDistribution instead of a keras model as usual.

Let’s see an example of a Bayesian neural network with two hidden layers:

\[\begin{split} \begin{split} \mathbf{W_1} \sim \mathcal{N}(0, 1)\\ \mathbf{b_1} \sim \mathcal{N}(0, 1)\\ \mathbf{W_2} \sim \mathcal{N}(0, 1)\\ \mathbf{b_2} \sim \mathcal{N}(0, 1)\\ \mathbf{W_3} \sim \mathcal{N}(0, 1)\\ \mathbf{b_3} \sim \mathcal{N}(0, 1)\\ E \sim \text{HalfNormal}(1)\\ \mathbf{y} \sim \mathcal{N}(\mathbf{y} = f(f(\mathbf{X} \cdot \mathbf{W_1} + \mathbf{b_1}) \cdot \mathbf{W_2} + \mathbf{b_2}) \cdot \mathbf{W_3} + \mathbf{b_3}, \sigma=E) \end{split} \end{split}\]

Where \(f(\cdot)\) is the ReLU activation function, let’s train the model with a reduced number of samples:

idx = np.arange(1000)
np.random.shuffle(idx)
x_sample = tf.constant(x.numpy()[idx[:10]].reshape(-1, 1))
y_sample = tf.constant(y.numpy()[idx[:10]].reshape(-1, 1))

Now, we can define the model:

model = tfd.JointDistributionNamedAutoBatched({
    "w1": tfd.Normal(loc=tf.zeros(shape=(1, 16)), scale=1.),
    "b1": tfd.Normal(loc=tf.zeros(shape=(16, )), scale=1.),
    "w3": tfd.Normal(loc=tf.zeros(shape=(16, 1)), scale=1.),
    "b3": tfd.Normal(loc=tf.zeros(shape=(1, )), scale=1.),
    "y": lambda b3, w3, b1, w1: tfd.Normal(
        loc = tf.nn.relu(
            x_sample @ w1 + b1
            ) @ w3 + b3,
        scale = 1.
        )
    }
)

We can define the log probability for this joint distribution:

def log_prob(w1, b1, w3, b3):
    return model.log_prob(
            w1 = w1, b1 = b1, w3 = w3, b3 = b3,
            y = tf.reshape(y_sample, (-1, 1))
            )

Now, we can define the mcmc procedure:

@tf.function
def mcmc():
    kernel = tfp.mcmc.NoUTurnSampler(
            target_log_prob_fn = log_prob,
            step_size=1e-3
            )
    return tfp.mcmc.sample_chain(
            num_results = 1000,
            num_burnin_steps = 500,
            current_state = [
                tf.zeros(shape=(1, 16)),
                tf.zeros(shape=(16, )),
                tf.zeros(shape=(16, 1)),
                tf.zeros(shape=(1, )),
                ],
            kernel = kernel,
            trace_fn = lambda _, results: results.target_log_prob
            )

Now, we can train the model:

samples, log_probs = mcmc()