# init repo notebook
!git clone https://github.com/rramosp/ppdl.git > /dev/null 2> /dev/null
!mv -n ppdl/content/init.py ppdl/content/local . 2> /dev/null
!pip install -r ppdl/content/requirements.txt > /dev/null
import sympy as sy
import numpy as np
from scipy import stats
from scipy.integrate import quad
import matplotlib.pyplot as plt
import tensorflow_probability as tfp
import tensorflow as tf
tfd = tfp.distributions
An expectation#
we have the following random variable, whose distribution depends on a parameter \(\theta\):
and we have the following function on \(z\), which depends on also on parameter \(\theta\)
we want to compute the following expectation
for a given value of \(\theta\), we can compute this in three ways:
using symbolic integration with
sympyusing numeric integration with
scipy.integrateusing the montecarlo approximation
Observe that this is very powerful as we are estimating an integral without actually having to do the integral. Recall that integration is hard in general, while derivation is mechanic.
[ref] https://gregorygundersen.com/blog/2018/04/29/reparameterization/
tval = 1.5
# using sympy
t,z = sy.symbols(r'\theta z')
muz = sy.N(3)
sigmaz = t
ptz = 1/(sigmaz*sy.sqrt(2*sy.pi))*sy.exp(-((z-muz)/sigmaz)**2/2) # the distribution of z
ftz = (t+z)**2
sy.integrate((ptz*ftz).subs({t:tval}), (z,-sy.oo,+sy.oo)).n()
# using numerical integration
dz = stats.norm(loc=3, scale=tval)
ftzn = lambda z: (z+tval)**2
quad(lambda z: dz.pdf(z)*ftzn(z), -20,20)[0]
22.5
# using montecarlo
np.mean(ftzn(dz.rvs(100000)))
22.52906253501012
# using montecarlo with TF
tf_tval = tf.Variable(tval)
tf_dz = tfd.Normal(loc=3, scale=tf_tval)
tf.reduce_mean((tf_dz.sample(10000)+tf_tval)**2)
<tf.Tensor: shape=(), dtype=float32, numpy=22.334044>
Expectations through batches#
In ML we regularly optimize by computing a loss function \(\mathcal{L}_\theta\) over a set of batches and then we average. And \(\theta\) refers to the parameters of our model that we are optimizing.
When we do this, we are computing a Montecarlo approximation of an expectation of the loss with respect to whatever distribution \(D\) that produces our data.
Where \(p(x)\) is the probability (pdf) of \(x\) as coming from distribution \(D_x\)
Or in supervised learning
Regularly, in ML, we are not very much aware of \(D_x\) or \(D_{xy}\), because (1) we simply have a set of data that we are given and this data is actually a sample of \(D_x\) or \(D_{xy}\); and (2) when we make a Montecarlo approximation of an expectation we do not need to use the pdf \(p(x)\), and that is the beauty of it.
When we do variational inference, we will want to maximize or minimize some expectation, so this is why ML optimiztion fits very nicely in this schema.
# using montecarlo with batches of 100 samples each
np.mean([np.mean(ftzn(dz.rvs(100))) for _ in range(1000)])
22.505884963908255
Optimizing through gradients of expectations#
In fact, when we implement an optimization loop in ML with Tensorflow what we do is:
Get a batch of data
Compute the loss, or more exactly, a Montecarlo approximation of the expectation of the loss.
Compute the gradient of he above
Apply the gradients (use the optimizer)
So usually the gradients we compute step 4 are
Recall that in TF we usually do a tf.reduce_mean on the loss computed over a batch of data, and then we take the gradients. This means that in an optimization loop in ML we are always computing the gradient of an expectation
Note that we can do this because in \((*)\) the \(p(x)\) does not depend on \(\theta\), this is, on the parameters with respect to which we are taking the gradient \(\nabla_\theta\). This is a key observation in what follows.
The gradient of an expectation wrt distribution parameters#
now we are in an unsupervised learning scenario and our data is denoted by \(z\), instead of \(x\). Again, we want to compute the gradient of the expectation above with respect to \(z\), and evaluate it at the same specific \(z\) value. But now our pdf also depends on \(\theta\), and we denote it by \(p_\theta\)
# with sympy
sy.integrate((ptz*ftz).diff(t).subs({t: tval}), (z,-sy.oo, +sy.oo)).n()
# montecarlo estimate with lambdified sympy function (wrong!!!)
diff_ftz = sy.lambdify(z, ftz.diff(t).subs({t: tval}), "numpy") # we are lazy and use sympy to get diff fz
np.mean(diff_ftz(dz.rvs(100000)))
8.99532267198478
# montecarlo estimate with TF (wrong!!!)
# ----
# NOTE: if you sample **within** tf.GradientTape, you get a different gradient
# as it seems that sampling modifies the computational graph that the
# gradient table is monitoring
func = lambda z,t: (z+t)**2
zsample = tfd.Normal(loc=3, scale=tval).sample(10000)
with tf.GradientTape() as tape:
loss = tf.reduce_mean(func(zsample, tf_tval))
grad = tape.gradient(loss, tf_tval)
loss, grad
(<tf.Tensor: shape=(), dtype=float32, numpy=22.460865>,
<tf.Tensor: shape=(), dtype=float32, numpy=8.999266>)
what happens is that $$
in fact, the montecarlo estimate we used is actually computing only the second part of the expression above
and this is unfortunate because this means that we cannot use Tensorflow differentiation engine and batch averaging to compute this.
Tensorflow does not do integration. Integration is hard!!!
# the second part of the expression above is what the montecarlo estimate we just used approximates
diff_ft = sy.lambdify(z, (ptz*ftz.diff(t)).subs({t: tval}), "numpy")
quad(diff_ft, -20,20)[0]
9.000000000000009
Fixing Montecarlo for expectation gradients with the reparametrization trick#
observe that if \(p_\theta(z)\) did not depend on \(\theta\), denoted simply by \(p(z)\), this would not happen as we would have \(\nabla_\theta p(z)=0\)
so we would like to have an expectation whose probability does not depend on \(\theta\).
we can do that by (1) sampling from a different variable \(x\) whose probability does not depend on \(\theta\); and (2) define a transformation \(g_\theta\) so that we recover \(z\):
this way
since \(p(x)\) does not depend on \(\theta\), and \(x_i\) is sampled from the distribution of \(x\).
observe that it is ok if \(g_\theta\) also depends on \(\theta\) (thus the subscript \(\_ _\theta\)).
so, for this example, we can do as follows
# sanity check: expectation using sympy and the reparametrization trick
g = lambda x,t: 3+t*x # the transformation of the reparametrization trick
x = sy.symbols(r'x')
mux = sy.N(0)
sigmax = sy.N(1)
ptx = 1/sy.sqrt(2*sy.pi)*sy.exp(-x**2/2) # standard gaussian distribution for x
ftx = (t+g(x,t))**2
sy.integrate((ptx*ftx).subs({t:tval}), (x,-sy.oo,+sy.oo)).n()
# sanity check: expectation using montecarlo and the reparametrization trick
dx = stats.norm(loc=0, scale=1)
ftxn = lambda x: (g(x,tval)+tval)**2
np.mean(ftxn(dx.rvs(100000)))
22.423727915764786
# reparametrization trick: the gradient of the expectation using sympy
sy.integrate((ptx*ftx).diff(t).subs({t: tval}), (x,-sy.oo, +sy.oo)).n()
# reparametrization trick: the gradient of the expectation using montecarlo. It works!!!!!
diff_ftx = sy.lambdify(x, ftx.diff(t).subs({t: tval}), "numpy")
np.mean(diff_ftx(dx.rvs(100000)))
11.992036881620106
# reparametrization trick with TF
tf_tval = tf.Variable(tval)
func = lambda z,t: (z+t)**2
xsample = tfd.Normal(loc=0, scale=1).sample(100000)
with tf.GradientTape() as tape:
loss = tf.reduce_mean(func(g(xsample, tf_tval), tf_tval))
grad = tape.gradient(loss, tf_tval)
loss, grad
(<tf.Tensor: shape=(), dtype=float32, numpy=22.431013>,
<tf.Tensor: shape=(), dtype=float32, numpy=11.936471>)
This way of doing the reparametrization trick requires us to find both a distribution for \(x\) and \(g_\theta\) so that \(z=g_\theta(x)\) and \(z\) has the distribution that we want.