An expectation

An expectation#

we have the following random variable, whose distribution depends on a parameter $\theta$:

\[z \sim \mathcal{N}_z(\mu=3, \sigma=\theta)\;\;\;\text{with }p_\theta(z)\text{ the pdf}\]

and we have the following function on $z$, which depends on also on parameter $\theta$

\[f_\theta(z) = (z+\theta)^2\]

we want to compute the following expectation

\[\begin{split}\begin{align} \mathbb{E}_{z\sim N_\theta} [f_\theta(z)] &= \int_{-\infty}^{+\infty} p_\theta(z)f_\theta(z) dx & \text{definite integral} \\ &\approx \frac{1}{N}\sum_{z_i\sim N_\theta}f_\theta(z_i) &\text{montecarlo approximation} \end{align} \end{split}\]

for a given value of $\theta$, we can compute this in three ways:

using symbolic integration with sympy
using numeric integration with scipy.integrate
using the montecarlo approximation

Observe that this is very powerful as we are estimating an integral without actually having to do the integral. Recall that integration is hard in general, while derivation is mechanic.

[ref] https://gregorygundersen.com/blog/2018/04/29/reparameterization/

tval = 1.5

# using sympy

t,z = sy.symbols(r'\theta z')
muz = sy.N(3)
sigmaz = t

ptz = 1/(sigmaz*sy.sqrt(2*sy.pi))*sy.exp(-((z-muz)/sigmaz)**2/2) # the distribution of z
ftz = (t+z)**2

sy.integrate((ptz*ftz).subs({t:tval}), (z,-sy.oo,+sy.oo)).n()

\[\displaystyle 22.5\]

# using numerical integration
dz = stats.norm(loc=3, scale=tval)
ftzn = lambda z: (z+tval)**2
quad(lambda z: dz.pdf(z)*ftzn(z), -20,20)[0]

22.5

# using montecarlo
np.mean(ftzn(dz.rvs(100000)))

22.52906253501012

# using montecarlo with TF

tf_tval = tf.Variable(tval)

tf_dz = tfd.Normal(loc=3, scale=tf_tval)
tf.reduce_mean((tf_dz.sample(10000)+tf_tval)**2)

<tf.Tensor: shape=(), dtype=float32, numpy=22.334044>

Expectations through batches#

In ML we regularly optimize by computing a loss function $\mathcal{L}_\theta$ over a set of batches and then we average. And $\theta$ refers to the parameters of our model that we are optimizing.

When we do this, we are computing a Montecarlo approximation of an expectation of the loss with respect to whatever distribution $D$ that produces our data.

\[\underbrace{\mathbb{E}_{x\sim D_x} \mathcal{L}_\theta(x) = \int p(x)\mathcal{L}_\theta(x)dx}_\text{an expectation} \approx \underbrace{\frac{1}{N}\sum_{x_i\sim D_x} \mathcal{L}_\theta(x_i)}_\text{what we do in ML}\]

Where $p(x)$ is the probability (pdf) of $x$ as coming from distribution $D_x$

Or in supervised learning

\[\underbrace{\mathbb{E}_{x,y\sim D_{xy}} \mathcal{L}_\theta(x,y) = \int p(x,y)\mathcal{L}_\theta(x,y)dx dy}_\text{an expectation} \approx \underbrace{\frac{1}{N}\sum_{x_i y_i\sim D_{xy}} \mathcal{L}_\theta(x_i, y_i)}_\text{what we do in ML}\]

Regularly, in ML, we are not very much aware of $D_x$ or $D_{xy}$, because (1) we simply have a set of data that we are given and this data is actually a sample of $D_x$ or $D_{xy}$; and (2) when we make a Montecarlo approximation of an expectation we do not need to use the pdf $p(x)$, and that is the beauty of it.

When we do variational inference, we will want to maximize or minimize some expectation, so this is why ML optimiztion fits very nicely in this schema.

# using montecarlo with batches of 100 samples each
np.mean([np.mean(ftzn(dz.rvs(100))) for _ in range(1000)])

22.505884963908255

Optimizing through gradients of expectations#

In fact, when we implement an optimization loop in ML with Tensorflow what we do is:

Get a batch of data
Compute the loss, or more exactly, a Montecarlo approximation of the expectation of the loss.
Compute the gradient of he above
Apply the gradients (use the optimizer)

So usually the gradients we compute step 4 are

\[\begin{split}\begin{align} \nabla_\theta\mathbb{E}_{x\sim D_x}\mathcal{L}_\theta(x) &= \nabla_\theta\int p(x)\mathcal{L}_\theta(x)dx \\ &=\int \nabla_\theta \big[p(x)\mathcal{L}_\theta(x)\big]dx\\ &=\int p(x) \nabla_\theta \big[\mathcal{L}_\theta(x)\big]dx&\;\;\;(*)\\ &\approx \frac{1}{N}\sum_{x_i}\nabla_\theta\mathcal{L}_\theta(x)\\ &= \nabla_\theta\frac{1}{N}\sum_{x_i}\mathcal{L}_\theta(x)&\;\;\;\text{(what we do in TF)}\\ \end{align} \end{split}\]

Recall that in TF we usually do a tf.reduce_mean on the loss computed over a batch of data, and then we take the gradients. This means that in an optimization loop in ML we are always computing the gradient of an expectation

Note that we can do this because in $(*)$ the $p(x)$ does not depend on $\theta$, this is, on the parameters with respect to which we are taking the gradient $\nabla_\theta$. This is a key observation in what follows.

The gradient of an expectation wrt distribution parameters#

now we are in an unsupervised learning scenario and our data is denoted by $z$, instead of $x$. Again, we want to compute the gradient of the expectation above with respect to $z$, and evaluate it at the same specific $z$ value. But now our pdf also depends on $\theta$, and we denote it by $p_\theta$

\[\begin{split}\begin{align} \nabla_z \mathbb{E}_{z\sim N_\theta} [f_\theta(z)] &= \nabla_\theta \int p_\theta(z)f_\theta(z) dz \\ &= \int \nabla_\theta \Big[ p_\theta(z)f_\theta(z)\Big] dz\\ &\text{but since }p_\theta(z)\text{ depends on }\theta\\ &\not \approx \frac{1}{N}\sum_{z_i\sim N_\theta}\nabla_\theta \Big[f_\theta(z_i) \Big] \end{align} \end{split}\]

# with sympy
sy.integrate((ptz*ftz).diff(t).subs({t: tval}), (z,-sy.oo, +sy.oo)).n()

\[\displaystyle 12.0\]

# montecarlo estimate with lambdified sympy function (wrong!!!) 
diff_ftz = sy.lambdify(z, ftz.diff(t).subs({t: tval}), "numpy") # we are lazy and use sympy to get diff fz
np.mean(diff_ftz(dz.rvs(100000)))

8.99532267198478

# montecarlo estimate with TF (wrong!!!) 
# ----
# NOTE: if you sample **within** tf.GradientTape, you get a different gradient
#       as it seems that sampling modifies the computational graph that the 
#       gradient table is monitoring

func             = lambda z,t: (z+t)**2
zsample = tfd.Normal(loc=3, scale=tval).sample(10000)

with tf.GradientTape() as tape:
    loss = tf.reduce_mean(func(zsample, tf_tval))
        
grad = tape.gradient(loss, tf_tval)
loss, grad

(<tf.Tensor: shape=(), dtype=float32, numpy=22.460865>,
 <tf.Tensor: shape=(), dtype=float32, numpy=8.999266>)

what happens is that $$

()#\[\begin{align} \int \nabla_\theta \Big[ p_\theta(z)f_\theta(z)\Big] dz &= \int \nabla_\theta \Big[ p_\theta(z)\Big]f_\theta(z) + p_\theta(x)\nabla_\theta\Big[ f_\theta(z) \Big] dz\\ &= \underbrace{\int \nabla_\theta \Big[ p_\theta(z)\Big]f_\theta(z)}_\text{not an expectation} + \underbrace{\int p_\theta(z)\nabla_\theta\Big[ f_\theta(z) \Big] dz}_\text{an expectation} \end{align}\]

in fact, the montecarlo estimate we used is actually computing only the second part of the expression above

\[ \frac{1}{N}\sum_{z_i\sim N_\theta}\nabla_\theta f_\theta(z_i) \approx \mathbb{E}_{z\sim \mathcal{N}_\theta}\Big[ \nabla_\theta f_z(z) \Big] \]

and this is unfortunate because this means that we cannot use Tensorflow differentiation engine and batch averaging to compute this.

Tensorflow does not do integration. Integration is hard!!!

# the second part of the expression above is what the montecarlo estimate we just used approximates
diff_ft = sy.lambdify(z, (ptz*ftz.diff(t)).subs({t: tval}), "numpy")
quad(diff_ft, -20,20)[0]

9.000000000000009

Fixing Montecarlo for expectation gradients with the reparametrization trick#

observe that if $p_\theta(z)$ did not depend on $\theta$, denoted simply by $p(z)$, this would not happen as we would have $\nabla_\theta p(z)=0$

so we would like to have an expectation whose probability does not depend on $\theta$.

we can do that by (1) sampling from a different variable $x$ whose probability does not depend on $\theta$; and (2) define a transformation $g_\theta$ so that we recover $z$:

\[\begin{split}\begin{align} g_\theta(x)&=z\\ x&\sim\text{ some distribution not dependant on }\theta \end{align} \end{split}\]

this way

\[\begin{split} \begin{align} \mathbb{E}_{z\sim N_\theta} [f_\theta(z)] &= \mathbb{E}_{x} [f_\theta(g_\theta(x)]\\ &= \nabla_\theta \int_{-\infty}^{+\infty} p(x)f_\theta(g_\theta(x)) dx\\ &= \int_{-\infty}^{+\infty} p(x) \nabla_\theta \Big[ f_\theta(g_\theta(x)) \Big] dx\\ &\approx \frac{1}{N}\sum_{x_i} \nabla_\theta f_\theta(g_\theta(x_i)) \end{align} \end{split}\]

since $p(x)$ does not depend on $\theta$, and $x_i$ is sampled from the distribution of $x$.

observe that it is ok if $g_\theta$ also depends on $\theta$ (thus the subscript $\_ _\theta$).

so, for this example, we can do as follows

\[\begin{split} \begin{align} x &\sim \mathcal{N}(0,1)\\ z &= g_\theta(x) = \theta x \\ &\text{preserving the original distribution of }z\\ z &\sim \mathcal{N}(0, \theta) \end{align} \end{split}\]

# sanity check: expectation using sympy and the reparametrization trick

g = lambda x,t: 3+t*x  # the transformation of the reparametrization trick

x = sy.symbols(r'x')
mux = sy.N(0)
sigmax = sy.N(1)

ptx = 1/sy.sqrt(2*sy.pi)*sy.exp(-x**2/2) # standard gaussian distribution for x
ftx = (t+g(x,t))**2

sy.integrate((ptx*ftx).subs({t:tval}), (x,-sy.oo,+sy.oo)).n()