Bishop’s PRML, section 4.4

Approximating Bayesian Models
Reparametrization

This page contains source code relating to section 4.4 of Bishop’s Pattern Recognition and Machine Learning (2009)

This section is about Laplace Approximation.

This technique tries to approximate a distribution \(p\) that we only know the shape \(f\) but not the normalization constant \(Z\):

\[p(z) = \frac{1}{Z} f(z)\]

The approximation is done with a normal distribution \(q\). To determine \(q\) we need to find the mode/mean and the variance.

The approximation states that the mode is the same as the mode of \(p\). We can find this mode, \(z_0\), by just knowing the shape of \(p\). This is possible by applying the first derivative, set it to zero and solve it:

\[\frac{df}{dz} \rvert_{z=z_0} = 0\]

Let’s use a small eg to clarify what we are doing here.

Let’s say that our distribution \(p\) is a beta, then:

\[p(\theta|y,n) = \frac{1}{Z} \theta^y (1-\theta)^{n-y}\]

assuming we are unable to find \(Z\) (in this case, \(Z\) is available).

Eg: find the posterior of \(\theta\) given \(n=18, y=10\).

Shape \(f\) is

\[p(\theta) \propto f(\theta) = \theta^{10} (1-\theta)^8\]

So, we calculate \(df/dz\) and find its root \(z_0\):

n <- 18
y <- 10 # 10 heads and 8 tails

df_dz   <- D(expression(theta^y * (1-theta)^(n-y)), "theta")
f_df_dz <- function(theta) eval(df_dz, envir=list(theta=theta, y=y, n=n))
z0      <- uniroot(f_df_dz, c(.001,.999))$root
z0

## [1] 0.5555318

With \(z_0\), the normal approximation \(q(z)\) is given by

\[q(z) = \left( \frac{A}{2\pi} \right)^{1/2} \exp\{ -\frac{A}{2} (z-z_0)^2 \}\]

\[A = -\frac{d^2}{dz^2} \log f(z) \rvert_{z=z_0}\]

In R:

d2f_dz2 <- D(D(expression(log(theta^y * (1-theta)^(n-y))),"theta"),"theta")
A       <- -eval(d2f_dz2, envir=list(theta=z0, y=y, n=n))

q_z <- function(z) {
  sqrt(A/(2*pi)) * exp(-(A/2) * (z-z0)^2)
}

To check, let’s compare it with the true posterior \(\text{Beta}(11,9)\) assuming prior \[\theta \sim \text{Beta(1,1)}\]

curve(dbeta(x,y+1,n-y+1), 0 ,1, col="blue", lwd=2,
      xlab=expression(theta), ylab="posterior") # true posterior is a beta
curve(q_z, 0, 1, lwd=2, col="red", add=T)         # laplace approximation is a normal
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

Let’s wrap this eg into a function:

# returns approximating normal distribution of a beta
lap_approx_beta <- function(y, n) {
  
  df_dz   <- D(expression(theta^y * (1-theta)^(n-y)), "theta")
  f_df_dz <- function(theta) eval(df_dz, envir=list(theta=theta, y=y, n=n))
  z0      <- uniroot(f_df_dz, c(.001,.999))$root
  
  d2f_dz2 <- D(D(expression(log(theta^y * (1-theta)^(n-y))),"theta"),"theta")
  A       <- -eval(d2f_dz2, envir=list(theta=z0, y=y, n=n))

  function(z) {
    sqrt(A/(2*pi)) * exp(-(A/2) * (z-z0)^2)
  }
}

Notice that the approximation not always work well (more about this below):

y   <- 4
n   <- 6
q_z <- lap_approx_beta(y,n)

curve(dbeta(x,y+1,n-y+1), 0 ,1, col="blue", lwd=2,
      xlab=expression(theta), ylab="posterior") 
curve(q_z, 0, 1, lwd=2, col="red", add=T)         
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

Approximating Bayesian Models

This section is based on Rasmus Bååth’s Easy Laplace Approximation of Bayesian Models in R

Let’s assume the following model:

\[y_i \sim \mathcal{N}(\mu, \sigma^2)\]

with priors

\[\mu \sim \mathcal{N}(0, 100)\]

\[\sigma \sim \text{LogNormal}(0,4)\]

The posterior is:

\[ \begin{array}{lclr} p(\mu, \sigma | y) & \propto & p(y | \mu, \sigma) p(\mu) p(\sigma) & \color{blue}{\text{Bayes Theorem}~\&~\mu \perp \sigma}\\ \log p(\mu, \sigma | y) & \propto & \log \left[ p(y | \mu, \sigma) p(\mu) p(\sigma) \right] & \\ & = & \log \prod_i p(y_i | \mu, \sigma) + \log p(\mu) + \log p(\sigma) & \color{blue}{y_i~\text{iid}} \\ & = & \sum_i \log p(y_i | \mu, \sigma) + \log p(\mu) + \log p(\sigma) & \\ \end{array} \]

The next R function computes this log posterior:

# Model
#   y_i   ~ Normal(mu, sigma)
#   mu    ~ Normal(0,100)
#   sigma ~ LogNormal(0,4)

# argument p receives the parameter values, it's a vector c(mu=..., sigma=...)
# argument y is the available data

model <- function(p, y) {
  log_lik         <- sum(dnorm(y, p["mu"], p["sigma"], log = T))    # log likelihood
  log_prior_mu    <- dnorm(p["mu"],     0, 100, log = T)
  log_prior_sigma <- dlnorm(p["sigma"], 0,   4, log = T)
  
  log_lik + log_prior_mu + log_prior_sigma
}

Now we wish the optimize the parameter values to maximize this log-posterior:

# Laplace approximation
laplace_approx <- function(model, inits, ...) {
  
  fit <- optim(inits, model, 
               control = list(fnscale = -1), # maximize
               hessian = TRUE, ...)          # Hessian matrix defines curvature at max
  param_mean    <- fit$par                   # the posterior modes are the mode
  param_cov_mat <- solve(-fit$hessian)       # this gives the co-variance matrix
  
  list(param_mean=param_mean, param_cov_mat=param_cov_mat)
}

Let’s try with some data:

y <- rnorm(n=50, mean=10, sd=5)

report <- laplace_approx(model, inits=c(mu=0, sigma=1), y=y)

library(coda)
library(mvtnorm)
# using sampling to approx density
samples <- mcmc(rmvnorm(1e4, report$param_mean, report$param_cov_mat))
plot(samples[,1], trace=FALSE, main="density of mean")

# compare posterior with approximation
curve(dnorm(x,10,5), from=-10, to=30, col="blue", lwd=2, xlab="", ylab="posterior")
curve(dnorm(x,report$param_mean["mu"],report$param_mean["sigma"]), from=-10, to=30, col="red", lwd=2, add=T)
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

Another example (n is known):

\[y_i \sim \text{Binomial}(n,\theta)\]

with prior

\[\theta \sim \text{Beta}(1, 1)\]

model2 <- function(p, y, n) {

  log_lik         <- sum(log(dbinom(y, n, p["theta"])))
  log_prior_theta <- dbeta(p["theta"], 1, 1)
  
  log_lik + log_prior_theta
}

Let’s also compare Bååth’s method with Bishop’s:

n <- 18
y <- 10 # 10 heads and 8 tails

report <- laplace_approx(model2, inits=c(theta=0.5), y=y, n=n) # using Bååth's method
z0  <- report$param_mean["theta"]
sd0 <- sqrt(report$param_cov_mat)

q_z <- lap_approx_beta(y,n)                                    # using Bishop's method

curve(dbeta(x,y+1,n-y+1), 0 ,1, col="blue", lwd=2, xlab=expression(theta), ylab="posterior") 
curve(q_z,            0, 1, lwd=4, col="red", add=T)    
curve(dnorm(x,z0,sd0),0, 1, lwd=2, col="yellow", add=T)
legend("topleft",c("true posterior","Bishop", "Bååth"), col=c("blue","red", "yellow"), lty=1, lwd=2)

As expected, they gave the same result :-)

Reparametrization

We’ve seen that some approximations are not very good:

y <- 4
n <- 6

report <- laplace_approx(model2, inits=c(theta=0.5), y=y, n=n)
z0     <- report$param_mean["theta"]
sd0    <- sqrt(report$param_cov_mat)

thetas    <- seq(0,1,len=100)
post_like <- dbeta(thetas,y+1,n-y+1)
plot(thetas, post_like, col="blue", lwd=2, xlab=expression(theta), ylab="posterior", type="l")

laplace_like <- dnorm(thetas, z0, sd0)
dist_diff    <- max(laplace_like) / max(post_like)
points(thetas, laplace_like/dist_diff, col="red", lwd=2, type="l")
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

When the posterior starts to skew, the normal cannot fit properly. This sometimes happen due to parameter boundaries (here, \(\theta \in [0,1]\)). We can try to reparametrize \(\theta\) so to remove the boundary. Two typical transformations are \(\log\), that translates positive numbers to the reals, and \(\text{logit}\), that translates the interval \([0,1]\) to the reals.

In this eg let’s remake the computation applying \(\text{logit}\) to parameter \(\theta\). We will not optimize over \(\theta\) but over \(\text{logit}(\theta)\). This implies that the parameter that ‘feeds’ the model distributions is \(\text{sigmoid}(\text{logit}(\theta))\):

logit   <- function(x) log(x/(1-x))
sigmoid <- function(x) 1/(1+exp(-x))  # logit & sigmoid are inverses

model3 <- function(p, y, n) {

  log_lik         <- sum(log(dbinom(y, n, sigmoid(p["logit_theta"]))))
  log_prior_theta <- dbeta(sigmoid(p["logit_theta"]), 1, 1)
  
  log_lik + log_prior_theta
}

Let’s test the model with the same data:

n <- 6
y <- 4

report <- laplace_approx(model3, inits=c(logit_theta=0.5), y=y, n=n)
z0     <- report$param_mean["logit_theta"]
sd0    <- sqrt(report$param_cov_mat)

thetas    <- seq(0,1,len=100)
post_like <- dbeta(thetas,y+1,n-y+1)
plot(thetas, post_like, col="blue", lwd=2, xlab=expression(theta), ylab="posterior", type="l")

# since the approx was made on a logit scale, we need to rescale thetas to logit
thetas_logit <- logit(thetas)
laplace_like <- dnorm(thetas_logit, z0, sd0)
dist_diff    <- max(laplace_like) / max(post_like)
points(thetas, laplace_like/dist_diff, col="red", lwd=2, type="l")
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

The results are now much better!

Bishop’s PRML, section 4.4

June, 2015

Approximating Bayesian Models

Reparametrization