This page contains source code relating to section 4.4 of Bishop’s Pattern Recognition and Machine Learning (2009)

This section is about Laplace Approximation.

This technique tries to approximate a distribution \(p\) that we only know the shape \(f\) but not the normalization constant \(Z\):

\[p(z) = \frac{1}{Z} f(z)\]

The approximation is done with a normal distribution \(q\). To determine \(q\) we need to find the mode/mean and the variance.

The approximation states that the mode is the same as the mode of \(p\). We can find this mode, \(z_0\), by just knowing the shape of \(p\). This is possible by applying the first derivative, set it to zero and solve it:

\[\frac{df}{dz} \rvert_{z=z_0} = 0\]

Let’s use a small eg to clarify what we are doing here.

Let’s say that our distribution \(p\) is a beta, then:

\[p(\theta|y,n) = \frac{1}{Z} \theta^y (1-\theta)^{n-y}\]

assuming we are unable to find \(Z\) (in this case, \(Z\) is available).

Eg: find the posterior of \(\theta\) given \(n=18, y=10\).

Shape \(f\) is

\[p(\theta) \propto f(\theta) = \theta^{10} (1-\theta)^8\]

So, we calculate \(df/dz\) and find its root \(z_0\):

n <- 18
y <- 10 # 10 heads and 8 tails

df_dz   <- D(expression(theta^y * (1-theta)^(n-y)), "theta")
f_df_dz <- function(theta) eval(df_dz, envir=list(theta=theta, y=y, n=n))
z0      <- uniroot(f_df_dz, c(.001,.999))$root
z0
## [1] 0.5555318

With \(z_0\), the normal approximation \(q(z)\) is given by

\[q(z) = \left( \frac{A}{2\pi} \right)^{1/2} \exp\{ -\frac{A}{2} (z-z_0)^2 \}\]

\[A = -\frac{d^2}{dz^2} \log f(z) \rvert_{z=z_0}\]

In R:

d2f_dz2 <- D(D(expression(log(theta^y * (1-theta)^(n-y))),"theta"),"theta")
A       <- -eval(d2f_dz2, envir=list(theta=z0, y=y, n=n))

q_z <- function(z) {
  sqrt(A/(2*pi)) * exp(-(A/2) * (z-z0)^2)
}

To check, let’s compare it with the true posterior \(\text{Beta}(11,9)\) assuming prior \[\theta \sim \text{Beta(1,1)}\]

curve(dbeta(x,y+1,n-y+1), 0 ,1, col="blue", lwd=2,
      xlab=expression(theta), ylab="posterior") # true posterior is a beta
curve(q_z, 0, 1, lwd=2, col="red", add=T)         # laplace approximation is a normal
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

Let’s wrap this eg into a function:

# returns approximating normal distribution of a beta
lap_approx_beta <- function(y, n) {
  
  df_dz   <- D(expression(theta^y * (1-theta)^(n-y)), "theta")
  f_df_dz <- function(theta) eval(df_dz, envir=list(theta=theta, y=y, n=n))
  z0      <- uniroot(f_df_dz, c(.001,.999))$root
  
  d2f_dz2 <- D(D(expression(log(theta^y * (1-theta)^(n-y))),"theta"),"theta")
  A       <- -eval(d2f_dz2, envir=list(theta=z0, y=y, n=n))

  function(z) {
    sqrt(A/(2*pi)) * exp(-(A/2) * (z-z0)^2)
  }
}

Notice that the approximation not always work well (more about this below):

y   <- 4
n   <- 6
q_z <- lap_approx_beta(y,n)

curve(dbeta(x,y+1,n-y+1), 0 ,1, col="blue", lwd=2,
      xlab=expression(theta), ylab="posterior") 
curve(q_z, 0, 1, lwd=2, col="red", add=T)         
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

Approximating Bayesian Models

This section is based on Rasmus Bååth’s Easy Laplace Approximation of Bayesian Models in R

Let’s assume the following model:

\[y_i \sim \mathcal{N}(\mu, \sigma^2)\]

with priors

\[\mu \sim \mathcal{N}(0, 100)\]

\[\sigma \sim \text{LogNormal}(0,4)\]

The posterior is:

\[ \begin{array}{lclr} p(\mu, \sigma | y) & \propto & p(y | \mu, \sigma) p(\mu) p(\sigma) & \color{blue}{\text{Bayes Theorem}~\&~\mu \perp \sigma}\\ \log p(\mu, \sigma | y) & \propto & \log \left[ p(y | \mu, \sigma) p(\mu) p(\sigma) \right] & \\ & = & \log \prod_i p(y_i | \mu, \sigma) + \log p(\mu) + \log p(\sigma) & \color{blue}{y_i~\text{iid}} \\ & = & \sum_i \log p(y_i | \mu, \sigma) + \log p(\mu) + \log p(\sigma) & \\ \end{array} \]

The next R function computes this log posterior:

# Model
#   y_i   ~ Normal(mu, sigma)
#   mu    ~ Normal(0,100)
#   sigma ~ LogNormal(0,4)

# argument p receives the parameter values, it's a vector c(mu=..., sigma=...)
# argument y is the available data

model <- function(p, y) {
  log_lik         <- sum(dnorm(y, p["mu"], p["sigma"], log = T))    # log likelihood
  log_prior_mu    <- dnorm(p["mu"],     0, 100, log = T)
  log_prior_sigma <- dlnorm(p["sigma"], 0,   4, log = T)
  
  log_lik + log_prior_mu + log_prior_sigma
}

Now we wish the optimize the parameter values to maximize this log-posterior:

# Laplace approximation
laplace_approx <- function(model, inits, ...) {
  
  fit <- optim(inits, model, 
               control = list(fnscale = -1), # maximize
               hessian = TRUE, ...)          # Hessian matrix defines curvature at max
  param_mean    <- fit$par                   # the posterior modes are the mode
  param_cov_mat <- solve(-fit$hessian)       # this gives the co-variance matrix
  
  list(param_mean=param_mean, param_cov_mat=param_cov_mat)
}

Let’s try with some data:

y <- rnorm(n=50, mean=10, sd=5)

report <- laplace_approx(model, inits=c(mu=0, sigma=1), y=y)

library(coda)
library(mvtnorm)
# using sampling to approx density
samples <- mcmc(rmvnorm(1e4, report$param_mean, report$param_cov_mat))
plot(samples[,1], trace=FALSE, main="density of mean")

# compare posterior with approximation
curve(dnorm(x,10,5), from=-10, to=30, col="blue", lwd=2, xlab="", ylab="posterior")
curve(dnorm(x,report$param_mean["mu"],report$param_mean["sigma"]), from=-10, to=30, col="red", lwd=2, add=T)
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

Another example (n is known):

\[y_i \sim \text{Binomial}(n,\theta)\]

with prior

\[\theta \sim \text{Beta}(1, 1)\]

model2 <- function(p, y, n) {

  log_lik         <- sum(log(dbinom(y, n, p["theta"])))
  log_prior_theta <- dbeta(p["theta"], 1, 1)
  
  log_lik + log_prior_theta
}

Let’s also compare Bååth’s method with Bishop’s:

n <- 18
y <- 10 # 10 heads and 8 tails

report <- laplace_approx(model2, inits=c(theta=0.5), y=y, n=n) # using Bååth's method
z0  <- report$param_mean["theta"]
sd0 <- sqrt(report$param_cov_mat)

q_z <- lap_approx_beta(y,n)                                    # using Bishop's method

curve(dbeta(x,y+1,n-y+1), 0 ,1, col="blue", lwd=2, xlab=expression(theta), ylab="posterior") 
curve(q_z,            0, 1, lwd=4, col="red", add=T)    
curve(dnorm(x,z0,sd0),0, 1, lwd=2, col="yellow", add=T)
legend("topleft",c("true posterior","Bishop", "Bååth"), col=c("blue","red", "yellow"), lty=1, lwd=2)

As expected, they gave the same result :-)

Reparametrization

We’ve seen that some approximations are not very good:

y <- 4
n <- 6

report <- laplace_approx(model2, inits=c(theta=0.5), y=y, n=n)
z0     <- report$param_mean["theta"]
sd0    <- sqrt(report$param_cov_mat)

thetas    <- seq(0,1,len=100)
post_like <- dbeta(thetas,y+1,n-y+1)
plot(thetas, post_like, col="blue", lwd=2, xlab=expression(theta), ylab="posterior", type="l")

laplace_like <- dnorm(thetas, z0, sd0)
dist_diff    <- max(laplace_like) / max(post_like)
points(thetas, laplace_like/dist_diff, col="red", lwd=2, type="l")
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

When the posterior starts to skew, the normal cannot fit properly. This sometimes happen due to parameter boundaries (here, \(\theta \in [0,1]\)). We can try to reparametrize \(\theta\) so to remove the boundary. Two typical transformations are \(\log\), that translates positive numbers to the reals, and \(\text{logit}\), that translates the interval \([0,1]\) to the reals.

In this eg let’s remake the computation applying \(\text{logit}\) to parameter \(\theta\). We will not optimize over \(\theta\) but over \(\text{logit}(\theta)\). This implies that the parameter that ‘feeds’ the model distributions is \(\text{sigmoid}(\text{logit}(\theta))\):

logit   <- function(x) log(x/(1-x))
sigmoid <- function(x) 1/(1+exp(-x))  # logit & sigmoid are inverses

model3 <- function(p, y, n) {

  log_lik         <- sum(log(dbinom(y, n, sigmoid(p["logit_theta"]))))
  log_prior_theta <- dbeta(sigmoid(p["logit_theta"]), 1, 1)
  
  log_lik + log_prior_theta
}

Let’s test the model with the same data:

n <- 6
y <- 4

report <- laplace_approx(model3, inits=c(logit_theta=0.5), y=y, n=n)
z0     <- report$param_mean["logit_theta"]
sd0    <- sqrt(report$param_cov_mat)

thetas    <- seq(0,1,len=100)
post_like <- dbeta(thetas,y+1,n-y+1)
plot(thetas, post_like, col="blue", lwd=2, xlab=expression(theta), ylab="posterior", type="l")

# since the approx was made on a logit scale, we need to rescale thetas to logit
thetas_logit <- logit(thetas)
laplace_like <- dnorm(thetas_logit, z0, sd0)
dist_diff    <- max(laplace_like) / max(post_like)
points(thetas, laplace_like/dist_diff, col="red", lwd=2, type="l")
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)

The results are now much better!