This page contains source code relating to section 4.4 of Bishop’s Pattern Recognition and Machine Learning (2009)
This section is about Laplace Approximation.
This technique tries to approximate a distribution \(p\) that we only know the shape \(f\) but not the normalization constant \(Z\):
\[p(z) = \frac{1}{Z} f(z)\]
The approximation is done with a normal distribution \(q\). To determine \(q\) we need to find the mode/mean and the variance.
The approximation states that the mode is the same as the mode of \(p\). We can find this mode, \(z_0\), by just knowing the shape of \(p\). This is possible by applying the first derivative, set it to zero and solve it:
\[\frac{df}{dz} \rvert_{z=z_0} = 0\]
Let’s use a small eg to clarify what we are doing here.
Let’s say that our distribution \(p\) is a beta, then:
\[p(\theta|y,n) = \frac{1}{Z} \theta^y (1-\theta)^{n-y}\]
assuming we are unable to find \(Z\) (in this case, \(Z\) is available).
Eg: find the posterior of \(\theta\) given \(n=18, y=10\).
Shape \(f\) is
\[p(\theta) \propto f(\theta) = \theta^{10} (1-\theta)^8\]
So, we calculate \(df/dz\) and find its root \(z_0\):
n <- 18
y <- 10 # 10 heads and 8 tails
df_dz <- D(expression(theta^y * (1-theta)^(n-y)), "theta")
f_df_dz <- function(theta) eval(df_dz, envir=list(theta=theta, y=y, n=n))
z0 <- uniroot(f_df_dz, c(.001,.999))$root
z0
## [1] 0.5555318
With \(z_0\), the normal approximation \(q(z)\) is given by
\[q(z) = \left( \frac{A}{2\pi} \right)^{1/2} \exp\{ -\frac{A}{2} (z-z_0)^2 \}\]
\[A = -\frac{d^2}{dz^2} \log f(z) \rvert_{z=z_0}\]
In R:
d2f_dz2 <- D(D(expression(log(theta^y * (1-theta)^(n-y))),"theta"),"theta")
A <- -eval(d2f_dz2, envir=list(theta=z0, y=y, n=n))
q_z <- function(z) {
sqrt(A/(2*pi)) * exp(-(A/2) * (z-z0)^2)
}
To check, let’s compare it with the true posterior \(\text{Beta}(11,9)\) assuming prior \[\theta \sim \text{Beta(1,1)}\]
curve(dbeta(x,y+1,n-y+1), 0 ,1, col="blue", lwd=2,
xlab=expression(theta), ylab="posterior") # true posterior is a beta
curve(q_z, 0, 1, lwd=2, col="red", add=T) # laplace approximation is a normal
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)
Let’s wrap this eg into a function:
# returns approximating normal distribution of a beta
lap_approx_beta <- function(y, n) {
df_dz <- D(expression(theta^y * (1-theta)^(n-y)), "theta")
f_df_dz <- function(theta) eval(df_dz, envir=list(theta=theta, y=y, n=n))
z0 <- uniroot(f_df_dz, c(.001,.999))$root
d2f_dz2 <- D(D(expression(log(theta^y * (1-theta)^(n-y))),"theta"),"theta")
A <- -eval(d2f_dz2, envir=list(theta=z0, y=y, n=n))
function(z) {
sqrt(A/(2*pi)) * exp(-(A/2) * (z-z0)^2)
}
}
Notice that the approximation not always work well (more about this below):
y <- 4
n <- 6
q_z <- lap_approx_beta(y,n)
curve(dbeta(x,y+1,n-y+1), 0 ,1, col="blue", lwd=2,
xlab=expression(theta), ylab="posterior")
curve(q_z, 0, 1, lwd=2, col="red", add=T)
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)
This section is based on Rasmus Bååth’s Easy Laplace Approximation of Bayesian Models in R
Let’s assume the following model:
\[y_i \sim \mathcal{N}(\mu, \sigma^2)\]
with priors
\[\mu \sim \mathcal{N}(0, 100)\]
\[\sigma \sim \text{LogNormal}(0,4)\]
The posterior is:
\[ \begin{array}{lclr} p(\mu, \sigma | y) & \propto & p(y | \mu, \sigma) p(\mu) p(\sigma) & \color{blue}{\text{Bayes Theorem}~\&~\mu \perp \sigma}\\ \log p(\mu, \sigma | y) & \propto & \log \left[ p(y | \mu, \sigma) p(\mu) p(\sigma) \right] & \\ & = & \log \prod_i p(y_i | \mu, \sigma) + \log p(\mu) + \log p(\sigma) & \color{blue}{y_i~\text{iid}} \\ & = & \sum_i \log p(y_i | \mu, \sigma) + \log p(\mu) + \log p(\sigma) & \\ \end{array} \]
The next R function computes this log posterior:
# Model
# y_i ~ Normal(mu, sigma)
# mu ~ Normal(0,100)
# sigma ~ LogNormal(0,4)
# argument p receives the parameter values, it's a vector c(mu=..., sigma=...)
# argument y is the available data
model <- function(p, y) {
log_lik <- sum(dnorm(y, p["mu"], p["sigma"], log = T)) # log likelihood
log_prior_mu <- dnorm(p["mu"], 0, 100, log = T)
log_prior_sigma <- dlnorm(p["sigma"], 0, 4, log = T)
log_lik + log_prior_mu + log_prior_sigma
}
Now we wish the optimize the parameter values to maximize this log-posterior:
# Laplace approximation
laplace_approx <- function(model, inits, ...) {
fit <- optim(inits, model,
control = list(fnscale = -1), # maximize
hessian = TRUE, ...) # Hessian matrix defines curvature at max
param_mean <- fit$par # the posterior modes are the mode
param_cov_mat <- solve(-fit$hessian) # this gives the co-variance matrix
list(param_mean=param_mean, param_cov_mat=param_cov_mat)
}
Let’s try with some data:
y <- rnorm(n=50, mean=10, sd=5)
report <- laplace_approx(model, inits=c(mu=0, sigma=1), y=y)
library(coda)
library(mvtnorm)
# using sampling to approx density
samples <- mcmc(rmvnorm(1e4, report$param_mean, report$param_cov_mat))
plot(samples[,1], trace=FALSE, main="density of mean")
# compare posterior with approximation
curve(dnorm(x,10,5), from=-10, to=30, col="blue", lwd=2, xlab="", ylab="posterior")
curve(dnorm(x,report$param_mean["mu"],report$param_mean["sigma"]), from=-10, to=30, col="red", lwd=2, add=T)
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)
Another example (n
is known):
\[y_i \sim \text{Binomial}(n,\theta)\]
with prior
\[\theta \sim \text{Beta}(1, 1)\]
model2 <- function(p, y, n) {
log_lik <- sum(log(dbinom(y, n, p["theta"])))
log_prior_theta <- dbeta(p["theta"], 1, 1)
log_lik + log_prior_theta
}
Let’s also compare Bååth’s method with Bishop’s:
n <- 18
y <- 10 # 10 heads and 8 tails
report <- laplace_approx(model2, inits=c(theta=0.5), y=y, n=n) # using Bååth's method
z0 <- report$param_mean["theta"]
sd0 <- sqrt(report$param_cov_mat)
q_z <- lap_approx_beta(y,n) # using Bishop's method
curve(dbeta(x,y+1,n-y+1), 0 ,1, col="blue", lwd=2, xlab=expression(theta), ylab="posterior")
curve(q_z, 0, 1, lwd=4, col="red", add=T)
curve(dnorm(x,z0,sd0),0, 1, lwd=2, col="yellow", add=T)
legend("topleft",c("true posterior","Bishop", "Bååth"), col=c("blue","red", "yellow"), lty=1, lwd=2)
As expected, they gave the same result :-)
We’ve seen that some approximations are not very good:
y <- 4
n <- 6
report <- laplace_approx(model2, inits=c(theta=0.5), y=y, n=n)
z0 <- report$param_mean["theta"]
sd0 <- sqrt(report$param_cov_mat)
thetas <- seq(0,1,len=100)
post_like <- dbeta(thetas,y+1,n-y+1)
plot(thetas, post_like, col="blue", lwd=2, xlab=expression(theta), ylab="posterior", type="l")
laplace_like <- dnorm(thetas, z0, sd0)
dist_diff <- max(laplace_like) / max(post_like)
points(thetas, laplace_like/dist_diff, col="red", lwd=2, type="l")
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)
When the posterior starts to skew, the normal cannot fit properly. This sometimes happen due to parameter boundaries (here, \(\theta \in [0,1]\)). We can try to reparametrize \(\theta\) so to remove the boundary. Two typical transformations are \(\log\), that translates positive numbers to the reals, and \(\text{logit}\), that translates the interval \([0,1]\) to the reals.
In this eg let’s remake the computation applying \(\text{logit}\) to parameter \(\theta\). We will not optimize over \(\theta\) but over \(\text{logit}(\theta)\). This implies that the parameter that ‘feeds’ the model distributions is \(\text{sigmoid}(\text{logit}(\theta))\):
logit <- function(x) log(x/(1-x))
sigmoid <- function(x) 1/(1+exp(-x)) # logit & sigmoid are inverses
model3 <- function(p, y, n) {
log_lik <- sum(log(dbinom(y, n, sigmoid(p["logit_theta"]))))
log_prior_theta <- dbeta(sigmoid(p["logit_theta"]), 1, 1)
log_lik + log_prior_theta
}
Let’s test the model with the same data:
n <- 6
y <- 4
report <- laplace_approx(model3, inits=c(logit_theta=0.5), y=y, n=n)
z0 <- report$param_mean["logit_theta"]
sd0 <- sqrt(report$param_cov_mat)
thetas <- seq(0,1,len=100)
post_like <- dbeta(thetas,y+1,n-y+1)
plot(thetas, post_like, col="blue", lwd=2, xlab=expression(theta), ylab="posterior", type="l")
# since the approx was made on a logit scale, we need to rescale thetas to logit
thetas_logit <- logit(thetas)
laplace_like <- dnorm(thetas_logit, z0, sd0)
dist_diff <- max(laplace_like) / max(post_like)
points(thetas, laplace_like/dist_diff, col="red", lwd=2, type="l")
legend("topleft",c("true posterior","laplace approx"), col=c("blue","red"), lty=1, lwd=2)
The results are now much better!