This page contains source code relating to chapter 6 of Bishop’s Pattern Recognition and Machine Learning (2009)
This chapter is about Kernel Methods.
Kernels are functions in the form
\[k(x,z) = \phi(x)^T \phi(z)\]
where \(\phi\) is a basis function (cf. chapter 3).
A radial basis function is a kernel which depends only on the distance between datapoints, ie, \(k(x,z) = f(\|x-z\|)\).
An example:
\[k(x,z) = \exp(-\gamma \|x-z\|^2)\]
make_gaussian_kernel <- function(gamma=1.0) {
function(x,z) {
exp(-gamma * norm(as.matrix(x-z),"F")^2 )
}
}
gaussian_kernel <- make_gaussian_kernel(0.2)
gaussian_kernel(x=c(1,1), z=c(2,2))
## [1] 0.67032
The hard way is to check if a given expression can be described as a product of basis.
Eg, \(k(x,z) = (x^Tz)^2\) can be defined using basis \(\phi(x) = (x_1^2, \sqrt{2}x_1x_2,x_2^2)^T\).
A simpler way is to build kernels from previous ones using rules like:
\(k(x,z) = f(x) k_1(x,z) f(z)\)
\(k(x,z) = q(k_1(x,z))\), \(q\) is a polynomial with nonnegative coefficients
\(k(x,z) = k_1(x,z) + k_2(x,z)\)
\(k(x,z) = k_1(x,z) k_2(x,z)\)
\(k(x,z) = \exp( k_1(x,z) )\)
\(k(x,z) = x^T A z\), \(A\) is a symmetric positive semidefinite matrix
where \(k_1,k_2\) are already known kernels.
The RBF network originally was used for exact interpolation of the training dataset \(X=(x_1,\ldots,x_N), Y=(y_1,\ldots,y_N)\).
The task was to find \(W=(w_1,w_2,\ldots,w_N)\) such that
\[y(x,W) = \sum_{i=1}^N w_i \times \exp(-\gamma \|x-x_i\|^2)\]
gave a perfect match for all the initial datapoints \((x_i,y_i)\), ie, \(y_i = y(x_i,W)\).
The design matrix \(\Phi\) is (with a bias parameter \(w_0\)),
\[ \Phi = \left\lbrack \begin{matrix} 1 & exp(-\gamma \|x_1-x_1\|^2) & \cdots & exp(-\gamma \|x_1-x_N\|^2) \cr 1 & exp(-\gamma \|x_2-x_1\|^2) & \cdots & exp(-\gamma \|x_2-x_N\|^2) \cr \vdots & \vdots & \ddots & \vdots \cr 1 & exp(-\gamma \|x_N-x_1\|^2) & \cdots & exp(-\gamma \|x_N-x_N\|^2) \cr \end{matrix} \right\rbrack \]
And so \[W = (\Phi^T \Phi)^{-1} \Phi^T Y\] which is an exact interpolation of the dataset (it’s the pseudo inverse \((\Phi^T \Phi)^{-1} \Phi^T\) instead of the inverse \(\Phi^{-1}\), because, by including the bias, \(\Phi\) is no longer a square matrix).
However, if \(N\) is large, the need to use all in-sample datapoints is exagerated and might overfit. A variant is to choose \(K\) representatives (\(K\lt\lt N\)) that can do the interpolation. Choose the best \(K\) centers among the possible \(x_i \in X\) is NP-hard, so a possible option is to select \(K\) centers using a clustering algorithm, namely a K-means clustering. Let’s name those centers \(\mu_1, \ldots, \mu_K\). These \(\mu_k\) define the set of basis functions.
Thus, the matrix form reduces the design matrix \(\Phi\) from a \(N \times (N+1)\) matrix to a smaller \(N \times (K+1)\) matrix:
\[ \Phi = \left\lbrack \begin{matrix} 1 & exp(-\gamma \|x_1-\mu_1\|^2) & \cdots & exp(-\gamma \|x_1-\mu_K\|^2) \cr 1 & exp(-\gamma \|x_2-\mu_1\|^2) & \cdots & exp(-\gamma \|x_2-\mu_K\|^2) \cr \vdots & \vdots & \ddots & \vdots \cr 1 & exp(-\gamma \|x_N-\mu_1\|^2) & \cdots & exp(-\gamma \|x_N-\mu_K\|^2) \cr \end{matrix} \right\rbrack \]
In R:
# returns a rbf model given the:
# * observations X=(x1, ..., xN)
# * output value for each observation Y = (y1, ..., yN)
# * number of centers K
# * gamma value
rbf <- function(X, Y, K=12, gamma=1.0) {
library(corpcor) # include pseudoinverse()
library(stats) # include kmeans()
N <- dim(X)[1] # number of observations
ncols <- dim(X)[2] # number of variables
repeat {
km <- kmeans(X, K) # let's cluster K centers out of the dataset
if (min(km$size)>0) # only accept if there are no empty clusters
break
}
mus <- km$centers # the clusters points
Phi <- matrix(rep(NA,(K+1)*N), ncol=K+1)
for (lin in 1:N) {
Phi[lin,1] <- 1 # bias column
for (col in 1:K) {
Phi[lin,col+1] <- exp(-gamma * norm(as.matrix(X[lin,]-mus[col,]),"F")^2)
}
}
w <- pseudoinverse(t(Phi) %*% Phi) %*% t(Phi) %*% Y # find RBF weights
list(weights=w, centers=mus, gamma=gamma) # return the rbf model
}
And also an implementation for the prediction function:
rbf.predict <- function(model, X, classification=FALSE) {
gamma <- model$gamma
centers <- model$centers
w <- model$weights
N <- dim(X)[1] # number of observations
pred <- rep(w[1],N) # we need to init to a value, so let's start with bias
for (j in 1:N) {
# find prediction for point xj
for (k in 1:length(centers[,1])) {
# the weight for center[k] is given by w[k+1] (because w[1] is the bias)
pred[j] <- pred[j] + w[k+1] * exp( -gamma * norm(as.matrix(X[j,]-centers[k,]),"F")^2 )
}
}
if (classification) {
pred <- unlist(lapply(pred, sign))
}
pred
}
Let’s see an example:
target <- function(x1, x2) {
2*(x2 - x1 + .25*sin(pi*x1) >= 0)-1
}
N <- 100
X <- data.frame(x1=runif(N, min=-1, max=1),
x2=runif(N, min=-1, max=1))
Y <- target(X$x1, X$x2)
plot(X$x1, X$x2, col=Y+3)
Now let’s learn the dataset using the RBFs:
rbf.model <- rbf(X, Y) # using default values for K and gamma
rbf.model
## $weights
## [,1]
## [1,] -0.8916381
## [2,] -0.9345065
## [3,] -3.4113668
## [4,] 8.8122640
## [5,] 13.9391005
## [6,] 2.6130550
## [7,] -4.9743924
## [8,] 1.3381857
## [9,] -4.2822439
## [10,] -2.6032711
## [11,] -2.1925096
## [12,] 8.3972227
## [13,] -14.5458204
##
## $centers
## x1 x2
## 1 -0.0727733 0.75865134
## 2 0.3643707 -0.43110565
## 3 -0.7193847 -0.20170566
## 4 0.2370692 0.22831829
## 5 0.8781584 -0.71665121
## 6 -0.5698811 -0.77808139
## 7 -0.5733316 0.52057839
## 8 0.7985798 -0.03698476
## 9 0.2978826 -0.82927679
## 10 0.6304115 0.62240032
## 11 -0.0336843 -0.69824954
## 12 -0.2237328 -0.06348511
##
## $gamma
## [1] 1
And make a prediction over a new test set:
N.test <- 200
X.out <- data.frame(x1=runif(N.test, min=-1, max=1),
x2=runif(N.test, min=-1, max=1))
Y.out <- target(X.out$x1, X.out$x2)
rbf.pred <- rbf.predict(rbf.model, X.out, classification=TRUE)
binary.error <- sum(rbf.pred != Y.out)/N.test
binary.error
## [1] 0.03
plot(X.out$x1, X.out$x2, col=Y.out+3, pch=0)
points(X.out$x1, X.out$x2, col=rbf.pred+3, pch=3)
points(rbf.model$centers, col="black", pch=19) # draw the model centers
legend("topleft",c("true value","predicted"),pch=c(0,3),bg="white")
Consider a model as a linear combination of some fixed basis \(\phi\),
\[y(x,W) = W^T \phi(x)\]
Now let’s assume this prior distribution for W
\[p(W) = \mathcal{N}(W|0,\alpha^{-1} I)\]
where the hyparameter \(\alpha\) defines the precision of the distribution.
This distribution over \(W\) induces a distribution over \(y(x,W)\). Notice however that \(y(x,W)\) is now a distribution over functions, where each has its own set of values for \(W\).
The next code is based on James Keirstead’s post.
require(MASS)
## Loading required package: MASS
# make the gaussian kernel
# gaussian_kernel <- make_gaussian_kernel(0.5)
# compute covariance matrix Sigma for the Gaussian process, ie,
# the distribution of functions (eqs. 6.53 & 6.54)
get_sigma <- function(X1, X2, kernel=make_gaussian_kernel(0.5)) {
Sigma <- matrix(rep(0, length(X1)*length(X2)), nrow=length(X1))
for (i in 1:nrow(Sigma))
for (j in 1:ncol(Sigma))
Sigma[i,j] <- kernel(X1[i], X2[j])
Sigma
}
# The points at which we want to define the functions
X <- seq(-5,5,len=50)
# Calculate the covariance matrix
sigma <- get_sigma(X,X)
means <- rep(0, length(X))
# Generate sample functions from the Gaussian process
set.seed(101)
n_samples <- 5
y_sample <- matrix(rep(0,length(X)*n_samples), ncol=n_samples)
for (i in 1:n_samples) {
# each column represents a sample from a multivariate normal distribution
# with zero mean and covariance sigma
y_sample[,i] <- mvrnorm(1, means, sigma)
}
plot(X, y_sample[,1], type="n", ylim=c(min(y_sample), max(y_sample)), ylab="sample functions")
for(i in 1:n_samples)
points(X, y_sample[,i], type="l", col=i, lwd=2)
Usually, we wish these functions to evaluate to specific values at \(y_i = y(x_i,W)\).
The next function receives \(X\), which is a set of \(X_i\) we wish to estimate, and output the estimations:
# dataset: two-column dataframe with col x for input, and col y for output
# kernel: the kernel in use
# sd_noise: the standard deviation of the noise of the dataset samples
gp_samples <- function(X, dataset, kernel, n_samples, sd_noise=0, seed=121) {
k.dd <- get_sigma(dataset$x, dataset$x, kernel)
k.dx <- get_sigma(dataset$x, X, kernel)
k.xd <- get_sigma(X, dataset$x, kernel)
k.xx <- get_sigma(X, X, kernel)
# These matrix calculations correspond to equation (2.19) in the
# Rasmussen and Williams's book 'Gaussian Processes for ML'
noise_part <- sd_noise^2 * diag(1, ncol(k.dd))
d.star.bar <- k.xd %*% solve(k.dd + noise_part) %*% d$y
cov.d.star <- k.xx - k.xd %*% solve(k.dd + noise_part) %*% k.dx
# generate the samples
set.seed(seed)
y_sample <- matrix(rep(0,length(X)*n_samples), ncol=n_samples)
for (i in 1:n_samples)
y_sample[,i] <- mvrnorm(1, d.star.bar, cov.d.star)
list(y_sample=y_sample, mean_est=d.star.bar)
}
Let’s try to fit some data:
d <- data.frame(x=c(-4,-3,-1,0 ,2),
y=c(-2, 0 ,1,2,-1))
n_samples <- 50
result <- gp_samples(X, d, make_gaussian_kernel(0.5), n_samples)
plot(X, result$y_sample[,1], type="n", ylim=c(min(y_sample)-.5, max(y_sample)+.5), ylab="sample functions")
for(i in 1:n_samples)
points(X, result$y_sample[,i], type="l", col="lightgrey")
points(X, result$mean_est, type="l", col="red", lwd=2)
We can estimate the output for a given value, say \(2.3\).
value <- 2.3
result <- gp_samples(value, d, make_gaussian_kernel(0.5), n_samples=5e3)
# i <- 23 # eg: check the 23th value of X
ys <- result$y_sample[1,]
library(coda)
hpd <- HPDinterval(as.mcmc(ys), prob=0.95) # compute highest density interval
hist(ys, breaks=50, freq=FALSE, xlab="", main=paste("Estimate for x=",round(value,2)), yaxt='n', ylab="")
lines(density(ys),col="red",lwd=2)
text(-0.2,1.1,paste("mean: ",round(result$mean_est,2)))
text(-0.4,1.3,paste("95% HPD: [",round(hpd[1],2),",",round(hpd[2],2),"]"))
Also, we can assume that our data is noisy:
result <- gp_samples(X, d, make_gaussian_kernel(0.5), n_samples, sd_noise=0.2)
plot(X, result$y_sample[,1], type="n", ylim=c(min(y_sample)-.5, max(y_sample)+.5), ylab="sample functions")
for(i in 1:n_samples)
points(X, result$y_sample[,i], type="l", col="lightgrey")
points(X, result$mean_est, type="l", col="red", lwd=2)
# estimate for value
value <- 2.3
result <- gp_samples(value, d, make_gaussian_kernel(0.5), n_samples=5e3, sd_noise=0.2)
ys <- result$y_sample[1,]
hpd <- HPDinterval(as.mcmc(ys), prob=0.95) # compute highest density interval
hist(ys, breaks=50, freq=FALSE, xlab="", main=paste("Estimate for x=",round(value,2)), yaxt='n', ylab="")
lines(density(ys),col="red",lwd=2)
text(-0.2,0.8,paste("mean: ",round(result$mean_est,2)))
text(-0.3,1.0,paste("95% HPD: [",round(hpd[1],2),",",round(hpd[2],2),"]"))
Notice how the HPD is now larger.
This procedure can also be used with other kernels. The next one corresponds to the Ornstein-Uhlenbeck process useful to model Brownian motion.
exponential_kernel <- function(x,z) exp(-0.1*abs(x-z)) # parameter theta=0.1
result2 <- gp_samples(X, d, exponential_kernel, n_samples)
plot(X, result2$y_sample[,1], type="n", ylim=c(min(y_sample), max(y_sample)),
ylab="sample functions")
for(i in 1:n_samples)
points(X, result2$y_sample[,i], type="l", col="lightgrey")
points(X, result2$mean_est, type="l", col="red", lwd=2)
Check library gausspr
for regression and classification with Gaussian Processes. For more information, check http://www.gaussianprocess.org/.
There are several functions that convert a real value into the \([0,1]\) interval, for us to interpret it as a probability. Here we use the sigmoid:
sigmoid <- function(x) 1/(1+exp(-x))
Now, given the training set and a new value to estimate, we compute a set of samples for that value and then convert them with the sigmoid. For results below \(0.5\) we interpret the sample as class \(y=-1\), otherwise as class \(y=1\).
d <- data.frame(x=c(-4,-3,-1, 0 ,2),
y=c(-1,-1, 1, 1,-1))
value <- -2.6
result <- gp_samples(value, d, make_gaussian_kernel(0.5), n_samples=5e3)
probs <- sapply(result$y_sample[1,], sigmoid)
hpd <- HPDinterval(as.mcmc(probs), prob=0.95)
hist(probs,breaks=50,prob=T, main=paste("p(y|x=",value,")"), yaxt='n', ylab="")
text(0.5,4.0,paste("95% HPD: [",round(hpd[1],2),",",round(hpd[2],2),"]"))
In this eg, we would classify the sample as class \(-1\) within a \(95\%\) credible interval.