This page contains source code relating to chapter 14 of Bishop’s Pattern Recognition and Machine Learning (2009)
This chapter is about combining models
A committee is a combination os models. The simplest way to construct a committee is to average a set of different models. Ideally, we would have \(M\) datasets and \(M\) models to train, but usually we just have one dataset.
One approach to this problem is to bootstrap the dataset, ie, resampling the original dataset into the required \(M\) datasets.
d <- iris
head(d)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
M <- 12
bootstrap_samples <- replicate(M,sample(1:nrow(d), nrow(d), replace=TRUE))
head(bootstrap_samples)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 51 36 8 75 95 31 137 30 62 27 133 57
## [2,] 47 29 113 76 53 37 83 101 77 106 70 131
## [3,] 7 19 16 108 32 92 86 131 108 52 18 36
## [4,] 123 35 34 127 94 76 52 57 107 63 21 8
## [5,] 84 119 50 18 45 128 139 109 111 142 77 103
## [6,] 12 98 80 113 81 142 32 139 121 105 122 131
The columns of bootstrap_Samples
are resamples of the original dataset (in this case, each column keeps the indices that refer to the original dataset). They are going to be used to train \(M\) models, \(y_1, \ldots, y_m\) (herein, we will use linear regression has an eg):
# models is a list with M linear regression models
models <- apply(bootstrap_samples, 2,
function(indices) lm(Petal.Length ~ Sepal.Length, data=d[indices,]))
THe committee prediction is given by
\[y_{com}(x) = \frac{1}{M} \sum_{m=1}^M y_m(x)\]
This technique is called bagging:
bagging <- function(models, vals) {
M <- length(models)
# each column (one per value in vals) will have M predictions
preds <- t(sapply(1:M, function(m) predict(models[[m]], vals)))
apply(preds, 2, mean)
}
Let’s predict the bagged linear regression:
xs <- seq(4,8,len=50)
ys <- bagging(models, data.frame(Sepal.Length=xs))
plot(d$Sepal.Length, d$Petal.Length, pch=19)
points(xs, ys, type="l", col="red", lwd=2)
**Boosting* is a more advanced technique to form a committee. Here the \(M\) models are trained in sequence, they depend on the results of the previous one. Datapoints that were wrongly classified will get more weigth in the next classification. After all the classifications end, we will do the mean of the models, as before.
The next version is called AdaBoost (for adaptive boosting):
# Adaboost using logistic regression (for classification, 2 classes named 0 and 1)
# Just serves as an eg. To use an efficient adaboost check library 'adabag'
lg_adaboost <- function(dataset, ab_formula, M) {
N <- nrow(dataset)
w <- rep(1/N, N) # weights
models <- list()
alpha_m <- rep(NA,M)
for(m in 1:M) {
models[[m]] <-
eval(substitute(
glm(ab_formula, family = binomial("logit"), data=dataset, weights=w)
), list(w=w)) # weights are evaluated like variables in formula [glm's help]
# get current classification
pred <- predict(models[[m]], dataset[,-ncol(dataset)], type="response")
pred <- (pred > 0.5)+0 # round to 0 and 1
# updating weights (cf Bishop's pg. 658)
indicators <- 0+(pred!=dataset[,ncol(dataset)]) # I(y_m(x_n) != t_n)
epsilon_m <- sum(w * indicators) / sum(w)
alpha_m[m] <- log((1-epsilon_m)/epsilon_m)
w <- w * exp(alpha_m[m] * indicators)
}
list(models=models, alpha_m=alpha_m)
}
The function lg_adaboost
returns the models \(y_m\) and a vector of weighting coefficients \(\alpha_m\) which are used to define the committee prediction:
\[y_{com}(x) = \text{sign}\left( \sum_{m=1}^M \alpha_m y_m(x) \right)\]
# prediction of dataset using a committee produced by function 'lg_adaboost'
pred_adaboost <- function(committee, dataset) {
N <- nrow(dataset)
M <- length(committee$models)
committee_vote <- function(row) {
preds <- sapply(1:M, function(m) committee$alpha_m[m] *
predict(committee$models[[m]], row))
(sum(preds)>0.5)+0 # return sign( sum(...) )
}
sapply(1:N, function(n) committee_vote(dataset[n,]))
}
Let’s check the performance with a two class dataset:
dataset <- iris[51:150,] # get just two flowers from iris dataset
dataset[,5] <- (dataset[,5]=="virginica")+0 # convert classes to 0 and 1
committee <- lg_adaboost(dataset, ab_formula=as.formula(Species ~ .), M)
# compare true classification with prediction by the committee:
table(dataset[,5], pred_adaboost(committee, dataset))
##
## 0 1
## 0 48 2
## 1 3 47