This page contains source code relating to chapter 8 of Bishop’s Pattern Recognition and Machine Learning (2009)
This chapter is about Probabilistic Graphical Models.
The model can be stated has:
\[p(y,w|x,\alpha,\sigma^2) = p(w|\alpha) \prod_{i=1}^N p(y_i|w,x_i,\sigma^2)\]
where \(\alpha\) is the precision for the normal prior of \(w \sim \mathcal{N}(0,\alpha^{-1}I)\), and \(\sigma^2\) is the noise variance of \(x\),
\[y_i \sim \mathcal{N}(\mu,\sigma^2)\]
where \(\mu\) is a polynomial expression over \(x\) (or basis of \(x\)) using the weigths \(w\), ie, \(\mu=y(x,w)\).
Here’s the respective probabilistic graphical model:
Herein diamonds describe deterministic parameters; shaded circles describe observed values; and boxes describe multiple variables (in this eg, \(x_1 \ldots x_N\) and \(t_1 \ldots t_N\)).
I will use openbugs to specify a model example (Bishop does not talk about MCMC in this chapter) and execute it to give a polynomial fit for a dataset.
In this eg we use a cubic polynomial, so
\[\mu = w_0 + w_1 x + w_2 x^2 + w_3 x^3\]
and we need to estimate the values of the weights \(w\), given \(x\) and \(y\):
modelString = "
model {
for(i in 1:4) {
w[i] ~ dnorm(0, alpha_1) # prior for each w[i], w ~ N(0,1/alpha)
}
for(i in 1:N) {
mu[i] <- w[1] + w[2]*x[i] + w[3]*pow(x[i],2) + w[4]*pow(x[i],3)
y[i] ~ dnorm(mu[i], sigma2) # likelihood, y ~ N(mu, sigma^2)
}
}
"
This is our data, a noisy sin wave:
N <- 50
xs <- seq(-pi,pi,len=N)
d <- data.frame(x=xs,
y=sin(xs)+rnorm(N,0,0.1))
plot(d,pch=19)
Let’s try to fit a polynomial cube on it:
data.list <- list(
N = N,
alpha_1 = 10,
sigma2 = 10,
x = d$x,
y = d$y
)
run.model(modelString, samples=c("w"), data=data.list, chainLength=10000)
## model is syntactically correct
## data loaded
## model compiled
## initial values generated, model initialized
## 1000 updates took 0 s
## monitor set for variable 'w'
## 10000 updates took 0 s
# get posterior mean of p(w|data)
w_hat <- samplesStats( "w" )$mean
# hot to compute estimates for new values of y's based on a given w
compute_mu <- function(w,x) {
w[1] + w[2]*x + w[3]*x^2 + w[4]*x^3
}
# for each x, estimate y given the posterior mean of p(w|data)
ys_hat <- sapply(xs, function(x) compute_mu(w_hat,x))
plot(d,pch=19)
points(xs,ys_hat,type="l", col="red",lwd=2)
We can also produce a confidence interval. For that we need the values of \(w\) computed by the mcmc:
w_samples <- data.frame(w11=samplesSample("w[1]"))
for(i in 2:4)
w_samples <- cbind(w_samples, samplesSample( paste0('w[',i,']') ))
names(w_samples) <- paste0("w",1:4)
head(w_samples,4)
## w1 w2 w3 w4
## 1 -0.042160 0.812928 0.020062 -0.086070
## 2 -0.048486 0.790838 0.013759 -0.079524
## 3 -0.017038 0.765114 0.004933 -0.075219
## 4 0.012780 0.782291 -0.002309 -0.091765
With these samples, we can produce several instances of fitting polynomials:
plot(d,pch=19,type="n")
for(i in 1:20) {
w <- w_samples[i,]
y <- sapply(xs, function(x) compute_mu(w,x))
points(xs,y,type="l", col="lightgrey", lwd=1)
}
points(d,pch=19)
And use those to get highest posterior density intervals for each value \(x\):
library(coda)
prob <- 0.9
hpd <- matrix(rep(NA,N*2),ncol=2)
for(i in 1:N) {
ys <- apply(w_samples, 1, function(w) compute_mu(w,xs[i]))
hpd[i,] <- HPDinterval(as.mcmc(ys), prob=prob)
}
plot(d,pch=19,type="n", xlab=paste0(100*prob,"% credible interval"), ylab="")
polygon(c(rev(xs), xs), c(rev(hpd[,1]), hpd[,2]), col = 'grey80', border = NA)
points(xs,ys_hat,type="l", col="red",lwd=2)
points(d,pch=19)
Naive Bayes is a model that assumes a very strong restriction, all features \(x_i\) are independent between themselves given the class label:
\[p(x,y) = p(y) \prod_{i=1}^D p(x_i,y)\]
In graphical terms:
The next eg is taken from the Stan’s GitHub and is about classifying words \(w\) of a certain vocabulary in a given topic \(z\) (more details in section 13.3 of Stan Modeling Manual).
There are \(M\) documents, each consisting of a bag of words (here there’s no order in the words), where the m-th document has \(N_m\) words, \(w_{m1} \ldots w_{mN_m}\), with a total of \(N\) words. There are \(K\) different categories/topics of documents (eg, spam, personal, work, news…).
The word data is organized in two vectors, one, \(w\), with all the words (identified by a numeric id) and another, \(doc\) with the id of the document the word belongs to.
source("chp8/data.R") # get data
K # number of topics
## [1] 4
V # number of words in the vocabulary
## [1] 10
N # total number of words in the data
## [1] 1955
M # number of documents
## [1] 200
head(z) # the classification of each document, ie, the topic of doc m is in z[m]
## [1] 1 1 2 1 1 3
head(doc,20)
## [1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3
head(w,20)
## [1] 7 9 5 2 9 1 1 9 6 9 1 10 6 7 2 3 3 2 9 3
There’s a category \(z_m\) for each document \(m \in 1:M\) with categorical distribution
\[z_m \sim \text{Categorical}(\theta)\]
where \(\theta\) is a K-simplex (ie, a vector of K elements summing to one) that represents the prevalence of each category for that document.
Each word \(w_{m,n}\), the n-th word of the m-tm document is generated by
\[w_{m,n} \sim \text{Categorical}(\phi_{z_m})\]
where \(\phi_{z_m}\) is a V-simplex representing the probability of each word in the vocabulary inside documents of category \(z_m\).
The priors for \(\phi,\theta\) have Dirichlet distributions with symmetric values which are given by vectors \(\alpha\) and \(\beta\),
alpha # there are K topics
## [1] 1 1 1 1
beta # there are V words in the vocabulary
## [1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Graphically:
The model in Stan:
model <- '
data {
// training data
int<lower=1> K; // num topics
int<lower=1> V; // num words, vocabulary
int<lower=0> M; // num docs
int<lower=0> N; // total word instances
int<lower=1,upper=K> z[M]; // topic for doc m
int<lower=1,upper=V> w[N]; // word n
int<lower=1,upper=M> doc[N]; // doc ID for word n
// hyperparameters
vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
}
parameters {
simplex[K] theta; // topic prevalence
simplex[V] phi[K]; // word dist for topic k
}
model {
// priors
theta ~ dirichlet(alpha);
for (k in 1:K)
phi[k] ~ dirichlet(beta);
// likelihood, including latent category
for (m in 1:M)
z[m] ~ categorical(theta);
for (n in 1:N)
w[n] ~ categorical(phi[z[doc[n]]]);
}
'
So, let’s give the model and the data to Stan and wait for the results:
library(rstan)
fit <- stan(model_code = model,
data = list(K=K,V=V,M=M,N=N,z=z,w=w,doc=doc,alpha=alpha,beta=beta),
iter = 10000, chains=1, verbose=FALSE, seed=101, warmup=1000)
##
## TRANSLATING MODEL 'model' FROM Stan CODE TO C++ CODE NOW.
## COMPILING THE C++ CODE FOR MODEL 'model' NOW.
## cygwin warning:
## MS-DOS style path detected: C:/PROGRA~1/R/R-32~1.0/etc/x64/Makeconf
## Preferred POSIX equivalent is: /cygdrive/c/PROGRA~1/R/R-32~1.0/etc/x64/Makeconf
## CYGWIN environment variable option "nodosfilewarning" turns off this warning.
## Consult the user's guide for more details about POSIX paths:
## http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
## In file included from C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/rstaninc.hpp:3:0,
## from file1950c1a4a41.cpp:508:
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp: In function 'int rstan::{anonymous}::sampler_command(rstan::stan_args&, Model&, Rcpp::List&, const std::vector<long long unsigned int>&, const std::vector<std::basic_string<char> >&, RNG_t&) [with Model = model1950496f3ff8_model_namespace::model1950496f3ff8_model, RNG_t = boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> >, Rcpp::List = Rcpp::Vector<19>]':
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp:1311:7: instantiated from 'SEXPREC* rstan::stan_fit<Model, RNG_t>::call_sampler(SEXP) [with Model = model1950496f3ff8_model_namespace::model1950496f3ff8_model, RNG_t = boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> >, SEXP = SEXPREC*]'
## file1950c1a4a41.cpp:519:118: instantiated from here
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp:732:15: warning: unused variable 'return_code' [-Wunused-variable]
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp:1311:7: instantiated from 'SEXPREC* rstan::stan_fit<Model, RNG_t>::call_sampler(SEXP) [with Model = model1950496f3ff8_model_namespace::model1950496f3ff8_model, RNG_t = boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> >, SEXP = SEXPREC*]'
## file1950c1a4a41.cpp:519:118: instantiated from here
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp:796:15: warning: unused variable 'return_code' [-Wunused-variable]
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp:1311:7: instantiated from 'SEXPREC* rstan::stan_fit<Model, RNG_t>::call_sampler(SEXP) [with Model = model1950496f3ff8_model_namespace::model1950496f3ff8_model, RNG_t = boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> >, SEXP = SEXPREC*]'
## file1950c1a4a41.cpp:519:118: instantiated from here
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp:616:14: warning: unused variable 'init_log_prob' [-Wunused-variable]
##
## SAMPLING FOR MODEL 'model' NOW (CHAIN 1).
##
## Iteration: 1 / 10000 [ 0%] (Warmup)
## Iteration: 1000 / 10000 [ 10%] (Warmup)
## Iteration: 1001 / 10000 [ 10%] (Sampling)
## Iteration: 2000 / 10000 [ 20%] (Sampling)
## Iteration: 3000 / 10000 [ 30%] (Sampling)
## Iteration: 4000 / 10000 [ 40%] (Sampling)
## Iteration: 5000 / 10000 [ 50%] (Sampling)
## Iteration: 6000 / 10000 [ 60%] (Sampling)
## Iteration: 7000 / 10000 [ 70%] (Sampling)
## Iteration: 8000 / 10000 [ 80%] (Sampling)
## Iteration: 9000 / 10000 [ 90%] (Sampling)
## Iteration: 10000 / 10000 [100%] (Sampling)
## # Elapsed Time: 7.203 seconds (Warm-up)
## # 52.01 seconds (Sampling)
## # 59.213 seconds (Total)
Here are the mean results for parameters \(\phi_{z_m}\):
phi_means <- matrix(as.numeric(get_posterior_mean(fit, pars="phi")), ncol=4)
colnames(phi_means) <- paste0("K",1:4) # i-th topic
rownames(phi_means) <- paste0("V",1:10) # i-th word in the vocabulary
phi_means
## K1 K2 K3 K4
## V1 0.19190737 0.02238319 0.04506077 0.202885494
## V2 0.02143002 0.17969415 0.02249301 0.473273975
## V3 0.04270028 0.45247781 0.14410443 0.025105712
## V4 0.05241685 0.11571790 0.02569999 0.043576554
## V5 0.01462873 0.01354873 0.46058673 0.049507174
## V6 0.06397323 0.02246406 0.15712215 0.031268342
## V7 0.44514952 0.04903711 0.02585869 0.049707493
## V8 0.02823626 0.02896186 0.05447617 0.006701946
## V9 0.12107472 0.08452556 0.03550147 0.092808349
## V10 0.01848302 0.03118963 0.02909660 0.025164961
Let’s select the words of a document:
doc_id <- 4
words <- w[doc==doc_id]
words
## [1] 6 2 7 1 7 7 1 6 7 7
And find the likelihood of this document (ie, of its words) of belonging on each topic (we compute the log-likelihoods to prevent underflows):
predict_topic <- function(words, phi_means) {
log_lik <- rep(NA,K)
for(topic in 1:K) {
log_lik[topic] <- sum(log(phi_means[words,topic]))
}
which.max(log_lik)
}
predict_topic(words, phi_means)
## [1] 1
And we see that the 4-th document is classified as being of topic 1.
This corresponds to its initial classification:
z[doc_id] # the known topic of this document
## [1] 1
If we had a new document, we could classify it:
words <- c(1,5,4,1,2) # the words id of the new document
predict_topic(words, phi_means) # the model's prediction
## [1] 4