Bishop’s PRML, Chapter 8

Polynomial Regression eg (section 8.1.1)
Naive Bayes eg (section 8.2.2)

This page contains source code relating to chapter 8 of Bishop’s Pattern Recognition and Machine Learning (2009)

This chapter is about Probabilistic Graphical Models.

Polynomial Regression eg (section 8.1.1)

The model can be stated has:

\[p(y,w|x,\alpha,\sigma^2) = p(w|\alpha) \prod_{i=1}^N p(y_i|w,x_i,\sigma^2)\]

where \(\alpha\) is the precision for the normal prior of \(w \sim \mathcal{N}(0,\alpha^{-1}I)\), and \(\sigma^2\) is the noise variance of \(x\),

\[y_i \sim \mathcal{N}(\mu,\sigma^2)\]

where \(\mu\) is a polynomial expression over \(x\) (or basis of \(x\)) using the weigths \(w\), ie, \(\mu=y(x,w)\).

Here’s the respective probabilistic graphical model:

Herein diamonds describe deterministic parameters; shaded circles describe observed values; and boxes describe multiple variables (in this eg, \(x_1 \ldots x_N\) and \(t_1 \ldots t_N\)).

I will use openbugs to specify a model example (Bishop does not talk about MCMC in this chapter) and execute it to give a polynomial fit for a dataset.

In this eg we use a cubic polynomial, so

\[\mu = w_0 + w_1 x + w_2 x^2 + w_3 x^3\]

and we need to estimate the values of the weights \(w\), given \(x\) and \(y\):

modelString = "
  model {
      for(i in 1:4) {
         w[i] ~ dnorm(0, alpha_1)       # prior for each w[i], w ~ N(0,1/alpha)
      }

      for(i in 1:N) {
          mu[i] <- w[1] + w[2]*x[i] + w[3]*pow(x[i],2) + w[4]*pow(x[i],3) 
          y[i] ~ dnorm(mu[i], sigma2)   # likelihood, y ~ N(mu, sigma^2)
      }
  }
"

This is our data, a noisy sin wave:

N  <- 50
xs <- seq(-pi,pi,len=N)
d  <- data.frame(x=xs,
                 y=sin(xs)+rnorm(N,0,0.1))
plot(d,pch=19)

Let’s try to fit a polynomial cube on it:

data.list <- list(
    N  = N,
    alpha_1 = 10,
    sigma2 = 10,
    x = d$x,
    y = d$y
)

run.model(modelString, samples=c("w"), data=data.list, chainLength=10000)

## model is syntactically correct
## data loaded
## model compiled
## initial values generated, model initialized
## 1000 updates took 0 s
## monitor set for variable 'w'
## 10000 updates took 0 s

# get posterior mean of p(w|data)
w_hat  <- samplesStats( "w" )$mean  
# hot to compute estimates for new values of y's based on a given w
compute_mu <- function(w,x) {
  w[1] + w[2]*x + w[3]*x^2 + w[4]*x^3
}
# for each x, estimate y given the posterior mean of p(w|data)
ys_hat <- sapply(xs, function(x) compute_mu(w_hat,x))

plot(d,pch=19)
points(xs,ys_hat,type="l", col="red",lwd=2)

We can also produce a confidence interval. For that we need the values of \(w\) computed by the mcmc:

w_samples <- data.frame(w11=samplesSample("w[1]"))
for(i in 2:4)
  w_samples <- cbind(w_samples, samplesSample( paste0('w[',i,']') ))
names(w_samples) <- paste0("w",1:4)
head(w_samples,4)

##          w1       w2        w3        w4
## 1 -0.042160 0.812928  0.020062 -0.086070
## 2 -0.048486 0.790838  0.013759 -0.079524
## 3 -0.017038 0.765114  0.004933 -0.075219
## 4  0.012780 0.782291 -0.002309 -0.091765

With these samples, we can produce several instances of fitting polynomials:

plot(d,pch=19,type="n")
for(i in 1:20) {
  w <- w_samples[i,]
  y <- sapply(xs, function(x) compute_mu(w,x))
  points(xs,y,type="l", col="lightgrey", lwd=1)
}
points(d,pch=19)

And use those to get highest posterior density intervals for each value \(x\):

library(coda)
prob <- 0.9
hpd  <- matrix(rep(NA,N*2),ncol=2)
for(i in 1:N) {
  ys      <- apply(w_samples, 1, function(w) compute_mu(w,xs[i]))
  hpd[i,] <- HPDinterval(as.mcmc(ys), prob=prob)
}

plot(d,pch=19,type="n", xlab=paste0(100*prob,"% credible interval"), ylab="")
polygon(c(rev(xs), xs), c(rev(hpd[,1]), hpd[,2]), col = 'grey80', border = NA)
points(xs,ys_hat,type="l", col="red",lwd=2)
points(d,pch=19)

Naive Bayes eg (section 8.2.2)

Naive Bayes is a model that assumes a very strong restriction, all features \(x_i\) are independent between themselves given the class label:

\[p(x,y) = p(y) \prod_{i=1}^D p(x_i,y)\]

In graphical terms:

The next eg is taken from the Stan’s GitHub and is about classifying words \(w\) of a certain vocabulary in a given topic \(z\) (more details in section 13.3 of Stan Modeling Manual).

There are \(M\) documents, each consisting of a bag of words (here there’s no order in the words), where the m-th document has \(N_m\) words, \(w_{m1} \ldots w_{mN_m}\), with a total of \(N\) words. There are \(K\) different categories/topics of documents (eg, spam, personal, work, news…).

The word data is organized in two vectors, one, \(w\), with all the words (identified by a numeric id) and another, \(doc\) with the id of the document the word belongs to.

source("chp8/data.R") # get data
K # number of topics
## [1] 4
V # number of words in the vocabulary
## [1] 10
N # total number of words in the data
## [1] 1955
M # number of documents
## [1] 200
head(z) # the classification of each document, ie, the topic of doc m is in z[m]
## [1] 1 1 2 1 1 3
head(doc,20)
##  [1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3
head(w,20)
##  [1]  7  9  5  2  9  1  1  9  6  9  1 10  6  7  2  3  3  2  9  3

There’s a category \(z_m\) for each document \(m \in 1:M\) with categorical distribution

\[z_m \sim \text{Categorical}(\theta)\]

where \(\theta\) is a K-simplex (ie, a vector of K elements summing to one) that represents the prevalence of each category for that document.

Each word \(w_{m,n}\), the n-th word of the m-tm document is generated by

\[w_{m,n} \sim \text{Categorical}(\phi_{z_m})\]

where \(\phi_{z_m}\) is a V-simplex representing the probability of each word in the vocabulary inside documents of category \(z_m\).

The priors for \(\phi,\theta\) have Dirichlet distributions with symmetric values which are given by vectors \(\alpha\) and \(\beta\),

alpha # there are K topics
## [1] 1 1 1 1
beta  # there are V words in the vocabulary
##  [1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Graphically:

The model in Stan:

model <- '
  data {
    // training data
    int<lower=1> K;               // num topics
    int<lower=1> V;               // num words, vocabulary
    int<lower=0> M;               // num docs
    int<lower=0> N;               // total word instances
    int<lower=1,upper=K> z[M];    // topic for doc m
    int<lower=1,upper=V> w[N];    // word n
    int<lower=1,upper=M> doc[N];  // doc ID for word n
    // hyperparameters
    vector<lower=0>[K] alpha;     // topic prior
    vector<lower=0>[V] beta;      // word prior
  }

  parameters {
    simplex[K] theta;   // topic prevalence
    simplex[V] phi[K];  // word dist for topic k
  }

  model {
    // priors
    theta ~ dirichlet(alpha);
    for (k in 1:K)  
      phi[k] ~ dirichlet(beta);

    // likelihood, including latent category
    for (m in 1:M)
      z[m] ~ categorical(theta);
    for (n in 1:N)
      w[n] ~ categorical(phi[z[doc[n]]]);
  }
'

So, let’s give the model and the data to Stan and wait for the results:

library(rstan)

fit <- stan(model_code = model, 
            data = list(K=K,V=V,M=M,N=N,z=z,w=w,doc=doc,alpha=alpha,beta=beta), 
            iter = 10000, chains=1, verbose=FALSE, seed=101, warmup=1000)

## 
## TRANSLATING MODEL 'model' FROM Stan CODE TO C++ CODE NOW.
## COMPILING THE C++ CODE FOR MODEL 'model' NOW.
## cygwin warning:
##   MS-DOS style path detected: C:/PROGRA~1/R/R-32~1.0/etc/x64/Makeconf
##   Preferred POSIX equivalent is: /cygdrive/c/PROGRA~1/R/R-32~1.0/etc/x64/Makeconf
##   CYGWIN environment variable option "nodosfilewarning" turns off this warning.
##   Consult the user's guide for more details about POSIX paths:
##     http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
## In file included from C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/rstaninc.hpp:3:0,
##                  from file1950c1a4a41.cpp:508:
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp: In function 'int rstan::{anonymous}::sampler_command(rstan::stan_args&, Model&, Rcpp::List&, const std::vector<long long unsigned int>&, const std::vector<std::basic_string<char> >&, RNG_t&) [with Model = model1950496f3ff8_model_namespace::model1950496f3ff8_model, RNG_t = boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> >, Rcpp::List = Rcpp::Vector<19>]':
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp:1311:7:   instantiated from 'SEXPREC* rstan::stan_fit<Model, RNG_t>::call_sampler(SEXP) [with Model = model1950496f3ff8_model_namespace::model1950496f3ff8_model, RNG_t = boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> >, SEXP = SEXPREC*]'
## file1950c1a4a41.cpp:519:118:   instantiated from here
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp:732:15: warning: unused variable 'return_code' [-Wunused-variable]
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp:1311:7:   instantiated from 'SEXPREC* rstan::stan_fit<Model, RNG_t>::call_sampler(SEXP) [with Model = model1950496f3ff8_model_namespace::model1950496f3ff8_model, RNG_t = boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> >, SEXP = SEXPREC*]'
## file1950c1a4a41.cpp:519:118:   instantiated from here
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp:796:15: warning: unused variable 'return_code' [-Wunused-variable]
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp:1311:7:   instantiated from 'SEXPREC* rstan::stan_fit<Model, RNG_t>::call_sampler(SEXP) [with Model = model1950496f3ff8_model_namespace::model1950496f3ff8_model, RNG_t = boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> >, SEXP = SEXPREC*]'
## file1950c1a4a41.cpp:519:118:   instantiated from here
## C:/Program Files/R/R-3.2.0/library/rstan/include/rstan/stan_fit.hpp:616:14: warning: unused variable 'init_log_prob' [-Wunused-variable]
## 
## SAMPLING FOR MODEL 'model' NOW (CHAIN 1).
## 
## Iteration:    1 / 10000 [  0%]  (Warmup)
## Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Iteration: 1001 / 10000 [ 10%]  (Sampling)
## Iteration: 2000 / 10000 [ 20%]  (Sampling)
## Iteration: 3000 / 10000 [ 30%]  (Sampling)
## Iteration: 4000 / 10000 [ 40%]  (Sampling)
## Iteration: 5000 / 10000 [ 50%]  (Sampling)
## Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Iteration: 10000 / 10000 [100%]  (Sampling)
## #  Elapsed Time: 7.203 seconds (Warm-up)
## #                52.01 seconds (Sampling)
## #                59.213 seconds (Total)

Here are the mean results for parameters \(\phi_{z_m}\):

phi_means <- matrix(as.numeric(get_posterior_mean(fit, pars="phi")), ncol=4)
colnames(phi_means) <- paste0("K",1:4)  # i-th topic
rownames(phi_means) <- paste0("V",1:10) # i-th word in the vocabulary
phi_means

##             K1         K2         K3          K4
## V1  0.19190737 0.02238319 0.04506077 0.202885494
## V2  0.02143002 0.17969415 0.02249301 0.473273975
## V3  0.04270028 0.45247781 0.14410443 0.025105712
## V4  0.05241685 0.11571790 0.02569999 0.043576554
## V5  0.01462873 0.01354873 0.46058673 0.049507174
## V6  0.06397323 0.02246406 0.15712215 0.031268342
## V7  0.44514952 0.04903711 0.02585869 0.049707493
## V8  0.02823626 0.02896186 0.05447617 0.006701946
## V9  0.12107472 0.08452556 0.03550147 0.092808349
## V10 0.01848302 0.03118963 0.02909660 0.025164961

Let’s select the words of a document:

doc_id <- 4
words <- w[doc==doc_id]
words

##  [1] 6 2 7 1 7 7 1 6 7 7

And find the likelihood of this document (ie, of its words) of belonging on each topic (we compute the log-likelihoods to prevent underflows):

predict_topic <- function(words, phi_means) {
  log_lik <- rep(NA,K)
  for(topic in 1:K) {
    log_lik[topic] <- sum(log(phi_means[words,topic]))
  }
  which.max(log_lik)
}

predict_topic(words, phi_means)

## [1] 1

And we see that the 4-th document is classified as being of topic 1.

This corresponds to its initial classification:

z[doc_id] # the known topic of this document

## [1] 1

If we had a new document, we could classify it:

words <- c(1,5,4,1,2)            # the words id of the new document
predict_topic(words, phi_means)  # the model's prediction

## [1] 4

Bishop’s PRML, Chapter 8

May, 2015

Polynomial Regression eg (section 8.1.1)

Naive Bayes eg (section 8.2.2)