BUGS tutorial (by example)

First Example
Coin eg
Random Variable Transformation
An otherwise intractable problem
Prediction eg
Conjugate inference
Discrete parameters
Non-conjugate eg
Multi-parameter Model
Another eg: messages received
Mixture of Priors
Sensitivity Analysis

Current version of OpenBUGS needed.

Refs:

Lunn et al. The BUGS Book website
Kruschke Doing Bayesian Data Analysis

BUGS allow us to do statistical inference using a graphical model approach:

Start with the model describing assumptions concerning the relationships of the problem
Feed the model with the available data
Use an engine to obtain approximated inferences about the unobserved variables

The model is described in a declarative way. It has two kinds of connections:

logical dependence which is represented by <-, there is not randomness here, just a simple assignment. At the right there is an expression
stochastic dependence which is represented by ~. At the right there is a statistical distribution

BUGS supports many distributions. Some of them:

Normal, dnorm(mean,precision), precision = 1/variance
Bernoulli, dbern(p), p is the success probability
Binomial, dbin(p,n), p is the success probability and n is the number of trials
Categorical, dcat(p[]), p is a vector with a discrete distribution
Uniform, dunif(a,b), uniform between [a,b]
Beta, dbeta(a,b)
Gamma, dgamma(a,b)
Logistic, dlogis(mu, tau)
Exponential, dexp(theta)
Student’s t, dt(mean, tau, k), where \(k\) is the degrees of freedom
Pareto, dpar(a,b), \(p(x|a,b)=ab^ax^{-(a+1)}\)
Poisson, dpois(theta)
Geometrix, dgeom(theta)
Beta-binomial, dbetabin(a,b,n)
Multinomial, x[1:R] ~ dmulti(theta[], n), \(n\) events each with R mutually exclusive outcomes with probabilities \(\theta_1\ldots \theta_R\)

Vector’s use:

v[i], the ith position
v[a:b], (v[a], v[a+1], …, v[b])
v[i,], the ith row
v[,j], the jth column

Check BUGS’s manual for more information.

First Example

Let’s try a simple example: if we flip a fair coin 8 times, what is the probability of seeing 2 or less heads?

First, we can solve this analitically without BUGS. Given \(Y \sim Binomial(0.5,8)\), we wish to know \(Pr(Y \le 2)\). That is,

\[Pr(Y \le 2) = \sum_{y=0}^2 p(y|\pi=0.5, n=8) = \sum_{y=0}^2 {8 \choose y} (0.5)^y (1-0.5)^{8-y} = 0.1445\]

Or in R:

pbinom(2, size=8, prob=0.5)

## [1] 0.1445313

Now, let’s show how to simulate it using BUGS.

Our example is modelled with two assigments (step is BUGS step function, ie, returns 1 if the input is non-negative):

  Y  ~ dbin(0.5, 8)
  P2 <- step(2.5-Y) # i.e., is Y <= 2?

In diagram:

Let’s show how to define and run this model using R to communicate with BUGS:

library(BRugs)

## Welcome to BRugs connected to OpenBUGS version 3.2.3

# Function to draw histograms of sample results
draw.hist <- function(data, text="") {
  tmp <- hist(data, breaks=0:(max(data)+1), xaxt="n", right=FALSE, freq=TRUE, main=text)
  axis(1, at=tmp$mids, labels=0:max(data))
}

#------------------------------------------------------------------------------
# THE MODEL

# Specify the model in BUGS language, but save it as a string in R:
modelString = "
  # BUGS model specification begins ...
  model {
      Y ~ dbin(0.5, 8)
      P2 <- step(2.5-Y) # i.e., is Y <= 2?
  }
  # ... BUGS model specification ends.
" # close quote to end modelString

# Write the modelString to a file, using R commands:
writeLines(modelString,con="model.txt")
# Use BRugs to send the model.txt file to BUGS, which checks the model syntax:
modelCheck( "model.txt" )

## model is syntactically correct

#------------------------------------------------------------------------------
# INTIALIZE THE CHAIN.

modelCompile()  # BRugs command tells BUGS to compile the model.

## model compiled

modelGenInits() # BRugs command tells BUGS to randomly initialize a chain.

## initial values generated, model initialized

#------------------------------------------------------------------------------
# RUN THE CHAINS.

# BRugs tells BUGS to keep a record of the sampled values:
samplesSet( c("Y","P2") )

## monitor set for variable 'Y'

## monitor set for variable 'P2'

# R command defines a new variable that specifies an arbitrary chain length:
chainLength = 20000
# BRugs tells BUGS to generate a MCMC chain:
modelUpdate( chainLength )

## 20000 updates took 0 s

#------------------------------------------------------------------------------
# EXAMINE THE RESULTS.

YSample <- samplesSample( "Y" ) # BRugs asks BUGS for the sample values.
# The simulated results of flips of heads
head(YSample, n=100)

##   [1] 4 8 4 5 4 5 4 2 5 3 6 4 4 3 5 3 5 3 4 2 3 7 2 5 3 3 4 6 5 3 1 6 2 4 4
##  [36] 1 6 6 3 6 2 5 5 4 5 5 6 3 4 6 3 3 4 3 2 4 5 3 4 5 5 4 2 4 2 6 5 3 4 4
##  [71] 3 4 7 2 3 5 4 4 6 7 5 6 3 6 6 6 5 1 7 3 2 5 3 4 6 4 5 4 4 5

draw.hist(YSample, "num. of heads")

P2Sample <- samplesSample( "P2" ) # But it's P2 that matter to us
# we can see them (there are 10k samples, 0 if there were more than two heads, 1 otherwise)
head(P2Sample, n=50)

##  [1] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0
## [36] 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0

draw.hist(P2Sample, "less or equal than 2 heads?")

samplesStats( "Y" ) # BRugs asks BUGS for summary statistics.

##    mean    sd MC_error val2.5pc median val97.5pc start sample
## Y 4.006 1.424 0.008681        1      4         7     1  20000

P2Summary <- samplesStats( "P2" )
P2Summary

##      mean     sd MC_error val2.5pc median val97.5pc start sample
## P2 0.1464 0.3536  0.00228        0      0         1     1  20000

We see that the mean is 0.1464, quite close to the true answer.

To avoid repeating the initial steps, let’s make a function for them:

run.model <- function(model, samples, data=list(), chainLength=10000, burnin=0.10, 
                      init.func, n.chains=1, thin=1) {
  
  writeLines(model, con="model.txt")  # Write the modelString to a file
  modelCheck( "model.txt" )           # Send the model to BUGS, which checks the model syntax
  if (length(data)>0)                 # If there's any data available...
    modelData(bugsData(data))         # ... BRugs puts it into a file and ships it to BUGS
  modelCompile(n.chains)              # BRugs command tells BUGS to compile the model
  
  if (missing(init.func)) {
    modelGenInits()                   # BRugs command tells BUGS to randomly initialize a chain
  } else {
    for (chain in 1:n.chains) {       # otherwise use user's init data
      modelInits(bugsInits(init.func))
    }
  }
  
  modelUpdate(chainLength*burnin)     # Burn-in period to be discarded
  samplesSet(samples)                 # BRugs tells BUGS to keep a record of the sampled values
  samplesSetThin(thin)                # Set thinning
  modelUpdate(chainLength)            # BRugs command tells BUGS to randomly initialize a chain
}

Coin eg

A defective coin minting machine produces coins whose probability of Heads is a random variable Q with pdf \[f_Q(q) = 3q^2, q \in [0,1] \]

A coin produced by this machine is tossed repeatedly, with successive tosses assumed to be independent. Let A be the event that the first toss of this coin results in Heads, and let B be the event that the second toss of this coin results in Heads.

The probability \(P(A)\) is the expected value of tossing Heads:

\[E[q] = \int_0^1 3q^2\times q~dq = 0.75\]

It’s possible to simulate this in Bugs. However, we need to create a random generator of this pdf. For that we use the inverse transformation technique:

Compute the cdf: \[F_Q(q) = q^3\] and its inverse \[F_Q^{-1}(u)=u^{1/3}\]

modelString = "
  model {
    u    ~  dunif(0,1)
    fq   <- pow(u,1/3)
    Toss ~  dbern(fq)
  }
"

run.model(modelString, samples=c("fq"), chainLength=50000)

## model is syntactically correct

## model compiled

## initial values generated, model initialized

## 5000 updates took 0 s

## monitor set for variable 'fq'

## 50000 updates took 0 s

samplesStats("fq")$mean

## [1] 0.7489

Say we wish to compute the posterior pdf \(f_{Q|A}\) after event \(A\).

\[f_{Q|A}(q|A) = \frac{f_Q(q) p(A|q)}{\int_0^1 f_Q(q) p(A|q)~dq}\]

The prior \(f_Q(q)\) is the uniform, so \(f_Q(q)=1\).

The likelihood \(p(A|q)\) is given by \(P_Q(X \le q)\) which is the cdf \(F_Q(q) = q^3\), so:

\[f_{Q|A}(q|A) = \frac{1 \times q^3}{\int_0^1 1 \times q^3~dq} = \frac{q^3}{1/4} = 4q^3\]

What is the expected value of tossing Heads after event \(A\)?

\[E[q|A] = \int_0^1 4q^3\times q~dq = 0.8\]

Let Bugs it:

data.list = list(
  Toss = 1  # event A
)

run.model(modelString, samples=c("fq"), data=data.list, chainLength=50000)

## model is syntactically correct

## data loaded

## model compiled

## initial values generated, model initialized

## 5000 updates took 0 s

## monitor set for variable 'fq'

## 50000 updates took 0 s

samplesStats("fq")$mean

## [1] 0.8003

# comparing the empirical data with the analytical solution:
hist(samplesSample( "fq" ), breaks=50, prob=T)
curve(4*x^3, col="red", lwd=2, add=T)

Random Variable Transformation

The next eg asks questions after a transformation of the random variable:

We have \(Z \sim N(0,1)\). From \(Z\) we obtain \(Y=(2Z+1)^3\). What is its distribution, expected value and \(Pr(Y>10)\)? This problem has a non-trivial analytical solution but can be easily simulated:

modelString = "
  model {
      Z ~ dnorm(0, 1) # note: 1 is precision = 1/variance
      Y <- pow(2*Z+1, 3)
      P10 <- step(Y-10)
  }
"
run.model(modelString, samples=c("Y","P10"), chainLength=50000)

## model is syntactically correct

## model compiled

## initial values generated, model initialized

## 5000 updates took 0 s

## monitor set for variable 'Y'

## monitor set for variable 'P10'

## 50000 updates took 0 s

Now we can answer the questions:

# Its distribution
YSample <- samplesSample( "Y" )
hist(YSample, prob=TRUE)  
lines(density(YSample, adjust=5), col="red", lwd=2)

# Its expected value
samplesStats("Y")$mean

## [1] 12.95

# Pr(Y>10)
samplesStats("P10")$mean

## [1] 0.2832

Analytically the expected value is \(13\) and \(Pr(Y>10) \approx 0.2820\). The simulated results are very close.

An otherwise intractable problem

We have \(20\) equal items to repair. The unit repair cost is given by a gamma with mean \(100\) euros and sd \(50\) euros. We have \(1000\) euros for repairs. How many items, on average, will we be able to repair?

A gamma(a,b) distribution has mean \(\frac{a}{b}\) and variance \(\frac{a}{b^2}\). So the gamma for this exercise must be gamma(4,0.04).

To use BUGS we will add 20 Y random variables, where \(Y_i\) is the cost of item i. We’ll compute a cumulative vector of costs, cum.cost, where cum.cost[i] is the sum of the first i items. Then we’ll make a cum.step vector that consists of \(1,2,\ldots,M,0,0,\ldots\) where \(M\) is the last item we fix before running out of money. Then we just need to find that \(M\) which is the maximum value of cum.step. The maximum will be found using the ranked function.

modelString = "
  model {
      for(i in 1:20) {
        Y[i] ~ dgamma(4, 0.04)
      }

      cum.cost[1] <- Y[1]
      for(i in 2:20) {
        cum.cost[i] <- cum.cost[i-1] + Y[i]
      }

      for(i in 1:20) {
        cum.step[i] <- i * step(1000 - cum.cost[i])  # 1,2,...,M,0,0,...
      }

      answer <- ranked(cum.step[],20)
  }
"
run.model(modelString, samples=c("answer", "Y[1]"), chainLength=50000)

## model is syntactically correct

## model compiled

## initial values generated, model initialized

## 5000 updates took 0 s

## monitor set for variable 'answer'

## monitor set for variable 'Y[1]'

## 50000 updates took 0 s

samplesStats("Y[1]") # just to check if the mean and sd are ok (must be close to 100 and 50)

##       mean    sd MC_error val2.5pc median val97.5pc start sample
## Y[1] 100.2 49.95   0.2212    27.29  92.21     219.2  5001  50000

samplesStats("answer")$mean  # what we want to know!

## [1] 9.62

Prediction eg

A hospital will begin a new high-risk treatment. Other hospitals that do it have a risk of death \(\theta\) around \(10\%\) and it will be quite surprisingly if it was less than \(3\%\) and more than \(20\%\) (say that this interval occupies \(90\%\) of the probability mass). This is our prior information, which will be modelled by a convenient beta distribution:

library(rriskDistributions) # for finding distribution parameters

prior.dist <- get.beta.par(p=c(.05,.5,.95),q=c(.03,.1,.2), plot=FALSE, show.output=FALSE)
prior.shape1 <- prior.dist[["shape1"]]
prior.shape1

## [1] 3.420687

prior.shape2 <- prior.dist[["shape2"]]
prior.shape2

## [1] 28.46136

xs <- seq(0,1,len=100)
plot(xs, dbeta(xs,prior.shape1,prior.shape2), type="l", col="red", main="prior distribution")

Now suppose the hospital is expected to perform \(20\) operations. How many deaths \(Y\) should we expect? What is the probability of having at least \(6\) deaths, ie, \(Pr(Y \ge 6)\)?

Our death rate is \(\theta \sim \text{Beta}(3.42,28.46)\). The number of deaths are modeled by \(Y|\theta \sim \text{Binomial}(\theta,20)\). Analytically, prediction of \(Y\) is given by a \(\text{Beta-Binomial}(3.42,28.46,20)\) which mean is \(\frac{20\times 3.42}{3.42+28.46} \approx 2.14\). This is the expected number of deaths.

To compute \(Pr(Y \ge 6)\):

library(VGAM)

## Loading required package: stats4

## Loading required package: splines

1-pbetabinom.ab(5,size=20,shape1=3.42,shape2=28.46) # Pr(Y >= 6) = 1 - Pr(Y <= 5)

## [1] 0.04663492

Now the respective simulation:

modelString = "
  model {
      theta ~ dbeta(3.42, 28.46)
      Y ~ dbin(theta, 20)
      P6 <- step(Y - 5.5)
  }
"
run.model(modelString, samples=c("Y", "P6"), chainLength=20000)

## model is syntactically correct

## model compiled

## initial values generated, model initialized

## 2000 updates took 0 s

## monitor set for variable 'Y'

## monitor set for variable 'P6'

## 20000 updates took 0 s

samplesStats("Y") # mean should be close to 2.14

##    mean    sd MC_error val2.5pc median val97.5pc start sample
## Y 2.148 1.735  0.01206        0      2         6  2001  20000

samplesStats("P6")$mean  # Pr(Y >= 6)

## [1] 0.0459

Conjugate inference

The previous hospital executed \(n=10\) high risk treatments and observed \(y=0\) deaths. Using the prior for the risk of death \(\theta \sim \text{Beta}(3.42,28.46)\) from the previous exercise, what will be the posterior distribution given this data? Also, what is the probability of the next patient dying? What is the probability of having 2 or more deaths in the next 20 operations?

Using the conjugate functions, we know that for a prior \(\theta \sim \text{Beta}(a,b)\) and a sampling distribution \(y|\theta,n \sim\text{Binomial}(\theta,n)\) the posterior distribution is given by \(\theta|y,n \sim \text{Beta}(a+y,b+n-y)\). In this case, the posterior distribution is \(\text{Beta}(3.42+0,28.46+10-0)=\text{Beta}(3.42,38.46)\).

Plotting the prior and posterior distributions together we see that the amount of uncertainty as diminished (we know more data now):

xs <- seq(0,1,len=50)
plot(xs,dbeta(xs,3.42,28.46), type="l", ylim=c(0,11))
points(xs, dbeta(xs,3.42,38.46), type="l", col="red")
legend("topright", c("prior", "posterior"), col = c("black","red"), lwd=1)

With the posterior we can answer the next question:

# Q: What is the probability of the next patient dying?
# A: The mean of the posterior, in this case, X ~ Beta(a,b) => E[X] = a/(a+b)
3.43/(3.42+38.46)

## [1] 0.08190067

And we use the preditive function \(\text{Beta-Binomial}(3.42,38.46,20)\) to answer the final question:

# Q: What is the probability of having 2 or more deaths in the next 20 operations, Pr(Y>=2)?
1-pbetabinom.ab(1,size=20,shape1=3.42,shape2=38.46) # Pr(Y >= 2) = 1 - Pr(Y <= 1)

## [1] 0.4580711

Before this data, this same question would evaluate to \(58\%\). But because the excellent first results at the hospital, the estimation is revised to just \(46\%\)

Now (surprise!) let’s do this using BUGS:

modelString = "
  model {
      theta  ~ dbeta(3.42, 28.46)  # prior
      y      ~ dbin(theta, n)      # likelihood
      Y.pred ~ dbin(theta, 20)     # predictive distribution
      Q3 <- step(Y.pred - 1.5)     # = 1 if Y.pred >= 2, = 0 otherwise
  }
"

# We list the data we know, in this case, there were 0 deaths in 10 operations
data.list = list(
    y = 0, 
    n = 10
)

run.model(modelString, samples=c("theta", "Q3"), data=data.list, chainLength=50000)

## model is syntactically correct

## data loaded

## model compiled

## initial values generated, model initialized

## 5000 updates took 0 s

## monitor set for variable 'theta'

## monitor set for variable 'Q3'

## 50000 updates took 0 s

samplesStats("theta")$mean  # the posterior risk of death in 20 trials

## [1] 0.08134

samplesStats("Q3")$mean     # Pr(Y>=2)

## [1] 0.4554

BUGS adds the data from the data list into the model and propagate that information for the other nodes, including the nodes we care about (in this eg, theta and Q3).

Note: if we need to input multidimensional data, use like this: list( X=structure(.Data=c(1,2,3,4,5,6), .Dim=c(2,3)) ) which defines a matrix with 2 rows and 3 columns (values are placed by row)

Just to compare, let’s say we still do not have the previous data. Then the same answers would be computed by the following model (notice that there is no data list):

modelString = "
  model {
      theta  ~ dbeta(3.42, 28.46)  # prior
      Y.pred ~ dbin(theta, 20)     # predictive distribution
      Q3 <- step(Y.pred - 1.5)     # = 1 if Y.pred >= 2, = 0 otherwise
  }
"
run.model(modelString, samples=c("theta", "Q3"), chainLength=50000)

## model is syntactically correct

## model compiled

## initial values generated, model initialized

## 5000 updates took 0 s

## monitor set for variable 'theta'

## monitor set for variable 'Q3'

## 50000 updates took 0 s

samplesStats("theta")$mean  # the posterior risk of death in 20 trials

## [1] 0.107

samplesStats("Q3")$mean     # Pr(Y>=2)

## [1] 0.5808

We got the expected \(10\%\) risk of death (the prior information we had) and the \(58\%\) that we already knew.

Discrete parameters

We have three coins in my pocket. One \(c_1\) is biased 3:1 to tails, one \(c_2\) if fair, and another \(c_3\) is biased 3:1 to heads. I pick one coin at random and toss it observing heads. What the posterior distribution of the next toss is heads again?

Each coin has a probability of being pick of \(1/3\), i.e., \(p(c_i)=1/3, i=1,2,3\). Let’s call \(\theta_i\) the probability of tossing heads for coin \(c_i\). The likelihood of tossing heads (\(y=1\)) or tails (\(y=0\)) is \(p(y|\theta_i) = \theta_i^y (1-\theta_i)^{1-y}\).

We can get the posterior \(p(\theta|y=1)\) using Bayes’ theorem. So,

\[ \begin{array}{cccc} \theta_i & p(\theta_i) & p(y=1|\theta_i) & p(\theta_i|y=1) \\ 0.25 & 0.33 & 0.25 & 0.167 \\ 0.50 & 0.33 & 0.50 & 0.333 \\ 0.75 & 0.33 & 0.75 & 0.5 \end{array} \]

After observing a heads, the distribution changed accordingly. Eg, now there’s more chance that the coin we have in the hand in the head biased coin (\(50\%\) chance).

To compute the probability of the next toss \(y_1\) is heads:

\[p(y_1=1|y=1) = \sum_i p(y_1=1|\theta_i) p(\theta_i|y=1) = 7/12 \approx 0.583\]

With BUGS:

modelString = "
  model {
      coin ~ dcat(p[]) # coin is modeled by a categorical distribution
        # ie, it outputs 1,2..n where the given vector p[] has the respective probabilities
        # in this eg it will output k=1,2 or 3 which corresponds to coin c_k

      y           ~ dbern(true.theta)   # the toss depends on the true coin we picked
      true.theta <- theta[coin]         # this can be coin 1,2,3 each with its theta
      
      for(i in 1:3) {
        p[i] <- 1/3         # creating vector p = {1/3,1/3,1/3}
        theta[i] <- 0.25*i  # creating vector theta = {0.25,0.50,0.75}
        coin.prob[i] <- equals(coin,i)  # data for the posterior distribution
      }

      y1 ~ dbern(true.theta)  # the outcome of the next toss
  }
"

data.list = list(
    y = 1
)

run.model(modelString, samples=c("coin.prob","y1"), data=data.list, chainLength=50000)

## model is syntactically correct

## data loaded

## model compiled

## initial values generated, model initialized

## 5000 updates took 0 s

## monitor set for variable 'coin.prob'

## monitor set for variable 'y1'

## 50000 updates took 0 s

samplesStats("y1")$mean   # the probability that next coin is heads

## [1] 0.583

samplesStats("coin.prob") # the posterior probability

##                mean     sd MC_error val2.5pc median val97.5pc start sample
## coin.prob[1] 0.1679 0.3738 0.002207        0      0         1  5001  50000
## coin.prob[2] 0.3269 0.4691 0.002375        0      0         1  5001  50000
## coin.prob[3] 0.5051 0.5000 0.002896        0      1         1  5001  50000

Let’s say we had four tosses, with 3 heads and 1 tail, \(y\) is now a vector of four positions:

modelString = "
  model {
      coin ~ dcat(p[]) # coin is modeled by a categorical distribution
        # ie, it outputs 1,2..n where the given vector p[] has the respective probabilities
        # in this eg it will output k=1,2 or 3 which corresponds to coin c_k

      for (i in 1:nTosses) {
        y[i] ~ dbern(true.theta)
      }
      true.theta <- theta[coin]         # this can be coin 1,2,3 each with its theta
      
      for(i in 1:3) {
        p[i] <- 1/3         # creating vector p = {1/3,1/3,1/3}
        theta[i] <- 0.25*i  # creating vector theta = {0.25,0.50,0.75}
        coin.prob[i] <- equals(coin,i)  # data for the posterior distribution
      }

      y1 ~ dbern(true.theta)  # the outcome of the next toss
  }
"

data.list = list(
    y = c(1,1,1,0), 
    nTosses = 4
)

run.model(modelString, samples=c("coin.prob","y1"), data=data.list, chainLength=50000)

## model is syntactically correct

## data loaded

## model compiled

## initial values generated, model initialized

## 5000 updates took 0 s

## monitor set for variable 'coin.prob'

## monitor set for variable 'y1'

## 50000 updates took 0 s

samplesStats("y1")$mean   # the probability that next coin is heads

## [1] 0.6301

samplesStats("coin.prob") # the posterior probability

##                 mean     sd MC_error val2.5pc median val97.5pc start
## coin.prob[1] 0.06468 0.2460 0.001521        0      0         1  5001
## coin.prob[2] 0.34610 0.4757 0.002559        0      0         1  5001
## coin.prob[3] 0.58920 0.4920 0.002967        0      1         1  5001
##              sample
## coin.prob[1]  50000
## coin.prob[2]  50000
## coin.prob[3]  50000

Notice that the order of heads and tails given in the data list is considered irrelevant because of the exchangeability assumption. A sequence of random variables that are iid conditional on some underlying distributional form is exchangeable.

Non-conjugate eg

In the hospital eg, we which to infer the risk of death \(\theta\) having \(y\) deaths after \(n\) operations. The likelihood, up to a constant, is \(p(y|\theta) \propto \theta^y (1-\theta)^{n-y}\) and we choose a non-conjugate normal prior given by the logistic transform of \(\theta\):

\[\text{logit} \theta = \log \frac{\theta}{1-\theta} =\sim Normal(0,2.71)\]

With this prior there’s no conjugate to find a well-behaved closed form of the posterior \(p(\theta|y,n)\). BUGS to the rescue:

modelString = "
  model {
      y ~ dbin(theta, n)  # likelihood
      logit(theta) <- logit.theta
      logit.theta ~ dnorm(0, 0.368)  # precision: 1/2.71 = 0.368
  }
"

data.list = list(
    y = 10,  # our data: 10 deaths after 100 operations 
    n = 100
)

run.model(modelString, samples=c("theta"), data=data.list, chainLength=50000)

## model is syntactically correct

## data loaded

## model compiled

## initial values generated, model initialized

## 5000 updates took 0 s

## monitor set for variable 'theta'

## 50000 updates took 0 s

samplesStats("theta")

##        mean      sd  MC_error val2.5pc median val97.5pc start sample
## theta 0.108 0.03017 0.0001324  0.05636 0.1056    0.1739  5001  50000

thetaSample <- samplesSample("theta")
samplesDensity("theta", mfrow = c(1, 1))

Let’s say we wish to fit a beta to this data:

library(MASS) 
beta.fit <- fitdistr(thetaSample,"beta",list(shape1=1,shape2=1)) 
beta.fit

##      shape1        shape2   
##   11.32464529   93.53647889 
##  ( 0.07060012) ( 0.59461703)

# let's draw & compare both densities
samplesDensity("theta", mfrow = c(1, 1))
xs <- seq(0,1,len=100); points(xs, dbeta(xs,beta.fit$estimate[1],beta.fit$estimate[2]), type="l", add=TRUE)

Sidenote: logit is the inverse of the sigmoid function; for a probability \(p\), logit(p) means the log-odds of \(p\), ie, log(p/(1-p)). Logit is the quantile function of the logistic distribution:

logit <- function (x) log(x/(1-x))
logit(0.34)

## [1] -0.6632942

qlogis(0.34)

## [1] -0.6632942

Multi-parameter Model

Assume the following model (described below in BUGS) and we wish to make a marginal inference about the unknown number of degrees of freedom:

\[p(d|y) = \int_\mu \int_r p(\mu,r,d|y) dr d\mu\]

modelString = "
  model {
      ### Likelihood
      for (i in 1:n) {
        y[i] ~ dt(mu,r,d)
      }
  }
"

# first generate 100 dummy observations given some initial parameter values
data.list = list(
    n=100, mu=0, r=1, d=4
)

run.model(modelString, samples=c("y"), data=data.list, chainLength=10000)

## model is syntactically correct

## data loaded

## model compiled

## initial values generated, model initialized

## 1000 updates took 0 s

## monitor set for variable 'y'

## 10000 updates took 0 s

n <- 100
ys <- rep(NA,n)
for(i in 1:n) {
  ys[i] <- samplesStats(paste0("y[",i,"]"))$mean
}
head(ys, n=12)

##  [1]  0.0136300 -0.0185600  0.0263600 -0.0029770 -0.0007505 -0.0066320
##  [7] -0.0282100  0.0113100 -0.0335200 -0.0200000 -0.0143100 -0.0166100

With this dummy dataset, let’s feed the model with some ‘vague’ priors and execute it again

modelString = "
  model {
      ### Likelihood
      for (i in 1:n) {
        y[i] ~ dt(mu,r,d)
      }
      ### Priors
      mu ~ dnorm(gamma, precision)
      r  ~ dgamma(alpha, beta)
      d  ~ dcat(p[])
      p[1] <- 0
      for (i in 2:30) {p[i] <- 1/29}
      ### Values for 'vague' priors
      alpha <- 0.001
      beta  <- 0.001
      gamma <- 0
      precision <- 0.0001
  }
"

data.list = list(
    n=100, y=ys
)

# sometimes it is necessary to tell BUGS where to start, especially in cases
# of distributions with large variances, where BUGS might create inappropriate
# starting values or simply it's unable to produce a starting value. 
# JAGS chooses a 'central' value which avoids this problem.

n.chains <- 3
mus <- c(0,1,2)  # more chains => these vectors should have more values
rs  <- c(1,1.2,2)
ds  <- c(10,15,20)

# Herein, it's used a closure with a local var 'i' that determines which chain
# is being initialized

genInitFactory <- function()  {
  i <- 0
  function() {
    i <<- i + 1
    list( 
      mu = mus[i],
      r  = rs[i],
      d  = ds[i]
    ) 
  }
}

run.model(modelString, samples=c("d"), data=data.list, chainLength=25000, init.func=genInitFactory(), n.chains=n.chains)

## model is syntactically correct

## data loaded

## model compiled

## Initializing chain 1:

## initial values loaded and chain initialized but another chain contain uninitialized variables

## Initializing chain 2:

## initial values loaded and chain initialized but another chain contain uninitialized variables

## Initializing chain 3:

## model is initialized

## 2500 updates took 1 s

## monitor set for variable 'd'

## 25000 updates took 10 s

samplesStats("d")

##    mean    sd MC_error val2.5pc median val97.5pc start sample
## d 19.49 6.858   0.1139        7     20        30  2501  75000

samplesDensity("d", mfrow = c(1, 1))

Another eg: messages received

This eg was adapted from Probabilistic Programming and Bayesian Methods for Hackers which is an introduction of MCMC for Python using the library PyMC.

You are given a series of daily text-message counts from a user of your system. The data, plotted over time, appears in the chart below. You are curious to know if the user’s text-messaging habits have changed over time, either gradually or suddenly. How can you model this?

This is the data:

msg.data <- read.csv("txtdata.csv")[,1]
length(msg.data)

## [1] 73

head(msg.data)

## [1] 24  8 24  7 35 14

space <- 0.2
width <- 1
barplot(msg.data, names.arg=1:73, space=space, width=width, xlab="days", ylab="#messages")

We start modelling the number of messages, at a given day \(i\), \(C_i\), with a Poisson. We will assume that there will be one moment when this parameter will change. So, at a day \(\tau\) the Poisson parameter will change from, say, \(\lambda_1\) to \(\lambda_2\). So,

\[C_i \sim Poisson(\lambda_1), i \leq \tau\] \[C_i \sim Poisson(\lambda_2), i > \tau\]

If the data does not provide evidence for this assumption, we will expect that \(\lambda_1 = \lambda_2\).

Since we don’t know the \(\lambda_i\) parameters, we will need to include them in model as random variables. In this case, we use an exponential to model our uncertainty (we know that they must be positive), ie,

\[\lambda_i \sim Exp(\alpha)\]

where \(\alpha\) is a hyperparameter (parameter of a parameter). Of course, we do not know \(\alpha\) either, so we could also model it, but since its effect is smaller, we decide to make it equal to the inverse of the average of the data because

\[\frac{1}{N} \sum_{i=1}^N C_i \approx E[\lambda | \alpha] = \frac{1}{\alpha}\]

We still need to model \(\tau\). Since we don’t have any extra information, let’s give equal probability for every day:

\[\tau \sim DiscreteUniform(1,73)\]

So, let’s write this all, BUGS way:

modelString = "
  model {
      tau ~ dcat(day[])       # the day that the Poisson parameter changes
      
      lambda1 ~ dexp(alpha)   # the Poisson parameter before the change
      lambda2 ~ dexp(alpha)   # the Poisson parameter after  the change

      for(i in 1:N) {
        day[i] <- 1/N         # each day has an a priori equal probability 
        lambdas[i] <- step(tau-i) * lambda1 + step(i-tau-1) * lambda2
        msgs[i] ~ dpois(lambdas[i])
      }

  }
"

data.list = list(
  alpha = 1 / mean(msg.data), 
  N = length(msg.data),
  msgs  = msg.data
)

run.model(modelString, samples=c("tau", "lambda1", "lambda2"), data=data.list, chainLength=1e5)

## model is syntactically correct

## data loaded

## model compiled

## initial values generated, model initialized

## 10000 updates took 5 s

## monitor set for variable 'tau'

## monitor set for variable 'lambda1'

## monitor set for variable 'lambda2'

## 100000 updates took 49 s

samplesStats("tau")      # stats of the posterior for the phase change

##      mean    sd MC_error val2.5pc median val97.5pc start sample
## tau 43.26 2.356  0.02795       41     43        44 10001 100000

samplesStats("lambda1")  # stats of the posterior for the 1st interval

##          mean     sd MC_error val2.5pc median val97.5pc start sample
## lambda1 17.87 0.6681 0.003383    16.63  17.86     19.19 10001 100000

samplesStats("lambda2")  # stats of the posterior for the 2nd interval

##          mean    sd MC_error val2.5pc median val97.5pc start sample
## lambda2 22.69 1.016 0.009435    20.93  22.69     24.48 10001 100000

hist(samplesSample("tau"), prob=TRUE, breaks=c(0,40.5,41.5,42.5,43.5,44.5,75), xlim=c(39,46), main="tau")

hist(samplesSample("lambda1"), prob=TRUE, xlim=c(15,21), breaks=100, main="lambda1")

hist(samplesSample("lambda2"), prob=TRUE, xlim=c(19,26), breaks=100, main="lambda2")

So the model, after processing the data, proposes the following scenario:

barplot(msg.data, names.arg=1:73, space=space, width=width, xlab="days", ylab="#messages")
lines(((1:73)-.5)*(width+space),c(rep(17.9,44),rep(22.7,29)), lwd=2, col="red")

Mixture of Priors

We have a coin that might the biased or fair. We assume a prior probability of 0.9 of being fair. We observe 15 heads out of \(n=20\) tosses. What is the posterior probability of being biased?

For two prior distributions, \(p_1, p_2\), the overall distribution is a mixture:

\[p(\theta) = q \times p_1(\theta) + (1-q) \times p_2(\theta) \]

where \(q\) is the assessed probability.

After observing data \(y\):

\[p(\theta|y) = q' \times p_1(\theta|y) + (1-q') \times p_2(\theta|y) \]

where:

\[p_i(\theta|y) \propto p(y|\theta) p_i(\theta)\\ q' = \frac{q~p_1(y)}{q~p_1(y) + (1-q)~p_2(y)}\\ p_i(y) = \int p(y|\theta) p_i(\theta) d\theta\]

modelString = "
  model {
      heads ~ dbin(p, n) # Likelihood

      p <- theta[pick] # p might be biased (pick=2) or unbiased (pick=1)

      pick ~ dcat(q[]) # pick=i => prior assumption i was selected
      q[1] <- 0.9      # pick is modeled by a categorical distribution
      q[2] <- 0.1      # ... with these values

      theta[1] <- 0.5       # theta for the fair coin
      theta[2] ~ dunif(0,1) # vague prior for the biased coin

      biased <- pick-1  # 1 if biased, 0 if unbiased
  }
"

data.list = list(
    heads=15, n=20
)

run.model(modelString, samples=c("biased","theta[2]"), data=data.list, chainLength=50000)

## model is syntactically correct

## data loaded

## model compiled

## initial values generated, model initialized

## 5000 updates took 0 s

## monitor set for variable 'biased'

## monitor set for variable 'theta[2]'

## 50000 updates took 0 s

biasedSample <- samplesSample( "biased" )
draw.hist(biasedSample, "biased coin?")

samplesDensity("theta[2]", mfrow = c(1, 1))

samplesStats(c("biased","theta[2]"))

##            mean     sd MC_error val2.5pc median val97.5pc start sample
## biased   0.2603 0.4388 0.002896  0.00000 0.0000    1.0000  5001  50000
## theta[2] 0.5589 0.2725 0.001267  0.03275 0.6261    0.9657  5001  50000

The initial 10% chance of being biased became, after the data, 26%.

Sensitivity Analysis

Sensitivity Analysis, in the Bayesian framework, checks if a prior distribution has unintented sensitivity to apprently innocuous non-informative assumptions. Since there is no such thing as a true prior, we must be forced to test several options and check what are the implications to estimation.

If we are unable to prevent this implications, this might mean we don’t have enough and we must work to provide and justify, with previous knowledge, a proper informative prior.

An eg. Suppose we enter a town of unknown size and check that a bus has number \(y=100\). Assuming the buses are numbered from \(1\ldots N\), where \(N\) is the unkown number of city buses. How large might \(N\) be?

We assume \(p(y|N) = 1/N\). The maximum likelihood estimate would say \(N=100\) buses which is not reasonable!

Let’s assume a uniform distribution for N, giving a maximum number of \(M\) possible buses, ie, \(p(N) = 1/M\).

modelString = "
  model {
      y ~ dcat(p[])    # Likelihood

      for(i in 1:M) {
        p[i] <- step(N - i+0.01)/N  # only values <= N have positive probability
      }

      N ~ dcat(p.unif[])  # N is itself a random variable (what we want to know)
      for(i in 1:M) {
        p.unif[i] <- 1/M  
      }      
  }
"

data.list = list(
    M=5000, y=100  # let's try a upper bound of 1k buses
)

run.model(modelString, samples=c("N"), data=data.list, chainLength=5000)

## model is syntactically correct

## data loaded

## model compiled

## initial values generated, model initialized

## 500 updates took 0 s

## monitor set for variable 'N'

## 5000 updates took 0 s

samplesStats(c("N"))

##   mean   sd MC_error val2.5pc median val97.5pc start sample
## N 1200 1238    33.05      111    690      4476   501   5000

This simulation says a mean of 1200 buses and a median of 690 reflecting a high skewed distribution:

NSample <- samplesSample( "N" )
hist(NSample, breaks=50)

Is this sensible? Our bounded posterior is proper (ie, it sums to 1) but the unbounded (\(M \rightarrow \infty\)) is not. It turns out that this makes the model extremely sensitive to initial assumptions, even to the value of \(M\) that should not be relevant. To check this, let’s run the model with \(M=10000\):

data.list = list(
    M=10000, y=100 
)

run.model(modelString, samples=c("N"), data=data.list, chainLength=5000)

## model is syntactically correct

## data loaded

## model compiled

## initial values generated, model initialized

## 500 updates took 0 s

## monitor set for variable 'N'

## 5000 updates took 0 s

samplesStats(c("N"))

##   mean   sd MC_error val2.5pc median val97.5pc start sample
## N 2061 2386    63.69      113   1021      8886   501   5000

Now the mean is 2061 (!) and the median 1021. This reveals a serious problem with the model!

The next model revision uses a Jeffrey’s suggestion of \(p(N) \propto 1/N\) which can me made proper with a known \(M\):

modelString = "
  model {
      y ~ dcat(p[])    # Likelihood

      for(i in 1:M) {
        p[i] <- step(N - i+0.01)/N  # only values <= N have positive probability
      }

      N ~ dcat(p.jeffrey[])  # N is itself a random variable (what we want to know)
      for(i in 1:M) {
        reciprocal[i] <- 1/i
        p.jeffrey[i] <- reciprocal[i]/sum.reciprocals
      }      
      sum.reciprocals <- sum(reciprocal[])  # used to normalize p.jeffrey[]
  }
"

data.list = list(
    M=5000, y=100  # let's try an upper bound of 5k buses
)

run.model(modelString, samples=c("N"), data=data.list, chainLength=5000)

## model is syntactically correct

## data loaded

## model compiled

## initial values generated, model initialized

## 500 updates took 0 s

## monitor set for variable 'N'

## 5000 updates took 0 s

samplesStats(c("N"))

##    mean    sd MC_error val2.5pc median val97.5pc start sample
## N 401.7 599.3    17.72      102    195      2272   501   5000

Let’s compare with \(M=10000\):

data.list = list(
    M=10000, y=100  # let's try an upper bound of 10k buses
)

run.model(modelString, samples=c("N"), data=data.list, chainLength=5000)

## model is syntactically correct

## data loaded

## model compiled

## initial values generated, model initialized

## 500 updates took 0 s

## monitor set for variable 'N'

## 5000 updates took 0 s

samplesStats(c("N"))

##    mean    sd MC_error val2.5pc median val97.5pc start sample
## N 408.2 758.1    27.44      102    192      2238   501   5000

Now the increasing of \(M\) had little effect of the estimates, both for the mean and the median!

Please continue to part 2