Let’s repeat the essential code here:
library(BRugs)
## Welcome to BRugs connected to OpenBUGS version 3.2.3
run.model <- function(model, samples, data=list(), chainLength=10000, burnin=0.10,
init.func, n.chains=1, thin=1) {
writeLines(model, con="model.txt") # Write the modelString to a file
modelCheck( "model.txt" ) # Send the model to BUGS, which checks the model syntax
if (length(data)>0) # If there's any data available...
modelData(bugsData(data)) # ... BRugs puts it into a file and ships it to BUGS
modelCompile(n.chains) # BRugs command tells BUGS to compile the model
if (missing(init.func)) {
modelGenInits() # BRugs command tells BUGS to randomly initialize a chain
} else {
for (chain in 1:n.chains) { # otherwise use user's init data
modelInits(bugsInits(init.func))
}
}
modelUpdate(chainLength*burnin) # Burn-in period to be discarded
samplesSet(samples) # BRugs tells BUGS to keep a record of the sampled values
samplesSetThin(thin) # Set thinning
modelUpdate(chainLength) # BRugs command tells BUGS to randomly initialize a chain
}
For ploting with error bars:
plot.errors <- function(x, y, e) {
plot(x, y, pch=19,ylim=c(min(y)*0.75,max(y)*1.2))
segments(x, y-e,x, y+e)
width.bar <- mean(e)/10
segments(x-width.bar,y-e,x+width.bar,y-e)
segments(x-width.bar,y+e,x+width.bar,y+e)
}
Given a data set \(D = (x_1,y_1), \ldots, (x_N,y_N)\) where \(x \in \mathbb{R}^d, y \in \mathbb{R}\), a Bayesian Linear Regression models the problem in the following way:
Prior: \[w \sim \mathcal{N}(0, \sigma_w^2 I_d)\]
\(w\) is vector \((w_1, \ldots, w_d)^T\), so the previous distribution is a multivariate Gaussian; and \(I_d\) is the \(d\times d\) identity matrix.
Likelihood: \[Y_i \sim \mathcal{N}(w^T x_i, \sigma^2)\]
We assume that \(Y_i \perp Y_j | w, i \neq j\)
For now we’ll use the precision instead of the variance, \(a = 1/\sigma^2\), and \(b = 1/\sigma_w^2\). We’ll also assume that \(a,b\) are known.
The prior can be stated as \[p(w) \propto \exp \Big\{ -\frac{b}{2} w^t w \Big\}\]
And the likelihood \[p(D|w) \propto \exp \Big\{ -\frac{a}{2} (y-Aw)^T (y-Aw) \Big\}\]
where \(y = (y_1,\ldots,y_N)^T\) and \(A\) is a \(n\times d\) matrix where the i-th row is \(x_i^T\).
The the posterior is \[p(w|D) \propto p(D|w) p(w)\]
After many calculations we discover that
\[p(w|D) \sim \mathcal{N}(w | \mu, \Lambda^{-1})\]
where (\(\Lambda\) is the precision matrix)
\[\Lambda = a A^T A + b I_d \] \[\mu = a \Lambda^{-1} A^T y\]
Notice that \(\mu\) is equal to the \(w_{MAP}\) of the regular linear regression, this is because for the Gaussian, the mean is equal to the mode.
Also, we can make some algebra over \(\mu\) and get the following equality (\(\Lambda = aA^TA+bI_d\)):
\[\mu = (A^T A + \frac{b}{a} I_d)^{-1} A^T y\]
and compare with \(w_{MLE}\):
\[w_{MLE} = (A^T A)^{-1} A^T y\]
The extra expression in \(\mu\) corresponds to the prior. This is similar to the expression for the Ridge regression, for the special case when \(\lambda = \frac{b}{a}\). Ridge regression is more general because the technique can choose improper priors (in the Bayesian perspective).
For the predictive posterior distribution:
\[p(y|x,D) = \int p(y|x,D,w) p(w|x,D) dw = \int p(y|x,w) p(w|D) dw\]
it is possible to calculate that
\[y|x,D \sim \mathcal{N}(\mu^Tx, \frac{1}{a} + x^T \Lambda^{-1}x)\]
modelString = "
model {
for (i in 1:5) {
y[i] ~ dnorm(mu[i], tau)
mu[i] <- beta0 + beta1 * (x[i] - mean(x[]))
}
# Jeffreys priors
beta0 ~ dflat()
beta1 ~ dflat()
tau <- 1/sigma2
log(sigma2) <- 2*log.sigma
log.sigma ~ dflat()
}
"
# data
x <- c( 8, 15, 22, 29, 36) # day of measure
y <- c(177, 236, 285, 350, 376) # weight in grams
data.list = list(
x = x,
y = y
)
# initializations
n.chains <- 1
log.sigmas <- c(0)
betas0 <- c(0)
betas1 <- c(0)
genInitFactory <- function() {
i <- 0
function() {
i <<- i + 1
list(
log.sigma = log.sigmas[i],
beta0 = betas0[i],
beta1 = betas1[i]
)
}
}
run.model(modelString, samples=c("beta0", "beta1", "sigma2"), data=data.list, chainLength=15000,
init.func=genInitFactory(), n.chains=n.chains)
## model is syntactically correct
## data loaded
## model compiled
## Initializing chain 1:
## model is initialized
## 1500 updates took 0 s
## monitor set for variable 'beta0'
## monitor set for variable 'beta1'
## monitor set for variable 'sigma2'
## 15000 updates took 0 s
samplesStats(c("beta0", "beta1", "sigma2"))
## mean sd MC_error val2.5pc median val97.5pc start sample
## beta0 284.900 8.2160 0.056790 270.100 284.900 300.300 1501 15000
## beta1 7.311 0.8012 0.006087 5.845 7.308 8.783 1501 15000
## sigma2 326.100 1015.0000 37.170000 37.950 144.300 1570.000 1501 15000
# Extract chain values:
beta0 <- samplesSample( "beta0" )
beta1 <- samplesSample( "beta1" )
sigma2 <- samplesSample( "sigma2" )
# Posterior prediction [from Kruschke - Doing Bayesian Data Analysis (2010)]
# Specify x values for which predicted y's are needed:
xPostPred <- seq( min(x)-5 , max(x)+5 , length=100 ) # just make a bunch of them
# Define matrix for recording posterior predicted y values at each x value.
# One row per x value, with each row holding random predicted y values.
postSampSize <- length(beta0)
yPostPred <- matrix( 0 , nrow=length(xPostPred) , ncol=postSampSize )
# Define matrix for recording HDI limits of posterior predicted y values:
yHDIlim <- matrix( 0 , nrow=length(xPostPred) , ncol=2 )
# Generate posterior predicted y values.
# This gets only one y value, at each x, for each step in the chain.
xM <- mean(xPostPred)
# generate values according to the model specified in BUGS:
# y[i] ~ dnorm(mu[i], tau)
# mu[i] <- beta0 + beta1 * (x[i] - mean(x[]))
for ( chainIdx in 1:postSampSize ) {
yPostPred[,chainIdx] <- rnorm( length(xPostPred) , # rnorm(n, mean, sd)
beta0[chainIdx] + beta1[chainIdx] * (xPostPred - xM),
sqrt(sigma2) )
}
source("HDIofMCMC.R") # call Kruschke's Highest Density Interval (HDI) script
for ( xIdx in 1:length(xPostPred) ) { # get 95% HDI for each predicted x
yHDIlim[xIdx,] <- HDIofMCMC( yPostPred[xIdx,] )
}
head(yHDIlim)
## [,1] [,2]
## [1,] 106.8501 186.2428
## [2,] 108.1011 188.3614
## [3,] 111.7592 190.7913
## [4,] 116.8344 193.5995
## [5,] 117.2369 195.2489
## [6,] 120.2614 197.5548
# Display data with HDIs of posterior predictions.
plot( x , y, xlim=c(8,36) , ylim=c(100,500), type="n", ylab=expression(hat(y)),
main="Data with 95% HDI & Mean of Posterior Predictions")
polygon(c(xPostPred,rev(xPostPred)), c(yHDIlim[,1],rev(yHDIlim[,2])), col="lightgray") # HDI's
points(x,y, pch=19)
lines( xPostPred , apply(yPostPred,1,mean) , col="red" ) # the linear regression
Checking the analytical solution vs. the simulated one:
a <- 1/var(x)
b <- 1/20
A <- matrix(x, ncol=1) # d=1, ie, 1D samples
Lambda <- a * t(A) %*% A + b * diag(ncol(A))
mu <- a * solve(Lambda) %*% t(A) %*% y
i <- 31 # select one of the estimated values
hist(yPostPred[i,], prob=T, breaks=100) # plot the estimated histogram
new.x <- xPostPred[i]
new.x
## [1] 14.51515
ys <- seq(100,300,1)
lines(ys, dnorm(ys, t(mu)%*%new.x, sqrt(1/a+t(new.x)%*%solve(Lambda)%*%new.x)) , lwd=2, col="red")
TODO: this does not fit :-(
We got this data somehow (in fact, from here).
x <- c(0, 3, 9, 14, 15, 19, 20, 21, 30, 35, 40, 41, 42, 43, 54, 56, 67, 69, 72, 88)
y <- c(33, 68, 34, 34, 37, 71, 37, 44, 48, 49, 53, 49, 50, 48, 56, 60, 61, 63, 44, 71)
e <- c(3.6, 3.9, 2.6, 3.4, 3.8, 3.8, 2.2, 2.1, 2.3, 3.8, 2.2, 2.8, 3.9, 3.1, 3.4, 2.6, 3.4, 3.7, 2.0, 3.5) # error of y
plot.errors(x,y,e)
We wish to model this using a linear model:
\[\hat{y}(x|\theta) = \theta_0 + \theta_1 x\]
and assume that the likelihood for each point is modelled by a Gaussian:
\[p(x_i,y_i,e_i | \theta) \propto \exp \Big\{ -\frac{1}{2e_i^2} (y - \hat{y}(x|\theta))^2 \Big\}\]
The traditional linear regression gives:
fit <- lm(y~x, data=data.frame(x=x,y=y))
plot.errors(x,y,e)
abline(fit,col="red",lwd=2) # showing the linear fit
points(x[c(2,6,19)],y[c(2,6,19)],col="red",pch=19) # showing the outliers
which does not seem right at all! This is because of the three obvious outliers (above in red) that influence the result.
We can hack a bit and use the Huber loss function, which is useful to deal with outliers in a classic statistics setting, providing a robust linear regression:
# a: residuals, ie, y - hat.y
huber <- function(a, delta) {
ifelse(abs(a)<delta, a^2/2, delta*(abs(a)-delta/2)) # ifelse is a vectorized conditional
}
huber.loss <- function(theta, x=x, y=y, e=e, delta=3) {
sum( huber((y - theta[1] - theta[2]*x)/e, delta) )
}
fit.huber <- optim(par=c(0,0), fn=huber.loss, x=x, y=y, e=e) # find best values using optimization
plot.errors(x,y,e)
abline(fit,col="lightgrey",) # showing the linear fit (in grey)
abline(fit.huber$par[1],fit.huber$par[2], col="red",lwd=2) # showing the robust fit
which is way better. However the Huber loss function, and the choice of its parameter value (set here to \(3\)), are somewhat hacks, ad hoc tools to attack the outlier problem.
In Bayesian terms acknowledging the outliers exist, we should modify the model in order to account them.
Now the likelihood will be the following:
\[p(x_i,y_i,e_i | \theta, g_i,\sigma_B) = \frac{g_i}{\sqrt{2\pi e_i^2}} \exp \Big\{ -\frac{1}{2e_i^2} (y - \hat{y}(x|\theta))^2 \Big\} + \frac{1-g_i}{\sqrt{2\pi \sigma_B^2}} \exp \Big\{ -\frac{1}{2 \sigma_B^2} (y - \hat{y}(x|\theta))^2 \Big\}\]
herein, parameter \(g_i=0\) means that data point \(x_i\) is an outlier, while \(g=1\) means that \(x_i\) is not. If the i-th point is an outlier the likelihhod will use a Gaussian of variance \(\sigma_B\) that might be considered an extra nuissance parameter, or set to a high value (we’ll choose value \(50\)). Our model has now 22 parameters instead of the initial two (\(\theta_0\) and \(\theta_1\)).
Since BUGS cannot sample from an arbitrary distribution, we can use the zeros trick to plug the likelihood directly:
modelString = "
model {
for (i in 1:n) {
phi[i] <- -log( (g[i]/sqrt(2*pi*pow(e[i],2))) * exp(-0.5*pow(y[i]-mu[i],2)/pow(e[i],2)) + ((1-g[i])/sqrt(2*pi*pow(sigmaB,2))) * exp(-0.5*pow(y[i]-mu[i],2)/pow(sigmaB,2)) ) + C
dummy[i] <- 0
dummy[i] ~ dpois( phi[i] )
mu[i] <- theta0 + theta1 * x[i]
g[i] ~ dunif(0,1)
}
theta0 ~ dflat()
theta1 ~ dflat()
C <- 10000 # for the zero's trick
pi <- 3.14159
}
"
# data
data.list = list(
x = x,
y = y,
e = e,
n = length(x),
sigmaB = 50
)
# initializations
n.chains <- 1
theta0 <- c(0)
theta1 <- c(0)
g <- rep(0.01,length(x))
genInitFactory <- function() {
i <- 0
function() {
i <<- i + 1
list(
theta0 = theta0[i],
theta1 = theta1[i],
g = g
)
}
}
run.model(modelString, samples=c("theta0", "theta1", "g"), data=data.list,
chainLength=25000, init.func=genInitFactory(), n.chains=n.chains)
## model is syntactically correct
## data loaded
## model compiled
## Initializing chain 1:
## model is initialized
## 2500 updates took 0 s
## monitor set for variable 'theta0'
## monitor set for variable 'theta1'
## monitor set for variable 'g'
## 25000 updates took 2 s
samplesStats(c("theta0", "theta1", "g"))
## mean sd MC_error val2.5pc median val97.5pc start sample
## theta0 31.3000 1.63900 0.0295200 28.11000 31.3000 34.5300 2501 25000
## theta1 0.4691 0.03808 0.0006846 0.39430 0.4690 0.5443 2501 25000
## g[1] 0.6410 0.25290 0.0019210 0.10000 0.6821 0.9862 2501 25000
## g[2] 0.3323 0.23560 0.0019900 0.01169 0.2920 0.8410 2501 25000
## g[3] 0.6454 0.25100 0.0021120 0.10510 0.6887 0.9873 2501 25000
## g[4] 0.6259 0.26080 0.0019830 0.07859 0.6691 0.9861 2501 25000
## g[5] 0.6428 0.25060 0.0019230 0.10660 0.6839 0.9868 2501 25000
## g[6] 0.3309 0.23590 0.0018550 0.01264 0.2898 0.8412 2501 25000
## g[7] 0.6030 0.26970 0.0021080 0.06083 0.6436 0.9857 2501 25000
## g[8] 0.6251 0.25990 0.0020680 0.08289 0.6666 0.9860 2501 25000
## g[9] 0.6355 0.25490 0.0018990 0.09566 0.6808 0.9864 2501 25000
## g[10] 0.6408 0.25150 0.0020230 0.10290 0.6826 0.9851 2501 25000
## g[11] 0.6269 0.25850 0.0020520 0.08543 0.6678 0.9859 2501 25000
## g[12] 0.6447 0.25020 0.0020680 0.11000 0.6864 0.9879 2501 25000
## g[13] 0.6407 0.25150 0.0020320 0.10280 0.6832 0.9852 2501 25000
## g[14] 0.6280 0.25930 0.0020200 0.08421 0.6697 0.9853 2501 25000
## g[15] 0.6410 0.24980 0.0018970 0.10880 0.6814 0.9864 2501 25000
## g[16] 0.6396 0.25300 0.0021610 0.10210 0.6811 0.9871 2501 25000
## g[17] 0.6395 0.25200 0.0021170 0.10650 0.6781 0.9869 2501 25000
## g[18] 0.6387 0.25220 0.0022050 0.10650 0.6804 0.9860 2501 25000
## g[19] 0.3346 0.23480 0.0019490 0.01284 0.2962 0.8390 2501 25000
## g[20] 0.6368 0.25400 0.0021190 0.09704 0.6780 0.9856 2501 25000
theta0.hat <- mean(samplesSample("theta0"))
theta1.hat <- mean(samplesSample("theta1"))
plot.errors(x,y,e)
abline(fit,col="lightgrey",) # showing the linear fit (in grey)
abline(fit.huber$par[1],fit.huber$par[2], col="grey",lwd=2) # showing the robust fit (in solid grey)
abline(theta0.hat, theta1.hat, col="red",lwd=2)
# define outliers as those which g[i] is less than 0.5
posterior.g <- rep(NA,length(g))
for(i in 1:length(g)) {
posterior.g[i] <- mean(samplesSample(paste0("g[",i,"]")))
}
outliers <- which(posterior.g<0.5)
# plot outliers
points(x[outliers], y[outliers], col="red", pch=12)
If we do not care of outlier accounting, we can just use a more heavy-tailed distribution to prevent the outlier influence. The next model uses a t distribution for that effect:
modelString = "
model {
for (i in 1:n) {
tau[i] <- 1/pow(e[i],2)
y[i] ~ dt(mu[i], tau[i], 4)
mu[i] <- theta0 + theta1 * x[i]
}
theta0 ~ dflat()
theta1 ~ dflat()
}
"
data.list = list(
x = x,
y = y,
e = e,
n = length(x)
)
# initializations
n.chains <- 1
theta0 <- c(0)
theta1 <- c(0)
genInitFactory <- function() {
i <- 0
function() {
i <<- i + 1
list(
theta0 = theta0[i],
theta1 = theta1[i]
)
}
}
run.model(modelString, samples=c("theta0", "theta1"), data=data.list, chainLength=15000,
init.func=genInitFactory(), n.chains=n.chains)
## model is syntactically correct
## data loaded
## model compiled
## Initializing chain 1:
## model is initialized
## 1500 updates took 0 s
## monitor set for variable 'theta0'
## monitor set for variable 'theta1'
## 15000 updates took 0 s
samplesStats(c("theta0", "theta1"))
## mean sd MC_error val2.5pc median val97.5pc start sample
## theta0 32.1600 1.70100 0.045930 28.9400 32.1400 35.580 1501 15000
## theta1 0.4451 0.03785 0.001012 0.3689 0.4454 0.518 1501 15000
theta0.hat <- mean(samplesSample("theta0"))
theta1.hat <- mean(samplesSample("theta1"))
plot.errors(x,y,e)
abline(fit,col="lightgrey",) # showing the linear fit (in grey)
abline(fit.huber$par[1],fit.huber$par[2], col="grey",lwd=2) # showing the robust fit (in solid grey)
abline(theta0.hat, theta1.hat, col="red",lwd=2)
For this eg we’ll use the Gelfand’s dataset of dugongs (sea cows).
The goal is to estimated the size \(y_i\) given the age \(x_i\).
The model is the following:
\[Y_i = \alpha - \beta \gamma^{x_i} + \epsilon_i, i = 1\ldots,n\]
where \(\alpha,\beta>0, 0 \leq \gamma \leq 1\), and the noise \(\epsilon_i\) are iid with \(\mathcal{N}(0,\sigma^2)\)
The interpretation is the following:
\(\alpha\) is the size of a fully grown adult (\(x \rightarrow \infty\))
\(\alpha-\beta\) is the dugong’s size at birth
\(\gamma\) is the growth rate (with an initial steep growth, and ending with almost linear growth)
The priors will be:
flat for \(\alpha\) and \(\beta\)
uniform from \(.01\) to \(100\) for \(sigma\)
uniform from \(.5\) to \(1\) for \(\gamma\) (not easy to estimate)
modelString = "
model {
for( i in 1:N ) {
y[i] ~ dnorm(mu[i], tau)
mu[i] <- alpha - beta * pow(gamma, x[i])
}
alpha ~ dflat()
beta ~ dflat()
gamma ~ dunif(0.5, 1.0)
sigma ~ dunif(0.01, 100)
tau <- 1/(sigma*sigma)
}
"
x = c( 1.0, 1.5, 1.5, 1.5, 2.5, 4.0, 5.0, 5.0, 7.0,
8.0, 8.5, 9.0, 9.5, 9.5, 10.0, 12.0, 12.0, 13.0,
13.0, 14.5, 15.5, 15.5, 16.5, 17.0, 22.5, 29.0, 31.5)
y = c(1.80, 1.85, 1.87, 1.77, 2.02, 2.27, 2.15, 2.26, 2.47,
2.19, 2.26, 2.40, 2.39, 2.41, 2.50, 2.32, 2.32, 2.43,
2.47, 2.56, 2.65, 2.47, 2.64, 2.56, 2.70, 2.72, 2.57)
data.list = list(
x = x,
y = y,
N = length(x))
# initializations
n.chains <- 3
alpha <- c(1,10,100)
beta <- c(1,10,100)
gamma <- c(.9,.7,.5)
sigma <- c(1,10,100)
genInitFactory <- function() {
i <- 0
function() {
i <<- i + 1
list(
alpha = alpha[i],
beta = beta[i],
gamma = gamma[i],
sigma = sigma[i]
)
}
}
run.model(modelString, samples=c("alpha", "beta", "gamma", "sigma"), data=data.list,
chainLength=50000, init.func=genInitFactory(), n.chains=n.chains)
## model is syntactically correct
## data loaded
## model compiled
## Initializing chain 1:
## initial values loaded and chain initialized but another chain contain uninitialized variables
## Initializing chain 2:
## initial values loaded and chain initialized but another chain contain uninitialized variables
## Initializing chain 3:
## model is initialized
## 5000 updates took 3 s
## monitor set for variable 'alpha'
## monitor set for variable 'beta'
## monitor set for variable 'gamma'
## monitor set for variable 'sigma'
## 50000 updates took 31 s
samplesStats(c("alpha", "beta", "gamma", "sigma"))
## mean sd MC_error val2.5pc median val97.5pc start sample
## alpha 2.6520 0.07366 9.429e-04 2.5280 2.64600 2.8160 5001 150000
## beta 0.9739 0.07879 5.028e-04 0.8230 0.97240 1.1330 5001 150000
## gamma 0.8619 0.03379 3.910e-04 0.7839 0.86570 0.9163 5001 150000
## sigma 0.1009 0.01572 6.209e-05 0.0758 0.09904 0.1370 5001 150000
So, let’s plot the results:
alpha.hat <- mean(samplesSample("alpha"))
beta.hat <- mean(samplesSample("beta"))
gamma.hat <- mean(samplesSample("gamma"))
plot(x,y,pch=19,xlab="Age (in years)", ylab="Size (in meters)",main="Dugong dataset")
curve(alpha.hat - beta.hat * gamma.hat^x, from=0, to=35, col="red", lwd=2, add=T) # plot estimation
# draw 2.5% and 97.5% growth curves
alpha2.5 <- samplesStats(c("alpha"))$val2.5pc
beta2.5 <- samplesStats(c("beta"))$val2.5pc
gamma2.5 <- samplesStats(c("gamma"))$val2.5pc
alpha97.5 <- samplesStats(c("alpha"))$val97.5pc
beta97.5 <- samplesStats(c("beta"))$val97.5pc
gamma97.5 <- samplesStats(c("gamma"))$val97.5pc
curve(alpha97.5 - beta97.5 * gamma97.5^x, from=0, to=35, col="grey", lwd=2, lty=2, add=T)
curve(alpha2.5 - beta2.5 * gamma2.5^x, from=0, to=35, col="grey", lwd=2, lty=2, add=T)
Linear Regression predicts that the expected value of the dependent/response variable \(y\) is a linear combination of the independent/observed variables \(x_i\). This is appropriate when the response variable is modelled by a normal distribution. If this does not happen, linear regression is not suitable.
If, for eg, the response \(y\) is an exponential combination of \(x\), then it is its logarithm that has a linear combination. This is called a log-linear model.
Another eg is if \(y\) is a boolean response. Then, say, if an increase of \(x\) increases the probability of \(y=1\), then the response might be a linear combination of the odds or the logarithm of the odds. This last eg is called a log-odds (aks logit) model.
Generalized Linear Modelling (GLM) is a generalization of linear regression in the sense that it allows \(y\) to have an arbitrary distribution, and that a certain function \(g\) of \(y\) to vary linearly with \(x\). The function \(g\) is called the link function.
For the previous exponential eg \(y\) can be modelled by a Poisson and \(g(x)=log(x)\). For the 2nd eg, \(y\) might be modelled by a Bernoulli while the link function is \(g(x)=logit(x)=log(x/(1-x))\)
The GLM assumes \(Y\) has distribution from the exponential family, like a Bernoulli, Binomial, Poisson, Exponential, Laplace, Normal, Gamma, …
\[E[Y_i] = g^{-1}(\beta_0 + \sum_k \beta_k x_{ki})\]
For GLM to work there’s the need of a distribution for \(Y\), a link function \(g\), and the linear predictor which can be found using several techniques, herein we’ll use BUGS (surprise!).
Assume we have a binary version of the previous dugong dataset, considering only a criteria to decide if a dugong is full grown (we no longer have access to \(y_i\)). In this case, our new response \(Z\) is defined as \(1\) if \(y>2.4\), or \(0\) otherwise:
z <- as.numeric(y>2.4)
z
## [1] 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1
To apply GLM (cf. section above) we choose:
\(Y \sim\) Bernoulli
The logit link function
BUGS to find the linear predictor, ie, the values of \(\beta\)
A logistic model for \(p_i = P(Z_i=1|X_i)\) is
\[logit(p_i) = \log(\frac{p_i}{1-p_i}) = \beta_0 + \beta_1 \log(x_i)\]
modelString = "
model {
for( i in 1:N) {
z[i] ~ dbern( p[i] )
# necessary to subtract the mean (ie, center x) to prevent MCMC convergence issues
logit(p[i]) <- beta0 + beta1 * (logAge[i] - mean(logAge[])) # logistic model
logAge[i] <- log(x[i])
}
beta0 ~ dflat()
beta1 ~ dflat()
}
"
data.list = list(
x = x,
z = z,
N = length(x))
# initializations
beta0 <- c(0)
beta1 <- c(0)
genInitFactory <- function() {
i <- 0
function() {
i <<- i + 1
list(
beta0 = beta0[i],
beta1 = beta1[i]
)
}
}
run.model(modelString, samples=c("beta0", "beta1"), data=data.list,
chainLength=50000, init.func=genInitFactory(), n.chains=1)
## model is syntactically correct
## data loaded
## model compiled
## Initializing chain 1:
## model is initialized
## 5000 updates took 0 s
## monitor set for variable 'beta0'
## monitor set for variable 'beta1'
## 50000 updates took 1 s
samplesStats(c("beta0", "beta1"))
## mean sd MC_error val2.5pc median val97.5pc start sample
## beta0 -1.505 0.961 0.01582 -3.704 -1.420 0.1096 5001 50000
## beta1 6.154 2.482 0.04349 2.358 5.811 12.1300 5001 50000
Plotting the results:
beta0.hat <- mean(samplesSample("beta0"))
beta1.hat <- mean(samplesSample("beta1"))
plot(x,z,pch=19,xlab="Age (in years)", ylab="Adult", main="Dugong dataset")
# draw the estimated values for E[Y_i] (it is a logistic curve)
xs <- seq(0.1,32,.1)
exp.logit <- exp( beta0.hat + beta1.hat*(log(xs) - mean(log(x))) )
lines(xs,exp.logit/(1+exp.logit), col="red", lwd=2) # plot estimation
For instance, what is the probability that a dugong with \(8\) years is full grown?
x.new <- 8
exp.logit.new <- exp( beta0.hat + beta1.hat*(log(x.new) - mean(log(x))) )
exp.logit.new/(1+exp.logit.new)
## [1] 0.1962679
In Poisson Regression, usually used to express number of successes within a fixed time interval, we choose:
\(Y_i \sim\) Poisson(\(\lambda_i\))
The log link function
BUGS to find the linear predictor, ie, the values of \(\beta\)
A log-linear model for \(\lambda\) is
\[\log(\lambda_i) = \beta_0 + \beta_1 x_i\]
set.seed(101)
n <- 20
x <- runif(n,0,30)
beta.0 <- -4
beta.1 <- 0.3
y <- exp( beta.0 + beta.1*x + rpois(n,1.0) )
plot(x,y,pch=19)
xs <- seq(.1,30,len=101)
lines(xs,exp(beta.0 + beta.1*xs),lwd=2,col="blue")
modelString = "
model {
for( i in 1:N) {
y[i] ~ dpois( lambda[i] )
log(lambda[i]) <- beta0 + beta1 * x[i]
}
beta0 ~ dnorm(0,0.0001)
beta1 ~ dnorm(0,0.0001)
}
"
data.list = list(
x = x,
y = y,
N = n)
# initializations
beta0 <- c(0)
beta1 <- c(1)
genInitFactory <- function() {
i <- 0
function() {
i <<- i + 1
list(
beta0 = beta0[i],
beta1 = beta1[i]
)
}
}
run.model(modelString, samples=c("beta0", "beta1"), data=data.list,
chainLength=250000, init.func=genInitFactory(), n.chains=1)
## model is syntactically correct
## data loaded
## model compiled
## Initializing chain 1:
## model is initialized
## 25000 updates took 0 s
## monitor set for variable 'beta0'
## monitor set for variable 'beta1'
## 250000 updates took 5 s
samplesStats(c("beta0", "beta1"))
## mean sd MC_error val2.5pc median val97.5pc start sample
## beta0 -2.2020 0.30540 0.0025220 -2.8150 -2.1960 -1.6150 25001 250000
## beta1 0.2465 0.01261 0.0001021 0.2222 0.2463 0.2718 25001 250000
beta0.hat <- mean(samplesSample("beta0"))
beta1.hat <- mean(samplesSample("beta1"))
plot(x,y,pch=19)
xs <- seq(.1,30,len=101)
lines(xs,exp(beta.0 + beta.1*xs),lwd=2,col="blue") # target function
lines(xs,exp(beta0.hat + beta1.hat*xs),lwd=2,col="red") # proposed function
Let’s say we have this dataset (taken from Bayesian Methods for Hackers):
samples <- read.csv("mixture_data.csv", header=F)[,1]
hist(samples, breaks=40)
It appears the data has a bimodal form, one peak around 120 and another at 200.
Our model proposes that the dataset has two clusters of data, each produced by a normal distribution. The construction was:
For each data point \(y_i\):
choose distribution 1 with probability \(p\), or distribution 2 with probability \(1-p\)
Draw one random sample from the chosen distribution \(\mathcal{N}(\mu_k, \sigma_k)\), \(k=1,2\)
So, our model is
\[G_i \sim \text{Binomial}(p)\]
\[y_i \sim \mathcal{N}(\mu_{G_i}, \sigma_{G_i})\]
\[\mu_k \sim \mathcal{N}(0,1000)\]
\[\sigma_k \sim \text{Gamma}(0.01, 0.01)\]
\(G_i\) means the cluster of the i-th datapoint; in this problem \(G_i = \{1,2\}\). In the following model, it is used dcat
instead of a binomial, just because this way it can also work with more than two clusters.
modelString = "
# BUGS model specification begins ...
model {
for( i in 1 : N ) {
y[i] ~ dnorm(mu[i], tau[i]) # likelihood
mu[i] <- lambda[G[i]] # prior for mean
tau[i] <- lambdaTau[G[i]] # prior for precision
G[i] ~ dcat(P[]) # the cluster attributions for each y_i
}
P[1:2] ~ ddirch(alpha[]) # dirichlet distribution (in this case just for 2 clusters)
alpha[1] <- 0.5 # It generalizes the beta (with K=2 we could have used the beta), and
alpha[2] <- 0.5 # is the conjugate for the categorical distribution
lambda[1] ~ dnorm(0.0, 1.0E-6) # hyperparameters for mean
lambda[2] <- lambda[1] + theta
theta ~ dnorm(0.0, 1.0E-6)I(0.0, )
lambdaTau[1] ~ dgamma(0.01,0.01) # hyperparameters for precision/standard deviation
lambdaTau[2] ~ dgamma(0.01,0.01)
sigma[1] <- 1 / sqrt(lambdaTau[1])
sigma[2] <- 1 / sqrt(lambdaTau[2])
}
# ... BUGS model specification ends.
" # close quote to end modelString
There is a trick here. To prevent divergence issues (all datapoints selected to a single cluster), the second mean hyperparameter, \(\lambda_2\), is determined this way: \(\lambda_2 = \lambda_1 + \theta, \theta \gt 0\) (check Bugs Book, pages 280-3).
data.list = list(
y = samples,
N = length(samples),
G = c(1, rep(NA,length(samples)-2), 2) # TODO: do not understand these 1 and 2
)
# let's apply some thinning to the mcmc:
run.model(modelString, samples=c("sigma", "lambda", "P", "G"), data=data.list, chainLength=3e4, thin=4)
## model is syntactically correct
## data loaded
## model compiled
## initial values generated, model initialized
## 3000 updates took 1 s
## monitor set for variable 'sigma'
## monitor set for variable 'lambda'
## monitor set for variable 'P'
## monitor set for variable 'G'
## 30000 updates took 19 s
samplesStats("sigma")
## mean sd MC_error val2.5pc median val97.5pc start sample
## sigma[1] 31.53 5.310 0.25370 23.59 30.61 44.25 3001 7500
## sigma[2] 22.18 2.204 0.08703 17.61 22.23 26.41 3001 7500
samplesStats("lambda")
## mean sd MC_error val2.5pc median val97.5pc start sample
## lambda[1] 122.8 8.485 0.4445 110.4 121.1 144.5 3001 7500
## lambda[2] 200.7 3.047 0.1237 195.0 200.6 206.8 3001 7500
samplesStats("P")
## mean sd MC_error val2.5pc median val97.5pc start sample
## P[1] 0.3975 0.07069 0.003455 0.2898 0.386 0.5789 3001 7500
## P[2] 0.6025 0.07069 0.003455 0.4211 0.614 0.7102 3001 7500
samplesCorrel("sigma[1]", "sigma[2]") # the correlation is negative: one sd larger means the other is smaller
## sigma.2.
## sigma[1] -0.6214
Let’s vizualize the results:
p <- samplesStats("P")$mean[1]
mu1.hat <- samplesStats("lambda")$mean[1]
mu2.hat <- samplesStats("lambda")$mean[2]
sigma1.hat <- samplesStats("sigma")$mean[1]
sigma2.hat <- samplesStats("sigma")$mean[2]
hist(samples, breaks=40, prob=T)
xs <- 0:300
lines(xs, p *dnorm(xs,mu1.hat,sigma1.hat), col="red", lwd=2)
lines(xs,(1-p)*dnorm(xs,mu2.hat,sigma2.hat), col="blue", lwd=2)
prob.cluster.1 <- 2 - samplesStats("G")[,1]
# need to remove the first and last observations until the previous TODO is solved
plot(samples[c(-1,-300)], prob.cluster.1, col = c("blue","red")[1+round(prob.cluster.1)], xlim=c(0,300),
xlab="data point", ylab="probability", main="probability of belonging to first cluster", pch=19)
Let’s also use BRugs graphical tools:
# cannot use samplesHistory("*") because of nodes G[]
samplesHistory("lambda[1]", mfrow = c(1, 1))
samplesDensity("lambda[1]", mfrow = c(1, 1))
samplesAutoC("lambda[1]", mfrow = c(1, 1), 1)