Refs:
http://www.sumsar.net/blog/2014/10/tiny-data-and-the-socks-of-karl-broman/
http://darrenjw.wordpress.com/2013/03/31/introduction-to-approximate-bayesian-computation-abc/
Approximate Bayesian Computation (ABC) generate samples from a distribution which is not the true posterior distribution of interest, but a distribution which is hoped to be close to the real posterior distribution of interest.
Suppose we know the prior \(p(\theta)\) and the likelihood \(p(x|\theta)\) and want to know the posterior \(p(\theta|x)\). We can simulate from the joiont distribution \(p(\theta,x)\) by first simulate from \(\theta^* \sim p(\theta)\) and then simulate from \(x^* \sim p(x|\theta^*)\). We thus can get as many samples \((\theta^*,x^*)\) as we like. if we apply a rejection process that only keep values \(\theta^*\) from samples \((\theta^*,x^*)\) if \(x^*\) matches the true data \(x\) we then have a sample from the posterior \(p(\theta|x)\).
The rejection rate can be high, especially if the domain is continuous, so we must define what means by \(x^*\) matching \(x\) to keep the rejection rate in appropriate levels, ie, some \(\epsilon\) for \(||x-x^*|| \leq \epsilon\), for some metric \(||\cdot||\). However, small values of \(\epsilon\) will better approximate the posterior but will increase the rate of rejection.
ABC cannot work properly for high-dimensionaly of \(x\), there will be too much rejections. In these cases the best is to find a good agreement between some statistics of \(x\). For eg, forming a lower-dimensionality vector of summary statistics (ideally, sufficient statistics) \(s(x)\) and accepting when \(||s(x)-s(x^*)|| \leq \epsilon\).
The algorithm
The problem is this: Given the tiny dataset of eleven unique socks, how many socks does Karl Broman have in his laundry in total?
Step 1&2:
If you know the parameters and the data, this is the way we generate a sample:
# The Parameters (which we do not know, but assume the following values as an eg)
n_socks <- 18
n_pairs <- 7
n_odd <- 4 # total socks = 2*n_pairs + n_odd
# The Data
n_picked <- 11
# The Generative Model
socks <- rep(seq_len(n_pairs + n_odd), rep(c(2,1), c(n_pairs,n_odd))) # label socks (same pair, same label)
socks
## [1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 9 10 11
pick_socks <- sample(socks, min(n_picked, n_socks)) # Simulate a sock picking
pick_socks
## [1] 2 9 7 7 4 6 1 8 1 11 5
sock_counts <- table(pick_socks)
sock_counts
## pick_socks
## 1 2 4 5 6 7 8 9 11
## 2 1 1 1 1 2 1 1 1
c(unique=sum(sock_counts==1), pairs=sum(sock_counts==2)) # counting amount of unique and paired socks
## unique pairs
## 7 2
The generative model above needs adequate prior distributions for its parameters, namely, n_socks
, n_pairs
, n_odd
(n_picked
is the data and will be used to estimate the parameter’s values).
For n_socks
, since it is discrete and always positive, it could be the Poisson. But the Poisson has the mean equal to the variance which seems very restrictive in this context. We will pick the Negative Binomial for extra flexibility.
\[\text{n_socks} \sim \text{NBinom}(\mu,\sigma)\]
In R
a negative binomial random generation is implemented by the rnbinom(mu, size)
, where \(size = -\mu^2 / (\mu - \sigma^2)\).
Assuming than an average person washes, each time, 30 socks and giving a confortable standard deviation of 15, we have:
prior_mu <- 30
prior_sd <- 15
prior_size <- -prior_mu^2 / (prior_mu - prior_sd^2)
n_socks <- rnbinom(10, mu=prior_mu, size=prior_size)
n_socks
## [1] 77 16 26 27 39 25 35 51 36 24
xs <- seq(0,75)
barplot(dnbinom(xs, mu=prior_mu, size=prior_size), names.arg=xs, lwd=2, col="lightgreen",
main="prior ratio for total number of socks", xlab="total socks", ylab="probability")
For n_pairs
and n_odd
we will place a prior on the ratio of paired socks, ie, 2*n_pairs/(2*n_pairs+n_odd)
which will vary between 0 and 1. We select a Beta for the job:
\[\text{ratio} \sim \text{Beta}(\alpha,\beta)\]
We assume that most socks are in pairs, say a ratio more than \(75\%\), so let’s aggregate the majority of the probability mass for the ratio around \([0.75,1]\). We’ll use a Beta with a mean of \(87.5\%\) and a sd of \(8\%\):
estBetaParams <- function(mu, var) { # estimate beta parameters
alpha <- ((1-mu)/var - 1/mu)*mu^2
beta <- alpha*(1/mu - 1)
c(alpha, beta)
}
beta_params <- estBetaParams(0.875, 0.08^2)
prior_alpha <- round(beta_params[1],0)
prior_alpha
## [1] 14
prior_beta <- round(beta_params[2],0)
prior_beta
## [1] 2
xs <- seq(0,1,len=101)
plot(xs, dbeta(xs, prior_alpha, prior_beta), lwd=2, col="red", type="l",
main="prior ratio for paired socks", xlab="ratio", ylab="density")
With the selection of model and priors we can make a sample function:
socks_sample <- function(n_picked) {
# Generate a sample of prior values
prior_mu <- 30
prior_sd <- 15
prior_size <- -prior_mu^2 / (prior_mu - prior_sd^2)
n_socks <- rnbinom(1, mu=prior_mu, size=prior_size)
prior_alpha <- 14
prior_beta <- 2
prop_pairs <- rbeta(1,prior_alpha, prior_beta)
n_pairs <- round(prop_pairs * floor(n_socks/2), 0)
n_odd <- n_socks - 2*n_pairs
# Simulate picking socks
socks <- rep( seq_len(n_pairs + n_odd), rep(c(2,1), c(n_pairs,n_odd)) )
pick_socks <- sample(socks, min(n_picked, n_socks))
sock_counts <- table(pick_socks)
# Returning parameters values and the counting of odd and paired socks
c(unique = sum(sock_counts==1),
pairs = sum(sock_counts==2),
n_socks = n_socks,
n_pairs = n_pairs,
n_odd = n_odd,
prop_pairs = prop_pairs
)
}
# let's get a couple of samples to see it working:
socks_sample(11)
## unique pairs n_socks n_pairs n_odd prop_pairs
## 9.0000000 1.0000000 38.0000000 16.0000000 6.0000000 0.8284114
socks_sample(11)
## unique pairs n_socks n_pairs n_odd prop_pairs
## 1.0000000 5.0000000 11.0000000 5.0000000 1.0000000 0.9108672
So, let’s plug the data, iterate, and get a bunch of samples:
n_picked <- 11 # our (only!) data
sock_sim <- t( replicate(5e4, socks_sample(n_picked)) )
head(sock_sim,10)
## unique pairs n_socks n_pairs n_odd prop_pairs
## [1,] 7 2 20 9 2 0.8566029
## [2,] 9 1 37 16 5 0.9050205
## [3,] 11 0 48 21 6 0.8792152
## [4,] 3 4 12 5 2 0.8168655
## [5,] 1 5 11 5 1 0.9104679
## [6,] 1 2 5 2 1 0.9280682
## [7,] 9 1 47 22 3 0.9419529
## [8,] 5 3 16 8 0 0.9438001
## [9,] 11 0 53 22 9 0.8379752
## [10,] 9 1 27 10 7 0.7801699
We just want the subset of samples where there are 11 unique socks and \(0\) pairs:
post_samples <- sock_sim[sock_sim[,"unique"]==11 & sock_sim[,"pairs"]==0, ] # select samples where x*==x
median_hist <- function(x, ...) {
hist(x, ...)
abline(v = median(x), col = "darkred", lty = 2, lwd = 2)
}
median_hist(post_samples[,"n_socks"], 40, xlim = c(0, 90), yaxt = "n", ylab = "", breaks=50,
xlab = "Number of socks", main = "Posterior on n_socks", col = "lightblue")
median(post_samples[,"n_socks"]) # median guess
## [1] 43
occ <- table(post_samples[,"n_socks"])
as.numeric(row.names(occ)[which.max(occ)]) # mode guess
## [1] 39
The mode estimation is better to maximize the possibility of being right, no matter how much wrong the estimation is. If we want to minimize the absolute error, we pick the median estimation. In this case, the median prediction is 43 socks with median(post_samples[,"n_pairs"])
= 18 pairs and median(post_samples[,"n_odd"])
= 6 singletons.
The true answer for the sock problem was:
ref: we could ask what is the probability of having \(k\) unique socks from drawing between some known \(n_1\) singletons and \(n_2\) pairs.
We can simulate the result:
n1 <- 3
n2 <- 21
k <- 11
sim_socks <- function(n1, n2, n_sim=1e5) {
socks <- c(rep(1:n2, 2), n2+(1:n1))
replicate(n_sim, min(which(duplicated(sample(socks)))) )
}
sim <- sim_socks(n1, n2)
mean(sim>k) # probability of no match in the first k socks is
## [1] 0.22712
or we can compute the exact probability via combinatorics:
The number of cases of \(k\) non-matching socks with \(i\) singletons is
\[{n_1 \choose i} {n_2 \choose k-i} 2^{k-i}\]
where \(2^{k-i}\) comes from the fact that we can choose either paired sock from the \(k-i\) socks that come from pairs.
So, the total number of ways of getting \(k\) unique socks is the sum over all possible values of \(i\), namely from \(0\) (no unique sock is a non-matching sock) to \(k\) (all unique socks are non-matching socks), ie,
\[\sum_{i=0}^k {n_1 \choose i} {n_2 \choose k-i} 2^{k-i}\]
The total number of getting \(k\) socks from \(n_1\) singletons and \(n_2\) pairs is
\[{n_1+2n_2 \choose k}\]
So the probability we are looking for is the fraction of these two values,
\[\frac{\sum_{i=0}^k {n_1 \choose i} {n_2 \choose k-i} 2^{k-i}}{{n_1+2n_2 \choose k}}\]
exact_socks <- function(n1, n2, k) {
cases <- 0
for(i in 0:k)
cases <- cases + choose(n1,i) * choose(n2,k-i) * 2^(k-i)
total_cases <- choose(n1+2*n2, k)
cases / total_cases
}
exact_socks(n1, n2, k)
## [1] 0.2275212