Refs:
Initial egs from Udacity’s “Introduction to AI” course
Some useful rules derived from Kolgomorov Axioms for random variables:
Notation for Independence between two random variables:
\[A \perp B \iff P(A,B) = P(A) P(B)\]
We define conditional independence as:
\[A \perp B | C \iff P(A,B|C) = P(A|C) P(B|C)\]
A Bayesian network is a probabilistic graphical model (a type of statistical model) that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG) wikipedia.
Eg:
Instead of having to find the overall joint probability like:
\[P(A,B,C,D,E) = P(A|B,C,D,E) P(B|C,D,E) P(C|D,E) P(D|E) P(E)\]
The model represented by this graph assumes that the joint probability can be factored by:
\[P(A,B,C,D,E) = P(A) P(B) P(C|A,B) P(D|C) P(E|C)\]
which is much simpler (\(2^5-1=31\) parameters (probability values) for the full join vs. 10 parameters for the distribution proposed by this bayes network).
These models will help us to infer when new information is given.
But first let’s do some probability calculus around simple bayes networks.
The classic cancer eg, where there are two different tests \(T_1\) and \(T_2\). We draw into the graph the causal relations that our model implies. In this case we assume that having cancer (random variable \(C\)) is a cause these tests to have a positive result, not the other way around.
This code produces the next diagram. The majority of the following graph scripts will be hidden for presentation purposes.
library(igraph)
cancer.bnet <- graph.formula(C--+T1, C--+T2)
V(cancer.bnet)$size <- 50
V(cancer.bnet)$color <- "white"
E(cancer.bnet)$color <- "black"
# each pair defines the (x,y) of each node
layout <- matrix(c(5, 5,
0, 0,
10,0), ncol=2, byrow=TRUE)
# this layout can be found by an interactive graphing screen that allows you to
# manually move around the vertices and arrange them as you want
# use:
# tkplot(graph.obj)
# (& after the manipulation:) layout <- tkplot.getcoords(1)
plot(cancer.bnet, layout=layout, edge.arrow.size=1, vertex.label.color=1)
Assume that the following probabilities are given:
Let’s compute some useful probabilities (to simplify \(P(+)\) means \(P(T=+)\)):
\(P(-) = 1- P(+) = 0.793\)
\(P(\neg C,-) = P(\neg C) P(-|\neg C) = 0.99 \times 0.8 = 0.792\)
Notice that these last for joint probabilities cover all hypothesis, so they must sum to 1 (which they do).
Also \(P(+) = P(+,C) + P(+,\neg C)\) which the results also hold.
Let’s apply Bayes’ rule to find \(P(C|+)\) and \(P(C|-)\):
\[P(C|+) = \frac{P(+|C) P(C)}{P(+)} = \frac{0.9 \times 0.01}{0.207} \approx 0.0435\] \[P(C|-) = \frac{P(-|C) P(C)}{P(-)} = \frac{0.1 \times 0.01}{0.792} \approx 0.00127\]
Bayes rule can be interpreted as a learning/inference mechanism, i.e., what is the new probability of having cancer after we have known the evidence, ie, that the test as been positive. In this interpretation, the probabilities are named as follows:
We can also infer if we have the information about the two tests (herein \(+_1+_2\) means \(T_1=+ \land T_2=+\)), namely \(P(C|+_1,+_2)\)
\[P(C|+_1+_2) = \frac{P(+_1+_2|C)P(C)}{P(+_1+_2)} \propto P(+_1+_2|C)P(C) = P(+_1|C)P(+_2|C)P(C) = 0.0081\]
Here we defer the computation of \(P(+_1+_2)\) which is a constant (that’s why it is proportional to the right expressions).
If we do the same thing for \[P(\neg C|+_1+_2) \propto 0.0396\]
Since \(P(C|+_1+_2) + P(\neg C|+_1+_2) = 1\), it’s enough to normalize both results to obtain the true probabilities:
Using the same ideas, eg:
Notice that given the cancer result, the two tests do not depend on each other, they are conditionally independent given \(C\), i.e., \(T_1 \perp T_2|C\) which means \(P(+_1|C,+_2) = P(+_1|C)\).
So, \(P(+_1|C,+_2) = P(+_1|C) = 0.9\)
When a variable is known, the respective node in the bayes net is shaded:
V(cancer.bnet)[1]$color <- "grey"
plot(cancer.bnet, layout=layout, edge.arrow.size=1, vertex.label.color=1)
But if \(C\) is not known, then \(P(+_1|+_2)\) is a different event and has a different probability (the first step is the application of the total probability rule):
\[P(+_1|+_2) = P(+_1|+_2,C) P(C|+_2) + P(+_1|+_2,\neg C) P(\neg C|+_2) = \ldots = 0.2304\]
This new situation models the event of someone being happy, \(H\), with two possible causes: it is sunny, \(S\) or that someone got a raise, \(R\). All random variables only have boolean values (as before).
Our model is:
Our initial data: + \(P(S) = 0.7\) + \(P(R) = 0.01\) + \(P(H|S,R) = 1\) + \(P(H|\neg S,R) = 0.9\) + \(P(H|S,\neg R) = 0.7\) + \(P(H|\neg S,\neg R) = 0.1\)
What value is \(P(R|S)\)? Without extra information, \(R \perp S\), so \(P(R|S) = P(R) = 0.01\).
From these we can compute:
Let’s next compute \(P(R|H)\):
\[P(R|H) = \frac{P(H|R) P(R)}{P(H)} = \frac{0.97 \times 0.01}{0.5245} \approx 0.185\]
And what about \(P(R|H,S)\)? Herein there’s extra information (we know she is happy).
\[P(R|H,S) = \frac{P(H|R,S) P(R|S)}{P(H|S)} = \frac{1 \times 0.01}{0.703} \approx 0.0142\]
Notice how strongly the hypothesis \(R\), i.e, she is happy, fell after we knew \(S\), i.e., that it was sunny (a 10-fold decrease). This is what is called to a cause to explain away another cause.
The effect also happens in reverse. If we know \(\neg S\) then
so, the raise hypothesis got stronger (a 8-fold increase) to explain the fact that she is happy.
In terms of dependence we shown that, even knowing that \(R \perp S\), when we know \(H\) a dependence occurs:
\[R \cancel{\perp} S | H\]
Given a bayes net \(BN\) and a set of known nodes, we can infer if two nodes of \(BN\) are conditional dependent.
Patterns of dependence (called active triplets):
Here the following dependences occur:
The next graph shows another dependende pattern, namely \(A \cancel{\perp} B | D\):
Patterns of independence (called inactive triplets):
Here the following independences occur:
gRain
refs: + http://www.jstatsoft.org/v46/i10/paper + http://cran.r-project.org/web/packages/gRain/index.html
gRain
is a R package for probability propagation in Bayes Networks. Let see how to use it going thru several examples.
Let’s first use the cancer bayes net above:
assuming the previous conditional probability tables (CPTs):
library(gRain)
## Loading required package: gRbase
# if there's a problem with RBGL package do
# source("http://bioconductor.org/biocLite.R")
# biocLite("RBGL")
# the possible values (all nodes are boolean vars)
yn <- c("yes","no")
# specify the CPTs
node.C <- cptable(~ C, values=c(1, 99), levels=yn)
node.T1 <- cptable(~ T1 + C, values=c(9,1,2,8), levels=yn)
node.T2 <- cptable(~ T2 + C, values=c(9,1,2,8), levels=yn)
# create an intermediate representation of the CPTs
plist <- compileCPT(list(node.C, node.T1, node.T2))
plist
## CPTspec with probabilities:
## P( C )
## P( T1 | C )
## P( T2 | C )
plist$C
## C
## yes no
## 0.01 0.99
plist$T1
## C
## T1 yes no
## yes 0.9 0.2
## no 0.1 0.8
# create network
bn.cancer <- grain(plist)
summary(bn.cancer)
## Independence network: Compiled: FALSE Propagated: FALSE
## Nodes : chr [1:3] "C" "T1" "T2"
This object can be queried:
# The marginal probability for each variable:
querygrain(bn.cancer, nodes=c("C", "T1", "T2"), type="marginal")
## $C
## C
## yes no
## 0.01 0.99
##
## $T1
## T1
## yes no
## 0.207 0.793
##
## $T2
## T2
## yes no
## 0.207 0.793
# The joint probability P(C,T1):
querygrain(bn.cancer, nodes=c("C","T1"), type="joint")
## T1
## C yes no
## yes 0.009 0.001
## no 0.198 0.792
# P(T1=+ | T2=+):
# 1. add evidence to the net
bn.cancer.1 <- setFinding(bn.cancer, nodes=c("T2"), states=c("yes"))
# 2. query the new net
querygrain(bn.cancer.1, nodes=c("T1"))
## $T1
## T1
## yes no
## 0.2304348 0.7695652
# The probability of this evidence:
# print(getFinding(bn.cancer.1))
# The conditional P(not C | not T1)
bn.cancer.2 <- setFinding(bn.cancer, nodes=c("T1"), states=c("no"))
querygrain(bn.cancer.2, nodes=c("C"))
## $C
## C
## yes no
## 0.001261034 0.998738966
Another way to use this package is to build the CPTs from a given dataset
data("cad1")
head(cad1)
## Sex AngPec AMI QWave QWavecode STcode STchange SuffHeartF
## 1 Male None NotCertain No Usable Usable No No
## 2 Male Atypical NotCertain No Usable Usable No No
## 3 Female None Definite No Usable Usable No No
## 4 Male None NotCertain No Usable Nonusable No No
## 5 Male None NotCertain No Usable Nonusable No No
## 6 Male None NotCertain No Usable Nonusable No No
## Hypertrophi Hyperchol Smoker Inherit Heartfail CAD
## 1 No No No No No No
## 2 No No No No No No
## 3 No No No No No No
## 4 No No No No No No
## 5 No No No No No No
## 6 No No No No No No
# create the DAG
dag.cad <- dag(~ CAD:Smoker:Inherit:Hyperchol +
AngPec:CAD +
Heartfail:CAD +
QWave:CAD)
library(Rgraphviz) # if not installed, execute at R's command line:
## Loading required package: graph
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:igraph':
##
## normalize, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colnames,
## do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unsplit
##
## Attaching package: 'graph'
## The following objects are masked from 'package:igraph':
##
## degree, edges, intersection
## Loading required package: grid
# source("http://bioconductor.org/biocLite.R")
# biocLite("Rgraphviz")
plot(dag.cad)
# smooth is a small positive number to avoid zero entries in the CPTs
# (cf. Additive smoothing, http://en.wikipedia.org/wiki/Additive_smoothing)
bn.cad <- grain(dag.cad, data = cad1, smooth = 0.1)
# Let's ask some questions...
querygrain(bn.cad, nodes=c("CAD", "Smoker"), type="conditional")
## CAD
## Smoker No Yes
## No 0.7172084 0.2827916
## Yes 0.4933771 0.5066229
querygrain(bn.cad, nodes=c("CAD"), type="marginal")
## $CAD
## CAD
## No Yes
## 0.5418012 0.4581988
# ...and add some evidence
bn.cad.1 <- setFinding(bn.cad, nodes=c("Smoker"), states=c("Yes"))
querygrain(bn.cad.1, nodes=c("CAD"), type="marginal")
## $CAD
## CAD
## No Yes
## 0.4933771 0.5066229
For the next section we will use an example presented in this paper
yn <- c("yes", "no")
a <- cptable(~ asia, values = c(1, 99), levels = yn)
t.a <- cptable(~ tub + asia, values = c(5, 95, 1, 99), levels = yn)
s <- cptable(~ smoke, values = c(5,5), levels = yn)
l.s <- cptable(~ lung + smoke, values = c(1, 9, 1, 99), levels = yn)
b.s <- cptable(~ bronc + smoke, values = c(6, 4, 3, 7), levels = yn)
x.e <- cptable(~ xray + either, values = c(98, 2, 5, 95), levels = yn)
d.be <- cptable(~ dysp + bronc + either, values = c(9, 1, 7, 3, 8, 2, 1, 9), levels = yn)
e.lt <- ortable(~ either + lung + tub, levels = yn)
bn.gin <- grain(compileCPT(list(a, t.a, s, l.s, b.s, e.lt, x.e, d.be)))
plot(bn.gin)
If the network is too big (not in this case, of course), finding the joint probabilities can take too long. One alternative is using simulation.
In this network the process would produce random values for ‘asia’ and ‘smoke’ and propagate those values into its children, until a new observation is created (i.e., a set of values for all the attributes in the joint probability). The process is then repeat \(N\) times. If \(N\) is large, the dataset should contain a good approximation of the true joint probability.
Let’s see an example with the previous net:
sim.gin <- simulate(bn.gin, nsim=5000)
head(sim.gin, n=10)
## asia tub smoke lung bronc either xray dysp
## 1 no no no no no no no no
## 2 no no no no no no yes no
## 3 no no no no no no no no
## 4 no no no no no no no no
## 5 no no no no no no no no
## 6 no no no no no no no no
## 7 yes no yes no no no no no
## 8 no no yes no no no no no
## 9 no no no no no no no no
## 10 no no no no no no no no
# say we want to know P(Lung,Bronc):
xtabs(~ lung+bronc, data=sim.gin) / nrow(sim.gin)
## bronc
## lung yes no
## yes 0.0268 0.0256
## no 0.4192 0.5284
# we can also compute the true P(Lung,Bronc) from the bayes net, since we know their true CPTs:
querygrain(bn.gin, nodes = c("lung", "bronc"), type = "joint")
## bronc
## lung yes no
## yes 0.0315 0.0235
## no 0.4185 0.5265
We can also use the package to predict a set of responses from a set of explanatory variables:
new.observation <- data.frame(dysp = "yes",
either = "yes",
tub="no",
asia="no",
xray="yes",
smoke="yes")
predict(bn.gin,
newdata = new.observation,
predictors=c("smoke", "asia", "tub" , "dysp", "xray"),
response = c("lung", "bronc"),
type = "dist")
## $pred
## $pred$lung
## yes no
## [1,] 0.7744796 0.2255204
##
## $pred$bronc
## yes no
## [1,] 0.7181958 0.2818042
##
##
## $pEvidence
## [1] 0.05084759
# This gives the most probable value for each variable, it does not imply that the jointly configuration with these two values is the most probable
# $pFinding is the probability of the explanatory variables
R package bnlearn
provides the capability of infering the bayes Net structure given only a dataset.
Refs:
Nagarajan - Bayesian Networks in R (2013)
In the next code sample, dataset alarm
has 20k observations made by this bayes network:
But let’s assume we only know the dataset:
library(bnlearn)
data(alarm)
head(alarm, n=1)
## CVP PCWP HIST TPR BP CO HRBP HREK HRSA PAP SAO2 FIO2
## 1 NORMAL NORMAL FALSE LOW NORMAL HIGH HIGH HIGH HIGH NORMAL NORMAL LOW
## PRSS ECO2 MINV MVS HYP LVF APL ANES PMB INT KINK DISC
## 1 HIGH ZERO HIGH NORMAL FALSE FALSE FALSE FALSE FALSE NORMAL FALSE TRUE
## LVV STKV CCHL ERLO HR ERCA SHNT PVS ACO2 VALV VLNG VTUB
## 1 NORMAL NORMAL HIGH FALSE HIGH FALSE NORMAL NORMAL NORMAL HIGH LOW ZERO
## VMCH
## 1 NORMAL
alarm.struct <- iamb(alarm, test = "x2")
The next diagram presents in red the edges of the true model that were found by the algorithm:
graphviz.plot(dag.alarm, highlight = list(arcs = arcs(alarm.struct)))