Refs:
Given features \( F_1, F_2, \ldots F_n \) and a class \( C \), we wish to know
\[ p(C|F_1, F_2, \ldots F_n) \]
By Bayes' Theorem:
\[ p(C|F_1, F_2, \ldots F_n) = \frac{p(C)p(F_1, F_2, \ldots F_n|C)}{p(F_1, F_2, \ldots F_n)} \propto p(C)p(F_1, F_2, \ldots F_n|C) \]
The numerator \( p(C)p(F_1, F_2, \ldots F_n|C) \) is equal to the joint probability \( p(C,F_1, F_2, \ldots F_n) \) which by the chain rule:
\[ \begin{array}{lcl} p(C,F_1, F_2, \ldots F_n) & \propto & p(C) p(F_1, F_2, \ldots F_n|C) \\ & \propto & p(C) p(F_1|C) p(F_2, \ldots F_n|C,F_1) \\ & \propto & p(C) p(F_1|C) p(F_2|C,F_1) p(F_3, \ldots F_n|C,F_1,F_2)\\ & \propto & \ldots \\ & \propto & p(C) p(F_1|C) p(F_2|C,F_1) \ldots p(F_n|C,F_1\ldots F_{n-1}) \end{array} \]
Naive Bayes 'naively' assumes that every feature \( F_i \) is conditionally independent of every other \( F_j \) (\( i \neq j \)) given \( C \), ie:
\[ p(F_i|C,F_j) = p(F_i|C), i \neq j \]
This greatly simplifies the previous chain rule:
\[ p(C) p(F_1|C) p(F_2|C,F_1) \ldots p(F_n|C,F_1\ldots F_{n-1}) = P(C)p(F_1|C)p(F_2|C)\ldots = p(C) \prod_i P(F_i|C) \]
And so:
\[ p(C|F_1, F_2, \ldots F_n) \propto p(C) \prod_i P(F_i|C) \]
An eg: here's an aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.
data(UCBAdmissions)
margin.table(UCBAdmissions,1)
Admit
Admitted Rejected
1755 2771
df <- as.data.frame(UCBAdmissions)
df[df$Dept=="A",] # just look at Dept. A
Admit Gender Dept Freq
1 Admitted Male A 512
2 Rejected Male A 313
3 Admitted Female A 89
4 Rejected Female A 19
The estimated probabilities of being admitted to dept. A being a male or female are: \[ \hat{p}(admitted | male, dept.A) = 512 / (512+313) \approx 0.62 \] \[ \hat{p}(admitted | female, dept.A) = 89 / (89+19) \approx 0.82 \]
Let A = admitted; G = gender and D = departement.
Naive Bayes assumes that \[ p(A|G,D) \propto p(A)p(G|A)p(D|A) \]
Knowing the gender, G=g, and the department, D=d, we wish to predict the value A (ie, was he/she admitted?) that maximizes p(A|g,d):
\[ \operatorname*{arg\,max}_A ~ p(A|g,d) = \operatorname*{arg\,max}_A ~ p(a) p(g|A) p(d|A)) \]
To achieve this, we get the data \( p(A) \), \( p(G|A) \) and \( p(D|A) \) for each possible value \( A \) and perform the calculations.
Let's say that G='female', and D='department A' and we which to know either A is ' admitted' or 'rejected'.
We need to query the table for some numbers:
sum (df$Freq) # total number of submitions (admitted+rejected)
[1] 4526
sum (df[df$Gender=="Female",]$Freq) # number of females
[1] 1835
sum (df[df$Admit=="Admitted",]$Freq) # number of admitted on all depts
[1] 1755
sum (df[df$Admit=="Rejected",]$Freq) # number of rejected on all depts
[1] 2771
sum (df[df$Admit=="Admitted" & df$Gender=="Female",]$Freq) # number of admitted females
[1] 557
sum (df[df$Admit=="Rejected" & df$Gender=="Female",]$Freq) # number of rejected females
[1] 1278
sum (df[df$Admit=="Admitted" & df$Dept=="A",]$Freq) # number of admitted in dept 'A'
[1] 601
sum (df[df$Admit=="Rejected" & df$Dept=="A",]$Freq) # number of rejected in dept 'A'
[1] 332
Then:
And:
This means that \( p(admitted|female,'A') \propto 0.39 \times 0.32 \times 0.34 = 0.042432 \)
Which is larger than \( p(rejected|female,'A') \propto 0.61 \times 0.2 \times 0.12 = 0.033672 \)
Our prediction is that a female will be admitted in Dept. A. The odds ratio is \( 1.26:1 \) in favor of admittance (ie, 56% chances of admittance).
naiveBayes()
creates a classifier given observation data and the class for each observationpredict()
receives the classifier, some observations, and returns a vector with their predicted classeslibrary(e1071)
# creates a classifier using Gender and Dept as data, and Admittance as the observation class
classifier <- naiveBayes(Admit ~ Gender + Dept, data = UCBAdmissions)
test <- unique(df[,c(-1,-4)]) # get the pairs gender+department
test$Prediction <- predict(classifier, test) # apply the prediction for those pairs (and add to a new column to data frame 'test')
test # check results
Gender Dept Prediction
1 Male A Admitted
3 Female A Admitted
5 Male B Admitted
7 Female B Admitted
9 Male C Rejected
11 Female C Rejected
13 Male D Rejected
15 Female D Rejected
17 Male E Rejected
19 Female E Rejected
21 Male F Rejected
23 Female F Rejected
It is possible to access the conditional probabilities from the $tables attribute:
classifier$tables$Gender["Admitted","Female"] # p(female|admitted)
[1] 0.3174
classifier$tables$Dept["Admitted","A"] # p(dept.A|admitted)
[1] 0.3425
classifier$tables$Gender["Rejected","Female"] # p(female|rejected)
[1] 0.4612
classifier$tables$Dept["Rejected","A"] # p(dept.A|rejected)
[1] 0.1198
And also the apriori distribution:
classifier$apriori / sum(classifier$apriori)
Admit
Admitted Rejected
0.3878 0.6122
If attribute \( F_i \) is continuous, a typical approach is to assume that \( p(F_i|C=c) \sim N(mean(F_i|C=c), sd(F_i|C=c)) \)
TODO: egs at http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Standard dataset iris includes continuous attributes:
data(iris) # load iris dataset
pairs(iris[1:4], main="Iris Data (red=setosa,green=versicolor,blue=virginica)",
pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])
head(iris,n=12)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1
1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3
Median :5.80 Median :3.00 Median :4.35 Median :1.3
Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2
3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8
Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5
Species
setosa :50
versicolor:50
virginica :50
library(e1071)
# create a classifier using naive bayes using the first 4 columns as the data, an the last column as the class for each observation (naiveBayes is a supervised learning algorithm)
classifier<-naiveBayes(iris[,1:4], iris[,5])
Use predict()
that receives the classifier object plus some observations (iris[,-5] is the original data without its class) and returns a sugested class for each observation.
predicted.classes <- predict(classifier, iris[,-5])
head(predicted.classes,n=12)
[1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[11] setosa setosa
Levels: setosa versicolor virginica
Then the method table()
presents a confusion matrix between the sugested classe vector with the real class vector. In this case the classification was very good, but this is a biased result since we are using the same data in the training and in the test sets
table(predicted.classes, iris[,5], dnn=list('predicted','actual'))
actual
predicted setosa versicolor virginica
setosa 50 0 0
versicolor 0 47 3
virginica 0 3 47
classifier$apriori / sum(classifier$apriori) # the prior distribution for the classes
iris[, 5]
setosa versicolor virginica
0.3333 0.3333 0.3333
Since the predictor variables here are all continuous, the naive Bayes classifier generates three Gaussian (Normal) distributions for each predictor variable: one for each value of the class variable Species. The first column is the mean, the 2nd column is the standard deviation.
classifier$tables$Petal.Length
Petal.Length
iris[, 5] [,1] [,2]
setosa 1.462 0.1737
versicolor 4.260 0.4699
virginica 5.552 0.5519
Let's plot these ones:
plot(0:3, xlim=c(0.5,7), col="red", ylab="density",type="n", xlab="Petal Length",main="Petal length distribution for each species")
curve(dnorm(x, classifier$tables$Petal.Length[1,1], classifier$tables$Petal.Length[1,2]), add=TRUE, col="red")
curve(dnorm(x, classifier$tables$Petal.Length[2,1], classifier$tables$Petal.Length[2,2]), add=TRUE, col="blue")
curve(dnorm(x, classifier$tables$Petal.Length[3,1], classifier$tables$Petal.Length[3,2]), add=TRUE, col ="green")
legend("topright", c("setosa", "versicolor", "virginica"), col = c("red","blue","green"), lwd=1)
These values could also be accessed directly in the dataset. Say for Petal.Length
:
mean(iris[iris$Species=="setosa",]$Petal.Length)
[1] 1.462
sd(iris[iris$Species=="setosa",]$Petal.Length)
[1] 0.1737
So, let's say we want to find, without using the R function, all three possible values of \( p(C|observation) \):
observation <- data.frame(Sepal.Length = 5.0,
Sepal.Width = 3.2,
Petal.Length = 1.5,
Petal.Width = 0.3) # this observation lies within Setosa cluster
For setosa class:\[ p(C=setosa|observation) \propto p(C=setosa) p(Sepal.Length=5.0 | C=setosa) p(Sepal.Width=3.2 | C=setosa) p(Petal.Length=1.5 | C=setosa) p(Petal.Width=0.3 | C=setosa) \]
A similar computation is needed for the other two classes. Let's implement them:
iris.classes <- c("setosa","versicolor","virginica")
iris.attributes <- names(iris)[-5]
means <- rep(0,length(iris.attributes))
sds <- rep(0,length(iris.attributes))
densities <- rep(0,length(iris.attributes))
p.Cs <- c(0,0,0) # p(C=class)
p.Cs_observation <- c(0,0,0) # p(C=class | observation)
for (c in 1:length(iris.classes)) {
p.Cs[c] <- nrow(iris[iris$Species==iris.classes[c],]) / nrow(iris)
for(i in 1:length(iris.attributes)) {
means[i] <- sapply(iris[iris$Species==iris.classes[c],][i], mean)
sds[i] <- sapply(iris[iris$Species==iris.classes[c],][i], sd)
densities[i] <- dnorm(as.numeric(observation[i]), means[i], sds[i])
}
p.Cs_observation[c] <- p.Cs[c] * prod(densities) # the final value for each class
}
names(p.Cs_observation) <- c("setosa","versicolor","virginica")
p.Cs_observation
setosa versicolor virginica
2.467e+00 1.954e-15 4.920e-23
p.Cs_observation / sum(p.Cs_observation) # normalize
setosa versicolor virginica
1.000e+00 7.920e-16 1.995e-23
This means that our naive bayes would predict 'setosa' with VERY high likelihood.
With the use of naiveBayes() we get the same prediction:
# type="raw" shows the probabilities
predict(classifier, observation, type="raw")
setosa versicolor virginica
[1,] 1 7.92e-16 1.995e-23