Naive Bayes

Refs:

Explaining Naive Bayes

Given features $ F_1, F_2, \ldots F_n $ and a class $ C $, we wish to know

\[ p(C|F_1, F_2, \ldots F_n) \]

By Bayes' Theorem:

\[ p(C|F_1, F_2, \ldots F_n) = \frac{p(C)p(F_1, F_2, \ldots F_n|C)}{p(F_1, F_2, \ldots F_n)} \propto p(C)p(F_1, F_2, \ldots F_n|C) \]

The numerator $ p(C)p(F_1, F_2, \ldots F_n|C) $ is equal to the joint probability $ p(C,F_1, F_2, \ldots F_n) $ which by the chain rule:

\[ \begin{array}{lcl} p(C,F_1, F_2, \ldots F_n) & \propto & p(C) p(F_1, F_2, \ldots F_n|C) \\ & \propto & p(C) p(F_1|C) p(F_2, \ldots F_n|C,F_1) \\ & \propto & p(C) p(F_1|C) p(F_2|C,F_1) p(F_3, \ldots F_n|C,F_1,F_2)\\ & \propto & \ldots \\ & \propto & p(C) p(F_1|C) p(F_2|C,F_1) \ldots p(F_n|C,F_1\ldots F_{n-1}) \end{array} \]

Naive Bayes 'naively' assumes that every feature $ F_i $ is conditionally independent of every other $ F_j $ ($ i \neq j $) given $ C $, ie:

\[ p(F_i|C,F_j) = p(F_i|C), i \neq j \]

This greatly simplifies the previous chain rule:

\[ p(C) p(F_1|C) p(F_2|C,F_1) \ldots p(F_n|C,F_1\ldots F_{n-1}) = P(C)p(F_1|C)p(F_2|C)\ldots = p(C) \prod_i P(F_i|C) \]

And so:

\[ p(C|F_1, F_2, \ldots F_n) \propto p(C) \prod_i P(F_i|C) \]

An eg: here's an aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.

data(UCBAdmissions)
margin.table(UCBAdmissions,1)

Admit
Admitted Rejected 
    1755     2771

df <- as.data.frame(UCBAdmissions)
df[df$Dept=="A",] # just look at Dept. A

     Admit Gender Dept Freq
1 Admitted   Male    A  512
2 Rejected   Male    A  313
3 Admitted Female    A   89
4 Rejected Female    A   19

The estimated probabilities of being admitted to dept. A being a male or female are: \[ \hat{p}(admitted | male, dept.A) = 512 / (512+313) \approx 0.62 \] \[ \hat{p}(admitted | female, dept.A) = 89 / (89+19) \approx 0.82 \]

Let A = admitted; G = gender and D = departement.

Naive Bayes assumes that \[ p(A|G,D) \propto p(A)p(G|A)p(D|A) \]

Knowing the gender, G=g, and the department, D=d, we wish to predict the value A (ie, was he/she admitted?) that maximizes p(A|g,d):

\[ \operatorname*{arg\,max}_A ~ p(A|g,d) = \operatorname*{arg\,max}_A ~ p(a) p(g|A) p(d|A)) \]

To achieve this, we get the data $ p(A) $, $ p(G|A) $ and $ p(D|A) $ for each possible value $ A $ and perform the calculations.

Let's say that G='female', and D='department A' and we which to know either A is ' admitted' or 'rejected'.

We need to query the table for some numbers:

sum (df$Freq) # total number of submitions (admitted+rejected)

[1] 4526

sum (df[df$Gender=="Female",]$Freq) # number of females

[1] 1835

sum (df[df$Admit=="Admitted",]$Freq) # number of admitted on all depts

[1] 1755

sum (df[df$Admit=="Rejected",]$Freq) # number of rejected on all depts

[1] 2771

sum (df[df$Admit=="Admitted" & df$Gender=="Female",]$Freq) # number of admitted females

[1] 557

sum (df[df$Admit=="Rejected" & df$Gender=="Female",]$Freq) # number of rejected females

[1] 1278

sum (df[df$Admit=="Admitted" & df$Dept=="A",]$Freq) # number of admitted in dept 'A'

[1] 601

sum (df[df$Admit=="Rejected" & df$Dept=="A",]$Freq) # number of rejected in dept 'A'

[1] 332

Then:

$ p(admitted) = 1755/4526 \approx 0.39 $
$ p(female|admitted) = p(female,admitted) / p(admitted) = (557/4526)/0.39 \approx 0.32 $
$ p(dept A|admitted) = p(dept A,admitted) / p(admitted) = (601/4526)/0.39 \approx 0.34 $

And:

$ p(rejected) = 2772/4526 \approx 0.61 $
$ p(female|rejected) = 1278/2772 \approx 0.46 $
$ p(dept A|rejected) = 332/2772 \approx 0.12 $

This means that $ p(admitted|female,'A') \propto 0.39 \times 0.32 \times 0.34 = 0.042432 $

Which is larger than $ p(rejected|female,'A') \propto 0.61 \times 0.2 \times 0.12 = 0.033672 $

Our prediction is that a female will be admitted in Dept. A. The odds ratio is $ 1.26:1 $ in favor of admittance (ie, 56% chances of admittance).

Using the R funtions

naiveBayes() creates a classifier given observation data and the class for each observation
predict() receives the classifier, some observations, and returns a vector with their predicted classes

library(e1071) 
# creates a classifier using Gender and Dept as data, and Admittance as the observation class
classifier <- naiveBayes(Admit ~ Gender + Dept, data = UCBAdmissions)
test <- unique(df[,c(-1,-4)])                # get the pairs gender+department
test$Prediction <- predict(classifier, test) # apply the prediction for those pairs (and add to a new column to data frame 'test')
test                                         # check results

   Gender Dept Prediction
1    Male    A   Admitted
3  Female    A   Admitted
5    Male    B   Admitted
7  Female    B   Admitted
9    Male    C   Rejected
11 Female    C   Rejected
13   Male    D   Rejected
15 Female    D   Rejected
17   Male    E   Rejected
19 Female    E   Rejected
21   Male    F   Rejected
23 Female    F   Rejected

It is possible to access the conditional probabilities from the $tables attribute:

classifier$tables$Gender["Admitted","Female"]  # p(female|admitted)

[1] 0.3174

classifier$tables$Dept["Admitted","A"]         # p(dept.A|admitted)

[1] 0.3425

classifier$tables$Gender["Rejected","Female"]  # p(female|rejected)

[1] 0.4612

classifier$tables$Dept["Rejected","A"]         # p(dept.A|rejected)

[1] 0.1198

And also the apriori distribution:

classifier$apriori / sum(classifier$apriori)

Admit
Admitted Rejected 
  0.3878   0.6122

Continuous attributes

If attribute $ F_i $ is continuous, a typical approach is to assume that $ p(F_i|C=c) \sim N(mean(F_i|C=c), sd(F_i|C=c)) $

TODO: egs at http://en.wikipedia.org/wiki/Naive_Bayes_classifier

Standard dataset iris includes continuous attributes:

data(iris) # load iris dataset
pairs(iris[1:4], main="Iris Data (red=setosa,green=versicolor,blue=virginica)", 
      pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])

plot of chunk unnamed-chunk-7

head(iris,n=12)

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
11          5.4         3.7          1.5         0.2  setosa
12          4.8         3.4          1.6         0.2  setosa

summary(iris)

  Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
 Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
 1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
 Median :5.80   Median :3.00   Median :4.35   Median :1.3  
 Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
 3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
 Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

library(e1071) 
# create a classifier using naive bayes using the first 4 columns as the data, an the last column as the class for each observation (naiveBayes is a supervised learning algorithm)
classifier<-naiveBayes(iris[,1:4], iris[,5])

Use predict() that receives the classifier object plus some observations (iris[,-5] is the original data without its class) and returns a sugested class for each observation.

predicted.classes <- predict(classifier, iris[,-5])
head(predicted.classes,n=12)

 [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[11] setosa setosa
Levels: setosa versicolor virginica

Then the method table() presents a confusion matrix between the sugested classe vector with the real class vector. In this case the classification was very good, but this is a biased result since we are using the same data in the training and in the test sets

table(predicted.classes, iris[,5], dnn=list('predicted','actual'))

            actual
predicted    setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         47         3
  virginica       0          3        47

classifier$apriori / sum(classifier$apriori) # the prior distribution for the classes

iris[, 5]
    setosa versicolor  virginica 
    0.3333     0.3333     0.3333

Since the predictor variables here are all continuous, the naive Bayes classifier generates three Gaussian (Normal) distributions for each predictor variable: one for each value of the class variable Species. The first column is the mean, the 2nd column is the standard deviation.

classifier$tables$Petal.Length

            Petal.Length
iris[, 5]     [,1]   [,2]
  setosa     1.462 0.1737
  versicolor 4.260 0.4699
  virginica  5.552 0.5519

Let's plot these ones:

plot(0:3, xlim=c(0.5,7), col="red", ylab="density",type="n", xlab="Petal Length",main="Petal length distribution for each species")
curve(dnorm(x, classifier$tables$Petal.Length[1,1], classifier$tables$Petal.Length[1,2]), add=TRUE, col="red")
curve(dnorm(x, classifier$tables$Petal.Length[2,1], classifier$tables$Petal.Length[2,2]), add=TRUE, col="blue")
curve(dnorm(x, classifier$tables$Petal.Length[3,1], classifier$tables$Petal.Length[3,2]), add=TRUE, col ="green")
legend("topright", c("setosa", "versicolor", "virginica"), col = c("red","blue","green"), lwd=1)

plot of chunk unnamed-chunk-11

These values could also be accessed directly in the dataset. Say for Petal.Length:

mean(iris[iris$Species=="setosa",]$Petal.Length)

[1] 1.462

sd(iris[iris$Species=="setosa",]$Petal.Length)

[1] 0.1737

So, let's say we want to find, without using the R function, all three possible values of $ p(C|observation) $:

observation <- data.frame(Sepal.Length = 5.0, 
                          Sepal.Width  = 3.2, 
                          Petal.Length = 1.5, 
                          Petal.Width  = 0.3)  # this observation lies within Setosa cluster

A similar computation is needed for the other two classes. Let's implement them:

iris.classes <- c("setosa","versicolor","virginica")
iris.attributes <- names(iris)[-5]

means     <- rep(0,length(iris.attributes))
sds       <- rep(0,length(iris.attributes))
densities <- rep(0,length(iris.attributes))

p.Cs <- c(0,0,0)              # p(C=class)
p.Cs_observation <- c(0,0,0)  # p(C=class | observation)

for (c in 1:length(iris.classes)) {
  p.Cs[c] <- nrow(iris[iris$Species==iris.classes[c],]) / nrow(iris) 

  for(i in 1:length(iris.attributes)) {
    means[i] <- sapply(iris[iris$Species==iris.classes[c],][i], mean)
    sds[i]   <- sapply(iris[iris$Species==iris.classes[c],][i], sd)
    densities[i] <- dnorm(as.numeric(observation[i]), means[i], sds[i])
  }

  p.Cs_observation[c] <- p.Cs[c] * prod(densities) # the final value for each class
}

names(p.Cs_observation) <- c("setosa","versicolor","virginica")
p.Cs_observation

    setosa versicolor  virginica 
 2.467e+00  1.954e-15  4.920e-23

p.Cs_observation / sum(p.Cs_observation) # normalize

    setosa versicolor  virginica 
 1.000e+00  7.920e-16  1.995e-23

This means that our naive bayes would predict 'setosa' with VERY high likelihood.

With the use of naiveBayes() we get the same prediction:

# type="raw" shows the probabilities
predict(classifier, observation, type="raw")

     setosa versicolor virginica
[1,]      1   7.92e-16 1.995e-23