Java: Naive Bayes classifier for Machine Learning

The Bayesian Classification represents a supervised learning method as well as a statistical method for classification.It is a simple but surprisingly powerful algorithm for prediction.Also it is a probabilistic classifier.

Bayesian probability theory is modeled in the idea that the estimated
likelihood of an event should be based on the evidences available.

The naive Bayes algorithm is named so because it makes a couple of "naive" assumptions about the data. In particular, naive Bayes assumes that all of the features/predictors in the dataset are equally important and independent. These assumptions are not always true in most of the real-world applications.

This post is written for developers having a little or no background in statistics or probability.This post is a bit verbose.If you have prior idea about probability theory, feel free to skip the probability section and the calculations.

Conditional Probability:

In order to know the Bayes classifier it's necessary to know conditional probability.We will discuss it here.

What is the probability that event A will happen, given that event B has already happened.Here happening or not happening of event A depends on happening or not happening of event B.

Let's say there are some outcome O and there are some evidence E.
Then we write conditional probability as P(O|E) =P(O ^ E)/P(E)
^ stands for and symbol.

Let's take an example

Let say we have a collection 52 cards . A card could be Diamond,Spade,Club or Heart . They are also either Black or Red.

if we draw a card randomly from the collection , what is the probability that the card is a Red Diamond.We can answer this question with the help of conditional probability.

Here assume R(Red) as evidence and D(Diamond) as outcome

So P(R|D)=P(R^D)/P(D)=(1/4)/(1/4)=1
That is if we are given diamonds and we need to pull a Red from it , so the probability will be 1 , i.e definitely the pulled card will be Red.

Similarly if we assume D as evidence and R as outcome, we can write it as reverse

P(D|R)=P(R^D)/P(R)=(1/4)/(1/2)=1/2=.5

That is if we are given only Reds and we need to pull a card from it ,the probability that it will be diamond is .5.

Here notice that we can calculate the probality of Red and Diamond in two ways .That is P(R^D)=P(R|D)*P(D)
P(R^D)=P(D|R)*P(R)

Bayes Rule:

It is like calculate the probability of outcome such that evidence is know.Mathematically it is like calculating P(Outcome|Evidence) from the known P(Evidence|Outcome).

Often, we know how frequently a evidence is observed, provided a outcome is known. We have to use this known fact to compute the reverse, to compute the probability of that outcome , given the evidence is observed.

P(Outcome|Evidence)= P(Evidence|Outcome) *P(Outcome)

______________________________________

P(Evidence)

An example for understanding Baye's rule is

Probability of Disease D given Test-positive =

P(Test is positive|Disease) * P(Disease) _______________________________________________________________
P(Test is Positive, with or without the disease)

Here Disease D is the outcome And Test is positive is the evidence.

The terminology in the Bayesian method of probability (more commonly used) is as follows:

P(Evidence|Outcome) is called likehood.

P(Outcome|Evidence) is called posterior.

P(Outcome) is the prior probability of outcome

P(Evidence) is the prior probability of evidence.

The intuition behind multiplying by the prior probabilty of outcome is so that we give high probability to more common outcomes, and low probabilities to more unlikely outcomes. These are also called base rates and they are used as scaling factor to scale our predicted probabilities.

Applying naive Base rule to a classification problem:

So far, we have discussed only about a single evidence. In reality, we have to predict an outcome given multiple evidence.

Let's assume a case where we have a single outcome and a multiple evidences and call them E1,E2,E3,......En.

P(Outcome|Multiple Evidence) = 
P(E1|Outcome) * P(E2|outcome) * ... * P(En|outcome) * P(Outcome)
________________________________________________________________ 
                       P(Multiple Evidence)

Here we will apply it to a supervised classification problem.

Let's say we have 1000 people from three countries.Let's say the countries are C1,C2,C3 and their properties like their average height,their color,their speaking style.We will divide the height into two parts like short or tall.Similarly we will divide the color in two parts eg. black and white.And their speaking style into two parts like English or non English.This is our training sample.

We will use this to classify any new people we meet .He/She may belongs to either of the 3 different countries C1 or C2 or C3.

Here let's assume C1,C2,C3 as our outcomes and height ,language and color are evidences.

Country  Short |  Tall || English | Not English || Black   |White     |Total
             ___________________________________________________________________
C1       |  400  |    100   || 350   |    150    ||  450   |  50      |  500
C2       |    0  |    300   || 150   |    150    ||  300   |   0      |  300
C3       |  100  |    100   || 150   |     50    ||   50   | 150      |  200
            ____________________________________________________________________
Total    |  500  |    500   || 650   |    350    ||  800   | 200      | 1000
             ___________________________________________________________________

Let's compute some values which will be in use later.

Compute the prior probability of outcomes

P(C1)= Samples collected from C1

___________________________ =500/1000=.5

Total samples collected

P(C2)= Samples collected from C2

___________________________ =300/1000=.3

Total samples collected

P(C3)= Samples collected from C3

___________________________ =200/1000=.2

Total samples collected

Compute the prior probability of Evidence

P(Short)= Samples collected for short

___________________________ =500/1000=.5

Total samples collected

P(Tall)= Samples collected for Tall

___________________________ =500/1000=.5

Total samples collected

Samples collected for English

P(English)= ___________________________ =650/1000=.65

Total samples collected

P(Non English)= Samples collected for NonEnglish

___________________________________ =350/1000=.35

Total samples collected

P(Black)= Samples collected for Black

______________________________ =800/1000=.8

Total samples collected

P(White)= Samples collected for White

______________________________ =200/1000=.2

Total samples collected

Compute the probability of likehoods.

Here we use the formula for conditional probability

P(Short|C1)= P(Short^C1) (400/1000)

______________ =_____________ = .4/.5 =.8

P(C1) (500/1000)

samples having Short and C1 =400 and total samples collected =1000 and only C1 is 500.So the result.

P(Short|C2)= P(Short^C2) (0/1000)

______________ =_____________ = 0/.3 =0

P(C2) (300/1000

P(Short|C3)= P(Short^C3) (100/1000)

______________ =_____________ = .1/.2 =0.5

P(C3) (200/1000

Similarly we can calculate for all likehoods.

Let's do the real work(Classification):

Let's say that we are given the characteristics of an unknown person, and asked to classify it. We are told that the person is Tall in height ,Speaks English and Black in color.

will he belongs to C1? will he belongs to C2? Or will he belongs to C3?

We can simply calculate the numbers for each of the 3 outcomes, one by one. Then we choose the highest probability and 'classify' our unknown person as belonging to the class having highest probability based on our prior and likehoods.

P(C1|Tall,English,Black) = P(Tall|C1)*P(English|C1)*P(Black|C1)*P(C1)
____________________________________________
   P(Tall)*P(English)*P(Black)

                                           =.2*.7*.9*.5/.5*.65*.8=0.484615385

P(C2|Tall,English,Black) = P(Tall|C2)*P(English|C2)*P(Black|C2)*P(C2)
____________________________________________
   P(Tall)*P(English)*P(Black)

= 1*.5*.1*.3/ .5*.65*.8=0.57

P(C3|Tall,English,Black) = P(Tall|C3)*P(English|C3)*P(Black|C3)*P(C3)
____________________________________________
   P(Tall)*P(English)*P(Black)

                                        =.5*.75*.25*.2/ .5*.65*.8=0.072115385

.57>.48>.07 .P(C2|Tall,English,Black) is greater than all of these.

So we say that the given person being Tall in height ,speaking English and Black in color belongs to country C2.

Note: P(Evidence)= P(Tall)*P(English)*P(Black) .Since we are dividing the same quantity each time we are calculating the outcome probability using Bayes rule.
We can safely ignore it , i.e without dividing the same we can compare the result.

If you have any questions or suggestions about this post? Please leave a comment and ask a question, I will do my best to answer it.