Sunday, July 23, 2017

Naive Bayes classifier for Machine Learning

The Bayesian  Classification represents a supervised learning method as well as a statistical method for classification.It is a simple but surprisingly powerful algorithm for prediction.Also it is a probabilistic classifier.

Bayesian probability theory is modeled in the idea that the estimated
likelihood of an event should be based on the evidences available.

The naive Bayes algorithm is named so  because it makes a couple of "naive" assumptions about the data. In particular, naive Bayes assumes that all of the features/predictors  in the dataset are equally important and independent. These assumptions are not always true in most of the real-world applications.

This post is written for developers having a little or no  background in statistics or probability.This post is a bit verbose.If you have prior idea about probability theory, feel free to skip the probability section and the calculations.

Conditional Probability:


In order to know the Bayes classifier it's necessary to know conditional probability.We will discuss it here.

What is the probability that event A will happen, given that event B  has already happened.Here happening or not happening of event A depends on happening or not happening of event B.

Let's say there are some outcome O and there are some evidence E.
Then we write conditional probability as  P(O|E) =P(O ^ E)/P(E)
^ stands for and symbol.

Let's take an example

Let say we have a collection 52 cards . A card could be Diamond,Spade,Club or Heart . They are also either Black or Red.

if we draw a card randomly from the collection , what is the probability that the card is a Red Diamond.We can answer this question with the help of conditional probability.

Here assume R(Red) as evidence and D(Diamond) as outcome

So  P(R|D)=P(R^D)/P(D)=(1/4)/(1/4)=1
That is if we are given diamonds  and we need to pull a Red from it , so the probability will be 1 , i.e definitely the pulled card will be Red.

Similarly  if we  assume  D as  evidence and  R as outcome, we can write it as reverse

P(D|R)=P(R^D)/P(R)=(1/4)/(1/2)=1/2=.5

That is if we are given only Reds and we need to pull a card from it ,the probability that it will be diamond is .5.

Here  notice that we can calculate the probality of Red and Diamond in two ways .That is  P(R^D)=P(R|D)*P(D)
                       P(R^D)=P(D|R)*P(R)

Bayes Rule:

 It is like calculate the probability of outcome such that evidence is know.Mathematically it is like calculating  P(Outcome|Evidence) from the known P(Evidence|Outcome).

Often, we know how frequently a  evidence is observed, provided a outcome is  known. We have to use this known fact to compute the reverse, to compute the probability of that outcome , given the evidence is observed.


P(Outcome|Evidence)=   P(Evidence|Outcome) *P(Outcome)
                                       ______________________________________
                                                        P(Evidence)


An example for understanding Baye's rule is 

Probability of Disease D given Test-positive =

                 P(Test is positive|Disease) * P(Disease)                                                    _______________________________________________________________
                 P(Test is  Positive, with or without the disease)

 Here Disease D is the outcome And Test is positive is the evidence.


The terminology in the Bayesian method of probability (more commonly used) is as follows:

P(Evidence|Outcome) is called likehood.

P(Outcome|Evidence) is called posterior.

P(Outcome) is the prior probability of outcome

P(Evidence) is the prior probability of evidence.

The intuition behind multiplying by the prior probabilty of outcome  is so that we give high probability to more common outcomes, and low probabilities to more unlikely outcomes. These are also called base rates and they are used as scaling  factor  to scale our predicted probabilities.

Applying naive Base rule to a classification problem: 

So far, we have discussed only about a single  evidence. In reality, we have to predict an outcome given multiple evidence.

Let's assume  a case where we have a single outcome and a multiple evidences and call them E1,E2,E3,......En.


P(Outcome|Multiple Evidence) = 
P(E1|Outcome) * P(E2|outcome) * ... * P(En|outcome) * P(Outcome)
________________________________________________________________ 
                       P(Multiple Evidence)
Here we will apply it to a supervised classification problem.
Let's say we have 1000 people from three countries.Let's say the countries are C1,C2,C3 and their properties like their average height,their color,their speaking style.We will divide  the height into two parts like short or tall.Similarly we will divide the color in two parts eg. black and white.And their speaking style into two parts like English or non English.This is our training sample.

We will use this to classify any new people  we meet .He/She may belongs to either of the 3 different countries C1 or C2 or C3.

Here let's assume C1,C2,C3 as our outcomes and height ,language and color are evidences.
Country  Short |  Tall || English | Not English || Black   |White     |Total
             ___________________________________________________________________
C1       |  400  |    100   || 350   |    150    ||  450   |  50      |  500
C2       |    0  |    300   || 150   |    150    ||  300   |   0      |  300
C3       |  100  |    100   || 150   |     50    ||   50   | 150      |  200
            ____________________________________________________________________
Total    |  500  |    500   || 650   |    350    ||  800   | 200      | 1000
             ___________________________________________________________________
Let's compute some values which will be in use later.
  • Compute the prior probability of outcomes  
P(C1)=    Samples collected from C1
              ___________________________   =500/1000=.5
                Total samples collected 

P(C2)=    Samples collected from C2
              ___________________________   =300/1000=.3
                Total samples collected 
P(C3)=    Samples collected from C3
              ___________________________   =200/1000=.2
                Total samples collected 
  • Compute the prior probability of Evidence
P(Short)=    Samples collected for short
                      ___________________________          =500/1000=.5
                        Total samples collected 

P(Tall)=    Samples collected for Tall
                ___________________________        =500/1000=.5
                Total samples collected 
                      Samples collected for English
 P(English)=  ___________________________       =650/1000=.65
                        Total samples collected

P(Non English)=  Samples collected for  NonEnglish
                             ___________________________________    =350/1000=.35
                                  Total samples collected

P(Black)=      Samples collected for Black
                        ______________________________       =800/1000=.8
                              Total samples collected

P(White)=       Samples collected for White
                        ______________________________       =200/1000=.2
                              Total samples collected 


  • Compute the probability of likehoods.
Here we use the formula for conditional probability

P(Short|C1)=  P(Short^C1)      (400/1000)
                       ______________ =_____________ =  .4/.5 =.8
                            P(C1)               (500/1000)
samples having Short and C1 =400 and total samples collected =1000 and only C1 is 500.So the result.

P(Short|C2)=   P(Short^C2)      (0/1000)
                         ______________ =_____________ =  0/.3 =0
                            P(C2)               (300/1000
P(Short|C3)=   P(Short^C3)      (100/1000)
                         ______________ =_____________ =  .1/.2 =0.5
                            P(C3)               (200/1000
Similarly we can calculate for all likehoods.

Let's do the real work(Classification):

Let's say that we are given the characteristics of an unknown person, and asked to classify it. We are told that the person is Tall in height ,Speaks English and Black in color.

will he belongs to C1? will he belongs to C2? Or will he belongs to C3?

We can simply calculate  the numbers for each of the 3 outcomes, one by one. Then we choose the highest probability and 'classify' our unknown person as belonging to the class having  highest probability based on our prior and likehoods.


P(C1|Tall,English,Black) = P(Tall|C1)*P(English|C1)*P(Black|C1)*P(C1)
                                            ____________________________________________
                                                       P(Tall)*P(English)*P(Black)

                                           =.2*.7*.9*.5/.5*.65*.8=0.484615385

P(C2|Tall,English,Black) = P(Tall|C2)*P(English|C2)*P(Black|C2)*P(C2)
                                            ____________________________________________
                                                       P(Tall)*P(English)*P(Black)
    
                                          = 1*.5*.1*.3/ .5*.65*.8=0.57


P(C3|Tall,English,Black) = P(Tall|C3)*P(English|C3)*P(Black|C3)*P(C3)
                                            ____________________________________________
                                                       P(Tall)*P(English)*P(Black)

                                        =.5*.75*.25*.2/ .5*.65*.8=0.072115385

.57>.48>.07 .P(C2|Tall,English,Black) is greater than all of these.

So we say that the given person being Tall in height ,speaking English and Black in color belongs to country C2.



 Note: P(Evidence)= P(Tall)*P(English)*P(Black) .Since we are dividing the same quantity each time we are calculating the outcome probability using Bayes rule.
We can safely ignore it , i.e without dividing the same we can compare the result.

If you have any questions  or suggestions  about this post? Please leave a comment and ask a question, I will do my best to answer it.

Bias, Variance, Training error, Test error, and irreducible error curves

1) Draw bias, variance, training error, test error, and  irreducible error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve.


Explanation: 


a)Training MSE:

The training MSE declines monotonically as flexibility increases, this is because as flexibility increases the f curve fits the observed data more closely. 

b) Test MSE:
 
The test MSE initially declines as flexibility increases but at some point it levels off and then starts to increase again (U-shape), this is because when a f curve yields a small training MSE but a large test MSE we are actually overfitting the data (our procedure tries too hard to find patterns in the training data that are maybe only caused by chance rather than by true properties of the unknown f). 

c) Variance:
 
The squared bias decreases monotonically and the variance increases monotonically; as a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. Variance refers to the amount by which f^ would change if we estimated it using a different training data set, so if the curve fits the observations very closely, changing any point may cause f^ to change considerably, and so will result in some variance. 

d)Bias:
 
Bias refers to the error that is introduced by approximating a real-life problem by a much simpler model, so if we use a very simple model (linear regression) it is unlikely that any real-life problem has such a simple linear relationship, and so performing linear regression will result in some bias in the estimate of f

e)  Irreducible error

Because the irreducible error is a noise the method can't fit it.So it will be there.The irreducible error is a constant so it is a parallel line, this curve lies below the test MSE curve because the expected test MSE will always be greater the Var(ε)

Classification Problem Vs. Regression Problem

Explain whether below scenarios is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.Here n is the nos of training examples and p is the nos of predictors or number of features.

(a) We collect a set of data on the top 500 firms in the US. For each
firm we record profit, number of employees, industry and the
CEO salary. We are interested in understanding which factors
affect CEO salary.

Ans:  This is a supervised regression learning problem. We are interested in the inference.  n=500, p=3.

(b) We are considering launching a new product and wish to know
whether it will be a success or a failure. We collect data on 20
similar products that were previously launched. For each prod-
uct we have recorded whether it was a success or failure, price
charged for the product, marketing budget, competition price,
and ten other variables.


Ans:  This is a supervised classification learning problem. We are interested in the prediction. p=13, n=20

(c) We are interesting in predicting the % change in the US dollar in
relation to the weekly changes in the world stock markets. Hence
we collect weekly data for all of 2012. For each week we record
the % change in the dollar, the % change in the US market,
the % change in the British market, and the % change in the
German market.

Ans:  This is a regression problem, and we are interested in the prediction. p=3, n=52.since the input is  about weekly data  and there are 52 weeks in a year so n=52.Here 3 features are  the % change in the US market,the % change in the British market, and the % change in the German market.
And our target variable is  the % change in the dollar.

d)  Our website has collected the ratings of 1000 different restaurants by 10, 000 customers. Each customer has rated about 100 restaurants, and we would like to recommend restaurants to customers who have not yet been there.

Ans:  This is a supervised classification learning problem. We are interested in the prediction. We assume that each customer has rated exactly 100 restaurants, the average number of rating each restaurant receives are 1,000.Here we are interested in recommending a restaurant to a customer.So now the problem is should we recommend it or not?So it is a two class  classification problem .Since each restaurant has 1000 ratings , so our  training data has 1000 samples.So n=1000.Here we know only one feature of the  restaurant that is rating so p=1.

e) Is this TV series/movie/ad campaign going to be successful or not?The data given for 1000 samples.And for each sample we collect Money spent, Talent, Running Time, Producer, TV Channel, Air time slot, etc.


Ans: It is a binary (two class) classification problem.The response or target  variable is   will the series or movie will be successful or not.We want to predict the response.And the features or predictors are  Money spent, Talent, Running Time, Producer, TV Channel, Air time slot.So p=6.And since we have 1000 samples , so n=1000.

f) Should this applicant be admitted into Utkal University , Vanivihar or not for MCA programme.We have 1000 samples.And each sample contains his/her 10th percentage,12th percentage,graduation percentage and OJEE  score.

Ans: It is a binary (two class) classification problem.The response or target  variable is admit or not admit.It is a prediction problem.
And the features or predictors are 10th percentage,12th percentage,graduation percentage and OJEE  score.So p=4.And  since we have 1000 samples , so n=1000.



Flexible learning method vs Inflexible learning method

A method is said to be flexible if  that can fit many different possible functional forms of the unknown function f.And the flexible models follow the error or noise too closely.

1) What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification ? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred ?

Ans: 

The advantages of a very flexible approach are that it may give a better fit for non-linear models and it decreases the bias.

The disadvantages of a very flexible approach are that it requires estimating a greater number of parameters, it follows the noise too closely (overfit) and it increases the variance.

A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results.

A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.




2)Decide which one is better or worse ,flexible method or inflexible method for below examples?

a) The sample size n is extremely large, and the number of predictors p is small ?

Ans: A flexible method will fit the data closer and with the large sample size, would perform better than an inflexible approach(restrictive approach).

b) The number of predictors p is extremely large, and the number of observations n is small ?

Ans: A flexible method would overfit the small number of observations.So here we need a less flexible method will do better.

c)  The variance of the error terms, i.e. σ2=Var(ϵ), is extremely high?
 
Ans: The inflexible model is referred .Because the flexible might fit everything including the errors, unfortunately, this data set has large error. Thus the inflexible model is a good option for getting the knowledge about the trend and the pattern of the data. 

d)  The relationship between the predictors and response is highly non-linear, and σ2 is small.

Ans:The flexible model is preferred. The prior information has already said that this model is highly non-linear. The linear model must not provide an accurate estimation for the data set. Moreover, the variance of the error term is small; we do not need worry if the flexible model can also fit into the errors.  

e)  The relationship between the predictors and response is highly non-linear, and σ2 is large.

Ans:The inflexible model is preferred. Although the underlying distribution is highly non-linear, the large error might also be mistakenly included into the flexible model. Inflexible model can fit better here because it can give us an clear and accurate sense of the overall knowledge of the data set.

Saturday, July 15, 2017

Adaboost-A boosting strategy for a weak classifier

Boosting as the name refers is a general technique to boost the result of some incorrect classifiers and generate a somewhat correct algorithm which is more accurate than the previous incorrect classifiers.What we want to emphasis here is that it is more accurate than the incorrect classifiers but it is not exactly accurate with zero error .Here the incorrect classifiers are called weak classifiers or weak learners or we can say they  are the judges in a competition with somewhat less judgmental power in a particular area.

A weak hypothesis or weak learner or weak classifier  is defined as one whose performance is at least slightly better than random chance.For your information,a random chance is one whose performance is 50%.
 

Let's take an example and stick to it through out the post.Suppose there is a song competition.There are 10 participants in the competition and there are 4 judges to judge the participants.Let's assume the participants are strong in various fields like some are good at classical and some are good at semi-classical and some are good at both.The organizing commit told the judges to classify the students in two classes.That is are they  going ahead in the competition to the next round or they are just not good enough to select them.Simply we have here two classes select or reject.Let's  assume select means class 1 and reject means class -1.

Let's assume the participants are A,B,C,D,E,F,G,H,I,J.And the four judges are j1,j2,j3,j4.Let's assume we already know the result of the 10 participants.Assume A,B,C H,J,I are belong to class 1 and D,E,F,G are belong to class -1.Let's call it training set.And we want to  build our model from this training set.And let's tell J1,J2,J3 and J4 are the weak learners.
In this post we will discuss about one of the important boosting technology called adaboost.

Motivation Behind AdaBoost:


A panel of judges often give the correct decision ,although each judge may provide some weak judgements.It collects the knowledge of each incompetent judge  and solve the learning problem.Please remember that AdaBoost(Adapative Boosting ) is itself is not a classifier.But it built on a set of weak classifiers or weak leraners.It take results from weak learners and apply some logic to solve the learning  problem.

Now a question arises what Adaboost will do?We can take the result of some weak learner and combine them to get the final result.So why AdaBoost?

It does two things

1. It helps us choose the training set for each new classifier that we want to train based on the results of the previous weak classifier.We will describe about it latter in details.
2. It determines how much weight should be given to each classifier’s proposed answer when combining the results.Since the result of the training set is available with us , so we can compare the result of each classifier with the actual result and see how efficiently that classifier is classifying.If it is classifying the training examples very well we will give it a high weight otherwise a low weight.

How AdaBoost works:

Let's formulate the problem now.Let's take 10 training samples (x₁,y₁),(x₂,y₂),(x₃,y₃),(x₄,y₄),(x₅,y₅),(x₆,y₆),(x₇,y₇),(x₈,y₈),(x₉,y₉),(x₁₀,y₁₀). As we map it with our example it will be (A,1),(B,1),(C,1),(H,1),(J,1),(I,1) ,(D,-1),(E,-1),(F,-1),(G,-1).

It assigns each training example a weight D(i) , the initial weight for each training example  is 1/n where n is the total number of training examples.In our case the initial weights will be 1/10.Here weight indicates the importance of the training example.And all these training samples are passed to a weak classifier to determine their class.In realty each weak classifier should be trained on a random subset of the total training set. The subsets can overlap.
But here we are taking somewhat loose version.We are passing every training example to each classifier for the sake of simplicity.

Assume that we have y classifiers, call them   T₁,T₂,T₃, .....Tᵧ.Here in our case the number of classifiers is equal to the number of judges.So it is 4.

When measuring a weak learner’s performance, AdaBoost takes into consideration each training example’s weight. A misclassified high-weight training example will contribute more to the overall weighted error than a misclassified low-weight training example. To get a low weighted error, a weak learner must focus on high-weight data points and try to predict them correctly. Therefore by manipulating the weight distribution, we can guide the weak learner to pay attention to high weighted training example. AdaBoost proceeds by rounds. In each After training a classifier, AdaBoost increases the weight on the misclassified training examples so that these examples will make up a larger part of the next classifiers training set, and hopefully the next classifier trained will perform better on them.So essentially, in each round, we ask the weak learner to focus on high weighted training example that previous weak learners cannot handle well.


Now the point comes how will AdaBoost combine the result of all weak learners, as they are not in the same scale ie. they are trained with different weight .
So we can think that we’ve assigned different tasks to different weak learners and each weak learner tried its best to do its given task. Intuitively, when we take into account one weak leaner’s judgment into the final prediction, we are willing to trust the weak leaner more if it did well on its  task, and less otherwise.

In other words we can say we assign a weight αᵣ to the r'th weak learner and it depends on its assigned task. I.e how well it performed on it's assigned task.
It will get a reward , if it has done a good job in it's assigned task.

Calculate the reward αᵣ :

 Now let's have a look at  αᵣ and will see how it is calculated.

let's  define the error rate ∊ᵣ , for r'th classifier .It is the probability of number of elements having mis classified by  r'th   classifier.If we have total n training examples and out of these m training examples are mis-classified by the classifier then it will be m/n.

So ∊ᵣ=m/n

Now we define  αᵣ=(1/2)ln((1-∊ᵣ)/∊ᵣ) 

From the formula  , we can conclude that ,the reward αᵣ, we assigned to r'th classifier depends on the error rate.if the error rate is low , αᵣ is high,that is reward is high.It means less error more reward.
But in our case let's assume our first weak classifier classifies 7 example correctly out of 10 examples.∊₁=3/10=.3 and α₁=0.42364893.
Now let's assume the case of second weak classifier that classifies 4 examples correctly out of 10 examples.∊₂=6/10=.6 and α₂=−0.202732554
Just notice first classifier has less error rate so reward is high and second classifier has high error rate so reward is low.  so as error rate decreases reward increases.You can plot it visualize.

Calculate the weight Dᵣ₊₁ :

Now we define weight of each training example for each classifier. That is before feeding the data to any classifier we will calculate the weights first.
Here we determined                                 Dᵣ(i)*exp(-αᵣyᵢhᵣ(xᵢ))
                                       Dᵣ₊₁(i)  =    __________________________
 
                                                                           Zᵣ
Here Dᵣ₊₁(i) is the weight on the r+1 th classifier for i'th training example.

Let's analyze how it is calculated , it is multiplication of  Dᵣ(i)  the weight of i'th training example from the previous classifier and exp(-αᵣyᵢhᵣ(xᵢ)).But let's analyze this exp term.What it says ?if in case of previous classifier(r'th classifier) if i'th training example have been classified correctly then yᵢ=hᵣ(xᵢ) and there multiplication will be 1 irrespective of their class .It may be a posstive example or negative example 1*1=-1*-1=1 😉.

To make it a distribution, all of these probabilities should add up to 1. To ensure this, we normalize the weights by dividing each of them by the sum of all the weights,  Zᵣ. So, for example, if all of the calculated weights added up to 10, then we would divide each of the weights by 10 so that they sum up to 1and will make a distribution.

So that factor yᵢhᵣ(xᵢ) is giving no magnitude  but a direction only.If you have remembered the plot of exp from you high school ,it is like


At 0 it is 1 and as x increases it increases.

So if the i'th  training example is not classified by r'th classifier correctly  then yᵢhᵣ(xᵢ) = -1.So for mis classified cases the equation will be

                                Dᵣ(i)*exp(αᵣ)
         Dᵣ₊₁(i)  =    __________________________
 
                                        Zᵣ

So Dᵣ₊₁(i) will be high as αᵣ is a positive quantity and exp(αᵣ)>1

But if  the i'th  training example is  classified by r'th classifier correctly  then yᵢhᵣ(xᵢ) = 1.So for correctly classified cases the equation will be

                                Dᵣ(i)*exp(-αᵣ)
         Dᵣ₊₁(i)  =    __________________________
 
                                        Zᵣ

So Dᵣ₊₁(i) will be low as (-αᵣ) is a negative quantity and exp(-αᵣ)<1

And the final hypothesis is

                     y
 H(x)=  sign(∑  αᵣhᵣ(x))
                    r=i
    

After completion of all the classifiers 1 to y ,we will take the  signum of the weighted sum , where αᵣ is the reward to the r'th classifier and hᵣ(x) is the output of the r'th weak classifier. 


Just to  refresh your memory signum function is like this.At x=0,y=0 and at x>0,y=1 and at x<0,y=-1





So in our case our result will be 1 or -1 as the output of the signum function.

Applying it on our Example :

Let's take our song competition example.
Let our candidate A classified by 4 judges as 1,-1 ,-1 ,-1.At first sight it seems that  candidate A will belongs to class 2 as it is voted by majority.But this may not be the case.

let the rewards given to judge 1 i.e α₁=.8 and rewards given to judge 2 ie. α₂=.2 and similarly α₃=.1 and α₄=.2 .
After putting it in our final hypothesis we get  sign((.8*1)+(.2*-1)+(.1*-1)+(.2*-1)) =sign(.3)=1, as .3>0 as per signum function definition.

So from this we conclude that our example candidate A will belongs to class 1.  Similarly we can calculate the result of all our training examples.
Please go through the links if you want   more insight.

http://www.site.uottawa.ca/~stan/csi5387/boost-tut-ppr.pdf
http://rob.schapire.net/papers/explaining-adaboost.pdf