Naive Bayes Classification
Naive Bayes is a modeling assumption used in classification, where we assume the observed data are conditionally independent given their class assignments. Despite its name, the standard naive Bayes model does not use Bayesian inference, but rather, a maximum likelihood estimation.
Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector $${\displaystyle \mathbf {x} =(x1,… ,xn)}$$ representing some n features (independent variables), it assigns to this instance probabilities
$${\displaystyle p(Ck\mid x1,… ,xn)\,}$$ for each of K possible outcomes or classes $${\displaystyle Ck} Ck$$.
The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes’ theorem, the conditional probability can be decomposed as
$${\displaystyle p(Ck\mid \mathbf {x} )={\frac {p(Ck)\ p(\mathbf {x} \mid Ck)}{p(\mathbf {x} )}}\,}$$ In plain English, using Bayesian probability terminology, the above equation can be written as
$${\displaystyle p(Ck,x1,… ,xn)\,}$$ , which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability:
$${\displaystyle {\begin{aligned}p(Ck,x1,… ,xn)&=p(x1,… ,xn,Ck)\&=p(x1\mid x2,… ,xn,Ck)p(x2,… ,xn,Ck)\&=p(x1\mid x2,… ,xn,Ck)p(x2\mid x3,… ,xn,Ck)p(x3,… ,xn,Ck)\&=… \&=p(x1\mid x2,… ,xn,Ck)p(x2\mid x3,… ,xn,Ck)… p(xn-1\mid xn,Ck)p(xn\mid Ck)p(Ck)\\end{aligned}}}$$ Now the “naive” conditional independence assumptions come into play: assume that each feature {\displaystyle xi} xi is conditionally independent of every other feature {\displaystyle xj} xj for {\displaystyle j≠ i} j≠ i, given the category {\displaystyle Ck} Ck. This means that
$${\displaystyle p(xi\mid xi+1,… ,xn,Ck)=p(xi\mid Ck)\,}$$. Thus, the joint model can be expressed as
$${\displaystyle {\begin{aligned}p(Ck\mid x1,… ,xn)&\varpropto p(Ck,x1,… ,xn)\&\varpropto p(Ck)\ p(x1\mid Ck)\ p(x2\mid Ck)\ p(x3\mid Ck)\ \cdots \&\varpropto p(Ck)∏ i=1np(xi\mid Ck)\,.\end{aligned}}}$$
This means that under the above independence assumptions, the conditional distribution over the class variable {\displaystyle C} C is:
$${\displaystyle p(Ck\mid x1,… ,xn)={\frac {1}{Z}}p(Ck)∏ i=1np(xi\mid Ck)}$$ where the evidence $${\displaystyle Z=p(\mathbf {x} )=∑ kp(Ck)\ p(\mathbf {x} \mid Ck)}$$ is a scaling factor dependent only on $${\displaystyle x1,… ,xn}$$, that is, a constant if the values of the feature variables are known.