Maximum Likelihood Method

The maximum likelihood method belongs to one of the most used methods when it comes to fitting of distribution functions against data. In machines learning a typical place where this method finds application is logistic regression. In this article we are looking into the detailed development and theory of likelihood estimation. Prerequisites in order to understand this article are just basic knowledge of classic probability theory. Our ultimate goal is to fit a conditional distribution function against given data. Let us first describe the structure of these data. The sample data The data are lists of records, each containing a set of feature variables $(X_i)_{1\leq i\leq n}$ and one dependent variable $Y$. This can be presented in matrix form like: $$ \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,n} & y_1\\ x_{2,1} & x_{2,2} & \cdots & x_{2,n} & y_2\\ \vdots\\ x_{m,1} & x_{m,2} & \cdots & x_{m,n} & y_m\\ \end{pmatrix} $$ We a...

Naive Bayes Classifier

Naive Bayes classifier (NB) is known to be one of the most effective and performant classification methods in machine learning. It actually imposes very hard assumptions on its data but even performs quite well in circumstances when these are not met. The method can be used to predict categorical variables in supervised learning.
This article attempts to explain the basic theory on which this classifier is built.

Prerequisite in order to understand this article is only basic knowledge of probability theory.

Conditional probability

The main framework of this method is conditional probability. Consider a probability measure $P$ defined on a measure space $(\Omega, \mathcal{A})$. That is, $\Omega$ is some set and $\mathcal{A}$ a $\sigma$-algebra on it. For any element $A\in\mathcal{A}$ one defines the trace of $\mathcal{A}$ on $A$ by $$ \{B\cap A \ : \ B\in\mathcal{A}\} $$ and can easily show that this again is a $\sigma$-algebra but now on the set $A$ instead of $\Omega$.

This consideration gives rise to a definition of a 'restricted' probability measure. It is called conditional probability given $A$ and defined by $$ P(B \ | \ A):=\frac{P(B\cap A)}{P(A)} $$ It is quite trivial to verify that all axioms of probability-measure hold for this definition.

If we reformulate above expression then it reads like $$ P(B\cap A)=P(B \ | \ A)\cdot P(A) $$ We can interpret this by saying: the probability of $B$ and $A$ to happen is the same as the product of the probability of $B$ to happen under the condition that $A$ has happened and the probability that $A$ happens. This interpretation makes conditional probability quite plausible.

Data structure

After this little theoretical setup we come to the classification problem.
The data are supposed to have the following structure:
Feature samples: $$ X= \begin{pmatrix} x_{1,1} & \cdots & x_{1, n}\\ \vdots\\ x_{m,1} & \cdots & x_{m, n} \end{pmatrix} $$ Outcome samples: $$ Y=\begin{pmatrix} y_1 \\ \vdots \\ y_m \end{pmatrix} $$ Where outcomes take values in some finite discrete set $M$, that is we assume $Y$ is a categorical variable.

Constructing a classifier

Let use consider the following conditional probability $$ P(Y=\eta \ | \ X_1=\xi_1 \cap \cdots \cap X_n=\xi_n) $$ This expresses the probability of the outcome $Y$ to take value $\eta$ under the condition that the $i$'th feature has value $\xi_i$.
If we knew this probability and would consider new feature vector $(\xi_1, \cdots, \xi_n)$, then we just predict $Y=\eta$ based on for which $\eta$ the term $P(Y=\eta \ | \ X_1=\xi_1 \cap \cdots \cap X_n=\xi_n)$ is largest.

We are going to find an effective method in order to estimate above conditional probabilities from given data. In theory one could use a direct approach but this becomes infeasible for many reasons. One of this could be that feature variable are continuous.

In order to ease notations we leave away for a moment the actual values and write above conditional probability as $$ P(Y \ | \ X_1 \cap \cdots \cap X_n) $$ By definition $$ P(Y \ | \ X_1 \cap \cdots \cap X_n)= \frac{P(Y\cap X_1 \cap \cdots \cap X_n)}{P(X_1 \cap \cdots \cap X_n)} $$ With regards to our above described method to classify by means of this conditional probability, we are actually only interested in the value of the nominator since the denominator does not depend on the value of $Y$. So we are going to concentrate on its computation:
We can apply the definition of conditional probability iteratively onto the nominator: $$ P(Y\cap X_1 \cap \cdots \cap X_n)=P(X_1 \ | \ Y\cap X_2 \cap \cdots \cap X_n)\cdot P(Y\cap X_2 \cap \cdots \cap X_n)\\ =P(X_1 \ | \ Y\cap X_2 \cap \cdots \cap X_n)\cdot P(X_2 \ | \ Y\cap X_3 \cap \cdots \cap X_n) \cdot P(Y\cap X_3 \cap \cdots \cap X_n) $$ Proceeding this way we come to $$ P(Y\cap X_1 \cap \cdots \cap X_n)=P(X_1 \ | \ Y\cap X_2 \cap \cdots \cap X_n)\cdots P(X_{n-1} \ | \ Y\cap X_n)\cdot P(X_n \ | \ Y)\cdot P(Y) $$ This still is infeasible to estimate since we still have to deal with feature values on the conditional side. This is were an actually very hard assumption on the data comes into play:

Conditional independence of features:

$$ P(X_1\cap \cdots \cap X_2 \ | \ Y)=P(X_1 \ | \ Y) \cdots P(X_n \ | \ Y) $$ This is very similar to the statement of probability variables being independent from each other, but takes additionally into account the current value of $Y$. One can translate this assumption into words:

For any $\eta\in M$, under the condition that $Y$ takes the values $\eta$, the value of any feature $X_i$ has no influence onto the value of a different feature $X_j$.

This assumptions is not too far fetched since this is actually something we always want. If features do depend on each other they in essence replace each other. Techniques like 'feature selection' do especially try to filter out features with high dependencies.

In order to use above assumption in our computation, let us extract an implication from it.
To ease notation, we restrict to the case of two features, $X_1, \ X_2$. The steps can easily be adapted for the general case. By the definition of conditional probability and above assumption we have on the one hand $$ P(X_1\cap X_2\cap Y)=P(X_1\cap X_2 \ | \ Y)\cdot P(Y)=P(X_1 \ | \ Y)\cdot P(X_2 \ | \ Y)\cdot P(Y) $$ and on the other hand $$ P(X_1\cap X_2\cap Y)=P(X_1 \ | \ Y\cap X_2)\cdot P(Y\cap X_2)=P(X_1 \ | \ Y\cap X_2)\cdot P(X_2 \ | \ Y)\cdot P(Y) $$ Therefore it is $$ P(X_1 \ | \ Y)\cdot P(X_2 \ | \ Y)\cdot P(Y)=P(X_1 \ | \ Y\cap X_2)\cdot P(X_2 \ | \ Y)\cdot P(Y) $$ from which we can conclude $$ P(X_1 \ | \ Y\cap X_2) = P(X_1 \ | \ Y) $$ This still holds if we replace $X_2$ with the intersection of any number of features, like $$ P(X_1 \ | \ Y\cap X_2\cap X_3) = P(X_1 \ | \ Y) $$ This formula holds for all features and we can apply it on the r.h.s of the computation of the nominator. That is, $$ P(Y\cap X_1 \cap \cdots \cap X_n)=P(X_1 \ | \ Y\cap X_2 \cap \cdots \cap X_n)\cdots P(X_{n-1} \ | \ Y\cap X_n)\cdot P(X_n \ | \ Y)\cdot P(Y)\\ =P(X_1 \ | \ Y)\cdots P(X_{n-1} \ | \ Y)\cdot P(X_n \ | \ Y)\cdot P(Y) $$ This form is very desirable since probabilities of the form $P(X_i \ | \ Y)$ can be estimated relatively easy. Before we look into this, let us wrap-up what we have learned so far.

Naive Bayes Classifier

If all feature variables $X_1, \cdots, X_n$ are conditional independent on the outcome $Y$, that is $$ P(X_i \cap X_j \ | \ Y)=P(X_i \ | \ Y)\cdot P( X_j \ | \ Y) $$ for $i\neq j$, then we have $$ P(Y\cap X_1 \cap \cdots \cap X_n)=P(Y)\prod_{i=1}^n P(X_i \ | \ Y) $$ The classifier is taken to be $$ (\xi_1, \cdots, \xi_n)\mapsto arg \max_{\eta\in M} \ P(Y=\eta \ | \ X_1=\xi_1 \cap \cdots \cap X_n=\xi_n) $$ which is the same as $$ (\xi_1, \cdots, \xi_n)\mapsto arg \max_{\eta\in M} \ P(Y=\eta)\prod_{i=1}^n P(X_i=\xi_i \ | \ Y=\eta) $$


Estimating conditional probabilities

We have seen the main task for the NB classifier is to estimate the probabilities $$ P(X_i=\xi_i \ | \ Y=\eta) $$ There exists several methods for this task. In this article we restrict on only two of them.

The first deals with the case when the feature $X_i$ is a categorical variable. For given $\eta$ and $\xi$ we can estimate for the $i'th$ feature $$ P(X_i=\xi_i \ | \ Y=\eta) = \frac{N_i}{N_{\eta}} $$ Here $N_{\eta}$ denotes the number of samples with $Y=\eta$. Moreover, $N_i$ denotes the number of samples with $Y=\eta$ and $X_i$ having value $\xi$.

Another situation appears when the feature $X_i$ is a continuous variable. Then above approach does not work. There exists one immediate solution which is 'binning'. That is, we use any binning method, for instance quantile-based, and in effect replace the continuous variable by an categorical one.
There is but another techniques which often is used. For this the assumption is made that $X_i$ has a normal distribution conditional on $Y$. So we assume $$ P(X_i=\xi_i \ | \ Y=\eta)\sim N(\mu, \sigma) $$ We can obtain the parameter $\mu$ and $\sigma$ as usually by computing the mean and standard deviation from values of $X_i$ which belong to outcomes of $\eta$.
Of course this approach requires to verify on data that indeed $X_i$ follows a normal distribution conditional on $Y$.

Example

Finally let us look at an implementation as one would do it with scikit-learn.
The data set we use is taken from https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29 .


import pandas as pd
import sklearn.feature_selection as selection
from sklearn.preprocessing import OrdinalEncoder
from sklearn.naive_bayes import CategoricalNB
from sklearn.model_selection import train_test_split

data = pd.read_csv(
  "example-code/naive-bayes/census-income.csv",
  delimiter=",",
  index_col=False,
  header=None
)
data = data.dropna()
outcome_col = len(data.columns)-1

# encoding categorical features
object_data = data.select_dtypes(include=object)
enc = OrdinalEncoder()
enc.fit(object_data)
data[object_data.columns] = enc.transform(object_data).astype(int)

# binning continuous features
continues_data = data.select_dtypes(include='float64')
for col in continues_data.columns:
  data[[col]] = pd.qcut(data[[col]].values.reshape(-1,), q=10, labels=False)

## split into train and test data
X_train, X_test, y_train, y_test = train_test_split(
  data.iloc[:, :-1],
  data.loc[:, [outcome_col]],
  random_state=0
)
y_train = y_train.values.reshape(-1,)
y_test = y_test.values.reshape(-1,)

# fit model
clf = CategoricalNB()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
print(clf.score(X_train, y_train))

The code is pretty easy to understand:
After loading the data we encode categorical features since the NB works on numbers only.

object_data = data.select_dtypes(include=object)
enc = OrdinalEncoder()
enc.fit(object_data)
data[object_data.columns] = enc.transform(object_data).astype(int)

Then we use the binning-approach to transform continuous variables to categorical once:

continues_data = data.select_dtypes(include='float64')
for col in continues_data.columns:
  data[[col]] = pd.qcut(data[[col]].values.reshape(-1,), q=10, labels=False)

And finally we train the model:

clf = CategoricalNB()
clf.fit(X_train, y_train)

Running the code returns a score on test and train data:
0.866442132274814
0.8691543817912083

For not having done anything special, this result is not too bad. Especially it shows that the model produces almost no over-fitting.

Final notes

Especially when dealing with large data sets and high dimensionality, the NB is a very attractive classifier. The theoretical fundament on which it is based is rather simple and its implementation is straight forward and can be made very efficient.

Comments