Maximum Likelihood Method

The maximum likelihood method belongs to one of the most used methods when it comes to fitting of distribution functions against data. In machines learning a typical place where this method finds application is logistic regression. In this article we are looking into the detailed development and theory of likelihood estimation. Prerequisites in order to understand this article are just basic knowledge of classic probability theory. Our ultimate goal is to fit a conditional distribution function against given data. Let us first describe the structure of these data. The sample data The data are lists of records, each containing a set of feature variables $(X_i)_{1\leq i\leq n}$ and one dependent variable $Y$. This can be presented in matrix form like: $$ \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,n} & y_1\\ x_{2,1} & x_{2,2} & \cdots & x_{2,n} & y_2\\ \vdots\\ x_{m,1} & x_{m,2} & \cdots & x_{m,n} & y_m\\ \end{pmatrix} $$ We a...

Linear Support Vector Machine

The traditional support vector machine (SVM) provides a linear model for binary classification and applies to supervised learning problems. In this article we want to look at its properties and the most fundamental theory behind it. At the end we look at an example how one can create a SVM-model in practice.

Prerequisite in order to understand this article is some basic knowledge of linear algebra as offered by many standard introductions.

We start by summarizing some definitions and facts in order to understand the building-blocks of a SVM.

Definition: Separated by hyperplane

Two sets of point $A$ and $B$ in $R^n$ are said to be separated by hyperplane if there are vectors $\mathbf{q}, \ \mathbf{w} \in R^n$ and some positive $\alpha\in R$ such that $$ A \subset \{x\in R^n \ | \ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}) > (\alpha\mathbf{w} \ | \ \mathbf{w})\} $$ and $$ B \subset \{x\in R^n \ | \ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}) < -(\alpha\mathbf{w} \ | \ \mathbf{w})\} $$
Here, $(\cdot \ | \ \cdot)$ denotes the usual inner product in $R^n$.

Let us put some light on this definition.
First of all, remember from analytic geometry: For a vector $\mathbf{w}\in R^n$, the set of points $\mathbf{x}\in R^n$ which fulfill $$ (\mathbf{x} \ | \ \mathbf{w})=0 $$ describes a $(n-1)$-dimensional linear subspace.
In terms of affine geometry, it is a plane (hyperplane) which is orthogonal to the vector $\mathbf{w}$.
We denote above space by $V$, that is $$ V=\{\mathbf{x}\in R^{n} \ | \ (\mathbf{x} \ | \ \mathbf{w})=0\} $$ Let us consider another vector $\mathbf{q}\in R^n$ and the set of points obtained from $$ H:=\mathbf{q}+V $$ This set denotes all points of the form $\mathbf{q}+\mathbf{v}$ for some $\mathbf{v}\in V$ and is a hyperplane obtained by 'shifting' the linear space $V$ by $\mathbf{q}$.

In $R^2$ we can illustrate this:
As easily seen, we can present $H$ by the set of points $\mathbf{x}\in R^n$ which fulfill $$ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w})=0 $$
Next, let us assume for some $\mathbf{h}\in R^n$ we have $$ (\mathbf{h}-\mathbf{q} \ | \ \mathbf{w}) > 0 $$ Then there is a positive $\alpha\in R$ such that $$ (\mathbf{h}-\mathbf{q} \ | \ \mathbf{w})=\alpha (\mathbf{w} \ | \ \mathbf{w}) $$ By refactoring this equation we get $$ (\mathbf{h}-(\mathbf{q}+\alpha\mathbf{w}) \ | \ \mathbf{w})=0 $$ We define $\mathbf{a}:=\alpha\mathbf{w}$.

In our picture this translates to:
As we see, the point $\mathbf{h}$ lies on a hyperplane $H'$ which is shifted from $H$ along $\mathbf{w}$ in positive direction. The same consideration we can do for points which fulfill $$ (\mathbf{h}-\mathbf{q} \ | \ \mathbf{w}) < 0 $$ and observe that then $\mathbf{h}$ would lie on a hyperplane shifted in negative direction along $\mathbf{w}$.

Altogether the hyperplane $H$ splits $R^n$ into two disjoint convex sets: $$ C_{+}:=\{\mathbf{x} \ | \ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}) > 0\} $$ and $$ C_{-}:=\{\mathbf{x} \ | \ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}) < 0\} $$ The first is called positive half-space the later negative half-space.

Looking back to above definition we see that we actually deal with two hyperplanes: $$ H_1:=\{x\in R^n \ | \ (\mathbf{x}-(\mathbf{q}+\alpha\mathbf{w}) \ | \ \mathbf{w}) = 0\} $$ and $$ H_2:=\{x\in R^n \ | \ (\mathbf{x}-(\mathbf{q}-\alpha\mathbf{w}) \ | \ \mathbf{w}) = 0\} $$ The first is shifted from $H$ along $\mathbf{w}$ in positive direction, the later in negative direction.

The definition states that $A$ and $B$ must lie in some half-space of these hyperplanes: $$ A \subset C_A $$ $$ B \subset C_B $$ where $$ C_A:=\{x\in R^n \ | \ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}) > (\alpha\mathbf{w} \ | \ \mathbf{w})\} $$ and $$ C_B:=\{x\in R^n \ | \ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}) < -(\alpha\mathbf{w} \ | \ \mathbf{w})\} $$

The classifier

We come to machine-learning and assume we are given data where each sample comprise $n$-features and a binary outcome which either takes the value $1$ or $-1$.
Each set of feature is considered as a point in $R^n$.
This gives us two sets, $A$ and $B$, the first belonging to those features which result in outcome $1$ and the second too those resulting in $-1$.
In case $A$ and $B$ can be separated by hyperplane $H$ like above, we can use this hyperplane for classification:
Everything lying in $C_{+}$ is classified as $1$ and everything lying in $C_{-}$ as $-1$.

In 2-d we can illustrate this by:
The red and blue points represent features from samples with outcome $1$ resp. $-1$.

Finding an optimal hyperplan

Apparently, if we choose $H$ to have as much as possible distance from both sets $A$ and $B$, we expect our classification to improve. Therefore we translate the problem of finding $H$ into an optimization problem:
Find $\alpha\in R$ and vectors $\mathbf{q},\mathbf{w}\in R^n$ such that the distance between the hyperplanes $$ H_1:=\{\mathbf{x} \ | \ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}=\alpha(\mathbf{w} \ | \mathbf{w})\} $$ and $$ H_2:=\{\mathbf{x} \ | \ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}=-\alpha(\mathbf{w} \ | \mathbf{w})\} $$ becomes maximal under the constraint that they separate $A$ and $B$.

Lemma

The distance between $H_1$ and $H_2$ is $2\alpha|\mathbf{w}|$. We write, $$ dist(H_1, H_2) = 2\alpha|\mathbf{w}| $$

Proof:
By definition $$ H_1 = \mathbf{q}+\alpha\mathbf{w} + V $$ and $$ H_2 = \mathbf{q}-\alpha\mathbf{w} + V $$ where $V$ is the orthogonal space to $\mathbf{w}$.
Thus for any two points $\mathbf{p_1}\in H_1, \ \mathbf{p_2}\in H_2$ we have $$ |\mathbf{p}_1-\mathbf{p}_2|^2=|\mathbf{q}+\alpha\mathbf{w} + \mathbf{x}_1 - \mathbf{q}+\alpha\mathbf{w} -\mathbf{x}_2|^2 $$ for appropriate $\mathbf{x}_1, \mathbf{x}_2\in V$.
The r.h.s can be eased to $$ |2\alpha\mathbf{w} +\mathbf{x}_1 -\mathbf{x}_2|^2 $$ Sine $\mathbf{w}\in V^{\bot}$ we can apply Pythagoras theorem: $$ |2\alpha\mathbf{w} +\mathbf{x}_1 -\mathbf{x}_2|^2=4\alpha^{2}|\mathbf{w}|^2 + |\mathbf{x}_1 -\mathbf{x}_2|^2 $$ This shows $$ |\mathbf{p}_1-\mathbf{p}_2|^2\leq 4\alpha^{2}|\mathbf{w}|^2 $$ Equality is obtained when choosing $\mathbf{p}_1, \ \mathbf{p}_2$ such that $\mathbf{x}_1 =\mathbf{x}_2$.
Hence the distance is given by $2\alpha|\mathbf{w}|$.


Let us assume the triple $\alpha, \ \mathbf{q}, \ \mathbf{w}$ fulfills the constraints, that is $H_1, \ H_2$ separate $A$ and $B$.
Since the definition of $H_1, \ H_2$ does not depend on the norm of $\mathbf{w}$, we can define $$ \mathbf{w}':=\frac{\mathbf{w}}{|\mathbf{w}|\sqrt{\alpha}} $$ and replace $\mathbf{w}$ by $\mathbf{w}'$ in the definition of $H_1, \ H_2$.
This yields $$ H_1=\{\mathbf{x} \ | \ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}') = 1\} $$ $$ H_2=\{\mathbf{x} \ | \ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}') = -1\} $$ The same replacement can be done in the definition of the half-spaces which become $$ C_A:=\{x \ | \ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}') > 1\} $$ and $$ C_B:=\{x\in R^n \ | \ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}') < -1\} $$ The distance between $H_1$ and $H_2$ can be expressed in terms of $\mathbf{w}'$ as well. Just note that be definition of $\mathbf{w}'$ we have $\alpha=\frac{1}{|\mathbf{w}'|^2}$ and so we obtain $$ dist(H_1, H_2)=2\alpha|\mathbf{w}|=2\frac{1}{|\mathbf{w}'|^2}\frac{1}{\sqrt{\alpha}}=\frac{2}{|\mathbf{w}'|} $$
So we simplified the optimization problem by removing $\alpha$ and searching for a $\mathbf{w}', \ \mathbf{q}$ such that $C_A, \ C_B$ do separate $A, \ B$ meanwhile $dist(H_1, H_2)=\frac{2}{|\mathbf{w}'|}$ becomes as large as possible.

Also the constraint can be written in more compact form by noting that for outcomes with value $1$ the term $(\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}')$ is supposed to be positive and for outcomes with value $-1$ the same term is supposed to be become negative.
Therefore we can write $$ (\mathbf{x}_{i}-\mathbf{q} \ | \ \mathbf{w}')\cdot y_{i}>0 $$ where $\mathbf{x}_i$ denotes the features of the $i$'th sample and $y_i$ the corresponding outcome.

Since our classifier decides based on in which half-space of $H$ a given feature vector $\mathbf{x}$ is lying, that is $$ \mathbf{x}\mapsto sgn \ (\mathbf{x}-\mathbf{q} \ | \ \mathbf{w}') $$ we actually don't care about the vector $\mathbf{q}$ but only about the product $(\mathbf{q} \ | \ \mathbf{w}')$. So we can just replace it by a real number $\beta$ and the constraints become $((\mathbf{x}_{i} \ | \ \mathbf{w}') - \beta)\cdot y_{i}>0$ for all $i$.
This means our search is over $\mathbf{w}'$ and $\beta$.

We summarize:

Support vector machine

In case the feature sets $A$ and $B$ which correspond to outcomes $1$ resp. $-1$ can be separated by hyperplane, then the optimal separation can be achieved by solving:

Find $\mathbf{w}\in R^n, \ \beta\in R$ such that $|\mathbf{w}|$ is minimized under the constraints
$((\mathbf{x}_{i}\ | \ \mathbf{w})-\beta)\cdot y_{i}>0$ for all $i$.

With such optimal $\mathbf{w}\in R^n, \ \beta\in R$, the classifier is taken to be $$ \mathbf{x}\mapsto sgn \ \left((\mathbf{x} \ | \ \mathbf{w}) - b\right) $$

In practice the sets $A$ and $B$ rarely can be separated by hyperplane. Therefore one resorts to a weaker optimization criterion. Consider the function $$ (\mathbf{w}, \beta)\mapsto \max\left(0, 1-\left((\mathbf{x}_i \ | \ \mathbf{w}) - \beta\right)\cdot y_i\right) $$ which sometimes is referred has hinge loss function. If for the $i$'th sample the point $\mathbf{x}_i$ lies in the correct half-space, that is in case $y_i=1$ it lies in $C_A$ otherwise in $C_B$, then by definition of $C_A, \ C_B$ we have $(\mathbf{x}_i \ | \ \mathbf{w}) - \beta > 1$ resp. $(\mathbf{x}_i \ | \ \mathbf{w}) - \beta < -1$. This implies $$ \left( (\mathbf{x}_i \ | \ \mathbf{w}) - \beta \right) \cdot y_i > 1 $$ and so above function takes value $0$.
If on the contrary the $i$'th sample $\mathbf{x}_i$ lies in the wrong half-space then $\left((\mathbf{x}_i \ | \ \mathbf{w}) - \beta\right)\cdot y_i < 1$. Therefore $1-\left((\mathbf{x}_i \ | \ \mathbf{w}) - \beta\right)\cdot y_i > 0$ and thus the function takes a positive value.
Moreover since $(\mathbf{x}_i \ | \ \mathbf{w})$ is proportional to the distance between $\mathbf{x}_i$ and $H$, the function becomes more positive the 'more' $\mathbf{x}_i$ lies on the wrong side. The later can be seen by noting that the point in $H$ with shortest distance to $\mathbf{x}_i$ is its orthogonal projection onto $H$.

This consideration leads to weakening the constraints by just requiring that the average value of the hinge loss function together with $|\mathbf{w}|$ becomes as small as possible.

We summarize:

Support vector machine (general)

Find $\mathbf{w}\in R^n$ and $\beta\in R$ which minimizes $$ \frac{1}{m}\sum_i^m \max\left(0, 1-\left((\mathbf{x}_i \ | \ \mathbf{w}) - \beta\right)\cdot y_i\right) + \lambda |\mathbf{w}|^2 $$ for some positive chosen $\lambda$.
With such optimal $\mathbf{w}\in R^n, \ \beta\in R$, the classifier is taken to be $$ \mathbf{x}\mapsto sgn \ \left((\mathbf{x} \ | \ \mathbf{w}) - \beta\right) $$

Note, the chosen values of $\lambda$ determines how much importance is laid to minimize $|\mathbf{w}|^2$ during optimization. As we saw, the distance between the hyperplanes $H_1, \ H_2$ is $1/|\mathbf{w}|$. This shows the larger the value $\lambda$ is chosen the larger this distance is ensured to become and thus the better generalization the model provides. Of course, too large values for $\lambda$ must be avoided as well, since this would suffer in withholding the constraints of separating the sample features.

Solution techniques

Above minimization target which incorporates the constraints in a weakened form is a convex function in $\mathbf{w}$ and $\beta$. For such functions very efficient optimization techniques like sub-gradient descent and coordinate descent have been developed. We won't look into further details here. They will be covered in their own article which focuses on optimization techniques for machine learning in general.

Example implementation

We look at an application of SVM as one would implement in scikit-learn. The data set we us is taken from https://archive.ics.uci.edu/ml/datasets/Divorce+Predictors+data+set .

import pandas as pd
import matplotlib.pyplot as plt
import sklearn.feature_selection as selection
import numpy as np
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

data = pd.read_csv(
	"example-code/divorce/divorce.csv",
    delimiter=";",
    index_col=False, header=0
)
data = data.dropna()

## split into train and test data
X_train, X_test, y_train, y_test = train_test_split(
	data.iloc[:, :-1],
    data.loc[:, ["Class"]],
    random_state=0
)

## fit model
clf = LinearSVC(C=0.0005, random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

## plot coefficients
plt.scatter(np.arange(0, len(clf.coef_[0])), clf.coef_[0])
print(clf.intercept_)
plt.show()

After loading and splitting data into train and test samples we fit the model:

clf = LinearSVC(C=0.0005, random_state=0)
clf.fit(X_train, y_train)

'C' is the regularization parameter which translates to $1/ \lambda$ in our notation. We had to choose a high regularization in order to avoid the model from over-fitting the data.
If one runs this code the output should be:
0.984251968503937
0.9534883720930233
Which shows the score on the train data and test data.

The data are a typical candidate for choosing SVM because we have many features (54) but only few data (170). The resulting score on test data is therefore quite okay. Of course in practice, with this few data, one would do some cross-validations in order to gain more confidence into the result.

Final notes

The SVM classifier is very well suited for cases when the sample data are large or you have many features and fewer samples. Solvers are very performant and the fitted coefficients allow to interpret how certain features affect the prediction. In cases where feature dimension is low, SVM sometimes does not fit well since it requires the features to be separated by hyperplane. In such cases there exists a very nice resort - using non-linear kernels - but more on that in another article.

Comments