In some scenarios the traditional support vector machine (SVM) which is based on a linear separation of features does not fit very well to their specific structure.
For these cases one can try to separate by non-linear shapes rather than hyperplanes.
This article intends to describe this approach and explains its underlying theory.
In this sense, the model described here, belongs to supervised classifiers.
Prerequisite in order to understand this article is knowledge of linear SVM like for instance obtained from my article Linear Support Vector Machine.
The initial situation is the same as for liner SVM's. The data are supposed to have the following structure:
Feature samples:
$$
X=
\begin{pmatrix}
x_{1,1} & \cdots & x_{1, n}\\
\vdots\\
x_{m,1} & \cdots & x_{m, n}
\end{pmatrix}
$$
Outcome samples:
y=
$$
\begin{pmatrix}
y_1 \\
\vdots \\
y_m
\end{pmatrix}
$$
Where outcomes take values in $\{-1, 1\}$.
Remember, in linear SVM we split the feature samples into two sets, $A$ and $B$. The set $A$ contains exactly those features $\mathbf{x}_i$
with $y_i=1$ and $B$ those with $y_i=-1$.
The use of linear SVM required that $A$ and $B$ can be as good as possible separated by hyperplane.
In some cases this is not possible at all. The following scenario gives an example:
In such a case the separation would be some non-linear function rather than a hyperplane.
Also remember, each new feature is predicted (classified) by measuring the distance to this hyperplane and then deciding if this distance is
positive or negative.
A similar constellation we would like to have in the non-linear separation, that is we want a real number to measure the distance from the separation
boundary. The larger this distance becomes, the more significant our classification becomes.
As first approach to above case we can use the function $\phi:R^2\rightarrow R^3$ defined by
$$
\mathbf{x} \mapsto (x_1, x_2, x_1^2+x_2^2)
$$
In other words, $\phi$ ads to a given point the squared distance to the origin into a third dimension. Of course this function is one-to-one and
moreover both set, $\phi(A)$ and $\phi(B)$, can be separated by the hyperplane
$$
H=\{\mathbf{z}\in R^3 \ | \ (\mathbf{z}-r^{2}\mathbf{e}_3 \ | \ \mathbf{e}_3) =0 \}
$$
Here $r$ denotes the radius of above circle and $(\cdot \ | \ \cdot)$ the inner product.
This hyperplane is the $x_1 x_2$-plane shifted by $r^2$ along $\mathbf{e}_3$.
So far we haven't done anything special. We just mapped all records of the feature samples by a suitable function $\phi$ which
allows to separate features by hyperplane. Such a function often is referred as
feature map.
Noteable though is that we map the feature space into a space of different dimension.
Now we can search for an optimal separating hyperplane like we do for linear SVM's and use the solution as classifier as usually.
In detail, the task is to find $\mathbf{w}'\in R^3, \ b\in R$ such that
$$
\frac{1}{N}\sum_i^N \max\left(0, 1-\left((\phi(\mathbf{x}_i) \ | \ \mathbf{w}') - \beta\right)\cdot y_i\right) + \lambda (\mathbf{w}' \ | \ \mathbf{w}')
$$
becomes minimal.
Having such a pair $\mathbf{w}', \ b$, we can use the classifier:
$$
\mathbf{x}\mapsto sgn \ \left((\phi(\mathbf{x}) \ | \ \mathbf{w}') - b\right)
$$
Instead of searching $\mathbf{w}'$ in $R^3$, that is the transformed space, we alternatively can minimize the function
$$
\frac{1}{N}\sum_i^N \max\left(0, 1-\left((\phi(\mathbf{x}_i) \ | \ \phi(\mathbf{w})) - \beta\right)\cdot y_i\right) + \lambda (\phi(\mathbf{w}) \ | \ \phi(\mathbf{w}))
$$
where now we search $\mathbf{w}$ in $R^2$ and the classifier would look like
$$
\mathbf{x}\mapsto sgn \ \left((\phi(\mathbf{x}) \ | \ \phi(\mathbf{w})) - b\right)
$$
Apparently and depending on the function $\phi$, the later optimization problem becomes a more difficult one. But as we will see soon
it is intendet to solve an even more difficult problem we otherwise would face.
In practice we usually don't have the fortune to have a 2-dimensional feature space and thus have actually no idea how to construct a suitable
function $\phi$. In order to deal with this problem one uses the so called
kernel trick:
Instead of searching a feature map, one uses parameterized families of symmetric functions, $k:R^n \times R^n\rightarrow R$, which are called
kernels and have the following property:
There exists some function $\phi:R^n\rightarrow F$ together with a space $F$, the
transformed feature space, equipped with an inner product such that
$$
k(\mathbf{x}, \mathbf{y})=(\phi(\mathbf{x}) \ | \ \phi(\mathbf{y}))_F
$$
Here $(\cdot \ | \ \cdot)_F$ denotes the inner product in $F$.
In our above example we came from the other direction, we knew $\phi$ and defined $k(\mathbf{x}, \mathbf{y}):=(\phi(\mathbf{x}) \ | \ \phi(\mathbf{y}))_{R^3}$.
The optimization problem together with the classifier can then be written as
$$
\frac{1}{N}\sum_i^N \max\left(0, 1-(k(\mathbf{x}_i, \mathbf{w}) - \beta)\cdot y_i\right) + \lambda k(\mathbf{w}, \mathbf{w})
$$
$$
\mathbf{x}\mapsto sgn \ \left(k(\mathbf{x},\mathbf{w}) - b\right)
$$
Observe, in these two expressions the function $\phi$ does not appear anymore. Nor does appear the inner product in the transformed feature space $F$.
The SVM written in this form is called
kernelized support vector machine.
So, the principle idea of kernelized SVM is to use a suitable kernel $k$, from which we know there exists a feature map $\phi$ into a linear space equipped with inner product, and then to solve above minimization. At no step there arise the need to have explicit knowledge of $\phi$.
Kernelized support vector machine
For a suitable kernel function $k:R^n \times R^n\rightarrow R$ which allows to be presented as
$$
k(\mathbf{x}, \mathbf{y})=(\phi(\mathbf{x}) \ | \ \phi(\mathbf{y}))_F
$$
for some map $\phi:R^n\rightarrow F$ and some linear space $F$ with inner product, find
$\mathbf{w}\in R^n, \ b\in R$ which minimizes
$$
\frac{1}{N}\sum_i^N \max\left(0, 1-(k(\mathbf{x}_i, \mathbf{w}) - \beta)\cdot y_i\right) + \lambda k(\mathbf{w}, \mathbf{w})
$$
For solutions $\mathbf{w}$ and $b$ the classifier is taken to be
$$
\mathbf{x}\mapsto sgn \ \left(k(\mathbf{x},\mathbf{w}) - b\right)
$$
In the initial example we know that above procedure makes sense and leads towards an optimal separation in the transformed feature space.
But for arbitrary kernels, for which it is formulated, we still have to verify that there takes place some sort of separation.
For this to see we write the minimization again in terms of feature maps:
$$
\frac{1}{N}\sum_i^N \max\left(0, 1-\left((\phi(\mathbf{x}_i) \ | \ \phi(\mathbf{w}))_F - \beta\right)\cdot y_i\right) + \lambda (\phi(\mathbf{w}) \ | \ \phi(\mathbf{w}))_F
$$
and instead of searching $\mathbf{w}$ we can search for $\mathbf{w}':=\phi(\mathbf{w})$ within $F$:
$$
\frac{1}{N}\sum_i^N \max\left(0, 1-\left((\phi(\mathbf{x}_i) \ | \ \mathbf{w}')_F - \beta\right)\cdot y_i\right) + \lambda (\mathbf{w}' \ | \ \mathbf{w}')_F
$$
The optimization problem as derived for linear SVM's takes exactly the same form. The only difference is that instead of $F$ we have some $R^n$. If we look through the steps done for the linear SVM, we see that we actually can replace $R^n$ by an arbitrary space $F$ as long the later has an inner product.
Note, for such spaces the Pythagoras theorem still holds.
The separation by a hyperplane and its optimization is made within a very general space. We not even care if it is finite or infinite dimensional.
It can be anything, as long it has an inner product.
Of course, the immediate question arises how to find a suitable kernel $k$ for a given classification problem. There is no general receipt for this rather than testing model performance with various kernels.
Such kernels for instance are:
1.) the linear-kernel which is the one used for linear SVM:
$$
(\mathbf{x}, \mathbf{y}) \mapsto (\mathbf{x} | \mathbf{y})
$$
2.) polynomial kernels for some $r\geq 0$ and $d\in N$:
$$
(\mathbf{x}, \mathbf{y}) \mapsto ((\mathbf{x} | \mathbf{y})+r)^d
$$
3.) radial basis function for some $\gamma>0$:
$$
(\mathbf{x}, \mathbf{y}) \mapsto exp(-\gamma |\mathbf{x}- \mathbf{y}|^2)
$$
Except the first it is not obvious how to verify that these kernels indeed allow for a representation of the form $k(\mathbf{x}, \mathbf{y})=(\phi(\mathbf{x} \ | \ \phi(\mathbf{y}))_F$. There is though a theorem which is based on the theory of Hilbert-Schmidt operators and which states:
Mercer's theorem
Let $K$ be a compact interval in $R^n$.
If $k:K\times K\rightarrow R$ is a continuous function which is symmetric, that is
$$
k(\mathbf{x}, \mathbf{y})=k(\mathbf{y}, \mathbf{x})
$$
and positive definite, that is
$$
\sum_{i,k}^mk(\mathbf{x}_i, \mathbf{x}_k)c_i c_k\geq 0
$$
for any choice of finite numbers $\{c_j\}_j$ and finite points $\{\mathbf{x}_j\}_j$ in $K$,
then there is a space with inner product $F$ and a function $\phi: K\rightarrow F$ such that
$$
k(\mathbf{x}, \mathbf{y})=(\phi(\mathbf{x} \ | \ \phi(\mathbf{y}))_F
$$
Proof:
We only sketch the proof for readers with knowledge in functional-analysis.
The kernel can be used to define an operator $C$ on $L^2(K)$ by
$$
x \mapsto \left(s\mapsto\int_K k(s,t)x(t)dt\right)
$$
This operetor maps into $L^2(K)$ and is symmetric. One can show that it is a compact operator and therefore allows
to be presented in terms of its eigenvectors $(u_n)_n$ by
$$
Cx = \sum_n \lambda_n \int_K x(t)u_n(t)dt \cdot u(s)
$$
In particular the positive definiteness implies all eigenvalues $\lambda_n$ to be positive.
By refactoring above expression we can write
$$
Cx = \int_K x(t) \sum_n \lambda_n u(t)u(s)dt
$$
By comparing with the definition of $C$ one can conclude that
$$
\sum_n \lambda_n u(t)u(s)=k(s,t)
$$
almost everywhere.
This gives rise to define $\phi$ by $\phi:=u$ and $F$ to be the vector space of all real-valued sequences
such that $\sum_n\lambda_n \xi^2$ converges. On this space we can define an inner product by
$$
(x,y)_F:=\sum_n\lambda_n x_n y_n
$$
Now we can apply this theorem in order to proof that for above kernels there exists a feature map $\phi$ and feature space $F$ with the desired representation. The symmetry and continuity is obviously fulfilled for all.
For #1:
$$
\sum_{i,j}^m(\mathbf{x}_i \ | \ \mathbf{x}_j)c_ic_j=\left(\sum_i^m \mathbf{x}_i c_i \ | \ \sum_i^m \mathbf{x}_i c_i \right)\geq 0
$$
For #2:
In case $d=1$ we have
$$
\sum_{i,j}^m ((\mathbf{x}_i \ | \ \mathbf{x}_j)+r)c_i c_j=\sum_{i,j}^m (\mathbf{x}_i \ | \ \mathbf{x}_j)c_i c_j+ \sum_{i,j}^m r c_i c_j = \\ =\left(\sum_i^m \mathbf{x}_i c_i \ | \ \sum_i^m \mathbf{x}_i c_i \right) + r\left(\sum_i^m c_i\right)^2\geq 0
$$
For all other kernels a little more computation is required. For the reader who is interested in more details on this subject shall be referred to Bochner's theorem, which relates the Fourier-transformed of measures to positive definite functions.
Example implementation
We look at an application of SVM as one would implement in
scikit-learn.
The data set we use is taken from https://archive.ics.uci.edu/ml/datasets/Divorce+Predictors+data+set .
import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
data = pd.read_csv(
"example-code/divorce/divorce.csv",
delimiter=";",
index_col=False, header=0
)
data = data.dropna()
## split into train and test data
X_train, X_test, y_train, y_test = train_test_split(
data.iloc[:, :-1].to_numpy(),
data.loc[:, ["Class"]].to_numpy(),
random_state=0
)
## fit model
clf = SVC(C=0.5)
clf.fit(X_train, y_train.reshape(len(y_train)))
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))
After loading and splitting data into train and test samples we fit the model:
clf = SVC(C=0.1)
clf.fit(X_train, y_train)
'C' is the regularization parameter which translates to $1/ \lambda$ in our notation.
We had to choose specific regularization in order to avoid the model from over-fitting the data.
If one runs this code the output should be:
0.984251968503937
0.9534883720930233
Which shows the score on the train data and test data.
Final notes
Kernelized support vector machines provide an interesting alternative to linear vector machines.
In particular they give a nice example how relative deep theoretical results, like Mercer's theorem, directly
contribute to applications. But keep in mind, they are known for over-fitting and cause performance problems
when size of data is large. In such cases better choose the linear version or other methods.
Comments
Post a Comment