Maximum Likelihood Method

The maximum likelihood method belongs to one of the most used methods when it comes to fitting of distribution functions against data. In machines learning a typical place where this method finds application is logistic regression. In this article we are looking into the detailed development and theory of likelihood estimation. Prerequisites in order to understand this article are just basic knowledge of classic probability theory. Our ultimate goal is to fit a conditional distribution function against given data. Let us first describe the structure of these data. The sample data The data are lists of records, each containing a set of feature variables $(X_i)_{1\leq i\leq n}$ and one dependent variable $Y$. This can be presented in matrix form like: $$ \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,n} & y_1\\ x_{2,1} & x_{2,2} & \cdots & x_{2,n} & y_2\\ \vdots\\ x_{m,1} & x_{m,2} & \cdots & x_{m,n} & y_m\\ \end{pmatrix} $$ We a...

Linear Regression

Linear regression is an very important machine learning algorithm. It applies to supervised learning problems. The aim of this article is to explain the most fundamental theory and ideas behind it. At the end we look at an example how one can create a linear regression model in practice.

Prerequisite in order to understand this article is some basic knowledge of linear algebra as offered by many standard introductions and some knowledge of multi-dimensional analysis.

The problem

We are given a set of real-valued feature variables $\{X_1, \cdots, X_{n-1}\}$ and a real valued outcome variable $Y$. Our aim is to predict $Y$ based on given values of the features.

The linear model

The approach is based on the following function $$ f(\mathbf{x}, \mathbf{p}, e)\mapsto \mathbf{x}^T\mathbf{p} + e $$ $\mathbf{x}$ is a $(n-1)$-dimensional vector which consists of values of the feature variables. The vector $\mathbf{p}=(p_1, \cdots, p_{n-1})^T$ and the real number $e$ are model parameters. In detail, $\mathbf{p}$ represents the model coefficients and $e$ the so called intercept. These parameter need to be determined in order to predict $Y$ as best as possible w.r.t some criterion.

Determining model parameter

To ease notations, we put the intercept $e$ into the first component of the coefficient vector $\mathbf{p}$. After re-indexing, we still can assume that $\mathbf{p}$ looks like $$ \mathbf{p}= \begin{pmatrix} p_1\\ \vdots \\ p_n \end{pmatrix} $$ but we keep in mind the special meaning of $p_1$.

As usually we can present the feature values in matrix form by $$ B:= \begin{pmatrix} b_{1,1} & \cdots & b_{1, n-1}\\ \vdots \\ b_{m,1} & \cdots & b_{m, n-1} \end{pmatrix} $$ Each row represents a sample of features. The samples of the outcome variable are given by a $m$-dimensional vector $\mathbf{y}$.
This allows us to write above model for each row in compact matrix form: $$ \begin{pmatrix} 1 & b_{1,1} & \cdots & b_{1, n-1}\\ \vdots \\ 1 & b_{m,1} & \cdots & b_{m, n-1} \end{pmatrix} \begin{pmatrix} p_1\\ \vdots\\ p_n \end{pmatrix} $$ Note, in order to cope with the special meaning of $p_1$, we added a column with values $1$ to $B$. The $i$'th row of this product is given by $$ p_1+\sum_ja_{i, j+1}p_j $$ And this is exactly our above model written for the $i$'th sample.
In the following we will refer to above matrix by $A$.

Given a model parameter $\mathbf{p}$ we can define for each sample $$ e_i:= p_1 + \sum_ja_{i, j+1}p_j - y_i $$ which is the error the model is producing when predicting $y_i$. We can write this in matrix form as $$ \mathbf{e} = A\mathbf{p}-\mathbf{y} $$ Our aim is now to find a $\mathbf{p}$ such that $\mathbf{e}$ is kept as 'small' as possible. In order to measure the magnitude of $\mathbf{e}$ we use the $l^2$-norm of $R^m$. So our problem can be formulated as an optimization: $$ \min_p |A\mathbf{p}-\mathbf{y}|^2 $$ Here $|\cdot|$ denotes the $l^2$-norm in $R^{m}$.

Closed form solution

One very nice property of linear regression is that above optimization problem allows for a closed form solution. We obtain this as follows:

Taylor's theorem states that for a function $f$ in $C^{2}(R^n, R)$, these are real valued functions which are $2$-times continuous differentiable, we can write $$ f(\mathbf{x}+\mathbf{h})=f(\mathbf{x})+Df(\mathbf{x})\mathbf{h}+\frac{1}{2}\left(D^2f(\mathbf{x}+\theta\mathbf{h})\mathbf{h} \ | \ \mathbf{h}\right) $$ for some appropriate real number $\theta\in [0,1]$. By $(\cdot \ | \ \cdot)$ we denote the usual inner product in $R^n$.
From this expansion we can easily infer that in case $D^2f(\mathbf{x}+\theta\mathbf{h})$ is positive definite at some point $\mathbf{x}$, the condition $$ Df(\mathbf{x})=0 $$ is necessary and sufficient for $f$ taking a minimum at $\mathbf{x}$.

For those which are not so much accustomed with these concepts let us roughly summarize:
1.) $D^{2}f$ is a $(n,n)$-matrix and its components consists of the partial differentials of $f$: $(\partial_{k}\partial_{j}f)_{k,j}$.
2.) the above inner product can be written as matrix multiplication: $(\mathbf{u} \ | \ \mathbf{v})=\mathbf{u}^T\mathbf{v}$
3.) a matrix $A$ is called positive definite if we have: $(A\mathbf{x} \ | \ \mathbf{x})=0 \leftrightarrow \mathbf{x}=\mathbf{0}$

We aim to apply these facts to our situation. In this case $f(\mathbf{p})=|A\mathbf{p}-\mathbf{y}|^2$.
Therefore we have to compute the first and second derivative of $|A\mathbf{p}-\mathbf{y}|^2$ w.r.t $\mathbf{p}$.

The best way to do this is to write $$ |A\mathbf{p}-\mathbf{y}|^2=\sum_i^n\left(y_i-\sum_j^n A_{i,l}p_l\right)\left(y_i-\sum_k^n A_{i,k}p_k\right) $$ After multiplying through the sums this becomes $$ \sum_i y_iy_i-\sum_{i,k}y_iA_{i,k}p_k-\sum_{i,l}y_iA_{i,l}p_{l}+\sum_{i,l,k}A_{i,l}A_{i,k}p_lp_k $$ By applying well-known rules of differentiation we compute, $$ \partial_{j}f(\mathbf{p})=-2\sum_{i}y_iA_{i,j}+2\sum_{i,k}A_{i,j}A_{i,k}p_k $$ And for the second partial derivatives we compute, $$ \partial_{k}\partial_{j}f(\mathbf{p})=2\sum_{i}A_{i,j}A_{i,k} $$ Putting all components together we can write in matrix form $$ Df(\mathbf{p})=-2A^{T}\mathbf{y}+2A^TA\mathbf{p} $$ $$ D^2f(\mathbf{p})=2A^TA $$ Next we show that $D^2f$ is positive definite in 'most' cases. For this to see we consider $$ (D^2f\mathbf{x} \ | \ \mathbf{x})=(2A^TA\mathbf{x} \ | \ \mathbf{x})=2(A\mathbf{x} \ | \ A\mathbf{x})=2|A\mathbf{x}|^2 $$ The r.h.s takes the value $0$ if and only if $A\mathbf{x}=\mathbf{0}$, or in other words, $\mathbf{x}\in N(A)$, the null-space of $A$. In most cases the sample size is much larger than the number of features: $m > n$. This fact makes it very certain that the columns of $A$ are linear independent. On the other hand, linear independent columns imply $N(A)=\{\mathbf{0}\}$ and thus that $D^2f$ is positive definite.

In order to find the minimum, according to above considerations, we have to find $\mathbf{p}$ such that $$ Df(\mathbf{p})=-2A^{T}\mathbf{y}+2A^TA\mathbf{p}=\mathbf{0} $$ This can be solved for $\mathbf{p}$ by $$ \mathbf{p}=\left(A^TA\right)^{-1}A^T\mathbf{y} $$ At this point one could ask if the r.h.s always is solve-able. Under the assumptions we made so far, we actually only can ensure the solution is unique if exists. In order to see it is always solve-able we consider the equivalent statement $$ A^T(A\mathbf{p}-\mathbf{y})=\mathbf{0} $$ So we ask for the existence of a $\mathbf{p}$ such that $A\mathbf{p}-\mathbf{y}\in N(A^T)$. Elements in $N(A^T)$ are easily seen to be vectors which are orthogonal on the linear hull of the columns of $A$. This hull coincides with $I(A)$, the image space of $A$. By Gram-Schmidt process we can find a orthogonal basis of $I(A)$ and extend it to $R^m$. Then $\mathbf{y}$ can be presented as $\mathbf{y}=\mathbf{w}+\mathbf{v}$ with some $\mathbf{w}\in I(A)$ and $\mathbf{v}\in I(A)^{\bot}$. The later denotes that $\mathbf{v}$ is orthogonal to the subspace $I(A)$. Since $A\mathbf{p}=\mathbf{w}$ must have a solution and for this solution we would have $A\mathbf{p}-\mathbf{y}=\mathbf{v}\in I(A)^{\bot}=N(A^T)$, we are done.

Let us summarize:

Theorem

The solution to a linear regression can be obtained by solving the linear system $$ A^TA\mathbf{p}=A^T\mathbf{y} $$ This system always has a solution and in case the the columns of $A$ are linear independent $A^TA$ is even invertable.


Regularization

One problem with linear regression is that the model coefficients in applications tend to become very large and especially when having many features models do over-fit the data.
A standard approach is to penalize large values in the optimal solution $\mathbf{p}$. This can be done by altering above optimization to: $$ \min_p \left( |A\mathbf{p}-\mathbf{y}|^2 + \alpha|\mathbf{p}|^{2} \right) $$ Here $\alpha$ denotes some positive real number which is called regularization factor. The larger it is chosen the more penalty is attributed to large values of $\mathbf{p}$.

Solution to regularized version

Also the regularized linear regression allows for a closed solution. As before we calculate the first and second derivative:

The steps are the same but we have to account for the additional summand $$ \alpha\sum_jp_jp_j $$ Doing this, the first derivative computes to $$ Df(\mathbf{p})=-2A^{T}\mathbf{y}+2A^TA\mathbf{p}+2\alpha\mathbf{p} $$ and from this the second derivate becomes $$ D^2f(\mathbf{p})=2A^TA+2\alpha E $$ Here we denote by $E$ the $(n,n)$-unit matrix (eye).

In this case positive definiteness of $D^2f$ is always ensured: $$ (D^f\mathbf{x} \ | \ \mathbf{x}) = (A^TA\mathbf{x}+\alpha E\mathbf{x} \ | \ \mathbf{x})=(A\mathbf{x} \ | \ A\mathbf{x}) + \alpha(\mathbf{x} \ | \ \mathbf{x})=|A\mathbf{x}|^2 + \alpha|\mathbf{x}|^2 $$ So from $(D^f\mathbf{x} \ | \ \mathbf{x})$ it follows $\mathbf{x}=\mathbf{0}$.
A sufficient and necessary criterion for $\mathbf{p}$ to provide a solution therefore is $$ Df(\mathbf{p})=\mathbf{0} $$ or $$ -2A^{T}\mathbf{y}+2A^TA\mathbf{p}+2\alpha\mathbf{p}=\mathbf{0} $$ what we can formulate as $$ (A^TA+\alpha E)\mathbf{p}=A^{T}\mathbf{y} $$ For the case the $A^TA$ is invertible, the same is true for $A^TA+\alpha E$ if $\alpha$ is chosen within certain bounds. This can be derived from considering the determinant and noting that $det(A^TA)\neq 0$ is equivalent with $A^TA$ being invertible. Since in general for a matrix $M$ the function $M\rightarrow det(M)$ is continuous, small deviations in $A^TA$ do not lead the determinate to become $0$.

We summarize:

Theorem

The solution to a regularized linear regression can be obtained by solving the linear system $$ (A^TA +\alpha E)\mathbf{p}=A^T\mathbf{y} $$ In case the the columns of $A$ are linear independent and $\alpha$ is not chosen too large $A^TA +\alpha E$ is invertible.


Polynomial Model

I third version of linear regression is by altering the original model to allow for polynomial dependencies between feature and outcome variables. This polynomial dependency can be of any degree, in case of degree 2 the model becomes: $$ \sum_{i,j}p_{i,j}x_ix_j + e $$ As before, $e$ denotes the intercept and $\{p_{i,j}\}_{i,j}$ are the model coefficients. The $x_k$ denote the sample value of the $k$'s feature. For degree 3 the model looks like $$ \sum_{i,j,k}p_{i,j,k}x_ix_jx_k + e $$ and so on.

Although our model is not linear any more, the problem of finding an optimal value for $\mathbf{p}$ remains linear. Actually, it imposes the same computations as before.
The principle idea is to observe that we can again formulate an optimization problem of the form $$ \min_p |A\mathbf{p}-\mathbf{y}|^2 $$ with some appropriate matrix $A$.
In order to see this, note that by suitable book-keeping of the indexes of $p_{i,j}$ we can make out of it a vector $(p_1', p_2', \cdots)$. For instance we could use the mapping $(i,j)\mapsto (i-1)\cdot n + j$ for an enumeration of the indexes. Alternatively we could consider all this as adding artificial features by forming polynomial combinations of the original feature values. That is, from the feature set $\{X_1, \cdots, X_n\}$ we build an extension by adding $\{X_1 X_1, X_1 X_2, \cdots, X_n X_n\}$. Then from a sample $\{x_1, x_2, \cdots, x_n\}$ we obtain the values for the artificial features by $\{x_1 x_1, x_1 x_2, \cdots, x_n x_n\}$. This way we again create a matrix $A$ which contains all sample values and altogether arrive at the same optimization problem as before.

As already mentioned, this additional features can be used to extract non-linear dependencies between feature variables and the outcome. One word of care though shall be given with regards on the chosen degree. One might be tempted to choose a very high degree in order to account for as much as possible nonlinear dependencies. Although this way the validation score of your model on training data can be driven arbitrary high, on your test data you most probably will see a very bad result since by chosing a high degree your model tends to get overfitted.

Finally, let us consider some example code for all three models as one would implement with scikit-learn.

Example

We consider a real-estate dataset which is available at https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set .
The code for a plain linear regression is as follows:

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

# loading data into a pandas-dataframe
data = pd.read_csv(
  "example-code/housing/housing-data.csv",
   delimiter=",",
   index_col=False,
   header=0
)
data = data.dropna()

# extracting feature-columns
feature_cols = ["age", "dist_to_station", "number_stores"]
feature_data = data.loc[:, feature_cols]

# scaling values to [0,1]
min_max_scaler = MinMaxScaler()
feature_data.loc[:, feature_cols] = min_max_scaler.fit_transform(feature_data)

# splitting into train and test data
X_train, X_test, y_train, y_test = train_test_split(
  feature_data, data["price"],
  random_state=0
)

# fitting model on training data
regressor = LinearRegression()
reg = regressor.fit(X_train, y_train)

# validating the model by R-square
print(reg.score(X_test, y_test))


We are loading the data into a pandas dataframe and extract the feature columns:

feature_cols = ["age", "dist_to_station", "number_stores"]
feature_data = data.loc[:, feature_cols]

All features are continuous variables and so we do scale them to the interval $[0,1]$:

min_max_scaler = MinMaxScaler()
feature_data.loc[:, feature_cols] = min_max_scaler.fit_transform(feature_data)

This is a common step for continuous variables and it ensures that regression coefficients will be comparable with each other. Then we split the data into train and test data:

X_train, X_test, y_train, y_test = train_test_split(
  feature_data, data["price"],
  random_state=0
)

Note, the column 'price' is the one we try to predict from the feature variables.
We perform a linear regression on the train data and print the score of the resulting model:

regressor = LinearRegression()
reg = regressor.fit(X_train, y_train)
print(reg.score(X_test, y_test))

If we run this code the output is:
0.55252965676222
[-10.85178934 -34.70475081 12.86670478]

The score is the so called $R^2$-value. It is defined by $$ R^2:=1-\frac{|A\mathbf{p}-\mathbf{y}|^2}{|\mathbf{y}-\mathbf{\bar{y}}|^2} $$ where $\mathbf{\bar{y}}$ is the vector with all components equal to the arithmetic mean of $\mathbf{y}$.
The fraction measures the quadratic error of the prediction, when obtained from $\mathbf{p}$ as models coefficients, relative to the variance of the outcome variable $\mathbf{y}$. This relativizing makes sense since naturally an outcome variable of high variance is harder to predict.

Next we replace the plain linear regression by a regularized linear regression. In order to do this, we replace the regressor

regressor = LinearRegression()

by

regressor = Ridge(alpha=10)

Now the output becomes:
0.4937615933885673
[ -7.6634312 -18.35736214 14.27928964]

As we would expect, model coefficients tend to become smaller w.r.t $l^2$-norm and the overall $R^2$ slightly suffers from this.

Finally we apply a polynomial regression. This we do adding polynomial features to the feature data:


...

X_train, X_test, y_train, y_test = train_test_split(
  feature_data, data["price"],
  random_state=0
)

poly_transf = PolynomialFeatures(
  degree=2,
  interaction_only=True,
  include_bias=False
)
X_train = poly_transf.fit_transform(X_train)
X_test = poly_transf.fit_transform(X_test)

# fitting model on training data
regressor = LinearRegression()

...


The output is:
0.5813495141354821
[-20.29032267 -51.94193049 12.20547223 51.3688902 8.28712238 -81.78823624]

The $R^2$ becomes better since we add more flexibility to the regression by allowing to account for non-linear feature interactions.

Above implementations are intended to provide a rough overview how linear regression can be performed. In practice one would do more sophisticated analysis of the data and especially in terms of model validation.

Final Note

The solutions we have constructed for the linear regression and its regularized version, although they are in closed form from a mathematical point of view, still require to solve a linear system. As long as the number of features is small, this is not a big deal. But in cases there are many features, the solution requires appropriate numerical techniques. One of those which in particular matches very nice to the given problem is solving by Cholesky Decomposition. In our setup the matrix to invert is positive definite and symmetric, which exactly are the prerequisites to apply this method.

Comments