Logistic regression is a method which is very often used for building supervised machine learning models.
Its actual intention is to classify binary events but as we will see it provides more than only pure 0-1 classification.
Our aim here is to present the basic theory on which it is built and sketch its actual implementation.
Prerequisites in order to understand this article is a basis knowledge of probability theory and linear algebra.
Logistic regression aims to estimate conditional probabilities for a binary distributed variable based on samples.
That is, there is a random variable $Y$ (outcome event) which takes values in $\{0,1\}$ and there are further random variables
$\{X_1, \cdots, X_n\}$ which influence the value of $Y$. These later variables are referred as 'features'.
The Data
As very common in machine learning we assume to have data of the following form:
$$
\begin{pmatrix}
x_{1,1} & x_{1,2} & \cdots & x_{1,n} & y_1\\
x_{2,1} & x_{2,2} & \cdots & x_{2,n} & y_2\\
\vdots\\
x_{m,1} & x_{m,2} & \cdots & x_{m,n} & y_m\\
\end{pmatrix}
$$
The $j$'th column, $x_{\cdot, j}$, present the data of the $j$'th feature, whereas the last column, $y_{\cdot}$,
are the outcome events.
So for instance the first row,
$$
\begin{pmatrix}
x_{1,1} & x_{1,2} & \cdots & x_{1,n} & y_1
\end{pmatrix}
$$
tells: In case the values of $\{X_1, \cdots, X_n\}$ are $(x_{1,1}, x_{1,2} , \cdots , x_{1,n})$ then
the variable $Y$ takes the value $y_1$.
The Model
We are concerned about to estimate conditional expections, that is finding a function $F$ which estimates
$$
P(Y=\eta \ | \ \mathbf{X}=\mathbf{\xi})
$$
Here $\mathbf{X}$ denotes the vector $(X_1, \cdots, X_n)$ and $\mathbf{\xi}=(\xi_1, \cdots, \xi_n)$ some value in the image space
of $\mathbf{X}$.
This can be read as: The probability of $Y$ taking the value $\eta$ under the condition that the value of $\mathbf{X}$ is $\mathbf{\xi}$.
If $F$ is an estimated for this conditional probability then
$$
F(\eta, \mathbf{\xi})\approx P(Y=\eta \ | \ \mathbf{X}=\mathbf{\xi})
$$
Having such an $F$ allows us to predict the probability of $Y$ taking the value $1$ in case $\mathbf{X}$ has values $\mathbf{\xi}$.
This probability is then estimated by $F(1, \mathbf{\xi})$.
In order to find such a function one usually uses a family of functions which all have appropriate properties for the given problem and which are
parameterized by some $\theta$. The parameter $\theta$ can be either a single value or again an arbitrary vector.
This family of functions is presented by $F(\eta, \mathbf{\xi}, \theta)$.
Given such a family one tries to find the parameter which best approximates the conditional probability w.r.t some criterion.
Since in our case $F$ estimates probabilities, one requirement will be $F$ to take values in the interval $[0,1]$.
Moreover, since $Y$ only has values in $\{0,1\}$, it is enough to have a family for $F(1, \mathbf{\xi}, \theta)$.
For this to see, just remember $P(Y=0 \ | \ \mathbf{X}=\mathbf{\xi})=1-P(Y=1 \ | \ \mathbf{X}=\mathbf{\xi})$ and thus
$F(1, \mathbf{\xi}, \theta)=1-F(0, \mathbf{\xi}, \theta)$.
The Logistic Function
In logistic regression one chooses a family for $F(1, \mathbf{\xi}, \theta)$ based on a function of the form
$$
\frac{1}{1+e^{-\beta - \theta^T\mathbf{\xi}}}
$$
Here $\theta^T$ denotes the transposed of the vector $\theta$ which makes $\theta^T\mathbf{\xi}=\theta_1\xi_1 +\cdots + \theta_n\xi_n$.
In order to facilitate notations we introduce an artificial random variable $X_1$ which is assumed to be identical with $1$.
With this we define the function
$$
H(\mathbf{\xi}, \theta):=\frac{1}{1+e^{-\theta^T\mathbf{\xi}}}
$$
and take this for $F(1, \mathbf{\xi}, \theta)$:
$$
F(1, \mathbf{\xi}, \theta)= H(\mathbf{\xi}, \theta) \approx P(Y=1 \ | \ \mathbf{X}=\mathbf{\xi})
$$
Clearly the values of $H$ are restricted to the interval $[0,1]$ and are able to take all values except $0$.
This all makes $H$ an appropriate candidate for our needs.
Finally we can write $F$ in terms of $H$ by
$$
F(\eta, \mathbf{\xi}, \theta)=H(\mathbf{\xi}, \theta)^{\eta} \cdot (1-H(\mathbf{\xi}, \theta))^{1-\eta}
$$
Note, $\eta\in \{0,1\}$.
Maximum Likelihood
One way of finding the parameter $\eta$ is by using the maximum likelihood method.
Remember, the idea behind the maximum likelihood method is to regard each entry of the sample matrix as a random variable
and then to consider the joined conditional probability
$$
P(Y_1=y_1 \wedge \cdots \wedge Y_m=y_m \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j})
$$
Under certain independence assumption one can transform this expression into a product
$$
P(Y_1=y_1 \wedge \cdots \wedge Y_m=y_m \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j})=\prod_{k}P(Y=y_k \ | \ \mathbf{X}=(x_{k,1}, \cdots, x_{k,n}))
$$
By inserting our estimate for $P(Y=\eta \ | \ \mathbf{X}=\mathbf{\xi})$ one obtains
$$
P(Y_1=y_1 \wedge \cdots \wedge Y_m=y_m \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j})=\prod_{k}F(y_k, (x_{k,1}, \cdots, x_{k,n}), \theta)
$$
If we now find a value for $\theta$ which maximizes the r.h.s, it will provide a conditional distribution function $F$
which produces the given sample-matrix with highest probability.
Numeric Solution
In order to solve above optimization problem one must apply numeric techniques. There actually exists a large amount
of such techniques and one has to consider case by case which of those best fits the given problem.
First of all we transform our optimization problem.
Since sums do behave better than products in terms of numeric stability, we take the logarithm on both sides to obtain
$$
\log\left(\prod_{k}F(y_k, (x_{k,1}, \cdots, x_{k,n}), \theta) \right)=\sum_k \log F(y_k, (x_{k,1}, \cdots, x_{k,n}), \theta)
$$
Since the logarithm is monotone, both maximization problems are equivalent.
We proceed to calculate the r.h.s:
$$
\sum_k \log F(y_k, (x_{k,1}, \cdots, x_{k,n}), \theta)=\sum_k \log\left(H(\mathbf{x}_{k,\cdot} \ , \ \theta)^{y_k} \cdot (1-H(\mathbf{x}_{k,\cdot} \ , \ \theta)^{1-y_k} \right)=\\
\sum_k y_k\cdot \log H(\mathbf{x}_{k,\cdot} \ , \ \theta) + (1-y_k)\cdot \log (1-H(\mathbf{x}_{k,\cdot} \ , \ \theta))
$$
To ease notation we introduced $\mathbf{x}_{k,\cdot}:=(x_{k,1}, \cdots, x_{k,n})$, the $k$'s row of feature data.
Further by inserting the definition of $H$ we get
$$
\sum_k \log F(y_k, (x_{k,1}, \cdots, x_{k,n}), \theta)=\\
\sum_k y_k\cdot \log \left(\frac{1}{1+e^{-\theta^T\mathbf{x}_{k,\cdot}}}\right) + (1-y_k)\cdot \log\left(1- \frac{1}{1+e^{-\theta^T\mathbf{x}_{k,\cdot}}}\right)=\\
\sum_k -y_k\cdot \log \left(1+e^{-\theta^T\mathbf{x}_{k,\cdot}}\right) + (1-y_k)\cdot \log\left(\frac{e^{-\theta^T\mathbf{x}_{k,\cdot}}}{1+e^{-\theta^T\mathbf{x}_{k,\cdot}}}\right)=\\
\sum_k -y_k\cdot \log \left(1+e^{-\theta^T\mathbf{x}_{k,\cdot}}\right) + (1-y_k)\cdot \left(-\theta^T\mathbf{x}_{k,\cdot} -\log(1+e^{-\theta^T\mathbf{x}_{k,\cdot}})\right)=\\
\sum_k (y_k-1)\cdot \theta^T\mathbf{x}_{k,\cdot} - \log(1+e^{-\theta^T\mathbf{x}_{k,\cdot}})
$$
Therefore the optimization problem becomes maximizing the following cost function w.r.t $\theta$:
$$
c(\theta):=\sum_k (y_k-1)\cdot \theta^T\mathbf{x}_{k,\cdot} - \log(1+e^{-\theta^T\mathbf{x}_{k,\cdot}})
$$
Many optimization techniques such as Newton-method or Gradient Descent use the gradient of the cost-function, but there are others
which, such as swarm-based techniques, come out without the gradient. We won't go into details here but just want to show how to compute the gradient
of our function.
In order to compute the gradient we have to compute all partial derivatives w.r.t $\theta$:
$$
\frac{\partial c}{\partial \theta_l}=\sum_k (y_k-1) x_{k,l}+\frac{x_{k,l}e^{-\theta^{T}\mathbf{x}_{k,\cdot}}}{1+e^{-\theta^T\mathbf{x}_{k,\cdot}}}=
\sum_k x_{k,l}\left(y_k -\frac{1}{1+e^{-\theta^T\mathbf{x}_{k,\cdot}}}\right)
$$
The Assumptions
Remember, in order Maximum Likelihood to work requires some assumptions on the data.
1. entries of the sample matrix are independent of each other
2. the value of $y_{i}$ only is influenced by the values of $(x_{i,1}, \cdots, x_{i,n})$
3. each column of the sample matrix consists of identical distributed variables
In real world scenarios these requirements are never met perfectly. One should have an eye on this
and consider some correlation tests across feature variables.
The Classifier
So far our logistic regression model gave us estimates for the conditional expectation $P(Y=1 \ | \ \mathbf{X}=\mathbf{\xi})=H(\mathbf{\xi}, \theta)$.
This on its own is already very useful in many applications. In case where we are only interested to classify a given set of
feature samples into $\{Y=1\}$ and $\{Y=0\}$ classes one may use as rule
$\mathbf{\xi}$ is in $\{Y=1\}$-class if: $H(\mathbf{\xi}, \theta) > \frac{1}{2}$
$\mathbf{\xi}$ is in $\{Y=0\}$-class if: $H(\mathbf{\xi}, \theta) \leq \frac{1}{2}$
As such an classifier one can apply standard measures like accuracy, recall and precision on test data in order to validate the model.
To find how much influence each feature variable has on such a measure, one can create a model without a specific feature
and compare its measure with the full model. If one scales the feature variables to the same interval (for instance $[0,1]$), then the corresponding component of $\theta$ discloses the impact of a specific variable to the predicted probability. This is another very neat property of logistic regression.
(For readers with knowledge in linear algebra)
In conjunction with classification, the logistic regression imposes a decision boundary. This is a hyperplane in the feature space which is given
by the equation
$$
D:=\{x\in R^n : \theta^T\mathbf{x}=0\}
$$
It comes from the fact that $H(\mathbf{x}, \theta)$ takes value $\frac{1}{2}$ exactly when $\theta^T\mathbf{x}$.
This is easily seen from the definition of $H$. A value of $\frac{1}{2}$ means $P(Y=1 \ | \ \mathbf{X}=\mathbf{x})=\frac{1}{2}$ and this
tells that from $\mathbf{x}$ we cannot conclude if $Y$ takes the value $0$ or $1$ with any precedence.
In contrast, for a feature, the higher the distance is to $D$, the higher
is its decisiveness on the outcome of $Y$.
For this to see let us construct a formula which provides for a given vector $\mathbf{x}$ its distance to $D$:
From linear algebra we know, $D$ as a linear subspace of $R^{n}$ and has an orthogonal basis $B$. Moreover since $\{\theta\}$ is the orthogonal
complement of this plane, the union $\{\theta\}\cap B$ is an orthogonal basis of $R^n$.
The distance of $\mathbf{x}$ to $D$ is the infinum of $|\mathbf{x}-\mathbf{b}|$ over all $\mathbf{b}\in D$. In linear algebra one shows
there always exists a unique $\mathbf{b}\in D$ which minimizes this expression. It turns out that $\mathbf{b}$ is the orthogonal projection of
$\mathbf{x}$ onto $D$ and that $\mathbf{x}$ can be uniquely represented in the form
(*) $\mathbf{x}=a\cdot \theta+\mathbf{b}$
with some $a\in R$.
Therefore the distance is given by
$$
|\mathbf{x}-\mathbf{b}|=|a\cdot\theta|
$$
We obtain $a$ by taking the inner product with $\theta$ on both sides of (*) to obtain
$(\mathbf{x}, \theta)=a|\theta|^2$. Solving for $a$ and then inserting in (*) gives finally
$$
dist(\mathbf{x}, D)=\frac{(\mathbf{x}, \theta)}{|\theta|}
$$
We observe the distance is proportional to the inner product $(\mathbf{x}, \theta)$ which directly appears in the definition of $H$.
Note, $(\mathbf{x}, \theta)=\theta^{T}\mathbf{x}$. Therefore we proofed that the distance of a feature to the decision boundary directly
impacts the probability of the event $Y=1$.
The theory needed in order to understand this argumentation in detail is covered in almost every standard book about linear algebra.
The logistic regression is implemented by many tools nowadays. We will look how it works in the following examples.
Example
This is a most basic example using the python machine learning tool set
scikit-learn. It uses an included dataset which which maps 30 features against 0-1 events.
The code looks like:
from sklearn.datasets import load_breast_cancer
from sklearn import linear_model
from sklearn.model_selection import train_test_split
## loading the data
X, y = load_breast_cancer(return_X_y=True)
## extracting features
X=X[:, [22,27]]
clf = linear_model.LogisticRegression()
## splitting data in training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
## training the model on training data
clf.fit(X_train, y_train)
## validating the model by its accuracy on test data
print(clf.score(X_test, y_test))
print(clf.coef_)
print(clf.intercept_)
The variable X holds the feature samples and we restrict it to the 22'th and 27'th feature.
Then we create an instance of the logistic regression model: 'clf = linear_model.LogisticRegression()'.
As it is always wise to split data into training and test data we do this by: 'train_test_split(X, y, random_state=0)'.
The actual model is trained ($\theta$ is searched) by: 'clf.fit(X_train, y_train)'.
After this, we can validate the model by computing its score (accuracy). The accuracy is the amount of correct classified sample
over the amount of all tested samples.
Of course, this only provides a rough idea how to build such a model. In practice you first would look into topics like
data cleaning, feature selection, dimension reduction and later a more sophisticated model validation.
For us it shall be sufficient, since our focus is laid on providing the basic theory of the subject.
Conclusion
Logistic regression provides a very strong modelling approach when trying to predict binary events from a set of features.
The feature variables themselves underlie no restriction, they can be categorical, continuous or a combination of both.
Moreover, logistic regression not just provide a classification but in addition probabilistic estimates for the events to happen.
Comments
Post a Comment