The maximum likelihood method belongs to one of the most used methods when it comes to fitting of distribution functions against data.
In machines learning a typical place where this method finds application is logistic regression.
In this article we are looking into the detailed development and theory of likelihood estimation.
Prerequisites in order to understand this article are just basic knowledge of classic probability theory.
Our ultimate goal is to fit a conditional distribution function against given data. Let us first describe the structure of these data.
$$ \begin{pmatrix} x_{j,1} & x_{j,2} & \cdots & x_{j,n} & y_j \end{pmatrix} $$ has no influence on picking/getting $$ \begin{pmatrix} x_{k,1} & x_{k,2} & \cdots & x_{k,n} & y_k \end{pmatrix} $$ for $k\neq j$.
For this we introduce sets of random variables, $\{X_{i,j} : 1\leq i \leq m \wedge 1\leq j \leq n\}$ and $\{Y_i : 1\leq i \leq m \}$.
We interpret above records in the following sense:
$x_{i,j}$ is an outcome of the variable $X_{i,j}$ and $y_i$ is an outcome of $Y_i$. In other words, at the time when the $k$'th record was 'produced', there actually where the variables $\{X_{k,j} : 1\leq j \leq n\} \cup \{Y_k\}$ involved.
The variables $\{Y_i\}_i$ all are assumed to be independent as well but in addition we assume them to be conditional independent on all $\{X_{i,j}\}_{i,j}$. This means, whatever the outcome of the $\{X_{i,j}\}_{i,j}$ is, the outcome of $Y_k$ has no influence onto the outcome of $Y_i$ for $i\neq k$.
Formally one can express this by
(1) $P(Y_1=y_1 \wedge \cdots \wedge Y_m=y_m \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j})=$
$P(Y_1=y_1 \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j}) \cdots P(Y_n=y_n \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j})$
There is one dependence though which we aim to exploit:
$Y_{k}$ dependence on $\{X_{k,j}\}_j$.
The stronger this dependence is, the better we are able to predict $y$ based on values of $(x_1, \cdots, x_n)$.
Though, this dependence is not assumed to be across rows. On the contrary we assume for any $i\neq k$, $Y_{i}$ is independent on $\{X_{k,j}\}_j$.
(2) $P(Y \ | \ X_1\wedge X_2)\cdot P(X_1\wedge X_2)=P(Y\wedge X_1 \wedge X_2)$
The independence between $\{Y, X_1\}$ and $X_2$ allows us to write $$ P(Y\wedge X_1 \wedge X_2)=P(Y\wedge X_1)\cdot P(X_2) $$ The r.h.s can be formulated by use of conditional probability to get $$ P(Y\wedge X_1 \wedge X_2)=P(Y \ | \ \wedge X_1)\cdot P(X_1) \cdot P(X_2) $$ Thus (2) becomes $$ P(Y \ | \ X_1\wedge X_2)\cdot P(X_1\wedge X_2)=P(Y \ | \ \wedge X_1)\cdot P(X_1) \cdot P(X_2) $$ Further since $X_1$ and $X_2$ are independent we have $P(X_1\wedge X_2)=P(X_1)\cdot P(X_2)$. By using this, above equations becomes $$ P(Y \ | \ X_1\wedge X_2)\cdot P(X_1)\cdot P(X_2)=P(Y \ | \ \wedge X_1)\cdot P(X_1) \cdot P(X_2) $$ Under the assumption that non of $P(X_1)$ or $P(X_2)$ vanishes, this implies
(3) $P(Y \ | \ X_1\wedge X_2)=P(Y \ | X_1)$
(4) $P(Y_1=y_1 \wedge \cdots \wedge Y_m=y_m \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j})=$
$P(Y_1=y_1 \ | \ \bigwedge_{j} X_{1,j}=x_{1,j}) \cdots P(Y_n=y_n \ | \ \bigwedge_{j} X_{n,j}=x_{n,j})$
Further we assume the variables $Y_i$ to all have the same distribution. The same assumption we do for any fixed $k\in\{1, \cdots n\}$ for the variables $\{X_{i,k} : 1\leq i \leq m \}$. Therefor in (4) we can introduce one variable $Y$ which represents all $Y_{i}$'s and for any $k\in\{1, \cdots n\}$ a variable $X_{k}$ which represents all $\{X_{i,k} : 1\leq i \leq m \}$. We can obtain
(5) $P(Y_1=y_1 \wedge \cdots \wedge Y_m=y_m \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j})=$
$P(Y=y_1 \ | \ \bigwedge_{j} X_j=x_{1,j}) \cdots P(Y=y_n \ | \ \bigwedge_{j} X_j=x_{n,j})$
In literature this often is called 'Likelihood' and written in form
(5') $P(Y_1=y_1 \wedge \cdots \wedge Y_m=y_m \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j})=\prod_{k}P(Y=y_k \ | \ \bigwedge_{j} X_j=x_{k,j})$
(6) $\max_{\theta}\prod_{k}F(x_{k,1}, \cdots, x_{k,n}, y_k, \theta)$
What sense makes this maximization?
Well, since $F(\cdots, \theta)$ is a conditional distribution for some variables $X^{\theta}_1, \cdots, X^{\theta}_n$ and $Y^{\theta}$, maximizing (6) is the same as maximizing $$ \prod_{k}P(Y^{\theta}=y_k \ | \ \bigwedge_{j} X^{\theta}_j=x_{k,j}) $$ over $\theta$. If we would pick samples of these variables the same way we did for $X_1, \cdots, X_n$ and $Y$ then according to (5') we have $$ P(Y^{\theta}_1=y_1 \wedge \cdots \wedge Y^{\theta}_m=y_m \ | \ \bigwedge_{i,j} X^{\theta}_{i,j}=x_{i,j})= \prod_{k}F(x_{k,1}, \cdots, x_{k,n}, y_k, \theta) $$ Therefore ensuring $\theta$ to make the r.h.s as maximal as possible implies from the l.h.s the sample of the variables $X^{\theta}_1, \cdots, X^{\theta}_n$ and $Y^{\theta}$ do coincide with our given samples as much as possible. Or in other words, the probability that the variables $X^{\theta}_1, \cdots, X^{\theta}_n$ and $Y^{\theta}$ produce our samples is as hight as possible.
Altogether we have described a very general method to obtain a conditional distribution from given samples. Note, in practice the optimization problem (6) is not solvable analytical and one must apply optimization techniques like gradient descent. Also since products behave in terms of numerical errors very bad, one usually takes the logarithm (strict monotone function) and solve instead of (6) $$ \max_{\theta}\sum_{k}\log(F(x_{k,1}, \cdots, x_{k,n}, y_k, \theta)) $$
Prerequisites in order to understand this article are just basic knowledge of classic probability theory.
Our ultimate goal is to fit a conditional distribution function against given data. Let us first describe the structure of these data.
The sample data
The data are lists of records, each containing a set of feature variables $(X_i)_{1\leq i\leq n}$ and one dependent variable $Y$. This can be presented in matrix form like: $$ \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,n} & y_1\\ x_{2,1} & x_{2,2} & \cdots & x_{2,n} & y_2\\ \vdots\\ x_{m,1} & x_{m,2} & \cdots & x_{m,n} & y_m\\ \end{pmatrix} $$ We assume that each record has been obtained independently from each other, that is picking/getting$$ \begin{pmatrix} x_{j,1} & x_{j,2} & \cdots & x_{j,n} & y_j \end{pmatrix} $$ has no influence on picking/getting $$ \begin{pmatrix} x_{k,1} & x_{k,2} & \cdots & x_{k,n} & y_k \end{pmatrix} $$ for $k\neq j$.
The probabilistic framework
The crucial thing is to consider these records as outcomes of certain random variables.For this we introduce sets of random variables, $\{X_{i,j} : 1\leq i \leq m \wedge 1\leq j \leq n\}$ and $\{Y_i : 1\leq i \leq m \}$.
We interpret above records in the following sense:
$x_{i,j}$ is an outcome of the variable $X_{i,j}$ and $y_i$ is an outcome of $Y_i$. In other words, at the time when the $k$'th record was 'produced', there actually where the variables $\{X_{k,j} : 1\leq j \leq n\} \cup \{Y_k\}$ involved.
Assumptions
For our purposes we make the assumption that the variables $\{X_{i,j}\}_{i,j}$ are independent of each other.The variables $\{Y_i\}_i$ all are assumed to be independent as well but in addition we assume them to be conditional independent on all $\{X_{i,j}\}_{i,j}$. This means, whatever the outcome of the $\{X_{i,j}\}_{i,j}$ is, the outcome of $Y_k$ has no influence onto the outcome of $Y_i$ for $i\neq k$.
Formally one can express this by
(1) $P(Y_1=y_1 \wedge \cdots \wedge Y_m=y_m \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j})=$
$P(Y_1=y_1 \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j}) \cdots P(Y_n=y_n \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j})$
There is one dependence though which we aim to exploit:
$Y_{k}$ dependence on $\{X_{k,j}\}_j$.
The stronger this dependence is, the better we are able to predict $y$ based on values of $(x_1, \cdots, x_n)$.
Though, this dependence is not assumed to be across rows. On the contrary we assume for any $i\neq k$, $Y_{i}$ is independent on $\{X_{k,j}\}_j$.
Conditional Probabilities
Before we proceed let us have a small excursion to some fundamental fact about conditional probabilities. Let us consider probability events $Y, X_1, X_2$ where we assume the same independency assumptions like above. In other words, we restrict to the case of having only one $Y_i$ and two $X_{i,j}$'s. By definition of conditional probability we have(2) $P(Y \ | \ X_1\wedge X_2)\cdot P(X_1\wedge X_2)=P(Y\wedge X_1 \wedge X_2)$
The independence between $\{Y, X_1\}$ and $X_2$ allows us to write $$ P(Y\wedge X_1 \wedge X_2)=P(Y\wedge X_1)\cdot P(X_2) $$ The r.h.s can be formulated by use of conditional probability to get $$ P(Y\wedge X_1 \wedge X_2)=P(Y \ | \ \wedge X_1)\cdot P(X_1) \cdot P(X_2) $$ Thus (2) becomes $$ P(Y \ | \ X_1\wedge X_2)\cdot P(X_1\wedge X_2)=P(Y \ | \ \wedge X_1)\cdot P(X_1) \cdot P(X_2) $$ Further since $X_1$ and $X_2$ are independent we have $P(X_1\wedge X_2)=P(X_1)\cdot P(X_2)$. By using this, above equations becomes $$ P(Y \ | \ X_1\wedge X_2)\cdot P(X_1)\cdot P(X_2)=P(Y \ | \ \wedge X_1)\cdot P(X_1) \cdot P(X_2) $$ Under the assumption that non of $P(X_1)$ or $P(X_2)$ vanishes, this implies
(3) $P(Y \ | \ X_1\wedge X_2)=P(Y \ | X_1)$
The Likelihood
As easily seen, formula (3) also holds for more than two $X_i$'s. There we can use this to rewrite (1):(4) $P(Y_1=y_1 \wedge \cdots \wedge Y_m=y_m \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j})=$
$P(Y_1=y_1 \ | \ \bigwedge_{j} X_{1,j}=x_{1,j}) \cdots P(Y_n=y_n \ | \ \bigwedge_{j} X_{n,j}=x_{n,j})$
Further we assume the variables $Y_i$ to all have the same distribution. The same assumption we do for any fixed $k\in\{1, \cdots n\}$ for the variables $\{X_{i,k} : 1\leq i \leq m \}$. Therefor in (4) we can introduce one variable $Y$ which represents all $Y_{i}$'s and for any $k\in\{1, \cdots n\}$ a variable $X_{k}$ which represents all $\{X_{i,k} : 1\leq i \leq m \}$. We can obtain
(5) $P(Y_1=y_1 \wedge \cdots \wedge Y_m=y_m \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j})=$
$P(Y=y_1 \ | \ \bigwedge_{j} X_j=x_{1,j}) \cdots P(Y=y_n \ | \ \bigwedge_{j} X_j=x_{n,j})$
In literature this often is called 'Likelihood' and written in form
(5') $P(Y_1=y_1 \wedge \cdots \wedge Y_m=y_m \ | \ \bigwedge_{i,j} X_{i,j}=x_{i,j})=\prod_{k}P(Y=y_k \ | \ \bigwedge_{j} X_j=x_{k,j})$
The Maximum Likelihood
Most statistic models try to find a conditional distribution functions $F$ which reflects $$ F(x_1, \cdots, x_n, y)=P(Y=y \ | \ \bigwedge_{j} X_j=x_j) $$ This is the probability $Y$ to take the value $y$ under the condition that for all $j$, $X_j$ takes the value $x_j$. One approach to do so is to use any function $F(x_1, \cdots, x_n, y, \theta)$ which is somehow parameterized by $\theta$. This parameter can be a single value or a tuple of values. The idea of so called 'Maximum Likelihood Method' is to take for $\theta$ a value which maximizes the r.h.s of (5'). So we try to solve(6) $\max_{\theta}\prod_{k}F(x_{k,1}, \cdots, x_{k,n}, y_k, \theta)$
What sense makes this maximization?
Well, since $F(\cdots, \theta)$ is a conditional distribution for some variables $X^{\theta}_1, \cdots, X^{\theta}_n$ and $Y^{\theta}$, maximizing (6) is the same as maximizing $$ \prod_{k}P(Y^{\theta}=y_k \ | \ \bigwedge_{j} X^{\theta}_j=x_{k,j}) $$ over $\theta$. If we would pick samples of these variables the same way we did for $X_1, \cdots, X_n$ and $Y$ then according to (5') we have $$ P(Y^{\theta}_1=y_1 \wedge \cdots \wedge Y^{\theta}_m=y_m \ | \ \bigwedge_{i,j} X^{\theta}_{i,j}=x_{i,j})= \prod_{k}F(x_{k,1}, \cdots, x_{k,n}, y_k, \theta) $$ Therefore ensuring $\theta$ to make the r.h.s as maximal as possible implies from the l.h.s the sample of the variables $X^{\theta}_1, \cdots, X^{\theta}_n$ and $Y^{\theta}$ do coincide with our given samples as much as possible. Or in other words, the probability that the variables $X^{\theta}_1, \cdots, X^{\theta}_n$ and $Y^{\theta}$ produce our samples is as hight as possible.
Altogether we have described a very general method to obtain a conditional distribution from given samples. Note, in practice the optimization problem (6) is not solvable analytical and one must apply optimization techniques like gradient descent. Also since products behave in terms of numerical errors very bad, one usually takes the logarithm (strict monotone function) and solve instead of (6) $$ \max_{\theta}\sum_{k}\log(F(x_{k,1}, \cdots, x_{k,n}, y_k, \theta)) $$
Comments
Post a Comment