Decision Trees

Decision Trees can be used for supervised classification as well as regression problems. There exists many variations of these methods like for instance Random Forests, Boosted Trees, etc. In this article we will restrict explanations to the most fundamental version. The techniques used for this kind of model is very basic and does not require much more than very basic knowledge of probability theory.

Data structure

The data are supposed to have the following structure:
Feature samples: $$ X= \begin{pmatrix} x_{1,1} & \cdots & x_{1, n}\\ \vdots\\ x_{m,1} & \cdots & x_{m, n} \end{pmatrix} $$ Outcome samples: $$ Y=\begin{pmatrix} y_1 \\ \vdots \\ y_m \end{pmatrix} $$ Decision trees are most generic on their requirements. Features can be a mixture of continuous and categorical variables. The same holds for the outcome. In case the later is continuous, we call it Decision Tree Regression, otherwise Decision Tree Classifier.

Tree decision

The principle idea of decision trees is to dissect features into disjoint values (categorical) or into disjoint intervals (continuous). If it is a classification problem, these dissections are chosen in order to have as homogenous as possible classes per dissection. If it is a regression, then one tries to minimize the variance of the outcome value per dissection. An example is this:

Here it is the feature 'shell_w' which is split into disjoint intervals. The first node dictates a cut at 0.167, so we get two intervals. Then either moving below 0.167 one arrives at the next split 0.059 or moving above 0.167 one arrive at 0.375. In either case the previous interval is split into 4 disjoint interval at tree level 2. The third level is made up of leafs. These nodes present the classification or regression, as in this example. In case of a classification a leaf's most common class is taken to be the prediction. For regression it will be the leaf's average value.

In general this dissection is made over all feature variables, that is if $D_1, \cdots, D_n$ denote the domains of corresponding feature variables, then we split $D_1\times \cdots \times D_n$ into disjoint $n$-dimensional intervals. And again, these intervals are chosen in order to have as homogenous as possible classes resp. values within each interval.
So, a leaf is an interval in $D_1\times \cdots \times D_n$ which does not contain any further interval from our dissection.

Measurement of a feature-split

The question arises how to determine these optimal splits.
In order to answer this, let us consider a single feature $X$.

Case: "Learning task is a regression"
If the feature is continuous and its domain is the interval $[a,b]$ then we choose $\xi\in [a,b]$ such that the split maximally reduces variance: $$ var(Y|X\in[a,b]) - var(Y|X\in[a,\xi]) - var(Y|X\in]\xi, b]) $$ Here we the term $var(Y|X\in[x,y])$ denotes the variance of the conditional distribution of $Y$ on $\{X\in [x,y]\}$. In words it is the variance of $Y$ when restricting to samples with $X\in [x,y]$. This criterion is quite plausible since the more we reduce variance in each of the sub-intervals the more the mean value of $Y$ is representative as prediction.

In case $X$ is categorical, it is even more straightforward. If $\{a_1, \cdots, a_k\}$ are the categories of $X$, then we choose a category $b$ which maximally reduces variance: $$ var(Y|X\in \{a_1, \cdots, a_k\}) - var(Y|X=b) - var(Y|X\in \{a_1, \cdots, a_k\}-\{b\}) $$

Case: "Learning task is a classification"
The above variance reduction makes no direct sense any more. The idea of either splitting an interval in case of a continuous feature or selecting specific categories remains the same. Though the criterion for validating the split is now taken to be the so called Gini impurity. Wherever splitting an interval or across categories in case of a categorical feature, the result is a family of disjoint sets. For such a set $A$ which either is an interval or a set of categories, we denote by $p_i$ the probability of class $i$ to fall into that set, that is $$ p_i:=P(Y=i|X\in A) $$ Then $1-p_i$ is the probability of the opposed event, that is $1-p_i=P(Y\neq i|X\in A)$. If we would predict $Y$ to be in class $i$ whenever $X\in A$ then $1-p_i$ is the probability we do a miss-classification. The classification rule will be based on which class the most occurs within a given group. This means we can weight miss-classification rates by occurrences of their classes: $$ \sum_i^k p_i(1-p_i) $$ So, in a group where most samples have class $i$, there is a high chance that our classifier finally predicts this class $i$. So we are concerned about its overall miss-classification, that is $1-p_i$.

Implementation

Since we have a technique to evaluate a split we can describe the method how to construct a classifier/regressor using all features.

We recursively create a binary tree where each node represents a $n$-dimensional interval of $D_1\times \cdots \times D_n$.

1.) as first node we take as interval the entire $D_1\times \cdots \times D_n$.
2.) we use this first node as the current node
3.) at the current node we restrict samples to lie within the corresponding node's interval $I$
4.) we iterate through all features to obtain a split with lowest Gini impurity
5.) we add two new child nodes to the current node
6.) to the first child node we attribute the interval obtained by adding to $I$ the first group of the new split
7.) to the second child in analogy we attribute the other group
8.) if stopping criterion is not met, we put the current node onto the first child node and move to step #3
9.) if stopping criterion is not met, we put the current node onto the second child node and move to step #3

The mentioned stopping criterion for instance could be intervals to contain a minimum amount of samples or the tree not to surpass a maximal depth.

After having constructed this tree and if it is a classification we can attribute to each leaf-node the class with highest occurrence in the given interval. In case of a regression we just take the mean value within the leaf's interval.

This way, we have constructed a full regression/classification tree. A new sample can be predicted by just moving to the leaf-node which path to root matches its feature values $x_1, \cdots, x_n$.

In practice there exists many variation and improvements of above sketched algorithm. Though the principle idea is as just explained.

Feature Importance

One very nice property of decision trees is the ability to obtain for each feature its contribution to the classifier/regressor. Again, several implementations and literature do define this in different ways. We will present a very basic one to understand the main idea.

Let us consider feature $X$. At all nodes where it is involved to provide a split we can compute the Gini impurity for the parent node, $G(P)$, and for both child nodes resulting from the split, $G(C_1), \ G(C_2)$. Moreover, from samples we can compute each probability to reach at the parent node $p_0$ and to its child nodes $p_1, \ p_2$.
With this $$ p_0\cdot G(P)-p_1\cdot G(C_1)-p_2\cdot G(C_2) $$ is the weighted reduction of Gini impurity contributed by the specific split.
We can do this for all nodes which involve feature $X$ and sum-up the weighted impurity reductions. This gives an overall measure how feature $X$ contributes to impurity reduction. Usually this value is divided by the sum of all impurity reductions of all features.

Example

We look at an application of Decision Tree as one would implement in scikit-learn. Data are taken from https://archive.ics.uci.edu/ml/datasets/Abalone. The task is to classify w.r.t to the column "rings".


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

feature_names = [
    "gender",
    "length",
    "diameter",
    "height",
    "whole_w",
    "shucked_w",
    "viscera_w",
    "shell_w"
]

# loading data
data = pd.read_csv(
  "example-code/decision-tree/abalone.csv",
  delimiter=",",
  names= feature_names + ["rings"]
)

# encode categorical variables
enc = OrdinalEncoder()
data[["gender"]] = enc.fit_transform(data[["gender"]]).astype(int)

# binning the output
data["rings"] = pd.qcut(data["rings"], 4, labels=False)

# splitting into train and test data
X_train, X_test, y_train, y_test = train_test_split(
  data.iloc[:, :-1],
  data["rings"],
  random_state=0
)

# classification tree:
classifier = DecisionTreeClassifier(random_state=0, max_depth=6)
classifier.fit(X_train, y_train)
print(classifier.score(X_train, y_train))
print(classifier.score(X_test, y_test))
print(classifier.feature_importances_)

After loading the data into a pandas-dataframe and encoding string-valued variables, we apply some binning in order to slightly improve predictability:


data["rings"] = pd.qcut(data["rings"], 4, labels=False)

Then as usually we split data into train and test and finally fit the classifier by applying the restriction of not allowing paths with a length more than 6 (max_depth=6):


classifier = DecisionTreeClassifier(random_state=0, max_depth=6)
classifier.fit(X_train, y_train)
print(classifier.score(X_train, y_train))
print(classifier.score(X_test, y_test))
print(classifier.feature_importances_)

Running this code yields: 0.6133461047254151
0.570334928229665
[0.04072161 0.00688689 0.01510883 0.00407257 0.04468784 0.16976092 0.01039694 0.7083644 ]

This shows by pruning the tree to depth 6, the model is not over-fitting considerable. Though as we see, a tree model seems not to be a very good match for these data. This among other reasons is caused by very low correlation between features and outcome.

Feature importance can be visualized like this:

Extensions

Since decision trees in its pure form usually tend to over-fit the data, there have been developed extended versions to it. Basically they rely on the principles as described here but use different techniques to mitigate over-fitting. There is for instance the so called Random-Forest which produces in a randomized fashion a bundle of trees. Classification resp. regression is then based on picking the class with highest probability resp. the average value. Another very promising technique is provided by the Gradiant-Boost-Regression-Tree. This uses the so called gradient-boost technique. This technique starts with a over-simplified tree and then iteratively constructs further simplified trees which aim at mitigating the miss-classification resp. error produced by the previous predictor.

Final notes

This method actually always works. It covers mixed features, continuous and categorical, and can be used to classify a categorical variables or to do a regression. Decision trees are notorious known to over-fit dramatically if the tree is not pruned. So, important is to always use things like tree-depth restriction. Moreover the technique does not require any assumptions about the data. It doesn't mind if features have dependencies or not. The model picks out the relevant features automatically. In terms of data pre-processing there is no binning, one-hot-dummy creation or scaling necessary.

We Do Mathematics / Machine Learning

Search This Blog

Maximum Likelihood Method