Decision Trees

You can find the iPython Notebook for this post at https://github.com/bryantravissmith/FromScratch/tree/master/SupervisedLearning/DecisionTrees

Decision Trees are a non-parametric supervised machine learning algorithm that takes a set of training data and constructs a set of regions in the space of features that is then used to make predictions. These predictions can be continuous values (Decision Tree Regression), or classification (Decision Tree Classification).

Decision Tree Process
High Level Decision Tree Process

Let’s imagine we have a data set with a lot of features and how much student debt individuals have. We could just take the average of the entire dataset, say $25,000.00, and use that as a prediction. A better prediction could be developed by dividing the dataset up using the level of college completed: None, Some, Undergraduate, and Graduate. We can then take the average of subset/region:

  1. None: \$0
  2. Some: \$2000
  3. Undergraduate: \$16000
  4. Graduate: \$42000

Now when we score a person, we use their level of college completed to make the prediction. The idea is that the results are improved over the $25,000 average estimate of the entire dataset.

Student Loan Example

Why Implement It Ourselves?

The reason why I want to implement machine learning algorithm from scratch is I have a better intuition for the assumptions and the trade off in the implementations of standard libraries. For instance, before I did this for the first time I did not appreciate that both sklearn (python) and rpart (R) have an implementation of the CART algorithm for decision trees, but there are equally valid decision tree algorithms that would produce different results. This implementation is inherently greedy. The issue with greedy algorithms is that making locally best splits or regions may not make the globally best set of regions. This is what makes finding the best Decision Tree model np-hard. If you really get into it, finding the best tree becomes an optimal stopping problem and brushes into Reinforcement Learning and Artifical Intelligence style searching.

On a more practical side, there have been a number of times in the iterative approach of model-building where understanding the model assumptions have allowed me to exclude features the violate assumptions, or create new features that are more easily expressed and represented by the model. Both of these have consistently lead to better out of sample performance.

Finally, the best reason is I get pleasure and satisfaction from doing this. I think algorithms are a powerful framework and a fun way to think about problems solving and doing things from scratch help improve about execution and understanding of them.

Metrics – Measures for Improvement

I made up the student debt example. It makes sense in my mind, but we need to define an algorithm that can create the regions from any dataset. Most machine learning algorithms are optimization algorithms that improve some measure/metric on each iteration. Decision trees are no different in that respect. To define the algorithm, we need to define the measures.

Classification Metrics for Decision Trees

Gini Impurity

Gini Impurity is a very intuitive metric to try to improve because it is a measure of the misclassification rate. If the fraction of the data set that has outcome class c is f, we can make predictions for each data point as being class c, f of the time. The error rate for a class is the probability of getting the class times the probability of getting it wrong.

Error_c = f_c * \left( 1 - f_c \right)

The Gini Impurity is the sum of these error rates for all classes.

Gini \ Impurity = \sum_c Error_c = \sum_c f_c * \left( 1 - f_c \right)

Gini \ Impurity = \sum_c f_c - \sum _c f^2_c

Gini \ Impurity = 1 - \sum _c f^2_c

If we divide the data set in regions that improve the weighted sum of the Gini Impurity, the misclassification rate will be reduced.

Gini Impurity for a Coin Flip

A coin flip is about as simple as a process there is to calculate the Gini Impurity for. We have a coin with probability of head, pH. The Gini Impurity is then

Gini \ Impurity = 1 - p_H^2 - \left(1-p_H\right)^2 = 2*p_H*\left(1-p_H\right)

A plot of the this is below for different values of pH.

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

p = np.linspace(0.01,0.99,100)
plt.figure(figsize=(16,8))
plt.plot(p,1-p**2-(1-p)**2)
plt.ylim(0,1)
plt.show()
 

output_2_1

We see that if pH = 0  or pH = 1 there is no misclassification. If pH = 0.5, we see that we get the coin flip wrong 50% of the time. This should match our intuition.

Entropy

Another, and less intuitive, measure of prediction accuracy is entropy. It is defined as the expected information content of a class. If f is the fraction of the class in a dataset, and I(c) is the information content in the class, then entropy is defined as following

Entropy = \sum_c f_c * I(c)

For reasons I will not get into in this post, I(c) = – log2(f_c), so the entropy of a dataset follows:

Entropy = - \sum_c f_c * log_2\left(f_c\right)

The base 2 of the log can be anything, really. But since we will be using binary trees throughout this series, I will keep the base 2.

Entropy for a Coin Flip

Continuing with the coinflip example, I have produced the entropy of a coinflip using different values of pH.

plt.figure(figsize=(16,8))
plt.plot(p,-p*np.log2(p)-(1-p)*np.log2(1-p))
plt.ylim(0,1)
plt.show()
 

output_4_0

The expected information content of pH and  pH = 1 is zero, which means that we always know what the next flip is. The entropy is maximum for pH = 0.5, which is where we have the highest misclassification rate

Regressions Metrics for Decision Trees

The metric used for decision tree regression is a reduction of  variance around the prediction.

Variance = \frac{1}{N} \sum_{i}^{N} ( y_i - \bar{y} )^2

Simple Linear Example

We can take a simple linear example and split it into two regions. The solid black line is the average of the entire dataset, while the red lines are the predictions in each region.

output_11_0

The total variance before the split was 40.00, while the weighted variance after the split:

\frac{6}{11} 11.67 + \frac{5}{11} 8.00  = 10.0

Improving Metrics – Splitting on Information Gain

Now that we have metrics to evaluate the prediction from a data set, we can now define the improvement metric to optimize. I will be using the term information gain, which it is the correct term when talking about entropy, for all the optimizations. I can not say conclusively is the name for the measure in decision tree regression. Being wrong hasn’t stopped me before!

Decision Tree Classifier

Information \ Gain = Gini \ Impurity(d) - \sum_{split \ s} \frac{|s|}{|d|} Gini \ Impurity(s)

Information \ Gain = Variance(d) - \sum_{split \ s} \frac{|s|}{|d|} Variance(s)

Decision Tree Regressor

Information \ Gain = Variance(d) - \sum_{split \ s} \frac{|s|}{|d|} Variance(s)

Parameters of Decision Trees

Decisions Trees are prone to overfit because the metrics can be driven to the minimum by making the tree deep enough capture each data point individually. The variance of a single point is zero. The misclassification rate of a single point is also zero. Traditionally this overfit is managed by pruning the tree after it is constructed, or setting parameters the limit how finely the data can be partitioned into regions.

The post-pruning of a decision tree uses some sort of statistical evaluation on a validation set. If the error on the sub-nodes is larger than the node, the sub-nodes are removed.

The parameters that are commonly used in CART to limit the partitioning of the data:

  1. Maximum Depth of the Tree
  2. Minimum Number of Data Points for a New Node
  3. Minimum Improvement needed for a New Node
  4. Minimum Metric under which New Nodes are not created.

Pseudo Code

Now that we have a metric and measure of improvement, we can outline the pseudo code we will be implementing in the next posts which are based on the CART algorithm for decision trees. Classification and Regression Trees (CART) algorithm that recursively and greedily splits nodes if there is an improve in the metric of choice until some stopping conditions are met.

  1. Set Training Parameters
  2. Create a Root Node
  3. Determine optimal split the node such that that split minimizes the metrics of concern.
  4. If stopping parameters are not violated, create two child nodes and apply step 3 and 4 to each child node.

Code

In this code section. I am going to be implementing a Decision Tree algorithm, but I will do it in the following parts:

  1. Introduce the Data
  2. Script Finding the Best First Partition
  3. Create a Data Structure and Implement Decision Trees
  4. Comparing with sklearn package.

Assumptions

The implementation will avoid some issues with decision trees. I am going to assume that there never missing values. I will assume that all categorical variables are logical, character, and factor variables. If a variable is numeric, I will assume it is numeric. I am also going to assume that all the data is in a data.frame format.

Data

I will be using the Iris dataset for classification and the Boston dataset for regression. Both are available in the sklearn package.

from sklearn.datasets import load_iris,load_boston
iris = load_iris()
boston = load_boston()
 

Boston Housing Price

import pandas as pd
bostonX = boston['data']
bostonY = boston['target']
bostonDF = pd.DataFrame(data = np.hstack((bostonX,bostonY.reshape(bostonY.shape[0],1))),\
                        columns=np.append(boston['feature_names'],'PRICE'))
bostonDF.head()
 
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

Iris

irisX = iris['data']
irisY = iris['target']
irisDF = pd.DataFrame(data = np.hstack((irisX,irisY.reshape(irisY.shape[0],1))),\
                        columns=np.append(iris['feature_names'],"Species"))
irisDF.head()
 
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Species
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0

Scripting the First Partition

I am going to implement a brute force search for the best partition in the data set. This will be done by going through each value in each feature and finding the feature-value that will best partition the data. To do this we need to implement our metrics I discussed in the theory post:

Entropy: Classification
Gini Impurity: Classification
Variance: Regression

def entropy(y):
    if y.size == 0: return 0
    p = np.unique(y, return_counts = True)[1].astype(float)/len(y)
    return -1 * np.sum(p * np.log2(p+1e-9))

def gini_impurity(y):
    if y.size == 0: return 0
    p = np.unique(y, return_counts = True)[1].astype(float)/len(y)
    return 1 - np.sum(p**2)

def variance(y):
    if y.size == 0: return 0
    return np.var(y)
 

I can quickly calculate these metrics for our dataset:

print entropy(irisY)
print gini_impurity(irisY)
print variance(bostonY)
 

1.58496249639
0.666666666667
84.4195561562

Information Gain

Now that I have metric functions, I want to calculate the improvement for different partitions. Each partition is going to be calculated using a mask which is a logical vector that states if the row is in the first partition (TRUE) or in the second partition (FALSE). I will call the function information gain, even though it really is only information gain if entropy is the metric.

def information_gain(y,mask,func=entropy):
    s1 = np.sum(mask)
    s2 = mask.size - s1
    if (s1 == 0 | s2 == 0): return 0
    return func(y) - s1/float(s1+s2) * func(y[mask]) - s2/float(s1+s2) * func(y[np.logical_not(mask)])

print information_gain(irisY,irisX[:,2] < 3.5)
print information_gain(irisY,irisX[:,2] < 3.5,gini_impurity)
print information_gain(bostonY,bostonX[:,5] < 7)
 

0.827861517442
0.305906762627
0.456884627461

Max Information Gain For A Single Feature

I have a function to calculated the information gain for a given criteria, mask, or split, I now will search for the best split among all the values in a given feature. This will give me the best split for that feature:

np.apply_along_axis(lambda x: np.sum(x),0,irisX)
 

array([ 876.5, 458.1, 563.8, 179.8])

def max_information_gain_split(y,x,func=gini_impurity):
    best_change = None
    split_value = None
    is_numeric = irisX[:,2].dtype.kind not in ['S','b']

    for val in np.unique(np.sort(x)):
        mask = x == val
        if(is_numeric): mask = x < val change = information_gain(y,mask,func) if best_change is None: best_change = change split_value = val elif change > best_change:
            best_change = change
            split_value = val

    return {"best_change":best_change,\
            "split_value":split_value,\
            "is_numeric":is_numeric}

print(max_information_gain_split(irisY,irisX[:,2]))
print(max_information_gain_split(irisY,irisX[:,2],entropy))
print(max_information_gain_split(bostonY,bostonX[:,3],variance))
 

{‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.33333333333333343}
{‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.91829583213089627}
{‘is_numeric’: True, ‘split_value’: 1.0, ‘best_change’: 2.5930420368499885}

All the best split values just happen to be 1 in this case.

Best Feature Split

Now I can apply the max_information gain across all columns and return the information for the best split

def best_feature_split(X,y,func=gini_impurity):
    best_result = None
    best_index = None
    for index in range(X.shape[1]):
        result = max_information_gain_split(y,X[:,index],func)
        if best_result is None:
            best_result = result
            best_index = index
        elif best_result['best_change'] < result['best_change']:
            best_result = result
            best_index = index

    best_result['index'] = best_index
    return best_result

print best_feature_split(irisX,irisY)
print best_feature_split(irisX,irisY,entropy)
print best_feature_split(bostonX,bostonY,variance)
 

{‘index’: 2, ‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.33333333333333343}
{‘index’: 2, ‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.91829583213089627}
{‘index’: 5, ‘is_numeric’: True, ‘split_value’: 6.9429999999999996, ‘best_change’: 38.220464479057071}

Now that I have the best feature-value combination to partition our data we can partition the data into the two data sets. One I will call ‘left’, and the other I will call ‘right’. This will end up being the data structure use in that each node can have two branches: left and right.

def get_best_mask(X,best_feature_dict):
    best_mask = None
    if best_feature_dict['is_numeric']:
        best_mask = X[:,best_feature_dict['index']] < best_feature_dict['split_value']
    else:
        best_mask = X[:,best_feature_dict['index']] == best_feature_dict['split_value']
    return best_mask

bfs = best_feature_split(irisX,irisY)
best_mask = get_best_mask(irisX,bfs)
left = irisX[best_mask,:]
right = irisX[np.logical_not(best_mask),:]
 

Now that I have scripted a single split, I will now work on creating a tree-based data structure for the decision tree.

Data Structure

I am going to make two classes. Decision Tree class that slightly mimics the sklearn API and a Decision Tree Node class will recursively fit the data based on the hyper-parameters we provide. There are many parameters that one could set to control the tree from overfitting. I have have choose: max_depth, min_information_gain, & min_leaf_size. Class diagrams are below.

Class Descriptions of Decision Tree Implementation
Class Descriptions of Decision Tree Implementation

DecisionTreeNode

This is going to be the template I will use for the DecisionTreeNode class.

class DecisionTreeNode(object):

    def __init__(self,\
            X,\
            y,\
            minimize_func,\
            min_information_gain=0.01,\
             max_depth=3,\
             depth=0):
        self.X = None
        self.y = None

        self.minimize_func=minimize_func
        self.min_information_gain=min_information_gain
        self.min_leaf_size = min_leaf_size
        self.max_depth = max_depth
        self.depth = depth

        self.best_split = None
        self.left = None
        self.right = None
        self.is_leaf = False
        self.split_description = ""

    def _information_gain(self,mask):
        pass

    def _max_information_gain_split(self,X):
        pass

    def _best_feature_split(self):
        pass

    def _split_node(self):
        pass

    def _predict_row(self,row):
        pass

    def predict(self,X):
        pass

    def __repr__(self):
        pass
 

Information Gain

This is the same function as I implemented during the scripting section except that it makes reference to the object variables instead of accepting them as parameters.

def _information_gain(self,mask):
    s1 = np.sum(mask)
    s2 = mask.size - s1
    if (s1 == 0 | s2 == 0): return 0
    return self.minimize_func(self.y) - \
            s1/float(s1+s2) * self.minimize_func(self.y[mask]) - \
            s2/float(s1+s2) * self.minimize_func(self.y[np.logical_not(mask)])
 

Max Information Gain Split

This function finds the value with the most information gain by a given feature. This is similar to the scripted version above except it also makes reference the object’s variables and have added an additional constraint that the best split is only updated if it new leafs are bigger or equal to the min leaf size.   This is finding the optimal split value given the constraint.  The above scrippted solution finds the optimal value then checks to see if it matches the constraint.  These two methods will give different results and produce different trees.

def _max_information_gain_split(self,x):
    best_change = None
    split_value = None
    previous_val = None
    is_numeric = x.dtype.kind not in ['S','b']

    for val in np.unique(np.sort(x)):
        mask = x == val
        if(is_numeric): mask = x < val change = self._information_gain(mask) s1 = np.sum(mask) s2 = mask.size-s1 if best_change is None and s1 >= self.min_leaf_size and s2 >= self.min_leaf_size:
            best_change = change
            split_value = val
        elif change > best_change and s1 >= self.min_leaf_size and s2 >= self.min_leaf_size:
            best_change = change
            split_value = np.mean([val,previous_val])

        previous_val = val

    return {"best_change":best_change,\
            "split_value":split_value,\
            "is_numeric":is_numeric}
 

Best Feature Split

This function finds the feature-value split that gives the most information gain out of the all the features.

def _best_feature_split(self):
    best_result = None
    best_index = None
    for index in range(self.X.shape[1]):
        result = self._max_information_gain_split(self.X[:,index])
        if result['best_change'] is not None:
            if best_result is None:
                best_result = result
                best_index = index
            elif best_result['best_change'] < result['best_change']:
                best_result = result
                best_index = index

    if best_result is not None:
        best_result['index'] = best_index
        self.best_split = best_result
 

Split Node

This function will split the DecisionTreeNode into left and right branches if it is consistent with the hyper-parameters. It is also setting each Node’s description so a summary tree can be printed out.

def _split_node(self):

    if self.depth < self.max_depth : self._best_feature_split() if self.best_split is not None and self.best_split['best_change'] >= self.min_information_gain :

            mask = None
            if self.best_split['is_numeric']:
                mask = self.X[:,self.best_split['index']] < self.best_split['split_value'] else: mask = self.X[:,self.best_split['index']] == self.best_split['split_value'] if(np.sum(mask) >= self.min_leaf_size and (mask.size-np.sum(mask)) >= self.min_leaf_size):
                self.is_leaf = False

                self.left = DecisionTreeNode(self.X[mask,:],\
                                            self.y[mask],\
                                            self.minimize_func,\
                                            self.min_information_gain,\
                                            self.max_depth,\
                                            self.min_leaf_size,\
                                            self.depth+1)

                if self.best_split['is_numeric']:
                    split_description = 'index ' + str(self.best_split['index']) + " < " + str(self.best_split['split_value']) + " ( " + str(self.X[mask,:].shape[0]) + " )" self.left.split_description = str(split_description) else: split_description = 'index ' + str(self.best_split['index']) + " == " + str(self.best_split['split_value']) + " ( " + str(self.X[mask,:].shape[0]) + " )" self.left.split_description = str(split_description) self.left._split_node() self.right = DecisionTreeNode(self.X[np.logical_not(mask),:],\ self.y[np.logical_not(mask)],\ self.minimize_func,\ self.min_information_gain,\ self.max_depth,\ self.min_leaf_size,\ self.depth+1) if self.best_split['is_numeric']: split_description = 'index ' + str(self.best_split['index']) + " >= " + str(self.best_split['split_value']) + " ( " + str(self.X[np.logical_not(mask),:].shape[0]) + " )"
                    self.right.split_description = str(split_description)
                else:
                    split_description = 'index ' + str(self.best_split['index']) + " != " + str(self.best_split['split_value']) + " ( " + str(self.X[np.logical_not(mask),:].shape[0]) + " )"
                    self.right.split_description = str(split_description)

                self.right._split_node()

    if self.is_leaf:
        if self.minimize_func == variance:
            self.split_description = self.split_description + " : predict - " + str(np.mean(self.y))
        else:
            values, counts = np.unique(self.y,return_counts=True)
            predict = values[np.argmax(counts)]
            self.split_description = self.split_description + " : predict - " + str(predict)

 

Predict Row

This function is used to make a prediction for a single row. It with be used by the predict function.

def _predict_row(self,row):
    predict_value = None
    if self.is_leaf:
        if self.minimize_func==variance:
            predict_value = np.mean(self.y)
        else:
            values, counts = np.unique(self.y,return_counts=True)
            predict_value = values[np.argmax(counts)]
    else:
        left = None
        if self.best_split['is_numeric']:
            left = row[self.best_split['index']] < self.best_split['split_value']
        else:
            left = row[self.best_split['index']] == self.best_split['split_value']

        if left:
            predict_value = self.left._predict_row(row)
        else:
            predict_value = self.right._predict_row(row)

    return predict_value
 

Predict

This function will predict the values from the fitted tree. If the minimizing function is variance, I assume the problem is a regression problem and return numeric values. If it is not, I assume it is a classification problem and return character values.

def predict(self,X):
        return np.apply_along_axis(lambda x: self._predict_row(x),1,X)
 

Final DecisionTreeNode Code

class DecisionTreeNode(object):

    def __init__(self,\
            X,\
            y,\
            minimize_func,\
            min_information_gain=0.001,\
            max_depth=3,\
            min_leaf_size=20,\
            depth=0):
        self.X = X
        self.y = y

        self.minimize_func=minimize_func
        self.min_information_gain=min_information_gain
        self.min_leaf_size = min_leaf_size
        self.max_depth = max_depth
        self.depth = depth

        self.best_split = None
        self.left = None
        self.right = None
        self.is_leaf = True
        self.split_description = "root"

    def _information_gain(self,mask):
        s1 = np.sum(mask)
        s2 = mask.size - s1
        if (s1 == 0 | s2 == 0): return 0
        return self.minimize_func(self.y) - \
                s1/float(s1+s2) * self.minimize_func(self.y[mask]) - \
                s2/float(s1+s2) * self.minimize_func(self.y[np.logical_not(mask)])

    def _max_information_gain_split(self,x):
        best_change = None
        split_value = None
        previous_val = None
        is_numeric = x.dtype.kind not in ['S','b']

        for val in np.unique(np.sort(x)):
            mask = x == val
            if(is_numeric): mask = x < val change = self._information_gain(mask) s1 = np.sum(mask) s2 = mask.size-s1 if best_change is None and s1 >= self.min_leaf_size and s2 >= self.min_leaf_size:
                best_change = change
                split_value = val
            elif change > best_change and s1 >= self.min_leaf_size and s2 >= self.min_leaf_size:
                best_change = change
                split_value = np.mean([val,previous_val])

            previous_val = val

        return {"best_change":best_change,\
                "split_value":split_value,\
                "is_numeric":is_numeric}

    def _best_feature_split(self):
        best_result = None
        best_index = None
        for index in range(self.X.shape[1]):
            result = self._max_information_gain_split(self.X[:,index])
            if result['best_change'] is not None:
                if best_result is None:
                    best_result = result
                    best_index = index
                elif best_result['best_change'] < result['best_change']:
                    best_result = result
                    best_index = index

        if best_result is not None:
            best_result['index'] = best_index
            self.best_split = best_result

    def _split_node(self):

        if self.depth < self.max_depth : self._best_feature_split() if self.best_split is not None and self.best_split['best_change'] >= self.min_information_gain :

                mask = None
                if self.best_split['is_numeric']:
                    mask = self.X[:,self.best_split['index']] < self.best_split['split_value'] else: mask = self.X[:,self.best_split['index']] == self.best_split['split_value'] if(np.sum(mask) >= self.min_leaf_size and (mask.size-np.sum(mask)) >= self.min_leaf_size):
                    self.is_leaf = False

                    self.left = DecisionTreeNode(self.X[mask,:],\
                                                self.y[mask],\
                                                self.minimize_func,\
                                                self.min_information_gain,\
                                                self.max_depth,\
                                                self.min_leaf_size,\
                                                self.depth+1)

                    if self.best_split['is_numeric']:
                        split_description = 'index ' + str(self.best_split['index']) + " < " + str(self.best_split['split_value']) + " ( " + str(self.X[mask,:].shape[0]) + " )" self.left.split_description = str(split_description) else: split_description = 'index ' + str(self.best_split['index']) + " == " + str(self.best_split['split_value']) + " ( " + str(self.X[mask,:].shape[0]) + " )" self.left.split_description = str(split_description) self.left._split_node() self.right = DecisionTreeNode(self.X[np.logical_not(mask),:],\ self.y[np.logical_not(mask)],\ self.minimize_func,\ self.min_information_gain,\ self.max_depth,\ self.min_leaf_size,\ self.depth+1) if self.best_split['is_numeric']: split_description = 'index ' + str(self.best_split['index']) + " >= " + str(self.best_split['split_value']) + " ( " + str(self.X[np.logical_not(mask),:].shape[0]) + " )"
                        self.right.split_description = str(split_description)
                    else:
                        split_description = 'index ' + str(self.best_split['index']) + " != " + str(self.best_split['split_value']) + " ( " + str(self.X[np.logical_not(mask),:].shape[0]) + " )"
                        self.right.split_description = str(split_description)

                    self.right._split_node()

        if self.is_leaf:
            if self.minimize_func == variance:
                self.split_description = self.split_description + " : predict - " + str(np.mean(self.y))
            else:
                values, counts = np.unique(self.y,return_counts=True)
                predict = values[np.argmax(counts)]
                self.split_description = self.split_description + " : predict - " + str(predict)

    def _predict_row(self,row):
        predict_value = None
        if self.is_leaf:
            if self.minimize_func==variance:
                predict_value = np.mean(self.y)
            else:
                values, counts = np.unique(self.y,return_counts=True)
                predict_value = values[np.argmax(counts)]
        else:
            left = None
            if self.best_split['is_numeric']:
                left = row[self.best_split['index']] < self.best_split['split_value'] else: left = row[self.best_split['index']] == self.best_split['split_value'] if left: predict_value = self.left._predict_row(row) else: predict_value = self.right._predict_row(row) return predict_value def predict(self,X): return np.apply_along_axis(lambda x: self._predict_row(x),1,X) def _rep(self,level): response = "|->" + self.split_description

        if self.left is not None:
            response += "\n"+(2*level+1)*" "+ self.left._rep(level+1)
        if self.right is not None:
            response += "\n"+(2*level+1)*" "+ self.right._rep(level+1)

        return response

    def __repr__(self):
        return self._rep(0)

 

Proof of Concept on Iris

dtn = DecisionTreeNode(irisX,irisY,gini_impurity)
dtn._split_node()
dtn
 

screen-shot-2016-12-09-at-3-34-52-pm

Proof of Concept on ToothGrowth

dtn = DecisionTreeNode(bostonX,bostonY,variance,max_depth=5)
dtn._split_node()
dtn
 

Screen Shot 2016-12-09 at 3.35.35 PM.png

DecisionTree

Now that I have implemented the DecisionTreeNode, I can implement the DecisionTree class.  Here is the template code.

class DecisionTree(object):

    def __init__(self,\
            minimize_func,\
            min_information_gain=0.01,\
            max_depth=3,\
            min_leaf_size=20):

        self.root = None
        self.minimize_func = minimize_func
        self.min_information_gain = min_information_gain
        self.max_depth = max_depth
        self.min_leaf_size = min_leaf_size

    def fit(X,y):
        pass

    def predict(X):
        pass
 

Fit

The fit function is going to take the features and targets. I am assuming that a data frame will be provided. You can generalize this for other datatypes as you see fit. It just takes the data and features, and creates a new DecisionTreeNode, then splits the node which recursively and greedily builds the tree.

def fit(self,X,y):
    self.root =  DecisionTreeNode(X,\
                                y,\
                                self.minimize_func,\
                                self.min_information_gain,\
                                self.max_depth,\
                                self.min_leaf_size,\
                                0)
    self.root._split_node()
 

Predict

The predict function is just a wrapper to call the DecisionTreeNode predict function. Right now there are no probabilities. You can add that feature yourself!

def predict(self,X):
    return self.root.predict(X)
 

Fully Implemented Decision Tree

class DecisionTree(object):

    def __init__(self,\
            minimize_func,\
            min_information_gain=0.001,\
            max_depth=3,\
            min_leaf_size=20):

        self.root = None
        self.minimize_func = minimize_func
        self.min_information_gain = min_information_gain
        self.max_depth = max_depth
        self.min_leaf_size = min_leaf_size

    def fit(self,X,y):
        self.root =  DecisionTreeNode(X,\
                                    y,\
                                    self.minimize_func,\
                                    self.min_information_gain,\
                                    self.max_depth,\
                                    self.min_leaf_size,\
                                    0)
        self.root._split_node()

    def predict(self,X):
        return self.root.predict(X)

    def __repr__(self):
        return self.root._rep(0)
 

Comparison to sklearn

Now that we have a decision tree, we can compare the results between the two CART implementations. To do this we need to write a print function for the Sklearn DecisionTree class that is similar to our own print function. A simple script is below.

from sklearn.tree import DecisionTreeClassifier as SklearnDTC
from sklearn.tree import DecisionTreeRegressor as SklearnDTR
from sklearn.tree import _tree

def print_sklearn_tree(tree,index=0,level=0):
    response = ""
    if level == 0:
        response += "root\n"

    if tree.feature[index] == -2:
        response +=  ": predict " + str(np.argmax(dt_sklearn.tree_.value[index,0,:])) + " ( " +str(np.sum(dt_sklearn.tree_.value[index,0,:])) + " )"
    else:
        response += "\n"+(2*level+1)*" " + "|-> index " +  str(tree.feature[index]) + " < " + str(tree.threshold[index]) response += (2*(level+1)+1)*" "+ print_sklearn_tree(tree,tree.children_left[index],level+1) response += "\n"+(2*level+1)*" " + "|-> index " +  str(tree.feature[index]) + " >= " + str(tree.threshold[index])
        response += (2*(level+1)+1)*" "+ print_sklearn_tree(tree,tree.children_right[index],level+1)

    return response

dt_sklearn = SklearnDTC(max_depth=3,min_samples_leaf=20,criterion="gini")
dt_sklearn.fit(irisX,irisY)

dt_bts = DecisionTree(gini_impurity,min_leaf_size=20,max_depth=3)
dt_bts.fit(irisX,irisY)

print dt_bts
print "\n" + 50*"-" + "\n"
print print_sklearn_tree(dt_sklearn.tree_)
 

screen-shot-2016-12-09-at-3-36-39-pm

The values here are close to identical, but it is worth noting that sklearn will jump between this result and splitting on Petal With of 0.8 because they have the same information gain.

dt_sklearn = SklearnDTC(max_depth=5,min_samples_leaf=5,criterion="gini")
dt_sklearn.fit(irisX,irisY)

dt_bts = DecisionTree(gini_impurity,min_leaf_size=5,max_depth=5)
dt_bts.fit(irisX,irisY)

print dt_bts
print "\n" + 50*"-" + "\n"
print print_sklearn_tree(dt_sklearn.tree_)
 

screen-shot-2016-12-09-at-3-37-14-pm

Again – my implementation and sklearn are close to identical in results. I will try regression on the Boston dataset.

dt_sklearn = SklearnDTR(max_depth=3,min_samples_leaf=20)
dt_sklearn.fit(bostonX,bostonY)

dt_bts = DecisionTree(variance,min_leaf_size=20,max_depth=3)
dt_bts.fit(bostonX,bostonY)

print dt_bts
print "\n" + 50*"-" + "\n"
print print_sklearn_tree(dt_sklearn.tree_)
 

screen-shot-2016-12-09-at-3-39-19-pm

I have made a

Conclusion

In this post, I covered the theory and code of implementing the CART algorithm for producing decision trees and compared the results to sklearn on a few toy datasets.  This code is far from production ready, but it does show a high-level implementation of the algorithm.    You can leave any question, comments, corrections, or requestion in the comments below.

If you are interested in learning more, I can recommend a few books below.  The image links are affiliate links.   I provide links to free version when available.

Data Science For Business

This book has a good overview of Decision Trees in chapter 3 and just good/general advice about using data science to get practical results.   I gave it a good once over, and continue to reference it from time to time.

Data Science From Scratch

This book has a good intro to data science activities with python and implements a number of algorithms from scratch. Chapter 18 has an implementation of ID3 Decision Tree Algorithm. It also shows how to implement Logistic Regression, Naive Bayes, & Neural Networks.

Python Machine Learning

This is a really good book that covers doing an enormous amount of data science tasks in python. If you are new to python or data science, I highly recommend it.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: