Implementing Decision Trees From Scratch Using Python 2.7

Decision Trees

You can find the iPython Notebook for this post at https://github.com/bryantravissmith/FromScratch/tree/master/SupervisedLearning/DecisionTrees

Decision Trees are a non-parametric supervised machine learning algorithm that takes a set of training data and constructs a set of regions in the space of features that is then used to make predictions. These predictions can be continuous values (Decision Tree Regression), or classification (Decision Tree Classification).

Decision Tree Process
High Level Decision Tree Process

Let’s imagine we have a data set with a lot of features and how much student debt individuals have. We could just take the average of the entire dataset, say $25,000.00, and use that as a prediction. A better prediction could be developed by dividing the dataset up using the level of college completed: None, Some, Undergraduate, and Graduate. We can then take the average of subset/region:

  1. None: \$0
  2. Some: \$2000
  3. Undergraduate: \$16000
  4. Graduate: \$42000

Now when we score a person, we use their level of college completed to make the prediction. The idea is that the results are improved over the $25,000 average estimate of the entire dataset.

Student Loan Example

Why Implement It Ourselves?

The reason why I want to implement machine learning algorithm from scratch is I have a better intuition for the assumptions and the trade off in the implementations of standard libraries. For instance, before I did this for the first time I did not appreciate that both sklearn (python) and rpart (R) have an implementation of the CART algorithm for decision trees, but there are equally valid decision tree algorithms that would produce different results. This implementation is inherently greedy. The issue with greedy algorithms is that making locally best splits or regions may not make the globally best set of regions. This is what makes finding the best Decision Tree model np-hard. If you really get into it, finding the best tree becomes an optimal stopping problem and brushes into Reinforcement Learning and Artifical Intelligence style searching.

On a more practical side, there have been a number of times in the iterative approach of model-building where understanding the model assumptions have allowed me to exclude features the violate assumptions, or create new features that are more easily expressed and represented by the model. Both of these have consistently lead to better out of sample performance.

Finally, the best reason is I get pleasure and satisfaction from doing this. I think algorithms are a powerful framework and a fun way to think about problems solving and doing things from scratch help improve about execution and understanding of them.

Metrics – Measures for Improvement

I made up the student debt example. It makes sense in my mind, but we need to define an algorithm that can create the regions from any dataset. Most machine learning algorithms are optimization algorithms that improve some measure/metric on each iteration. Decision trees are no different in that respect. To define the algorithm, we need to define the measures.

Classification Metrics for Decision Trees

Gini Impurity

Gini Impurity is a very intuitive metric to try to improve because it is a measure of the misclassification rate. If the fraction of the data set that has outcome class c is f, we can make predictions for each data point as being class c, f of the time. The error rate for a class is the probability of getting the class times the probability of getting it wrong.

Error_c = f_c * \left( 1 - f_c \right)

The Gini Impurity is the sum of these error rates for all classes.

Gini \ Impurity = \sum_c Error_c = \sum_c f_c * \left( 1 - f_c \right)

Gini \ Impurity = \sum_c f_c - \sum _c f^2_c

Gini \ Impurity = 1 - \sum _c f^2_c

If we divide the data set in regions that improve the weighted sum of the Gini Impurity, the misclassification rate will be reduced.

Gini Impurity for a Coin Flip

A coin flip is about as simple as a process there is to calculate the Gini Impurity for. We have a coin with probability of head, pH. The Gini Impurity is then

Gini \ Impurity = 1 - p_H^2 - \left(1-p_H\right)^2 = 2*p_H*\left(1-p_H\right)

A plot of the this is below for different values of pH.

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

p = np.linspace(0.01,0.99,100)
plt.figure(figsize=(16,8))
plt.plot(p,1-p**2-(1-p)**2)
plt.ylim(0,1)
plt.show()
 

output_2_1

We see that if pH = 0  or pH = 1 there is no misclassification. If pH = 0.5, we see that we get the coin flip wrong 50% of the time. This should match our intuition.

Entropy

Another, and less intuitive, measure of prediction accuracy is entropy. It is defined as the expected information content of a class. If f is the fraction of the class in a dataset, and I(c) is the information content in the class, then entropy is defined as following

Entropy = \sum_c f_c * I(c)

For reasons I will not get into in this post, I(c) = – log2(f_c), so the entropy of a dataset follows:

Entropy = - \sum_c f_c * log_2\left(f_c\right)

The base 2 of the log can be anything, really. But since we will be using binary trees throughout this series, I will keep the base 2.

Entropy for a Coin Flip

Continuing with the coinflip example, I have produced the entropy of a coinflip using different values of pH.

plt.figure(figsize=(16,8))
plt.plot(p,-p*np.log2(p)-(1-p)*np.log2(1-p))
plt.ylim(0,1)
plt.show()
 

output_4_0

The expected information content of pH and  pH = 1 is zero, which means that we always know what the next flip is. The entropy is maximum for pH = 0.5, which is where we have the highest misclassification rate

Regressions Metrics for Decision Trees

The metric used for decision tree regression is a reduction of  variance around the prediction.

Variance = \frac{1}{N} \sum_{i}^{N} ( y_i - \bar{y} )^2

Simple Linear Example

We can take a simple linear example and split it into two regions. The solid black line is the average of the entire dataset, while the red lines are the predictions in each region.

output_11_0

The total variance before the split was 40.00, while the weighted variance after the split:

\frac{6}{11} 11.67 + \frac{5}{11} 8.00  = 10.0

Improving Metrics – Splitting on Information Gain

Now that we have metrics to evaluate the prediction from a data set, we can now define the improvement metric to optimize. I will be using the term information gain, which it is the correct term when talking about entropy, for all the optimizations. I can not say conclusively is the name for the measure in decision tree regression. Being wrong hasn’t stopped me before!

Decision Tree Classifier

Information \ Gain = Gini \ Impurity(d) - \sum_{split \ s} \frac{|s|}{|d|} Gini \ Impurity(s)

Information \ Gain = Variance(d) - \sum_{split \ s} \frac{|s|}{|d|} Variance(s)

Decision Tree Regressor

Information \ Gain = Variance(d) - \sum_{split \ s} \frac{|s|}{|d|} Variance(s)

Parameters of Decision Trees

Decisions Trees are prone to overfit because the metrics can be driven to the minimum by making the tree deep enough capture each data point individually. The variance of a single point is zero. The misclassification rate of a single point is also zero. Traditionally this overfit is managed by pruning the tree after it is constructed, or setting parameters the limit how finely the data can be partitioned into regions.

The post-pruning of a decision tree uses some sort of statistical evaluation on a validation set. If the error on the sub-nodes is larger than the node, the sub-nodes are removed.

The parameters that are commonly used in CART to limit the partitioning of the data:

  1. Maximum Depth of the Tree
  2. Minimum Number of Data Points for a New Node
  3. Minimum Improvement needed for a New Node
  4. Minimum Metric under which New Nodes are not created.

Pseudo Code

Now that we have a metric and measure of improvement, we can outline the pseudo code we will be implementing in the next posts which are based on the CART algorithm for decision trees. Classification and Regression Trees (CART) algorithm that recursively and greedily splits nodes if there is an improve in the metric of choice until some stopping conditions are met.

  1. Set Training Parameters
  2. Create a Root Node
  3. Determine optimal split the node such that that split minimizes the metrics of concern.
  4. If stopping parameters are not violated, create two child nodes and apply step 3 and 4 to each child node.

Code

In this code section. I am going to be implementing a Decision Tree algorithm, but I will do it in the following parts:

  1. Introduce the Data
  2. Script Finding the Best First Partition
  3. Create a Data Structure and Implement Decision Trees
  4. Comparing with sklearn package.

Assumptions

The implementation will avoid some issues with decision trees. I am going to assume that there never missing values. I will assume that all categorical variables are logical, character, and factor variables. If a variable is numeric, I will assume it is numeric. I am also going to assume that all the data is in a data.frame format.

Data

I will be using the Iris dataset for classification and the Boston dataset for regression. Both are available in the sklearn package.

from sklearn.datasets import load_iris,load_boston
iris = load_iris()
boston = load_boston()
 

Boston Housing Price

import pandas as pd
bostonX = boston['data']
bostonY = boston['target']
bostonDF = pd.DataFrame(data = np.hstack((bostonX,bostonY.reshape(bostonY.shape[0],1))),\
                        columns=np.append(boston['feature_names'],'PRICE'))
bostonDF.head()
 
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

Iris

irisX = iris['data']
irisY = iris['target']
irisDF = pd.DataFrame(data = np.hstack((irisX,irisY.reshape(irisY.shape[0],1))),\
                        columns=np.append(iris['feature_names'],"Species"))
irisDF.head()
 
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Species
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0

Scripting the First Partition

I am going to implement a brute force search for the best partition in the data set. This will be done by going through each value in each feature and finding the feature-value that will best partition the data. To do this we need to implement our metrics I discussed in the theory post:

Entropy: Classification
Gini Impurity: Classification
Variance: Regression

def entropy(y):
    if y.size == 0: return 0
    p = np.unique(y, return_counts = True)[1].astype(float)/len(y)
    return -1 * np.sum(p * np.log2(p+1e-9))

def gini_impurity(y):
    if y.size == 0: return 0
    p = np.unique(y, return_counts = True)[1].astype(float)/len(y)
    return 1 - np.sum(p**2)

def variance(y):
    if y.size == 0: return 0
    return np.var(y)
 

I can quickly calculate these metrics for our dataset:

print entropy(irisY)
print gini_impurity(irisY)
print variance(bostonY)
 

1.58496249639
0.666666666667
84.4195561562

Information Gain

Now that I have metric functions, I want to calculate the improvement for different partitions. Each partition is going to be calculated using a mask which is a logical vector that states if the row is in the first partition (TRUE) or in the second partition (FALSE). I will call the function information gain, even though it really is only information gain if entropy is the metric.

def information_gain(y,mask,func=entropy):
    s1 = np.sum(mask)
    s2 = mask.size - s1
    if (s1 == 0 | s2 == 0): return 0
    return func(y) - s1/float(s1+s2) * func(y[mask]) - s2/float(s1+s2) * func(y[np.logical_not(mask)])

print information_gain(irisY,irisX[:,2] < 3.5)
print information_gain(irisY,irisX[:,2] < 3.5,gini_impurity)
print information_gain(bostonY,bostonX[:,5] < 7)
 

0.827861517442
0.305906762627
0.456884627461

Max Information Gain For A Single Feature

I have a function to calculated the information gain for a given criteria, mask, or split, I now will search for the best split among all the values in a given feature. This will give me the best split for that feature:

np.apply_along_axis(lambda x: np.sum(x),0,irisX)
 

array([ 876.5, 458.1, 563.8, 179.8])

def max_information_gain_split(y,x,func=gini_impurity):
    best_change = None
    split_value = None
    is_numeric = irisX[:,2].dtype.kind not in ['S','b']

    for val in np.unique(np.sort(x)):
        mask = x == val
        if(is_numeric): mask = x < val change = information_gain(y,mask,func) if best_change is None: best_change = change split_value = val elif change > best_change:
            best_change = change
            split_value = val

    return {"best_change":best_change,\
            "split_value":split_value,\
            "is_numeric":is_numeric}

print(max_information_gain_split(irisY,irisX[:,2]))
print(max_information_gain_split(irisY,irisX[:,2],entropy))
print(max_information_gain_split(bostonY,bostonX[:,3],variance))
 

{‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.33333333333333343}
{‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.91829583213089627}
{‘is_numeric’: True, ‘split_value’: 1.0, ‘best_change’: 2.5930420368499885}

All the best split values just happen to be 1 in this case.

Best Feature Split

Now I can apply the max_information gain across all columns and return the information for the best split

def best_feature_split(X,y,func=gini_impurity):
    best_result = None
    best_index = None
    for index in range(X.shape[1]):
        result = max_information_gain_split(y,X[:,index],func)
        if best_result is None:
            best_result = result
            best_index = index
        elif best_result['best_change'] < result['best_change']:
            best_result = result
            best_index = index

    best_result['index'] = best_index
    return best_result

print best_feature_split(irisX,irisY)
print best_feature_split(irisX,irisY,entropy)
print best_feature_split(bostonX,bostonY,variance)
 

{‘index’: 2, ‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.33333333333333343}
{‘index’: 2, ‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.91829583213089627}
{‘index’: 5, ‘is_numeric’: True, ‘split_value’: 6.9429999999999996, ‘best_change’: 38.220464479057071}

Now that I have the best feature-value combination to partition our data we can partition the data into the two data sets. One I will call ‘left’, and the other I will call ‘right’. This will end up being the data structure use in that each node can have two branches: left and right.

def get_best_mask(X,best_feature_dict):
    best_mask = None
    if best_feature_dict['is_numeric']:
        best_mask = X[:,best_feature_dict['index']] < best_feature_dict['split_value']
    else:
        best_mask = X[:,best_feature_dict['index']] == best_feature_dict['split_value']
    return best_mask

bfs = best_feature_split(irisX,irisY)
best_mask = get_best_mask(irisX,bfs)
left = irisX[best_mask,:]
right = irisX[np.logical_not(best_mask),:]
 

Now that I have scripted a single split, I will now work on creating a tree-based data structure for the decision tree.

Data Structure

I am going to make two classes. Decision Tree class that slightly mimics the sklearn API and a Decision Tree Node class will recursively fit the data based on the hyper-parameters we provide. There are many parameters that one could set to control the tree from overfitting. I have have choose: max_depth, min_information_gain, & min_leaf_size. Class diagrams are below.

Class Descriptions of Decision Tree Implementation
Class Descriptions of Decision Tree Implementation

DecisionTreeNode

This is going to be the template I will use for the DecisionTreeNode class.

class DecisionTreeNode(object):

    def __init__(self,\
            X,\
            y,\
            minimize_func,\
            min_information_gain=0.01,\
             max_depth=3,\
             depth=0):
        self.X = None
        self.y = None

        self.minimize_func=minimize_func
        self.min_information_gain=min_information_gain
        self.min_leaf_size = min_leaf_size
        self.max_depth = max_depth
        self.depth = depth

        self.best_split = None
        self.left = None
        self.right = None
        self.is_leaf = False
        self.split_description = ""

    def _information_gain(self,mask):
        pass

    def _max_information_gain_split(self,X):
        pass

    def _best_feature_split(self):
        pass

    def _split_node(self):
        pass

    def _predict_row(self,row):
        pass

    def predict(self,X):
        pass

    def __repr__(self):
        pass
 

Information Gain

This is the same function as I implemented during the scripting section except that it makes reference to the object variables instead of accepting them as parameters.

def _information_gain(self,mask):
    s1 = np.sum(mask)
    s2 = mask.size - s1
    if (s1 == 0 | s2 == 0): return 0
    return self.minimize_func(self.y) - \
            s1/float(s1+s2) * self.minimize_func(self.y[mask]) - \
            s2/float(s1+s2) * self.minimize_func(self.y[np.logical_not(mask)])
 

Max Information Gain Split

This function finds the value with the most information gain by a given feature. This is similar to the scripted version above except it also makes reference the object’s variables and have added an additional constraint that the best split is only updated if it new leafs are bigger or equal to the min leaf size.   This is finding the optimal split value given the constraint.  The above scrippted solution finds the optimal value then checks to see if it matches the constraint.  These two methods will give different results and produce different trees.

def _max_information_gain_split(self,x):
    best_change = None
    split_value = None
    previous_val = None
    is_numeric = x.dtype.kind not in ['S','b']

    for val in np.unique(np.sort(x)):
        mask = x == val
        if(is_numeric): mask = x < val change = self._information_gain(mask) s1 = np.sum(mask) s2 = mask.size-s1 if best_change is None and s1 >= self.min_leaf_size and s2 >= self.min_leaf_size:
            best_change = change
            split_value = val
        elif change > best_change and s1 >= self.min_leaf_size and s2 >= self.min_leaf_size:
            best_change = change
            split_value = np.mean([val,previous_val])

        previous_val = val

    return {"best_change":best_change,\
            "split_value":split_value,\
            "is_numeric":is_numeric}
 

Best Feature Split

This function finds the feature-value split that gives the most information gain out of the all the features.

def _best_feature_split(self):
    best_result = None
    best_index = None
    for index in range(self.X.shape[1]):
        result = self._max_information_gain_split(self.X[:,index])
        if result['best_change'] is not None:
            if best_result is None:
                best_result = result
                best_index = index
            elif best_result['best_change'] < result['best_change']:
                best_result = result
                best_index = index

    if best_result is not None:
        best_result['index'] = best_index
        self.best_split = best_result
 

Split Node

This function will split the DecisionTreeNode into left and right branches if it is consistent with the hyper-parameters. It is also setting each Node’s description so a summary tree can be printed out.

def _split_node(self):

    if self.depth < self.max_depth : self._best_feature_split() if self.best_split is not None and self.best_split['best_change'] >= self.min_information_gain :

            mask = None
            if self.best_split['is_numeric']:
                mask = self.X[:,self.best_split['index']] < self.best_split['split_value'] else: mask = self.X[:,self.best_split['index']] == self.best_split['split_value'] if(np.sum(mask) >= self.min_leaf_size and (mask.size-np.sum(mask)) >= self.min_leaf_size):
                self.is_leaf = False

                self.left = DecisionTreeNode(self.X[mask,:],\
                                            self.y[mask],\
                                            self.minimize_func,\
                                            self.min_information_gain,\
                                            self.max_depth,\
                                            self.min_leaf_size,\
                                            self.depth+1)

                if self.best_split['is_numeric']:
                    split_description = 'index ' + str(self.best_split['index']) + " < " + str(self.best_split['split_value']) + " ( " + str(self.X[mask,:].shape[0]) + " )" self.left.split_description = str(split_description) else: split_description = 'index ' + str(self.best_split['index']) + " == " + str(self.best_split['split_value']) + " ( " + str(self.X[mask,:].shape[0]) + " )" self.left.split_description = str(split_description) self.left._split_node() self.right = DecisionTreeNode(self.X[np.logical_not(mask),:],\ self.y[np.logical_not(mask)],\ self.minimize_func,\ self.min_information_gain,\ self.max_depth,\ self.min_leaf_size,\ self.depth+1) if self.best_split['is_numeric']: split_description = 'index ' + str(self.best_split['index']) + " >= " + str(self.best_split['split_value']) + " ( " + str(self.X[np.logical_not(mask),:].shape[0]) + " )"
                    self.right.split_description = str(split_description)
                else:
                    split_description = 'index ' + str(self.best_split['index']) + " != " + str(self.best_split['split_value']) + " ( " + str(self.X[np.logical_not(mask),:].shape[0]) + " )"
                    self.right.split_description = str(split_description)

                self.right._split_node()

    if self.is_leaf:
        if self.minimize_func == variance:
            self.split_description = self.split_description + " : predict - " + str(np.mean(self.y))
        else:
            values, counts = np.unique(self.y,return_counts=True)
            predict = values[np.argmax(counts)]
            self.split_description = self.split_description + " : predict - " + str(predict)

 

Predict Row

This function is used to make a prediction for a single row. It with be used by the predict function.

def _predict_row(self,row):
    predict_value = None
    if self.is_leaf:
        if self.minimize_func==variance:
            predict_value = np.mean(self.y)
        else:
            values, counts = np.unique(self.y,return_counts=True)
            predict_value = values[np.argmax(counts)]
    else:
        left = None
        if self.best_split['is_numeric']:
            left = row[self.best_split['index']] < self.best_split['split_value']
        else:
            left = row[self.best_split['index']] == self.best_split['split_value']

        if left:
            predict_value = self.left._predict_row(row)
        else:
            predict_value = self.right._predict_row(row)

    return predict_value
 

Predict

This function will predict the values from the fitted tree. If the minimizing function is variance, I assume the problem is a regression problem and return numeric values. If it is not, I assume it is a classification problem and return character values.

def predict(self,X):
        return np.apply_along_axis(lambda x: self._predict_row(x),1,X)
 

Final DecisionTreeNode Code

class DecisionTreeNode(object):

    def __init__(self,\
            X,\
            y,\
            minimize_func,\
            min_information_gain=0.001,\
            max_depth=3,\
            min_leaf_size=20,\
            depth=0):
        self.X = X
        self.y = y

        self.minimize_func=minimize_func
        self.min_information_gain=min_information_gain
        self.min_leaf_size = min_leaf_size
        self.max_depth = max_depth
        self.depth = depth

        self.best_split = None
        self.left = None
        self.right = None
        self.is_leaf = True
        self.split_description = "root"

    def _information_gain(self,mask):
        s1 = np.sum(mask)
        s2 = mask.size - s1
        if (s1 == 0 | s2 == 0): return 0
        return self.minimize_func(self.y) - \
                s1/float(s1+s2) * self.minimize_func(self.y[mask]) - \
                s2/float(s1+s2) * self.minimize_func(self.y[np.logical_not(mask)])

    def _max_information_gain_split(self,x):
        best_change = None
        split_value = None
        previous_val = None
        is_numeric = x.dtype.kind not in ['S','b']

        for val in np.unique(np.sort(x)):
            mask = x == val
            if(is_numeric): mask = x < val change = self._information_gain(mask) s1 = np.sum(mask) s2 = mask.size-s1 if best_change is None and s1 >= self.min_leaf_size and s2 >= self.min_leaf_size:
                best_change = change
                split_value = val
            elif change > best_change and s1 >= self.min_leaf_size and s2 >= self.min_leaf_size:
                best_change = change
                split_value = np.mean([val,previous_val])

            previous_val = val

        return {"best_change":best_change,\
                "split_value":split_value,\
                "is_numeric":is_numeric}

    def _best_feature_split(self):
        best_result = None
        best_index = None
        for index in range(self.X.shape[1]):
            result = self._max_information_gain_split(self.X[:,index])
            if result['best_change'] is not None:
                if best_result is None:
                    best_result = result
                    best_index = index
                elif best_result['best_change'] < result['best_change']:
                    best_result = result
                    best_index = index

        if best_result is not None:
            best_result['index'] = best_index
            self.best_split = best_result

    def _split_node(self):

        if self.depth < self.max_depth : self._best_feature_split() if self.best_split is not None and self.best_split['best_change'] >= self.min_information_gain :

                mask = None
                if self.best_split['is_numeric']:
                    mask = self.X[:,self.best_split['index']] < self.best_split['split_value'] else: mask = self.X[:,self.best_split['index']] == self.best_split['split_value'] if(np.sum(mask) >= self.min_leaf_size and (mask.size-np.sum(mask)) >= self.min_leaf_size):
                    self.is_leaf = False

                    self.left = DecisionTreeNode(self.X[mask,:],\
                                                self.y[mask],\
                                                self.minimize_func,\
                                                self.min_information_gain,\
                                                self.max_depth,\
                                                self.min_leaf_size,\
                                                self.depth+1)

                    if self.best_split['is_numeric']:
                        split_description = 'index ' + str(self.best_split['index']) + " < " + str(self.best_split['split_value']) + " ( " + str(self.X[mask,:].shape[0]) + " )" self.left.split_description = str(split_description) else: split_description = 'index ' + str(self.best_split['index']) + " == " + str(self.best_split['split_value']) + " ( " + str(self.X[mask,:].shape[0]) + " )" self.left.split_description = str(split_description) self.left._split_node() self.right = DecisionTreeNode(self.X[np.logical_not(mask),:],\ self.y[np.logical_not(mask)],\ self.minimize_func,\ self.min_information_gain,\ self.max_depth,\ self.min_leaf_size,\ self.depth+1) if self.best_split['is_numeric']: split_description = 'index ' + str(self.best_split['index']) + " >= " + str(self.best_split['split_value']) + " ( " + str(self.X[np.logical_not(mask),:].shape[0]) + " )"
                        self.right.split_description = str(split_description)
                    else:
                        split_description = 'index ' + str(self.best_split['index']) + " != " + str(self.best_split['split_value']) + " ( " + str(self.X[np.logical_not(mask),:].shape[0]) + " )"
                        self.right.split_description = str(split_description)

                    self.right._split_node()

        if self.is_leaf:
            if self.minimize_func == variance:
                self.split_description = self.split_description + " : predict - " + str(np.mean(self.y))
            else:
                values, counts = np.unique(self.y,return_counts=True)
                predict = values[np.argmax(counts)]
                self.split_description = self.split_description + " : predict - " + str(predict)

    def _predict_row(self,row):
        predict_value = None
        if self.is_leaf:
            if self.minimize_func==variance:
                predict_value = np.mean(self.y)
            else:
                values, counts = np.unique(self.y,return_counts=True)
                predict_value = values[np.argmax(counts)]
        else:
            left = None
            if self.best_split['is_numeric']:
                left = row[self.best_split['index']] < self.best_split['split_value'] else: left = row[self.best_split['index']] == self.best_split['split_value'] if left: predict_value = self.left._predict_row(row) else: predict_value = self.right._predict_row(row) return predict_value def predict(self,X): return np.apply_along_axis(lambda x: self._predict_row(x),1,X) def _rep(self,level): response = "|->" + self.split_description

        if self.left is not None:
            response += "\n"+(2*level+1)*" "+ self.left._rep(level+1)
        if self.right is not None:
            response += "\n"+(2*level+1)*" "+ self.right._rep(level+1)

        return response

    def __repr__(self):
        return self._rep(0)

 

Proof of Concept on Iris

dtn = DecisionTreeNode(irisX,irisY,gini_impurity)
dtn._split_node()
dtn
 

screen-shot-2016-12-09-at-3-34-52-pm

Proof of Concept on ToothGrowth

dtn = DecisionTreeNode(bostonX,bostonY,variance,max_depth=5)
dtn._split_node()
dtn
 

Screen Shot 2016-12-09 at 3.35.35 PM.png

DecisionTree

Now that I have implemented the DecisionTreeNode, I can implement the DecisionTree class.  Here is the template code.

class DecisionTree(object):

    def __init__(self,\
            minimize_func,\
            min_information_gain=0.01,\
            max_depth=3,\
            min_leaf_size=20):

        self.root = None
        self.minimize_func = minimize_func
        self.min_information_gain = min_information_gain
        self.max_depth = max_depth
        self.min_leaf_size = min_leaf_size

    def fit(X,y):
        pass

    def predict(X):
        pass
 

Fit

The fit function is going to take the features and targets. I am assuming that a data frame will be provided. You can generalize this for other datatypes as you see fit. It just takes the data and features, and creates a new DecisionTreeNode, then splits the node which recursively and greedily builds the tree.

def fit(self,X,y):
    self.root =  DecisionTreeNode(X,\
                                y,\
                                self.minimize_func,\
                                self.min_information_gain,\
                                self.max_depth,\
                                self.min_leaf_size,\
                                0)
    self.root._split_node()
 

Predict

The predict function is just a wrapper to call the DecisionTreeNode predict function. Right now there are no probabilities. You can add that feature yourself!

def predict(self,X):
    return self.root.predict(X)
 

Fully Implemented Decision Tree

class DecisionTree(object):

    def __init__(self,\
            minimize_func,\
            min_information_gain=0.001,\
            max_depth=3,\
            min_leaf_size=20):

        self.root = None
        self.minimize_func = minimize_func
        self.min_information_gain = min_information_gain
        self.max_depth = max_depth
        self.min_leaf_size = min_leaf_size

    def fit(self,X,y):
        self.root =  DecisionTreeNode(X,\
                                    y,\
                                    self.minimize_func,\
                                    self.min_information_gain,\
                                    self.max_depth,\
                                    self.min_leaf_size,\
                                    0)
        self.root._split_node()

    def predict(self,X):
        return self.root.predict(X)

    def __repr__(self):
        return self.root._rep(0)
 

Comparison to sklearn

Now that we have a decision tree, we can compare the results between the two CART implementations. To do this we need to write a print function for the Sklearn DecisionTree class that is similar to our own print function. A simple script is below.

from sklearn.tree import DecisionTreeClassifier as SklearnDTC
from sklearn.tree import DecisionTreeRegressor as SklearnDTR
from sklearn.tree import _tree

def print_sklearn_tree(tree,index=0,level=0):
    response = ""
    if level == 0:
        response += "root\n"

    if tree.feature[index] == -2:
        response +=  ": predict " + str(np.argmax(dt_sklearn.tree_.value[index,0,:])) + " ( " +str(np.sum(dt_sklearn.tree_.value[index,0,:])) + " )"
    else:
        response += "\n"+(2*level+1)*" " + "|-> index " +  str(tree.feature[index]) + " < " + str(tree.threshold[index]) response += (2*(level+1)+1)*" "+ print_sklearn_tree(tree,tree.children_left[index],level+1) response += "\n"+(2*level+1)*" " + "|-> index " +  str(tree.feature[index]) + " >= " + str(tree.threshold[index])
        response += (2*(level+1)+1)*" "+ print_sklearn_tree(tree,tree.children_right[index],level+1)

    return response

dt_sklearn = SklearnDTC(max_depth=3,min_samples_leaf=20,criterion="gini")
dt_sklearn.fit(irisX,irisY)

dt_bts = DecisionTree(gini_impurity,min_leaf_size=20,max_depth=3)
dt_bts.fit(irisX,irisY)

print dt_bts
print "\n" + 50*"-" + "\n"
print print_sklearn_tree(dt_sklearn.tree_)
 

screen-shot-2016-12-09-at-3-36-39-pm

The values here are close to identical, but it is worth noting that sklearn will jump between this result and splitting on Petal With of 0.8 because they have the same information gain.

dt_sklearn = SklearnDTC(max_depth=5,min_samples_leaf=5,criterion="gini")
dt_sklearn.fit(irisX,irisY)

dt_bts = DecisionTree(gini_impurity,min_leaf_size=5,max_depth=5)
dt_bts.fit(irisX,irisY)

print dt_bts
print "\n" + 50*"-" + "\n"
print print_sklearn_tree(dt_sklearn.tree_)
 

screen-shot-2016-12-09-at-3-37-14-pm

Again – my implementation and sklearn are close to identical in results. I will try regression on the Boston dataset.

dt_sklearn = SklearnDTR(max_depth=3,min_samples_leaf=20)
dt_sklearn.fit(bostonX,bostonY)

dt_bts = DecisionTree(variance,min_leaf_size=20,max_depth=3)
dt_bts.fit(bostonX,bostonY)

print dt_bts
print "\n" + 50*"-" + "\n"
print print_sklearn_tree(dt_sklearn.tree_)
 

screen-shot-2016-12-09-at-3-39-19-pm

I have made a

Conclusion

In this post, I covered the theory and code of implementing the CART algorithm for producing decision trees and compared the results to sklearn on a few toy datasets.  This code is far from production ready, but it does show a high-level implementation of the algorithm.    You can leave any question, comments, corrections, or requestion in the comments below.

If you are interested in learning more, I can recommend a few books below.  The image links are affiliate links.   I provide links to free version when available.

Data Science For Business

This book has a good overview of Decision Trees in chapter 3 and just good/general advice about using data science to get practical results.   I gave it a good once over, and continue to reference it from time to time.

Data Science From Scratch

This book has a good intro to data science activities with python and implements a number of algorithms from scratch. Chapter 18 has an implementation of ID3 Decision Tree Algorithm. It also shows how to implement Logistic Regression, Naive Bayes, & Neural Networks.

Python Machine Learning

This is a really good book that covers doing an enormous amount of data science tasks in python. If you are new to python or data science, I highly recommend it.

Implementing Decision Trees From Scratch Using R

Decision Trees

You can find the iPython Notebook for this post at https://github.com/bryantravissmith/FromScratch/tree/master/SupervisedLearning/DecisionTrees

Decision Trees are a non-parametric supervised machine learning algorithm that takes a set of training data and constructs a set of regions in the space of features that is then used to make predictions. These predictions can be continuous values (Decision Tree Regression), or classification (Decision Tree Classification).

Decision Tree Process
High Level Decision Tree Process

 

Let’s imagine we have a data set with a lot of features and how much student debt individuals have. We could just take the average of the entire dataset, say $25,000.00, and use that as a prediction. A better prediction could be developed by dividing the dataset up using the level of college completed: None, Some, Undergraduate, and Graduate. We can then take the average of subset/region:

  1. None: $0
  2. Some: $2000
  3. Undergraduate: $16000
  4. Graduate: $42000

Now when we score a person, we use their level of college completed to make the prediction. The idea is that the results are improved over the $25,000 average estimate of the entire dataset.

Student Loan Example

 

 

Why Implement It Ourselves?

The reason why I want to implement machine learning algorithm from scratch is I have a better intuition for the assumptions and the trade off in the implementations of standard libraries. For instance, before I did this for the first time I did not appreciate that both sklearn (python) and rpart (R) have an implementation of the CART algorithm for decision trees, but there are equally valid decision tree algorithms that would produce different results. This implementation is inherently greedy. The issue with greedy algorithms is that making locally best splits or regions may not make the globally best set of regions. This is what makes finding the best Decision Tree model np-hard. If you really get into it, finding the best tree becomes an optimal stopping problem and brushes into Reinforcement Learning and Artifical Intelligence style searching.

On a more practical side, there have been a number of times in the iterative approach of model-building where understanding the model assumptions have allowed me to exclude features the violate assumptions, or create new features that are more easily expressed and represented by the model. Both of these have consistently lead to better out of sample performance.

Finally, the best reason is I get pleasure and satisfaction from doing this. I think algorithms are a powerful framework and a fun way to think about problems solving and doing things from scratch help improve about execution and understanding of them.

Metrics – Measures for Improvement

I made up the student debt example. It makes sense in my mind, but we need to define an algorithm that can create the regions from any dataset. Most machine learning algorithms are optimization algorithms that improve some measure/metric on each iteration. Decision trees are no different in that respect. To define the algorithm, we need to define the measures.

Classification Metrics for Decision Trees

Gini Impurity

Gini Impurity is a very intuitive metric to try to improve because it is a measure of the misclassification rate. If the fraction of the data set that has outcome class c is f_c , we can make predictions for each data point as being class c , f_c of the time. The error rate for a class is the probability of getting the class times the probability of getting it wrong.

Error_c = f_c * \left( 1 - f_c \right)

The Gini Impurity is the sum of these error rates for all classes.

Gini \ Impurity = \sum_c Error_c = \sum_c f_c * \left( 1 - f_c \right)

Gini \ Impurity = \sum_c f_c - \sum _c f^2_c

Gini \ Impurity = 1 - \sum _c f^2_c

If we divide the data set in regions that improve the weighted sum of the Gini Impurity, the misclassification rate will be reduced.

Gini Impurity for a Coin Flip

A coin flip is about as simple as a process there is to calculate the Gini Impurity for. We have a coin with probability of head, p_H . The Gini Impurity is then

Gini \ Impurity = 1 - p_H^2 - \left(1-p_H\right)^2 = 2*p_H*\left(1-p_H\right)

A plot of the this is below for different values of p_H .

p <- seq(0.01,0.99,0.01)
plot(p,
     1-p^2-(1-p)^2,
     type='l',
     ylim=c(0,1),
     ylab = "Gini Impurity",
     xlab="Probability of Heads",
     col="blue")

screen-shot-2016-11-26-at-2-05-40-pm

We see that if p_H = 0 or p_H = 1 there is no misclassification. If p_H = 0.5 , we see that we get the coin flip wrong 50% of the time. This should match our intuition.

Entropy

Another, and less intuitive, measure of prediction accuracy is entropy. It is defined as the expected information content of a class. If $f_c$ is the fraction of the class in a dataset, and I(c) is the information content in the class, then entropy is defined as following

Entropy = \sum_c f_c * I(c)

For reasons I will not get into in this post, I(c) = - log_2\left(f_c\right) , so the entropy of a dataset follows:

Entropy = - \sum_c f_c * log_2\left(f_c\right)

The base 2 of the log can be anything, really. But since we will be using binary trees throughout this series, I will keep the base 2.

Entropy for a Coin Flip

Continuing with the coinflip example, I have produced the entropy of a coinflip using different values of p_H .

p <- seq(0.01,0.99,0.01)
plot(p,
     -p*log2(p)-(1-p)*log2(1-p),
     type='l',
     ylim=c(0,1),
     ylab = "Entropy",
     xlab="Probability of Heads",
     col="blue")

screen-shot-2016-11-26-at-2-05-48-pm

The expected information content of p_H = 0 and p_H = 1 is zero, which means that we always know what the next flip is. The entropy is maximum for p_H = 0.5 , which is where we have the highest misclassification rate

Regressions Metrics for Decision Trees

The metric used for decision tree regression is a reduction of  variance around the prediction.

Variance = \frac{1}{N} \sum_{i}^{N} ( y_i - \bar{y} )^2

Simple Linear Example

We can take a simple linear example and split it into two regions. The solid black line is the average of the entire dataset, while the red lines are the predictions in each region.

output_11_0

The total variance before the split was 40.00, while the weighted variance after the split is \frac{6}{11} 11.67 + \frac{5}{11} 8.00 = 10.0

Improving Metrics – Splitting on Information Gain

Now that we have metrics to evaluate the prediction from a data set, we can now define the improvement metric to optimize. I will be using the term information gain, which it is the correct term when talking about entropy, for all the optimizations. I can not say conclusively is the name for the measure in decision tree regression. Being wrong hasn’t stopped me before!

Decision Tree Classifier

Information \ Gain = Gini \ Impurity(d) - \sum_{split \ s} \frac{|s|}{|d|} Gini \ Impurity(s)

Information \ Gain = Entropy(d) - \sum_{split \ s} \frac{|s|}{|d|} Entropy(s)

Decision Tree Regressor

 

Information \ Gain = Mean \ Variance(d) - \sum_{split \ s} Mean \ Variance(s)

Parameters of Decision Trees

Decisions Trees are prone to overfit because the metrics can be driven to the minimum by making the tree deep enough capture each data point individually. The variance of a single point is zero. The misclassification rate of a single point is also zero. Traditionally this overfit is managed by pruning the tree after it is constructed, or setting parameters the limit how finely the data can be partitioned into regions.

The post-pruning of a decision tree uses some sort of statistical evaluation on a validation set. If the error on the sub-nodes is larger than the node, the sub-nodes are removed.

The parameters that are commonly used in CART to limit the partitioning of the data:

  1. Maximum Depth of the Tree
  2. Minimum Number of Data Points for a New Node
  3. Minimum Improvement needed for a New Node
  4. Minimum Metric under which New Nodes are not created.

Pseudo Code

Now that we have a metric and measure of improvement, we can outline the pseudo code we will be implementing in the next posts which are based on the CART algorithm for decision trees. Classification and Regression Trees (CART) algorithm that recursively and greedily splits nodes if there is an improve in the metric of choice until some stopping conditions are met.

  1. Set Training Parameters
  2. Create a Root Node
  3. Determine optimal split the node such that that split minimizes the metrics of concern.
  4. If stopping parameters are not violated, create two child nodes and apply step 3 and 4 to each child node.

Code

In this code section. I am going to be implementing a Decision Tree algorithm, but I will do it in the following parts:

  1. Introduce the Data
  2. Script Finding the Best First Partition
  3. Create a Data Structure and Implement Decision Trees
  4. Comparing with rpart package.

Assumptions

The implementation will avoid some issues with decision trees. I am going to assume that there never missing values. I will assume that all categorical variables are logical, character, and factor variables. If a variable is numeric, I will assume it is numeric. I am also going to assume that all the data is in a data.frame format.

Data

I will be using the Iris dataset for classification and the ToothGrowth dataset for regression. Both are available in the R datasets package

library(datasets)
data(iris)
data(ToothGrowth)

Iris

I will be using the Species as the classification target in this post. Here are the top 6 rows of the dataset:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

ToothGrowth

I will be using the ‘len’ variable as the regression target in this post. Here are the top 6 rows of the ToothGrowth dataset:

len supp dose
4.2 VC 0.5
11.5 VC 0.5
7.3 VC 0.5
5.8 VC 0.5
6.4 VC 0.5
10.0 VC 0.5

Scripting the First Partition

I am going to implement a brute force search for the best partition in the data set. This will be done by going through each value in each feature and finding the feature-value that will best partition the data. To do this we need to implement our metrics I discussed in the theory post:

entropy: classification
Gini impurity: classification
variance: regression

entropy <- function(y){
  # assumes y is a factor with all levels
  if(length(y)==0) return(0)
  p <- table(y)/length(y)
  sum(-p*log2(p+1e-9))
}

gini_impurity <- function(y){
  # assumes y if a factor with all levels
  if(length(y) == 0) return(0)
  p <- table(y)/length(y)
  1-sum(p^2)
}

variance <- function(y){
  if(length(y) <= 1) return(0)
  var(y)
}

I can quickly calculate these metrics for our dataset:

entropy(iris[,'Species'])
gini_impurity(iris[,'Species'])
variance(ToothGrowth[,'len'])

1.58496249639307
0.666666666666667
58.5120225988701

Information Gain

Now that I have metric functions, I want to calculate the improvement for different partitions. Each partition is going to be calculated using a mask which is a logical vector that states if the row is in the first partition (TRUE) or in the second partition (FALSE). I will call the function information gain, even though it really is only information gain if entropy is the metric.

information_gain <- function(y,mask,func=entropy){
    s1 = sum(mask)
    s2 = length(mask)-s1
    if ( s1 == 0 | s2 == 0) return(0)
    func(y)-s1/(s1+s2)*func(y[mask])-s2/(s1+s2)*func(y[!mask])
}

information_gain(iris[,'Species'],iris[,'Petal.Width'] < 1.5)
information_gain(iris[,'Species'],iris[,'Petal.Width'] < 1.5,gini_impurity)
information_gain(ToothGrowth[,'len'],ToothGrowth[,'dose']<= 0.5,variance)

0.643516416593624
0.229045542635659
33.8790109029636

Max Information Gain For A Single Feature

I have a function to calculated the information gain for a given criteria, mask, or split, I now will search for the best split among all the values in a given feature. This will give me the best split for that feature:

max_information_gain_split <- function(y,x,func=gini_impurity){
  best_change = NA
  split_value = NA
  is_numeric = !(is.factor(x)|is.logical(x)|is.character(x))
  for( val in sort(unique(x))){
    mask <- x == val
    if (is_numeric) mask <- x < val
    change <- information_gain(y,mask,func) if(is.na(best_change) | change > best_change){
      best_change = change
      split_value = val
    }
  }
  return(list("best_change"=best_change,
      "split_value"=split_value,
      "is_numeric"=is_numeric))
}

print(unlist(max_information_gain_split(iris$Species,iris$Petal.Width)))
print(unlist(max_information_gain_split(iris$Species,iris$Petal.Width,entropy)))
print(unlist(max_information_gain_split(ToothGrowth$len,ToothGrowth$dose,variance)))

best_change split_value is_numeric
0.3333333 1.0000000 1.0000000
best_change split_value is_numeric
0.9182958 1.0000000 1.0000000
best_change split_value is_numeric
33.87901 1.00000 1.00000

All the best split values just happen to be 1 in this case.

Max Information Again Among All Features

Now I can apply ‘max_information_gain_split’ to all of the feature columns and find the best split. We will do this by applying a ‘sapply ‘ function on the features like so:

sapply(iris[,1:4],function(x) max_information_gain_split(iris[,5],x))
Sepal.Length Sepal.Width Petal.Length Petal.Width
best_change 0.2277603 0.1269234 0.3333333 0.3333333
split_value 5.5 3.4 3 1
is_numeric TRUE TRUE TRUE TRUE
sapply(ToothGrowth[,2:3],function(x) max_information_gain_split(ToothGrowth[,1],x))
supp dose
best_change 0.01444444 0.02027778
split_value OJ 2
is_numeric FALSE TRUE

After ‘sapplying’ the function, I will return a list with the information of the feature and value with the most information gain.

best_feature_split <- function(X,y){
  results <- sapply(X,function(x) max_information_gain_split(y,x))
 best_name <- names(which.max(results['best_change',]))
 best_result <- results[,best_name]
 best_result[["name"]] <- best_name
 best_result
}

best_feature_split(iris[,1:4],iris[,5])
best_feature_split(ToothGrowth[,2:3],ToothGrowth[,1])
best_change split_value is_numeric name
0.3333333 3 TRUE Petal.Length
best_change split_value is_numeric name
0.0202778 2 TRUE dose

Now that I have the best feature-value combination to partition our data we can partition the data into the two data sets. One I will call ‘left’, and the other I will call ‘right’. This will end up being the data structure use in that each node can have two branches: left and right.

get_best_mask <- function(X,best_feature_list){
  best_mask <- X[,best_feature_list$name] == best_feature_list$split_value
  if(best_feature_list$is_numeric){
    best_mask <- X[,best_feature_list$name] < best_feature_list$split_value
  }
  return(best_mask)
}

best_tooth_mask <- get_best_mask(ToothGrowth[,2:3],bfs)
leftDf = ToothGrowth[best_tooth_mask,]
rightDf = ToothGrowth[!best_tooth_mask,]

Now that I have scripted a single split, I will now work on creating a tree-based data structure for the decision tree.

Data Structure

R is a functional language, but I decided to take a more object oriented approach to implementing my decision tree. If you want to examine a function approach, you can examine the code in the rpart package.

I decided this route partially because it is how I think of the problem, but that does not make it the best implementation. A truer reason is that I also wanted to take this post as an opportunity to play around with R’s Reference Classes object implementation. Hadley Wickham has a good Reference Classes overview. I have never used these before this post. Be warned.

I am going to make two classes. Decision Tree class that slightly mimics the sklearn API and a Decision Tree Node class will recursively fit the data based on the hyper-parameters we provide. There are many parameters that one could set to control the tree from overfitting. I have have choose: max_depth, min_information_gain, & min_leaf_size. Class diagrams are below.

Class Descriptions of Decision Tree Implementation
Class Descriptions of Decision Tree Implementation

 

DecisionTreeNode

This is going to be the template I will use for the DecisionTreeNode class.  It is using R’s Reference Class objects.

DecisionTreeNode <- setRefClass("DecisionTreeNode",
    fields = list(
        x = "data.frame",
        y = "ANY",
        is_leaf="logical",
        split_description="character",
        best_split="list",
        branches="list",
        depth="numeric",
        minimize_func="function",
        min_information_gain="numeric",
        min_leaf_size="numeric",
        max_depth="numeric"),
    methods = list(
        initialize = function(...){},
        information_gain = function(mask){},
        max_information_gain_split = function(feature){},
        best_feature_split = function(){},
        split_node = function() {},
        predict_row = function(row){},
        predict = function(features){}
     )
)

Print

I want to set the print on a DecisionTreeNode to give a summary of the entire structure and decisions. To do this I am recursively descending down the tree to get the node description, then append it the response. After all the responses have been returned to the root level, I cat the response so that I can take advantage of the tab/’\t’ and newline/’\n’ syntax.

print.DecisionTreeNode <- function(node,level=0){
    response <- paste("|->",node$split_description)
    if(level < node$max_depth){
         if(!is.null(node$branches$left)){
           response <- paste0(response,"\n",paste(rep(" ",2*(level+1)),collapse=" "),print(node$branches$left,level+1))
        }
        if(!is.null(node$branches$right)){
            response <- paste0(response,"\n",paste(rep(" ",2*(level+1)),collapse=" "),print(node$branches$right,level+1))
        }
    }

    if(level==0) {
        cat(response)
    } else {
        return(response)
    }
}

Initialize Function

This function gets called when a DecisionTreeNode\$new() gets called. In it, I set a list of default values. I iterate through the parameters provided to the $new(…) function and if a default parameter is not set, I set it with the do.call command.

initialize = function(...){
   defaults <- list(x = data.frame(),
                    y=c(),
                    depth=0,
                    minimize_func=gini_impurity,
                    min_information_gain=1e-2,
                    min_leaf_size=20,
                    max_depth=3,
                    is_leaf=T,
                    split_description="root",
                    best_split=NULL,
                    branches=list("left"=NULL,"right"=NULL))

   params <- list(...)
   fields <- names(getRefClass()$fields())
   for( field in fields){
       if (!(field %in% names(params))) {
           params[[field]] <- defaults[[field]]
       }
   }
   for( param in names(params)){
       do.call("<<-",list(param, params[[param]]))
   }
}

Information Gain

This is the same function as I implemented during the scripting section except that it makes reference to the object variables instead of accepting them as parameters.

information_gain = function(mask){
   s1 = sum(mask)
   s2 = length(mask)-s1
   if ( s1 == 0 | s2 == 0) return(0)
   minimize_func(y)-s1/(s1+s2)*minimize_func(y[mask])-s2/(s1+s2)*minimize_func(y[!mask])
}

Max Information Gain Split

This function finds the value with the most information gain by a given feature. This is similar to the scripted version above except it also makes reference the object’s variables

max_information_gain_split = function(feature){

   best_change = NA
   split_value = NA
   is_numeric = !(is.factor(feature)|is.logical(feature)|is.character(feature))

   previous_val <- 0
   for( val in sort(unique(feature))){

       mask <- feature == val
       if (is_numeric) mask <- feature < val
       change <- information_gain(mask) if(is.na(best_change) | change > best_change){
           best_change = change
           split_value = ifelse(is_numeric,
                                mean(c(val,previous_val)),
                                val)

       }
       previous_val <- val

   }
   return(list("best_change"=best_change,
               "split_value"=split_value,
               "is_numeric"=is_numeric))
}

Best Feature Split

This function finds the feature-value split that gives the most information gain out of the all the features.

best_feature_split = function(){
   results <- sapply(x,function(feature) max_information_gain_split(feature))
   best_name <- names(which.max(results['best_change',]))
   best_result <- results[,best_name]
   best_result[["name"]] <- best_name
   best_split <<- best_result
}

Best Mask

This function finds the mask for the best feature split. This is used to split the node if it is consistent with the hyper-parameters

best_mask = function(){
   best_mask <- x[,best_split$name] == best_split$split_value
   if(best_split$is_numeric){
       best_mask <- x[,best_split$name] < best_split$split_value
   }
   return(best_mask)
}

Split Node

This function will split the DecisionTreeNode into left and right branches if it is consistent with the hyper-parameters. It is also setting each Node’s description so a summary tree can be printed out.

split_node = function() {
   if(depth < max_depth){
       best_feature_split()
       if(best_split$best_change < min_information_gain ){
           mask = best_mask()
           if(sum(mask) < min_leaf_size && length(mask)-sum(mask) < min_leaf_size){
               is_leaf <<- F

               branches$left <<- .self$copy()
               branches$left$is_leaf <<- T
               branches$left$x <<-  branches$left$x[mask,]
               branches$left$y <<-  branches$left$y[mask]

               branches$left$split_description <<- ifelse(best_split$is_numeric,
                                                         paste(c(best_split$name,
                                                                 "<",
                                                                 best_split$split_value),
                                                               collapse = " "),
                                                         paste(c(best_split$name,
                                                                 "=",
                                                                 best_split$split_value),
                                                               collapse = " "))

               branches$left$depth <<-  branches$left$depth+1
               branches$left$split_node()

               branches$right <<- .self$copy()
               branches$right$is_leaf <<- T
               branches$right$x <<-  branches$right$x[!mask,]
               branches$right$y <<-  branches$right$y[!mask]

               branches$right$split_description <<- ifelse(best_split$is_numeric, paste(c(best_split$name, ">=",
                                                                 best_split$split_value),
                                                               collapse = " "),
                                                         paste(c(best_split$name,
                                                                 "!=",
                                                                 best_split$split_value),
                                                               collapse = " "))

               branches$right$depth <-  branches$right$depth+1
               branches$right$split_node()
           }
       }
   }
   if(is_leaf){
       split_description <<- ifelse(identical(minimize_func,variance),
                                          paste(c(split_description,
                                                  ":",
                                                  "predict - ",
                                                  mean(y)),
                                                collapse=" "),
                                          paste(c(split_description,
                                                  ":",
                                                  "predict - ",
                                                  names(which.max(table(y)))),
                                                collapse=" "))
   }

}

Predict Row

This function is used to make a prediction for a single row. It with be used by the predict function.

predict_row = function(row){
   if(is_leaf){
       predict_value <- ifelse(identical(minimize_func,variance),
                               mean(y),
                               names(which.max(table(y))))
   } else {
       if(best_split$is_numeric){
           left = row[best_split$name] < best_split$split_value
       } else{
           left = row[best_split$name] == best_split$split_value
       }
       if(left){
           predict_value = branches$left$predict_row(row)
       } else {
           predict_value = branches$right$predict_row(row)
       }
   }
   return(predict_value)
}

Predict

This function will predict the values from the fitted tree. If the minimizing function is variance, I assume the problem is a regression problem and return numeric values. If it is not, I assume it is a classification problem and return character values.

predict = function(features){
   pred <- character(length=dim(features)[1])
   if(identical(minimize_func,variance)) pred <- numeric(length=dim(features)[1])
   for(i in 1:dim(features)[1]){
       pred[i] = predict_row(features[i,])
   }
   pred
}

Final DecisionTreeNode Code


DecisionTreeNode <- setRefClass("DecisionTreeNode",
   fields = list(x = "data.frame",
              y = "ANY",
              is_leaf="logical",
              split_description="character",
              best_split="list",
              branches="list",
              depth="numeric",
              minimize_func="function",
              min_information_gain="numeric",
              min_leaf_size="numeric",
              max_depth="numeric"),
   methods = list(
       initialize = function(...){
           defaults <- list(x = data.frame(),
                            y=c(),
                            depth=0,
                            minimize_func=gini_impurity,
                            min_information_gain=1e-2,
                            min_leaf_size=20,
                            max_depth=3,
                            is_leaf=T,
                            split_description="root",
                            best_split=NULL,
                            branches=list("left"=NULL,"right"=NULL))

           params <- list(...)
           fields <- names(getRefClass()$fields())
           for( field in fields){
               if (!(field %in% names(params))) {
                   params[[field]] <- defaults[[field]]
               }
           }
           for( param in names(params)){
               do.call("<<-",list(param, params[[param]]))
           }

       },
       information_gain = function(mask){

           s1 = sum(mask)
           s2 = length(mask)-s1
           if ( s1 == 0 | s2 == 0) return(0)
           minimize_func(y)-s1/(s1+s2)*minimize_func(y[mask])-s2/(s1+s2)*minimize_func(y[!mask])
       },
       max_information_gain_split = function(feature){

           best_change = NA
           split_value = NA
           is_numeric = !(is.factor(feature)|is.logical(feature)|is.character(feature))

           previous_val <- 0
           for( val in sort(unique(feature))){

               mask <- feature == val
               if (is_numeric) mask <- feature < val
               change <- information_gain(mask) if(is.na(best_change) | change > best_change){
                   best_change = change
                   split_value = ifelse(is_numeric,
                                        mean(c(val,previous_val)),
                                        val)

               }
               previous_val <- val

           }
           return(list("best_change"=best_change,
                       "split_value"=split_value,
                       "is_numeric"=is_numeric))
       },
       best_feature_split = function(){
           results <- sapply(x,function(feature) max_information_gain_split(feature))
           best_name <- names(which.max(results['best_change',]))
           best_result <- results[,best_name]
           best_result[["name"]] <- best_name
           best_split <<- best_result
       },
       best_mask = function(){
           best_mask <- x[,best_split$name] == best_split$split_value
           if(best_split$is_numeric){
               best_mask <- x[,best_split$name] < best_split$split_value
           }
           return(best_mask)
       },
       split_node = function() {
           if(depth < max_depth){ best_feature_split() if(best_split$best_change > min_information_gain ){
                   mask = best_mask()
                   if(sum(mask) > min_leaf_size && length(mask)-sum(mask) > min_leaf_size){
                       is_leaf <<- F

                       branches$left <<- .self$copy()
                       branches$left$is_leaf <<- T
                       branches$left$x <<-  branches$left$x[mask,]
                       branches$left$y <<-  branches$left$y[mask]

                       branches$left$split_description <<- ifelse(best_split$is_numeric,
                                                                 paste(c(best_split$name,
                                                                         "<",
                                                                         best_split$split_value),
                                                                       collapse = " "),
                                                                 paste(c(best_split$name,
                                                                         "=",
                                                                         best_split$split_value),
                                                                       collapse = " "))

                       branches$left$depth <<-  branches$left$depth+1
                       branches$left$branches <<- list("left"=NULL,"right"=NULL)
                       branches$left$split_node()

                       branches$right <<- .self$copy()
                       branches$right$is_leaf <<- T
                       branches$right$x <<-  branches$right$x[!mask,]
                       branches$right$y <<-  branches$right$y[!mask]

                       branches$right$split_description <<- ifelse(best_split$is_numeric, paste(c(best_split$name, ">=",
                                                                         best_split$split_value),
                                                                       collapse = " "),
                                                                 paste(c(best_split$name,
                                                                         "!=",
                                                                         best_split$split_value),
                                                                       collapse = " "))

                       branches$right$depth <<-  branches$right$depth+1
                       branches$right$branches <<- list("left"=NULL,"right"=NULL)
                       branches$right$split_node()
                   }
               }
           }
           if(is_leaf){
               split_description <<- ifelse(identical(minimize_func,variance),
                                                  paste(c(split_description,
                                                          ":",
                                                          "predict - ",
                                                          mean(y)),
                                                        collapse=" "),
                                                  paste(c(split_description,
                                                          ":",
                                                          "predict - ",
                                                          names(which.max(table(y)))),
                                                        collapse=" "))
           }

       },
       predict_row = function(row){
           if(is_leaf){
               predict_value <- ifelse(identical(minimize_func,variance),
                                       mean(y),
                                       names(which.max(table(y))))
           } else {
               if(best_split$is_numeric){
                   left = row[best_split$name] < best_split$split_value
               } else{
                   left = row[best_split$name] == best_split$split_value
               }
               if(left){
                   predict_value = branches$left$predict_row(row)
               } else {
                   predict_value = branches$right$predict_row(row)
               }
           }
           return(predict_value)
       },
       predict = function(features){
           pred <- character(length=dim(features)[1])
           if(identical(minimize_func,variance)) pred <- numeric(length=dim(features)[1])
           for(i in 1:dim(features)[1]){
               pred[i] = predict_row(features[i,])
           }
           pred
       }
   )
)

Proof of Concept on Iris

dtn <- DecisionTreeNode$new(x=iris[,1:4],y=iris[,5],max_depth=3)
dtn$split_node()
print(dtn)
Proof of Concept on Iris
Proof of Concept on Iris

Proof of Concept on ToothGrowth

dtn <- DecisionTreeNode$new(x=ToothGrowth[,2:3],
                            y=ToothGrowth[,1],
                            minimize_func=variance,
                            min_leaf_size=5,
                            max_depth=5,
                            min_information_gain=1e-3)
dtn$split_node()
print(dtn)
Proof of Concept on ToothGrowth
Proof of Concept on ToothGrowth

DecisionTree

Now that I have implemented the DecisionTreeNode, I can implement the DecisionTree class.  Here is the template code.

DecisionTree <- setRefClass("DecisionTree",
    fields = list(minimize_func="function",
                  min_information_gain="numeric",
                  min_leaf_size="numeric",
                  max_depth="numeric",
                  root = "DecisionTreeNode"),
    methods = list(
        initialize = function(...){},
        fit = function(features,target){},
        predict = function(features){}
    ))

Initialize

This is almost identical to the initialization function for DecisionTreeNode. The defaults are only the hyper-parameters though

initialize = function(...){
    defaults <- list(minimize_func=gini_impurity,
                     min_information_gain=1e-2,
                     min_leaf_size=20,
                     max_depth=3,
                     root=NULL)

    params <- list(...)
    fields <- names(getRefClass()$fields())
    for( field in fields){
        if (!(field %in% names(params))) {
            params[[field]] <- defaults[[field]]
        }
    }
    for( param in names(params)){
        do.call("<<-",list(param, params[[param]]))
    }

}

Fit

The fit function is going to take the features and targets. I am assuming that a data frame will be provided. You can generalize this for other datatypes as you see fit. It just takes the data and features, and creates a new DecisionTreeNode, then splits the node which recursively and greedily builds the tree.

fit = function(features,target){
    root <<- DecisionTreeNode$new(x=features,
                                y=target,
                                minimize_func=minimize_func,
                                min_information_gain=min_information_gain,
                                min_leaf_size=min_leaf_size,
                                max_depth=max_depth
                                )
    root$split_node()

}

Predict

The predict function is just a wrapper to call the DecisionTreeNode predict function. Right now there are no probabilities. You can add that feature yourself!

predict = function(features){
        root$predict(features)
    }

Fully Implemented Decision Tree

DecisionTree <- setRefClass("DecisionTree",
    fields = list(minimize_func="function",
                  min_information_gain="numeric",
                  min_leaf_size="numeric",
                  max_depth="numeric",
                  root = "DecisionTreeNode"),
    methods = list(
        initialize = function(...){
            defaults <- list(minimize_func=gini_impurity,
                             min_information_gain=1e-2,
                             min_leaf_size=20,
                             max_depth=3,
                             root=NULL)

            params <- list(...)
            fields <- names(getRefClass()$fields())
            for( field in fields){
                if (!(field %in% names(params))) {
                    params[[field]] <- defaults[[field]]
                }
            }
            for( param in names(params)){
                do.call("<<-",list(param, params[[param]]))
            }

        },
        fit = function(features,target){
            root <<- DecisionTreeNode$new(x=features,
                                        y=target,
                                        minimize_func=minimize_func,
                                        min_information_gain=min_information_gain,
                                        min_leaf_size=min_leaf_size,
                                        max_depth=max_depth
                                        )
            root$split_node()

        },
        predict = function(features){
            root$predict(features)
        }
    ))
print.DecisionTree <- function(tree){
    print(tree$root)
}

Comparison to rpart

Now that we have a decision tree, we can compare the results between the two CART implementations

library(rpart)
dt_rpart <- rpart(Species ~ .,data=iris,maxdepth=3,minbucket=20)
dt_bts <- DecisionTree(max_depth=3,min_leaf_size=20)
dt_bts$fit(iris[,1:4],iris[,5])

print(dt_bts)
cat("\n\n")
print(dt_rpart)
A comparison to rpart 1
Comparison between rpart and my implementation on the Iris dataset

 

dt_rpart <- rpart(Species ~ .,data=iris,maxdepth=5,minbucket=5)
dt_bts <- DecisionTree(max_depth=5,min_leaf_size=5)
dt_bts$fit(iris[,1:4],iris[,5])

print(dt_bts)
cat("\n\n")
print(dt_rpart)
comparison to rpart 2
Comparison between rpart and my implementation on the Iris dataset

 

dt_rpart <- rpart(len ~ .,data=ToothGrowth,maxdepth=5,minbucket=5,method="anova")
dt_bts <- DecisionTree(max_depth=5,min_leaf_size=5,minimize_func=variance)
dt_bts$fit(ToothGrowth[,2:3],ToothGrowth[,1])

print(dt_bts)
cat("\n\n")
print(dt_rpart)
Comparison between rpart and my implementation on the ToothGrowth dataset
Comparison between rpart and my implementation on the ToothGrowth dataset

I have a Decision Tree class that if the data meets my assumption of not having missing data, is in a data frame format, and classification variables are factors, I can produce results that are similar, if not identical, to the rpart package.

Conclusion

In this post, I covered the theory and code of implementing the CART algorithm for producing decision trees and compared the results to rpart on a few toy datasets.  This code is far from production ready, but it does show a high-level implementation of the algorithm.    You can leave any question, comments, corrections, or requestion in the comments below.

If you are interested in learning more, I can recommend a few books below.  The image links are affiliate links.   I provide links to free version when available.

Data Science For Business

This book has a good overview of Decision Trees in chapter 3 and just good/general advice about using data science to get practical results.   I gave it a good once over, and continue to reference it from time to time.

 

 

 

 

 

Introduction to Statistical Learning

This is the book I first picked up when I wanted to learn R.  You can google for the accompanying code. It gives a very comprehensive coverage of Statistical Learning and R. I highly recommend it. You can find an older version of this book in a pdf for free: http://www-bcf.usc.edu/~gareth/ISL/

Elements of Statistical Learning

This is the another book I picked up when I wanted to learn R. There is also code available. This was not a practical approach, but I think it is a good, global experience.  I am amazed that I learn something new every time I open this book. You can find a free older version of this book as well: http://statweb.stanford.edu/~tibs/ElemStatLearn/

 

 

 

 

Bayesian Methods for Hackers

Main Author: Cam Davidson-Pilon

Link: https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

Description:

The most hands-on and practical coverage of Bayesian Inference I have been exposed to.   It is both a GitHub collection of iPython notebooks and a real living, breathing book.

Stanford’s Machine Learning Project Page

Links:

http://cs229.stanford.edu/projects2015.html

http://cs229.stanford.edu/projects2014.html

http://cs229.stanford.edu/projects2013.html

Description:  A list of all the student projects and reports from Stanford’s CS 229 Machine Learning course.   The most recent year has the top projects highlighted.  It is probably not worth exhaustively reading each report unless you want to practice being a C.S. T.A.

Data Science iPython Notebooks by Donne Martin

Author:  Donne Martin

Link:  https://github.com/donnemartin/data-science-ipython-notebooks

Description:  This is one of the best collection of iPython notebooks.  I obsessively went through all of them when I decided I was going to leave teaching and pursue a Data Science career.  I found a lot of value in these notebooks.  Thank You, Donne.