First blog post

This is your very first post. Click the Edit link to modify or delete it, or start a new post. If you like, use this post to tell readers why you started this blog and what you plan to do with it.

Decision Trees

You can find the iPython Notebook for this post at https://github.com/bryantravissmith/FromScratch/tree/master/SupervisedLearning/DecisionTrees

Decision Trees are a non-parametric supervised machine learning algorithm that takes a set of training data and constructs a set of regions in the space of features that is then used to make predictions. These predictions can be continuous values (Decision Tree Regression), or classification (Decision Tree Classification).

Let’s imagine we have a data set with a lot of features and how much student debt individuals have. We could just take the average of the entire dataset, say $25,000.00, and use that as a prediction. A better prediction could be developed by dividing the dataset up using the level of college completed: None, Some, Undergraduate, and Graduate. We can then take the average of subset/region: 1. None: \$0
2. Some: \$2000 3. Undergraduate: \$16000
4. Graduate: \$42000 Now when we score a person, we use their level of college completed to make the prediction. The idea is that the results are improved over the$25,000 average estimate of the entire dataset.

Why Implement It Ourselves?

The reason why I want to implement machine learning algorithm from scratch is I have a better intuition for the assumptions and the trade off in the implementations of standard libraries. For instance, before I did this for the first time I did not appreciate that both sklearn (python) and rpart (R) have an implementation of the CART algorithm for decision trees, but there are equally valid decision tree algorithms that would produce different results. This implementation is inherently greedy. The issue with greedy algorithms is that making locally best splits or regions may not make the globally best set of regions. This is what makes finding the best Decision Tree model np-hard. If you really get into it, finding the best tree becomes an optimal stopping problem and brushes into Reinforcement Learning and Artifical Intelligence style searching.

On a more practical side, there have been a number of times in the iterative approach of model-building where understanding the model assumptions have allowed me to exclude features the violate assumptions, or create new features that are more easily expressed and represented by the model. Both of these have consistently lead to better out of sample performance.

Finally, the best reason is I get pleasure and satisfaction from doing this. I think algorithms are a powerful framework and a fun way to think about problems solving and doing things from scratch help improve about execution and understanding of them.

Metrics – Measures for Improvement

I made up the student debt example. It makes sense in my mind, but we need to define an algorithm that can create the regions from any dataset. Most machine learning algorithms are optimization algorithms that improve some measure/metric on each iteration. Decision trees are no different in that respect. To define the algorithm, we need to define the measures.

Classification Metrics for Decision Trees

Gini Impurity

Gini Impurity is a very intuitive metric to try to improve because it is a measure of the misclassification rate. If the fraction of the data set that has outcome class c is f, we can make predictions for each data point as being class c, f of the time. The error rate for a class is the probability of getting the class times the probability of getting it wrong.

$Error_c = f_c * \left( 1 - f_c \right)$

The Gini Impurity is the sum of these error rates for all classes.

$Gini \ Impurity = \sum_c Error_c = \sum_c f_c * \left( 1 - f_c \right)$

$Gini \ Impurity = \sum_c f_c - \sum _c f^2_c$

$Gini \ Impurity = 1 - \sum _c f^2_c$

If we divide the data set in regions that improve the weighted sum of the Gini Impurity, the misclassification rate will be reduced.

Gini Impurity for a Coin Flip

A coin flip is about as simple as a process there is to calculate the Gini Impurity for. We have a coin with probability of head, pH. The Gini Impurity is then

$Gini \ Impurity = 1 - p_H^2 - \left(1-p_H\right)^2 = 2*p_H*\left(1-p_H\right)$

A plot of the this is below for different values of pH.

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

p = np.linspace(0.01,0.99,100)
plt.figure(figsize=(16,8))
plt.plot(p,1-p**2-(1-p)**2)
plt.ylim(0,1)
plt.show()


We see that if pH = 0  or pH = 1 there is no misclassification. If pH = 0.5, we see that we get the coin flip wrong 50% of the time. This should match our intuition.

Entropy

Another, and less intuitive, measure of prediction accuracy is entropy. It is defined as the expected information content of a class. If f is the fraction of the class in a dataset, and I(c) is the information content in the class, then entropy is defined as following

$Entropy = \sum_c f_c * I(c)$

For reasons I will not get into in this post, I(c) = – log2(f_c), so the entropy of a dataset follows:

$Entropy = - \sum_c f_c * log_2\left(f_c\right)$

The base 2 of the log can be anything, really. But since we will be using binary trees throughout this series, I will keep the base 2.

Entropy for a Coin Flip

Continuing with the coinflip example, I have produced the entropy of a coinflip using different values of pH.

plt.figure(figsize=(16,8))
plt.plot(p,-p*np.log2(p)-(1-p)*np.log2(1-p))
plt.ylim(0,1)
plt.show()


The expected information content of pH and  pH = 1 is zero, which means that we always know what the next flip is. The entropy is maximum for pH = 0.5, which is where we have the highest misclassification rate

Regressions Metrics for Decision Trees

The metric used for decision tree regression is a reduction of  variance around the prediction.

$Variance = \frac{1}{N} \sum_{i}^{N} ( y_i - \bar{y} )^2$

Simple Linear Example

We can take a simple linear example and split it into two regions. The solid black line is the average of the entire dataset, while the red lines are the predictions in each region.

The total variance before the split was 40.00, while the weighted variance after the split:

$\frac{6}{11} 11.67 + \frac{5}{11} 8.00 = 10.0$

Improving Metrics – Splitting on Information Gain

Now that we have metrics to evaluate the prediction from a data set, we can now define the improvement metric to optimize. I will be using the term information gain, which it is the correct term when talking about entropy, for all the optimizations. I can not say conclusively is the name for the measure in decision tree regression. Being wrong hasn’t stopped me before!

Decision Tree Classifier

$Information \ Gain = Gini \ Impurity(d) - \sum_{split \ s} \frac{|s|}{|d|} Gini \ Impurity(s)$

$Information \ Gain = Variance(d) - \sum_{split \ s} \frac{|s|}{|d|} Variance(s)$

Decision Tree Regressor

$Information \ Gain = Variance(d) - \sum_{split \ s} \frac{|s|}{|d|} Variance(s)$

Parameters of Decision Trees

Decisions Trees are prone to overfit because the metrics can be driven to the minimum by making the tree deep enough capture each data point individually. The variance of a single point is zero. The misclassification rate of a single point is also zero. Traditionally this overfit is managed by pruning the tree after it is constructed, or setting parameters the limit how finely the data can be partitioned into regions.

The post-pruning of a decision tree uses some sort of statistical evaluation on a validation set. If the error on the sub-nodes is larger than the node, the sub-nodes are removed.

The parameters that are commonly used in CART to limit the partitioning of the data:

1. Maximum Depth of the Tree
2. Minimum Number of Data Points for a New Node
3. Minimum Improvement needed for a New Node
4. Minimum Metric under which New Nodes are not created.

Pseudo Code

Now that we have a metric and measure of improvement, we can outline the pseudo code we will be implementing in the next posts which are based on the CART algorithm for decision trees. Classification and Regression Trees (CART) algorithm that recursively and greedily splits nodes if there is an improve in the metric of choice until some stopping conditions are met.

1. Set Training Parameters
2. Create a Root Node
3. Determine optimal split the node such that that split minimizes the metrics of concern.
4. If stopping parameters are not violated, create two child nodes and apply step 3 and 4 to each child node.

Code

In this code section. I am going to be implementing a Decision Tree algorithm, but I will do it in the following parts:

1. Introduce the Data
2. Script Finding the Best First Partition
3. Create a Data Structure and Implement Decision Trees
4. Comparing with sklearn package.

Assumptions

The implementation will avoid some issues with decision trees. I am going to assume that there never missing values. I will assume that all categorical variables are logical, character, and factor variables. If a variable is numeric, I will assume it is numeric. I am also going to assume that all the data is in a data.frame format.

Data

I will be using the Iris dataset for classification and the Boston dataset for regression. Both are available in the sklearn package.

from sklearn.datasets import load_iris,load_boston


Boston Housing Price

import pandas as pd
bostonX = boston['data']
bostonY = boston['target']
bostonDF = pd.DataFrame(data = np.hstack((bostonX,bostonY.reshape(bostonY.shape[0],1))),\
columns=np.append(boston['feature_names'],'PRICE'))

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

Iris

irisX = iris['data']
irisY = iris['target']
irisDF = pd.DataFrame(data = np.hstack((irisX,irisY.reshape(irisY.shape[0],1))),\
columns=np.append(iris['feature_names'],&quot;Species&quot;))

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Species
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0

Scripting the First Partition

I am going to implement a brute force search for the best partition in the data set. This will be done by going through each value in each feature and finding the feature-value that will best partition the data. To do this we need to implement our metrics I discussed in the theory post:

Entropy: Classification
Gini Impurity: Classification
Variance: Regression

def entropy(y):
if y.size == 0: return 0
p = np.unique(y, return_counts = True)[1].astype(float)/len(y)
return -1 * np.sum(p * np.log2(p+1e-9))

def gini_impurity(y):
if y.size == 0: return 0
p = np.unique(y, return_counts = True)[1].astype(float)/len(y)
return 1 - np.sum(p**2)

def variance(y):
if y.size == 0: return 0
return np.var(y)


I can quickly calculate these metrics for our dataset:

print entropy(irisY)
print gini_impurity(irisY)
print variance(bostonY)


1.58496249639
0.666666666667
84.4195561562

Information Gain

Now that I have metric functions, I want to calculate the improvement for different partitions. Each partition is going to be calculated using a mask which is a logical vector that states if the row is in the first partition (TRUE) or in the second partition (FALSE). I will call the function information gain, even though it really is only information gain if entropy is the metric.

def information_gain(y,mask,func=entropy):
if (s1 == 0 | s2 == 0): return 0

print information_gain(irisY,irisX[:,2] &lt; 3.5)
print information_gain(irisY,irisX[:,2] &lt; 3.5,gini_impurity)
print information_gain(bostonY,bostonX[:,5] &lt; 7)


0.827861517442
0.305906762627
0.456884627461

Max Information Gain For A Single Feature

I have a function to calculated the information gain for a given criteria, mask, or split, I now will search for the best split among all the values in a given feature. This will give me the best split for that feature:

np.apply_along_axis(lambda x: np.sum(x),0,irisX)


array([ 876.5, 458.1, 563.8, 179.8])

def max_information_gain_split(y,x,func=gini_impurity):
best_change = None
split_value = None
is_numeric = irisX[:,2].dtype.kind not in ['S','b']

for val in np.unique(np.sort(x)):
if(is_numeric): mask = x &lt; val change = information_gain(y,mask,func) if best_change is None: best_change = change split_value = val elif change &gt; best_change:
best_change = change
split_value = val

return {&quot;best_change&quot;:best_change,\
&quot;split_value&quot;:split_value,\
&quot;is_numeric&quot;:is_numeric}

print(max_information_gain_split(irisY,irisX[:,2]))
print(max_information_gain_split(irisY,irisX[:,2],entropy))
print(max_information_gain_split(bostonY,bostonX[:,3],variance))


{‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.33333333333333343}
{‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.91829583213089627}
{‘is_numeric’: True, ‘split_value’: 1.0, ‘best_change’: 2.5930420368499885}

All the best split values just happen to be 1 in this case.

Best Feature Split

Now I can apply the max_information gain across all columns and return the information for the best split

def best_feature_split(X,y,func=gini_impurity):
best_result = None
best_index = None
for index in range(X.shape[1]):
result = max_information_gain_split(y,X[:,index],func)
if best_result is None:
best_result = result
best_index = index
elif best_result['best_change'] &lt; result['best_change']:
best_result = result
best_index = index

best_result['index'] = best_index
return best_result

print best_feature_split(irisX,irisY)
print best_feature_split(irisX,irisY,entropy)
print best_feature_split(bostonX,bostonY,variance)


{‘index’: 2, ‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.33333333333333343}
{‘index’: 2, ‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.91829583213089627}
{‘index’: 5, ‘is_numeric’: True, ‘split_value’: 6.9429999999999996, ‘best_change’: 38.220464479057071}

Now that I have the best feature-value combination to partition our data we can partition the data into the two data sets. One I will call ‘left’, and the other I will call ‘right’. This will end up being the data structure use in that each node can have two branches: left and right.

def get_best_mask(X,best_feature_dict):
if best_feature_dict['is_numeric']:
else:

bfs = best_feature_split(irisX,irisY)


Now that I have scripted a single split, I will now work on creating a tree-based data structure for the decision tree.

Data Structure

I am going to make two classes. Decision Tree class that slightly mimics the sklearn API and a Decision Tree Node class will recursively fit the data based on the hyper-parameters we provide. There are many parameters that one could set to control the tree from overfitting. I have have choose: max_depth, min_information_gain, & min_leaf_size. Class diagrams are below.

DecisionTreeNode

This is going to be the template I will use for the DecisionTreeNode class.

class DecisionTreeNode(object):

def __init__(self,\
X,\
y,\
minimize_func,\
min_information_gain=0.01,\
max_depth=3,\
depth=0):
self.X = None
self.y = None

self.minimize_func=minimize_func
self.min_information_gain=min_information_gain
self.min_leaf_size = min_leaf_size
self.max_depth = max_depth
self.depth = depth

self.best_split = None
self.left = None
self.right = None
self.is_leaf = False
self.split_description = &quot;&quot;

pass

def _max_information_gain_split(self,X):
pass

def _best_feature_split(self):
pass

def _split_node(self):
pass

def _predict_row(self,row):
pass

def predict(self,X):
pass

def __repr__(self):
pass


Information Gain

This is the same function as I implemented during the scripting section except that it makes reference to the object variables instead of accepting them as parameters.

def _information_gain(self,mask):
if (s1 == 0 | s2 == 0): return 0
return self.minimize_func(self.y) - \


Max Information Gain Split

This function finds the value with the most information gain by a given feature. This is similar to the scripted version above except it also makes reference the object’s variables and have added an additional constraint that the best split is only updated if it new leafs are bigger or equal to the min leaf size.   This is finding the optimal split value given the constraint.  The above scrippted solution finds the optimal value then checks to see if it matches the constraint.  These two methods will give different results and produce different trees.

def _max_information_gain_split(self,x):
best_change = None
split_value = None
previous_val = None
is_numeric = x.dtype.kind not in ['S','b']

for val in np.unique(np.sort(x)):
if(is_numeric): mask = x &lt; val change = self._information_gain(mask) s1 = np.sum(mask) s2 = mask.size-s1 if best_change is None and s1 &gt;= self.min_leaf_size and s2 &gt;= self.min_leaf_size:
best_change = change
split_value = val
elif change &gt; best_change and s1 &gt;= self.min_leaf_size and s2 &gt;= self.min_leaf_size:
best_change = change
split_value = np.mean([val,previous_val])

previous_val = val

return {&quot;best_change&quot;:best_change,\
&quot;split_value&quot;:split_value,\
&quot;is_numeric&quot;:is_numeric}


Best Feature Split

This function finds the feature-value split that gives the most information gain out of the all the features.

def _best_feature_split(self):
best_result = None
best_index = None
for index in range(self.X.shape[1]):
result = self._max_information_gain_split(self.X[:,index])
if result['best_change'] is not None:
if best_result is None:
best_result = result
best_index = index
elif best_result['best_change'] &lt; result['best_change']:
best_result = result
best_index = index

if best_result is not None:
best_result['index'] = best_index
self.best_split = best_result


Split Node

This function will split the DecisionTreeNode into left and right branches if it is consistent with the hyper-parameters. It is also setting each Node’s description so a summary tree can be printed out.

def _split_node(self):

if self.depth &lt; self.max_depth : self._best_feature_split() if self.best_split is not None and self.best_split['best_change'] &gt;= self.min_information_gain :

if self.best_split['is_numeric']:
self.is_leaf = False

self.minimize_func,\
self.min_information_gain,\
self.max_depth,\
self.min_leaf_size,\
self.depth+1)

if self.best_split['is_numeric']:
split_description = 'index ' + str(self.best_split['index']) + &quot; &lt; &quot; + str(self.best_split['split_value']) + &quot; ( &quot; + str(self.X[mask,:].shape[0]) + &quot; )&quot; self.left.split_description = str(split_description) else: split_description = 'index ' + str(self.best_split['index']) + &quot; == &quot; + str(self.best_split['split_value']) + &quot; ( &quot; + str(self.X[mask,:].shape[0]) + &quot; )&quot; self.left.split_description = str(split_description) self.left._split_node() self.right = DecisionTreeNode(self.X[np.logical_not(mask),:],\ self.y[np.logical_not(mask)],\ self.minimize_func,\ self.min_information_gain,\ self.max_depth,\ self.min_leaf_size,\ self.depth+1) if self.best_split['is_numeric']: split_description = 'index ' + str(self.best_split['index']) + &quot; &gt;= &quot; + str(self.best_split['split_value']) + &quot; ( &quot; + str(self.X[np.logical_not(mask),:].shape[0]) + &quot; )&quot;
self.right.split_description = str(split_description)
else:
split_description = 'index ' + str(self.best_split['index']) + &quot; != &quot; + str(self.best_split['split_value']) + &quot; ( &quot; + str(self.X[np.logical_not(mask),:].shape[0]) + &quot; )&quot;
self.right.split_description = str(split_description)

self.right._split_node()

if self.is_leaf:
if self.minimize_func == variance:
self.split_description = self.split_description + &quot; : predict - &quot; + str(np.mean(self.y))
else:
values, counts = np.unique(self.y,return_counts=True)
predict = values[np.argmax(counts)]
self.split_description = self.split_description + &quot; : predict - &quot; + str(predict)



Predict Row

This function is used to make a prediction for a single row. It with be used by the predict function.

def _predict_row(self,row):
predict_value = None
if self.is_leaf:
if self.minimize_func==variance:
predict_value = np.mean(self.y)
else:
values, counts = np.unique(self.y,return_counts=True)
predict_value = values[np.argmax(counts)]
else:
left = None
if self.best_split['is_numeric']:
left = row[self.best_split['index']] &lt; self.best_split['split_value']
else:
left = row[self.best_split['index']] == self.best_split['split_value']

if left:
predict_value = self.left._predict_row(row)
else:
predict_value = self.right._predict_row(row)

return predict_value


Predict

This function will predict the values from the fitted tree. If the minimizing function is variance, I assume the problem is a regression problem and return numeric values. If it is not, I assume it is a classification problem and return character values.

def predict(self,X):
return np.apply_along_axis(lambda x: self._predict_row(x),1,X)


Final DecisionTreeNode Code

class DecisionTreeNode(object):

def __init__(self,\
X,\
y,\
minimize_func,\
min_information_gain=0.001,\
max_depth=3,\
min_leaf_size=20,\
depth=0):
self.X = X
self.y = y

self.minimize_func=minimize_func
self.min_information_gain=min_information_gain
self.min_leaf_size = min_leaf_size
self.max_depth = max_depth
self.depth = depth

self.best_split = None
self.left = None
self.right = None
self.is_leaf = True
self.split_description = &quot;root&quot;

if (s1 == 0 | s2 == 0): return 0
return self.minimize_func(self.y) - \

def _max_information_gain_split(self,x):
best_change = None
split_value = None
previous_val = None
is_numeric = x.dtype.kind not in ['S','b']

for val in np.unique(np.sort(x)):
if(is_numeric): mask = x &lt; val change = self._information_gain(mask) s1 = np.sum(mask) s2 = mask.size-s1 if best_change is None and s1 &gt;= self.min_leaf_size and s2 &gt;= self.min_leaf_size:
best_change = change
split_value = val
elif change &gt; best_change and s1 &gt;= self.min_leaf_size and s2 &gt;= self.min_leaf_size:
best_change = change
split_value = np.mean([val,previous_val])

previous_val = val

return {&quot;best_change&quot;:best_change,\
&quot;split_value&quot;:split_value,\
&quot;is_numeric&quot;:is_numeric}

def _best_feature_split(self):
best_result = None
best_index = None
for index in range(self.X.shape[1]):
result = self._max_information_gain_split(self.X[:,index])
if result['best_change'] is not None:
if best_result is None:
best_result = result
best_index = index
elif best_result['best_change'] &lt; result['best_change']:
best_result = result
best_index = index

if best_result is not None:
best_result['index'] = best_index
self.best_split = best_result

def _split_node(self):

if self.depth &lt; self.max_depth : self._best_feature_split() if self.best_split is not None and self.best_split['best_change'] &gt;= self.min_information_gain :

if self.best_split['is_numeric']:
self.is_leaf = False

self.minimize_func,\
self.min_information_gain,\
self.max_depth,\
self.min_leaf_size,\
self.depth+1)

if self.best_split['is_numeric']:
split_description = 'index ' + str(self.best_split['index']) + &quot; &lt; &quot; + str(self.best_split['split_value']) + &quot; ( &quot; + str(self.X[mask,:].shape[0]) + &quot; )&quot; self.left.split_description = str(split_description) else: split_description = 'index ' + str(self.best_split['index']) + &quot; == &quot; + str(self.best_split['split_value']) + &quot; ( &quot; + str(self.X[mask,:].shape[0]) + &quot; )&quot; self.left.split_description = str(split_description) self.left._split_node() self.right = DecisionTreeNode(self.X[np.logical_not(mask),:],\ self.y[np.logical_not(mask)],\ self.minimize_func,\ self.min_information_gain,\ self.max_depth,\ self.min_leaf_size,\ self.depth+1) if self.best_split['is_numeric']: split_description = 'index ' + str(self.best_split['index']) + &quot; &gt;= &quot; + str(self.best_split['split_value']) + &quot; ( &quot; + str(self.X[np.logical_not(mask),:].shape[0]) + &quot; )&quot;
self.right.split_description = str(split_description)
else:
split_description = 'index ' + str(self.best_split['index']) + &quot; != &quot; + str(self.best_split['split_value']) + &quot; ( &quot; + str(self.X[np.logical_not(mask),:].shape[0]) + &quot; )&quot;
self.right.split_description = str(split_description)

self.right._split_node()

if self.is_leaf:
if self.minimize_func == variance:
self.split_description = self.split_description + &quot; : predict - &quot; + str(np.mean(self.y))
else:
values, counts = np.unique(self.y,return_counts=True)
predict = values[np.argmax(counts)]
self.split_description = self.split_description + &quot; : predict - &quot; + str(predict)

def _predict_row(self,row):
predict_value = None
if self.is_leaf:
if self.minimize_func==variance:
predict_value = np.mean(self.y)
else:
values, counts = np.unique(self.y,return_counts=True)
predict_value = values[np.argmax(counts)]
else:
left = None
if self.best_split['is_numeric']:
left = row[self.best_split['index']] &lt; self.best_split['split_value'] else: left = row[self.best_split['index']] == self.best_split['split_value'] if left: predict_value = self.left._predict_row(row) else: predict_value = self.right._predict_row(row) return predict_value def predict(self,X): return np.apply_along_axis(lambda x: self._predict_row(x),1,X) def _rep(self,level): response = &quot;|-&gt;&quot; + self.split_description

if self.left is not None:
response += &quot;\n&quot;+(2*level+1)*&quot; &quot;+ self.left._rep(level+1)
if self.right is not None:
response += &quot;\n&quot;+(2*level+1)*&quot; &quot;+ self.right._rep(level+1)

return response

def __repr__(self):
return self._rep(0)



Proof of Concept on Iris

dtn = DecisionTreeNode(irisX,irisY,gini_impurity)
dtn._split_node()
dtn


Proof of Concept on ToothGrowth

dtn = DecisionTreeNode(bostonX,bostonY,variance,max_depth=5)
dtn._split_node()
dtn


DecisionTree

Now that I have implemented the DecisionTreeNode, I can implement the DecisionTree class.  Here is the template code.

class DecisionTree(object):

def __init__(self,\
minimize_func,\
min_information_gain=0.01,\
max_depth=3,\
min_leaf_size=20):

self.root = None
self.minimize_func = minimize_func
self.min_information_gain = min_information_gain
self.max_depth = max_depth
self.min_leaf_size = min_leaf_size

def fit(X,y):
pass

def predict(X):
pass


Fit

The fit function is going to take the features and targets. I am assuming that a data frame will be provided. You can generalize this for other datatypes as you see fit. It just takes the data and features, and creates a new DecisionTreeNode, then splits the node which recursively and greedily builds the tree.

def fit(self,X,y):
self.root =  DecisionTreeNode(X,\
y,\
self.minimize_func,\
self.min_information_gain,\
self.max_depth,\
self.min_leaf_size,\
0)
self.root._split_node()


Predict

The predict function is just a wrapper to call the DecisionTreeNode predict function. Right now there are no probabilities. You can add that feature yourself!

def predict(self,X):
return self.root.predict(X)


Fully Implemented Decision Tree

class DecisionTree(object):

def __init__(self,\
minimize_func,\
min_information_gain=0.001,\
max_depth=3,\
min_leaf_size=20):

self.root = None
self.minimize_func = minimize_func
self.min_information_gain = min_information_gain
self.max_depth = max_depth
self.min_leaf_size = min_leaf_size

def fit(self,X,y):
self.root =  DecisionTreeNode(X,\
y,\
self.minimize_func,\
self.min_information_gain,\
self.max_depth,\
self.min_leaf_size,\
0)
self.root._split_node()

def predict(self,X):
return self.root.predict(X)

def __repr__(self):
return self.root._rep(0)


Comparison to sklearn

Now that we have a decision tree, we can compare the results between the two CART implementations. To do this we need to write a print function for the Sklearn DecisionTree class that is similar to our own print function. A simple script is below.

from sklearn.tree import DecisionTreeClassifier as SklearnDTC
from sklearn.tree import DecisionTreeRegressor as SklearnDTR
from sklearn.tree import _tree

def print_sklearn_tree(tree,index=0,level=0):
response = &quot;&quot;
if level == 0:
response += &quot;root\n&quot;

if tree.feature[index] == -2:
response +=  &quot;: predict &quot; + str(np.argmax(dt_sklearn.tree_.value[index,0,:])) + &quot; ( &quot; +str(np.sum(dt_sklearn.tree_.value[index,0,:])) + &quot; )&quot;
else:
response += &quot;\n&quot;+(2*level+1)*&quot; &quot; + &quot;|-&gt; index &quot; +  str(tree.feature[index]) + &quot; &lt; &quot; + str(tree.threshold[index]) response += (2*(level+1)+1)*&quot; &quot;+ print_sklearn_tree(tree,tree.children_left[index],level+1) response += &quot;\n&quot;+(2*level+1)*&quot; &quot; + &quot;|-&gt; index &quot; +  str(tree.feature[index]) + &quot; &gt;= &quot; + str(tree.threshold[index])
response += (2*(level+1)+1)*&quot; &quot;+ print_sklearn_tree(tree,tree.children_right[index],level+1)

return response

dt_sklearn = SklearnDTC(max_depth=3,min_samples_leaf=20,criterion=&quot;gini&quot;)
dt_sklearn.fit(irisX,irisY)

dt_bts = DecisionTree(gini_impurity,min_leaf_size=20,max_depth=3)
dt_bts.fit(irisX,irisY)

print dt_bts
print &quot;\n&quot; + 50*&quot;-&quot; + &quot;\n&quot;
print print_sklearn_tree(dt_sklearn.tree_)


The values here are close to identical, but it is worth noting that sklearn will jump between this result and splitting on Petal With of 0.8 because they have the same information gain.

dt_sklearn = SklearnDTC(max_depth=5,min_samples_leaf=5,criterion=&quot;gini&quot;)
dt_sklearn.fit(irisX,irisY)

dt_bts = DecisionTree(gini_impurity,min_leaf_size=5,max_depth=5)
dt_bts.fit(irisX,irisY)

print dt_bts
print &quot;\n&quot; + 50*&quot;-&quot; + &quot;\n&quot;
print print_sklearn_tree(dt_sklearn.tree_)


Again – my implementation and sklearn are close to identical in results. I will try regression on the Boston dataset.

dt_sklearn = SklearnDTR(max_depth=3,min_samples_leaf=20)
dt_sklearn.fit(bostonX,bostonY)

dt_bts = DecisionTree(variance,min_leaf_size=20,max_depth=3)
dt_bts.fit(bostonX,bostonY)

print dt_bts
print &quot;\n&quot; + 50*&quot;-&quot; + &quot;\n&quot;
print print_sklearn_tree(dt_sklearn.tree_)


Conclusion

In this post, I covered the theory and code of implementing the CART algorithm for producing decision trees and compared the results to sklearn on a few toy datasets.  This code is far from production ready, but it does show a high-level implementation of the algorithm.    You can leave any question, comments, corrections, or requestion in the comments below.

If you are interested in learning more, I can recommend a few books below.  The image links are affiliate links.   I provide links to free version when available.

This book has a good overview of Decision Trees in chapter 3 and just good/general advice about using data science to get practical results.   I gave it a good once over, and continue to reference it from time to time.

Data Science From Scratch

This book has a good intro to data science activities with python and implements a number of algorithms from scratch. Chapter 18 has an implementation of ID3 Decision Tree Algorithm. It also shows how to implement Logistic Regression, Naive Bayes, & Neural Networks.

Python Machine Learning

This is a really good book that covers doing an enormous amount of data science tasks in python. If you are new to python or data science, I highly recommend it.

Decision Trees

You can find the iPython Notebook for this post at https://github.com/bryantravissmith/FromScratch/tree/master/SupervisedLearning/DecisionTrees

Decision Trees are a non-parametric supervised machine learning algorithm that takes a set of training data and constructs a set of regions in the space of features that is then used to make predictions. These predictions can be continuous values (Decision Tree Regression), or classification (Decision Tree Classification).

Let’s imagine we have a data set with a lot of features and how much student debt individuals have. We could just take the average of the entire dataset, say $25,000.00, and use that as a prediction. A better prediction could be developed by dividing the dataset up using the level of college completed: None, Some, Undergraduate, and Graduate. We can then take the average of subset/region: 1. None:$0
2. Some: $2000 3. Undergraduate:$16000
4. Graduate: $42000 Now when we score a person, we use their level of college completed to make the prediction. The idea is that the results are improved over the$25,000 average estimate of the entire dataset.

Why Implement It Ourselves?

The reason why I want to implement machine learning algorithm from scratch is I have a better intuition for the assumptions and the trade off in the implementations of standard libraries. For instance, before I did this for the first time I did not appreciate that both sklearn (python) and rpart (R) have an implementation of the CART algorithm for decision trees, but there are equally valid decision tree algorithms that would produce different results. This implementation is inherently greedy. The issue with greedy algorithms is that making locally best splits or regions may not make the globally best set of regions. This is what makes finding the best Decision Tree model np-hard. If you really get into it, finding the best tree becomes an optimal stopping problem and brushes into Reinforcement Learning and Artifical Intelligence style searching.

On a more practical side, there have been a number of times in the iterative approach of model-building where understanding the model assumptions have allowed me to exclude features the violate assumptions, or create new features that are more easily expressed and represented by the model. Both of these have consistently lead to better out of sample performance.

Finally, the best reason is I get pleasure and satisfaction from doing this. I think algorithms are a powerful framework and a fun way to think about problems solving and doing things from scratch help improve about execution and understanding of them.

Metrics – Measures for Improvement

I made up the student debt example. It makes sense in my mind, but we need to define an algorithm that can create the regions from any dataset. Most machine learning algorithms are optimization algorithms that improve some measure/metric on each iteration. Decision trees are no different in that respect. To define the algorithm, we need to define the measures.

Classification Metrics for Decision Trees

Gini Impurity

Gini Impurity is a very intuitive metric to try to improve because it is a measure of the misclassification rate. If the fraction of the data set that has outcome class $c$ is $f_c$, we can make predictions for each data point as being class $c$, $f_c$ of the time. The error rate for a class is the probability of getting the class times the probability of getting it wrong.

$Error_c = f_c * \left( 1 - f_c \right)$

The Gini Impurity is the sum of these error rates for all classes.

$Gini \ Impurity = \sum_c Error_c = \sum_c f_c * \left( 1 - f_c \right)$

$Gini \ Impurity = \sum_c f_c - \sum _c f^2_c$

$Gini \ Impurity = 1 - \sum _c f^2_c$

If we divide the data set in regions that improve the weighted sum of the Gini Impurity, the misclassification rate will be reduced.

Gini Impurity for a Coin Flip

A coin flip is about as simple as a process there is to calculate the Gini Impurity for. We have a coin with probability of head, $p_H$. The Gini Impurity is then

$Gini \ Impurity = 1 - p_H^2 - \left(1-p_H\right)^2 = 2*p_H*\left(1-p_H\right)$

A plot of the this is below for different values of $p_H$.

p &lt;- seq(0.01,0.99,0.01)
plot(p,
1-p^2-(1-p)^2,
type='l',
ylim=c(0,1),
ylab = &quot;Gini Impurity&quot;,
col=&quot;blue&quot;)


We see that if $p_H = 0$ or $p_H = 1$ there is no misclassification. If $p_H = 0.5$, we see that we get the coin flip wrong 50% of the time. This should match our intuition.

Entropy

Another, and less intuitive, measure of prediction accuracy is entropy. It is defined as the expected information content of a class. If $f_c$ is the fraction of the class in a dataset, and $I(c)$ is the information content in the class, then entropy is defined as following

$Entropy = \sum_c f_c * I(c)$

For reasons I will not get into in this post, $I(c) = - log_2\left(f_c\right)$, so the entropy of a dataset follows:

$Entropy = - \sum_c f_c * log_2\left(f_c\right)$

The base 2 of the log can be anything, really. But since we will be using binary trees throughout this series, I will keep the base 2.

Entropy for a Coin Flip

Continuing with the coinflip example, I have produced the entropy of a coinflip using different values of $p_H$.

p &lt;- seq(0.01,0.99,0.01)
plot(p,
-p*log2(p)-(1-p)*log2(1-p),
type='l',
ylim=c(0,1),
ylab = &quot;Entropy&quot;,
col=&quot;blue&quot;)


The expected information content of $p_H = 0$ and $p_H = 1$ is zero, which means that we always know what the next flip is. The entropy is maximum for $p_H = 0.5$, which is where we have the highest misclassification rate

Regressions Metrics for Decision Trees

The metric used for decision tree regression is a reduction of  variance around the prediction.

$Variance = \frac{1}{N} \sum_{i}^{N} ( y_i - \bar{y} )^2$

Simple Linear Example

We can take a simple linear example and split it into two regions. The solid black line is the average of the entire dataset, while the red lines are the predictions in each region.

The total variance before the split was 40.00, while the weighted variance after the split is $\frac{6}{11} 11.67 + \frac{5}{11} 8.00$ = 10.0

Improving Metrics – Splitting on Information Gain

Now that we have metrics to evaluate the prediction from a data set, we can now define the improvement metric to optimize. I will be using the term information gain, which it is the correct term when talking about entropy, for all the optimizations. I can not say conclusively is the name for the measure in decision tree regression. Being wrong hasn’t stopped me before!

Decision Tree Classifier

$Information \ Gain = Gini \ Impurity(d) - \sum_{split \ s} \frac{|s|}{|d|} Gini \ Impurity(s)$

$Information \ Gain = Entropy(d) - \sum_{split \ s} \frac{|s|}{|d|} Entropy(s)$

Decision Tree Regressor

$Information \ Gain = Mean \ Variance(d) - \sum_{split \ s} Mean \ Variance(s)$

Parameters of Decision Trees

Decisions Trees are prone to overfit because the metrics can be driven to the minimum by making the tree deep enough capture each data point individually. The variance of a single point is zero. The misclassification rate of a single point is also zero. Traditionally this overfit is managed by pruning the tree after it is constructed, or setting parameters the limit how finely the data can be partitioned into regions.

The post-pruning of a decision tree uses some sort of statistical evaluation on a validation set. If the error on the sub-nodes is larger than the node, the sub-nodes are removed.

The parameters that are commonly used in CART to limit the partitioning of the data:

1. Maximum Depth of the Tree
2. Minimum Number of Data Points for a New Node
3. Minimum Improvement needed for a New Node
4. Minimum Metric under which New Nodes are not created.

Pseudo Code

Now that we have a metric and measure of improvement, we can outline the pseudo code we will be implementing in the next posts which are based on the CART algorithm for decision trees. Classification and Regression Trees (CART) algorithm that recursively and greedily splits nodes if there is an improve in the metric of choice until some stopping conditions are met.

1. Set Training Parameters
2. Create a Root Node
3. Determine optimal split the node such that that split minimizes the metrics of concern.
4. If stopping parameters are not violated, create two child nodes and apply step 3 and 4 to each child node.

Code

In this code section. I am going to be implementing a Decision Tree algorithm, but I will do it in the following parts:

1. Introduce the Data
2. Script Finding the Best First Partition
3. Create a Data Structure and Implement Decision Trees
4. Comparing with rpart package.

Assumptions

The implementation will avoid some issues with decision trees. I am going to assume that there never missing values. I will assume that all categorical variables are logical, character, and factor variables. If a variable is numeric, I will assume it is numeric. I am also going to assume that all the data is in a data.frame format.

Data

I will be using the Iris dataset for classification and the ToothGrowth dataset for regression. Both are available in the R datasets package

library(datasets)
data(iris)
data(ToothGrowth)


Iris

I will be using the Species as the classification target in this post. Here are the top 6 rows of the dataset:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

ToothGrowth

I will be using the ‘len’ variable as the regression target in this post. Here are the top 6 rows of the ToothGrowth dataset:

len supp dose
4.2 VC 0.5
11.5 VC 0.5
7.3 VC 0.5
5.8 VC 0.5
6.4 VC 0.5
10.0 VC 0.5

Scripting the First Partition

I am going to implement a brute force search for the best partition in the data set. This will be done by going through each value in each feature and finding the feature-value that will best partition the data. To do this we need to implement our metrics I discussed in the theory post:

entropy: classification
Gini impurity: classification
variance: regression

entropy &lt;- function(y){
# assumes y is a factor with all levels
if(length(y)==0) return(0)
p &lt;- table(y)/length(y)
sum(-p*log2(p+1e-9))
}

gini_impurity &lt;- function(y){
# assumes y if a factor with all levels
if(length(y) == 0) return(0)
p &lt;- table(y)/length(y)
1-sum(p^2)
}

variance &lt;- function(y){
if(length(y) &lt;= 1) return(0)
var(y)
}


I can quickly calculate these metrics for our dataset:

entropy(iris[,'Species'])
gini_impurity(iris[,'Species'])
variance(ToothGrowth[,'len'])


1.58496249639307
0.666666666666667
58.5120225988701

Information Gain

Now that I have metric functions, I want to calculate the improvement for different partitions. Each partition is going to be calculated using a mask which is a logical vector that states if the row is in the first partition (TRUE) or in the second partition (FALSE). I will call the function information gain, even though it really is only information gain if entropy is the metric.

information_gain &lt;- function(y,mask,func=entropy){
if ( s1 == 0 | s2 == 0) return(0)
}

information_gain(iris[,'Species'],iris[,'Petal.Width'] &lt; 1.5)
information_gain(iris[,'Species'],iris[,'Petal.Width'] &lt; 1.5,gini_impurity)
information_gain(ToothGrowth[,'len'],ToothGrowth[,'dose']&lt;= 0.5,variance)


0.643516416593624
0.229045542635659
33.8790109029636

Max Information Gain For A Single Feature

I have a function to calculated the information gain for a given criteria, mask, or split, I now will search for the best split among all the values in a given feature. This will give me the best split for that feature:

max_information_gain_split &lt;- function(y,x,func=gini_impurity){
best_change = NA
split_value = NA
is_numeric = !(is.factor(x)|is.logical(x)|is.character(x))
for( val in sort(unique(x))){
if (is_numeric) mask &lt;- x &lt; val
change &lt;- information_gain(y,mask,func) if(is.na(best_change) | change &gt; best_change){
best_change = change
split_value = val
}
}
return(list(&quot;best_change&quot;=best_change,
&quot;split_value&quot;=split_value,
&quot;is_numeric&quot;=is_numeric))
}

print(unlist(max_information_gain_split(iris$Species,iris$Petal.Width)))
print(unlist(max_information_gain_split(iris$Species,iris$Petal.Width,entropy)))
print(unlist(max_information_gain_split(ToothGrowth$len,ToothGrowth$dose,variance)))


best_change split_value is_numeric
0.3333333 1.0000000 1.0000000
best_change split_value is_numeric
0.9182958 1.0000000 1.0000000
best_change split_value is_numeric
33.87901 1.00000 1.00000

All the best split values just happen to be 1 in this case.

Max Information Again Among All Features

Now I can apply ‘max_information_gain_split’ to all of the feature columns and find the best split. We will do this by applying a ‘sapply ‘ function on the features like so:

sapply(iris[,1:4],function(x) max_information_gain_split(iris[,5],x))

Sepal.Length Sepal.Width Petal.Length Petal.Width
best_change 0.2277603 0.1269234 0.3333333 0.3333333
split_value 5.5 3.4 3 1
is_numeric TRUE TRUE TRUE TRUE
sapply(ToothGrowth[,2:3],function(x) max_information_gain_split(ToothGrowth[,1],x))

supp dose
best_change 0.01444444 0.02027778
split_value OJ 2
is_numeric FALSE TRUE

After ‘sapplying’ the function, I will return a list with the information of the feature and value with the most information gain.

best_feature_split &lt;- function(X,y){
results &lt;- sapply(X,function(x) max_information_gain_split(y,x))
best_name &lt;- names(which.max(results['best_change',]))
best_result &lt;- results[,best_name]
best_result[[&quot;name&quot;]] &lt;- best_name
best_result
}

best_feature_split(iris[,1:4],iris[,5])
best_feature_split(ToothGrowth[,2:3],ToothGrowth[,1])

best_change split_value is_numeric name
0.3333333 3 TRUE Petal.Length
best_change split_value is_numeric name
0.0202778 2 TRUE dose

Now that I have the best feature-value combination to partition our data we can partition the data into the two data sets. One I will call ‘left’, and the other I will call ‘right’. This will end up being the data structure use in that each node can have two branches: left and right.

get_best_mask &lt;- function(X,best_feature_list){
best_mask &lt;- X[,best_feature_list$name] == best_feature_list$split_value
if(level &lt; node$max_depth){ if(!is.null(node$branches$left)){ response &lt;- paste0(response,&quot;\n&quot;,paste(rep(&quot; &quot;,2*(level+1)),collapse=&quot; &quot;),print(node$branches$left,level+1)) } if(!is.null(node$branches$right)){ response &lt;- paste0(response,&quot;\n&quot;,paste(rep(&quot; &quot;,2*(level+1)),collapse=&quot; &quot;),print(node$branches$right,level+1)) } } if(level==0) { cat(response) } else { return(response) } }  Initialize Function This function gets called when a DecisionTreeNode\$new() gets called. In it, I set a list of default values. I iterate through the parameters provided to the $new(…) function and if a default parameter is not set, I set it with the do.call command. initialize = function(...){ defaults &lt;- list(x = data.frame(), y=c(), depth=0, minimize_func=gini_impurity, min_information_gain=1e-2, min_leaf_size=20, max_depth=3, is_leaf=T, split_description=&quot;root&quot;, best_split=NULL, branches=list(&quot;left&quot;=NULL,&quot;right&quot;=NULL)) params &lt;- list(...) fields &lt;- names(getRefClass()$fields())
for( field in fields){
if (!(field %in% names(params))) {
params[[field]] &lt;- defaults[[field]]
}
}
for( param in names(params)){
do.call(&quot;&lt;&lt;-&quot;,list(param, params[[param]]))
}
}


Information Gain

This is the same function as I implemented during the scripting section except that it makes reference to the object variables instead of accepting them as parameters.

information_gain = function(mask){
if ( s1 == 0 | s2 == 0) return(0)
}


Max Information Gain Split

This function finds the value with the most information gain by a given feature. This is similar to the scripted version above except it also makes reference the object’s variables

max_information_gain_split = function(feature){

best_change = NA
split_value = NA
is_numeric = !(is.factor(feature)|is.logical(feature)|is.character(feature))

previous_val &lt;- 0
for( val in sort(unique(feature))){

if (is_numeric) mask &lt;- feature &lt; val
change &lt;- information_gain(mask) if(is.na(best_change) | change &gt; best_change){
best_change = change
split_value = ifelse(is_numeric,
mean(c(val,previous_val)),
val)

}
previous_val &lt;- val

}
return(list(&quot;best_change&quot;=best_change,
&quot;split_value&quot;=split_value,
&quot;is_numeric&quot;=is_numeric))
}


Best Feature Split

This function finds the feature-value split that gives the most information gain out of the all the features.

best_feature_split = function(){
results &lt;- sapply(x,function(feature) max_information_gain_split(feature))
best_name &lt;- names(which.max(results['best_change',]))
best_result &lt;- results[,best_name]
best_result[[&quot;name&quot;]] &lt;- best_name
best_split &lt;&lt;- best_result
}


This function finds the mask for the best feature split. This is used to split the node if it is consistent with the hyper-parameters

best_mask = function(){
best_mask &lt;- x[,best_split$name] == best_split$split_value
if(best_split$is_numeric){ best_mask &lt;- x[,best_split$name] &lt; best_split$split_value } return(best_mask) }  Split Node This function will split the DecisionTreeNode into left and right branches if it is consistent with the hyper-parameters. It is also setting each Node’s description so a summary tree can be printed out. split_node = function() { if(depth &lt; max_depth){ best_feature_split() if(best_split$best_change &lt; min_information_gain ){
is_leaf &lt;&lt;- F

branches$left &lt;&lt;- .self$copy()
branches$left$is_leaf &lt;&lt;- T
branches$left$x &lt;&lt;-  branches$left$x[mask,]
branches$left$y &lt;&lt;-  branches$left$y[mask]

branches$left$split_description &lt;&lt;- ifelse(best_split$is_numeric, paste(c(best_split$name,
&quot;&lt;&quot;,
best_split$split_value), collapse = &quot; &quot;), paste(c(best_split$name,
&quot;=&quot;,
best_split$split_value), collapse = &quot; &quot;)) branches$left$depth &lt;&lt;- branches$left$depth+1 branches$left$split_node() branches$right &lt;&lt;- .self$copy() branches$right$is_leaf &lt;&lt;- T branches$right$x &lt;&lt;- branches$right$x[!mask,] branches$right$y &lt;&lt;- branches$right$y[!mask] branches$right$split_description &lt;&lt;- ifelse(best_split$is_numeric, paste(c(best_split$name, &quot;&gt;=&quot;, best_split$split_value),
collapse = &quot; &quot;),
paste(c(best_split$name, &quot;!=&quot;, best_split$split_value),
collapse = &quot; &quot;))

branches$right$depth &lt;-  branches$right$depth+1
branches$right$split_node()
}
}
}
if(is_leaf){
split_description &lt;&lt;- ifelse(identical(minimize_func,variance),
paste(c(split_description,
&quot;:&quot;,
&quot;predict - &quot;,
mean(y)),
collapse=&quot; &quot;),
paste(c(split_description,
&quot;:&quot;,
&quot;predict - &quot;,
names(which.max(table(y)))),
collapse=&quot; &quot;))
}

}


Predict Row

This function is used to make a prediction for a single row. It with be used by the predict function.

predict_row = function(row){
if(is_leaf){
predict_value &lt;- ifelse(identical(minimize_func,variance),
mean(y),
names(which.max(table(y))))
} else {
if(best_split$is_numeric){ left = row[best_split$name] &lt; best_split$split_value } else{ left = row[best_split$name] == best_split$split_value } if(left){ predict_value = branches$left$predict_row(row) } else { predict_value = branches$right$predict_row(row) } } return(predict_value) }  Predict This function will predict the values from the fitted tree. If the minimizing function is variance, I assume the problem is a regression problem and return numeric values. If it is not, I assume it is a classification problem and return character values. predict = function(features){ pred &lt;- character(length=dim(features)[1]) if(identical(minimize_func,variance)) pred &lt;- numeric(length=dim(features)[1]) for(i in 1:dim(features)[1]){ pred[i] = predict_row(features[i,]) } pred }  Final DecisionTreeNode Code  DecisionTreeNode &lt;- setRefClass(&quot;DecisionTreeNode&quot;, fields = list(x = &quot;data.frame&quot;, y = &quot;ANY&quot;, is_leaf=&quot;logical&quot;, split_description=&quot;character&quot;, best_split=&quot;list&quot;, branches=&quot;list&quot;, depth=&quot;numeric&quot;, minimize_func=&quot;function&quot;, min_information_gain=&quot;numeric&quot;, min_leaf_size=&quot;numeric&quot;, max_depth=&quot;numeric&quot;), methods = list( initialize = function(...){ defaults &lt;- list(x = data.frame(), y=c(), depth=0, minimize_func=gini_impurity, min_information_gain=1e-2, min_leaf_size=20, max_depth=3, is_leaf=T, split_description=&quot;root&quot;, best_split=NULL, branches=list(&quot;left&quot;=NULL,&quot;right&quot;=NULL)) params &lt;- list(...) fields &lt;- names(getRefClass()$fields())
for( field in fields){
if (!(field %in% names(params))) {
params[[field]] &lt;- defaults[[field]]
}
}
for( param in names(params)){
do.call(&quot;&lt;&lt;-&quot;,list(param, params[[param]]))
}

},

if ( s1 == 0 | s2 == 0) return(0)
},
max_information_gain_split = function(feature){

best_change = NA
split_value = NA
is_numeric = !(is.factor(feature)|is.logical(feature)|is.character(feature))

previous_val &lt;- 0
for( val in sort(unique(feature))){

if (is_numeric) mask &lt;- feature &lt; val
change &lt;- information_gain(mask) if(is.na(best_change) | change &gt; best_change){
best_change = change
split_value = ifelse(is_numeric,
mean(c(val,previous_val)),
val)

}
previous_val &lt;- val

}
return(list(&quot;best_change&quot;=best_change,
&quot;split_value&quot;=split_value,
&quot;is_numeric&quot;=is_numeric))
},
best_feature_split = function(){
results &lt;- sapply(x,function(feature) max_information_gain_split(feature))
best_name &lt;- names(which.max(results['best_change',]))
best_result &lt;- results[,best_name]
best_result[[&quot;name&quot;]] &lt;- best_name
best_split &lt;&lt;- best_result
},
best_mask &lt;- x[,best_split$name] == best_split$split_value
if(best_split$is_numeric){ best_mask &lt;- x[,best_split$name] &lt; best_split$split_value } return(best_mask) }, split_node = function() { if(depth &lt; max_depth){ best_feature_split() if(best_split$best_change &gt; min_information_gain ){
is_leaf &lt;&lt;- F

branches$left &lt;&lt;- .self$copy()
branches$left$is_leaf &lt;&lt;- T
branches$left$x &lt;&lt;-  branches$left$x[mask,]
branches$left$y &lt;&lt;-  branches$left$y[mask]

branches$left$split_description &lt;&lt;- ifelse(best_split$is_numeric, paste(c(best_split$name,
&quot;&lt;&quot;,
best_split$split_value), collapse = &quot; &quot;), paste(c(best_split$name,
&quot;=&quot;,
best_split$split_value), collapse = &quot; &quot;)) branches$left$depth &lt;&lt;- branches$left$depth+1 branches$left$branches &lt;&lt;- list(&quot;left&quot;=NULL,&quot;right&quot;=NULL) branches$left$split_node() branches$right &lt;&lt;- .self$copy() branches$right$is_leaf &lt;&lt;- T branches$right$x &lt;&lt;- branches$right$x[!mask,] branches$right$y &lt;&lt;- branches$right$y[!mask] branches$right$split_description &lt;&lt;- ifelse(best_split$is_numeric, paste(c(best_split$name, &quot;&gt;=&quot;, best_split$split_value),
collapse = &quot; &quot;),
paste(c(best_split$name, &quot;!=&quot;, best_split$split_value),
collapse = &quot; &quot;))

branches$right$depth &lt;&lt;-  branches$right$depth+1
branches$right$branches &lt;&lt;- list(&quot;left&quot;=NULL,&quot;right&quot;=NULL)
branches$right$split_node()
}
}
}
if(is_leaf){
split_description &lt;&lt;- ifelse(identical(minimize_func,variance),
paste(c(split_description,
&quot;:&quot;,
&quot;predict - &quot;,
mean(y)),
collapse=&quot; &quot;),
paste(c(split_description,
&quot;:&quot;,
&quot;predict - &quot;,
names(which.max(table(y)))),
collapse=&quot; &quot;))
}

},
predict_row = function(row){
if(is_leaf){
predict_value &lt;- ifelse(identical(minimize_func,variance),
mean(y),
names(which.max(table(y))))
} else {
dtn$split_node() print(dtn)  Proof of Concept on ToothGrowth dtn &lt;- DecisionTreeNode$new(x=ToothGrowth[,2:3],
y=ToothGrowth[,1],
minimize_func=variance,
min_leaf_size=5,
max_depth=5,
min_information_gain=1e-3)
dtn$split_node() print(dtn)  DecisionTree Now that I have implemented the DecisionTreeNode, I can implement the DecisionTree class. Here is the template code. DecisionTree &lt;- setRefClass(&quot;DecisionTree&quot;, fields = list(minimize_func=&quot;function&quot;, min_information_gain=&quot;numeric&quot;, min_leaf_size=&quot;numeric&quot;, max_depth=&quot;numeric&quot;, root = &quot;DecisionTreeNode&quot;), methods = list( initialize = function(...){}, fit = function(features,target){}, predict = function(features){} ))  Initialize This is almost identical to the initialization function for DecisionTreeNode. The defaults are only the hyper-parameters though initialize = function(...){ defaults &lt;- list(minimize_func=gini_impurity, min_information_gain=1e-2, min_leaf_size=20, max_depth=3, root=NULL) params &lt;- list(...) fields &lt;- names(getRefClass()$fields())
for( field in fields){
if (!(field %in% names(params))) {
params[[field]] &lt;- defaults[[field]]
}
}
for( param in names(params)){
do.call(&quot;&lt;&lt;-&quot;,list(param, params[[param]]))
}

}


Fit

The fit function is going to take the features and targets. I am assuming that a data frame will be provided. You can generalize this for other datatypes as you see fit. It just takes the data and features, and creates a new DecisionTreeNode, then splits the node which recursively and greedily builds the tree.

fit = function(features,target){
root &lt;&lt;- DecisionTreeNode$new(x=features, y=target, minimize_func=minimize_func, min_information_gain=min_information_gain, min_leaf_size=min_leaf_size, max_depth=max_depth ) root$split_node()

}


Predict

The predict function is just a wrapper to call the DecisionTreeNode predict function. Right now there are no probabilities. You can add that feature yourself!

predict = function(features){
root$predict(features) }  Fully Implemented Decision Tree DecisionTree &lt;- setRefClass(&quot;DecisionTree&quot;, fields = list(minimize_func=&quot;function&quot;, min_information_gain=&quot;numeric&quot;, min_leaf_size=&quot;numeric&quot;, max_depth=&quot;numeric&quot;, root = &quot;DecisionTreeNode&quot;), methods = list( initialize = function(...){ defaults &lt;- list(minimize_func=gini_impurity, min_information_gain=1e-2, min_leaf_size=20, max_depth=3, root=NULL) params &lt;- list(...) fields &lt;- names(getRefClass()$fields())
for( field in fields){
if (!(field %in% names(params))) {
params[[field]] &lt;- defaults[[field]]
}
}
for( param in names(params)){
do.call(&quot;&lt;&lt;-&quot;,list(param, params[[param]]))
}

},
fit = function(features,target){
root &lt;&lt;- DecisionTreeNode$new(x=features, y=target, minimize_func=minimize_func, min_information_gain=min_information_gain, min_leaf_size=min_leaf_size, max_depth=max_depth ) root$split_node()

},
predict = function(features){
root$predict(features) } ))  print.DecisionTree &lt;- function(tree){ print(tree$root)
}



Comparison to rpart

Now that we have a decision tree, we can compare the results between the two CART implementations

library(rpart)
dt_rpart &lt;- rpart(Species ~ .,data=iris,maxdepth=3,minbucket=20)
dt_bts &lt;- DecisionTree(max_depth=3,min_leaf_size=20)
dt_bts$fit(iris[,1:4],iris[,5]) print(dt_bts) cat(&quot;\n\n&quot;) print(dt_rpart)  dt_rpart &lt;- rpart(Species ~ .,data=iris,maxdepth=5,minbucket=5) dt_bts &lt;- DecisionTree(max_depth=5,min_leaf_size=5) dt_bts$fit(iris[,1:4],iris[,5])

print(dt_bts)
cat(&quot;\n\n&quot;)
print(dt_rpart)


dt_rpart &lt;- rpart(len ~ .,data=ToothGrowth,maxdepth=5,minbucket=5,method=&quot;anova&quot;)
dt_bts &lt;- DecisionTree(max_depth=5,min_leaf_size=5,minimize_func=variance)
dt_bts\$fit(ToothGrowth[,2:3],ToothGrowth[,1])

print(dt_bts)
cat(&quot;\n\n&quot;)
print(dt_rpart)


I have a Decision Tree class that if the data meets my assumption of not having missing data, is in a data frame format, and classification variables are factors, I can produce results that are similar, if not identical, to the rpart package.

Conclusion

In this post, I covered the theory and code of implementing the CART algorithm for producing decision trees and compared the results to rpart on a few toy datasets.  This code is far from production ready, but it does show a high-level implementation of the algorithm.    You can leave any question, comments, corrections, or requestion in the comments below.

If you are interested in learning more, I can recommend a few books below.  The image links are affiliate links.   I provide links to free version when available.

This book has a good overview of Decision Trees in chapter 3 and just good/general advice about using data science to get practical results.   I gave it a good once over, and continue to reference it from time to time.

Introduction to Statistical Learning

This is the book I first picked up when I wanted to learn R.  You can google for the accompanying code. It gives a very comprehensive coverage of Statistical Learning and R. I highly recommend it. You can find an older version of this book in a pdf for free: http://www-bcf.usc.edu/~gareth/ISL/

Elements of Statistical Learning

This is the another book I picked up when I wanted to learn R. There is also code available. This was not a practical approach, but I think it is a good, global experience.  I am amazed that I learn something new every time I open this book. You can find a free older version of this book as well: http://statweb.stanford.edu/~tibs/ElemStatLearn/

Stanford’s Machine Learning Project Page

http://cs229.stanford.edu/projects2015.html

http://cs229.stanford.edu/projects2014.html

http://cs229.stanford.edu/projects2013.html

Description:  A list of all the student projects and reports from Stanford’s CS 229 Machine Learning course.   The most recent year has the top projects highlighted.  It is probably not worth exhaustively reading each report unless you want to practice being a C.S. T.A.

Data Science iPython Notebooks by Donne Martin

Author:  Donne Martin

Description:  This is one of the best collection of iPython notebooks.  I obsessively went through all of them when I decided I was going to leave teaching and pursue a Data Science career.  I found a lot of value in these notebooks.  Thank You, Donne.

Group Lens Datasets

Collector: GroupLens

Description:  A collection of a number of data sets useful for recommender systems.   The most famous data set is probably the MovieLens dataset, but there are others as well.

Pew Research Data Sets

Collector: Pew Research

Description:  A collection/fee of data sets.  You must register to download any of the datasets.  Their hope is that it will be used by researchers.

Reality Commons Data

Collector:  http://hd.media.mit.edu/

Description:  Data sets opened sources in the hope that it will inspire insights that will help facilitate the use of mobile phones in social science experiments.

U.S. Open Data

Description:  A growing collection of the US Government’s collected data as part of its open data initiatives.   Every time I go back I find new data.