This is your very first post. Click the Edit link to modify or delete it, or start a new post. If you like, use this post to tell readers why you started this blog and what you plan to do with it.

## Implementing Decision Trees From Scratch Using Python 2.7

# Decision Trees

You can find the iPython Notebook for this post at https://github.com/bryantravissmith/FromScratch/tree/master/SupervisedLearning/DecisionTrees

Decision Trees are a non-parametric supervised machine learning algorithm that takes a set of training data and constructs a set of regions in the space of features that is then used to make predictions. These predictions can be continuous values (Decision Tree Regression), or classification (Decision Tree Classification).

Let’s imagine we have a data set with a lot of features and how much student debt individuals have. We could just take the average of the entire dataset, say $25,000.00, and use that as a prediction. A better prediction could be developed by dividing the dataset up using the level of college completed: None, Some, Undergraduate, and Graduate. We can then take the average of subset/region:

- None: \$0
- Some: \$2000
- Undergraduate: \$16000
- Graduate: \$42000

Now when we score a person, we use their level of college completed to make the prediction. The idea is that the results are improved over the $25,000 average estimate of the entire dataset.

## Why Implement It Ourselves?

The reason why I want to implement machine learning algorithm from scratch is I have a better intuition for the assumptions and the trade off in the implementations of standard libraries. For instance, before I did this for the first time I did not appreciate that both sklearn (python) and rpart (R) have an implementation of the CART algorithm for decision trees, but there are equally valid decision tree algorithms that would produce different results. This implementation is inherently greedy. The issue with greedy algorithms is that making locally best splits or regions may not make the globally best set of regions. This is what makes finding the best Decision Tree model np-hard. If you really get into it, finding the best tree becomes an optimal stopping problem and brushes into Reinforcement Learning and Artifical Intelligence style searching.

On a more practical side, there have been a number of times in the iterative approach of model-building where understanding the model assumptions have allowed me to exclude features the violate assumptions, or create new features that are more easily expressed and represented by the model. Both of these have consistently lead to better out of sample performance.

Finally, the best reason is I get pleasure and satisfaction from doing this. I think algorithms are a powerful framework and a fun way to think about problems solving and doing things from scratch help improve about execution and understanding of them.

## Metrics – Measures for Improvement

I made up the student debt example. It makes sense in my mind, but we need to define an algorithm that can create the regions from any dataset. Most machine learning algorithms are optimization algorithms that improve some measure/metric on each iteration. Decision trees are no different in that respect. To define the algorithm, we need to define the measures.

### Classification Metrics for Decision Trees

#### Gini Impurity

Gini Impurity is a very intuitive metric to try to improve because it is a measure of the misclassification rate. If the fraction of the data set that has outcome class c is f, we can make predictions for each data point as being class c, f of the time. The error rate for a class is the probability of getting the class times the probability of getting it wrong.

The Gini Impurity is the sum of these error rates for all classes.

If we divide the data set in regions that improve the weighted sum of the Gini Impurity, the misclassification rate will be reduced.

#### Gini Impurity for a Coin Flip

A coin flip is about as simple as a process there is to calculate the Gini Impurity for. We have a coin with probability of head, pH. The Gini Impurity is then

A plot of the this is below for different values of pH.

import matplotlib.pyplot as plt import numpy as np %matplotlib inline p = np.linspace(0.01,0.99,100) plt.figure(figsize=(16,8)) plt.plot(p,1-p**2-(1-p)**2) plt.ylim(0,1) plt.show()

We see that if pH = 0 or pH = 1 there is no misclassification. If pH = 0.5, we see that we get the coin flip wrong 50% of the time. This should match our intuition.

#### Entropy

Another, and less intuitive, measure of prediction accuracy is entropy. It is defined as the expected information content of a class. If f is the fraction of the class in a dataset, and I(c) is the information content in the class, then entropy is defined as following

For reasons I will not get into in this post, I(c) = – log2(f_c), so the entropy of a dataset follows:

The base 2 of the log can be anything, really. But since we will be using binary trees throughout this series, I will keep the base 2.

#### Entropy for a Coin Flip

Continuing with the coinflip example, I have produced the entropy of a coinflip using different values of pH.

plt.figure(figsize=(16,8)) plt.plot(p,-p*np.log2(p)-(1-p)*np.log2(1-p)) plt.ylim(0,1) plt.show()

The expected information content of pH and pH = 1 is zero, which means that we always know what the next flip is. The entropy is maximum for pH = 0.5, which is where we have the highest misclassification rate

### Regressions Metrics for Decision Trees

The metric used for decision tree regression is a reduction of variance around the prediction.

#### Simple Linear Example

We can take a simple linear example and split it into two regions. The solid black line is the average of the entire dataset, while the red lines are the predictions in each region.

The total variance before the split was 40.00, while the weighted variance after the split:

## Improving Metrics – Splitting on Information Gain

Now that we have metrics to evaluate the prediction from a data set, we can now define the improvement metric to optimize. I will be using the term information gain, which it is the correct term when talking about entropy, for all the optimizations. I can not say conclusively is the name for the measure in decision tree regression. Being wrong hasn’t stopped me before!

### Decision Tree Classifier

### Decision Tree Regressor

## Parameters of Decision Trees

Decisions Trees are prone to overfit because the metrics can be driven to the minimum by making the tree deep enough capture each data point individually. The variance of a single point is zero. The misclassification rate of a single point is also zero. Traditionally this overfit is managed by pruning the tree after it is constructed, or setting parameters the limit how finely the data can be partitioned into regions.

The post-pruning of a decision tree uses some sort of statistical evaluation on a validation set. If the error on the sub-nodes is larger than the node, the sub-nodes are removed.

The parameters that are commonly used in CART to limit the partitioning of the data:

- Maximum Depth of the Tree
- Minimum Number of Data Points for a New Node
- Minimum Improvement needed for a New Node
- Minimum Metric under which New Nodes are not created.

## Pseudo Code

Now that we have a metric and measure of improvement, we can outline the pseudo code we will be implementing in the next posts which are based on the CART algorithm for decision trees. Classification and Regression Trees (CART) algorithm that recursively and greedily splits nodes if there is an improve in the metric of choice until some stopping conditions are met.

- Set Training Parameters
- Create a Root Node
- Determine optimal split the node such that that split minimizes the metrics of concern.
- If stopping parameters are not violated, create two child nodes and apply step 3 and 4 to each child node.

# Code

In this code section. I am going to be implementing a Decision Tree algorithm, but I will do it in the following parts:

- Introduce the Data
- Script Finding the Best First Partition
- Create a Data Structure and Implement Decision Trees
- Comparing with sklearn package.

## Assumptions

The implementation will avoid some issues with decision trees. I am going to assume that there never missing values. I will assume that all categorical variables are logical, character, and factor variables. If a variable is numeric, I will assume it is numeric. I am also going to assume that all the data is in a data.frame format.

## Data

I will be using the Iris dataset for classification and the Boston dataset for regression. Both are available in the sklearn package.

from sklearn.datasets import load_iris,load_boston iris = load_iris() boston = load_boston()

### Boston Housing Price

import pandas as pd bostonX = boston['data'] bostonY = boston['target'] bostonDF = pd.DataFrame(data = np.hstack((bostonX,bostonY.reshape(bostonY.shape[0],1))),\ columns=np.append(boston['feature_names'],'PRICE')) bostonDF.head()

CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |

1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |

2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |

3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |

4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |

### Iris

irisX = iris['data'] irisY = iris['target'] irisDF = pd.DataFrame(data = np.hstack((irisX,irisY.reshape(irisY.shape[0],1))),\ columns=np.append(iris['feature_names'],"Species")) irisDF.head()

sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | Species | |
---|---|---|---|---|---|

0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 |

1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 |

2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 |

3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.0 |

4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 |

## Scripting the First Partition

I am going to implement a brute force search for the best partition in the data set. This will be done by going through each value in each feature and finding the feature-value that will best partition the data. To do this we need to implement our metrics I discussed in the theory post:

Entropy: Classification

Gini Impurity: Classification

Variance: Regression

def entropy(y): if y.size == 0: return 0 p = np.unique(y, return_counts = True)[1].astype(float)/len(y) return -1 * np.sum(p * np.log2(p+1e-9)) def gini_impurity(y): if y.size == 0: return 0 p = np.unique(y, return_counts = True)[1].astype(float)/len(y) return 1 - np.sum(p**2) def variance(y): if y.size == 0: return 0 return np.var(y)

I can quickly calculate these metrics for our dataset:

print entropy(irisY) print gini_impurity(irisY) print variance(bostonY)

1.58496249639

0.666666666667

84.4195561562

## Information Gain

Now that I have metric functions, I want to calculate the improvement for different partitions. Each partition is going to be calculated using a mask which is a logical vector that states if the row is in the first partition (TRUE) or in the second partition (FALSE). I will call the function information gain, even though it really is only information gain if entropy is the metric.

def information_gain(y,mask,func=entropy): s1 = np.sum(mask) s2 = mask.size - s1 if (s1 == 0 | s2 == 0): return 0 return func(y) - s1/float(s1+s2) * func(y[mask]) - s2/float(s1+s2) * func(y[np.logical_not(mask)]) print information_gain(irisY,irisX[:,2] < 3.5) print information_gain(irisY,irisX[:,2] < 3.5,gini_impurity) print information_gain(bostonY,bostonX[:,5] < 7)

0.827861517442

0.305906762627

0.456884627461

## Max Information Gain For A Single Feature

I have a function to calculated the information gain for a given criteria, mask, or split, I now will search for the best split among all the values in a given feature. This will give me the best split for that feature:

np.apply_along_axis(lambda x: np.sum(x),0,irisX)

array([ 876.5, 458.1, 563.8, 179.8])

def max_information_gain_split(y,x,func=gini_impurity): best_change = None split_value = None is_numeric = irisX[:,2].dtype.kind not in ['S','b'] for val in np.unique(np.sort(x)): mask = x == val if(is_numeric): mask = x < val change = information_gain(y,mask,func) if best_change is None: best_change = change split_value = val elif change > best_change: best_change = change split_value = val return {"best_change":best_change,\ "split_value":split_value,\ "is_numeric":is_numeric} print(max_information_gain_split(irisY,irisX[:,2])) print(max_information_gain_split(irisY,irisX[:,2],entropy)) print(max_information_gain_split(bostonY,bostonX[:,3],variance))

{‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.33333333333333343}

{‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.91829583213089627}

{‘is_numeric’: True, ‘split_value’: 1.0, ‘best_change’: 2.5930420368499885}

All the best split values just happen to be 1 in this case.

## Best Feature Split

Now I can apply the max_information gain across all columns and return the information for the best split

def best_feature_split(X,y,func=gini_impurity): best_result = None best_index = None for index in range(X.shape[1]): result = max_information_gain_split(y,X[:,index],func) if best_result is None: best_result = result best_index = index elif best_result['best_change'] < result['best_change']: best_result = result best_index = index best_result['index'] = best_index return best_result print best_feature_split(irisX,irisY) print best_feature_split(irisX,irisY,entropy) print best_feature_split(bostonX,bostonY,variance)

{‘index’: 2, ‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.33333333333333343}

{‘index’: 2, ‘is_numeric’: True, ‘split_value’: 3.0, ‘best_change’: 0.91829583213089627}

{‘index’: 5, ‘is_numeric’: True, ‘split_value’: 6.9429999999999996, ‘best_change’: 38.220464479057071}

Now that I have the best feature-value combination to partition our data we can partition the data into the two data sets. One I will call ‘left’, and the other I will call ‘right’. This will end up being the data structure use in that each node can have two branches: left and right.

def get_best_mask(X,best_feature_dict): best_mask = None if best_feature_dict['is_numeric']: best_mask = X[:,best_feature_dict['index']] < best_feature_dict['split_value'] else: best_mask = X[:,best_feature_dict['index']] == best_feature_dict['split_value'] return best_mask bfs = best_feature_split(irisX,irisY) best_mask = get_best_mask(irisX,bfs) left = irisX[best_mask,:] right = irisX[np.logical_not(best_mask),:]

Now that I have scripted a single split, I will now work on creating a tree-based data structure for the decision tree.

## Data Structure

I am going to make two classes. Decision Tree class that slightly mimics the sklearn API and a Decision Tree Node class will recursively fit the data based on the hyper-parameters we provide. There are many parameters that one could set to control the tree from overfitting. I have have choose: max_depth, min_information_gain, & min_leaf_size. Class diagrams are below.

# DecisionTreeNode

This is going to be the template I will use for the DecisionTreeNode class.

class DecisionTreeNode(object): def __init__(self,\ X,\ y,\ minimize_func,\ min_information_gain=0.01,\ max_depth=3,\ depth=0): self.X = None self.y = None self.minimize_func=minimize_func self.min_information_gain=min_information_gain self.min_leaf_size = min_leaf_size self.max_depth = max_depth self.depth = depth self.best_split = None self.left = None self.right = None self.is_leaf = False self.split_description = "" def _information_gain(self,mask): pass def _max_information_gain_split(self,X): pass def _best_feature_split(self): pass def _split_node(self): pass def _predict_row(self,row): pass def predict(self,X): pass def __repr__(self): pass

### Information Gain

This is the same function as I implemented during the scripting section except that it makes reference to the object variables instead of accepting them as parameters.

def _information_gain(self,mask): s1 = np.sum(mask) s2 = mask.size - s1 if (s1 == 0 | s2 == 0): return 0 return self.minimize_func(self.y) - \ s1/float(s1+s2) * self.minimize_func(self.y[mask]) - \ s2/float(s1+s2) * self.minimize_func(self.y[np.logical_not(mask)])

### Max Information Gain Split

This function finds the value with the most information gain by a given feature. This is similar to the scripted version above except it also makes reference the object’s variables and have added an additional constraint that the best split is only updated if it new leafs are bigger or equal to the min leaf size. This is finding the optimal split value given the constraint. The above scrippted solution finds the optimal value then checks to see if it matches the constraint. These two methods will give different results and produce different trees.

def _max_information_gain_split(self,x): best_change = None split_value = None previous_val = None is_numeric = x.dtype.kind not in ['S','b'] for val in np.unique(np.sort(x)): mask = x == val if(is_numeric): mask = x < val change = self._information_gain(mask) s1 = np.sum(mask) s2 = mask.size-s1 if best_change is None and s1 >= self.min_leaf_size and s2 >= self.min_leaf_size: best_change = change split_value = val elif change > best_change and s1 >= self.min_leaf_size and s2 >= self.min_leaf_size: best_change = change split_value = np.mean([val,previous_val]) previous_val = val return {"best_change":best_change,\ "split_value":split_value,\ "is_numeric":is_numeric}

### Best Feature Split

This function finds the feature-value split that gives the most information gain out of the all the features.

def _best_feature_split(self): best_result = None best_index = None for index in range(self.X.shape[1]): result = self._max_information_gain_split(self.X[:,index]) if result['best_change'] is not None: if best_result is None: best_result = result best_index = index elif best_result['best_change'] < result['best_change']: best_result = result best_index = index if best_result is not None: best_result['index'] = best_index self.best_split = best_result

### Split Node

This function will split the DecisionTreeNode into left and right branches if it is consistent with the hyper-parameters. It is also setting each Node’s description so a summary tree can be printed out.

def _split_node(self): if self.depth < self.max_depth : self._best_feature_split() if self.best_split is not None and self.best_split['best_change'] >= self.min_information_gain : mask = None if self.best_split['is_numeric']: mask = self.X[:,self.best_split['index']] < self.best_split['split_value'] else: mask = self.X[:,self.best_split['index']] == self.best_split['split_value'] if(np.sum(mask) >= self.min_leaf_size and (mask.size-np.sum(mask)) >= self.min_leaf_size): self.is_leaf = False self.left = DecisionTreeNode(self.X[mask,:],\ self.y[mask],\ self.minimize_func,\ self.min_information_gain,\ self.max_depth,\ self.min_leaf_size,\ self.depth+1) if self.best_split['is_numeric']: split_description = 'index ' + str(self.best_split['index']) + " < " + str(self.best_split['split_value']) + " ( " + str(self.X[mask,:].shape[0]) + " )" self.left.split_description = str(split_description) else: split_description = 'index ' + str(self.best_split['index']) + " == " + str(self.best_split['split_value']) + " ( " + str(self.X[mask,:].shape[0]) + " )" self.left.split_description = str(split_description) self.left._split_node() self.right = DecisionTreeNode(self.X[np.logical_not(mask),:],\ self.y[np.logical_not(mask)],\ self.minimize_func,\ self.min_information_gain,\ self.max_depth,\ self.min_leaf_size,\ self.depth+1) if self.best_split['is_numeric']: split_description = 'index ' + str(self.best_split['index']) + " >= " + str(self.best_split['split_value']) + " ( " + str(self.X[np.logical_not(mask),:].shape[0]) + " )" self.right.split_description = str(split_description) else: split_description = 'index ' + str(self.best_split['index']) + " != " + str(self.best_split['split_value']) + " ( " + str(self.X[np.logical_not(mask),:].shape[0]) + " )" self.right.split_description = str(split_description) self.right._split_node() if self.is_leaf: if self.minimize_func == variance: self.split_description = self.split_description + " : predict - " + str(np.mean(self.y)) else: values, counts = np.unique(self.y,return_counts=True) predict = values[np.argmax(counts)] self.split_description = self.split_description + " : predict - " + str(predict)

### Predict Row

This function is used to make a prediction for a single row. It with be used by the predict function.

def _predict_row(self,row): predict_value = None if self.is_leaf: if self.minimize_func==variance: predict_value = np.mean(self.y) else: values, counts = np.unique(self.y,return_counts=True) predict_value = values[np.argmax(counts)] else: left = None if self.best_split['is_numeric']: left = row[self.best_split['index']] < self.best_split['split_value'] else: left = row[self.best_split['index']] == self.best_split['split_value'] if left: predict_value = self.left._predict_row(row) else: predict_value = self.right._predict_row(row) return predict_value

### Predict

This function will predict the values from the fitted tree. If the minimizing function is variance, I assume the problem is a regression problem and return numeric values. If it is not, I assume it is a classification problem and return character values.

def predict(self,X): return np.apply_along_axis(lambda x: self._predict_row(x),1,X)

## Final DecisionTreeNode Code

class DecisionTreeNode(object): def __init__(self,\ X,\ y,\ minimize_func,\ min_information_gain=0.001,\ max_depth=3,\ min_leaf_size=20,\ depth=0): self.X = X self.y = y self.minimize_func=minimize_func self.min_information_gain=min_information_gain self.min_leaf_size = min_leaf_size self.max_depth = max_depth self.depth = depth self.best_split = None self.left = None self.right = None self.is_leaf = True self.split_description = "root" def _information_gain(self,mask): s1 = np.sum(mask) s2 = mask.size - s1 if (s1 == 0 | s2 == 0): return 0 return self.minimize_func(self.y) - \ s1/float(s1+s2) * self.minimize_func(self.y[mask]) - \ s2/float(s1+s2) * self.minimize_func(self.y[np.logical_not(mask)]) def _max_information_gain_split(self,x): best_change = None split_value = None previous_val = None is_numeric = x.dtype.kind not in ['S','b'] for val in np.unique(np.sort(x)): mask = x == val if(is_numeric): mask = x < val change = self._information_gain(mask) s1 = np.sum(mask) s2 = mask.size-s1 if best_change is None and s1 >= self.min_leaf_size and s2 >= self.min_leaf_size: best_change = change split_value = val elif change > best_change and s1 >= self.min_leaf_size and s2 >= self.min_leaf_size: best_change = change split_value = np.mean([val,previous_val]) previous_val = val return {"best_change":best_change,\ "split_value":split_value,\ "is_numeric":is_numeric} def _best_feature_split(self): best_result = None best_index = None for index in range(self.X.shape[1]): result = self._max_information_gain_split(self.X[:,index]) if result['best_change'] is not None: if best_result is None: best_result = result best_index = index elif best_result['best_change'] < result['best_change']: best_result = result best_index = index if best_result is not None: best_result['index'] = best_index self.best_split = best_result def _split_node(self): if self.depth < self.max_depth : self._best_feature_split() if self.best_split is not None and self.best_split['best_change'] >= self.min_information_gain : mask = None if self.best_split['is_numeric']: mask = self.X[:,self.best_split['index']] < self.best_split['split_value'] else: mask = self.X[:,self.best_split['index']] == self.best_split['split_value'] if(np.sum(mask) >= self.min_leaf_size and (mask.size-np.sum(mask)) >= self.min_leaf_size): self.is_leaf = False self.left = DecisionTreeNode(self.X[mask,:],\ self.y[mask],\ self.minimize_func,\ self.min_information_gain,\ self.max_depth,\ self.min_leaf_size,\ self.depth+1) if self.best_split['is_numeric']: split_description = 'index ' + str(self.best_split['index']) + " < " + str(self.best_split['split_value']) + " ( " + str(self.X[mask,:].shape[0]) + " )" self.left.split_description = str(split_description) else: split_description = 'index ' + str(self.best_split['index']) + " == " + str(self.best_split['split_value']) + " ( " + str(self.X[mask,:].shape[0]) + " )" self.left.split_description = str(split_description) self.left._split_node() self.right = DecisionTreeNode(self.X[np.logical_not(mask),:],\ self.y[np.logical_not(mask)],\ self.minimize_func,\ self.min_information_gain,\ self.max_depth,\ self.min_leaf_size,\ self.depth+1) if self.best_split['is_numeric']: split_description = 'index ' + str(self.best_split['index']) + " >= " + str(self.best_split['split_value']) + " ( " + str(self.X[np.logical_not(mask),:].shape[0]) + " )" self.right.split_description = str(split_description) else: split_description = 'index ' + str(self.best_split['index']) + " != " + str(self.best_split['split_value']) + " ( " + str(self.X[np.logical_not(mask),:].shape[0]) + " )" self.right.split_description = str(split_description) self.right._split_node() if self.is_leaf: if self.minimize_func == variance: self.split_description = self.split_description + " : predict - " + str(np.mean(self.y)) else: values, counts = np.unique(self.y,return_counts=True) predict = values[np.argmax(counts)] self.split_description = self.split_description + " : predict - " + str(predict) def _predict_row(self,row): predict_value = None if self.is_leaf: if self.minimize_func==variance: predict_value = np.mean(self.y) else: values, counts = np.unique(self.y,return_counts=True) predict_value = values[np.argmax(counts)] else: left = None if self.best_split['is_numeric']: left = row[self.best_split['index']] < self.best_split['split_value'] else: left = row[self.best_split['index']] == self.best_split['split_value'] if left: predict_value = self.left._predict_row(row) else: predict_value = self.right._predict_row(row) return predict_value def predict(self,X): return np.apply_along_axis(lambda x: self._predict_row(x),1,X) def _rep(self,level): response = "|->" + self.split_description if self.left is not None: response += "\n"+(2*level+1)*" "+ self.left._rep(level+1) if self.right is not None: response += "\n"+(2*level+1)*" "+ self.right._rep(level+1) return response def __repr__(self): return self._rep(0)

### Proof of Concept on Iris

dtn = DecisionTreeNode(irisX,irisY,gini_impurity) dtn._split_node() dtn

### Proof of Concept on ToothGrowth

dtn = DecisionTreeNode(bostonX,bostonY,variance,max_depth=5) dtn._split_node() dtn

# DecisionTree

Now that I have implemented the DecisionTreeNode, I can implement the DecisionTree class. Here is the template code.

class DecisionTree(object): def __init__(self,\ minimize_func,\ min_information_gain=0.01,\ max_depth=3,\ min_leaf_size=20): self.root = None self.minimize_func = minimize_func self.min_information_gain = min_information_gain self.max_depth = max_depth self.min_leaf_size = min_leaf_size def fit(X,y): pass def predict(X): pass

### Fit

The fit function is going to take the features and targets. I am assuming that a data frame will be provided. You can generalize this for other datatypes as you see fit. It just takes the data and features, and creates a new DecisionTreeNode, then splits the node which recursively and greedily builds the tree.

def fit(self,X,y): self.root = DecisionTreeNode(X,\ y,\ self.minimize_func,\ self.min_information_gain,\ self.max_depth,\ self.min_leaf_size,\ 0) self.root._split_node()

### Predict

The predict function is just a wrapper to call the DecisionTreeNode predict function. Right now there are no probabilities. You can add that feature yourself!

def predict(self,X): return self.root.predict(X)

## Fully Implemented Decision Tree

class DecisionTree(object): def __init__(self,\ minimize_func,\ min_information_gain=0.001,\ max_depth=3,\ min_leaf_size=20): self.root = None self.minimize_func = minimize_func self.min_information_gain = min_information_gain self.max_depth = max_depth self.min_leaf_size = min_leaf_size def fit(self,X,y): self.root = DecisionTreeNode(X,\ y,\ self.minimize_func,\ self.min_information_gain,\ self.max_depth,\ self.min_leaf_size,\ 0) self.root._split_node() def predict(self,X): return self.root.predict(X) def __repr__(self): return self.root._rep(0)

# Comparison to sklearn

Now that we have a decision tree, we can compare the results between the two CART implementations. To do this we need to write a print function for the Sklearn DecisionTree class that is similar to our own print function. A simple script is below.

from sklearn.tree import DecisionTreeClassifier as SklearnDTC from sklearn.tree import DecisionTreeRegressor as SklearnDTR from sklearn.tree import _tree def print_sklearn_tree(tree,index=0,level=0): response = "" if level == 0: response += "root\n" if tree.feature[index] == -2: response += ": predict " + str(np.argmax(dt_sklearn.tree_.value[index,0,:])) + " ( " +str(np.sum(dt_sklearn.tree_.value[index,0,:])) + " )" else: response += "\n"+(2*level+1)*" " + "|-> index " + str(tree.feature[index]) + " < " + str(tree.threshold[index]) response += (2*(level+1)+1)*" "+ print_sklearn_tree(tree,tree.children_left[index],level+1) response += "\n"+(2*level+1)*" " + "|-> index " + str(tree.feature[index]) + " >= " + str(tree.threshold[index]) response += (2*(level+1)+1)*" "+ print_sklearn_tree(tree,tree.children_right[index],level+1) return response dt_sklearn = SklearnDTC(max_depth=3,min_samples_leaf=20,criterion="gini") dt_sklearn.fit(irisX,irisY) dt_bts = DecisionTree(gini_impurity,min_leaf_size=20,max_depth=3) dt_bts.fit(irisX,irisY) print dt_bts print "\n" + 50*"-" + "\n" print print_sklearn_tree(dt_sklearn.tree_)

The values here are close to identical, but it is worth noting that sklearn will jump between this result and splitting on Petal With of 0.8 because they have the same information gain.

dt_sklearn = SklearnDTC(max_depth=5,min_samples_leaf=5,criterion="gini") dt_sklearn.fit(irisX,irisY) dt_bts = DecisionTree(gini_impurity,min_leaf_size=5,max_depth=5) dt_bts.fit(irisX,irisY) print dt_bts print "\n" + 50*"-" + "\n" print print_sklearn_tree(dt_sklearn.tree_)

Again – my implementation and sklearn are close to identical in results. I will try regression on the Boston dataset.

dt_sklearn = SklearnDTR(max_depth=3,min_samples_leaf=20) dt_sklearn.fit(bostonX,bostonY) dt_bts = DecisionTree(variance,min_leaf_size=20,max_depth=3) dt_bts.fit(bostonX,bostonY) print dt_bts print "\n" + 50*"-" + "\n" print print_sklearn_tree(dt_sklearn.tree_)

I have made a

# Conclusion

In this post, I covered the theory and code of implementing the CART algorithm for producing decision trees and compared the results to sklearn on a few toy datasets. This code is far from production ready, but it does show a high-level implementation of the algorithm. You can leave any question, comments, corrections, or requestion in the comments below.

If you are interested in learning more, I can recommend a few books below. The image links are affiliate links. I provide links to free version when available.

### Data Science For Business

This book has a good overview of Decision Trees in chapter 3 and just good/general advice about using data science to get practical results. I gave it a good once over, and continue to reference it from time to time.

### Data Science From Scratch

This book has a good intro to data science activities with python and implements a number of algorithms from scratch. Chapter 18 has an implementation of ID3 Decision Tree Algorithm. It also shows how to implement Logistic Regression, Naive Bayes, & Neural Networks.

### Python Machine Learning

This is a really good book that covers doing an enormous amount of data science tasks in python. If you are new to python or data science, I highly recommend it.

## Implementing Decision Trees From Scratch Using R

# Decision Trees

You can find the iPython Notebook for this post at https://github.com/bryantravissmith/FromScratch/tree/master/SupervisedLearning/DecisionTrees

Decision Trees are a non-parametric supervised machine learning algorithm that takes a set of training data and constructs a set of regions in the space of features that is then used to make predictions. These predictions can be continuous values (Decision Tree Regression), or classification (Decision Tree Classification).

Let’s imagine we have a data set with a lot of features and how much student debt individuals have. We could just take the average of the entire dataset, say $25,000.00, and use that as a prediction. A better prediction could be developed by dividing the dataset up using the level of college completed: None, Some, Undergraduate, and Graduate. We can then take the average of subset/region:

- None: $0
- Some: $2000
- Undergraduate: $16000
- Graduate: $42000

Now when we score a person, we use their level of college completed to make the prediction. The idea is that the results are improved over the $25,000 average estimate of the entire dataset.

## Why Implement It Ourselves?

The reason why I want to implement machine learning algorithm from scratch is I have a better intuition for the assumptions and the trade off in the implementations of standard libraries. For instance, before I did this for the first time I did not appreciate that both sklearn (python) and rpart (R) have an implementation of the CART algorithm for decision trees, but there are equally valid decision tree algorithms that would produce different results. This implementation is inherently greedy. The issue with greedy algorithms is that making locally best splits or regions may not make the globally best set of regions. This is what makes finding the best Decision Tree model np-hard. If you really get into it, finding the best tree becomes an optimal stopping problem and brushes into Reinforcement Learning and Artifical Intelligence style searching.

On a more practical side, there have been a number of times in the iterative approach of model-building where understanding the model assumptions have allowed me to exclude features the violate assumptions, or create new features that are more easily expressed and represented by the model. Both of these have consistently lead to better out of sample performance.

Finally, the best reason is I get pleasure and satisfaction from doing this. I think algorithms are a powerful framework and a fun way to think about problems solving and doing things from scratch help improve about execution and understanding of them.

## Metrics – Measures for Improvement

I made up the student debt example. It makes sense in my mind, but we need to define an algorithm that can create the regions from any dataset. Most machine learning algorithms are optimization algorithms that improve some measure/metric on each iteration. Decision trees are no different in that respect. To define the algorithm, we need to define the measures.

### Classification Metrics for Decision Trees

#### Gini Impurity

Gini Impurity is a very intuitive metric to try to improve because it is a measure of the misclassification rate. If the fraction of the data set that has outcome class is , we can make predictions for each data point as being class , of the time. The error rate for a class is the probability of getting the class times the probability of getting it wrong.

The Gini Impurity is the sum of these error rates for all classes.

If we divide the data set in regions that improve the weighted sum of the Gini Impurity, the misclassification rate will be reduced.

#### Gini Impurity for a Coin Flip

A coin flip is about as simple as a process there is to calculate the Gini Impurity for. We have a coin with probability of head, . The Gini Impurity is then

A plot of the this is below for different values of .

p <- seq(0.01,0.99,0.01) plot(p, 1-p^2-(1-p)^2, type='l', ylim=c(0,1), ylab = "Gini Impurity", xlab="Probability of Heads", col="blue")

We see that if or there is no misclassification. If , we see that we get the coin flip wrong 50% of the time. This should match our intuition.

#### Entropy

Another, and less intuitive, measure of prediction accuracy is entropy. It is defined as the expected information content of a class. If $f_c$ is the fraction of the class in a dataset, and is the information content in the class, then entropy is defined as following

For reasons I will not get into in this post, , so the entropy of a dataset follows:

The base 2 of the log can be anything, really. But since we will be using binary trees throughout this series, I will keep the base 2.

#### Entropy for a Coin Flip

Continuing with the coinflip example, I have produced the entropy of a coinflip using different values of .

p <- seq(0.01,0.99,0.01) plot(p, -p*log2(p)-(1-p)*log2(1-p), type='l', ylim=c(0,1), ylab = "Entropy", xlab="Probability of Heads", col="blue")

The expected information content of and is zero, which means that we always know what the next flip is. The entropy is maximum for , which is where we have the highest misclassification rate

### Regressions Metrics for Decision Trees

The metric used for decision tree regression is a reduction of variance around the prediction.

#### Simple Linear Example

We can take a simple linear example and split it into two regions. The solid black line is the average of the entire dataset, while the red lines are the predictions in each region.

The total variance before the split was 40.00, while the weighted variance after the split is = 10.0

## Improving Metrics – Splitting on Information Gain

Now that we have metrics to evaluate the prediction from a data set, we can now define the improvement metric to optimize. I will be using the term information gain, which it is the correct term when talking about entropy, for all the optimizations. I can not say conclusively is the name for the measure in decision tree regression. Being wrong hasn’t stopped me before!

### Decision Tree Classifier

### Decision Tree Regressor

## Parameters of Decision Trees

Decisions Trees are prone to overfit because the metrics can be driven to the minimum by making the tree deep enough capture each data point individually. The variance of a single point is zero. The misclassification rate of a single point is also zero. Traditionally this overfit is managed by pruning the tree after it is constructed, or setting parameters the limit how finely the data can be partitioned into regions.

The post-pruning of a decision tree uses some sort of statistical evaluation on a validation set. If the error on the sub-nodes is larger than the node, the sub-nodes are removed.

The parameters that are commonly used in CART to limit the partitioning of the data:

- Maximum Depth of the Tree
- Minimum Number of Data Points for a New Node
- Minimum Improvement needed for a New Node
- Minimum Metric under which New Nodes are not created.

## Pseudo Code

Now that we have a metric and measure of improvement, we can outline the pseudo code we will be implementing in the next posts which are based on the CART algorithm for decision trees. Classification and Regression Trees (CART) algorithm that recursively and greedily splits nodes if there is an improve in the metric of choice until some stopping conditions are met.

- Set Training Parameters
- Create a Root Node
- Determine optimal split the node such that that split minimizes the metrics of concern.
- If stopping parameters are not violated, create two child nodes and apply step 3 and 4 to each child node.

# Code

In this code section. I am going to be implementing a Decision Tree algorithm, but I will do it in the following parts:

- Introduce the Data
- Script Finding the Best First Partition
- Create a Data Structure and Implement Decision Trees
- Comparing with rpart package.

## Assumptions

The implementation will avoid some issues with decision trees. I am going to assume that there never missing values. I will assume that all categorical variables are logical, character, and factor variables. If a variable is numeric, I will assume it is numeric. I am also going to assume that all the data is in a data.frame format.

## Data

I will be using the Iris dataset for classification and the ToothGrowth dataset for regression. Both are available in the R datasets package

library(datasets) data(iris) data(ToothGrowth)

#### Iris

I will be using the Species as the classification target in this post. Here are the top 6 rows of the dataset:

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa |

4.9 | 3.0 | 1.4 | 0.2 | setosa |

4.7 | 3.2 | 1.3 | 0.2 | setosa |

4.6 | 3.1 | 1.5 | 0.2 | setosa |

5.0 | 3.6 | 1.4 | 0.2 | setosa |

5.4 | 3.9 | 1.7 | 0.4 | setosa |

#### ToothGrowth

I will be using the ‘len’ variable as the regression target in this post. Here are the top 6 rows of the ToothGrowth dataset:

len | supp | dose |
---|---|---|

4.2 | VC | 0.5 |

11.5 | VC | 0.5 |

7.3 | VC | 0.5 |

5.8 | VC | 0.5 |

6.4 | VC | 0.5 |

10.0 | VC | 0.5 |

## Scripting the First Partition

I am going to implement a brute force search for the best partition in the data set. This will be done by going through each value in each feature and finding the feature-value that will best partition the data. To do this we need to implement our metrics I discussed in the theory post:

entropy: classification

Gini impurity: classification

variance: regression

entropy <- function(y){ # assumes y is a factor with all levels if(length(y)==0) return(0) p <- table(y)/length(y) sum(-p*log2(p+1e-9)) } gini_impurity <- function(y){ # assumes y if a factor with all levels if(length(y) == 0) return(0) p <- table(y)/length(y) 1-sum(p^2) } variance <- function(y){ if(length(y) <= 1) return(0) var(y) }

I can quickly calculate these metrics for our dataset:

entropy(iris[,'Species']) gini_impurity(iris[,'Species']) variance(ToothGrowth[,'len'])

1.58496249639307

0.666666666666667

58.5120225988701

## Information Gain

Now that I have metric functions, I want to calculate the improvement for different partitions. Each partition is going to be calculated using a mask which is a logical vector that states if the row is in the first partition (TRUE) or in the second partition (FALSE). I will call the function information gain, even though it really is only information gain if entropy is the metric.

information_gain <- function(y,mask,func=entropy){ s1 = sum(mask) s2 = length(mask)-s1 if ( s1 == 0 | s2 == 0) return(0) func(y)-s1/(s1+s2)*func(y[mask])-s2/(s1+s2)*func(y[!mask]) } information_gain(iris[,'Species'],iris[,'Petal.Width'] < 1.5) information_gain(iris[,'Species'],iris[,'Petal.Width'] < 1.5,gini_impurity) information_gain(ToothGrowth[,'len'],ToothGrowth[,'dose']<= 0.5,variance)

0.643516416593624

0.229045542635659

33.8790109029636

## Max Information Gain For A Single Feature

I have a function to calculated the information gain for a given criteria, mask, or split, I now will search for the best split among all the values in a given feature. This will give me the best split for that feature:

max_information_gain_split <- function(y,x,func=gini_impurity){ best_change = NA split_value = NA is_numeric = !(is.factor(x)|is.logical(x)|is.character(x)) for( val in sort(unique(x))){ mask <- x == val if (is_numeric) mask <- x < val change <- information_gain(y,mask,func) if(is.na(best_change) | change > best_change){ best_change = change split_value = val } } return(list("best_change"=best_change, "split_value"=split_value, "is_numeric"=is_numeric)) } print(unlist(max_information_gain_split(iris$Species,iris$Petal.Width))) print(unlist(max_information_gain_split(iris$Species,iris$Petal.Width,entropy))) print(unlist(max_information_gain_split(ToothGrowth$len,ToothGrowth$dose,variance)))

best_change split_value is_numeric

0.3333333 1.0000000 1.0000000

best_change split_value is_numeric

0.9182958 1.0000000 1.0000000

best_change split_value is_numeric

33.87901 1.00000 1.00000

All the best split values just happen to be 1 in this case.

## Max Information Again Among All Features

Now I can apply ‘max_information_gain_split’ to all of the feature columns and find the best split. We will do this by applying a ‘sapply ‘ function on the features like so:

sapply(iris[,1:4],function(x) max_information_gain_split(iris[,5],x))

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|

best_change | 0.2277603 | 0.1269234 | 0.3333333 | 0.3333333 |

split_value | 5.5 | 3.4 | 3 | 1 |

is_numeric | TRUE | TRUE | TRUE | TRUE |

sapply(ToothGrowth[,2:3],function(x) max_information_gain_split(ToothGrowth[,1],x))

supp | dose | |
---|---|---|

best_change | 0.01444444 | 0.02027778 |

split_value | OJ | 2 |

is_numeric | FALSE | TRUE |

After ‘sapplying’ the function, I will return a list with the information of the feature and value with the most information gain.

best_feature_split <- function(X,y){ results <- sapply(X,function(x) max_information_gain_split(y,x)) best_name <- names(which.max(results['best_change',])) best_result <- results[,best_name] best_result[["name"]] <- best_name best_result } best_feature_split(iris[,1:4],iris[,5]) best_feature_split(ToothGrowth[,2:3],ToothGrowth[,1])

best_change | split_value | is_numeric | name |
---|---|---|---|

0.3333333 | 3 | TRUE | Petal.Length |

best_change | split_value | is_numeric | name |
---|---|---|---|

0.0202778 | 2 | TRUE | dose |

Now that I have the best feature-value combination to partition our data we can partition the data into the two data sets. One I will call ‘left’, and the other I will call ‘right’. This will end up being the data structure use in that each node can have two branches: left and right.

get_best_mask <- function(X,best_feature_list){ best_mask <- X[,best_feature_list$name] == best_feature_list$split_value if(best_feature_list$is_numeric){ best_mask <- X[,best_feature_list$name] < best_feature_list$split_value } return(best_mask) } best_tooth_mask <- get_best_mask(ToothGrowth[,2:3],bfs) leftDf = ToothGrowth[best_tooth_mask,] rightDf = ToothGrowth[!best_tooth_mask,]

Now that I have scripted a single split, I will now work on creating a tree-based data structure for the decision tree.

## Data Structure

R is a functional language, but I decided to take a more object oriented approach to implementing my decision tree. If you want to examine a function approach, you can examine the code in the rpart package.

I decided this route partially because it is how I think of the problem, but that does not make it the best implementation. A truer reason is that I also wanted to take this post as an opportunity to play around with R’s Reference Classes object implementation. Hadley Wickham has a good Reference Classes overview. I have never used these before this post. Be warned.

I am going to make two classes. Decision Tree class that slightly mimics the sklearn API and a Decision Tree Node class will recursively fit the data based on the hyper-parameters we provide. There are many parameters that one could set to control the tree from overfitting. I have have choose: max_depth, min_information_gain, & min_leaf_size. Class diagrams are below.

# DecisionTreeNode

This is going to be the template I will use for the DecisionTreeNode class. It is using R’s Reference Class objects.

DecisionTreeNode <- setRefClass("DecisionTreeNode", fields = list( x = "data.frame", y = "ANY", is_leaf="logical", split_description="character", best_split="list", branches="list", depth="numeric", minimize_func="function", min_information_gain="numeric", min_leaf_size="numeric", max_depth="numeric"), methods = list( initialize = function(...){}, information_gain = function(mask){}, max_information_gain_split = function(feature){}, best_feature_split = function(){}, split_node = function() {}, predict_row = function(row){}, predict = function(features){} ) )

I want to set the print on a DecisionTreeNode to give a summary of the entire structure and decisions. To do this I am recursively descending down the tree to get the node description, then append it the response. After all the responses have been returned to the root level, I cat the response so that I can take advantage of the tab/’\t’ and newline/’\n’ syntax.

print.DecisionTreeNode <- function(node,level=0){ response <- paste("|->",node$split_description) if(level < node$max_depth){ if(!is.null(node$branches$left)){ response <- paste0(response,"\n",paste(rep(" ",2*(level+1)),collapse=" "),print(node$branches$left,level+1)) } if(!is.null(node$branches$right)){ response <- paste0(response,"\n",paste(rep(" ",2*(level+1)),collapse=" "),print(node$branches$right,level+1)) } } if(level==0) { cat(response) } else { return(response) } }

### Initialize Function

This function gets called when a *DecisionTreeNode\$new()* gets called. In it, I set a list of default values. I iterate through the parameters provided to the $new(…) function and if a default parameter is not set, I set it with the *do.call* command.

initialize = function(...){ defaults <- list(x = data.frame(), y=c(), depth=0, minimize_func=gini_impurity, min_information_gain=1e-2, min_leaf_size=20, max_depth=3, is_leaf=T, split_description="root", best_split=NULL, branches=list("left"=NULL,"right"=NULL)) params <- list(...) fields <- names(getRefClass()$fields()) for( field in fields){ if (!(field %in% names(params))) { params[[field]] <- defaults[[field]] } } for( param in names(params)){ do.call("<<-",list(param, params[[param]])) } }

### Information Gain

This is the same function as I implemented during the scripting section except that it makes reference to the object variables instead of accepting them as parameters.

information_gain = function(mask){ s1 = sum(mask) s2 = length(mask)-s1 if ( s1 == 0 | s2 == 0) return(0) minimize_func(y)-s1/(s1+s2)*minimize_func(y[mask])-s2/(s1+s2)*minimize_func(y[!mask]) }

### Max Information Gain Split

This function finds the value with the most information gain by a given feature. This is similar to the scripted version above except it also makes reference the object’s variables

max_information_gain_split = function(feature){ best_change = NA split_value = NA is_numeric = !(is.factor(feature)|is.logical(feature)|is.character(feature)) previous_val <- 0 for( val in sort(unique(feature))){ mask <- feature == val if (is_numeric) mask <- feature < val change <- information_gain(mask) if(is.na(best_change) | change > best_change){ best_change = change split_value = ifelse(is_numeric, mean(c(val,previous_val)), val) } previous_val <- val } return(list("best_change"=best_change, "split_value"=split_value, "is_numeric"=is_numeric)) }

### Best Feature Split

This function finds the feature-value split that gives the most information gain out of the all the features.

best_feature_split = function(){ results <- sapply(x,function(feature) max_information_gain_split(feature)) best_name <- names(which.max(results['best_change',])) best_result <- results[,best_name] best_result[["name"]] <- best_name best_split <<- best_result }

### Best Mask

This function finds the mask for the best feature split. This is used to split the node if it is consistent with the hyper-parameters

best_mask = function(){ best_mask <- x[,best_split$name] == best_split$split_value if(best_split$is_numeric){ best_mask <- x[,best_split$name] < best_split$split_value } return(best_mask) }

### Split Node

This function will split the DecisionTreeNode into left and right branches if it is consistent with the hyper-parameters. It is also setting each Node’s description so a summary tree can be printed out.

split_node = function() { if(depth < max_depth){ best_feature_split() if(best_split$best_change < min_information_gain ){ mask = best_mask() if(sum(mask) < min_leaf_size && length(mask)-sum(mask) < min_leaf_size){ is_leaf <<- F branches$left <<- .self$copy() branches$left$is_leaf <<- T branches$left$x <<- branches$left$x[mask,] branches$left$y <<- branches$left$y[mask] branches$left$split_description <<- ifelse(best_split$is_numeric, paste(c(best_split$name, "<", best_split$split_value), collapse = " "), paste(c(best_split$name, "=", best_split$split_value), collapse = " ")) branches$left$depth <<- branches$left$depth+1 branches$left$split_node() branches$right <<- .self$copy() branches$right$is_leaf <<- T branches$right$x <<- branches$right$x[!mask,] branches$right$y <<- branches$right$y[!mask] branches$right$split_description <<- ifelse(best_split$is_numeric, paste(c(best_split$name, ">=", best_split$split_value), collapse = " "), paste(c(best_split$name, "!=", best_split$split_value), collapse = " ")) branches$right$depth <- branches$right$depth+1 branches$right$split_node() } } } if(is_leaf){ split_description <<- ifelse(identical(minimize_func,variance), paste(c(split_description, ":", "predict - ", mean(y)), collapse=" "), paste(c(split_description, ":", "predict - ", names(which.max(table(y)))), collapse=" ")) } }

### Predict Row

This function is used to make a prediction for a single row. It with be used by the predict function.

predict_row = function(row){ if(is_leaf){ predict_value <- ifelse(identical(minimize_func,variance), mean(y), names(which.max(table(y)))) } else { if(best_split$is_numeric){ left = row[best_split$name] < best_split$split_value } else{ left = row[best_split$name] == best_split$split_value } if(left){ predict_value = branches$left$predict_row(row) } else { predict_value = branches$right$predict_row(row) } } return(predict_value) }

### Predict

This function will predict the values from the fitted tree. If the minimizing function is variance, I assume the problem is a regression problem and return numeric values. If it is not, I assume it is a classification problem and return character values.

predict = function(features){ pred <- character(length=dim(features)[1]) if(identical(minimize_func,variance)) pred <- numeric(length=dim(features)[1]) for(i in 1:dim(features)[1]){ pred[i] = predict_row(features[i,]) } pred }

## Final DecisionTreeNode Code

DecisionTreeNode <- setRefClass("DecisionTreeNode", fields = list(x = "data.frame", y = "ANY", is_leaf="logical", split_description="character", best_split="list", branches="list", depth="numeric", minimize_func="function", min_information_gain="numeric", min_leaf_size="numeric", max_depth="numeric"), methods = list( initialize = function(...){ defaults <- list(x = data.frame(), y=c(), depth=0, minimize_func=gini_impurity, min_information_gain=1e-2, min_leaf_size=20, max_depth=3, is_leaf=T, split_description="root", best_split=NULL, branches=list("left"=NULL,"right"=NULL)) params <- list(...) fields <- names(getRefClass()$fields()) for( field in fields){ if (!(field %in% names(params))) { params[[field]] <- defaults[[field]] } } for( param in names(params)){ do.call("<<-",list(param, params[[param]])) } }, information_gain = function(mask){ s1 = sum(mask) s2 = length(mask)-s1 if ( s1 == 0 | s2 == 0) return(0) minimize_func(y)-s1/(s1+s2)*minimize_func(y[mask])-s2/(s1+s2)*minimize_func(y[!mask]) }, max_information_gain_split = function(feature){ best_change = NA split_value = NA is_numeric = !(is.factor(feature)|is.logical(feature)|is.character(feature)) previous_val <- 0 for( val in sort(unique(feature))){ mask <- feature == val if (is_numeric) mask <- feature < val change <- information_gain(mask) if(is.na(best_change) | change > best_change){ best_change = change split_value = ifelse(is_numeric, mean(c(val,previous_val)), val) } previous_val <- val } return(list("best_change"=best_change, "split_value"=split_value, "is_numeric"=is_numeric)) }, best_feature_split = function(){ results <- sapply(x,function(feature) max_information_gain_split(feature)) best_name <- names(which.max(results['best_change',])) best_result <- results[,best_name] best_result[["name"]] <- best_name best_split <<- best_result }, best_mask = function(){ best_mask <- x[,best_split$name] == best_split$split_value if(best_split$is_numeric){ best_mask <- x[,best_split$name] < best_split$split_value } return(best_mask) }, split_node = function() { if(depth < max_depth){ best_feature_split() if(best_split$best_change > min_information_gain ){ mask = best_mask() if(sum(mask) > min_leaf_size && length(mask)-sum(mask) > min_leaf_size){ is_leaf <<- F branches$left <<- .self$copy() branches$left$is_leaf <<- T branches$left$x <<- branches$left$x[mask,] branches$left$y <<- branches$left$y[mask] branches$left$split_description <<- ifelse(best_split$is_numeric, paste(c(best_split$name, "<", best_split$split_value), collapse = " "), paste(c(best_split$name, "=", best_split$split_value), collapse = " ")) branches$left$depth <<- branches$left$depth+1 branches$left$branches <<- list("left"=NULL,"right"=NULL) branches$left$split_node() branches$right <<- .self$copy() branches$right$is_leaf <<- T branches$right$x <<- branches$right$x[!mask,] branches$right$y <<- branches$right$y[!mask] branches$right$split_description <<- ifelse(best_split$is_numeric, paste(c(best_split$name, ">=", best_split$split_value), collapse = " "), paste(c(best_split$name, "!=", best_split$split_value), collapse = " ")) branches$right$depth <<- branches$right$depth+1 branches$right$branches <<- list("left"=NULL,"right"=NULL) branches$right$split_node() } } } if(is_leaf){ split_description <<- ifelse(identical(minimize_func,variance), paste(c(split_description, ":", "predict - ", mean(y)), collapse=" "), paste(c(split_description, ":", "predict - ", names(which.max(table(y)))), collapse=" ")) } }, predict_row = function(row){ if(is_leaf){ predict_value <- ifelse(identical(minimize_func,variance), mean(y), names(which.max(table(y)))) } else { if(best_split$is_numeric){ left = row[best_split$name] < best_split$split_value } else{ left = row[best_split$name] == best_split$split_value } if(left){ predict_value = branches$left$predict_row(row) } else { predict_value = branches$right$predict_row(row) } } return(predict_value) }, predict = function(features){ pred <- character(length=dim(features)[1]) if(identical(minimize_func,variance)) pred <- numeric(length=dim(features)[1]) for(i in 1:dim(features)[1]){ pred[i] = predict_row(features[i,]) } pred } ) )

### Proof of Concept on Iris

dtn <- DecisionTreeNode$new(x=iris[,1:4],y=iris[,5],max_depth=3) dtn$split_node() print(dtn)

### Proof of Concept on ToothGrowth

dtn <- DecisionTreeNode$new(x=ToothGrowth[,2:3], y=ToothGrowth[,1], minimize_func=variance, min_leaf_size=5, max_depth=5, min_information_gain=1e-3) dtn$split_node() print(dtn)

# DecisionTree

Now that I have implemented the DecisionTreeNode, I can implement the DecisionTree class. Here is the template code.

DecisionTree <- setRefClass("DecisionTree", fields = list(minimize_func="function", min_information_gain="numeric", min_leaf_size="numeric", max_depth="numeric", root = "DecisionTreeNode"), methods = list( initialize = function(...){}, fit = function(features,target){}, predict = function(features){} ))

### Initialize

This is almost identical to the initialization function for DecisionTreeNode. The defaults are only the hyper-parameters though

initialize = function(...){ defaults <- list(minimize_func=gini_impurity, min_information_gain=1e-2, min_leaf_size=20, max_depth=3, root=NULL) params <- list(...) fields <- names(getRefClass()$fields()) for( field in fields){ if (!(field %in% names(params))) { params[[field]] <- defaults[[field]] } } for( param in names(params)){ do.call("<<-",list(param, params[[param]])) } }

### Fit

The fit function is going to take the features and targets. I am assuming that a data frame will be provided. You can generalize this for other datatypes as you see fit. It just takes the data and features, and creates a new DecisionTreeNode, then splits the node which recursively and greedily builds the tree.

fit = function(features,target){ root <<- DecisionTreeNode$new(x=features, y=target, minimize_func=minimize_func, min_information_gain=min_information_gain, min_leaf_size=min_leaf_size, max_depth=max_depth ) root$split_node() }

### Predict

The predict function is just a wrapper to call the DecisionTreeNode predict function. Right now there are no probabilities. You can add that feature yourself!

predict = function(features){ root$predict(features) }

## Fully Implemented Decision Tree

DecisionTree <- setRefClass("DecisionTree", fields = list(minimize_func="function", min_information_gain="numeric", min_leaf_size="numeric", max_depth="numeric", root = "DecisionTreeNode"), methods = list( initialize = function(...){ defaults <- list(minimize_func=gini_impurity, min_information_gain=1e-2, min_leaf_size=20, max_depth=3, root=NULL) params <- list(...) fields <- names(getRefClass()$fields()) for( field in fields){ if (!(field %in% names(params))) { params[[field]] <- defaults[[field]] } } for( param in names(params)){ do.call("<<-",list(param, params[[param]])) } }, fit = function(features,target){ root <<- DecisionTreeNode$new(x=features, y=target, minimize_func=minimize_func, min_information_gain=min_information_gain, min_leaf_size=min_leaf_size, max_depth=max_depth ) root$split_node() }, predict = function(features){ root$predict(features) } ))

print.DecisionTree <- function(tree){ print(tree$root) }

# Comparison to rpart

Now that we have a decision tree, we can compare the results between the two CART implementations

library(rpart) dt_rpart <- rpart(Species ~ .,data=iris,maxdepth=3,minbucket=20) dt_bts <- DecisionTree(max_depth=3,min_leaf_size=20) dt_bts$fit(iris[,1:4],iris[,5]) print(dt_bts) cat("\n\n") print(dt_rpart)

dt_rpart <- rpart(Species ~ .,data=iris,maxdepth=5,minbucket=5) dt_bts <- DecisionTree(max_depth=5,min_leaf_size=5) dt_bts$fit(iris[,1:4],iris[,5]) print(dt_bts) cat("\n\n") print(dt_rpart)

dt_rpart <- rpart(len ~ .,data=ToothGrowth,maxdepth=5,minbucket=5,method="anova") dt_bts <- DecisionTree(max_depth=5,min_leaf_size=5,minimize_func=variance) dt_bts$fit(ToothGrowth[,2:3],ToothGrowth[,1]) print(dt_bts) cat("\n\n") print(dt_rpart)

I have a Decision Tree class that if the data meets my assumption of not having missing data, is in a data frame format, and classification variables are factors, I can produce results that are similar, if not identical, to the rpart package.

# Conclusion

In this post, I covered the theory and code of implementing the CART algorithm for producing decision trees and compared the results to rpart on a few toy datasets. This code is far from production ready, but it does show a high-level implementation of the algorithm. You can leave any question, comments, corrections, or requestion in the comments below.

If you are interested in learning more, I can recommend a few books below. The image links are affiliate links. I provide links to free version when available.

### Data Science For Business

This book has a good overview of Decision Trees in chapter 3 and just good/general advice about using data science to get practical results. I gave it a good once over, and continue to reference it from time to time.

### Introduction to Statistical Learning

This is the book I first picked up when I wanted to learn R. You can google for the accompanying code. It gives a very comprehensive coverage of Statistical Learning and R. I highly recommend it. You can find an older version of this book in a pdf for free: http://www-bcf.usc.edu/~gareth/ISL/

### Elements of Statistical Learning

This is the another book I picked up when I wanted to learn R. There is also code available. This was not a practical approach, but I think it is a good, global experience. I am amazed that I learn something new every time I open this book. You can find a free older version of this book as well: http://statweb.stanford.edu/~tibs/ElemStatLearn/

## Bayesian Methods for Hackers

Main Author: Cam Davidson-Pilon

Link: https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

Description:

The most hands-on and practical coverage of Bayesian Inference I have been exposed to. It is both a GitHub collection of iPython notebooks and a real living, breathing book.

## Stanford’s Machine Learning Project Page

Links:

http://cs229.stanford.edu/projects2015.html

http://cs229.stanford.edu/projects2014.html

http://cs229.stanford.edu/projects2013.html

Description: A list of all the student projects and reports from Stanford’s CS 229 Machine Learning course. The most recent year has the top projects highlighted. It is probably not worth exhaustively reading each report unless you want to practice being a C.S. T.A.

## Data Science iPython Notebooks by Donne Martin

Author: Donne Martin

Link: https://github.com/donnemartin/data-science-ipython-notebooks

Description: This is one of the best collection of iPython notebooks. I obsessively went through all of them when I decided I was going to leave teaching and pursue a Data Science career. I found a lot of value in these notebooks. Thank You, Donne.

## Group Lens Datasets

Collector: GroupLens

Link: http://grouplens.org/datasets/

Description: A collection of a number of data sets useful for recommender systems. The most famous data set is probably the MovieLens dataset, but there are others as well.

## Pew Research Data Sets

Collector: Pew Research

Link: http://www.pewinternet.org/datasets/

Description: A collection/fee of data sets. You must register to download any of the datasets. Their hope is that it will be used by researchers.

## Reality Commons Data

Collector: http://hd.media.mit.edu/

Link: http://realitycommons.media.mit.edu/index.html

Description: Data sets opened sources in the hope that it will inspire insights that will help facilitate the use of mobile phones in social science experiments.

## U.S. Open Data

Link: https://www.data.gov/

Description: A growing collection of the US Government’s collected data as part of its open data initiatives. Every time I go back I find new data.