# Decision Trees

You can find the iPython Notebook for this post at https://github.com/bryantravissmith/FromScratch/tree/master/SupervisedLearning/DecisionTrees

Decision Trees are a non-parametric supervised machine learning algorithm that takes a set of training data and constructs a set of regions in the space of features that is then used to make predictions. These predictions can be continuous values (Decision Tree Regression), or classification (Decision Tree Classification).

Let’s imagine we have a data set with a lot of features and how much student debt individuals have. We could just take the average of the entire dataset, say $25,000.00, and use that as a prediction. A better prediction could be developed by dividing the dataset up using the level of college completed: None, Some, Undergraduate, and Graduate. We can then take the average of subset/region: 1. None:$0
2. Some: $2000 3. Undergraduate:$16000
4. Graduate: $42000 Now when we score a person, we use their level of college completed to make the prediction. The idea is that the results are improved over the$25,000 average estimate of the entire dataset.

## Why Implement It Ourselves?

The reason why I want to implement machine learning algorithm from scratch is I have a better intuition for the assumptions and the trade off in the implementations of standard libraries. For instance, before I did this for the first time I did not appreciate that both sklearn (python) and rpart (R) have an implementation of the CART algorithm for decision trees, but there are equally valid decision tree algorithms that would produce different results. This implementation is inherently greedy. The issue with greedy algorithms is that making locally best splits or regions may not make the globally best set of regions. This is what makes finding the best Decision Tree model np-hard. If you really get into it, finding the best tree becomes an optimal stopping problem and brushes into Reinforcement Learning and Artifical Intelligence style searching.

On a more practical side, there have been a number of times in the iterative approach of model-building where understanding the model assumptions have allowed me to exclude features the violate assumptions, or create new features that are more easily expressed and represented by the model. Both of these have consistently lead to better out of sample performance.

Finally, the best reason is I get pleasure and satisfaction from doing this. I think algorithms are a powerful framework and a fun way to think about problems solving and doing things from scratch help improve about execution and understanding of them.

## Metrics – Measures for Improvement

I made up the student debt example. It makes sense in my mind, but we need to define an algorithm that can create the regions from any dataset. Most machine learning algorithms are optimization algorithms that improve some measure/metric on each iteration. Decision trees are no different in that respect. To define the algorithm, we need to define the measures.

### Classification Metrics for Decision Trees

#### Gini Impurity

Gini Impurity is a very intuitive metric to try to improve because it is a measure of the misclassification rate. If the fraction of the data set that has outcome class $c$ is $f_c$, we can make predictions for each data point as being class $c$, $f_c$ of the time. The error rate for a class is the probability of getting the class times the probability of getting it wrong.

$Error_c = f_c * \left( 1 - f_c \right)$

The Gini Impurity is the sum of these error rates for all classes.

$Gini \ Impurity = \sum_c Error_c = \sum_c f_c * \left( 1 - f_c \right)$

$Gini \ Impurity = \sum_c f_c - \sum _c f^2_c$

$Gini \ Impurity = 1 - \sum _c f^2_c$

If we divide the data set in regions that improve the weighted sum of the Gini Impurity, the misclassification rate will be reduced.

#### Gini Impurity for a Coin Flip

A coin flip is about as simple as a process there is to calculate the Gini Impurity for. We have a coin with probability of head, $p_H$. The Gini Impurity is then

$Gini \ Impurity = 1 - p_H^2 - \left(1-p_H\right)^2 = 2*p_H*\left(1-p_H\right)$

A plot of the this is below for different values of $p_H$.

p &lt;- seq(0.01,0.99,0.01)
plot(p,
1-p^2-(1-p)^2,
type='l',
ylim=c(0,1),
ylab = &quot;Gini Impurity&quot;,
col=&quot;blue&quot;)


We see that if $p_H = 0$ or $p_H = 1$ there is no misclassification. If $p_H = 0.5$, we see that we get the coin flip wrong 50% of the time. This should match our intuition.

#### Entropy

Another, and less intuitive, measure of prediction accuracy is entropy. It is defined as the expected information content of a class. If $f_c$ is the fraction of the class in a dataset, and $I(c)$ is the information content in the class, then entropy is defined as following

$Entropy = \sum_c f_c * I(c)$

For reasons I will not get into in this post, $I(c) = - log_2\left(f_c\right)$, so the entropy of a dataset follows:

$Entropy = - \sum_c f_c * log_2\left(f_c\right)$

The base 2 of the log can be anything, really. But since we will be using binary trees throughout this series, I will keep the base 2.

#### Entropy for a Coin Flip

Continuing with the coinflip example, I have produced the entropy of a coinflip using different values of $p_H$.

p &lt;- seq(0.01,0.99,0.01)
plot(p,
-p*log2(p)-(1-p)*log2(1-p),
type='l',
ylim=c(0,1),
ylab = &quot;Entropy&quot;,
col=&quot;blue&quot;)


The expected information content of $p_H = 0$ and $p_H = 1$ is zero, which means that we always know what the next flip is. The entropy is maximum for $p_H = 0.5$, which is where we have the highest misclassification rate

### Regressions Metrics for Decision Trees

The metric used for decision tree regression is a reduction of  variance around the prediction.

$Variance = \frac{1}{N} \sum_{i}^{N} ( y_i - \bar{y} )^2$

#### Simple Linear Example

We can take a simple linear example and split it into two regions. The solid black line is the average of the entire dataset, while the red lines are the predictions in each region.

The total variance before the split was 40.00, while the weighted variance after the split is $\frac{6}{11} 11.67 + \frac{5}{11} 8.00$ = 10.0

## Improving Metrics – Splitting on Information Gain

Now that we have metrics to evaluate the prediction from a data set, we can now define the improvement metric to optimize. I will be using the term information gain, which it is the correct term when talking about entropy, for all the optimizations. I can not say conclusively is the name for the measure in decision tree regression. Being wrong hasn’t stopped me before!

### Decision Tree Classifier

$Information \ Gain = Gini \ Impurity(d) - \sum_{split \ s} \frac{|s|}{|d|} Gini \ Impurity(s)$

$Information \ Gain = Entropy(d) - \sum_{split \ s} \frac{|s|}{|d|} Entropy(s)$

### Decision Tree Regressor

$Information \ Gain = Mean \ Variance(d) - \sum_{split \ s} Mean \ Variance(s)$

## Parameters of Decision Trees

Decisions Trees are prone to overfit because the metrics can be driven to the minimum by making the tree deep enough capture each data point individually. The variance of a single point is zero. The misclassification rate of a single point is also zero. Traditionally this overfit is managed by pruning the tree after it is constructed, or setting parameters the limit how finely the data can be partitioned into regions.

The post-pruning of a decision tree uses some sort of statistical evaluation on a validation set. If the error on the sub-nodes is larger than the node, the sub-nodes are removed.

The parameters that are commonly used in CART to limit the partitioning of the data:

1. Maximum Depth of the Tree
2. Minimum Number of Data Points for a New Node
3. Minimum Improvement needed for a New Node
4. Minimum Metric under which New Nodes are not created.

## Pseudo Code

Now that we have a metric and measure of improvement, we can outline the pseudo code we will be implementing in the next posts which are based on the CART algorithm for decision trees. Classification and Regression Trees (CART) algorithm that recursively and greedily splits nodes if there is an improve in the metric of choice until some stopping conditions are met.

1. Set Training Parameters
2. Create a Root Node
3. Determine optimal split the node such that that split minimizes the metrics of concern.
4. If stopping parameters are not violated, create two child nodes and apply step 3 and 4 to each child node.

# Code

In this code section. I am going to be implementing a Decision Tree algorithm, but I will do it in the following parts:

1. Introduce the Data
2. Script Finding the Best First Partition
3. Create a Data Structure and Implement Decision Trees
4. Comparing with rpart package.

## Assumptions

The implementation will avoid some issues with decision trees. I am going to assume that there never missing values. I will assume that all categorical variables are logical, character, and factor variables. If a variable is numeric, I will assume it is numeric. I am also going to assume that all the data is in a data.frame format.

## Data

I will be using the Iris dataset for classification and the ToothGrowth dataset for regression. Both are available in the R datasets package

library(datasets)
data(iris)
data(ToothGrowth)


#### Iris

I will be using the Species as the classification target in this post. Here are the top 6 rows of the dataset:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

#### ToothGrowth

I will be using the ‘len’ variable as the regression target in this post. Here are the top 6 rows of the ToothGrowth dataset:

len supp dose
4.2 VC 0.5
11.5 VC 0.5
7.3 VC 0.5
5.8 VC 0.5
6.4 VC 0.5
10.0 VC 0.5

## Scripting the First Partition

I am going to implement a brute force search for the best partition in the data set. This will be done by going through each value in each feature and finding the feature-value that will best partition the data. To do this we need to implement our metrics I discussed in the theory post:

entropy: classification
Gini impurity: classification
variance: regression

entropy &lt;- function(y){
# assumes y is a factor with all levels
if(length(y)==0) return(0)
p &lt;- table(y)/length(y)
sum(-p*log2(p+1e-9))
}

gini_impurity &lt;- function(y){
# assumes y if a factor with all levels
if(length(y) == 0) return(0)
p &lt;- table(y)/length(y)
1-sum(p^2)
}

variance &lt;- function(y){
if(length(y) &lt;= 1) return(0)
var(y)
}


I can quickly calculate these metrics for our dataset:

entropy(iris[,'Species'])
gini_impurity(iris[,'Species'])
variance(ToothGrowth[,'len'])


1.58496249639307
0.666666666666667
58.5120225988701

## Information Gain

Now that I have metric functions, I want to calculate the improvement for different partitions. Each partition is going to be calculated using a mask which is a logical vector that states if the row is in the first partition (TRUE) or in the second partition (FALSE). I will call the function information gain, even though it really is only information gain if entropy is the metric.

information_gain &lt;- function(y,mask,func=entropy){
if ( s1 == 0 | s2 == 0) return(0)
}

information_gain(iris[,'Species'],iris[,'Petal.Width'] &lt; 1.5)
information_gain(iris[,'Species'],iris[,'Petal.Width'] &lt; 1.5,gini_impurity)
information_gain(ToothGrowth[,'len'],ToothGrowth[,'dose']&lt;= 0.5,variance)


0.643516416593624
0.229045542635659
33.8790109029636

## Max Information Gain For A Single Feature

I have a function to calculated the information gain for a given criteria, mask, or split, I now will search for the best split among all the values in a given feature. This will give me the best split for that feature:

max_information_gain_split &lt;- function(y,x,func=gini_impurity){
best_change = NA
split_value = NA
is_numeric = !(is.factor(x)|is.logical(x)|is.character(x))
for( val in sort(unique(x))){
mask &lt;- x == val
if (is_numeric) mask &lt;- x &lt; val
change &lt;- information_gain(y,mask,func) if(is.na(best_change) | change &gt; best_change){
best_change = change
split_value = val
}
}
return(list(&quot;best_change&quot;=best_change,
&quot;split_value&quot;=split_value,
&quot;is_numeric&quot;=is_numeric))
}

print(unlist(max_information_gain_split(iris$Species,iris$Petal.Width)))
print(unlist(max_information_gain_split(iris$Species,iris$Petal.Width,entropy)))
print(unlist(max_information_gain_split(ToothGrowth$len,ToothGrowth$dose,variance)))


best_change split_value is_numeric
0.3333333 1.0000000 1.0000000
best_change split_value is_numeric
0.9182958 1.0000000 1.0000000
best_change split_value is_numeric
33.87901 1.00000 1.00000

All the best split values just happen to be 1 in this case.

## Max Information Again Among All Features

Now I can apply ‘max_information_gain_split’ to all of the feature columns and find the best split. We will do this by applying a ‘sapply ‘ function on the features like so:

sapply(iris[,1:4],function(x) max_information_gain_split(iris[,5],x))

Sepal.Length Sepal.Width Petal.Length Petal.Width
best_change 0.2277603 0.1269234 0.3333333 0.3333333
split_value 5.5 3.4 3 1
is_numeric TRUE TRUE TRUE TRUE
sapply(ToothGrowth[,2:3],function(x) max_information_gain_split(ToothGrowth[,1],x))

supp dose
best_change 0.01444444 0.02027778
split_value OJ 2
is_numeric FALSE TRUE

After ‘sapplying’ the function, I will return a list with the information of the feature and value with the most information gain.

best_feature_split &lt;- function(X,y){
results &lt;- sapply(X,function(x) max_information_gain_split(y,x))
best_name &lt;- names(which.max(results['best_change',]))
best_result &lt;- results[,best_name]
best_result[[&quot;name&quot;]] &lt;- best_name
best_result
}

best_feature_split(iris[,1:4],iris[,5])
best_feature_split(ToothGrowth[,2:3],ToothGrowth[,1])

best_change split_value is_numeric name
0.3333333 3 TRUE Petal.Length
best_change split_value is_numeric name
0.0202778 2 TRUE dose

Now that I have the best feature-value combination to partition our data we can partition the data into the two data sets. One I will call ‘left’, and the other I will call ‘right’. This will end up being the data structure use in that each node can have two branches: left and right.

get_best_mask &lt;- function(X,best_feature_list){
best_mask &lt;- X[,best_feature_list$name] == best_feature_list$split_value
if(level &lt; node$max_depth){ if(!is.null(node$branches$left)){ response &lt;- paste0(response,&quot;\n&quot;,paste(rep(&quot; &quot;,2*(level+1)),collapse=&quot; &quot;),print(node$branches$left,level+1)) } if(!is.null(node$branches$right)){ response &lt;- paste0(response,&quot;\n&quot;,paste(rep(&quot; &quot;,2*(level+1)),collapse=&quot; &quot;),print(node$branches$right,level+1)) } } if(level==0) { cat(response) } else { return(response) } }  ### Initialize Function This function gets called when a DecisionTreeNode\$new() gets called. In it, I set a list of default values. I iterate through the parameters provided to the $new(…) function and if a default parameter is not set, I set it with the do.call command. initialize = function(...){ defaults &lt;- list(x = data.frame(), y=c(), depth=0, minimize_func=gini_impurity, min_information_gain=1e-2, min_leaf_size=20, max_depth=3, is_leaf=T, split_description=&quot;root&quot;, best_split=NULL, branches=list(&quot;left&quot;=NULL,&quot;right&quot;=NULL)) params &lt;- list(...) fields &lt;- names(getRefClass()$fields())
for( field in fields){
if (!(field %in% names(params))) {
params[[field]] &lt;- defaults[[field]]
}
}
for( param in names(params)){
do.call(&quot;&lt;&lt;-&quot;,list(param, params[[param]]))
}
}


### Information Gain

This is the same function as I implemented during the scripting section except that it makes reference to the object variables instead of accepting them as parameters.

information_gain = function(mask){
if ( s1 == 0 | s2 == 0) return(0)
}


### Max Information Gain Split

This function finds the value with the most information gain by a given feature. This is similar to the scripted version above except it also makes reference the object’s variables

max_information_gain_split = function(feature){

best_change = NA
split_value = NA
is_numeric = !(is.factor(feature)|is.logical(feature)|is.character(feature))

previous_val &lt;- 0
for( val in sort(unique(feature))){

mask &lt;- feature == val
if (is_numeric) mask &lt;- feature &lt; val
change &lt;- information_gain(mask) if(is.na(best_change) | change &gt; best_change){
best_change = change
split_value = ifelse(is_numeric,
mean(c(val,previous_val)),
val)

}
previous_val &lt;- val

}
return(list(&quot;best_change&quot;=best_change,
&quot;split_value&quot;=split_value,
&quot;is_numeric&quot;=is_numeric))
}


### Best Feature Split

This function finds the feature-value split that gives the most information gain out of the all the features.

best_feature_split = function(){
results &lt;- sapply(x,function(feature) max_information_gain_split(feature))
best_name &lt;- names(which.max(results['best_change',]))
best_result &lt;- results[,best_name]
best_result[[&quot;name&quot;]] &lt;- best_name
best_split &lt;&lt;- best_result
}


This function finds the mask for the best feature split. This is used to split the node if it is consistent with the hyper-parameters

best_mask = function(){
best_mask &lt;- x[,best_split$name] == best_split$split_value
if(best_split$is_numeric){ best_mask &lt;- x[,best_split$name] &lt; best_split$split_value } return(best_mask) }  ### Split Node This function will split the DecisionTreeNode into left and right branches if it is consistent with the hyper-parameters. It is also setting each Node’s description so a summary tree can be printed out. split_node = function() { if(depth &lt; max_depth){ best_feature_split() if(best_split$best_change &lt; min_information_gain ){
is_leaf &lt;&lt;- F

branches$left &lt;&lt;- .self$copy()
branches$left$is_leaf &lt;&lt;- T
branches$left$x &lt;&lt;-  branches$left$x[mask,]
branches$left$y &lt;&lt;-  branches$left$y[mask]

branches$left$split_description &lt;&lt;- ifelse(best_split$is_numeric, paste(c(best_split$name,
&quot;&lt;&quot;,
best_split$split_value), collapse = &quot; &quot;), paste(c(best_split$name,
&quot;=&quot;,
best_split$split_value), collapse = &quot; &quot;)) branches$left$depth &lt;&lt;- branches$left$depth+1 branches$left$split_node() branches$right &lt;&lt;- .self$copy() branches$right$is_leaf &lt;&lt;- T branches$right$x &lt;&lt;- branches$right$x[!mask,] branches$right$y &lt;&lt;- branches$right$y[!mask] branches$right$split_description &lt;&lt;- ifelse(best_split$is_numeric, paste(c(best_split$name, &quot;&gt;=&quot;, best_split$split_value),
collapse = &quot; &quot;),
paste(c(best_split$name, &quot;!=&quot;, best_split$split_value),
collapse = &quot; &quot;))

branches$right$depth &lt;-  branches$right$depth+1
branches$right$split_node()
}
}
}
if(is_leaf){
split_description &lt;&lt;- ifelse(identical(minimize_func,variance),
paste(c(split_description,
&quot;:&quot;,
&quot;predict - &quot;,
mean(y)),
collapse=&quot; &quot;),
paste(c(split_description,
&quot;:&quot;,
&quot;predict - &quot;,
names(which.max(table(y)))),
collapse=&quot; &quot;))
}

}


### Predict Row

This function is used to make a prediction for a single row. It with be used by the predict function.

predict_row = function(row){
if(is_leaf){
predict_value &lt;- ifelse(identical(minimize_func,variance),
mean(y),
names(which.max(table(y))))
} else {
if(best_split$is_numeric){ left = row[best_split$name] &lt; best_split$split_value } else{ left = row[best_split$name] == best_split$split_value } if(left){ predict_value = branches$left$predict_row(row) } else { predict_value = branches$right$predict_row(row) } } return(predict_value) }  ### Predict This function will predict the values from the fitted tree. If the minimizing function is variance, I assume the problem is a regression problem and return numeric values. If it is not, I assume it is a classification problem and return character values. predict = function(features){ pred &lt;- character(length=dim(features)[1]) if(identical(minimize_func,variance)) pred &lt;- numeric(length=dim(features)[1]) for(i in 1:dim(features)[1]){ pred[i] = predict_row(features[i,]) } pred }  ## Final DecisionTreeNode Code  DecisionTreeNode &lt;- setRefClass(&quot;DecisionTreeNode&quot;, fields = list(x = &quot;data.frame&quot;, y = &quot;ANY&quot;, is_leaf=&quot;logical&quot;, split_description=&quot;character&quot;, best_split=&quot;list&quot;, branches=&quot;list&quot;, depth=&quot;numeric&quot;, minimize_func=&quot;function&quot;, min_information_gain=&quot;numeric&quot;, min_leaf_size=&quot;numeric&quot;, max_depth=&quot;numeric&quot;), methods = list( initialize = function(...){ defaults &lt;- list(x = data.frame(), y=c(), depth=0, minimize_func=gini_impurity, min_information_gain=1e-2, min_leaf_size=20, max_depth=3, is_leaf=T, split_description=&quot;root&quot;, best_split=NULL, branches=list(&quot;left&quot;=NULL,&quot;right&quot;=NULL)) params &lt;- list(...) fields &lt;- names(getRefClass()$fields())
for( field in fields){
if (!(field %in% names(params))) {
params[[field]] &lt;- defaults[[field]]
}
}
for( param in names(params)){
do.call(&quot;&lt;&lt;-&quot;,list(param, params[[param]]))
}

},

if ( s1 == 0 | s2 == 0) return(0)
},
max_information_gain_split = function(feature){

best_change = NA
split_value = NA
is_numeric = !(is.factor(feature)|is.logical(feature)|is.character(feature))

previous_val &lt;- 0
for( val in sort(unique(feature))){

mask &lt;- feature == val
if (is_numeric) mask &lt;- feature &lt; val
change &lt;- information_gain(mask) if(is.na(best_change) | change &gt; best_change){
best_change = change
split_value = ifelse(is_numeric,
mean(c(val,previous_val)),
val)

}
previous_val &lt;- val

}
return(list(&quot;best_change&quot;=best_change,
&quot;split_value&quot;=split_value,
&quot;is_numeric&quot;=is_numeric))
},
best_feature_split = function(){
results &lt;- sapply(x,function(feature) max_information_gain_split(feature))
best_name &lt;- names(which.max(results['best_change',]))
best_result &lt;- results[,best_name]
best_result[[&quot;name&quot;]] &lt;- best_name
best_split &lt;&lt;- best_result
},
best_mask &lt;- x[,best_split$name] == best_split$split_value
if(best_split$is_numeric){ best_mask &lt;- x[,best_split$name] &lt; best_split$split_value } return(best_mask) }, split_node = function() { if(depth &lt; max_depth){ best_feature_split() if(best_split$best_change &gt; min_information_gain ){
is_leaf &lt;&lt;- F

branches$left &lt;&lt;- .self$copy()
branches$left$is_leaf &lt;&lt;- T
branches$left$x &lt;&lt;-  branches$left$x[mask,]
branches$left$y &lt;&lt;-  branches$left$y[mask]

branches$left$split_description &lt;&lt;- ifelse(best_split$is_numeric, paste(c(best_split$name,
&quot;&lt;&quot;,
best_split$split_value), collapse = &quot; &quot;), paste(c(best_split$name,
&quot;=&quot;,
best_split$split_value), collapse = &quot; &quot;)) branches$left$depth &lt;&lt;- branches$left$depth+1 branches$left$branches &lt;&lt;- list(&quot;left&quot;=NULL,&quot;right&quot;=NULL) branches$left$split_node() branches$right &lt;&lt;- .self$copy() branches$right$is_leaf &lt;&lt;- T branches$right$x &lt;&lt;- branches$right$x[!mask,] branches$right$y &lt;&lt;- branches$right$y[!mask] branches$right$split_description &lt;&lt;- ifelse(best_split$is_numeric, paste(c(best_split$name, &quot;&gt;=&quot;, best_split$split_value),
collapse = &quot; &quot;),
paste(c(best_split$name, &quot;!=&quot;, best_split$split_value),
collapse = &quot; &quot;))

branches$right$depth &lt;&lt;-  branches$right$depth+1
branches$right$branches &lt;&lt;- list(&quot;left&quot;=NULL,&quot;right&quot;=NULL)
branches$right$split_node()
}
}
}
if(is_leaf){
split_description &lt;&lt;- ifelse(identical(minimize_func,variance),
paste(c(split_description,
&quot;:&quot;,
&quot;predict - &quot;,
mean(y)),
collapse=&quot; &quot;),
paste(c(split_description,
&quot;:&quot;,
&quot;predict - &quot;,
names(which.max(table(y)))),
collapse=&quot; &quot;))
}

},
predict_row = function(row){
if(is_leaf){
predict_value &lt;- ifelse(identical(minimize_func,variance),
mean(y),
names(which.max(table(y))))
} else {
dtn$split_node() print(dtn)  ### Proof of Concept on ToothGrowth dtn &lt;- DecisionTreeNode$new(x=ToothGrowth[,2:3],
y=ToothGrowth[,1],
minimize_func=variance,
min_leaf_size=5,
max_depth=5,
min_information_gain=1e-3)
dtn$split_node() print(dtn)  # DecisionTree Now that I have implemented the DecisionTreeNode, I can implement the DecisionTree class. Here is the template code. DecisionTree &lt;- setRefClass(&quot;DecisionTree&quot;, fields = list(minimize_func=&quot;function&quot;, min_information_gain=&quot;numeric&quot;, min_leaf_size=&quot;numeric&quot;, max_depth=&quot;numeric&quot;, root = &quot;DecisionTreeNode&quot;), methods = list( initialize = function(...){}, fit = function(features,target){}, predict = function(features){} ))  ### Initialize This is almost identical to the initialization function for DecisionTreeNode. The defaults are only the hyper-parameters though initialize = function(...){ defaults &lt;- list(minimize_func=gini_impurity, min_information_gain=1e-2, min_leaf_size=20, max_depth=3, root=NULL) params &lt;- list(...) fields &lt;- names(getRefClass()$fields())
for( field in fields){
if (!(field %in% names(params))) {
params[[field]] &lt;- defaults[[field]]
}
}
for( param in names(params)){
do.call(&quot;&lt;&lt;-&quot;,list(param, params[[param]]))
}

}


### Fit

The fit function is going to take the features and targets. I am assuming that a data frame will be provided. You can generalize this for other datatypes as you see fit. It just takes the data and features, and creates a new DecisionTreeNode, then splits the node which recursively and greedily builds the tree.

fit = function(features,target){
root &lt;&lt;- DecisionTreeNode$new(x=features, y=target, minimize_func=minimize_func, min_information_gain=min_information_gain, min_leaf_size=min_leaf_size, max_depth=max_depth ) root$split_node()

}


### Predict

The predict function is just a wrapper to call the DecisionTreeNode predict function. Right now there are no probabilities. You can add that feature yourself!

predict = function(features){
root$predict(features) }  ## Fully Implemented Decision Tree DecisionTree &lt;- setRefClass(&quot;DecisionTree&quot;, fields = list(minimize_func=&quot;function&quot;, min_information_gain=&quot;numeric&quot;, min_leaf_size=&quot;numeric&quot;, max_depth=&quot;numeric&quot;, root = &quot;DecisionTreeNode&quot;), methods = list( initialize = function(...){ defaults &lt;- list(minimize_func=gini_impurity, min_information_gain=1e-2, min_leaf_size=20, max_depth=3, root=NULL) params &lt;- list(...) fields &lt;- names(getRefClass()$fields())
for( field in fields){
if (!(field %in% names(params))) {
params[[field]] &lt;- defaults[[field]]
}
}
for( param in names(params)){
do.call(&quot;&lt;&lt;-&quot;,list(param, params[[param]]))
}

},
fit = function(features,target){
root &lt;&lt;- DecisionTreeNode$new(x=features, y=target, minimize_func=minimize_func, min_information_gain=min_information_gain, min_leaf_size=min_leaf_size, max_depth=max_depth ) root$split_node()

},
predict = function(features){
root$predict(features) } ))  print.DecisionTree &lt;- function(tree){ print(tree$root)
}



# Comparison to rpart

Now that we have a decision tree, we can compare the results between the two CART implementations

library(rpart)
dt_rpart &lt;- rpart(Species ~ .,data=iris,maxdepth=3,minbucket=20)
dt_bts &lt;- DecisionTree(max_depth=3,min_leaf_size=20)
dt_bts$fit(iris[,1:4],iris[,5]) print(dt_bts) cat(&quot;\n\n&quot;) print(dt_rpart)  dt_rpart &lt;- rpart(Species ~ .,data=iris,maxdepth=5,minbucket=5) dt_bts &lt;- DecisionTree(max_depth=5,min_leaf_size=5) dt_bts$fit(iris[,1:4],iris[,5])

print(dt_bts)
cat(&quot;\n\n&quot;)
print(dt_rpart)


dt_rpart &lt;- rpart(len ~ .,data=ToothGrowth,maxdepth=5,minbucket=5,method=&quot;anova&quot;)
dt_bts &lt;- DecisionTree(max_depth=5,min_leaf_size=5,minimize_func=variance)
dt_bts\$fit(ToothGrowth[,2:3],ToothGrowth[,1])

print(dt_bts)
cat(&quot;\n\n&quot;)
print(dt_rpart)


I have a Decision Tree class that if the data meets my assumption of not having missing data, is in a data frame format, and classification variables are factors, I can produce results that are similar, if not identical, to the rpart package.

# Conclusion

In this post, I covered the theory and code of implementing the CART algorithm for producing decision trees and compared the results to rpart on a few toy datasets.  This code is far from production ready, but it does show a high-level implementation of the algorithm.    You can leave any question, comments, corrections, or requestion in the comments below.

If you are interested in learning more, I can recommend a few books below.  The image links are affiliate links.   I provide links to free version when available.

### Data Science For Business

This book has a good overview of Decision Trees in chapter 3 and just good/general advice about using data science to get practical results.   I gave it a good once over, and continue to reference it from time to time.

### Introduction to Statistical Learning

This is the book I first picked up when I wanted to learn R.  You can google for the accompanying code. It gives a very comprehensive coverage of Statistical Learning and R. I highly recommend it. You can find an older version of this book in a pdf for free: http://www-bcf.usc.edu/~gareth/ISL/

### Elements of Statistical Learning

This is the another book I picked up when I wanted to learn R. There is also code available. This was not a practical approach, but I think it is a good, global experience.  I am amazed that I learn something new every time I open this book. You can find a free older version of this book as well: http://statweb.stanford.edu/~tibs/ElemStatLearn/