A quick look at bayesian decision trees for classification

Oliver Burkhard
4 min readFeb 28, 2021

This article has a brief look at the performance of bayesian decision trees, as presented here, with code here.

TL;DR: The selling point of the bayesian trees is essentially “ensemble quality classification with single tree explainability”. While I find the method outperforms single trees, it does not reach the predictive quality of ensembles in most cases. In cases where explainability is important, bayesian trees are to be preferred over other single tree methods.

Bayesian decision trees

I will not go into the details of the paper, but provide a brief summary of my understanding to invite a discourse should it not reflect reality. I will also restrict myself to the perpendicular version of the trees, as the hyperplane algorithms did not yield results in acceptable amounts of time.

The key idea behind bayesian decision trees is having a distribution over the possible trees, updating their prior distribution with the observed data in a classical bayesian way and choosing the tree with the highest posterior probability. The (unnormalised) prior that was chosen for the tree distribution has low values for deeper trees and those splitting multiple times along the same dimension. The likelihood is the product of the likelihoods of the individual partitions (courtesy of the independence assumption) and the data follows a dirichlet-categorical pair of distributions. As per usual, the trees are assumed to be binary.

The implementation on github reflects the so called Greedy Modal Tree (GMT). This means that at every step, the midpoints between all observed values along every dimensions are evaluated as candidate splits and the one that improves the posterior of the tree the most is chosen. If there is no split that would improve it, the tree is returned.

This procedure is very reminiscent of the way usual decision trees work, where instead of the increase in the posterior probability of the decision tree some measure like Entropy or Gini Impurity is taken to make the decision. Other than that I could not find any major differences in terms of the algorithms, which may be due to my only superficial look at them.

In terms of usability, the resulting classification is a decision tree and as such is easy to understand. Whether this constitutes the holy grail of interpretability is a whole other story not covered here.

Test setup

Test setups for ML algorithms are always fraught with too many options which somehow have to be reduced. There are parameters for the problem that can be played with as well as parameters for the algorithms themselves resulting in a combinatorial explosion of options. I chose to focus on the variability of the data first and then did some hyper-parameter tuned comparison on the mnist dataset.

For the data variability part, I used sklearn’s make_classification function to generate problems for the algorithms to solve. Again, the options are many, so I limited myself to an overall simple as well as an overall complex case and variations of the simple case where the parameters were chosen to be more challenging in turn. Since the imbalanced case is so important, I chose two example for it, one based on the simple, one based on the complex case.

For the mnist part I chose reasonable (TM) parameters for the obvious parameter candidates keeping in mind that being too generous with the options would explode my calculation time. Should I have forgotten an important case, please feel free to comment.

Results

As per expectations, normal classification trees do not fare as well as bagged or boosted trees and boosted trees seem to outperform bagged ones in most cases. LightGBM seems to perform equally well as XGB on nearly all tasks, generally being a tad lower.

The bayesian decision trees that were the focus here (drawn in red) are somewhere between the normal trees and the aggregated trees.

In terms of runtime, both XGB and the bayesian trees are definitely not the fastest and incur a multiple of the wall times of other algorithms. This is most likely a minor concern in practice for most use cases, but still something to be aware of.

Discussion

Thus, as long as performance is paramount I will probably not use them in the future and opt for boosted trees instead (XGB unless runtime is a concern). However, since the bayesian trees achieve their performance with a single tree that is easily understandable, in cases where this understandability is important, they are certainly to be preferred over standard single tree methods.

Materials

The parameters to get the toy problems on which the algorithms were compared can be found here. The hyperparameter choices for the mnist part can be found here.

--

--