On the Qualitative Behavior of Impurity-Based Splitting Rules 11: The Shape of Splitting Curves

Craig W. Codrington, Purdue University School of ECE

Document Type Article

Abstract

The process of decision tree induction involves choosing an attribute to split on and deciding on a cut point along the asis of that attribute that split,s the attribut,e into two distinct ranges (we cctnsider only ordinal attributes here). Typically the cut. point is chosen to minimize the average impurity, delinecl in terms of an impurity memure such as entropy or gini. Consider the two-class case. If the average impurity is plotted as a function of cut point for, say, the entropy and gini impurit.y measures! one notices an interesting behavior; the two curves have the same shape, in the sense that they rise and fall together. It will be shown here that this holds, not just for entropy and gini, but for ally two strictly concave impurity measures (note that ent,ropy and gini are strictly concave). A further consequence of t,his theory is that it is possible to predict the shape of .the split.ting curve, given only the sequence of instance class labels, where the instances have been sorted on the attriblite under consideratioll. I11 a previous paper, we showed that, for any strict,ly concave impurity measure, the minimum of t,he average impurity must occur at a boundary point. However, not every boundary point will correspond t'o a minirnum. To exploit this observation, a necessary condition is developed for the split,t,ing curve to have a local minirnum. By applying this test to each boundary point, we can efficiently rule out those boundary points that (lo not correspond to local minima, thereby reducing the computation required to find the best split.