Neyman-Pearson and Minimax cost complexity pruning of classification trees

Balaji Raghavan, Purdue University

Abstract

Cost complexity pruning of classification trees as introduced in the Classification and Regression Tree (CART) framework by Breiman et al. (1984) is a powerful technique to avoid overfitting the training data, while maintaining the interpretability of the tree. This approach involves a Bayesian cost which employs a fixed value of the prior probabilities, either obtained by Maximum Likelihood estimation or specified by the user. To use this approach with the Neyman Pearson (NP) criterion and generate a Receiver Operating Characteristic (ROC), a few values of the priors are selected heuristically in an attempt to obtain some desired values of false alarm and detection probabilities. However, there is no systematic approach to integrating pruning with NP criterion and ROC generation, even restricted to those points which can be obtained by varying the priors, nor is there a method to select the complexity and prior parameters which is statistically robust and includes the randomization of pruned subtrees. In this thesis, we consider the cost complexity measure parameterized by both prior and complexity of the tree and examine the two dimensional pruning problem. A computationally efficient algorithm to determine the optimal pruning of the tree for each point in the parameter space is developed. This results in a family of pruned subtrees each of which is shown to be optimal over a convex polygonal region in the parameter space. The method solves the pruning problem for NP criteria and for ROC curve generation in the sense that it generates all possible solutions for different prior values. Also, a robust procedure for NP parameter selection which allows randomizations among pruned subtrees is formulated and solved as a linear programming problem. The pruned subtrees can also be used to solve the pruning problem with Minimax criterion and the problem of joint estimation of the priors and Bayesian pruning of the tree. Integration of the pruning approach with tree growing is also examined. The effectiveness of the proposed algorithms is demonstrated on datasets obtained from the UCI machine learning repository.

Degree

Ph.D.

Advisors

Gelfand, Purdue University.

Subject Area

Applied Mathematics|Statistics|Artificial intelligence

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS