Cost-sensitive decision trees with completion time requirements

Hung-Pin Kao, Purdue University

Abstract

Decision trees are an attractive method for classification tasks, because classification rules can be efficiently generated and easy to interpret and understand. Many decision-tree induction algorithms have been developed for a wide range of applications. The traditional approach is essentially top-down greedy heuristics in nature and uses classification accuracy as the goal in developing tree induction algorithms. Under a decision tree structure, a subject is classified according to the results of sequentially measuring a set of predicting attributes. In many applications, measuring an attribute requires cost and time, and managing the overall cost and completion times for implementing a tree are the main concerns for the user. In this study, I focus on economic aspects of implementing a tree in a classification task. In addition, I introduce a special completion time requirement: the deadline for classifying a subject is determined by its label (target) value. For example, a timely diagnosis is important for an illness, which requires immediate medical attention. Therefore, it is common in medical diagnosis to set deadlines based on the severity levels of illness. Based on this requirement, I consider two problems with different approaches to incorporating the completion time requirement in the model formulation. In the first problem, I assume that a late penalty cost is incurred if the deadline is not met. The goal of tree induction is to produce a tree to minimize the total cost, which is the sum of the cost of measuring attributes used in classification, the cost of misclassification, and late penalties. I propose an innovative approach for tree induction, which produces multiple candidate trees by allowing more than one splitting attribute at each node. The user can specify the maximum number of candidate trees to control the computational efforts required to produce the final solution. In the tree-induction process, an allocation scheme is used to dynamically distribute the given number of candidate trees to splitting attributes according to their estimated contributions to cost reduction. The algorithm finds the final tree by backtracking. An extensive experiment shows that the algorithm outperforms the top-down heuristic and can effectively obtain the optimal or near-optimal decision trees without an excessive computation time. In the second problem, I use constraints to control the rate of tardy classifications for each label value. The goal is to find a cost-effective tree, which also meet the complete time constraints. The constraints enrich the decision tree problem, but also pose a challenge to developing an efficient solution algorithm, because the conventional tree algorithms based on the “divide-and-conquer” strategy is not workable. I develop a novel algorithm, which relaxes the completion time constraints and iteratively solves a series of cost-sensitive decision tree problems under systematically generated late penalties. An extensive numerical experiment conducted in this study shows that the proposed algorithm is effective in finding the optimal or a near-optimal solution.

Degree

Ph.D.

Advisors

Tang, Purdue University.

Subject Area

Management|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS