Classification and variable selection for high dimensional multivariate binary data: Adaboost based new methods and a theory for the plug -in rule

Junyong Park, Purdue University

Abstract

We consider theoretically a classification problem where all the covariates are independent Bernoulli random variables Xji,1 ≤ i ≤ n and j = 0, 1, i.e., each variable has values 0 or 1, recording the presence or absence of an event. The parameters of Bernoulli random variables are estimated by using maximum likelihood estimation and they are plugged into the optimal Bayes rule, which is called the plug-in rule. This rule has been applied to real DNA fingerprint data as well as simulations in Wilbur et al. [2002] and shown to classify well even when the independence assumption does not hold. The asymptotic performance of the plug-in rule is the primary object of this study. Since the number of variables and hence the number of Bernoulli parameters depend on the sample size, n, indicating the need of more and more complex models as n increases, the usual notion of consistency, i.e., convergence of estimates to fixed parameter values isn't applicable. We introduce triangular arrays and a suitably modified definition of consistency called persistence based on how close the performance of the plug-in rule to the classifier with known parameters, pji 1 ≤ i ≤ n and j = 0, 1. We present various cases where the plug-in rule is persistent or not persistent. Under sparsity condition, we show that the plug-in rule with well-chosen variables may overcome non-persistence. This shows that variable selection can be effective in high dimensional data with sparsity condition. We also discuss convergence rate of the plug-in rule with sobolev ball type parameter spaces. We show that the plug-in rule with selected variables can improve convergence rate which shows that a simpler model may achieve better performance than the full model. As Bickel and Levina [2004] showed a naive Bayes model performs better than the full model, our results also underpin the well-known practical finding that a model with well-chosen variables may achieve better rate in prediction than the full model especially for high dimensional data. In addition to the theoretical study of the plug-in rule, we propose and study a new methodology for classification and variable selection based on adaboost. Our application to real and simulated data suggests the new methods perform consider ably better than the plug-in rule. A theoretical study of the new methods is yet to be done.

Degree

Ph.D.

Advisors

Ghosh, Purdue University.

Subject Area

Statistics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS