Model-Free Variable Screening, Sparse Regression Analysis and Other Applications with Optimal Transformations

Qiming Huang, Purdue University


Variable screening and variable selection methods play important roles in modeling high dimensional data. Variable screening is the process of filtering out irrelevant variables, with the aim to reduce the dimensionality from ultrahigh to high while retaining all important variables. Variable selection is the process of selecting a subset of relevant variables for use in model construction. The main theme of this thesis is to develop variable screening and variable selection methods for high dimensional data analysis. In particular, we will present two relevant methods for variable screening and selection under a unified framework based on optimal transformations. In the first part of the thesis, we develop a maximum correlation-based sure independence screening (MC-SIS) procedure to screen features in an ultrahigh-dimensional setting. We show that MC-SIS possesses the sure screen property without imposing model or distributional assumptions on the response and predictor variables. MC-SIS is a model-free method in contrast with some other existing model-based sure independence screening methods in the literature. In the second part of the thesis, we develop a novel method called SParse Optimal Transformations (SPOT) to simultaneously select important variables and explore relationships between the response and predictor variables in high dimensional nonparametric regression analysis. Not only are the optimal transformations identified by SPOT interpretable, they can also be used for response prediction. We further show that SPOT achieves consistency in both variable selection and parameter estimation. Besides variable screening and selection, we also consider other applications with optimal transformations. In the third part of the thesis, we propose several dependence measures, for both univariate and multivariate random variables, based on maximum correlation and B-spline approximation. B-spline based Maximum Correlation (BMC) and Trace BMC (T-BMC) are introduced to measure dependence between two univariate random variables. As extensions to BMC and T-BMC, Multivariate BMC (MBMC) and Trace Multivariate BMC (T-MBMC) are proposed to measure dependence between multivariate random variables. We give convergence rates for both BMC and T-BMC. Numerical simulations and real data applications are used to demonstrate the performances of proposed methods. The results show that the proposed methods outperform other existing ones and can serve as effective tools in practice.




Zhu, Purdue University.

Subject Area


Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server