Some new approaches in high-dimensional variable selection and regression

Zhongyin John Daye, Purdue University

Abstract

Variable selection and estimation for high-dimensional data have become a topic of foremost importance in modern Statistics. This is largely driven by the need to analyze massive data sets due to recent technological advancements (Fan and Li, 2006; Yu, 2007). For instance, bioengineering innovations have presented new statistical challenges by introducing functional MRI and gene microarray data. In many of these applications, we wish to achieve better prediction accuracy and allow easier interpretability by reducing the number of variables to obtain a parsimonious or sparse model. In this thesis, we study and propose several new methodologies for regression and variable selection under high-dimensionality. In the first part of the thesis, we propose the weighted fusion, a new penalized regression and variable selection method for data with correlated variables. The weighted fusion can potentially incorporate information redundancy among correlated variables for estimation and variable selection. Weighted fusion is also useful when the number of predictors p is larger than the number of observations n. It allows the selection of more than n variables in a motivated way. Real data and simulation examples show that weighted fusion can improve variable selection and prediction accuracy. In the second part of the thesis, we propose the covariance-thresholded lasso. Covariance-thresholded lasso presents as an important marriage of covariance-regularization and penalized regression methods to allow better variable selection and prediction accuracy for high-dimensional data. Covariance-thresholded lasso improves upon excessive variability and rank deficiency of the sample covariance matrix of the lasso by utilizing covariance sparsity. In high-dimensions, where many predictors are independent or weakly correlated, covariance sparsity is a natural assumption. Real-data and simulation examples indicate that our method can be very useful in improving performances. In the third part of the thesis, we propose the ridge-lasso hybrid estimator (ridle), a new penalized regression method that simultaneously estimates coefficients of mandatory predictors while allowing selection for others. The ridle is useful when some predictors are known to be significant due to prior knowledge or must be kept for additional analysis. Further, we propose the adaptive ridle, for use when good initial estimates are available. Through theoretical studies, we show that the ridle and adaptive ridle can improve variable selection for regression with mandatory variables.

Degree

Ph.D.

Advisors

Zhu, Purdue University.

Subject Area

Statistics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS