Group variable selection methods and their applications in analyses of genomic data

Lingmin Zeng, Purdue University

Abstract

Variable selection methods are powerful tools in analysis of high dimensional massive data. In bioinformatics, the methods have often been applied in gene expression microarray data to reduce dimensions and select important features. It is well known that for genes participating in a common biological pathway or sharing a similar function, the correlations among them can be very high. However, most of the available variable selection methods cannot deal with the complicated interdependence among data. We propose three new methods, via two different approaches, by selecting groups of variables in regression models. First, we propose two new selection algorithms, namely gLars and gRidge, following LARS’ forward selection procedure. The new approaches intend to conduct grouping and selecting at the same time, not requiring any prior information on group structures of the variables. The third method called SCAD_ℓ2 is a penalized regression method. Lasso, a popular regularization approach, utilizes L 1 penalty. Elastic net combines L1 and L2 penalties to incorporate group effects in the variables. However, both of them provide biased coefficient estimators. The biasedness of Lasso and elastic net interferes with variable selection. Fan and Li (2001) proposed a non-concave penalty function called SCAD with many good properties, including unbiasedness. Our new method SCAD_ℓ2 combines the penalties of SCAD and L2. It favors group effects in addition to the good properties of SCAD. Simulations show that our proposed methods often outperform the existing variable selection methods, including Lasso, LARS, SCAD and elastic net, in terms of both reducing prediction error and preserving model sparsity, while yielding additional group information. We apply the proposed methods in gene expression microarray data and genetic variant SNP data. The group variable selection models are more appropriate than other existing methods for the genomic data with complicated interdependent structures.

Degree

Ph.D.

Advisors

Jun, Purdue University.

Subject Area

Statistics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS