Nonparametric variable selection and dimension reduction methods and their applications in pharmacogenomics
Abstract
Nowadays it is common to collect large volumes of data in many fields with an extensive amount of variables, but often a small or moderate number of samples. For example, in the analysis of genomic data, the number of genes can be very large, varying from tens of thousands to several millions, whereas the number of samples is several hundreds to thousands. Pharmacogenomics is an example of genomics data analysis that we are considering here. Pharmacogenomics research uses whole-genome genetic information to predict individuals' drug response. Because whole-genome data are high dimensional and their relationships to drug response are complicated, we are developing a variety of nonparametric methods, including variable selection using local regression and extended dimension reduction techniques, to detect nonlinear patterns in the relationship between genetic variants and clinical response. High dimensional data analysis has become a popular research topic in the Statistics society in recent years. However, the nature of high dimensional data makes many traditional statistical methods fail, because most methods rely on the assumption that the sample size n is larger than the variable dimension p. Consequently, variable selection or dimension reduction is often the first step in high dimensional data analysis. Meanwhile, another important issue arises as the choice of an appropriate statistical modeling strategy for conducting variable selection or dimension reduction. It has been found from our studies that the traditional parametric linear model might not work well for detecting nonlinear patterns of relationships between predictors and response. The limitations of the linear model and other parametric statistical approaches motivate us to consider nonparametric/nonlinear models for conducting variable selection or dimension reduction. The thesis is composed of two major parts. In the first part, we develop a nonparametric predictive model of the response based on a small number of predictors, which are selected from a nonparametric forward variable selection procedure. We also propose strategies to identify subpopulations with enhanced treatment effects. In the second part, we develop an alternating least squares method to extend the classical Sliced Inverse Regression (SIR) [Li, 1991] to the context of high dimensional data. Both methods are demonstrated by simulation studies and a pharmacogenomics study of bortezomib in multiple myeloma [Mulligan et al., 2007]. The proposed methods have favorable performances compared to other existing methods in the literature.
Degree
Ph.D.
Advisors
Xie, Purdue University.
Subject Area
Statistics
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.