On new approaches for variable selection under single index model and DNA methylation status calling

Longjie Cheng, Purdue University


This thesis consists of two main components: a regularization based variable selection method for the single index model and a novel classification based method for DNA methylation status calling for bisulphite-sequencing data. The single index model is an intuitive extension of the linear regression model. It has become increasingly popular due to its flexibility in modeling. Similar to the linear regression model, the set of predictors for the single index model can contain a large number of irrelevant variables. Therefore, it is important to select the relevant variables when fitting the single index model. However, the problem of variable selection for high-dimensional single index model is not well settled in the literature. In the first part of this thesis, we combine the idea of applying cubic B-splines for estimating the single index model with the idea of using the family of the smooth integration of counting and absolute deviation (SICA) penalty functions for variable selection. Based on this combination, a new method is proposed to simultaneously perform parameter estimation and model selection for the single index model. This method is referred to as the B-spline and SICA method for the single index model, or in short, BS-SIM. Since LASSO is a limiting case of SICA, the proposed BS-SIM framework can also be applied if one prefers LASSO. A coordinate descent algorithm is developed to efficiently implement BS-SIM. Moreover, we develop the regularity conditions under which BS-SIM can consistently estimate the parameter and select the true model. Simulations with various settings and a real data analysis are conducted to demonstrate the estimation accuracy, selection consistency and computational efficiency of BS-SIM. In addition, we also briefly discuss the problem of estimating the single index model with our framework when linear equality and inequality constraints are imposed. With the advent of high-throughput sequencing technology, bisulphite-sequencing based DNA methylation profiling methods have emerged as the most promising approaches due to their single-base resolution and genome-wide coverage. Nevertheless, statistical analysis methods for analyzing this type of methylation data are not well developed. Although the most widely used proportion based estimation method is simple and intuitive, it is not statistically adequate in dealing with the various sources of noise in bisulphite-sequencing data. Furthermore, it is not biologically satisfactory in applications that require binary methylation status calls. In the second part of this thesis, we consider the problem of DNA methylation status calling. A mixture of Binomial model is used to characterize bisulphite-sequencing data, and based on the model, we propose to use a classification based procedure, called the Methylation Status Calling (MSC) procedure, to make binary methylation status calls. The MSC procedure is optimal in terms of maximizing the overall correct allocation rate, and the FDR and FNDR of MSC can be estimated. In order to control FDR at any given level, we further develop a FDR-controlled MSC (FMSC) procedure, which combines a local false discovery rate (Lfdr) based adaptive procedure with the MSC procedure. Both simulation study and real data application are carried out to examine the performance of the proposed procedures. It is shown in our simulation study that the estimates of FDR and FNDR of the MSC procedure are appropriate. Simulation study also demonstrates that the FMSC procedure is valid in controlling FDR at a prespecified level and is more powerful than the individual Binomial testing procedure. In the real data application, the MSC procedure exhibits an estimated FDR of 0.1426 and an estimated FNDR of 0.0067. The overall correct allocation rate is more than 0.97. These results suggest the effectiveness of the proposed procedures.




Zhu, Purdue University.

Subject Area


Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server