Nonparametric clustering and model selection with application in bioinformatics
Because of recent development in biotechnology, more and more high dimensional and highly correlated data have been generated. Extracting useful information from these data is a challenging issue for modern statisticians and biologists. My dissertation focuses on developing method and theory for analyzing two types of these data: temporal gene expression data and DNA sequence data. This dissertation consists of two parts. The first part focuses the development of a model-based functional clustering (MFclust) method and its application in analyzing temporal microarray data. The second part discusses the regularized slice inverse regression (RSIR) approach and its usage in the motif discovery using DNA sequence data. In Chapter 2, we discuss the MFclust method based on the assumption that gene expression profiles are realizations of a mixture of Gaussian processes. For each Gaussian component, a functional mixed effect model is employed to model the mean curve and the variance-covariance structure. The parameters involved in the model are updated using a variant of Monte Carlo EM algorithm. A fully Bayesian version of the functional clustering approach is also given in the end of this chapter. The performance of this method is demonstrated using simulated data and real data. In Chapter 3, we propose the two-step model selection procedure RSIR. The key of this approach is selecting models after dimension reduction. RSIR is a generalization of SIR, which was proposed in Li (1991), to data of high dimensionality and high multicollinearity. The RSIR approach is demonstrated using simulated data and real data. In general, RSIR has lower false selection and false rejection rates compared to SIR and other model selection procedures such as stepwise regression.
Zhu, Purdue University.
Off-Campus Purdue Users:
To access this dissertation, please log in to our