Some theoretical and methodological aspects of multiple testing, model selection and related areas

Jyotishka Datta, Purdue University

Abstract

In the recent past, thanks to applications in genomics, finance and astronomy as well as other fields, high-dimensional statistical inference has been a major trend in research and practice. Many large scale applications involve multiple testing, where one simultaneously tests a small proportion of true signals in presence of a large number of noise observations, or sparse regression, where the number of covariates is much larger than the sample size. There has been a tremendous development in terms of theoretical, methodological and computational work that is still going on. With the accelerating growth in the size of datasets, the need for methodological advances as well as the issues of computation and scalability has come to the fore. In this thesis, we study some theoretical and methodological aspects of multiple testing and model selection for high dimensional data with a focus on both theory and application to real data problems. ^ Chapter 2 studies the asymptotic properties of the Bayes risk for the Horseshoe prior in the context of multiple testing. We provide a theoretical footing to the use of a continuous shrinkage prior in the context of a multiple testing problems and prove that the Bayes risk for the Horseshoe prior attains the Bayes risk for the oracle up to O(1) with the constant in the horseshoe risk close to the constant in the oracle. We use the same asymptotic framework as Bogdan et al. [2011] who introduced the Bayes oracle in the context of multiple testing and provided conditions under which the Benjamini-Hochberg procedure attains the risk of the Bayes oracle. Chapter 3 focuses on two important but relatively less studied aspects of multiple testing, namely, the estimation of the false negative rate and cross-validation. We propose a new Empirical Bayes estimator of the false negative rate based on the estimated parameters of the two-groups model. We compare the accuracy of this new estimator with a few standard, popular estimators based on the estimated proportion of null effects in Section 3.2. We discuss the problem of internal cross-validation due to Majumder et al. [2009] in the context of multiple testing in Section 3.3. In particular, we shed some light on the achieved false discovery rate when the outcomes of the Benjamini-Hochberg procedure is internally cross-validated using the half-sample approach of Majumder et al. [2009]. Chapter 4 describes two recent approaches to optimality in Bayesian study of linear models, namely, the axiomatic approach by Bayarri et al. [2012] and the direct approach based on continuous shrinkage priors, beginning with the Horseshoe priors due to Carvalho et al. [2010]. Our primary interest is in the performance of the optimal prior for model selection due to Bayarri et al. [2012] relative to the frequentist method for variable selection based on the LASSO and the shrinkage priors. We also compare these two approaches with respect to three different criteria, namely accuracy of estimation, out-of-bag prediction and variable selection which is closely related to model selection in terms of the inclusion probabilities for the unknown parameters. In Chapter 5, we also explore a few unusual aspects of Bootstrap, including its application in high-dimensional problems, e.g. estimation of the inverse covariance matrix and hypothesis testing in high dimension.^

Degree

Ph.D.

Advisors

Jayanta K. Ghosh, Purdue University, Michael Y. Zhu, Purdue University.

Subject Area

Statistics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS