Advanced Statistical Tests for Large-Scale Genomic Data Analysis

Yaowu Liu, Purdue University


Hypothesis testing is widely adopted in genetic studies for summarizing statistical evidence from data. Genome-wide association studies (GWAS) examine a large number of genetic variants, e.g., single-nucleotide polymorphism (SNP), that may contribute to disease risk or other disease-related phenotypes. The characteristics of GWAS data such as strong correlation, high-dimensionality, and large scale pose a variety of challenges for effective statistical analysis. In this thesis, we present two statistical developments on the power and computation efficiency of tests that are particularly useful in the recently proposed SNP-set analysis, which aims to detect association between a phenotype and SNP-sets, such as genes or pathways. In the first part of the thesis, we propose a new test that is based on conditional effects of multiple SNPs and takes advantage of correlations among SNPs to improve the power of SNP-set analysis. The limiting null distribution of the test statistic and the power of the test are derived. Under appropriate conditions, the test is shown to be more powerful than the minimum p-value method, which is commonly used in GWAS. Through simulations and analysis of a real GWAS data set, we demonstrate that the proposed test is more advantageous than the existing methods, including the higher criticism and sequence kernel association tests, in the case of weak marginal effects. The second part of the thesis focuses on the p-value calculation of SNP-set tests, as GWAS involves a tremendous number of hypothesis tests and accurate p-value calculation is very challenging. Specifically, we consider three popular tests that are particularly powerful against sparse alternatives, i.e., the minimum p-value, higher criticism and Berk-Jones tests. We propose a Gaussian approximation method that calculates p-values with similar accuracy as the permutation method while substantially reduces the computation time of screening a whole genome, e.g., from days to hours. We also derive non-asymptotic bounds for the approximation errors under arbitrary dependency structures, which indicate that the approximation errors could vanish even if the number of covariates is exponentially larger than the sample size.




Xie, Purdue University.

Subject Area


Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server