A framework for practical Bayesian analysis of high-dimensional genomic data

Sanvesh Srivastava, Purdue University


Genomic data from high-throughput technologies, such as microarrays and next-generation sequencing (NGS), are high-dimensional and have sparse genomic effects that interact. Due to the decreasing cost of using these technologies and the launching of large-scale epidemiological projects, genomic data are becoming increasingly complex and large, which in turn leads to nontrivial computational and statistical challenges for classical inferential and modeling approaches. To address these issues, it is demonstrated that Latent Process Decomposition (LPD) is a general and extensible hierarchical Bayesian modeling framework for analyzing complex high-dimensional genomic data. LPD is extended for applications to high-dimensional count data, as well as high-dimensional Gaussian time series data. Unfortunately, the trade-off for the flexibility gained from employing Bayesian methods is their computational intractability when considering exact inference for the high-dimensional genomic effects. This computational intractability is resolved via novel applications of approximate Bayesian variational methods. The performance of variational methods is empirically compared with that of equivalent collapsed Gibbs samplers. Using Sampling Importance Resampling, it is demonstrated that the variational posterior densities can be used to sample from the true posterior densities of the parameters in the generative models of LPD and its extensions. The novel extensions of LPD are applied to simulated and publicly available (microarray and NGS) data, and the results are compared with those of existing approaches for analyzing genomic data. LPD is shown to be a useful and extensible framework for identifying genes suitable for further exploration. Although LPD and its extensions are applied in genomic context, the ideas are easily generalized. All the proposed methods are implemented as an open source R package.^




Rebecca W. Doerge, Purdue University.

Subject Area

Biology, Genetics|Statistics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server