On the Interplay Between Statistical Concepts and Computational Models in Omics Applications

Emery T Goossens, Purdue University

Abstract

Technological advancements have lead to the generation of enormous amounts of data. In order to capitalize on this trend, however, both computational and statistical challenges must be tackled. While computational efficiency is important, interpretability of models and algorithms are essential to ensuring the validity of any conclusions drawn. Nowhere is this more clear than in the case of biomedical data, where inferences drawn from large datasets are used to inform future directions of research, diagnose diseases, and generate leads for the development of new pharmaceuticals. This work examines the interplay between statistical concepts and computational models in three applications. Specifically, quantifying protein expression of fluorescent images, classifying somatic mutations in cancer, and combining p-values computed from genomic summary statistics. Across these applications, there are three recurring themes: accounting for technical and biological variation in data processing, evaluating the performance of a model in its end use case, and integrating results with outside data. Within these applications and themes, many statistical concepts are employed including Bayes theorem, and type I error rate control alongside computational models such a convolutional neural networks and Monte Carlo sampling algorithms. The results of these investigations inform much broader application areas such as biomedical imaging, modeling genomic sequences, and hypothesis testing in high-dimensions. Specific contributions in the application of Convolutional Neural Networks include demonstrating their ability to replicate the quantification of protein expression images from various manually-generated or deterministic label sets as well as the creation of a modeling framework for sequencing-based cancer diagnostics and the prioritization of unvalidated somatic mutations. In the area of hypothesis testing, novel algorithms are proposed that enable the use of a powerful and interpretable technique of combining p-values in the large-scale setting of genome-wide association studies.

Degree

Ph.D.

Advisors

Rao, Purdue University.

Subject Area

Statistics|Artificial intelligence|Genetics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS