A Data-Driven Approach To Genetics

Myson C Burch, Purdue University

Abstract

With the completion of the Human Genome Project and many additional efforts since, there is an abundance of genetic data that can be leveraged to revolutionize healthcare. Now, there are significant efforts to develop state-of-the-art techniques that reveal insights about connections between genetics and complex diseases such as diabetes, heart disease, or common psychiatric conditions that depend on multiple genes interacting with environmental factors. These methods help pave the way towards diagnosis, cure, and ultimately prediction and prevention of complex disorders. As a part of this effort, we address high dimensional genomics-related questions through mathematical modeling, statistical methodologies, combinatorics and scalable algorithms. More specifically, we develop innovative techniques at the intersection of technology and life sciences using biobank scale data from genome-wide association studies (GWAS) and machine learning as an effort to better understand human health and disease.The underlying principle behind Genome Wide Association Studies (GWAS) is a test for association between genotyped variants for each individual and the trait of interest. GWAS have been extensively used to estimate the signed effects of trait-associated alleles, mapping genes to disorders and over the past decade about 10,000 strong associations between genetic variants and one (or more) complex traits have been reported. One of the key challenges in GWAS is population stratification which can lead to spurious genotype-trait associations. Our work proposes a simple clustering-based approach to correct for stratification better than existing methods. This method takes into account the linkage disequilibrium (LD) while computing the distance between the individuals in a sample. Our approach, called CluStrat, performs Agglomerative Hierarchical Clustering (AHC) using a regularized Mahalanobis distance-based GRM, which captures the population-level covariance (LD) matrix for the available genotype data.Linear mixed models (LMMs) have been a popular and powerful method when conducting genome-wide association studies (GWAS) in the presence of population structure. LMMs are computationally expensive relative to simpler techniques. We implement matrix sketching in LMMs (MaSk-LMM) to mitigate the more expensive computations. Matrix sketching is an approximation technique where random projections are applied to compress the original dataset into one that is significantly smaller and still preserves some of the properties of the original dataset up to some guaranteed approximation ratio. This technique naturally applies to problems in genetics where we can treat large biobanks as a matrix with the rows representing samples and columns representing SNPs. These matrices will be very large due to the large number of individuals and markers in biobanks and can benefit from matrix sketching. Our approach tackles the bottleneck of LMMs directly by using sketching on the samples of the genotype matrix as well as sketching on the markers during the computation of the relatedness or kinship matrix (GRM).Predictive analytics have been used to improve healthcare by reinforcing decision-making, enhancing patient outcomes, and providing relief for the healthcare system. These methods help pave the way towards diagnosis, cure, and ultimately prediction and prevention of complex disorders. The prevalence of these complex diseases varies greatly around the world. Understanding the basis of this prevalence difference can help disentangle the interaction among different factors causing complex disorders and identify groups of people who may be at a greater risk of developing certain disorders. This could become the basis of the implementation of early intervention strategies for populations at higher risk with significant benefits for public health.This dissertation broadens our understanding of empirical population genetics. It proposes a data-driven perspective to a variety of problems in genetics such as confounding factors in genetic structure. This dissertation highlights current computational barriers in open problems in genetics and provides robust, scalable and efficient methods to ease the analysis of genotype data.

Degree

Ph.D.

Advisors

Drineas, Purdue University.

Subject Area

Public health|Genetics|Artificial intelligence|Health sciences|Medicine|Statistics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS