Machine Learning Approaches to Reveal Discrete Signals in Gene Expression

Changlin Wan, Purdue University

Abstract

Gene expression is an intricate process that determines different cell types and functions in metazoans, where most of its regulation is communicated through discrete signals, like whether the DNA helix is open, whether an enzyme binds with its target, etc. Understanding the regulation signals of the selective expression process is essential to the full comprehension of biological mechanism and complicated biological systems. In this research, we seek to reveal the discrete signals in gene expression by utilizing novel machine learning approaches. Specifically, we focus on two types of data chromatin conformation capture (3C) and single cell RNA sequencing (scRNA-seq). To identify potential regulators, we utilize a new hypergraph neural network to predict genome interactions, where we find the gene coregulation may result from the shared enhancer element. To reveal the discrete expression state from scRNA-seq data, we propose a novel model called LTMG that considered the biological noise and showed better goodness of fitting compared with existing models. Next, we applied Boolean matrix factorization to find the co-regulation modules from the identified expression states, where we revealed the general property in cancer cells across different patients. Lastly, to find more reliable modules, we analyze the bias in the data and proposed BIND, the first algorithm to quantify the column- and row-wise bias in binary matrix. We will first introduce the background of the thesis in the first chapter. In the second chapter, we will discuss how we formulate the genome interaction prediction task as hyperedge prediction problem and proposed a theoretically driven neural network HIGNN which achieved 30% performance increase comparing with other methods. Next, we thought to identify the discrete gene expression states. Specifically, in the third chapter, we proposed a left truncated mixture Gaussian model that retrieve the state information from single cell RNA sequencing data. In the fourth and fifth chapter, we introduce fast and efficient Boolean matrix/tensor factorization method to identify functional patterns from the expression states. In the sixth and seventh chapter, we further discussed the bias issue in binary data and proposed the first bias aware Boolean matrix factorization method that mitigate the impact from row- and column-wise bias in a binary matrix.

Degree

Ph.D.

Advisors

Boutin, Purdue University.

Subject Area

Artificial intelligence|Bioinformatics|Genetics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS