Searching the Edges of the Protein Universe Using Data Science

Mengmeng Zhu, Purdue University

Abstract

Data science uses the latest techniques in statistics and machine learning to extract insights from data. With the increasing amount of protein data, a number of novel research approaches have become feasible. Micropeptides are an emerging field in the protein universe. They are small proteins with <= 100 amino acid residues (aa) and are translated from small open reading frames (sORFs) of <= 303 base pairs (bp). Traditionally, their existence was ignored because of the technical difficulties in isolating them. With technological advances, a growing number of micropeptides have been characterized and shown to play vital roles in many biological processes. Yet, we lack bioinformatics methods for predicting them directly from DNA sequences, which could substantially facilitate research in this field with minimal cost. With the increasing amount of data, developing new methods to address this need becomes possible. We therefore developed MiPepid, a machine-learning-based method specifically designed for predicting micropeptides from DNA sequences by curating a high-quality dataset and by training MiPepid using logistic regression with 4-mer features. MiPepid performed exceptionally well on holdout test sets and performed much better than existing methods. MiPepid is available for downloading, easy to use, and runs sufficiently fast. Long noncoding RNAs (LncRNAs) are transcripts of > 200 bp and does not encode a protein. Contrary to their “noncoding” definition, an increasing number of lncRNAs have been found to be translated into functional micropeptides. Therefore, whether most lncRNAs are translated is an open question of great significance. To address this question, by harnessing the availability of large-scale human variation data, we have explored the relationships between lncRNAs, micropeptides, and canonical regular proteins (> 100 aa) from the perspective of genetic variation, which has long been used to study natural selection to infer functional relevance. Through rigorous statistical analyses, we find that lncRNAs share a similar genetic variation profile with proteins regarding single nucleotide polymorphism (SNP) density, SNP spectrum, enrichment of rare SNPs, etc., suggesting lncRNAs are under similar negative selection strength with proteins. Our study revealed similarities between micropeptides, lncRNAs, and canonical proteins and is the first attempt to explore the relationships between the three groups from a genetic variation perspective. Deep learning has been tremendously successful in 2D image recognition. Protein binding ligand prediction is fundamental topic in protein research as most proteins bind ligands to function. Proteins are 3D structures and can be considered as 3D images. Prediction of binding ligands of proteins can then be converted to a 3D image classification problem. In addition, a large number of protein structure data are available now. We therefore utilized deep learning to predict protein binding ligands by designing a 3D convolutional neural network from scratch and by building a large 3D image dataset of protein structures. The trained model achieved an average F1 score of over 0.8 across 151 classes on the holdout test set. Compared to existing methods, our model performed better. In summary, we showed the feasibility of deploying deep learning in protein structure research.

Degree

Ph.D.

Advisors

Gribskov, Purdue University.

Subject Area

Bioinformatics|Agronomy|Analytical chemistry|Artificial intelligence|Biochemistry|Cellular biology|Chemistry|Genetics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS