Adaptation of Frequent Subgraph Mining Algorithms to Noncoding RNA Topology Alignment and Function Prediction
In recent years, advances in high throughput genome sequencing and transcriptome assembly techniques have enabled large-scale transcriptome studies. The results of these studies dramatically changed our understanding of Central Dogma. While RNA was previously viewed as an information carrier for protein synthesis, mammalian genomes are pervasively transcribed outside of protein coding regions. In addition to protein coding transcripts, numerous novel noncoding RNAs (ncRNA) transcripts have been reported. However, the functional annotation of ncRNA is at a rudimentary stage and only limited numbers of functional categories have been intensively studied. Available studies support the existence of distinguishable structural topology conservation among ncRNAs despite low sequence similarity. This raises the crucial biological question of how to identify topological conservation among novel ncRNAs without traditional sequence alignment techniques and how to utilize structural similarity to annotate ncRNA functions. Despite a growing pool of unannotated ncRNA data, exhaustive topological comparison is NP-Hard; available topology alignment algorithms cannot identify overall conserved topologies efficiently and currently available ncRNA topology classification models can only distinguish RNA-like structures from non-RNA-like structures. This study addresses these questions by adapting graph mining algorithms to ncRNA topological alignment and evaluating the accuracy and precision of ncRNA topology classification based upon our graph mining approach. We define a ncRNA graph representation model called XIOS (representing eXclusive, Included, Overlapping, and Serial stems arrangements) and develop a multiple ncRNA topology alignment algorithm that can align and identify conserved ncRNA structural topologies. Using this topological alignment tool, we build a ncRNA classification model that classifies ncRNAs into functional categories. In particular, 1) we define the XIOS ncRNA graph representation. 2) we implement two of the most cited FSM (Frequent Subgraph Mining) algorithms: Margin and gSpan for use on XIOS RNA graphs. 3) Due to the insurmountable computational expense of an exhaustive search, we develop the MMC-Margin algorithm, which samples the Margin space, known to be the smallest FSM space, by Metropolis Monte Carlo (MMC) sampling. The evaluation of the MMC-Margin algorithm is conducted by performance comparison with Margin algorithm, identification of conserved topological substructures in real and synthetic ncRNAs, and MMC convergence diagnostics. 4) we develop a ncRNA topology classification algorithm that utilizes sampled maximal frequent subgraphs to classify ncRNAs. 5) We test the ncRNA classification algorithm on both synthetic and real functional categories, and demonstrate that conservation among ncRNA structures can be identified and accurate ncRNA function predication can be achieved. We conclude that our ncRNA multiple topological alignment algorithm can identify conservation that is tightly related to ncRNA functionality, and provide a high throughput means to predict ncRNA function. Our algorithms provide a foundation for exploring the functionality of pervasively transcribed ncRNAs and for elucidating the unique roles of ncRNA beyond the Central Dogma.
Gribskov, Purdue University.
Bioinformatics|Artificial intelligence|Computer science
Off-Campus Purdue Users:
To access this dissertation, please log in to our