Exploring protein functional relationships utilizing genomic information, data mining and computational intelligence

Jack Y Yang, Purdue University

Abstract

Since the discoveries of Overhauser effects and DNA double helix structure, many protein structures have been determined experimentally, especially by utilizing the Overhauser effects. Biologists are not only able to describe the life phenomena but also to seek the understanding of life mechanisms at molecular level. With the advent of high-throughput genome sequencing technology, more and more genomes are available; consequently our ability to sequence genomes has outstripped our ability to analyze the resulting data in order to determine the functions and structures of proteins encoded in the genomes. Determination of protein structures and functions using traditional laboratory methods is rather slow and expensive. Therefore, our goal is to develop an automated machine learning based approach to provide information concerning multiple functional relations among a large group of proteins simultaneously through computational intelligence. As of today, functions of most proteins are either completely unknown or not completely known. This is due to the nature of complex protein-protein and protein-DNA interactions and the limitations of experimental approaches and data mining techniques. However, we are able to extract information concerning the protein functional relationship by our new approach which performed a hierarchical decomposition of feature space. Such approach transformed the difficult problem into simpler sub-problems so that complex biomedical data can be utilized efficiently in solving the problems. We refer this new approach as unsupervised and supervised tree (UST) because it combined the advantages of both supervised and unsupervised learning. The core of UST is to construct a Maximum contract tree (MCT) that allows us to establish many links among proteins of related functions. Furthermore, we introduced a new machine learning classifier called Multiple-Labeled Instance Classifier (MLIC) that handles instances belonging to many classes, which has not been studied in previous computational intelligence approaches. We built a most comprehensive protein phylogenetic profile library based on 60 genomes; it is an improvement from the results of other protein phylogenetic profiles based on 24 genomes. Experimental results show USTs outperform other computational intelligence methods such as Support Vector Machines and Decision Trees, and provide a viable alternative to the supervised or unsupervised methods alone.

Degree

Ph.D.

Advisors

Kuo, Purdue University.

Subject Area

Biomedical engineering|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS