Computational Protein Function Prediction and its Application to the Missing Enzymes Problem
Improving the overall annotation level of genomes and completeness of biological pathways with high accuracy is the long term basic goal for this research. Large numbers of proteins are getting sequenced every year, creating a pressing need to build computational techniques for rapidly analyzing genomes to extract relevant knowledge. The purpose of this study is 1) to develop an advanced method to computationally elucidate functions of unannotated proteins, 2) to characterize the relationships between functional terms used to describe the proteins and 3) to further use these relationships to predict missing enzymes in the metabolic pathways. Here we have developed the Extended Similarity Group (ESG) method for protein annotation prediction that iteratively searches the sequence homology space around the query protein and draws consensus from the annotations of proteins in the neighborhood. In terms of prediction accuracy, ESG has been shown to outperform simple PSI-BLAST search and the PFP method previously developed in our lab. Secondly we have designed two scores, Co-occurrence Association Score (CAS) and PubMed Association Score (PAS), that capture the relationship between pairs of Gene Ontology terms used for annotating the proteins. CAS is based on co-occurrence of annotation terms in the database to annotate the same proteins, and PAS is based on co-mentions of annotation terms in the PubMed abstracts. These two scores have been successfully applied to identify functionally coherent groups of proteins that work in coordinated fashion to achieve some biological task. For newly sequenced genomes, metabolic reconstruction often leads to several missing enzymes where a known reaction is not associated with any gene product. As the next step, we use the aforementioned function association scores combined with the phylogenetic profile and microarray expression data to find the most likely matches for such missing enzymes thereby increasing the completeness of biological knowledge. Thus the principal goal achieved here is to understand and improve the computational characterization of protein annotations starting from the individual proteins and moving towards the systems level.
Kihara, Purdue University.
Off-Campus Purdue Users:
To access this dissertation, please log in to our