Design, evaluation, and application of PFP: An automated system for protein function prediction

Troy B Hawkins, Purdue University

Abstract

The last decade of biological research has seen a tremendous push towards production of high volume data describing DNA and protein sequence, structure, expression, interaction, and localization. This glut of new data is the impetus for the development and emergence of a slew of computational tools that can interpret it to provide new functional characterization of proteins. We have developed PFP, an automated function prediction system which provides high probability annotations for a query sequence in each of the three branches of the Gene Ontology: biological process, cellular component, and molecular function. Rather than using precise pattern matching to identify functional motifs in the sequences and structures of these proteins, we designed PFP to increase the coverage of function annotation by lowering resolution of predictions when detailed functional information is not predictable. To do this, we extend a traditional PSI-BLAST homology search by extracting and scoring annotations (GO terms) individually, including annotations from distantly related sequences, and applying a novel data mining tool, the Function Association Matrix, to score strongly associated pairs of annotations. The scoring scheme also provides GO term-based statistical significance scores and confidence scores empirically derived from an extensive benchmark evaluation of annotated proteins from fifteen organisms. We have shown this system to be effective in providing accurate predictions for both specific and broad functional terms. This is consistent with the performance of PFP as the best overall predictor in two independent international assessments: AFP-SIG ’05 and CASP7 function (FN), where it outperformed even consensus predictions made by the organizers. Additionally, we have extensively applied blind predictions to the protein interaction networks of and clusters of contiguous genes in E. coli, S. cerevisiae, and P. falciparum (Malaria plasmodium). Through this style of applications, PFP is able to provide significant annotation gain for previously uncharacterized groups of proteins. The automated PFP system is publicly available as a web server at http://dragon.bio.purdue.edu/pfp/.

Degree

Ph.D.

Advisors

Kihara, Purdue University.

Subject Area

Bioinformatics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS