Scoring functions in predicting protein structure and protein-protein interaction
Abstract
Structural bioinformatics is of great necessity to the study of mechanisms of molecular machinery in the biological processes. It applies statistical and mathematical modeling to solve problems in protein folding, protein structure prediction and protein-protein interactions. Amongst the various issues in structural bioinformatics, scoring function is a very important one because it is the core of many algorithms. In this thesis, scoring function optimization and weight training problems are investigated in three related works: (1) Quality Assessment of Protein Structure Model: Knowing the resolution and accuracy of the structure model is crucial for biologists to determine its usage. Various quality assessment scores are combined using linear, logistic and LOESS regressions to predict the quality of the structure model in terms of RMSD and correct/incorrect categories. Local quality of the structure, in terms of Cα distance, is also modeled using simple regression and hierarchical approaches. Finally, the developed regression equations are applied to assess quality of structure models of the whole E.coli proteome. (2) Optimizing Scoring Function for Ranking Protein Docking Conformations: Numerous metrics that measure the goodness of the docking scoring function are used to optimize our scoring function that is a linear combination of 9 energetic terms and the weights are optimized by logistic regression and Genetic Algorithm. By cross comparison, different metrics are shown to have different generalization ability. The resulting scoring functions are then compared to ZRANK and ZDOCK on a benchmark data set and show substantial improvement. Finally ensemble approaches are employed and improvement is observed on several metrics. (3) Threading without Optimizing Weighting Factor for Scoring Function: A simple gapless threading system with two energy terms is used to test several novel methods which do not require training weights on a training set. Basic ideas of these methods is to sample different values of the weight and select an optimal template structure for a target sequence by examining the characteristics of the distribution of scores computed by varying the weight. An artificial neural network model is also built to predict target-specific weight based on the features of protein sequence. Finally, it is shown that the novel approaches combined with the traditional methods can increase the predicting power of the scoring function.
Degree
Ph.D.
Advisors
Kihara, Purdue University.
Subject Area
Bioinformatics
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.