Statistical inference of protein structure using small-angle X-ray scattering data

Lanqing Hua, Purdue University

Abstract

The interatomic distance distribution, P(r), is a valuable tool for evaluating the structure of a molecule in solution and represents the maximum structural information that can be deterministically derived from solution scattering intensities, I(q). While P(r) is related to the Fourier transform of I(q), it unfortunately cannot be accurately determined by direct Fourier transform due to the limited resolution range of the experimental intensities. Instead, indirect transform approaches have been used. These methods involve fitting the observed I(q) to a linear combination of basis functions and then reconstructing P(r) with the same combination of Fourier-transformed basis functions. While these approaches improve accuracy, there is still room for improvement. We describe two novel approaches for calculating an improved P(r). First, we introduce a family of flexibly-defined basis functions to reconstruct P(r) when the number of functions employed in each set is limited by the resolution of the data (always the case in practical applications). Second, we propose model averaging, in which the multiple sets of basis functions are combined to reduce bias. By specifically employing basis sets with the ability to describe a wide variety of structural features represented in P(r), greater information is extracted from the I(q) data. We explore the properties of these approaches and find they can offer substantial benefits in analyzing solution scattering data. For example, unrealistic behaviors, such as artificial oscillations in P(r), are reduced while avoiding explicit smoothness constraints and subjective physical assumptions about the molecule. Simulations using a wide variety of structures show that any combination of our flexible basis functions (single best, weighted average, or unweighted average) provides P(r) reconstructions that are comparable to, or more accurate than, the equally unrestrained method of Moore [1] at both high and low resolution. Our approaches also reconstruct P(r) more accurately than Svergun et al's approach [2] at high resolution and more accurately than Glatter's approach [3] at low resolution. Criteria for choosing among our three combination methods are discussed. In addition to P(r) reconstruction, observed scattering data are also used to discriminate between potential molecular structures. In this thesis, we describe the problem in a statistical framework and discuss the limitations of this type of discrimination. The most challenging issue is that each potential structure is represented as a single static shape and the SAXS data represent the molecule in solution. In other words, SAXS data does not represent the scattering of a single protein structure but rather an average of a mixture of numerous shapes. We propose using a dynamic simulation program to generate alternative shapes for a potential structure. These shapes give us a "dynamic average" for the potential structure as well as a measure of variability about this average. We borrow this information from numerous proposed structures to create an envelope, which represents the range of structures/curves that likely contributed to the SAXS data. The envelope we construct is shown to discriminate structures that are in different topology categories.

Degree

Ph.D.

Advisors

Craig, Purdue University.

Subject Area

Statistics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS