Bayesian probability Baker and co-workers justified statistical PMFs from a Bayesian point of view and used these insights in the construction of the coarse grained
ROSETTA energy function. According to
Bayesian probability calculus, the conditional probability P(X\mid A) of a structure X, given the amino acid sequence A, can be written as: : P\left(X\mid A\right)=\frac{P\left(A\mid X\right)P\left(X\right)}{P\left(A\right)}\propto P\left(A\mid X\right)P\left(X\right) P(X\mid A) is proportional to the product of the
likelihood P\left(A\mid X\right) times the
prior P\left(X\right). By assuming that the likelihood can be approximated as a product of pairwise probabilities, and applying
Bayes' theorem, the likelihood can be written as: {{Equation box 1 where the product runs over all amino acid pairs a_{i},a_{j} (with i), and r_{ij} is the distance between amino acids i and j. Obviously, the negative of the logarithm of the expression has the same functional form as the classic pairwise distance statistical PMFs, with the denominator playing the role of the reference state. This explanation has two shortcomings: it relies on the unfounded assumption the likelihood can be expressed as a product of pairwise probabilities, and it is purely
qualitative.
Probability kinematics Hamelryck and co-workers later gave a
quantitative explanation for the statistical potentials, according to which they approximate a form of probabilistic reasoning due to
Richard Jeffrey and named
probability kinematics. This variant of Bayesian thinking (sometimes called "
Jeffrey conditioning") allows
updating a prior distribution based on new information on the probabilities of the elements of a partition on the support of the prior. From this point of view, (i) it is not necessary to assume that the database of protein structures—used to build the potentials—follows a Boltzmann distribution, (ii) statistical potentials generalize readily beyond pairwise differences, and (iii) the
reference ratio is determined by the prior distribution.
Reference ratio . In order to obtain a complete description of protein structure, one also needs a probability distribution P(Y) that describes nonlocal aspects, such as hydrogen bonding. P(Y) is typically obtained from a set of solved protein structures from the
PDB (left). In order to combine Q(X) with P(Y) in a meaningful way, one needs the reference ratio expression (bottom), which takes the signal in Q(X) with respect to Y into account. Expressions that resemble statistical PMFs naturally result from the application of probability theory to solve a fundamental problem that arises in protein structure prediction: how to improve an imperfect probability distribution Q(X) over a first variable X using a probability distribution P(Y) over a second variable Y, with Y=f(X). Typically, X and Y are fine and coarse grained variables, respectively. For example, Q(X) could concern the local structure of the protein, while P(Y) could concern the pairwise distances between the amino acids. In that case, X could for example be a vector of dihedral angles that specifies all atom positions (assuming ideal bond lengths and angles). In order to combine the two distributions, such that the local structure will be distributed according to Q(X), while the pairwise distances will be distributed according to P(Y), the following expression is needed: : P(X,Y)=\frac{P(Y)}{Q(Y)}Q(X) where Q(Y) is the distribution over Y implied by Q(X). The ratio in the expression corresponds to the PMF. Typically, Q(X) is brought in by sampling (typically from a fragment library), and not explicitly evaluated; the ratio, which in contrast is explicitly evaluated, corresponds to Sippl's PMF. This explanation is quantitive, and allows the generalization of statistical PMFs from pairwise distances to arbitrary coarse grained variables. It also provides a rigorous definition of the reference state, which is implied by Q(X). Conventional applications of pairwise distance statistical PMFs usually lack two necessary features to make them fully rigorous: the use of a proper probability distribution over pairwise distances in proteins, and the recognition that the reference state is rigorously defined by Q(X). ==Applications==