Because protein structures are composed of
amino acids whose
side chains are linked by a common protein backbone, a number of different possible subsets of the atoms that make up a protein macromolecule can be used in producing a structural alignment and calculating the corresponding RMSD values. When aligning structures with very different sequences, the side chain atoms generally are not taken into account because their identities differ between many aligned residues. For this reason it is common for structural alignment methods to use by default only the backbone atoms included in the
peptide bond. For simplicity and efficiency, often only the
alpha carbon positions are considered, since the peptide bond has a minimally variant
planar conformation. Only when the structures to be aligned are highly similar or even identical is it meaningful to align side-chain atom positions, in which case the RMSD reflects not only the conformation of the protein backbone but also the
rotameric states of the side chains. Other comparison criteria that reduce noise and bolster positive matches include
secondary structure assignment,
native contact maps or residue interaction patterns, measures of side chain packing, and measures of
hydrogen bond retention.
Structural superposition The most basic possible comparison between protein structures makes no attempt to align the input structures and requires a precalculated alignment as input to determine which of the residues in the sequence are intended to be considered in the RMSD calculation. Structural superposition is commonly used to compare multiple conformations of the same protein (in which case no alignment is necessary, since the sequences are the same) and to evaluate the quality of alignments produced using only sequence information between two or more sequences whose structures are known. This method traditionally uses a simple least-squares fitting algorithm, in which the optimal rotations and translations are found by minimizing the sum of the squared distances among all structures in the superposition. More recently, maximum likelihood and Bayesian methods have greatly increased the accuracy of the estimated rotations, translations, and covariance matrices for the superposition. Algorithms based on multidimensional rotations and modified
quaternions have been developed to identify topological relationships between protein structures without the need for a predetermined alignment. Such algorithms have successfully identified canonical folds such as the
four-helix bundle. The SuperPose method is sufficiently extensible to correct for relative domain rotations and other structural pitfalls.
Evaluating similarity Often the purpose of seeking a structural superposition is not so much the superposition itself, but an evaluation of the similarity of two structures or a confidence in a remote alignment. A subtle but important distinction from maximal structural superposition is the conversion of an alignment to a meaningful similarity score. Most methods output some sort of "score" indicating the quality of the superposition. However, what one actually wants is
not merely an
estimated "Z-score" or an
estimated E-value of seeing the observed superposition by chance but instead one desires that the
estimated E-value is tightly correlated to the true E-value. Critically, even if a method's estimated E-value is precisely correct
on average, if it lacks a low standard deviation on its estimated value generation process, then the rank ordering of the relative similarities of a query protein to a comparison set will rarely agree with the "true" ordering. Different methods will superimpose different numbers of residues because they use different quality assurances and different definitions of "overlap"; some only include residues meeting multiple local and global superposition criteria and others are more greedy, flexible, and promiscuous. A greater number of atoms superposed can mean more similarity but it may not always produce the best E-value quantifying the unlikeliness of the superposition and thus not as useful for assessing similarity, especially in remote homologs. ==Algorithmic complexity==