Superfamilies of proteins are identified using a number of methods. Closely related members can be identified by different methods to those needed to group the most evolutionarily divergent members.
Sequence similarity of mammalian
histone proteins. The similarity of the sequences implies that they evolved by
gene duplication. Residues that are conserved across all sequences are highlighted in grey. Below the protein sequences is a key denoting: Historically, the similarity of different amino acid sequences has been the most common method of inferring
homology. Sequence similarity is considered a good predictor of relatedness, since similar sequences are more likely the result of
gene duplication and
divergent evolution, rather than the result of
convergent evolution. Amino acid sequence is typically more conserved than DNA sequence (due to the
degenerate genetic code), so it is a more sensitive detection method. Since some of the amino acids have similar properties (e.g., charge, hydrophobicity, size),
conservative mutations that interchange them are often
neutral to function. The most conserved sequence regions of a protein often correspond to functionally important regions like
catalytic sites and binding sites, since these regions are less tolerant to sequence changes. Using sequence similarity to infer homology has several limitations. There is no minimum level of sequence similarity guaranteed to produce identical structures. Over long periods of evolution, related proteins may show no detectable sequence similarity to one another. Sequences with many
insertions and deletions can also sometimes be difficult to
align and so identify the homologous sequence regions. In the
PA clan of
proteases, for example, not a single residue is conserved through the superfamily, not even those in the
catalytic triad. Conversely, the individual families that make up a superfamily are defined on the basis of their sequence alignment, for example the C04 protease family within the PA clan. Nevertheless, sequence similarity is the most commonly used form of evidence to infer relatedness, since the number of known sequences vastly outnumbers the number of known
tertiary structures. In the absence of structural information, sequence similarity constrains the limits of which proteins can be assigned to a superfamily. Over very long evolutionary timescales, very few residues show detectable amino acid sequence conservation, however
secondary structural elements and
tertiary structural motifs are highly conserved. Some
protein dynamics and
conformational changes of the protein structure may also be conserved, as is seen in the
serpin superfamily. Consequently, protein tertiary structure can be used to detect homology between proteins even when no evidence of relatedness remains in their sequences.
Structural alignment programs, such as
DALI, use the 3D structure of a protein of interest to find proteins with similar folds. However, on rare occasions, related proteins may evolve to be structurally dissimilar and relatedness can only be inferred by other methods.
Mechanistic similarity The
catalytic mechanism of enzymes within a superfamily is commonly conserved, although
substrate specificity may be significantly different. Catalytic residues also tend to occur in the same order in the protein sequence. For the families within the PA clan of proteases, although there has been divergent evolution of the
catalytic triad residues used to perform catalysis, all members use a similar mechanism to perform
covalent, nucleophilic catalysis on proteins, peptides or amino acids. However, mechanism alone is not sufficient to infer relatedness. Some catalytic mechanisms have been
convergently evolved multiple times independently, and so form separate superfamilies, and in some superfamilies display a range of different (though often chemically similar) mechanisms. == Evolutionary significance ==