PAM One of the first amino acid substitution matrices, the PAM
(Point Accepted Mutation) matrix was developed by
Margaret Dayhoff in the 1970s. This matrix is calculated by observing the differences in closely related proteins. Because the use of very closely related homologs, the observed mutations are not expected to significantly change the common functions of the proteins. Thus the observed substitutions (by point mutations) are considered to be accepted by natural selection. One PAM unit is defined as 1% of the amino acid positions that have been changed. To create a PAM1 substitution matrix, a group of very closely related sequences with mutation frequencies corresponding to one PAM unit is chosen. Based on collected mutational data from this group of sequences, a substitution matrix can be derived. This PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed. The PAM1 matrix is used as the basis for calculating other matrices by assuming that repeated mutations would follow the same pattern as those in the PAM1 matrix, and multiple substitutions can occur at the same site. With this assumption, the PAM2 matrix can estimated by squaring the probabilities. Using this logic, Dayhoff derived matrices as high as PAM250. Usually the
PAM 30 and the PAM70 are used.
BLOSUM Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales. The
BLOSUM (BLOck SUbstitution Matrix) series of matrices rectifies this problem.
Henikoff & Henikoff constructed these matrices using multiple alignments of evolutionarily divergent proteins. The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments. These conserved sequences are assumed to be of functional importance within related proteins and will therefore have lower substitution rates than less conserved regions. To reduce bias from closely related sequences on substitution rates, segments in a block with a sequence identity above a certain threshold were clustered, reducing the weight of each such cluster (Henikoff and Henikoff). For the BLOSUM62 matrix, this threshold was set at 62%. Pairs frequencies were then counted between clusters, hence pairs were only counted between segments less than 62% identical. One would use a higher numbered BLOSUM matrix for aligning two closely related sequences and a lower number for more divergent sequences. It turns out that the BLOSUM62 matrix does an excellent job detecting similarities in distant sequences, and this is the matrix used by default in most recent alignment applications such as
BLAST. It also turns out the BLOSUM computer code written by Henikoff and Henikoff does not exactly match the description in their paper. Surprisingly, this commonly used "wrong" version has better search performance.
Differences between PAM and BLOSUM • PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the branches of a phylogenetic tree: maximum parismony), whereas the BLOSUM matrices are based on an implicit model of evolution. • The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps. • The method used to count the replacements is different: unlike the PAM matrix, the BLOSUM procedure uses groups of sequences within which not all mutations are counted the same. • Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. Example: PAM150 is used for more distant sequences than PAM100; BLOSUM62 is used for closer sequences than BLOSUM50.
Newer matrices A number of newer substitution matrices have been proposed to deal with inadequacies in earlier designs. • JTT (1992). Published in the same year as BLOSOM, it also performs clustering and uses an implicit model. This may help reduce the systematic error from maximum parismony (MP), but also wastes sequence information. • WAG (Wheelan And Goldman, 2001) uses a
maximum likelihood estimating procedure instead of any form of MP over a "BRKALN" dataset. The substitution scores are calculated based on the likelihood of a change considering multiple tree topologies derived using
neighbor-joining. The scores correspond to an
substitution model which includes also amino-acid stationary frequencies and a scaling factor in the similarity scoring. There are two versions of the matrix: WAG matrix based on the assumption of the same amino-acid stationary frequencies across all the compared protein and WAG* matrix with different frequencies for each of included
protein families. • PMB (Probability Matrix from Blocks, 2003), a set of "true" substitution frequencies estimated from the observed frequencies of BLOSUM, taking into account the possibility of a later substitution masking a previous one. It thus creates a evolutionary model where the distances have theoretical meaning (BLOSUM does not have this feature, unlike PAM, WAG, and most other later matrices, and hence is
not recommended for phylogeny by IQ-TREE). • LG (2008), which uses a larger dataset (Pfam-based) than WAG. An extension of the WAG algorithm is used, with a new PhyML (WAG+Γ4) model taking into account of sites with different evolutionary rates. • Qmaker and nQmaker (2021, 2022), programs with the ability to estimate time-reversible and nonreversible matrices from very large datasets quickly. Each provide a general matrix and 5 specialized matrices, for a total of 12 precalculated substitution matrices. • Matrices using a selection of proteins based on structural relatedness, as proposed by Benner et al. (1994), Fan (2004), and Steven et al. (2004). PhyML, and RAxML.
Specialized substitution matrices and their extensions The real substitution rates in a protein depends not only on the identity of the amino acid, but also on the specific structural or sequence context it is in. Many specialized matrices have been developed for these contexts, such as in transmembrane alpha helices, for combinations of secondary structure states and solvent accessibility states, or for local sequence-structure contexts. These context-specific substitution matrices lead to generally improved alignment quality at some cost of speed but are not yet widely used. Since the 2000s, an increasing amount of matrices are defined for subsets of proteins not optimally aligned by traditional "general-purpose" matrices. These include: • PfSSM (2008), CBM and CCF (2008) for
Plasmodium proteins, which have a different amino acid evolutionary bias due to the low
GC content of the genome. • Matrices for transmembrane proteins. JTT transmembrane (1994) is the first of the class. Later work include: • For alpha-helical transmembrane proteins, PHAT (2000) and SLIM (2001). • For beta-barrel transmembrane proteins, bbTM (2008). • Matrices for a specific protein family, including GPCRtm (2015) for the transmembrane (mostly helical) regions of
GPCRs. • Matrices for proteins with a specific role, including Hubsm (2017) for "hub proteins" in protein‐protein interaction networks. • Matrices for
intrinsically disordered proteins, including DUNMat (2002), MidicMat (2009), Disorder (2010), and EDSSMat (2019). Recently, sequence context-specific amino acid similarities have been derived that do not need substitution matrices but that rely on a library of sequence contexts instead. Using this idea, a context-specific extension of the popular
BLAST program has been demonstrated to achieve a twofold sensitivity improvement for remotely related sequences over BLAST at similar speeds (
CS-BLAST). == Nucleotide matrices ==