There are various alignment methods used within multiple sequence to maximize scores and correctness of alignments. Each is usually based on a certain heuristic with an insight into the evolutionary process. Most try to replicate evolution to get the most realistic alignment possible to best predict relations between sequences.
Dynamic programming A direct method for producing an MSA uses the
dynamic programming technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a
gap penalty and a
substitution matrix assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences, a similar gap penalty is used, but a much simpler substitution matrix, wherein only identical matches and mismatches are considered, is typical. The scores in the substitution matrix may be either all positive or a mix of positive and negative in the case of a global alignment, but must be both positive and negative, in the case of a local alignment. For
n individual sequences, the naive method requires constructing the
n-dimensional equivalent of the matrix formed in standard pairwise
sequence alignment. The search space thus increases exponentially with increasing
n and is also strongly dependent on sequence length. Expressed with the
big O notation commonly used to measure
computational complexity, a
naïve MSA takes
O(LengthNseqs) time to produce. To find the global optimum for
n sequences this way has been shown to be an
NP-complete problem. In 1989, based on Carrillo-Lipman Algorithm, Altschul introduced a practical method that uses pairwise alignments to constrain the n-dimensional search space. In this approach pairwise dynamic programming alignments are performed on each pair of sequences in the query set, and only the space near the n-dimensional intersection of these alignments is searched for the n-way alignment. The MSA program optimizes the sum of all of the pairs of characters at each position in the alignment (the so-called
sum of pair score) and has been implemented in a software program for constructing multiple sequence alignments. In 2019, Hosseininasab and van Hoeve showed that by using decision diagrams, MSA may be modeled in polynomial space complexity. Progressive alignment builds up a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related. All progressive alignment methods require two stages: a first stage in which the relationships between the sequences are represented as a
phylogenetic tree, called a
guide tree, and a second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree. The initial
guide tree is determined by an efficient
clustering method such as
neighbor-joining or
unweighted pair group method with arithmetic mean (
UPGMA), and may use distances based on the number of identical two-letter sub-sequences (as in
FASTA rather than a dynamic programming alignment).
ClustalW is used extensively for phylogenetic tree construction, in spite of the author's explicit warnings that unedited alignments should not be used in such studies and as input for
protein structure prediction by homology modeling.
European Bioinformatics Institute (EMBL-EBI) announced that CLustalW2 will expire in August 2015. They recommend
Clustal Omega which performs based on seeded guide trees and HMM profile-profile techniques for protein alignments. An alternative tool for progressive DNA alignments is
multiple alignment using fast Fourier transform (
MAFFT). Another common progressive alignment method named
T-Coffee is slower than Clustal and its derivatives but generally produces more accurate alignments for distantly related sequence sets. T-Coffee calculates pairwise alignments by combining the direct alignment of the pair with indirect alignments that aligns each sequence of the pair to a third sequence. It uses the output from Clustal as well as another local alignment program LALIGN, which finds multiple regions of local alignment between two sequences. The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate weighting factors. Because progressive methods are heuristics that are not guaranteed to converge to a global optimum, alignment quality can be difficult to evaluate and their true biological significance can be obscure. A semi-progressive method that improves alignment quality and does not use a lossy heuristic while running in
polynomial time has been implemented in the program PSAlign.
Iterative methods A set of methods to produce MSAs while reducing the errors inherent in progressive methods are classified as "iterative" because they work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. One reason progressive methods are so strongly dependent on a high-quality initial alignment is the fact that these alignments are always incorporated into the final result – that is, once a sequence has been aligned into the MSA, its alignment is not considered further. This approximation improves efficiency at the cost of accuracy. By contrast, iterative methods can return to previously calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a means of optimizing a general
objective function such as finding a high-quality alignment score. The software package PRRN/PRRP uses a
hill-climbing algorithm to optimize its MSA alignment score and iteratively corrects both alignment weights and locally divergent or "gappy" regions of the growing MSA. PRRP performs best when refining an alignment previously constructed by a faster method. The alignment of individual motifs is then achieved with a matrix representation similar to a dot-matrix plot in a pairwise alignment. An alternative method that uses fast local alignments as anchor points or
seeds for a slower global-alignment procedure is implemented in the CHAOS/DIALIGN suite. The distance measure is updated between iteration stages (although, in its original form, MUSCLE contained only 2-3 iterations depending on whether refinement was enabled).
Consensus methods Consensus methods attempt to find the optimal multiple sequence alignment given multiple different alignments of the same set of sequences. There are two commonly used consensus methods, M-COFFEE and MergeAlign. M-COFFEE uses multiple sequence alignments generated by seven different methods to generate consensus alignments. MergeAlign is capable of generating consensus alignments from any number of input alignments generated using different models of sequence evolution or different methods of multiple sequence alignment. The default option for MergeAlign is to infer a consensus alignment using alignments generated using 91 different models of
protein sequence evolution.
Hidden Markov models (HMM) modelling a multiple sequence alignment A
hidden Markov model (HMM) is a probabilistic model that can assign likelihoods to all possible combinations of gaps, matches, and mismatches, to determine the most likely MSA or set of possible MSAs. HMMs can produce a single highest-scoring output but can also generate a family of possible alignments that can then be evaluated for biological significance. HMMs can produce both global and local alignments. Although HMM-based methods have been developed relatively recently, they offer significant improvements in computational speed, especially for sequences that contain overlapping regions. This is distinct from progressive alignment methods because the alignment of prior sequences is updated at each new sequence addition. However, like progressive methods, this technique can be influenced by the order in which the sequences in the query set are integrated into the alignment, especially when the sequences are distantly related. and a similar more general method is implemented in the Sequence Alignment and Modeling System (SAM) software package. and
HMMER. SAM has been used as a source of alignments for
protein structure prediction to participate in the Critical Assessment of Structure Prediction (
CASP) structure prediction experiment and to develop a database of predicted proteins in the
yeast species
S. cerevisiae.
HHsearch is a software package for the detection of remotely related protein sequences based on the pairwise comparison of HMMs. A server running HHsearch (
HHpred) was the fastest of 10 automatic structure prediction servers in the CASP7 and CASP8 structure prediction competitions.
Phylogeny-aware methods Most multiple sequence alignment methods try to minimize the number of
insertions/deletions (gaps) and, as a consequence, produce compact alignments. This causes several problems if the sequences to be aligned contain non-
homologous regions, if gaps are informative in a
phylogeny analysis. These problems are common in newly produced sequences that are poorly annotated and may contain
frame-shifts, wrong
domains or non-homologous
spliced exons. The first such method was developed in 2005 by Löytynoja and Goldman. The same authors released a software package called
PRANK in 2008. PRANK improves alignments when insertions are present. Nevertheless, it runs slowly compared to progressive and/or iterative methods which have been developed for several years. In 2012, two new phylogeny-aware tools appeared. One is called
PAGAN that was developed by the same team as PRANK. The other is
ProGraphMSA developed by Szalkowski. Both software packages were developed independently but share common features, notably the use of
graph algorithms to improve the recognition of non-homologous regions, and an improvement in code making these software faster than PRANK.
Motif finding caspases colored by motifs as identified by MEME. When motif positions and sequence alignments are generated independently, they often correlate well but not perfectly, as in this example. Motif finding, also known as profile analysis, is a method of locating
sequence motifs in global MSAs that is both a means of producing a better MSA and a means of producing a scoring matrix for use in searching other sequences for similar motifs. A variety of methods for isolating the motifs have been developed, but all are based on identifying short highly conserved patterns within the larger alignment and constructing a matrix similar to a substitution matrix that reflects the amino acid or nucleotide composition of each position in the putative motif. The alignment can then be refined using these matrices. In standard profile analysis, the matrix includes entries for each possible character as well as entries for gaps. Block scoring generally relies on the spacing of high-frequency characters rather than on the calculation of an explicit substitution matrix. Statistical pattern-matching has been implemented using both the
expectation-maximization algorithm and the
Gibbs sampler. One of the most common motif-finding tools, named
Multiple EM for Motif Elicitation (MEME), uses expectation maximization and hidden Markov methods to generate motifs that are then used as search tools by its companion MAST in the combined suite MEME/MAST.
Non-coding multiple sequence alignment Non-coding DNA regions, especially
transcription factor binding sites (TFBSs), are conserved, but not necessarily evolutionarily related, and may have converged from non-common ancestors. Thus, the assumptions used to align protein sequences and DNA coding regions are inherently different from those that hold for TFBS sequences. Although it is meaningful to align DNA coding regions for homologous sequences using mutation operators, alignment of
binding site sequences for the same transcription factor cannot rely on evolutionary related mutation operations. Similarly, the evolutionary operator of point mutations can be used to define an
edit distance for coding sequences, but this has little meaning for TFBS sequences because any sequence variation has to maintain a certain level of specificity for the binding site to function. This becomes specifically important when trying to align known TFBS sequences to build supervised models to predict unknown locations of the same TFBS. Hence, Multiple Sequence Alignment methods need to adjust the underlying evolutionary hypothesis and the operators used as in the work published incorporating neighbouring base thermodynamic information to align the binding sites searching for the lowest thermodynamic alignment conserving specificity of the binding site. ==Optimization==