Artificial neural networks Artificial neural networks in bioinformatics have been used for: • Comparing and aligning RNA, protein, and DNA sequences. • Identification of promoters and finding genes from sequences related to DNA. • Interpreting the expression-gene and micro-array data. • Identifying the network (regulatory) of genes. • Learning evolutionary relationships by constructing
phylogenetic trees. • Classifying and predicting
protein structure. •
Molecular design and
docking Feature engineering The way that features, often vectors in a many-dimensional space, are extracted from the domain data is an important component of learning systems. In genomics, a typical representation of a sequence is a vector of
k-mers frequencies, which is a vector of dimension 4^k whose entries count the appearance of each subsequence of length k in a given sequence. Since for a value as small as k=12 the dimensionality of these vectors is huge (e.g. in this case the dimension is 4^{12} \approx 16\times 10^6), techniques such as
principal component analysis are used to project the data to a lower dimensional space, thus selecting a smaller set of features from the sequences.
Classification In this type of machine learning task, the output is a discrete variable. One example of this type of task in bioinformatics is labeling new genomic data (such as genomes of unculturable bacteria) based on a model of already labeled data. HMMs can be formulated in continuous time. HMMs can be used to profile and convert a multiple sequence alignment into a position-specific scoring system suitable for searching databases for homologous sequences remotely. Additionally, ecological phenomena can be described by HMMs.
Convolutional neural networks Convolutional neural networks (CNN) are a class of
deep neural network whose architecture is based on shared weights of convolution kernels or filters that slide along input features, providing translation-equivariant responses known as feature maps. CNNs take advantage of the hierarchical pattern in data and assemble patterns of increasing complexity using smaller and simpler patterns discovered via their filters. Convolutional networks were
inspired by
biological processes in that the connectivity pattern between
neurons resembles the organization of the animal
visual cortex. Individual
cortical neurons respond to stimuli only in a restricted region of the
visual field known as the
receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field. CNN uses relatively little pre-processing compared to other
image classification algorithms. This means that the network learns to optimize the
filters (or kernels) through automated learning, whereas in traditional algorithms these filters are
hand-engineered. This reduced reliance on prior knowledge of the analyst and on human intervention in manual feature extraction makes CNNs a desirable model. In this approach, phylogenetic data is endowed with patristic distance (the sum of the lengths of all branches connecting two
operational taxonomic units [OTU]) to select k-neighborhoods for each OTU, and each OTU and its neighbors are processed with convolutional filters.
Self-supervised learning (Attention and Transformer Models) Unlike supervised methods,
self-supervised learning methods learn representations without relying on annotated data. That is well-suited for genomics, where
high throughput sequencing techniques can create potentially large amounts of unlabeled data. Some examples of self-supervised learning methods applied on genomics include DNABERT and Self-GenomeNet. Because of their parallelism and the ability to extract correlation across the whole sequences,
transformer-based models achieve
state-of-the-art performance in a variety of important tasks such as machine translation and
question answering. The vanilla transformer model can be divided into two parts:
encoder and
decoder, which have similar basic architectures composed of a stack of identical blocks. Each block consists of two kinds of sub-layers: the multi-head attention sub-layer and the position-wise feed-forward sub-layer. Both kinds of sublayers are followed by layer normalization. A residual connection around every sub-layer will be applied in each block to speed up the training process.
Attention modules The key innovation in the
Transformer architecture is the multi-head self-attention layer, which can relate all relevant tokens to better encode every word or residue in the input sequence. The self-attention layer takes a sequence of tokens as input (tokens equivalent to words in a language or
amino acids/
nucleotides in a sequence) and learns sequence-wide context information. Multi-head attention represents multiple simultaneous attention heads. Before calculating the attention function, each token embedding is transformed into three corresponding vectors: the Query (Q), the Key (K), and the Value (V) vectors. This transformation is achieved by multiplying the token embedding with three randomly initialized, learnable parameter matrices, W_Q, W_K, and W_V. The core attention function is computed by three steps: •
Scoring: The attention head computes the dot products of the Query vector with all Key vectors. •
Scaling and Weighting: Each dot product is divided by \sqrt{d_k} (where d_k is the dimension of the key vector) and a
Softmax function is applied to obtain the weights on the Value vectors. •
Output: The output of the attention function is the weighted sum of these Value vectors, which contains information for the entire sequence. The weight assigned to each value is computed by a compatibility function of the Query with the corresponding Key. In the parallel computation of the attention function, a set of query, key, and value vectors are packed into matrices Q, K, and V. The attention function is computed as follows: {Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V When generalized to multi-head attention with h heads, the results of the multiple heads (each assigned different parameters W_Q, W_K, W_V) are concatenated and once again projected with a parameter matrix W^O, resulting in the final output:\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O \text{where } \text{head}_i = \text{Attention}(Q W^Q_i, K W^K_i, V W^V_i) This is a modification of
bootstrap aggregating (which aggregates a large collection of decision trees) and can be used for
classification or
regression. As random forests give an internal estimate of generalization error, cross-validation is unnecessary. In addition, they produce proximities, which can be used to impute missing values, and which enable novel data visualizations. Computationally, random forests are appealing because they naturally handle both regression and (multiclass) classification, are relatively fast to train and to predict, depend only on one or two tuning parameters, have a built-in estimate of the generalization error, can be used directly for high-dimensional problems, and can easily be implemented in parallel. Statistically, random forests are appealing for additional features, such as measures of variable importance, differential class weighting, missing value imputation, visualization, outlier detection, and unsupervised learning.
Clustering algorithms used in bioinformatics Data clustering algorithms can be hierarchical or partitional. Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once. Hierarchical algorithms can be agglomerative (bottom-up) or divisive (top-down). Agglomerative algorithms begin with each element as a separate cluster and merge them in successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
Hierarchical clustering is calculated using metrics on
Euclidean spaces, the most commonly used is the
Euclidean distance computed by finding the square of the difference between each variable, adding all the squares, and finding the square root of the said sum. An example of a
hierarchical clustering algorithm is
BIRCH, which is particularly good on bioinformatics for its nearly linear
time complexity given generally large datasets. Partitioning algorithms are based on specifying an initial number of groups, and iteratively reallocating objects among groups to convergence. This algorithm typically determines all clusters at once. Most applications adopt one of two popular heuristic methods:
k-means algorithm or
k-medoids. Other algorithms do not require an initial number of groups, such as
affinity propagation. In a genomic setting this algorithm has been used both to cluster biosynthetic gene clusters in gene cluster families(GCF) and to cluster said GCFs.
Workflow Typically, a workflow for applying machine learning to biological data goes through four steps: •
Recording, including capture and storage. In this step, different information sources may be merged into a single set. •
Preprocessing, including cleaning and restructuring into a ready-to-analyze form. In this step, uncorrected data are eliminated or corrected, while missing data maybe imputed and relevant variables chosen. •
Analysis, evaluating data using either supervised or unsupervised algorithms. The algorithm is typically trained on a subset of data, optimizing parameters, and evaluated on a separate test subset. •
Visualization and interpretation, where knowledge is represented effectively using different methods to assess the significance and importance of the findings.
Data errors • Duplicate data is a significant issue in bioinformatics. Publicly available data may be of uncertain quality. • Errors during experimentation. • Erroneous interpretation. • Typing mistakes. • Non-standardized methods (3D structure in PDB from multiple sources, X-ray diffraction, theoretical modeling, nuclear magnetic resonance, etc.) are used in experiments. == Applications ==