Machine learning in bioinformatics

Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.

Tasks

Machine learning algorithms in bioinformatics can be used for prediction, classification, and feature selection. Methods to achieve this task are varied and span many disciplines; most well known among them are machine learning and statistics. Classification and prediction tasks aim at building models that describe and distinguish classes or concepts for future prediction. The differences between them are the following: • Classification/recognition outputs a categorical class, while prediction outputs a numerical valued feature. • The type of algorithm, or process used to build the predictive models from data using analogies, rules, neural networks, probabilities, and/or statistics. Due to the exponential growth of information technologies and applicable models, including artificial intelligence and data mining, in addition to the access ever-more comprehensive data sets, new and better information analysis techniques have been created, based on their ability to learn. Such models allow reach beyond description and provide insights in the form of testable models. == Approaches ==

Approaches

Artificial neural networks Artificial neural networks in bioinformatics have been used for: • Comparing and aligning RNA, protein, and DNA sequences. • Identification of promoters and finding genes from sequences related to DNA. • Interpreting the expression-gene and micro-array data. • Identifying the network (regulatory) of genes. • Learning evolutionary relationships by constructing phylogenetic trees. • Classifying and predicting protein structure. • Molecular design and docking Feature engineering The way that features, often vectors in a many-dimensional space, are extracted from the domain data is an important component of learning systems. In genomics, a typical representation of a sequence is a vector of k-mers frequencies, which is a vector of dimension 4^k whose entries count the appearance of each subsequence of length k in a given sequence. Since for a value as small as k=12 the dimensionality of these vectors is huge (e.g. in this case the dimension is 4^{12} \approx 16\times 10^6), techniques such as principal component analysis are used to project the data to a lower dimensional space, thus selecting a smaller set of features from the sequences. Classification In this type of machine learning task, the output is a discrete variable. One example of this type of task in bioinformatics is labeling new genomic data (such as genomes of unculturable bacteria) based on a model of already labeled data. HMMs can be formulated in continuous time. HMMs can be used to profile and convert a multiple sequence alignment into a position-specific scoring system suitable for searching databases for homologous sequences remotely. Additionally, ecological phenomena can be described by HMMs. Convolutional neural networks Convolutional neural networks (CNN) are a class of deep neural network whose architecture is based on shared weights of convolution kernels or filters that slide along input features, providing translation-equivariant responses known as feature maps. CNNs take advantage of the hierarchical pattern in data and assemble patterns of increasing complexity using smaller and simpler patterns discovered via their filters. Convolutional networks were inspired by biological processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field. CNN uses relatively little pre-processing compared to other image classification algorithms. This means that the network learns to optimize the filters (or kernels) through automated learning, whereas in traditional algorithms these filters are hand-engineered. This reduced reliance on prior knowledge of the analyst and on human intervention in manual feature extraction makes CNNs a desirable model. In this approach, phylogenetic data is endowed with patristic distance (the sum of the lengths of all branches connecting two operational taxonomic units [OTU]) to select k-neighborhoods for each OTU, and each OTU and its neighbors are processed with convolutional filters. Self-supervised learning (Attention and Transformer Models) Unlike supervised methods, self-supervised learning methods learn representations without relying on annotated data. That is well-suited for genomics, where high throughput sequencing techniques can create potentially large amounts of unlabeled data. Some examples of self-supervised learning methods applied on genomics include DNABERT and Self-GenomeNet. Because of their parallelism and the ability to extract correlation across the whole sequences, transformer-based models achieve state-of-the-art performance in a variety of important tasks such as machine translation and question answering. The vanilla transformer model can be divided into two parts: encoder and decoder, which have similar basic architectures composed of a stack of identical blocks. Each block consists of two kinds of sub-layers: the multi-head attention sub-layer and the position-wise feed-forward sub-layer. Both kinds of sublayers are followed by layer normalization. A residual connection around every sub-layer will be applied in each block to speed up the training process. Attention modules The key innovation in the Transformer architecture is the multi-head self-attention layer, which can relate all relevant tokens to better encode every word or residue in the input sequence. The self-attention layer takes a sequence of tokens as input (tokens equivalent to words in a language or amino acids/nucleotides in a sequence) and learns sequence-wide context information. Multi-head attention represents multiple simultaneous attention heads. Before calculating the attention function, each token embedding is transformed into three corresponding vectors: the Query (Q), the Key (K), and the Value (V) vectors. This transformation is achieved by multiplying the token embedding with three randomly initialized, learnable parameter matrices, W_Q, W_K, and W_V. The core attention function is computed by three steps: • Scoring: The attention head computes the dot products of the Query vector with all Key vectors. • Scaling and Weighting: Each dot product is divided by \sqrt{d_k} (where d_k is the dimension of the key vector) and a Softmax function is applied to obtain the weights on the Value vectors. • Output: The output of the attention function is the weighted sum of these Value vectors, which contains information for the entire sequence. The weight assigned to each value is computed by a compatibility function of the Query with the corresponding Key. In the parallel computation of the attention function, a set of query, key, and value vectors are packed into matrices Q, K, and V. The attention function is computed as follows: {Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V When generalized to multi-head attention with h heads, the results of the multiple heads (each assigned different parameters W_Q, W_K, W_V) are concatenated and once again projected with a parameter matrix W^O, resulting in the final output:\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O \text{where } \text{head}_i = \text{Attention}(Q W^Q_i, K W^K_i, V W^V_i) This is a modification of bootstrap aggregating (which aggregates a large collection of decision trees) and can be used for classification or regression. As random forests give an internal estimate of generalization error, cross-validation is unnecessary. In addition, they produce proximities, which can be used to impute missing values, and which enable novel data visualizations. Computationally, random forests are appealing because they naturally handle both regression and (multiclass) classification, are relatively fast to train and to predict, depend only on one or two tuning parameters, have a built-in estimate of the generalization error, can be used directly for high-dimensional problems, and can easily be implemented in parallel. Statistically, random forests are appealing for additional features, such as measures of variable importance, differential class weighting, missing value imputation, visualization, outlier detection, and unsupervised learning. Clustering algorithms used in bioinformatics Data clustering algorithms can be hierarchical or partitional. Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once. Hierarchical algorithms can be agglomerative (bottom-up) or divisive (top-down). Agglomerative algorithms begin with each element as a separate cluster and merge them in successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. Hierarchical clustering is calculated using metrics on Euclidean spaces, the most commonly used is the Euclidean distance computed by finding the square of the difference between each variable, adding all the squares, and finding the square root of the said sum. An example of a hierarchical clustering algorithm is BIRCH, which is particularly good on bioinformatics for its nearly linear time complexity given generally large datasets. Partitioning algorithms are based on specifying an initial number of groups, and iteratively reallocating objects among groups to convergence. This algorithm typically determines all clusters at once. Most applications adopt one of two popular heuristic methods: k-means algorithm or k-medoids. Other algorithms do not require an initial number of groups, such as affinity propagation. In a genomic setting this algorithm has been used both to cluster biosynthetic gene clusters in gene cluster families(GCF) and to cluster said GCFs. Workflow Typically, a workflow for applying machine learning to biological data goes through four steps: • Recording, including capture and storage. In this step, different information sources may be merged into a single set. • Preprocessing, including cleaning and restructuring into a ready-to-analyze form. In this step, uncorrected data are eliminated or corrected, while missing data maybe imputed and relevant variables chosen. • Analysis, evaluating data using either supervised or unsupervised algorithms. The algorithm is typically trained on a subset of data, optimizing parameters, and evaluated on a separate test subset. • Visualization and interpretation, where knowledge is represented effectively using different methods to assess the significance and importance of the findings. Data errors • Duplicate data is a significant issue in bioinformatics. Publicly available data may be of uncertain quality. • Errors during experimentation. • Erroneous interpretation. • Typing mistakes. • Non-standardized methods (3D structure in PDB from multiple sources, X-ray diffraction, theoretical modeling, nuclear magnetic resonance, etc.) are used in experiments. == Applications ==

Applications

In general, a machine learning system can usually be trained to recognize elements of a certain class given sufficient samples. For example, machine learning methods can be trained to identify specific visual features such as splice sites. Support vector machines have been extensively used in cancer genomic studies. In addition, deep learning has been incorporated into bioinformatic algorithms. Deep learning applications have been used for regulatory genomics and cellular imaging. Other applications include medical image classification, genomic sequence analysis, as well as protein structure classification and prediction. Deep learning has been applied to regulatory genomics, variant calling and pathogenicity scores. Natural language processing and text mining have helped to understand phenomena including protein-protein interaction, gene-disease relation as well as predicting biomolecule structures and functions. Precision/personalized medicine Natural language processing algorithms personalized medicine for patients who suffer genetic diseases, by combining the extraction of clinical information and genomic data available from the patients. Institutes such as Health-funded Pharmacogenomics Research Network focus on finding breast cancer treatments. However, while raw data was becoming increasingly available and accessible, , biological interpretation of this data was occurring at a much slower pace. This made for an increasing need for developing computational genomics tools, including machine learning systems, that can automatically determine the location of protein-encoding genes within a given DNA sequence (i.e. gene prediction). Proteomics Proteins, strings of amino acids, gain much of their function from protein folding, where they conform into a three-dimensional structure, including the primary structure, the secondary structure (alpha helices and beta sheets), the tertiary structure, and the quaternary structure. Protein secondary structure prediction is a main focus of this subfield as tertiary and quaternary structures are determined based on the secondary structure. Automatic feature learning reaches an accuracy of 82-84%. The theoretical limit for three-state protein secondary structure is 88–90%. Machine learning has also been applied to proteomics problems such as protein side-chain prediction, protein loop modeling, and protein contact map prediction. Currently, limitations and challenges predominate in the implementation of machine learning tools due to the amount of data in environmental samples. Supercomputers and web servers have made access to these tools easier. The high dimensionality of microbiome datasets is a major challenge in studying the microbiome; this significantly limits the power of current approaches for identifying true differences and increases the chance of false discoveries. Despite their importance, machine learning tools related to metagenomics have focused on the study of gut microbiota and the relationship with digestive diseases, such as inflammatory bowel disease (IBD), Clostridioides difficile infection (CDI), colorectal cancer and diabetes, seeking better diagnosis and treatments. In addition, random forest (RF) methods and implemented importance measures help in the identification of microbiome species that can be used to distinguish diseased and non-diseased samples. However, the performance of a decision tree and the diversity of decision trees in the ensemble significantly influence the performance of RF algorithms. The generalization error for RF measures how accurate the individual classifiers are and their interdependence. Therefore, the high dimensionality problems of microbiome datasets pose challenges. Effective approaches require many possible variable combinations, which exponentially increases the computational burden as the number of features increases. designed an R package called MegaR. This package allows working with 16S rRNA and whole metagenomic sequences to make taxonomic profiles and classification models by machine learning models. MegaR includes a comfortable visualization environment to improve the user experience. Machine learning in environmental metagenomics can help to answer questions related to the interactions between microbial communities and ecosystems, e.g. the work of Xun et al., in 2021 where the use of different machine learning methods offered insights on the relationship among the soil, microbiome biodiversity, and ecosystem stability. Microarrays Microarrays, a type of lab-on-a-chip, are used for automatically collecting data about large amounts of biological material. Machine learning can aid in analysis, and has been applied to expression pattern identification, classification, and genetic network induction. One of the main tasks is identifying which genes are expressed based on the collected data. Machine learning has been used to aid in modeling these interactions in domains such as genetic networks, signal transduction networks, and metabolic pathways. Evolution This domain, particularly phylogenetic tree reconstruction, uses the features of machine learning techniques. Phylogenetic trees are schematic representations of the evolution of organisms. Initially, they were constructed using features such as morphological and metabolic features. Later, due to the availability of genome sequences, the construction of the phylogenetic tree algorithm used the concept based on genome comparison. With the help of optimization techniques, a comparison was done by means of multiple sequence alignment. Stroke diagnosis Machine learning methods for the analysis of neuroimaging data are used to help diagnose stroke. Historically multiple approaches to this problem involved neural networks. Multiple approaches to detect strokes used machine learning. As proposed by Mirtskhulava, feed-forward networks were tested to detect strokes using neural imaging. As proposed by Titano 3D-CNN techniques were tested in supervised classification to screen head CT images for acute neurologic events. Three-dimensional CNN and SVM methods are often used. Machine learning can be used for this knowledge extraction task using techniques such as natural language processing to extract the useful information from human-generated reports in a database. Text Nailing, an alternative approach to machine learning, capable of extracting features from clinical narrative notes was introduced in 2017. This technique has been applied to the search for novel drug targets, as this task requires the examination of information stored in biological databases and journals. Clustering and abundance profiling of biosynthetic gene clusters Microbial communities are complex assembles of diverse microorganisms, where symbiont partners constantly produce diverse metabolites derived from the primary and secondary (specialized) metabolism, from which metabolism plays an important role in microbial interaction. Metagenomic and metatranscriptomic data are an important source for deciphering communications signals. Molecular mechanisms produce specialized metabolites in various ways. Biosynthetic Gene Clusters (BGCs) attract attention, since several metabolites are clinically valuable, anti-microbial, anti-fungal, anti-parasitic, anti-tumor and immunosuppressive agents produced by the modular action of multi-enzymatic, multi-domains gene clusters, such as Nonribosomal peptide synthetases (NRPSs) and polyketide synthases (PKSs). Diverse studies show that grouping BGCs that share homologous core genes into gene cluster families (GCFs) can yield useful insights into the chemical diversity of the analyzed strains, and can support linking BGCs to their secondary metabolites. and to study the ability of soil to suppress fungal pathogens. Given their direct relationship to catalytic enzymes, and compounds produced from their encoded pathways, BGCs/GCFs can serve as a proxy to explore the chemical space of microbial secondary metabolism. Cataloging GCFs in sequenced microbial genomes yields an overview of the existing chemical diversity and offers insights into future priorities. have emerged with the sole purpose of unveiling the importance of BGCs in natural environments. Decodification of RiPPs chemical structures The increase of experimentally characterized ribosomally synthesized and post-translationally modified peptides (RiPPs), together with the availability of information on their sequence and chemical structure, selected from databases such as BAGEL, BACTIBASE, MIBIG, and THIOBASE, provide the opportunity to develop machine learning tools to decode the chemical structure and classify them. In 2017, researchers at the National Institute of Immunology of New Delhi, India, developed RiPPMiner software, a bioinformatics resource for decoding RiPP chemical structures by genome mining. The RiPPMiner web server consists of a query interface and the RiPPDB database. RiPPMiner defines 12 subclasses of RiPPs, predicting the cleavage site of the leader peptide and the final cross-link of the RiPP chemical structure. Mass spectral similarity scoring Many tandem mass spectrometry (MS/MS) based metabolomics studies, such as library matching and molecular networking, use spectral similarity as a proxy for structural similarity. Spec2vec algorithm provides a new way of spectral similarity score, based on Word2Vec. Spec2Vec learns fragmental relationships within a large set of spectral data, in order to assess spectral similarities between molecules and to classify unknown molecules through these comparisons. For systemic annotation, some metabolomics studies rely on fitting measured fragmentation mass spectra to library spectra or contrasting spectra via network analysis. Scoring functions are used to determine the similarity between pairs of fragment spectra as part of these processes. So far, no research has suggested scores that are significantly different from the commonly utilized cosine-based similarity. == Databases ==

Databases

An important part of bioinformatics is the management of big datasets, known as databases of reference. Databases exist for each type of biological data, for example for biosynthetic gene clusters and metagenomes. General databases by bioinformatics National Center for Biotechnology Information The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank nucleic acid sequence database and the PubMed database of citations and abstracts for published life science journals. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. Resources include PubMed Data Management, RefSeq Functional Elements, genome data download, variation services API, Magic-BLAST, QuickBLASTp, and Identical Protein Groups. All of these resources can be accessed through NCBI. Bioinformatics analysis for biosynthetic gene clusters antiSMASH antiSMASH allows the rapid genome-wide identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genomes. It integrates and cross-links with a large number of in silico secondary metabolite analysis tools. gutSMASH gutSMASH is a tool that systematically evaluates bacterial metabolic potential by predicting both known and novel anaerobic metabolic gene clusters (MGCs) from the gut microbiome. MIBiG MIBiG, the minimum information about a biosynthetic gene cluster specification, provides a standard for annotations and metadata on biosynthetic gene clusters and their molecular products. MIBiG is a Genomic Standards Consortium project that builds on the minimum information about any sequence (MIxS) framework. MIBiG facilitates the standardized deposition and retrieval of biosynthetic gene cluster data as well as the development of comprehensive comparative analysis tools. It empowers next-generation research on the biosynthesis, chemistry and ecology of broad classes of societally relevant bioactive secondary metabolites, guided by robust experimental evidence and rich metadata components. SILVA SILVA is an interdisciplinary project among biologists and computers scientists assembling a complete database of RNA ribosomal (rRNA) sequences of genes, both small (16S, 18S, SSU) and large (23S, 28S, LSU) subunits, which belong to the bacteria, archaea and eukarya domains. These data are freely available for academic and commercial use. Greengenes Greengenes is a full-length 16S rRNA gene database that provides chimera screening, standard alignment and a curated taxonomy based on de novo tree inference. Overview: • 1,012,863 RNA sequences from 92,684 organisms contributed to RNAcentral. • The shortest sequence has 1,253 nucleotides, the longest 2,368. • The average length is 1,402 nucleotides. • Database version: 13.5. Open Tree of Life Taxonomy Open Tree of Life Taxonomy (OTT) aims to build a complete, dynamic, and digitally available Tree of Life by synthesizing published phylogenetic trees along with taxonomic data. Phylogenetic trees have been classified, aligned, and merged. Taxonomies have been used to fill in sparse regions and gaps left by phylogenies. OTT is a base that has been little used for sequencing analyzes of the 16S region, however, it has a greater number of sequences classified taxonomically down to the genus level compared to SILVA and Greengenes. However, in terms of classification at the edge level, it contains a lesser amount of information Ribosomal Database Project Ribosomal Database Project (RDP) is a database that provides RNA ribosomal (rRNA) sequences of small subunits of domain bacterial and archaeal (16S); and fungal rRNA sequences of large subunits (28S). == References ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com