DNA annotation

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

History

The first generation of genome annotators used local ab initio methods, which are based solely on the information that can be extracted from the DNA sequence on a local scale, that is, one open reading frame (ORF) at a time. They appeared as a necessity to handle the enormous amount of data produced by the Maxam-Gilbert and Sanger DNA sequencing techniques developed in the late 1970s. The first software used to analyze sequencing reads is the Staden Package, created by Rodger Staden in 1977. It performed several tasks related to annotation, such as base and codon counts. In fact, codon usage was the main strategy used by several early protein coding sequence (CDS) prediction methods, based on the assumption that the most translated regions in a genome contain codons with the most abundant corresponding tRNAs (the molecules responsible for carrying amino acids to the ribosome during protein synthesis) allowing a more efficient translation. This was also known to be the case for synonymous codons, which are often present in proteins expressed at a lower level. The advent of complete genomes in the 1990s (the first one being the genome of Haemophilus influenzae sequenced in 1995) introduced a second generation of annotators. Just like in the previous generation, they performed annotation through ab initio methods, but now applied on a genome-wide scale. these models can be thought of as directed graphs where nodes represent different genomic signals (such as transcription and translation start sites) connected by arrows representing the scanning of the sequence. To ensure a Markov model detects a genomic signal, it must first be trained on a series of known genomic signals. The output of Markov models in the context of annotation includes the probabilities of every kind of genomic element in every single part of the genome, and an accurate Markov model will assign high probabilities to correct annotations and low probabilities to the incorrect ones. As more sequenced genomes began to be available in early and mid 2000s, coupled with the numerous protein sequences that were obtained experimentally, genome annotators began employing homology based methods, launching the third generation of genome annotation. These new methods allowed annotators not only to infer genomic elements through statistical means (as in previous generations) but could also perform their task by comparing the sequence being annotated with other already existing and validated sequences. These so-called combiner annotators, which perform both ab initio and homology-based annotation, require fast alignment algorithms to identify regions of homology. ==Structural annotation==

Structural annotation

of an assembled genome are masked by using a repeat library. Then, optionally, the masked sequence is aligned with all the available evidence (ESTs, RNAs, and proteins) of the organism being annotated. In eukaryotic genomes, splice sites must be identified. Finally, the coding and noncoding sequences contained in the genome are predicted with the help of databases of known DNA, RNA and protein sequences, as well as other supporting information. Structural annotation describes the precise location of the different elements in a genome, such as open reading frames (ORFs), coding sequences (CDS), exons, introns, repeats, splice sites, regulatory motifs, start and stop codons, and promoters. The main steps of structural annotation are: • Repeat identification and masking. • Evidence alignment (optional). • Splice identification (only in eukaryotes). • Feature prediction (coding and noncoding sequences). Repeat identification and masking The first step of structural annotation consists in the identification and masking of repeats, which include low-complexity sequences (such as AGAGAGAG, or monopolymeric segments like TTTTTTTTT), and transposons (which are larger elements with several copies across the genome). and three quarters of the human genome are composed of repetitive elements. Identifying repeats is difficult for two main reasons: they are poorly conserved, and their boundaries are not clearly defined. Because of this, repeat libraries must be built for the genome of interest, which can be accomplished with one of the following methods: • '''De novo methods'''. Repeats are identified by detecting and grouping pairs of sequences at different locations whose similarity is above a minimum threshold of sequence conservation in a self-genome comparison, thus requiring no prior information about repeat structure or sequences. The disadvantage of these methods is that they can identify any repeated sequence, not just transposons, and may include conserved coding sequences (CDS), making careful post-processing an indispensable step to remove these sequences. It may also leave out related regions that have degraded over time and may group elements that have no connection in their evolutionary history. • Homology-based methods. Repeats are identified by similarity (homology) of known repeats stored in a curated database. These methods are more likely to find real transposons, even in lower quantities, when compared with de novo methods, but are biased towards previously identified families. • Structure-based methods. Repeats are identified based on models of their structure, rather than repetition or similarity. They are capable of identifying real transposons (just like the homology-based ones), but are not biased by known elements. However, they are highly specific to each class of repeat, and, as such, are less universally applicable. • Comparative genomic methods. Repeats are identified as disruptions of one or more sequences in a multiple sequence alignment produced by large insertion regions. Although this strategy avoids the poorly defined boundary problem that exists in other methods, it is highly dependent on assembly quality and the level of activity of transposons in the genomes in question. After the repetitive regions in a genome have been identified, they are masked. Masking means replacing the letters of the nucleotides (A, C, G, or T) with other letters. By doing so, these regions will be marked as repetitive and downstream analyses will treat them accordingly. Repetitive regions may produce performance issues if they are not masked, and may even produce false evidence for gene annotation (for example, treating an open reading frame (ORF) in a transposon as an exon) Evidence alignment The next step after genome masking usually involves aligning all available transcript and protein evidence with the analyzed genome, that is, aligning all known expressed sequence tags (ESTs), RNAs and proteins of the organism being annotated with the genome. However, transcripts provide insufficient information for gene prediction because they might be unobtainable from some genes, they may encode operons of more than one gene, and their start and stop codons cannot be determined due to frameshifts and translation initiation factors. Splice identification Annotation of eukaryotic genomes has an extra layer of difficulty due to RNA splicing, a post-transcriptional process in which introns (non-coding regions) are removed and exons (coding regions) are joined. Predictors of new exon boundaries usually require efficient data-compression and alignment algorithms, but they are prone to failure in boundaries located in regions with low sequence coverage or high error-rates produced during sequencing. Feature prediction A genome is divided in coding and noncoding regions, and the last step of structural annotation consists in identifying these features within the genome. In fact, the primary task in genome annotation is gene prediction, which is why numerous methods have been developed for this purpose. Whereas prokaryotic CDS predictors mostly deal with open reading frames (ORFs), which are segments of DNA between the start and stop codons, eukaryotic CDS predictors are faced with a more difficult problem because of the complex organization of eukaryotic genes. • Homology-based methods (also called empirical, evidence-driven, or extrinsic). CDS prediction is based on similarity to known sequences. Specifically, it performs alignments of the analyzed sequence with expressed sequence tags (ESTs), complementary DNA (cDNA), or protein sequences. • Combiners. CDS prediction is done by a combination of both methods mentioned above. ==Functional annotation==

Functional annotation

Functional annotation assigns functions to the genomic elements found by structural annotation, It shows the molecular functions, biological processes, and cellular components in which the matrilin complex, a component of the extracellular matrix, is involved. Every box is an ontology term that falls into one of the three GO categories and is color-coded respectively. Ontology terms are related to each other through specific qualifiers (such as "is a", "part of", etc.), which are represented by different kinds of arrows. Functional annotation of genes requires a controlled vocabulary (or ontology) to name the predicted functional features. However, because there are numerous ways to define gene functions, the annotation process may be hindered when it is performed by different research groups. As such, a standardized controlled vocabulary must be employed, the most comprehensive of which is the Gene Ontology (GO). It classifies functional properties into one of three categories (molecular function, biological process, and cellular component) and organizes them in a directed acyclic graph, in which every node is a particular function, and every edge (or arrow) between two nodes indicates a parent-child or subcategory-category relationship. Some conventional methods for functional annotation are homology-based, which rely on local alignment search tools. artificially high scores due to the presence of low complexity regions, and significant variation within a protein family. Functional annotation can be performed through probabilistic methods. The distribution of hydrophilic and hydrophobic amino acids indicates whether a protein is located in a solution or membrane. Specific sequence motifs provide information on posttranslational modifications and final location of any given protein. Machine learning methods are also used to generate functional annotations for novel proteins based on GO terms. Generally, they consist in constructing a binary classifier for each GO term, which are then joined to make predictions on individual GO terms (forming a multiclass classifier) for which confidence scores are later obtained. The support vector machine (SVM) is the most widely used binary classifier in functional annotation; however, other algorithms, such as k-nearest neighbors (kNN) and convolutional neural network (CNN), have also been employed. • Homology-based method. Pseudogenes are identified by searching sequences that are similar to functional genes but contain mutations that produce a disruption in their ORF. This method cannot determine the evolutionary relationship between a pseudogene and its parent gene nor the elapsed time since the event happened. • Phylogeny-based method. Pseudogenes are identified by means of a phylogenetic analysis. First, a species tree of the species of interest and a phylogenetic tree of the gene (or gene family) of interest are constructed. The two are then compared to identify a species that has lost the gene. Next, within the genome of the species where the gene was not found, a sequence is searched that is orthologous to the gene identified in the closest species. Finally, if this orthologous sequence has a disruption in its ORF (and it meets with other criteria, such as RNA-Seq data analysis, dN/dS ratio, etc.), it means that the sequence is indeed a pseudogene. Segmental duplications are DNA segments of more than 1000 base pairs that are repeated in the genome with more than 90% sequence identity. Two strategies used for their identification are WGAC and WSSD: • Whole-Genome Assembly Comparison (WGAC). It aligns the entire genome to itself in order to identify repeated sequences after filtering out common repeats; it does not require having the original reads used for the assembly. • Whole-genome Shotgun Sequence Detection (WSSD). It aligns the original reads with the assembled genome and searches for regions with a higher read depth than the average, which usually are signals of duplication. Segmental duplications identified by this method but not by WGAC are likely collapsed duplications, which means that they were mistakenly aligned to the same region. DNA binding sites are regions in the genome sequence that bind to and interact with specific proteins. They play an important role in DNA replication and repair, transcriptional regulation, and viral infection. Binding site prediction involves the use of one of the following two methods: • Sequence similarity based methods. They consist in the identification of homologous sequences with known DNA binding sites, or by aligning them with query proteins. Their performance is usually low because the DNA binding sequences are less conserved. • Structure based methods. They employ the three-dimensional structural information of proteins to predict the locations of DNA binding sites. Noncoding RNA (ncRNA), produced by RNA genes, is a type of RNA that is not translated into a protein. It includes molecules such as tRNA, rRNA, snoRNA, and microRNA, as well as noncoding mRNA-like transcripts. Ab initio prediction of RNA genes in a single genome often yields inaccurate results (with an exception being miRNA), so multi-genome comparative methods are used instead. These methods are specifically concerned with the secondary structures of ncRNA, as they are conserved in related species even when their sequence is not. Therefore, by performing a multiple sequence alignment, more useful information can be obtained for their prediction. Homology search may also be employed to identify RNA genes, but this procedure is complicated, especially in eukaryotes, due to presence of a large number of repeats and pseudogenes. ==Visualization==

Visualization

File:GBK File Snapshot.svg|thumb|A snapshot of an annotated GBK file created with Prokka. Some of these formats use controlled vocabularies and ontologies to define their descriptive terminologies and guarantee interoperability between analysis and visualization tools. Genomic browsers can be divided into web-based genomic browsers and stand-alone genomic browsers. The former use information from databases and can be classified into multiple-species (integrate sequence and annotations of multiple organisms and promote cross-species comparative analysis) and species-specific (focus on one organism and the annotations for particular species). The latter are not necessarily linked to a specific genome database but are general-purpose browsers that can be downloaded and installed as an application on a local computer. Visualization tools capable of illustrating the comparative behavior between two or more genomes are essential for this approach, and can be classified into three categories based on the representation of the relationships between the compared genomes: • Dot Plots: This scheme only allows to show the alignment of two genomes, one genome is represented along the horizontal axis and the other along the vertical axis and the dots in the plot represent the genomic elements that are similar between these two annotations. • Linear representation: This representation uses multiple linear tracks to represent multiple genomes and their features where "track" is a concept that refers to a specific type of genomic feature at a genomic location. • Circular representation: This representation facilitates comparison of whole microbial or viral genomes. In this visualization mode, concentric circles and arcs are used to represent genomic sections. ==Quality control==

Quality control

The quality of the sequence assembly influences the quality of the annotation, so it is important to assess assembly quality before performing the subsequent annotation steps. Community annotation approaches are great techniques for quality control and standardization in genome annotation. An annotation jamboree that took part in 2002, led to the creation of the annotation standards used by the Sanger Institute's Human and Vertebrate Analysis Project (HAVANA). ==Community annotation==

Community annotation

Community annotation consists in the engagement of a community (both scientific and nonscientific) in genome annotation projects. It can be classified into the following six categories: • Blessed annotator: A variation of the museum model, applied in the Knockout Mouse Project (KOMP), in which curators go through a training period prior to annotation, and are then given access to annotation tools to continue their work. • Gatekeeper approach: It is a combination of the jamboree and cottage industry models. It begins with an annotation workshop, followed by a decentralized collaboration to extend and refine the initial annotation. It has been used for multiple species data. A community annotation is said to be supervised when there is a coordinator who manages the project by requesting the annotation of specific items to a select number of experts. On the other hand, when anyone can enter a project and coordination is accomplished in a decentralized manner, it is called unsupervised community annotation. Supervised community annotation is short-lived and limited to the duration of the event, whereas the unsupervised counterpart does not have this limitation. However, the latter has been less successful than the former presumably due to a lack of time, motivation, incentive and/or communication. Wikipedia has multiple WikiProjects aimed at improving annotation. The Gene WikiProject, for instance, operates a bot that harvests gene data from research databases and creates gene stubs on that basis. The RNA WikiProject seeks to write articles that describe individual RNAs and RNA families in an accessible way. ==Applications==

Applications

Disease diagnosis Gene Ontology is being used by researchers to establish a disease-gene relationship, as GO helps in the identification of novel genes, the alterations in their expression, distribution and function under a different set of conditions, such as diseased versus healthy. Databases of this disease-gene relationships of different organisms have been created, such as Plant-Pathogen Ontology, Plant-Associated Microbe Gene Ontology or DisGeNET. And some others have been implemented in pre-existing databases like Rat Disease Ontology in the Rat Genome database. Bioremediation A great diversity of catabolic enzymes involved in hydrocarbon degradation by some bacterial strains are encoded by genes located in their mobile genetic elements (MGEs). The study of these elements is of great importance in the field of bioremediation, since recently the inoculation of wild or genetically modified strains with these MGEs has been sought in order to acquire these hydrocarbon degradation capacities. In 2013, Phale et al. published the genome annotation of a strain of Pseudomonas putida (CSV86), a bacterium known for its preference of naphthalene and other aromatic compounds over glucose as a carbon and energy source. In order to find the MGEs of this bacterium, its genome was annotated using RAST and the NCBI Prokaryotic Genome Annotation Pipeline (PGAP), and the identification of nine mobile elements was possible with the Insertion Sequence (IS) Finder database. This analysis concluded in the localization of the upper pathway genes of naphthalene degradation, right next to the genes encoding tRNA-Gly and integrase, as well as the identification of the genes encoding enzymes involved in the degradation of salicylate, benzoate, 4-hydroxybenzoate, phenylacetic acid, hydroxyphenyl acetic acid, and the recognition of an operon involved in glucose transport in the strain. Gene Ontology analysis is of great importance in functional annotation, and specifically in bioremediation it can be applied to know the relationships between the genes of some microorganisms with their functions and their role in the remediation of certain contaminants. This was the approach of the investigation and identification of Halomonas zincidurans strain B6(T), a bacterium with thirty-one genes encoding resistance to heavy metals, especially zinc and Stenotrophomonas sp. DDT-1, a strain capable of using DDT as its sole carbon and energy source, to mention a few examples. ==Software==

Software

Genes in a eukaryotic genome can be annotated using various annotation tools such as FINDER. A modern annotation pipeline can support a user-friendly web interface and software containerization such as MOSGA. Modern annotation pipelines for prokaryotic genomes are Bakta, Prokka and PGAP. The National Center for Biomedical Ontology develops tools for automated annotation of database records based on the textual descriptions of those records. As a general method, dcGO has an automated procedure for statistically inferring associations between ontology terms and protein domains or combinations of domains from the existing gene/protein-level annotations. A variety of software tools have been developed that allow scientists to view and share genome annotations, such as MAKER. Genome annotation is an active area of investigation and involves a number of different organizations in the life science community which publish the results of their efforts in publicly available biological databases accessible via the web and other electronic means. Here is an alphabetical listing of on-going projects relevant to genome annotation: • Encyclopedia of DNA elements (ENCODE) • Entrez Gene • Ensembl • FlyBase • GENCODE • Gene Ontology Consortium • GeneRIF • Mouse Genome Informatics • RefSeq • Uniprot • Vertebrate and Genome Annotation Project (Vega) • WormBase == References ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com