of an
assembled genome are masked by using a repeat library. Then, optionally, the masked sequence is aligned with all the available evidence (
ESTs,
RNAs, and
proteins) of the organism being annotated. In
eukaryotic genomes,
splice sites must be identified. Finally, the
coding and
noncoding sequences contained in the genome are predicted with the help of databases of known DNA, RNA and protein sequences, as well as other supporting information. Structural annotation describes the precise location of the different elements in a genome, such as
open reading frames (ORFs),
coding sequences (CDS),
exons,
introns,
repeats,
splice sites,
regulatory motifs,
start and
stop codons, and
promoters. The main steps of structural annotation are: • Repeat identification and masking. • Evidence alignment (optional). • Splice identification (only in eukaryotes). • Feature prediction (coding and noncoding sequences).
Repeat identification and masking The first step of structural annotation consists in the identification and masking of
repeats, which include low-complexity sequences (such as AGAGAGAG, or monopolymeric segments like TTTTTTTTT), and
transposons (which are larger elements with several copies across the genome). and three quarters of the
human genome are composed of repetitive elements. Identifying repeats is difficult for two main reasons: they are poorly conserved, and their boundaries are not clearly defined. Because of this, repeat libraries must be built for the genome of interest, which can be accomplished with one of the following methods: • '''
De novo methods'''. Repeats are identified by detecting and grouping pairs of sequences at different locations whose similarity is above a minimum threshold of
sequence conservation in a self-genome comparison, thus requiring no prior information about repeat structure or sequences. The disadvantage of these methods is that they can identify any repeated sequence, not just transposons, and may include conserved
coding sequences (CDS), making careful post-processing an indispensable step to remove these sequences. It may also leave out related regions that have degraded over time and may group elements that have no connection in their evolutionary history. •
Homology-based methods. Repeats are identified by similarity (
homology) of known repeats stored in a curated database. These methods are more likely to find real transposons, even in lower quantities, when compared with
de novo methods, but are biased towards previously identified families. •
Structure-based methods. Repeats are identified based on models of their structure, rather than repetition or similarity. They are capable of identifying real transposons (just like the homology-based ones), but are not biased by known elements. However, they are highly specific to each class of repeat, and, as such, are less universally applicable. •
Comparative genomic methods. Repeats are identified as disruptions of one or more sequences in a
multiple sequence alignment produced by large
insertion regions. Although this strategy avoids the poorly defined boundary problem that exists in other methods, it is highly dependent on assembly quality and the level of activity of transposons in the genomes in question. After the repetitive regions in a genome have been identified, they are masked.
Masking means replacing the letters of the
nucleotides (A, C, G, or T) with other letters. By doing so, these regions will be marked as repetitive and downstream analyses will treat them accordingly. Repetitive regions may produce performance issues if they are not masked, and may even produce false evidence for gene annotation (for example, treating an
open reading frame (ORF) in a transposon as an
exon)
Evidence alignment The next step after genome masking usually involves aligning all available transcript and protein evidence with the analyzed genome, that is, aligning all known
expressed sequence tags (ESTs),
RNAs and
proteins of the organism being annotated with the genome. However, transcripts provide insufficient information for gene prediction because they might be unobtainable from some genes, they may encode
operons of more than one gene, and their start and stop codons cannot be determined due to
frameshifts and
translation initiation factors.
Splice identification Annotation of
eukaryotic genomes has an extra layer of difficulty due to
RNA splicing, a
post-transcriptional process in which
introns (non-coding regions) are removed and
exons (coding regions) are joined. Predictors of new exon boundaries usually require efficient data-compression and alignment algorithms, but they are prone to failure in boundaries located in regions with low
sequence coverage or high error-rates produced during sequencing.
Feature prediction A genome is divided in
coding and
noncoding regions, and the last step of structural annotation consists in identifying these features within the genome. In fact, the primary task in genome annotation is
gene prediction, which is why numerous methods have been developed for this purpose. Whereas
prokaryotic CDS predictors mostly deal with
open reading frames (ORFs), which are segments of DNA between the
start and
stop codons,
eukaryotic CDS predictors are faced with a more difficult problem because of the complex organization of eukaryotic genes. •
Homology-based methods (also called empirical, evidence-driven, or extrinsic). CDS prediction is based on similarity to known sequences. Specifically, it performs alignments of the analyzed sequence with
expressed sequence tags (ESTs),
complementary DNA (cDNA), or
protein sequences. •
Combiners. CDS prediction is done by a combination of both methods mentioned above. ==Functional annotation==