Ab Initio gene prediction is an intrinsic method based on gene content and signal detection. Because of the inherent expense and difficulty in obtaining extrinsic evidence for many genes, it is also necessary to resort to
ab initio gene finding, in which the
genomic DNA sequence alone is systematically searched for certain tell-tale signs of protein-coding genes. These signs can be broadly categorized as either
signals, specific sequences that indicate the presence of a gene nearby, or
content, statistical properties of the protein-coding sequence itself.
Ab initio gene finding might be more accurately characterized as gene
prediction, since extrinsic evidence is generally required to conclusively establish that a putative gene is functional. In the genomes of
prokaryotes, genes have specific and relatively well-understood
promoter sequences (signals), such as the
Pribnow box and
transcription factor binding sites, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous
open reading frame (ORF), which is typically many hundred or thousands of
base pairs long. The statistics of
stop codons are such that even finding an open reading frame of this length is a fairly informative sign. (Since 3 of the 64 possible codons in the genetic code are stop codons, one would expect a stop codon approximately every 20–25 codons, or 60–75 base pairs, in a
random sequence.) Furthermore, protein-coding DNA has certain
periodicities and other statistical properties that are easy to detect in a sequence of this length. These characteristics make prokaryotic gene finding relatively straightforward, and well-designed systems are able to achieve high levels of accuracy.
Ab initio gene finding in
eukaryotes, especially complex organisms like humans, is considerably more challenging for several reasons. First, the promoter and other regulatory signals in these genomes are more complex and less well-understood than in prokaryotes, making them more difficult to reliably recognize. Two classic examples of signals identified by eukaryotic gene finders are
CpG islands and binding sites for a
poly(A) tail. Second,
splicing mechanisms employed by eukaryotic cells mean that a particular protein-coding sequence in the genome is divided into several parts (
exons), separated by non-coding sequences (
introns). (Splice sites are themselves another signal that eukaryotic gene finders are often designed to identify.) A typical protein-coding gene in humans might be divided into a dozen exons, each less than two hundred base pairs in length, and some as short as twenty to thirty. It is therefore much more difficult to detect periodicities and other known content properties of protein-coding DNA in eukaryotes. Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex
probabilistic models, such as
hidden Markov models (HMMs) to combine information from a variety of different signal and content measurements. The
GLIMMER system is a widely used and highly accurate gene finder for prokaryotes.
GeneMark is another popular approach. Eukaryotic
ab initio gene finders, by comparison, have achieved only limited success; notable examples are the
GENSCAN and
geneid programs. The GeneMark-ES and SNAP gene finders are GHMM-based like GENSCAN. They attempt to address problems related to using a gene finder on a genome sequence that it was not trained against. A few recent approaches like mSplicer, CONTRAST, or
mGene also use
machine learning techniques like
support vector machines for successful gene prediction. They build a
discriminative model using
hidden Markov support vector machines or
conditional random fields to learn an accurate gene prediction scoring function.
Ab Initio methods have been benchmarked, with some approaching 100% sensitivity, It has been suggested that signals other than those directly detectable in sequences may improve gene prediction. For example, the role of
secondary structure in the identification of regulatory motifs has been reported. In addition, it has been suggested that RNA secondary structure prediction helps splice site prediction.
Neural networks Artificial neural networks are computational models that excel at
machine learning and
pattern recognition. Neural networks must be
trained with example data before being able to generalise for experimental data, and tested against benchmark data. Neural networks are able to come up with approximate solutions to problems that are hard to solve algorithmically, provided there is sufficient training data. When applied to gene prediction, neural networks can be used alongside other
ab initio methods to predict or identify biological features such as splice sites. One approach involves using a sliding window, which traverses the sequence data in an overlapping manner. The output at each position is a score based on whether the network thinks the window contains a donor splice site or an acceptor splice site. Larger windows offer more accuracy but also require more computational power. A neural network is an example of a signal sensor as its goal is to identify a functional site in the genome. == Combined approaches ==