Classical Genome Assembly The term genome assembly refers to the process of taking a large number of DNA fragments that are generated during
shotgun sequencing and assembling them into the correct order such as to reconstruct the original genome. Sequencing involves using automated machines to determine the order of
nucleic acids in the DNA of interest (the nucleic acids in DNA are
adenine,
cytosine,
guanine and
thymine) to conduct genomic analyses involving an organism of interest. The advent of next generation sequencing has presented significant improvements in the speed, accuracy and cost of DNA sequencing and has made the sequencing of entire genomes a feasible process. There are many different sequencing technologies that have been developed by various biotechnology companies, each of which produce different sequencing reads in terms of accuracy and read length. Some of these technologies include
Roche 454,
Illumina,
SOLiD, and
IonTorrent. These sequencing technologies produce relatively short reads (50–700 bases) and have a high accuracy (>98%).
Third-generation sequencing include technologies as the PacBio RS system which can produce long reads (maximum of 23kb) but have a relatively low accuracy. Genome assembly is normally done by one of two methods: assembly using a
reference genome as a scaffold, or
de novo assembly. The scaffolding approach can be useful if the genome of a similar organism has been previously sequenced. This process involves assembling the genome of interest by comparing it to a known genome or scaffold.
De novo genome assembly is used when the genome to be assembled is not similar to any other organisms whose genomes have been previously sequenced. This process is carried out by assembling single reads into contiguous sequences (
contigs) which are then extended in the 3' and 5' directions by overlapping other sequences. The latter is preferred because it allows for the conservation of more sequences. The
de novo assembly of DNA sequences is a very computationally challenging process and can fall into the
NP-hard class of problems if the
Hamiltonian-cycle approach is used. This is because millions of sequences must be assembled to reconstruct a genome. Within genomes, there are often tandem repeats of DNA segments that can be thousands of base pairs in length, which can cause problems during assembly. One hybrid approach to genome assembly involves supplementing short, accurate second-generation sequencing data (i.e. from IonTorrent, Illumina or Roche 454) with long less accurate
third-generation sequencing data (i.e. from PacBio RS) to resolve complex repeated DNA segments. The main limitation of single-molecule
third-generation sequencing that prevents it from being used alone is its relatively low accuracy, which causes inherent errors in the sequenced DNA. Using solely second-generation sequencing technologies for genome assembly can miss or lead to the incomplete assembly of important aspects of the genome. Supplementation of third generation reads with short, high-accuracy second generation sequences can overcome these inherent errors and completed crucial details of the genome. This approach has been used to sequence the genomes of some bacterial species including a strain of
Vibrio cholerae. Algorithms specific for this type of hybrid genome assembly have been developed, such as the PacBio corrected Reads algorithm. Hybrid genome assembly can also be accomplished using the Eulerian path approach. In this approach, the length of the assembled sequences does not matter as once a k-mer spectrum has been constructed, the lengths of the reads are irrelevant. ==Practical approaches==