Genome assembly refers to the process of taking a large number of short
DNA sequences and reassembling them to create a representation of the original
chromosomes from which the DNA originated. In a
shotgun sequencing project, all the DNA from a source (usually a single
organism, anything from a
bacterium to a
mammal) is first fractured into millions of small pieces. These pieces are then "read" by automated sequencing machines. A genome assembly
algorithm works by taking all the pieces and aligning them to one another, and detecting all places where two of the short sequences, or
reads, overlap. These overlapping reads can be merged, and the process continues. Genome assembly is a very difficult
computational problem, made more difficult because many genomes contain large numbers of identical sequences, known as
repeats. These repeats can be thousands of nucleotides long, and occur different locations, especially in the large genomes of
plants and
animals. The resulting (draft) genome sequence is produced by combining the information sequenced
contigs and then employing linking information to create scaffolds. Scaffolds are positioned along the
physical map of the chromosomes creating a "golden path".
Assembly software Originally, most large-scale DNA sequencing centers developed their own software for assembling the sequences that they produced. However, this has changed as the software has grown more complex and as the number of sequencing centers has increased. An example of such
assembler Short Oligonucleotide Analysis Package developed by
BGI for de novo assembly of human-sized genomes, alignment,
SNP detection, resequencing, indel finding, and structural variation analysis. ==Genome annotation==