Genome The first sequence assemblers began to appear in the late 1980s and early 1990s as variants of simpler
sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments called
DNA sequencers. As the sequenced organisms grew in size and complexity (from small
viruses over
plasmids to
bacteria and finally
eukaryotes), the assembly programs used in these
genome projects needed increasingly sophisticated strategies to handle: •
terabytes of sequencing data which need processing on
computing clusters; • identical and nearly identical sequences (known as
repeats) which can, in the worst case, increase the time and space complexity of algorithms quadratically; •
DNA read errors in the fragments from the sequencing instruments, which can confound assembly. Faced with the challenge of assembling the first larger eukaryotic genomes—the fruit fly
Drosophila melanogaster in 2000 and the human genome just a year later,—scientists developed assemblers like Celera Assembler and Arachne able to handle genomes of 130 million (e.g., the fruit fly
D. melanogaster) to 3 billion (e.g., the human genome) base pairs. Subsequent to these efforts, several other groups, mostly at the major genome sequencing centers, built large-scale assemblers, and an open source effort known as AMOS was launched to bring together all the innovations in genome assembly technology under the
open source framework.
EST Expressed sequence tag or EST assembly was an early strategy, dating from the mid-1990s to the mid-2000s, to assemble individual genes rather than whole genomes. The problem differs from genome assembly in several ways. The input sequences for EST assembly are fragments of the transcribed
mRNA of a cell and represent only a subset of the whole genome. •
Overlap/Layout/Consensus (OLC) approach, which was typical of the Sanger-data assemblers and relies on an overlap graph; •
de Bruijn Graph (DBG) approach, which is most widely applied to the short reads from the Solexa and SOLiD platforms. It relies on K-mer graphs, which performs well with vast quantities of short reads; •
Greedy graph-based approach, which may also use one of the OLC or DBG approaches. With greedy graph-based algorithms, the contigs, series of reads aligned together, grow by greedy extension, always taking on the read that is found by following the highest-scoring overlap. == Technological advances ==