There are numerous programs for de novo sequence assembly and many have been compared in the Assemblathon. The Assemblathon is a periodic, collaborative effort to test and improve the numerous assemblers available. Thus far, two assemblathons have been completed (2011 and 2013) and a third is in progress (as of April 2017). Teams of researchers from across the world choose a program and assemble simulated genomes (Assemblathon 1) and the genomes of model organisms whose that have been previously assembled and annotated (Assemblathon 2). The assemblies are then compared and evaluated using numerous metrics.
Assemblathon 1 Assemblathon 1 was conducted in 2011 and featured 59 assemblies from 17 different groups and the organizers. The goal of this Assembalthon was to most accurately and completely assemble a genome that consisted of two haplotypes (each with three chromosomes of 76.3, 18.5, and 17.7 Mb, respectively) that was generated using Evolver. Numerous metrics were used to assess the assemblies, including: NG50 (point at which 50% of the total genome size is reached when scaffold lengths are summed from the longest to the shortest), LG50 (number of scaffolds that are greater than, or equal to, the N50 length), genome coverage, and substitution error rate. • Software compared: ABySS, Phusion2, phrap, Velvet, SOAPdenovo, PRICE, ALLPATHS-LG • N50 analysis: assemblies by the Plant Genome Assembly Group (using the assembler Meraculous) and ALLPATHS,
Broad Institute, USA (using ALLPATHS-LG) performed the best in this category, by an order of magnitude over other groups. These assemblies scored an N50 of >8,000,000 bases. • Coverage of genome by assembly: for this metric, BGI's assembly via SOAPdenovo performed best, with 98.8% of the total genome being covered. All assemblers performed relatively well in this category, with all but three groups having coverage of 90% and higher, and the lowest total coverage being 78.5% (Dept. of Comp. Sci.,
University of Chicago, USA via Kiki). • Substitution errors: the assembly with the lowest substitution error rate was submitted by the
Wellcome Trust Sanger Institute, UK team using the software SGA. • Overall: No one assembler performed significantly better in others in all categories. While some assemblers excelled in one category, they did not in others, suggesting that there is still much room for improvement in assembler software quality.
Assemblathon 2 Assemblathon 2 improved on Assemblathon 1 by incorporating the genomes of multiples vertebrates (a bird (
Melopsittacus undulatus), a fish (
Maylandia zebra), and a snake (
Boa constrictor constrictor)) with genomes estimated to be 1.2, 1.0, and 1.6Gbp in length) and assessment by over 100 metrics. Each team was given four months to assemble their genome from Next-Generation Sequence (NGS) data, including
Illumina and
Roche 454 sequence data. • Software compared: ABySS, ALLPATHS-LG, PRICE, Ray, and SOAPdenovo • N50 analysis: for the assembly of the bird genome, the
Baylor College of Medicine Human Genome Sequencing Center and ALLPATHS teams had the highest NG50s, at over 16,000,000 and over 14,000,000 bp, respectively. • Presence of core genes: Most assemblies performed well in this category (~80% or higher), with only one dropping to just over 50% in their bird genome assembly (Wayne State University via HyDA). • Overall: Overall, the Baylor College of Medicine Human Genome Sequencing Center utilizing a variety of assembly methods (SeqPrep, KmerFreq, Quake, BWA, Newbler, ALLPATHS-LG, Atlas-Link, Atlas-GapFill, Phrap, CrossMatch, Velvet, BLAST, and BLASR) performed the best for the bird and fish assemblies. For the snake genome assembly, the Wellcome Trust Sanger Institute using SGA, performed best. For all assemblies, SGA, BCM, Meraculous, and Ray submitted competitive assemblies and evaluations. The results of the many assemblies and evaluations described here suggest that while one assembler may perform well on one species, it may not perform as well on another. The authors make several suggestions for assembly: 1) use more than one assembler, 2) use more than one metric for evaluation, 3) select an assembler that excels in metrics of more interest (e.g., N50, coverage), 4) low N50s or assembly sizes may not be concerning, depending on user needs, and 5) assess the levels of heterozygosity in the genome of interest. == See also ==