N50 N50 statistic defines assembly quality in terms of
contiguity. Given a set of contigs, the
N50 is defined as the sequence length of the shortest contig at 50% of the total assembly length. It can be thought of as the point of half of the mass of the distribution; the number of
bases from all contigs longer than the
N50 will be close to the number of bases from all contigs shorter than the
N50. For example, consider 9 contigs with the lengths 2,3,4,5,6,7,8,9, and 10; their sum is 54, half of the sum is 27, and the size of the genome also happens to be 54. Then, 50% of this assembly would be 10 + 9 + 8 = 27 (half the length of the sequence). Thus the N50=8, which is the size of the contig which, along with the larger contigs, contain half of sequence of a particular genome. Note: When comparing N50 values from different assemblies, the assembly sizes must be the same size in order for N50 to be meaningful. N50 can be described as a
weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value.
L50 Given a set of contigs, each with its own length, the
L50 is defined as count of smallest number of contigs whose length sum makes up half of
genome size. From the example above the L50=3.
N90 The
N90 statistic is less than or equal to the
N50 statistic; it is the length for which the collection of all contigs of that length or longer contains at least 90% of the sum of the lengths of all contigs.
NG50 Note that
N50 is calculated in the context of the assembly size rather than the genome size. Therefore, comparisons of N50 values derived from assemblies of significantly different lengths are usually not informative, even if for the same genome. To address this, the authors of the
Assemblathon competition came up with a new measure called
NG50. The
NG50 statistic is the same as
N50 except that it is 50% of the known or estimated genome size that must be of the NG50 length or longer. This allows for meaningful comparisons between different assemblies. In the typical case that the assembly size is not more than the genome size, the NG50 statistic will not be more than the N50 statistic.
D50 The
D50 statistic (also termed
D50 test) is similar to the
N50 statistic in definition though it is generally not used to describe genome assemblies. The
D50 statistic is the lowest value
d for which the sum of the lengths of the largest
d lengths is at least 50% of the sum of all of the lengths.
U50 U50 is the length of the smallest contig such that 50% of the sum of all unique, target-specific contigs is contained in contigs of size U50 or larger.
UL50 UL50 is the number of contigs whose length sum produces U50.
UG50 UG50 is the length of the smallest contig such that 50% of the reference genome is contained in unique, target-specific contigs of size UG50 or larger.
UG50% UG50% is the estimated percent coverage length of the UG50 in direct relation to the length of the reference genome. The calculation is (100 × (UG50/Length of reference genome). The
UG50%, as a percentage-based metric, can be used to compare assembly results from different samples or studies. ==Examples==