The human and mouse reference genomes are maintained and improved by the
Genome Reference Consortium (GRC), a group of fewer than 20 scientists from a number of genome research institutes, including the
European Bioinformatics Institute, the
National Center for Biotechnology Information, the
Sanger Institute and
McDonnell Genome Institute at
Washington University in St. Louis. GRC continues to improve reference genomes by building new alignments that contain fewer gaps, and fixing misrepresentations in the sequence.
Human reference genome The original human reference genome was derived from thirteen anonymous volunteers from
Buffalo, New York. Donors were recruited by advertisement in
The Buffalo News, on Sunday, March 23, 1997. The first ten male and ten female volunteers were invited to make an appointment with the project's
genetic counselors and donate blood from which DNA was extracted. As a result of how the DNA samples were processed, about 80 percent of the reference genome came from eight people and one male, designated
RP11, accounts for 66 percent of the total. The
ABO blood group system differs among humans, but the human reference genome contains only an
O allele, although the others are
annotated. Comparison between the reference (assembly NCBI36/hg18) and Watson's genome revealed 3.3 million
single nucleotide polymorphism differences, while about 1.4 percent of his DNA could not be matched to the reference genome at all. For regions where there is known to be large-scale variation, sets of alternate
loci are assembled alongside the reference locus. The latest human reference genome assembly, released by the
Genome Reference Consortium, was GRCh38 in 2017. Several patches were added to update it, the latest patch being GRCh38.p14, published on the 3rd of February 2022. This build only has 349 gaps across the entire assembly, which implies a great improvement in comparison with the first version, which had roughly 150,000 gaps. The number of
genomic clone libraries contributing to the reference has increased steadily to >60 over the years, although individual
RP11 still accounts for 70% of the reference genome. Genomic analysis of this anonymous male suggests that he is of African-European ancestry. In 2022, the Telomere-to-Telomere (T2T) Consortium, an open, community-based effort, published the first completely assembled reference genome (version T2T-CHM13), without any gaps in the assembly. It did not contain a Y-chromosome until version 2.0. This assembly allows for the examination of centromeric and pericentromeric sequence evolution. The consortium employed rigorous methods to assemble, clean, and validate complex repeat regions which are particularly difficult to sequence. It used ultra-long–read (>100 kb) sequencing to accurately sequence
segmental duplications. The T2T-CHM13 is sequenced from CHM13hTERT, a cell line from an essentially haploid
hydatidiform mole. "CHM" stands for "Complete Hydatidiform Mole", and "13" is its line number. "hTERT" stands for "human
Telomerase Reverse Transcriptase". The cell line has been transfected with the TERT gene, which is responsible for maintaining telomere length and thus contributes to the
cell line's immortality. A hydatidiform mole contains two copies of the same parental genome, and thus is essentially haploid. This eliminates allelic variation and allows better sequencing accuracy.
Limitations For much of a genome, the reference provides a good approximation of the DNA of any single individual. But in regions with high
allelic diversity, such as the
major histocompatibility complex in humans and the
major urinary proteins of mice, the reference genome may differ significantly from other individuals. Due to the fact that the reference genome is a "single" distinct sequence, which gives its utility as an index or locator of genomic features, there are limitations in terms of how faithfully it represents the human genome and its
variability. Most of the initial samples used for reference genome sequencing came from people of European ancestry. In 2010, it was found that, by
de novo assembling genomes from African and Asian populations with the NCBI reference genome (version NCBI36), these genomes had ~5Mb sequences that did not align against any region of the reference genome. Following projects to the Human Genome Project seek to address a deeper and more diverse characerization of the human genetic variability, which the reference genome is not able to represent. The
HapMap Project, active during the period 2002 -2010, with the purpose of creating a
haplotypes map and their most common variations among different human populations. Up to 11 populations of different ancestry were studied, such as individuals of the
Han ethnic group from China,
Gujaratis from India, the
Yoruba people from Nigeria or
Japanese people, among others. The
1000 Genomes Project, carried out between 2008 and 2015, with the aim of creating a database that includes more than 95% of the variations present in the human genome and whose results can be used in studies of association with diseases (
GWAS) such as diabetes, cardiovascular or autoimmune diseases. A total of 26 ethnic groups were studied in this project, expanding the scope of the HapMap project to new ethnic groups such as the
Mende people of Sierra Leone, the
Vietnamese people or the
Bengali people. The
Human Pangenome Project, which started its initial phase in 2019 with the creation of the Human Pangenome Reference Consortium, seeks to create the largest map of human genetic variability taking the results of previous studies as a starting point.
Mouse reference genome Recent mouse genome assemblies are as follows: == Other genomes ==