GenBank

The GenBank sequence database is an open access, annotated collection of publicly available nucleotide sequences and their protein translations. GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC) and is produced and maintained by the National Center for Biotechnology Information (NCBI), a division of the United States National Library of Medicine (NLM), part of the National Institutes of Health (NIH).

Submissions

GenBank accepts nucleotide sequence submissions from individual laboratories and large-scale sequencing projects. Direct submissions are made using the NCBI Submission Portal, which provides guided workflows for data submission, or programmatically using tools such as table2asn. The legacy BankIt submission system is being phased out in favor of the Submission Portal. Upon receipt of a submission, GenBank staff review the data for completeness, biological context, and consistency, assign an accession number, and perform quality assurance checks before release to the public database. Submitted sequences are accessible through Entrez and are available for download via FTP. GenBank supports a variety of submission types, including whole genome shotgun (WGS) assemblies, transcriptome shotgun assemblies (TSA), targeted locus studies (TLS), and high-throughput genomic (HTGS) sequences. Third Party Annotation (TPA) records allow the publication of annotations based on sequences already present in GenBank. Raw sequence reads generated by next-generation sequencing technologies are deposited in the Sequence Read Archive (SRA), rather than in GenBank itself. ==History==

History

Walter Goad of the Theoretical Biology and Biophysics Group at Los Alamos National Laboratory (LANL) and colleagues established the Los Alamos Sequence Database in 1979, which culminated in the creation of the public GenBank database in 1982. Funding was provided by the National Institutes of Health, the National Science Foundation, the Department of Energy, and the Department of Defense. LANL collaborated on GenBank with the firm Bolt, Beranek, and Newman, and by the end of 1983 more than 2,000 sequences had been collected. An early description of the database was published in 1985, when GenBank contained over five million bases across approximately 6,000 sequence entries. During the late 1980s and early 1990s, responsibility for GenBank transitioned from Los Alamos National Laboratory to the newly established National Center for Biotechnology Information (NCBI) at the United States National Library of Medicine (NLM), part of the National Institutes of Health (NIH). Contemporary GenBank release documentation from this period reflects a shift from joint contributions by LANL-based database staff and NLM-based indexing teams toward full management by NCBI, indicating a phased transfer of data curation and operational responsibilities. ==Growth==

Growth

GenBank has grown substantially since its inception. Early analyses and release notes have described this growth as approximating a doubling in the number of bases every 18 months, although growth rates have varied over time with changes in sequencing technologies and data submission practices. As of GenBank release 271.0 (April 2026), the database contained over 53.9 trillion bases and 6.27 billion sequence records, reflecting continued large-scale contributions from high-throughput sequencing projects. Although GenBank release statistics summarize the traditional GenBank divisions, several large bulk-oriented sequence collections are tracked separately. Whole Genome Shotgun (WGS), Transcriptome Shotgun Assembly (TSA), and Targeted Locus Study (TLS) records are processed by GenBank but are not distributed as part of regular GenBank release files; instead, they are made available continuously through separate per-project FTP areas. For release 271.0, most sequence data were in WGS records rather than traditional GenBank entries. The following table lists the twenty organisms with the largest number of bases in the traditional GenBank divisions for release 271.0; it excludes WGS, TSA, TLS, metagenomic, mitochondrial, chloroplast, synthetic construct, and several other categories defined in the release notes. ==Limitations==

Limitations

An analysis of GenBank and other services for the molecular identification of clinical blood culture isolates using 16S rRNA sequences showed that such analyses were more discriminative when GenBank was combined with other services such as EzTaxon-e and the BIBI databases. GenBank may contain sequences wrongly assigned to a particular species, because the initial identification of the organism was wrong. A recent study showed that 75% of mitochondrial Cytochrome c oxidase subunit I sequences were wrongly assigned to the fish Nemipterus mesoprion resulting from continued usage of sequences of initially misidentified individuals. The authors provide recommendations how to avoid further distribution of publicly available sequences with incorrect scientific names. Numerous published manuscripts have identified erroneous sequences on GenBank. These are not only incorrect species assignments (which can have different causes) but also include chimeras and accession records with sequencing errors. A recent manuscript on the quality of all Cytochrome b records of birds further showed that 45% of the identified erroneous records lack a voucher specimen that prevents a reassessment of the species identification. Another problem is that sequence records are often submitted as anonymous sequences without species names (e.g. as "Pelomedusa sp. A CK-2014" because the species are either unknown or withheld for publication purposes. However, even after the species have been identified or published, these sequence records are not updated and thus may cause ongoing confusion. ==See also==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com