UniProt provides four core databases: UniProtKB (with sub-parts Swiss-Prot and TrEMBL), UniParc, UniRef and Proteome.
UniProtKB UniProt Knowledgebase (UniProtKB) is a protein database partially curated by experts, consisting of two sections: UniProtKB/Swiss-Prot (containing reviewed, manually annotated entries) and UniProtKB/TrEMBL (containing unreviewed, automatically annotated entries). , release "2023_01" of UniProtKB/Swiss-Prot contains 569,213 sequence entries (comprising 205,728,242 amino acids abstracted from 291,046 references) and release "2023_01" of UniProtKB/TrEMBL contains 245,871,724 sequence entries (comprising 85,739,380,194 amino acids).
UniProtKB/Swiss-Prot UniProtKB/Swiss-Prot is a manually annotated, non-redundant protein sequence database. It combines information extracted from scientific literature and
biocurator-evaluated computational analysis. The aim of UniProtKB/Swiss-Prot is to provide all known relevant information about a particular protein. Annotation is regularly reviewed to keep up with current scientific findings. The manual annotation of an entry involves detailed analysis of the protein sequence and of the scientific literature. Sequences from the same
gene and the same
species are merged into the same database entry. Differences between sequences are identified, and their cause documented (for example
alternative splicing,
natural variation, incorrect
initiation sites, incorrect
exon boundaries,
frameshifts, unidentified conflicts). A range of sequence analysis tools is used in the annotation of UniProtKB/Swiss-Prot entries. Computer-predictions are manually evaluated, and relevant results selected for inclusion in the entry. These predictions include post-translational modifications,
transmembrane domains and
topology,
signal peptides, domain identification, and
protein family classification. Relevant publications are identified by searching databases such as
PubMed. The full text of each paper is read, and information is extracted and added to the entry. Annotation arising from the scientific literature includes, but is not limited to: Since 22 July 2021 it also includes structures predicted with
AlphaFold2.
UniParc UniProt Archive (UniParc) is a comprehensive and non-redundant database, which contains all the protein sequences from the main, publicly available protein sequence databases. Proteins may exist in several different source databases, and in multiple copies in the same database. In order to avoid redundancy, UniParc stores each unique sequence only once. Identical sequences are merged, regardless of whether they are from the same or different species. Each sequence is given a stable and unique identifier (UPI), making it possible to identify the same protein from different source databases. UniParc contains only protein sequences, with no annotation. Database cross-references in UniParc entries allow further information about the protein to be retrieved from the source databases. When sequences in the source databases change, these changes are tracked by UniParc and history of all changes is archived.
Source databases Currently UniParc contains protein sequences from the following publicly available databases: •
INSDC EMBL-Bank/
DDBJ/
GenBank nucleotide sequence databases •
Ensembl •
European Patent Office (EPO) •
FlyBase: the primary repository of genetic and molecular data for the insect family Drosophilidae (FlyBase) •
H-Invitational Database (H-Inv) •
International Protein Index (IPI) •
Japan Patent Office (JPO) •
Protein Information Resource (PIR-PSD) •
Protein Data Bank (PDB) •
Protein Research Foundation (PRF) •
RefSeq •
Saccharomyces Genome Database (SGD) •
The Arabidopsis Information Resource (TAIR) •
TROME •
US Patent Office (USPTO) • UniProtKB/Swiss-Prot, UniProtKB/Swiss-Prot protein isoforms, UniProtKB/TrEMBL •
Vertebrate and Genome Annotation Database (VEGA) •
WormBase UniRef The UniProt Reference Clusters (UniRef) consist of three databases of clustered sets of protein sequences from UniProtKB and selected UniParc records. The UniRef100 database combines identical sequences and sequence fragments (from any
organism) into a single UniRef entry. The sequence of a representative protein, the
accession numbers of all the merged entries and links to the corresponding UniProtKB and UniParc records are displayed. UniRef100 sequences are clustered using the CD-HIT
algorithm to build UniRef90 and UniRef50. Each cluster is composed of sequences that have at least 90% or 50% sequence identity, respectively, to the longest sequence. Clustering sequences significantly reduces database size, enabling faster sequence searches. UniRef is available from the UniProt FTP site . ==Funding==