UniProt

UniProt
File:UPlogo1.png
Content
Description	UniProt is the Universal Protein resource, a central repository of protein data created by combining the Swiss-Prot, TrEMBL and PIR-PSD databases.
Data types; captured	Protein annotation
Organisms	All
Contact
Research center	EMBL-EBI, UK; SIB, Switzerland; PIR, US.
Primary citation	Ongoing and future developments at the Universal Protein Resource
Access
Data format	Custom flat file, FASTA, GFF, RDF, XML.
Website	www.uniprot.org
Download URL	www.uniprot.org/downloads & for downloading complete data sets ftp.uniprot.org
Web service URL	Yes – JAVA API see info here & REST see info here
Tools
Web	Advanced search, BLAST, ClustalO, bulk retrieval/download, ID mapping
Miscellaneous
License	Creative Commons Attribution-NoDerivs
Versioning	Yes
Data release; frequency	4 weeks
Curation policy	Yes – manual and automatic. Rules for automatic annotation generated by database curators and computational algorithms.
Bookmarkable; entities	Yes – both individual protein entries and searches

UniProt is a comprehensive, high-quality and freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.

The UniProt consortium

The UniProt consortium comprises the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein Information Resource (PIR). EBI, located at the Wellcome Trust Genome Campus in Hinxton, UK, hosts a large resource of bioinformatics databases and services. SIB, located in Geneva, Switzerland, maintains the ExPASy (Expert Protein Analysis System) servers that are a central resource for proteomics tools and databases. PIR, hosted by the National Biomedical Research Foundation (NBRF) at the Georgetown University Medical Center in Washington, DC, USA, is heir to the oldest protein sequence database, Margaret Dayhoff's Atlas of Protein Sequence and Structure, first published in 1965.^[2] In 2002, EBI, SIB, and PIR joined forces as the UniProt consortium.^[3]

The roots of UniProt databases

Each consortium member is heavily involved in protein database maintenance and annotation. Until recently, EBI and SIB together produced the Swiss-Prot and TrEMBL databases, while PIR produced the Protein Sequence Database (PIR-PSD).^[4]^[5]^[6] These databases coexisted with differing protein sequence coverage and annotation priorities.

Swiss-Prot was created in 1986 by Amos Bairoch during his PhD and developed by the Swiss Institute of Bioinformatics and subsequently developed by Rolf Apweiler at the European Bioinformatics Institute.^[7]^[8]^[9] Swiss-Prot aimed to provide reliable protein sequences associated with a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recognizing that sequence data were being generated at a pace exceeding Swiss-Prot's ability to keep up, TrEMBL (Translated EMBL Nucleotide Sequence Data Library) was created to provide automated annotations for those proteins not in Swiss-Prot. Meanwhile, PIR maintained the PIR-PSD and related databases, including iProClass, a database of protein sequences and curated families.

The consortium members pooled their overlapping resources and expertise, and launched UniProt in December 2003.^[10]

Organization of UniProt databases

UniProt provides four core databases: UniProtKB (with sub-parts Swiss-Prot and TrEMBL), UniParc, UniRef, and UniMes.

UniProtKB

UniProt Knowledgebase (UniProtKB) is a protein database partially curated by experts, consisting of two sections: UniProtKB/Swiss-Prot (containing reviewed, manually annotated entries) and UniProtKB/TrEMBL (containing unreviewed, automatically annotated entries).^[11] As of 19 March 2014^[update], release "2014_03" of UniProtKB/Swiss-Prot contains 542,782 sequence entries (comprising 193,019,802 amino acids abstracted from 226,896 references) and release "2014_03" of UniProtKB/TrEMBL contains 54,247,468 sequence entries (comprising 17,207,833,179 amino acids).^[12]^[13]

UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot is a manually annotated, non-redundant protein sequence database. It combines information extracted from scientific literature and biocurator-evaluated computational analysis. The aim of UniProtKB/Swiss-Prot is to provide all known relevant information about a particular protein. Annotation is regularly reviewed to keep up with current scientific findings. The manual annotation of an entry involves detailed analysis of the protein sequence and of the scientific literature.^[14]

Sequences from the same gene and the same species are merged into the same database entry. Differences between sequences are identified, and their cause documented (for example alternative splicing, natural variation, incorrect initiation sites, incorrect exon boundaries, frameshifts, unidentified conflicts). A range of sequence analysis tools is used in the annotation of UniProtKB/Swiss-Prot entries. Computer-predictions are manually evaluated, and relevant results selected for inclusion in the entry. These predictions include post-translational modifications, transmembrane domains and topology, signal peptides, domain identification, and protein family classification.^[14]^[15]

Relevant publications are identified by searching databases such as PubMed. The full text of each paper is read, and information is extracted and added to the entry. Annotation arising from the scientific literature includes, but is not limited to:^[10]^[14]^[15]

Protein and gene names
Function
Enzyme-specific information such as catalytic activity, cofactors and catalytic residues
Subcellular location
Protein-protein interactions
Pattern of expression
Locations and roles of significant domains and sites
Ion-, substrate- and cofactor-binding sites
Protein variant forms produced by natural genetic variation, RNA editing, alternative splicing, proteolytic processing, and post-translational modification

Annotated entries undergo quality assurance before inclusion into UniProtKB/Swiss-Prot. When new data becomes available, entries are updated.

UniProtKB/TrEMBL

UniProtKB/TrEMBL contains high-quality computationally analyzed records, which are enriched with automatic annotation. It was introduced in response to increased dataflow resulting from genome projects, as the time- and labour-consuming manual annotation process of UniProtKB/Swiss-Prot could not be broadened to include all available protein sequences.^[10] The translations of annotated coding sequences in the EMBL-Bank/GenBank/DDBJ nucleotide sequence database are automatically processed and entered in UniProtKB/TrEMBL. UniProtKB/TrEMBL also contains sequences from PDB, and from gene prediction, including Ensembl, RefSeq and CCDS.^[16]

UniParc

UniProt Archive (UniParc) is a comprehensive and non-redundant database, which contains all the protein sequences from the main, publicly available protein sequence databases.^[17] Proteins may exist in several different source databases, and in multiple copies in the same database. In order to avoid redundancy, UniParc stores each unique sequence only once. Identical sequences are merged, regardless of whether they are from the same or different species. Each sequence is given a stable and unique identifier (UPI), making it possible to identify the same protein from different source databases. UniParc contains only protein sequences, with no annotation. Database cross-references in UniParc entries allow further information about the protein to be retrieved from the source databases. When sequences in the source databases change, these changes are tracked by UniParc and history of all changes is archived.

Source databases

Currently UniParc contains protein sequences from the following publicly available databases:

INSDC EMBL-Bank/DDBJ/GenBank nucleotide sequence databases
Ensembl
European Patent Office (EPO)
FlyBase: the primary repository of genetic and molecular data for the insect family Drosophilidae (FlyBase)
H-Invitational Database (H-Inv)
International Protein Index (IPI)
Japan Patent Office (JPO)
Protein Information Resource (PIR-PSD)
Protein Data Bank (PDB)
Protein Research Foundation (PRF) [1]
RefSeq
Saccharomyces Genome Database (SGD)
The Arabidopsis Information Resource (TAIR)
TROME [2]
US Patent Office (USPTO)
UniProtKB/Swiss-Prot, UniProtKB/Swiss-Prot protein isoforms, UniProtKB/TrEMBL
Vertebrate and Genome Annotation Database (VEGA)
WormBase

UniRef

The UniProt Reference Clusters (UniRef) consist of three databases of clustered sets of protein sequences from UniProtKB and selected UniParc records.^[18] The UniRef100 database combines identical sequences and sequence fragments (from any organism) into a single UniRef entry. The sequence of a representative protein, the accession numbers of all the merged entries and links to the corresponding UniProtKB and UniParc records are displayed. UniRef100 sequences are clustered using the CD-HIT algorithm to build UniRef90 and UniRef50.^[18]^[19] Each cluster is composed of sequences that have at least 90% or 50% sequence identity, respectively, to the longest sequence. Clustering sequences significantly reduces database size, enabling faster sequence searches.

UniRef is available from the UniProt FTP site.

UniMes

The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data.^[20] The predicted proteins from this dataset are combined with automatic classification by InterPro to enhance the original information with further analysis.

UniProtKB contains protein sequences from known species, data arising from metagenomics studies is from environmental (i.e., uncultured) samples and as such the species may not be known or as yet identified. UniMES was developed for this data. Data from UniMES is not included in UniProtKB or UniRef, but is included in UniParc.^[20] As of July 2012^[update], UniMES contains only data from the Global Ocean Sampling Expedition (GOS).^[21] The environmental sample data contained within this database is not present in either the UniProt Knowledgebase or the UniProt Reference Clusters.

The UniMES clusters provide clustered sets (unimes_cluster100 and unimes_cluster90) of sequences at two resolutions (100% and >90%). In unimes_cluster100, identical sequences and subfragments from unimes.fasta are placed into a single cluster. The unimes_cluster90 is built by clustering unimes_cluster100 representative sequences (the longest sequence in a cluster) using the CD-HIT algorithm^[19] such that each cluster is composed of sequences that have at least 90% sequence identity, to the representative sequence. Only the representative sequences of the clusters are present in these files.

UniMES is available from the UniProt FTP site

Funding for UniProt

UniProt is funded by grants from the National Human Genome Research Institute, the National Institutes of Health (NIH), the European Commission, the Swiss Federal Government through the Federal Office of Education and Science, NCI-caBIG, and the Department of Defense.^[11]

References

↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ http://www.genome.gov/page.cfm?pageID=10005283
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Séverine Altairac, "Naissance d’une banque de données: Interview du prof. Amos Bairoch". Protéines à la Une, August 2006. ISSN 1660-9824.
↑ ^10.0 ^10.1 ^10.2 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^11.0 ^11.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ UniProtKB/SwissProt release statistics
↑ UniProtKB/TrEMBL release statistics
↑ ^14.0 ^14.1 ^14.2 Annotation of UniProtKB
↑ ^15.0 ^15.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ Where do UniProtKB sequences come from
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^18.0 ^18.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^19.0 ^19.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^20.0 ^20.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.

External links

UniProt

[pmid21051339-1] Lua error in package.lua at line 80: module 'strict' not found.

[dayhoff-2] Lua error in package.lua at line 80: module 'strict' not found.

[3] ttp://www.genome.gov/page.cfm?pageID=10005283

[pmid12230036-4] Lua error in package.lua at line 80: module 'strict' not found.

[pmid12520019-5] Lua error in package.lua at line 80: module 'strict' not found.

[pmid12520024-6] Lua error in package.lua at line 80: module 'strict' not found.

[7] Lua error in package.lua at line 80: module 'strict' not found.

[Bairoch2000-8] Lua error in package.lua at line 80: module 'strict' not found.

[9] Séverine Altairac, "Naissance d’une banque de données: Interview du prof. Amos Bairoch". Protéines à la Une, August 2006. ISSN 1660-9824.

[pmid15036160-10] 10.0 ^10.1 ^10.2 Lua error in package.lua at line 80: module 'strict' not found.

[pmid19843607-11] 11.0 ^11.1 Lua error in package.lua at line 80: module 'strict' not found.

[SPstats-12] UniProtKB/SwissProt release statistics

[TrEMBLstats-13] UniProtKB/TrEMBL release statistics

[faq45-14] 14.0 ^14.1 ^14.2 Annotation of UniProtKB

[pmid14681372-15] 15.0 ^15.1 Lua error in package.lua at line 80: module 'strict' not found.

[faq37-16] Where do UniProtKB sequences come from

[pmid15044231-17] Lua error in package.lua at line 80: module 'strict' not found.

[pmid17379688-18] 18.0 ^18.1 Lua error in package.lua at line 80: module 'strict' not found.

[pmid11294794-19] 19.0 ^19.1 Lua error in package.lua at line 80: module 'strict' not found.

[pmid18045787-20] 20.0 ^20.1 Lua error in package.lua at line 80: module 'strict' not found.

[pmid17355171-21] Lua error in package.lua at line 80: module 'strict' not found.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

v t e Bioinformatics
Databases	Sequence databases: GenBank, European Nucleotide Archive, DNA Data Bank of Japan and China National GeneBank Secondary databases: UniProt, database of protein sequences grouping together Swiss-Prot, TrEMBL and Protein Information Resource Other databases: BioNumbers, Protein Data Bank, Ensembl, InterPro, KEGG, and Gene Ontology Specialised genomic databases: BOLD, Saccharomyces Genome Database, FlyBase, VectorBase, WormBase, Rat Genome Database, PHI-base, Arabidopsis Information Resource, GISAID and Zebrafish Information Network
Software	BLAST Bowtie Clustal EMBOSS HMMER MUSCLE PANGOLIN SAMtools SOAP suite TopHat
Other	Server: ExPASy Rosalind (education platform)
Institutions	Broad Institute Computational Biology Department (CBD) Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Database Center for Life Science (DBCLS) DNA Data Bank of Japan (DDBJ) European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory (EMBL) Flatiron Institute J. Craig Venter Institute (JCVI) Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) US National Center for Biotechnology Information (NCBI) Japanese Institute of Genetics Netherlands Bioinformatics Centre (NBIC) Philippine Genome Center (PGC) Scripps Research Swiss Institute of Bioinformatics (SIB) Wellcome Sanger Institute Whitehead Institute
Organizations	African Society for Bioinformatics and Computational Biology (ASBCB) Australia Bioinformatics Resource (EMBL-AR) European Molecular Biology network (EMBnet) International Nucleotide Sequence Database Collaboration (INSDC) International Society for Biocuration (ISB) International Society for Computational Biology (ISCB) Student Council (ISCB-SC) Institute of Genomics and Integrative Biology (CSIR-IGIB) Japanese Society for Bioinformatics (JSBi)
Meetings	Basel Computational Biology Conference‎ ([BC²]) European Conference on Computational Biology (ECCB) Intelligent Systems for Molecular Biology (ISMB) International Conference on Bioinformatics (InCoB) International Conference on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB) ISCB Africa ASBCB Conference on Bioinformatics Pacific Symposium on Biocomputing (PSB) Research in Computational Molecular Biology (RECOMB)
File formats	CRAM format FASTA format FASTQ format NeXML format Nexus format Pileup format SAM format Stockholm format VCF format GFF format
Related topics	Computational biology List of biobanks List of biological databases Molecular phylogenetics Sequencing Sequence database Sequence alignment
Category Commons

UniProt

Contents

The UniProt consortium

The roots of UniProt databases

Organization of UniProt databases

UniProtKB

UniProtKB/Swiss-Prot

UniProtKB/TrEMBL

UniParc

Source databases

UniRef

UniMes

Funding for UniProt

References

External links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

File:UPlogo1.png
Content
Description	UniProt is the Universal Protein resource, a central repository of protein data created by combining the Swiss-Prot, TrEMBL and PIR-PSD databases.
Data types captured	Protein annotation
Organisms	All
Contact
Research center	EMBL-EBI, UK; SIB, Switzerland; PIR, US.
Primary citation	Ongoing and future developments at the Universal Protein Resource^[1]
Access
Data format	Custom flat file, FASTA, GFF, RDF, XML.
Website	www.uniprot.org
Download URL	www.uniprot.org/downloads & for downloading complete data sets ftp.uniprot.org
Web service URL	Yes – JAVA API see info here & REST see info here
Tools
Web	Advanced search, BLAST, ClustalO, bulk retrieval/download, ID mapping
Miscellaneous
License	Creative Commons Attribution-NoDerivs
Versioning	Yes
Data release frequency	4 weeks
Curation policy	Yes – manual and automatic. Rules for automatic annotation generated by database curators and computational algorithms.
Bookmarkable entities	Yes – both individual protein entries and searches