Abstract
We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database1. These models cover nearly all proteins that are known, including those challenging to annotate for function or putative biological role using standard homology-based approaches. In this study, we examine the extent to which the AlphaFold database has structurally illuminated this “dark matter” of the natural protein universe at high predicted accuracy. We further describe the protein diversity that these models cover as an annotated interactive sequence similarity network, accessible at https://uniprot3d.org/atlas/AFDB90v4. By searching for novelties from sequence, structure, and semantic perspectives, we uncovered the β-flower fold, added multiple protein families to Pfam database2, and experimentally demonstrate that one of these belongs to a new superfamily of translation-targeting toxin-antitoxin systems, TumE-TumA. This work underscores the value of large-scale efforts in identifying, annotating, and prioritising novel protein families. By leveraging the recent deep learning revolution in protein bioinformatics, we can now shed light into uncharted areas of the protein universe at an unprecedented scale, paving the way to innovations in life sciences and biotechnology.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Author information
Authors and Affiliations
Corresponding authors
Supplementary information
Supplementary Table 1
Top 10 GO term predictions for the members of the clusters in the high-resolution sequence similarity network of component 159 (TumE) in figure 4a and their cognate antitoxin families, as predicted with DeepFRI. For each cluster, the number of models used for predictions (n) is displayed, as well as a boxplot depicting the distribution of DeepFri scores for each prediction. Boxplots demonstrate the quartiles of the dataset. Outliers that are outside the inter-quartile range are displayed as single points.
Supplementary Table 2
List of 290 connected components with a semantic diversity higher than 20%. 133 new Pfams will already be available in the next two releases of Pfam (36.0 and 37.0) and 17 were found to merge with previously defined Pfams. For all those, the corresponding Pfam and Pfam description are provided.
Supplementary Table 3
DNA fragments and DNA oligonucleotides used for plasmid construction during the experimental validation and characterisation of TumE and TumA.
Rights and permissions
About this article
Cite this article
Durairaj, J., Waterhouse, A.M., Mets, T. et al. Uncovering new families and folds in the natural protein universe. Nature (2023). https://doi.org/10.1038/s41586-023-06622-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41586-023-06622-3
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.