Abstract
Metal ions have various important biological roles in proteins, including structural maintenance, molecular recognition and catalysis. Previous methods of predicting metal-binding sites in proteomes were based on either sequence or structural motifs. Here we developed a co-evolution-based pipeline named ‘MetalNetʼ to systematically predict metal-binding sites in proteomes. We applied MetalNet to proteomes of four representative prokaryotic species and predicted 4,849 potential metalloproteins, which substantially expands the currently annotated metalloproteomes. We biochemically and structurally validated previously unannotated metal-binding sites in several proteins, including apo-citrate lyase phosphoribosyl-dephospho-CoA transferase citX, an Escherichia coli enzyme lacking structural or sequence homology to any known metalloprotein (Protein Data Bank (PDB) codes: 7DCM and 7DCN). MetalNet also successfully recapitulated all known zinc-binding sites from the human spliceosome complex. The pipeline of MetalNet provides a unique and enabling tool for interrogating the hidden metalloproteome and studying metal biology.

This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 per month
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout




Data availability
The original protein structure dataset containing 9,846 protein sequences and their MSA can be downloaded from https://doi.org/10.1073/pnas.1702664114. Co-evolved pairs used in model training can be found in Supplementary Dataset. The proteins and related MSA of prokaryotic species can be downloaded from https://gremlin2.bakerlab.org/db/{species}/fasta/. The Metagenome-pfam MSA and structural model can be downloaded from https://gremlin2.bakerlab.org/db/UNI/. The human spliceosome dataset can be found in Supplementary Dataset. PDB codes 6ID0, 6ID1, 6ICZ and 6QW6 were used to construct human spliceosome dataset. We downloaded the information table of protein entities from the PDB server (https://ftp.wwpdb.org/pub/pdb/derived_data/index/entries.idx) to construct unbiased dataset when comparing methods. UniProt (https://ebi10.uniprot.org) profiles (date: 16 August 2021), Gene Ontology database (https://www.ebi.ac.uk/QuickGO/) and Pfam database (http://pfam.xfam.org/) were used in the analysis. The pdbaa database (16 January 2018 release, ftp://ftp.ncbi.nlm.nih.gov/blast/db/pdbaa.tar.gz) is used in BLASTP. The structures of citX reported in this paper have been deposited in PDB with the accession numbers 7DCM (determined by single-wavelength anomalous diffraction) and 7DCN (determined by molecular replacement). Source data are provided with this paper.
Code availability
Our code is available as open source at https://github.com/wangchulab/MetalNet.
References
Gladyshev, V. N. & Zhang, Y. Comparative genomics analysis of the metallomes. Met. Ions Life Sci. 12, 529–580 (2013).
Waldron, K. J. & Robinson, N. J. How do bacterial cells ensure that metalloproteins get the correct metal? Nat. Rev. Microbiol. 7, 25–35 (2009).
Yannone, S. M., Hartung, S., Menon, A. L., Adams, M. W. & Tainer, J. A. Metals in biology: defining metalloproteomes. Curr. Opin. Biotechnol. 23, 89–95 (2012).
Waldron, K. J., Rutherford, J. C., Ford, D. & Robinson, N. J. Metalloproteins and metal sensing. Nature 460, 823–830 (2009).
Cvetkovic, A. et al. Microbial metalloproteomes are largely uncharacterized. Nature 466, 779–782 (2010).
Pace, N. J. & Weerapana, E. A competitive chemical-proteomic platform to identify zinc-binding cysteines. ACS Chem. Biol. 9, 258–265 (2014).
Sevcenco, A. M. et al. Exploring the microbial metalloproteome using MIRAGE. Metallomics 3, 1324–1330 (2011).
Andreini, C., Banci, L., Bertini, I. & Rosato, A. Counting the zinc-proteins encoded in the human genome. J. Proteome Res. 5, 196–201 (2006).
Passerini, A., Punta, M., Ceroni, A., Rost, B. & Frasconi, P. Identifying cysteines and histidines in transition‐metal‐binding sites using support vector machines and neural networks. Proteins Struct. Funct. Bioinf. 65, 305–316 (2006).
Passerini, A., Lippi, M. & Frasconi, P. MetalDetector v2.0: predicting the geometry of metal binding sites from protein sequence. Nucleic Acids Res. 39, W288–W292 (2011).
Haberal, İ. & Oğul, H. Prediction of protein metal binding sites using deep neural networks. Mol. Inf. 38, e1800169 (2019).
Babor, M., Gerzon, S., Raveh, B., Sobolev, V. & Edelman, M. Prediction of transition metal-binding sites from apoprotein structures. Proteins 70, 208–217 (2008).
Lin, Y. F. et al. MIB: metal ion-binding site prediction and docking server. J. Chem. Inf. Model. 56, 2287–2291 (2016).
Zhang, C., Freddolino, P. L. & Zhang, Y. COFACTOR: improved protein function prediction by combining structure, sequence and protein-protein interaction information. Nucleic Acids Res. 45, W291–W299 (2017).
Gobel, U., Sander, C., Schneider, R. & Valencia, A. Correlated mutations and residue contacts in proteins. Proteins 18, 309–317 (1994).
Shindyalov, I. N., Kolchanov, N. A. & Sander, C. Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng. Des. Select. 7, 349–358 (1994).
Martin, L. C., Gloor, G. B., Dunn, S. D. & Wahl, L. M. Using information theory to search for co-evolving residues in proteins. Bioinformatics 21, 4116–4124 (2005).
Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293–E1301 (2011).
Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S. I. & Langmead, C. J. Learning generative models for protein fold families. Proteins 79, 1061–1078 (2011).
Jones, D. T., Buchan, D. W., Cozzetto, D. & Pontil, M. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28, 184–190 (2012).
Marks, D. S., Hopf, T. A. & Sander, C. Protein structure prediction from sequence variation. Nat. Biotechnol. 30, 1072–1080 (2012).
Ovchinnikov, S. et al. Protein structure determination using metagenome sequence data. Science 355, 294–298 (2017).
Xu, J. Distance-based protein folding powered by deep learning. Proc. Natl Acad. Sci. USA 116, 16856–16865 (2019).
Ovchinnikov, S., Kamisetty, H. & Baker, D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. eLife 3, e02030 (2014).
Cong, Q., Anishchenko, I., Ovchinnikov, S. & Baker, D. Protein interaction networks revealed by proteome coevolution. Science 365, 185–189 (2019).
Toth-Petroczy, A. et al. Structured states of disordered proteins from genomic sequences. Cell 167, 158–170 (2016).
Chakrabarti, S. & Panchenko, A. R. Coevolution in defining the functional specificity. Proteins 75, 231–240 (2009).
Kuipers, R. K. et al. Correlated mutation analyses on super-family alignments reveal functionally important residues. Proteins 76, 608–616 (2009).
Chakrabarti, S. & Panchenko, A. R. Structural and functional roles of coevolved sites in proteins. PLoS One 5, e8591 (2010).
Jeong, C. S. & Kim, D. Structure-based Markov random field model for representing evolutionary constraints on functional sites. BMC Bioinf. 17, 99 (2016).
Wang, G. & Dunbrack, R. L. Jr. PISCES: a protein sequence culling server. Bioinformatics 19, 1589–1591 (2003).
Anishchenko, I., Ovchinnikov, S., Kamisetty, H. & Baker, D. Origins of coevolution between residues distant in protein 3D structures. Proc. Natl Acad. Sci. USA 114, 9122–9127 (2017).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Cariss, S. J. L. et al. YieJ (CbrC) mediates CreBC-dependent colicin E2 tolerance in Escherichia coli. J. Bacteriol. 192, 3329–3336 (2010).
Schneider, K., Dimroth, P. & Bott, M. Biosynthesis of the prosthetic group of citrate lyase. Biochemistry 39, 9438–9450 (2000).
Will, C. L. & Luhrmann, R. Spliceosome structure and function. Csh Perspect. Biol. 3, a003707 (2011).
Charenton, C., Wilkinson, M. E. & Nagai, K. Mechanism of 5′ splice site transfer for human spliceosome activation. Science 364, 362–367 (2019).
Zhang, X. F. et al. Structures of the human spliceosomes before and after release of the ligated exon. Cell Res. 29, 274–285 (2019).
Zhang, C. X., Zheng, W., Mortuza, S. M., Li, Y. & Zhang, Y. DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins. Bioinformatics 36, 2105–2112 (2020).
Piazza, I. et al. A map of protein-metabolite interactions reveals principles of chemical communication. Cell 172, 358–372 (2018).
Zhuang, S., Li, Q., Cai, L., Wang, C. & Lei, X. Chemoproteomic profiling of bile acid interacting proteins. ACS Cent. Sci. 3, 501–509 (2017).
Horning, B. D. et al. Chemical proteomic profiling of human methyltransferases. J. Am. Chem. Soc. 138, 13335–13343 (2016).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinform. 20, 473 (2019).
Varoquaux, G., Vaught, T., & Millman, J. (eds.). Exploring network structure, dynamics, and function using networkX. In Proceedings of the 7th Python in Science Conference 11–15 (SciPy, 2008).
Huang, Y., Niu, B. F., Gao, Y., Fu, L. M. & Li, W. Z. CD-HIT suite: a web server for clustering and comparing biological sequences. Bioinformatics 26, 680–682 (2010).
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
Hulsen, T., de Vlieg, J. & Alkema, W. BioVenn—a web application for the comparison and visualization of biological lists using area-proportional Venn diagrams. BMC Genom. 9, 488 (2008).
Song, Y. F. et al. High-resolution comparative modeling with RosettaCM. Structure 21, 1735–1742 (2013).
Wang, C., Vernon, R., Lange, O., Tyka, M. & Baker, D. Prediction of structures of zinc-binding proteins through explicit modeling of metal coordination geometry. Protein Sci. 19, 494–506 (2010).
Sheldrick, G. M. Experimental phasing with SHELXC/D/E: combining chain tracing with density modification. Acta Crystallogr. D Biol. Crystallogr. 66, 479–485 (2010).
Adams, P. D. et al. PHENIX: building new software for automated crystallographic structure determination. Acta Crystallogr. D Biol. Crystallogr. 58, 1948–1954 (2002).
Emsley, P. & Cowtan, K. Coot: model-building tools for molecular graphics. Acta Crystallogr. D Biol. Crystallogr. 60, 2126–2132 (2004).
Abraham, M. J. et al. GROMACS: high-performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1, 19–25 (2015).
PyMOL. The PyMOL Molecular Graphics System, Version 2.4 (Schrodinger Inc., 2015).
Bussi, G., Donadio, D. & Parrinello, M. Canonical sampling through velocity rescaling. J. Chem. Phys. 126, 014101 (2007).
Wang, H., Dommert, F. & Holm, C. Optimizing working parameters of the smooth particle mesh Ewald algorithm in terms of accuracy and efficiency. J. Chem. Phys. 133, 034117 (2010).
Acknowledgements
We thank H. Tang in Chu Wang’s lab and the Institute of Geographic Sciences and Natural Resources Research, CAS for the help with ICP measurements. We thank National Center for Protein Sciences at Peking University, Beijing for the help with Circular Dichroism measurements, and the staff of the Shanghai Synchrotron Radiation Facility and KEK Photon Factory for assistance with X-ray data collection. Funding: C.W. was supported by the National Natural Science Foundation of China (grants 21925701, 91953109 and 92153301).
Author information
Authors and Affiliations
Contributions
H.W. and C.W. conceived the project; Y.C., H.W. and Y.L. performed computational analysis with the help of C.S. and S.O.; H.X. purified citX and solved the structure under the guidance of X.S.; B.M., X.C., X.W. and X.Z. contributed to biochemical verification of zinc binding in citX and other proteins; and Y.C, H.W., Y.L. and C.W. analyzed the data and wrote the manuscript with inputs from all authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Chemical Biology thanks Kevin Yang, Rosalin Bonetta Valentino and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Distribution of the metal types of the metalloproteins in the training set.
ZN: zinc; CA: calcium; MG: magnesium; MN: manganese; FE: iron; SF4: [FE4-S4] iron-sulfur clusters; NI: nickel; CU: copper; CO: cobalt; FES: [FE2-S2] iron-sulfur clusters.
Extended Data Fig. 2 Examples of the coevolved CHED network clusters detected in the PDB data.
Metal-binding residues in these sites have the tendency to coevolve with each other and cluster together in a high-order coevolution network. The PDB ID and type of the metal ion bound are listed below the corresponding coevolved network.
Extended Data Fig. 3 Distribution of the number of coevolved connections (‘node degree’) for each residue among all the coevolved CHED pairs in the training set.
The CHED residues involved in metal binding (blue) have more node degrees on average than those non-metal-binding residues in the coevolved networks (red).
Extended Data Fig. 4 Composition of metal-chelating CHED sidechains within the coordination sphere around each specific metal ion.
The statistic is calculated based on metalloprotein structures in the benchmark. Figure a and b show the absolute count and normalized percentage, respectively. ZN: zinc; CA: calcium; MG: magnesium; MN: manganese; NI: nickel; FE: iron; CU: copper; SF4: [FE4-S4] iron-sulfur clusters; CO: cobalt; FES: [FE2-S2] iron-sulfur clusters.
Extended Data Fig. 5 Overall workflow of MetalNet.
Starting with the information from coevolution analysis, MetalNet uses a machine learning (ML)-based classifier that has been trained by a benchmark of known metalloproteins to predict whether an individual CHED coevolved residue pair is metal-binding or not. It then employs a graph-based approach to identify high-order coevolution network clusters formed by these coevolved CHED pairs to generate reliable predictions of metal-binding sites in the query protein. In certain cases when the coevolution network topology can be matched to that of known metal-binding sites, the method can also infer the information on the type of metal bound in the predicted site. The method does not use any sequence (1D) or structural (3D) homology information to make predictions.
Extended Data Fig. 6 Comparation of the performance between MetalNet and MIB.
a, Evaluation of the performance of MetalNet and MIB on a ‘prospective’ metalloprotein dataset that were deposited in PDB after 2016/08. b, Evaluation of the performance of MetalNet and MIB on this prospective metalloprotein dataset after proteins with one single sidechain as the liganding group were removed. In a and b, the calculated precision, recall and F1-scores for MetalNet and MIB were shown on the left in the table format for all metal-binding sites (‘All sites’), Zn-specific sites (‘Zn’), Mg/Ca-specific sites (‘Mg/Ca’) and the remaining sites (‘others’). The number of correct and incorrect predictions made by MetalNet and MIB were shown on the right in the venn diagram format with the ‘prospective’ metalloprotein dataset (‘PDB’) as the reference (the dataset can be found in Supplementary Dataset 3).
Extended Data Fig. 7 Distribution of the minimal distances between the metal-binding residue pairs predicted by MetalNet.
The blue block shows the distribution of residue-residue pairwise distances for metal-binding residues from known metalloproteins in the PDB database. Black solid lines show the distribution of mean distances in GREMLIN models for the metal-binding coevolved sites predicted by MetalNet. The distributions generally agree well with each other until 4.5 Å, suggesting that a large portion of MetalNet predictions indeed have metal-binding residues in proximity with each other. Since GREMLIN does not take metal-binding into consideration during structural modeling, the disagreement after 4.5 Å may come from either false positive prediction by MetalNet, or some errors in structural modeling by GREMLIN.
Extended Data Fig. 8 Validation of metal binding in purified Desor_0198 (a), SVEN_5263 (b), CbrC (c) and CitX (d) by ICP-MS.
For a and b, the coevolved metal-binding cluster predicted by MetalNet are shown on the left, SDS-PAGE of the purified wild-type protein in the middle and ICP-MS measurement of metal binding of the purified protein are shown on the right. Both proteins are validated with zinc-binding activity. c, The coevolved metal-binding cluster predicted by MetalNet are shown on the left, SDS-PAGE of the purified wild-type protein, two single mutants (C56S or C182S) and the double mutant (C56S&C182S) in the middle and ICP-MS analysis of zinc binding in the purified wild-type cbrC and mutants on the right. Partial zinc binding was retained in each of the single mutant whereas metal binding activity was completely abolished in the double mutant. In a, b, and c, error bars mean and s.d (n = 3 biologically independent samples). ICP-MS analysis of Fe and Cu binding is measured only once. d, SDS-PAGE of the purified wild-type citX as well as four single mutants of the predicted metal-binding residues by MetalNet (C145S, C148S, C155S and H161S). ICP-MS analysis of citX is shown in Fig. 4d. The experiment was repeated twice independently with similar results.
Extended Data Fig. 9 Molecular dynamic (MD) simulations of citX and its mutants.
MD simulations were performed to calculate root mean square fluctuation (RMSF) for citX (PDB ID: 7DCN) and its mutants (C145S, C148S, C155S and H161, metal-binding residues predicted by MetalNet) using GROMACS. The results suggested that all mutants show greater conformational fluctuations at the binding site.
Extended Data Fig. 10 Prediction of metal-binding sites in the human spliceosome.
a, Scheme of predicting metal-binding sites in the human spliceosome using coevolution obtained by deepMSA and MSA transformer. b, Highlight of the zinc-binding sites in the human spliceosome predicted by MetalNet that match well with the experimental structures in PDB. The experimental structures of the corresponding metalloproteins subunits are shown in cartoon on the left and the predicted coevolution networks corresponding to the metal-binding sites are shown on the right. Metal-chelating residues are shown in sticks and zinc ions are shown in spheres.
Supplementary information
Supplementary Information
Supplementary Tables 1–5 and Supplementary Fig. 1.
Supplementary Data 1
The list of co-evolved pairs used for training and evaluating of ML model in this study.
Supplementary Data 2
The adjacency list of ‘motifs‘ in co-evolution motif bank.
Supplementary Data 3
The ‘prospective‘ metalloprotein dataset that was deposited in PDB after August 2016. It was used for comparing the performance of MetalNet and MIB in an unbiased manner.
Supplementary Data 4
The predicted metal-binding sites by MetalNet in the prokaryotic species dataset.
Supplementary Data 5
The predicted metal-binding sites in metagenome-Pfam dataset.
Supplementary Data 6
The predicted metal-binding sites in the human spliceosome sequences dataset.
Source data
Source Data Fig. 1
Statistical source data.
Source Data Fig. 3
Statistical source data.
Source Data Fig. 4
Unprocessed gels and statistical source data.
Source Data Extended Data Fig. 1
Statistical source data.
Source Data Extended Data Fig. 3
Statistical source data.
Source Data Extended Data Fig. 4
Statistical source data.
Source Data Extended Data Fig. 6
Statistical source data.
Source Data Extended Data Fig. 7
Statistical source data.
Source Data Extended Data Fig. 8
Unprocessed gels and ICP-MS data.
Source Data Extended Data Fig. 9
MD data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cheng, Y., Wang, H., Xu, H. et al. Co-evolution-based prediction of metal-binding sites in proteomes by machine learning. Nat Chem Biol (2023). https://doi.org/10.1038/s41589-022-01223-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41589-022-01223-z