Structure determination of the HgcAB complex using metagenome sequence data: insights into microbial mercury methylation

Bacteria and archaea possessing the hgcAB gene pair methylate inorganic mercury (Hg) to form highly toxic methylmercury. HgcA consists of a corrinoid binding domain and a transmembrane domain, and HgcB is a dicluster ferredoxin. However, their detailed structure and function have not been thoroughly characterized. We modeled the HgcAB complex by combining metagenome sequence data mining, coevolution analysis, and Rosetta structure calculations. In addition, we overexpressed HgcA and HgcB in Escherichia coli, confirmed spectroscopically that they bind cobalamin and [4Fe-4S] clusters, respectively, and incorporated these cofactors into the structural model. Surprisingly, the two domains of HgcA do not interact with each other, but HgcB forms extensive contacts with both domains. The model suggests that conserved cysteines in HgcB are involved in shuttling HgII, methylmercury, or both. These findings refine our understanding of the mechanism of Hg methylation and expand the known repertoire of corrinoid methyltransferases in nature.

This is a remarkably thorough manuscript. I am no structural biologist and I cannot comment on the experimental part. Instead, I found the bioinformatics work extremely elegant, comprehensive and overall robust and convincing.
I have just two major comments on the coevolutionary part: 1) They used GREMLIN for contact prediction. How would it compare with other methods, such as direct Boltzmann learning? The proteins are small enough that it should be doable (although, given the present dire times, I am not necessarily insisting that the authors perform it if they can't access suitable computational infrastructures to do it, of find the time while confined at home) 2) How many coevolutionary predictions do they include in the Rosetta protocol? Are they weighted? I think that providing at least some hints of that in the Methods section could give the reader a feeling that it is not all a "black box".

Reviewer 1
In the manuscript titled "Structure Determination of the HgcAB Complex Using Metagenome Sequence Data: Insights into Microbial Mercury Methylation" by Connor J. Cooper, Kaiyuan Zheng, Katherine W. Rush, Alexander Johs, Brian C. Sanders, Georgios A. Pavlopoulos, Nikos C. Kyrpides, Mircea Podar, Sergey Ovchinnikov, Stephen W. Ragsdale, and Jerry M. Parks Response: Our strategy was to apply the thoroughly validated protocol developed by members of the Baker lab, one of whom is a coauthor of our paper, to the HgcAB system (Ovchinnikov, Science, 2017, 355, 294-298). This approach relies on contact predictions from GREMLIN and does not use neural networks, which we assume is what was meant by direct Boltzmann learning. Instead, our approach relies on Monte Carlo sampling with the Rosetta energy function to correct any inaccurate contact predictions from GREMLIN. However, the field of protein structure prediction is evolving rapidly and advanced inter-residue contact (and interatomic distance) prediction algorithms based on convolutional neural networks have recently become the state of the art. Most currently used neural network-based approaches use GREMLIN (or its parallel implementation CCMpred) as an "ingredient" to their networks. Based on the Reviewer's suggestion, we compared the contact map predicted by GREMLIN to the deep dilated residual network-based contact prediction server Raptor-X_contact (Xu J., PNAS, 2019, 116, 16856-16865;Wang S, Sun S, Li Z, Zhang R and Xu J. PLoS Comput Biol, 2017, 13, e1005324), which has been shown to be among the most accurate contact predictors available. Comparison of the contact maps from each server indicates that the two give similar results (See figure below). Thus, we would not expect major changes to the model if Raptor-X_contact or another accurate contact predictor were used. However, in future work we do intend to incorporate contacts and distances derived from deep learning into our modeling protocol. We have revised the text (Methods, MSA generation and coevolution analysis, page 13) as follows: We also compared the contact map predicted by GREMLIN to the deep dilated residual network-based contact prediction server Raptor-X_contact (Xu J., PNAS, 2019, 116, 16856-16865;Wang S, Sun S, Li Z, Zhang R and Xu J. PLoS Comput Biol, 2017, 13, e1005324), which has been shown to be among the most accurate contact predictors available. Comparison of the contact maps from each server indicates that the two give similar results (Figure S#). For consistency with previous work, we used the GREMLIN contacts here.
2) How many coevolutionary predictions do they include in the Rosetta protocol? Are they weighted? I think that providing at least some hints of that in the Methods section could give the reader a feeling that it is not all a "black box".

Response:
To address the Reviewer's questions, we have added the following text to the revised manuscript (Methods, MSA generation and coevolution analysis, page 13): A single GREMLIN calculation was performed on the paired multiple sequence alignment. The GREMLIN output provides predicted contacts that are ranked based on the strength of the coevolution signal between residue pairs. These raw contacts were then normalized and reweighted according to a previously described model that estimates the contact prediction accuracy from the normalized GREMLIN scores, the number of sequences in the MSA, and the length of the query sequence (Ovchinnikov, eLIFE, 2015).
I have no more comments, and once these have been addressed, I will clearly be in favour of publication, at least as far as the bioinformatics part is concerned.
We hope you find our revised manuscript suitable for publication in Communications Biology.