Electron density-based GPT for optimization and suggestion of host–guest binders

Here we present a machine learning model trained on electron density for the production of host–guest binders. These are read out as simplified molecular-input line-entry system (SMILES) format with >98% accuracy, enabling a complete characterization of the molecules in two dimensions. Our model generates three-dimensional representations of the electron density and electrostatic potentials of host–guest systems using a variational autoencoder, and then utilizes these representations to optimize the generation of guests via gradient descent. Finally the guests are converted to SMILES using a transformer. The successful practical application of our model to established molecular host systems, cucurbit[n]uril and metal–organic cages, resulted in the discovery of 9 previously validated guests for CB[6] and 7 unreported guests (with association constant Ka ranging from 13.5 M−1 to 5,470 M−1) and the discovery of 4 unreported guests for [Pd214]4+ (with Ka ranging from 44 M−1 to 529 M−1).

justifications and discussion, as noted by reviewers.
-Please be sure that the language and discussions used throughout the manuscript are accessible to a broad audience You will also need to make some editorial changes so that it complies with our Guide to Authors at https://www.nature.com/natcomputsci/for-authors.
In particular, I would like to highlight the following points of our style: Nature Computational Science titles should give a sense of the main new findings of a manuscript, and should not contain punctuation.Please keep in mind that we strongly discourage active verbs in titles, and that they should ideally fit within 150 characters each (including spaces).
To improve the accessibility of your paper to readers from other research areas, please pay particular attention to the wording of the paper's abstract, which serves both as an introduction and as a brief, non-technical summary in about 150 words.It should include the background and context of the work, 'Here we show' or an equivalent phrase, and then the major results and conclusions of the paper.Because researchers from other sub-disciplines will be interested in your results and their implications, it is important to explain essential but specialised terms concisely.We suggest you show your summary paragraph to colleagues in other fields to uncover any problematic concepts.
We encourage you to archive the data reported in your manuscript in an accessible, persistent repository.If your data are archived prior to the acceptance of yo ur manuscript, please provide us with the full citation as soon as you receive it so that a link to the data can be included in the publication.See http://www.nature.com/authors/policies/availability.html for more information.
If your paper is accepted for publication, we will edit your display items electronically so they conform to our house style and will reproduce clearly in print.If necessary, we will re-size figures to fit single or double column width.If your figures contain several parts, the parts should form a neat rectangle when assembled.Choosing the right electronic format at this stage will speed up the processing of your paper and give the best possible results in print.If you are in doubt about the correct format for your figures after reading our guidelines, please ask the art editors for advice computationalscience@nature.com.Please use the following link to submit your revised manuscript and a point-by-point response to the referees' comments (which should be in a separate document to any cover letter): [REDACTED] ** This url links to your confidential homepage and associated information about manuscripts you may have submitted or be reviewing for us.If you wish to forward this e-mail to co-authors, please delete this link to your homepage first.** To aid in the review process, we would appreciate it if you could also provide a copy of your manuscript files that indicates your revisions by making use of Track Changes or similar mark-up tools.Please also ensure that all correspondence is marked with your Nature Computational Science reference number in the subject line.
In addition, please make sure to upload a Word Document or LaTeX version of your text, to assist us in the editorial stage.
To improve transparency in authorship, we request that all authors identified as 'corresponding author' on published papers create and link their Open Researcher and Contributor Identifier (ORCID) with their account on the Manuscript Tracking System (MTS), prior to acceptance.ORCID helps the scientific community achieve unambiguous attribution of all scholarly contributions.You can create and link your ORCID from the home page of the MTS by clicking on 'Modify my Springer Nature account'.For more information please visit please visit <a href="http://www.springernature.com/orcid">www.springernature.com/orcid</a>.We hope to receive your revised paper within three weeks.In this article, an electron density based GPT model has been developed and host guest optimizations with gradient descent techniques have been perform ed to discover novel CB[6] and Pd2L4 host guest pairs.The computational favourable guest binding predictions were validated with NMR titration experiments.Overall, I think the article is a difficult read, and the authors should work to make it significan tly easier to follow for non-experts.
-In the abstract, it is stated "…host guest binders which are transformed into the SMILES format with >98% accuracy" and in page 4 "… Interestingly, this approach only generated chemically plausible molecules…" and in page 9 "…Due to the degeneracy of the SMILES representations generated by our methodology, it was inevitable that duplicate molecules would be obtained…".The audience is consistently referred to the Supporting Information, yet it is cumbersome to match the critical arguments with the corresponding supporting data since the exact SI numberings are omitted.It should be made clear where each former mentions of SMILES string data comparison corresponds to the state of the data in the analysis steps.
-How does the total model performance change when different chemical environment descriptors such as SELFIES or DeepSMILES are us ed for generating the molecules?Would the Variational Autoencoder as Illustrated in Figure 2B be as powerful with a different molecular representation as it is claimed with SMILES strings?-What is the chemical diversity of the QM9 database and how does direct insertion of this database molecules into the target hosts compare with the candidate molecules?Larger heavy atom count libraries such as QM7 database seems to be more suitable for the Pd2L4 host as there is available free volume within the host.
-It is unclear from the main text and the SI which xTB version and which parameters are selected for electrostatic calculations.xTB methods may sometimes give erroneous calculation results if the initial geometry of the system is not valid.How does the xTB method electron density files for the most promising guest molecules (that were generated) compare with electron densities from pairing DFT calculations?-How transferrable is just placing the guest molecule at the centre of the host?Would inclusion of angular rotation parameters enhance and direct the results and let the ESP maps of host and guest match better?-Why isn't slight overlap of electron densities considered?The candidate guest molecules can have differing degrees of flexibility as illustrated in the number of double, triple bonds section of the SI, as well as the host molecule due to thermal vibrations.A single molecule per host is assumed so this relaxation of the initial placement can potentially add an ansemble of candidates with one more heaver atom.
-In Figure 8, the association constants are weaker than for those guests that are previously known (on the left hand side).Could the authors do a general comparison of this point?-In the abstract, the "discover" should be "discovery" in the last sentence.
I am reassured to see the authors will publish their model on github, it is important that this is indeed done for any accepted version.
Reviewer #2 (Remarks to the Author): 1. Given there are a lot of approaches that are focuse d on target-aware drug design based on electron densities [1-4], I think it is necessary to add some comparisons with them, for example, Targetdiff[1] and DecompDiff[4].
2. The paper's current approach only presents a limited view of its results, as it so lely showcases a few desirable guests selected by an expert chemist.This approach lacks objectivity and is insufficient to assess the overall performance of the generated molecules before expert intervention.To address this, the paper should consider adopting an evaluation step similar to the one used in the DecompDiff paper, which provides a more comprehensive assessment of the generated results.Furthermore, I suggest conducting an ablation study on the major components of the pipeline.For instance, the paper could demonstrate the performance of generated guests without utilizing the optimization pipeline.This addition would be valuable for readers, as it would clearly highlight the individual contributions of each novel idea in the molecule generation process.
[1]: Reviewer #2 (Remarks on code availability): The authors have generously provided the source code, a list of dependent packages, and the trained model for their paper.I haven't had the opportunity to install and run it yet.
Reviewer #3 (Remarks to the Author): Review on the article titled: "Electron Density-Based GPT for Optimization and Suggestion of Host-Guest Binders" This manuscript presents a new approach towards generative modeling of molecules using host-guest systems as an example.In this method, representing the host molecules with their electron density decorated with electrostatic potential, new guest molecules are generated with improved host-guest interaction, by maximizing the inter-molecular interactions and minimizing the overlap with the host.The overall workflow is two-tiered: first, virtual libraries of potential guest molecules are generated; in this step, first, electron density volumetric representation of new molecules are generated and optimized by minimizing the overlap and maxim izing the electrostatic interaction and second, potential guest molecules are selected for invitro testing of binding affinity.
The language in the manuscript is lucid and easy to read.The authors discussed their findings thoroughly, however, here are a few suggestion/comments that may help in improving the readability of the manuscript: In Page 8: While explaining "Translating electron densities into SMILES", the author mentioned 3D to 4D expansion of the tensor data in this line "To do so, the input 3 D data first had to be transformed and expanded into 4D -enabling 3D convolutionsbefore this 4D data was transformed into 2D." While the steps are clearly explained, the reason behind this 4D expansion needs a little more explanation.It would be very useful if the authors explain in detail the reason behind expanding the electron density tensors to add another dimension.
In the section "Quantitative study of the host-guest recognition": the authors studied in-vitro binding affinity tests on two different host-guest systems.The authors described in detail about the affinities of the new guest molecules, but if the authors could provide the rationale behind selecting 9 guests for Cucurbituril and 4 from the metal-organic cage host from the generative workflow, which is not mentioned in the current version of the manuscript.
In Page 14, 15: the authors presented the selected guests for two host-guest systems in Fig. 7 and 8 respectively.For a better visualization and presentation of the binding affinities, one of the suggestions would be to convert these figures into tables for known and new guests.

Minor corrections for typos, if applicable:
In the Abstract, the last line, "discovery" instead of "discover" Page 7: "fitting the closest these volumes" to "fitting the closest to these volumes" Thank you for submitting your revised manuscript "Electron Density -Based GPT for Optimization and Suggestion of Host-Guest Binders" (NATCOMPUTSCI-23-0741A).It has now been seen by the original referees and their comments are below.The reviewers find that the paper has improved in revision, and therefore we'll be happy in principle to publish it in Nature Computational Science, pending minor revisions to satisfy the referees' final requests and to comply with our editorial and formatting guidelines.
We are now performing detailed checks on your paper and will send you a checklist detailing our editorial and formatting requirements in about a week.Please do not upload the final materials and make any revisions until you receive this additional information from us.
TRANSPARENT PEER REVIEW Nature Computational Science offers a transparent peer review option for original research manuscripts.We encourage increased transparency in peer revie w by publishing the reviewer comments, author rebuttal letters and editorial decision letters if the authors agree.Such peer review material is made available as a supplementary peer review file.Please remember to choose, using the manuscript system, whether or not you want to participate in transparent peer review.Please note: we allow redactions to authors' rebuttal and reviewer comments in the interest of confidentiality.If you are concerned about the release of confidential data, please let us know specifically what information you would like to have removed.Please note that we cannot incorporate redactions for any other reasons.Reviewer names will be published in the peer review files if the reviewer signed the comments to authors, or if reviewers explicitly agree to release their name.For more information, please refer to our <a href="https://www.nature.com/documents/nrtransparent-peer-review.pdf"target="new">FAQ page</a>.Thank you again for your interest in Nature Computational Science.Please do not hesitate to contact me if you have any questions.

Sincerely, Kaitlin McCardle, PhD
Senior Editor Nature Computational Science ORCID IMPORTANT: Non-corresponding authors do not have to link their ORCIDs but are encouraged to do so.Please note that it will not be possible to add/modify ORCIDs at proof.Thus, please let your co-authors know that if they wish to have their ORCID added to the paper they must follow the procedure described in the following link prior to acceptance: https://www.springernature.com/gp/researchers/orcid/orcid-fornature-research Reviewer #1 (Remarks to the Author): The manuscript can now be accepted in my opinion.There are a few small typos to be corrected at proof stage.
Reviewer #1 (Remarks on code availability): It is difficult to find the rotation element within the code repository.The rotation element should live inside bin/optimisers/host_guest_overlapping.py as shown in SI 2.2.1.Yet one has to dive into src/utils/optimiser_utils.py to find elements of rotation within the code, and it is not well documented how to act upon this element.(No user comments) Also, the data paths are not properly formatted to target correct user independent subdirectories (such as "DATA_FOLDER = '/home/juanma/Data/' # in maddog2020" in line 42 host_guest_overlapping.py and comments should be refined to help users with functionalities.Distribution of the code also through either a docker container or a google collab binder document is potentially better as these approaches will not force the community to download 250 GB+ raw QM9 data. Reviewer #2 (Remarks to the Author): Thanks for the explanation and revision!1. Sorry for the confusion.I didn't believe your method was quite similar to those approaches.I just thought these two tasks were similar.It is because when considering computational model or framework design, it appears that designing drugs based on pockets and designing guest molecules based on the host are quite similar.Therefore, I initially believed that it would be beneficial to draw a comparison between methods under these two tasks.However, I acknowledge your assertion that they are indeed distinct tasks within the realm of chemistry, so it is not necessary to compare them.

I appreciate your reminder about the supplementary information and apologize
for not considering it in the initial review.Upon reviewing the results provided in the supplementary information, I find the evaluation to be more comprehensive.
Nonetheless, I still believe that displaying the overall performance prior to expert intervention would be very valuable, as it represents an unbiased way to directly showcase your method's effectiveness.However, I understand that if your algorithm is the first in silico design method in the field of guest molecule design, it may be acceptable to omit these preliminary results.
I recommend incorporating a concise summary of the results from the supplementary information into the main text.I believe that would be beneficial for the reader.
Reviewer #2 (Remarks on code availability): Yes, I successfully installed their package and their pretrained model could be successfully loaded.
Reviewer #3 (Remarks to the Author): I would like to thank the authors for carefully reading the comments and suggestions and making necessary changes or add the required information to the manuscript.

Message :
Dear Professor Cronin, We are pleased to inform you that your Article "Electron Density -Based GPT for Optimization and Suggestion of Host-Guest Binders" has now been accepted for publication in Nature Computational Science.
Once your manuscript is typeset, you will receive an email with a link to choose the appropriate publishing options for your paper and our Author Services team will be in touch regarding any additional information that may be required.
Please note that <i>Nature Computational Science</i> is a Transformative Journal (TJ).
Authors may publish their research with us through the traditional subscription access route or make their paper immediately open access through payment of an articleprocessing charge (APC).Authors will not be required to make a final decision about access to their article until it has been accepted.<a href="https://www.springernature.If you have any questions about our publishing options, costs, Open Access requir ements, or our legal forms, please contact ASJournals@springernature.com Acceptance of your manuscript is conditional on all authors' agreement with our publication policies (see https://www.nature.com/natcomputsci/for-authors).In particular your manuscript must not be published elsewhere and there must be no announcement of the work to any media outlet until the publication date (the day on which it is uploaded onto our web site).
Before your manuscript is typeset, we will edit the text to ensure it is intelligible to our wide readership and conforms to house style.We look particularly carefully at the titles of all papers to ensure that they are relatively brief and understandable.
Once your manuscript is typeset, you will receive a link to your electronic proof via email with a request to make any corrections within 48 hours.If, when you receive your proof, you cannot meet this deadline, please inform us at rjsproduction@springernature.com immediately.
If you have queries at any point during the pro duction process then please contact the production team at rjsproduction@springernature.com.
You may wish to make your media relations office aware of your accepted publication, in case they consider it appropriate to organize some internal or external pu blicity.Once your paper has been scheduled you will receive an email confirming the publication details.This is normally 3-4 working days in advance of publication.If you need additional notice of the date and time of publication, please let the production team know when you receive the proof of your article to ensure there is sufficient time to coordinate.Further information on our embargo policies can be found here: https://www.nature.com/authors/policies/embargo.htmlAn online order form for reprints of your paper is available at <a href="https://www.nature.com/reprints/authorreprints.html">https://www.nature.com/reprints/author-reprints.html</a>.All coauthors, authors' institutions and authors' funding agencies can order reprints using the form appropriate to their geographical re gion.
We welcome the submission of potential cover material (including a short caption of around 40 words) related to your manuscript; suggestions should be sent to Nature Computational Science as electronic files (the image should be 300 dpi at 210 x 297 mm in either TIFF or JPEG format).We also welcome suggestions for the Hero Image, which appears at the top of our <a href="http://www.nature.com/natcomputsci">homepage</a>; these should be 72 dpi at 1400 x 400 pixels in JPEG format.Please note that such pictures should be selected more for their aesthetic appeal than for their scientific content, and that colour images work better than black and white or grayscale images.
Please do not try to design a cover with the Nature Computational Science logo etc ., and please do not submit composites of images related to your work.I am sure you will understand that we cannot make any promise as to whether any of your suggestions might be selected for the cover of the journal.
You can now use a single sign-on for all your accounts, view the status of all your manuscript submissions and reviews, access usage statistics for your published articles and download a record of your refereeing activity for the Nature journals.
To assist our authors in disseminating their research to the broader community, our SharedIt initiative provides you with a unique shareable link that will allow anyone (with or without a subscription) to read the published article.Recipients of the link with a subscription will also be able to download and print the PDF.
As soon as your article is published, you will receive an automated email with your shareable link.
We look forward to publishing your paper.

Figure
Figure legends must provide a brief description of the figure and the symbols used, including definitions of any error bars employed in the figures.
com/gp/open-research/transformative-journals"> Find out more about Transformative Journals</a> If your research is supported by a funder that requires immediate open access (e.g. according to <a href="https://www.springernature.com/gp/open-research/plan-scompliance">PlanS principles</a>) then you should select the gold OA route, and we will direct you to the compliant route where possible.For authors selecting the subscription publication route, the journal's standard licensing terms will need to be accepted, including <a href="https://www.springernature.com/gp/open-research/policies/journalpolicies">self-archivingpolicies</a>.Those licensing terms will supersede any other terms that the author or any third party may assert apply to any version of the manuscript.