Abstract
The design of molecules and materials with tailored properties is challenging, as candidate molecules must satisfy multiple competing requirements that are often difficult to measure or compute. While molecular structures produced through generative deep learning will satisfy these patterns, they often only possess specific target properties by chance and not by design, which makes molecular discovery via this route inefficient. In this work, we predict molecules with (Pareto-)optimal properties by combining a generative deep learning model that predicts three-dimensional conformations of molecules with a supervised deep learning model that takes these as inputs and predicts their electronic structure. Optimization of (multiple) molecular properties is achieved by screening newly generated molecules for desirable electronic properties and reusing hit molecules to retrain the generative model with a bias. The approach is demonstrated to find optimal molecules for organic electronics applications. Our method is generally applicable and eliminates the need for quantum chemical calculations during predictions, making it suitable for high-throughput screening in materials and catalyst design.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout




Data availability
The OE62 dataset is available in ref. 24 and the OE62 + 340k G-SchNet molecule dataset is uploaded on https://figshare.com/articles/dataset/G-SchNet_for_OE62/20146943 (ref. 64). Quantum chemistry calculations carried out in this study are uploaded to NOMAD under DOI 10.17172/NOMAD/2022.07.02-1 (ref. 65). A supplementary data file showing the number of molecules predicted and used for training in each experiment and each loop is included as Supplementary Data 1.
Code availability
The modified G-SchNet version is available on GitHub (https://github.com/rhyan10/G-SchNetOE62) and tagged as version v0.1 (minted version under DOI 10.5281/zenodo.7430248)66. The GitHub repository includes scripts to analyze the data and carry out PCA. SchNet + H is published in ref. 23 and available on http://www.github.com/schnarc (minted version under DOI 10.5281/zenodo.7424017)67. We include a tutorial for using SchNet + H and G-SchNet models for OE62 on figshare (https://figshare.com/articles/dataset/G-SchNet_for_OE62/20146943), including instructions for installation64. Original tutorials for training and using G-SchNet and SchNet + H are available on GitHub with the original code of G-SchNet (https://github.com/atomistic-machine-learning/G-SchNet)3 and SchNarc (https://github.com/schnarc/SchNarc/tree/develop)68, respectively.
References
Gómez-Bombarelli, R. et al. Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. Nat. Mater. 15, 1120–1127 (2016).
Bilodeau, C., Jin, W., Jaakkola, T., Barzilay, R. & Jensen, K. F. Generative models for molecular discovery: recent advances and challenges. WIRES Comput. Mol. Sci. 12, e1608 (2022).
Gebauer, N. W. A., Gastegger, M. & Schütt, K. T. Symmetry-adapted generation of 3D point sets for the targeted discovery of molecules. Adv. Neural Inf. Process. Syst. 32 (2019).
Tkatchenko, A. Machine learning for chemical discovery. Nat. Commun. 11, 4125 (2020).
Coley, C. W. Defining and exploring chemical spaces. Trends Chem. 3, 133–145 (2021).
Wu, T. C. et al. A materials acceleration platform for organic laser discovery. Adv. Mater. https://doi.org/10.1002/adma.202207070 (2022).
Gryn’ova, G., Lin, K.-H. & Corminboeuf, C. Read between the molecules: computational insights into organic semiconductors. J. Am. Chem. Soc. 140, 16370–16386 (2018).
Li, X.-H. et al. Narrow-bandgap materials for optoelectronics applications. Front. Phys. 17, 13304 (2022).
Xue, D. et al. Advances and challenges in deep generative models for de novo molecule generation. WIRES Comput. Mol. Sci. 9, e1395 (2019).
Meyers, J., Fabian, B. & Brown, N. De novo molecular design and generative models. Drug Discov. Today 26, 2707–2715 (2021).
Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: generative models for matter engineering. Science 361, 360–365 (2018).
Gebauer, N. W. A., Gastegger, M., Hessmann, S. S. P., Müller, K.-R. & Schütt, K. T. Inverse design of 3D molecular structures with conditional generative neural networks. Nat. Commun. 13, 973 (2022).
Li, Y., Pei, J. & Lai, L. Structure-based de novo drug design using 3D deep generative models. Chem. Sci. 12, 13664–13675 (2021).
Elton, D. C., Boukouvalas, Z., Fuge, M. D. & Chung, P. W. Deep learning for molecular design—a review of the state of the art. Mol. Syst. Des. Eng. 4, 828–849 (2019).
Tan, X. et al. Automated design and optimization of multitarget schizophrenia drug candidates by deep learning. Eur. J. Med. Chem. 204, 112572 (2020).
Sumita, M., Yang, X., Ishihara, S., Tamura, R. & Tsuda, K. Hunting for organic molecules with artificial intelligence: molecules optimized for desired excitation energies. ACS Cent. Sci. 4, 1126–1133 (2018).
Bilodeau, C. et al. Generating molecules with optimized aqueous solubility using iterative graph translation. React. Chem. Eng. 7, 297–309 (2022).
Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 37, 1038–1040 (2019).
Simm, G. N. & Hernández-Lobato, J. M. A generative model for molecular distance geometry. In Proc. 37th International Conference on Machine Learning 8949–8958 (JMLR.org, 2020).
Xu, M., Luo, S., Bengio, Y., Peng, J. & Tang, J. Learning neural generative dynamics for molecular conformation generation. Preprint at https://arxiv.org/abs/2102.10240 (2021).
Axelrod, S. & Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci. Data 9, 185 (2022).
Ganea, O. et al. GeoMol: torsional geometric generation of molecular 3D conformer ensembles. Adv. Neural Inf. Process. Syst. 34 (2021).
Westermayr, J. & Maurer, R. J. Physically inspired deep learning of molecular excitations and photoemission spectra. Chem. Sci. 12, 10755–10764 (2021).
Stuke, A. et al. Atomic structures and orbital energies of 61,489 crystal-forming organic molecules. Sci. Data 7, 58 (2020).
Golze, D., Dvorak, M. & Rinke, P. The GW compendium: a practical guide to theoretical photoemission spectroscopy. Front. Chem 7, 377 (2019).
Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. SCScore: synthetic complexity learned from a reaction corpus. J. Chem. Inf. Model. 58, 252–261 (2018).
Brown, N., Fiscato, M., Segler, M. H. S. & Vaucher, A. C. GuacaMol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59, 1096–1108 (2019).
Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Big data meets quantum chemistry approximations: the Δ-machine learning approach. J. Chem. Theory Comput. 11, 2087–2096 (2015).
Lawson, A. J., Swienty-Busch, J., Géoui, T. & Evans, D. in The Future of the History of Chemical Information ACS Symposium Series Vol. 1164, 127–148 (American Chemical Society, 2014).
Joshi, R. P. et al. 3D-Scaffold: a deep learning framework to generate 3D coordinates of drug-like molecules with desired scaffolds. J. Phys. Chem. B 125, 12166–12176 (2021).
Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B 87, 184115 (2013).
Zhang, T., Ramakrishnan, R. & Livny, M. BIRCH: a new data clustering algorithm and its applications. Data Min. Knowl. Discov. 1, 141–182 (1997).
Schubert, E., Sander, J., Ester, M., Kriegel, H. P. & Xu, X. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans. Database Syst. 42, 19 (2017).
Liotta, D. & Monahan, R. Selenium in organic synthesis. Science 231, 356–361 (1986).
Wilbraham, L., Smajli, D., Heath-Apostolopoulos, I. & Zwijnenburg, M. A. Mapping the optoelectronic property space of small aromatic molecules. Commun. Chem. 3, 14 (2020).
Bajusz, D., Rácz, A. & Héberger, K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J. Cheminform. 7, 20 (2015).
Bendikov, M., Wudl, F. & Perepichka, D. F. Tetrathiafulvalenes, oligoacenenes, and their buckminsterfullerene derivatives: the brick and mortar of organic electronics. Chem. Rev. 104, 4891–4946 (2004).
Hu, Y., Chaitanya, K., Yin, J. & Ju, X.-H. Theoretical investigation on the crystal structures and electron transfer properties of cyanated TTPO and their selenium analogs. J. Mater. Sci. 51, 6235–6248 (2016).
Ferri, N. et al. Hemilabile ligands as mechanosensitive electrode contacts for molecular electronics. Ang. Chem. Int. Ed. 58, 16583–16589 (2019).
Manzoor, F. et al. Theoretical calculations of the optical and electronic properties of dithienosilole- and dithiophene-based donor materials for organic solar cells. Chem. Sel. 3, 1593–1601 (2018).
Li, Y., Liu, J., Liu, D., Li, X. & Xu, Y. D–A–π–A based organic dyes for efficient DSSCs: a theoretical study on the role of π-spacer. Comput. Mater. Sci. 161, 163–176 (2019).
Kim, T. H. & Kim, K. S. Acridine derivative and organic electroluminescence device comprising the same. South Korea patent KR101120892B1 (2009).
Seifermann, S. & Choné, R. Organic molecules, in particular for use in optoelectronic devices. Europe patent EP3916072 (2018).
Sharma, V. K., Sohn, M. & McDonald, T. J. in Advances in Water Purification Techniques (ed. Ahuja, S.) 203–218 (Elsevier, 2019).
Fordyce, F. M. in Essentials of Medical Geology: Revised Edition (ed. Selinus, O.) 375–416 (Springer, 2013).
Landrum, G. RDKit: Open-Source Cheminformatics (2006); https://www.rdkit.org/
Blum, V. et al. Ab initio molecular simulations with numeric atom-centered orbitals. Comput. Phys. Commun. 180, 2175–2196 (2009).
Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865–3868 (1996).
Tkatchenko, A. & Scheffler, M. Accurate molecular van der Waals interactions from ground-state electron density and free-atom reference data. Phys. Rev. Lett. 102, 073005 (2009).
Adamo, C. & Barone, V. Toward reliable density functional methods without adjustable parameters: the PBE0 model. J. Chem. Phys. 110, 6158–6170 (1999).
Perdew, J. P., Ernzerhof, M. & Burke, K. Rationale for mixing exact exchange with density functional approximations. J. Chem. Phys. 105, 9982–9985 (1996).
Ren, X. et al. Resolution-of-identity approach to Hartree–Fock, hybrid density functionals, RPA, MP2 and GW with numeric atom-centered orbital basis functions. New J. Phys. 14, 053020 (2012).
Weigend, F. & Ahlrichs, R. Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: design and assessment of accuracy. Phys. Chem. Chem. Phys. 7, 3297–3305 (2005).
van Setten, M. J. et al. GW100: benchmarking G0W0 for molecular systems. J. Chem. Theory Comput. 11, 5665–5687 (2015).
Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1, 140022 (2014).
Ruddigkeit, L., van Deursen, R., Blum, L. C. & Reymond, J.-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model. 52, 2864–2875 (2012).
Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A. & Müller, K.-R. SchNet—a deep learning architecture for molecules and materials. J. Chem. Phys. 148, 241722 (2018).
Schütt, K. T. et al. SchNetPack: a deep learning toolbox for atomistic systems. J. Chem. Theory Comput. 15, 448–455 (2019).
Reining, L. The GW approximation: content, successes and limitations. WIRES Comput. Mol. Sci. 8, e1344 (2018).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inform. Comput. Sci. 28, 31–36 (1988).
O’Boyle, N. M. et al. Open Babel: an open chemical toolbox. J. Cheminform 3, 33 (2011).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Baldi, P. & Nasr, R. When is chemical similarity significant? The statistical distribution of chemical similarity scores and its extreme values. J. Chem. Inf. Model. 50, 1205–1222 (2010).
Westermayr, J., Barrett, R., Gilkes, J. & Maurer, R. J. G-SchNet for OE62. Figshare https://doi.org/10.6084/m9.figshare.20146943.v2 (2022).
Westermayr, J. & Maurer, R. J. Organic molecules from generative autoregressive models. NOMAD https://doi.org/10.17172/NOMAD/2022.07.02-1 (2022).
Westermayr, J. & Barrett, R. G-Schnet for OE62 dataset (v0.1). Zenodo https://doi.org/10.5281/zenodo.7430248 (2022).
Westermayr, J. SchNarc for SchNet + H. Zenodo https://doi.org/10.5281/zenodo.7424017 (2021).
Westermayr, J., Gastegger, M. & Marquetand, P. Combining SchNet and SHARC: the SchNarc machine learning approach for excited-state dynamics. J. Phys. Chem. Lett. 11, 3828–3834 (2020).
Acknowledgements
This work was funded by the Austrian Science Fund (FWF; J 4522-N) (J.W.), the EPSRC Centre for Doctoral Training in Modelling of Heterogeneous Systems (EP/S022848/1) (R.J.M.), the EPSRC-funded Network+ on Artificial and Augmented Intelligence for Automated Scientific Discovery (EP/S000356/10) (R.J.M.) and the UKRI Future Leaders Fellowship program (MR/S016023/1) (R.J.M.). Computational resources have been provided by the Scientific Computing Research Technology Platform of the University of Warwick, the EPSRC-funded Northern Ireland High Performance Computing service (EP/T022175/1) via access to Kelvin2, the EPSRC-funded HPC Midlands+ computing service (EP/P020232/1) via access to Athena and Sulis and the EPSRC-funded High End Computing Materials Chemistry Consortium (EP/R029431/1) for access to the ARCHER2 UK National Supercomputing Service (https://www.archer2.ac.uk). We thank N. Gebauer (TU Berlin) for fruitful discussions on the G-SchNet model. For the purpose of open access, we have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising from this submission.
Author information
Authors and Affiliations
Contributions
R.J.M. conceived the original idea and supervised the research project. R.J.M. and J.W. designed the research project. R.B. and J.W. trained the deep learning models and created the property-guided design workflow. J.G. and J.W. performed the dataset curation, predictions, model validation and data analysis. J.W. performed the quantum chemistry calculations. J.W. and R.J.M. wrote the manuscript with the help of the other authors. The manuscript reflects the contributions of all authors.
Corresponding authors
Ethics declarations
Competing interests
R.J.M. is an editorial board member of the journal Communications Materials. All other authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Camille Bilodeau and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Sections 1–9, Figs. 1–11 and Table 1.
Supplementary Data
Number of molecules predicted and used for training. Number of molecules used initially, obtained either from OE62 (initial loop), from OE62 + G-SchNet (initial loop for multiproperty biasing) or from G-SchNet alone (remaining loops). Molecules that were generated with G-SchNet are already sorted, hence the number of valid molecules is shown. The number of generated molecules was set to 200,000 for EA, ΔE and multiproperty biasing and to 100,000 for IP and ΔE (knockout) biasing. The third column shows the number of molecules that were selected for biasing G-SchNet. The fourth column shows the percentage of selected molecules with respect to the number of predicted molecules at this iteration.
Source data
Source Data for all Figures
Data depicted in Figs. 1–4.
Source Data Fig. 3
ChemDraw file of molecules depicted in Fig. 3.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Westermayr, J., Gilkes, J., Barrett, R. et al. High-throughput property-driven generative design of functional organic molecules. Nat Comput Sci 3, 139–148 (2023). https://doi.org/10.1038/s43588-022-00391-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-022-00391-1