A deep learning model for predicting selected organic molecular spectra

Zou, Zihan; Zhang, Yujin; Liang, Lijun; Wei, Mingzhi; Leng, Jiancai; Jiang, Jun; Luo, Yi; Hu, Wei

doi:10.1038/s43588-023-00550-y

Brief Communication
Published: 13 November 2023

A deep learning model for predicting selected organic molecular spectra

Nature Computational Science volume 3, pages 957–964 (2023)Cite this article

2709 Accesses
6 Citations
9 Altmetric
Metrics details

Subjects

Abstract

Accurate and efficient molecular spectra simulations are crucial for substance discovery and structure identification. However, the conventional approach of relying on the quantum chemistry is cost intensive, which hampers efficiency. Here we develop DetaNet, a deep-learning model combining E(3)-equivariance group and self-attention mechanism to predict molecular spectra with improved efficiency and accuracy. By passing high-order geometric tensorial messages, DetaNet is able to generate a wide variety of molecular properties, including scalars, vectors, and second- and third-order tensors—all at the accuracy of quantum chemistry calculations. Based on this we developed generalized modules to predict four important types of molecular spectra, namely infrared, Raman, ultraviolet–visible, and ¹H and ¹³C nuclear magnetic resonance, taking the QM9S dataset containing 130,000 molecular species as an example. By speeding up the prediction of molecular spectra at quantum chemical accuracy, DetaNet could help progress toward real-time structural identification using spectroscopic measurements.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: DetaNet-predicted IR and Raman spectra and their comparison with DFT reference and experimental data.**

**Fig. 2: Performance of DetaNet in predicting UV-Vis and NMR spectra.**

Machine learning reveals the control mechanics of an insect wing hinge

Article 17 April 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

De novo design of protein structure and function with RFdiffusion

Article Open access 11 July 2023

Data availability

All datasets used in this document are publicly available. The QM9S dataset including optimized structures, various properties, and IR, Raman, and UV-Vis spectra of the 133,885 molecules is available at figshare³³: https://doi.org/10.6084/m9.figshare.24235333 or code ocean for reference³⁴: https://doi.org/10.24433/CO.5808137.v3. The original QM9 dataset¹⁶ is available from http://quantum-machine.org/datasets/. We used both the gas and solvent phase NMR dataset²⁶ obtained at mPW1PW91/6-311 + G(2d,p) level, which is available at https://moldis.tifrh.res.in/data/QM9NMR. The QM7-X dataset¹⁷ is available from https://zenodo.org/record/3905361. Infrared and Raman experimental spectra for comparison are from Spectral Database for Organic Compounds³⁵ (SDBS) at https://sdbs.db.aist.go.jp. The experimental Raman spectrum of caffeine comes from rruff database³⁶ at https://rruff.info/Ca/D120006. Source data are provided with this paper.

Code availability

The DetaNet model and trained parameters are available from Code Ocean³⁴. The program used for spectrum prediction based on DetaNet is also included in the code. All code is written using the PYG³⁷ and e3nn³⁸ libraries on Python 3.9.

References

Chen, D., Wang, Z., Guo, D., Orekhov, V. & Qu, X. Review and prospect: deep learning in nuclear magnetic resonance spectroscopy. Chem. Eur. J. 26, 10391–10401 (2020).
Article Google Scholar
Qu, X. et al. Accelerated nuclear magnetic resonance spectroscopy with deep learning. Angew. Chem. 132, 10383–10386 (2020).
Article Google Scholar
Ghosh, K. et al. Deep learning spectroscopy: neural networks for molecular excitation spectra. Adv. Sci. 6, 1801367 (2019).
Article Google Scholar
Ye, S. et al. A machine learning protocol for predicting protein infrared spectra. J. Am. Chem. Soc. 142, 19071–19077 (2020).
Article Google Scholar
Chen, Z. & Yam, V. W.-W. Machine-learned electronically excited states with the MolOrbImage generated from the molecular ground state. J. Phys. Chem. Lett. 14, 1955–1961 (2023).
Article Google Scholar
Grossutti, M. et al. Deep learning and infrared spectroscopy: representation learning with a β-variational autoencoder. J. Phys. Chem. Lett. 13, 5787–5793 (2022).
Article Google Scholar
Behler, J. Atom-centered symmetry functions for constructing high-dimensional neural network potentials. J. Chem. Phys. 134, 074106 (2011).
Article Google Scholar
Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R. & Tkatchenko, A. Quantum-chemical insights from deep tensor neural networks. Nat. Commun. 8, 13890 (2017).
Article Google Scholar
Schütt, K., Unke, O. & Gastegger, M. Equivariant message passing for the prediction of tensorial properties and molecular spectra. In Int. Conf. Machine Learning 9377–9388 (PMLR, 2021).
Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A. & Müller, K.-R. SchNet—a deep learning architecture for molecules and materials. J. Chem. Phys. 148, 241722 (2018).
Unke, O. T. et al. SpookyNet: learning force fields with electronic degrees of freedom and nonlocal effects. Nat. Commun. 12, 7273 (2021).
Article Google Scholar
Gasteiger, J., Groß, J. & Günnemann, S. Directional message passing for molecular graphs. Preprint at https://doi.org/10.48550/arXiv.2003.03123 (2020).
Thomas, N. et al. Tensor field networks: rotation-and translation-equivariant neural networks for 3D point clouds. Preprint at https://doi.org/10.48550/arXiv.1802.08219 (2018).
Batzner, S. et al. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).
Article Google Scholar
Atz, K., Grisoni, F. & Schneider, G. Geometric deep learning on molecular representations. Nat. Mach. Intell. 3, 1023–1032 (2021).
Article Google Scholar
Ramakrishnan, R., Dral, P. O., Rupp, M. & Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1, 140022 (2014).
Article Google Scholar
Hoja, J. et al. QM7-X, a comprehensive dataset of quantum-mechanical properties spanning the chemical space of small organic molecules. Sci. Data 8, 43 (2021).
Article Google Scholar
Zubatyuk, R., Smith, J. S., Leszczynski, J. & Isayev, O. Accurate and transferable multitask prediction of chemical properties with an atoms-in-molecules neural network. Sci. Adv. 5, eaav6490 (2019).
Article Google Scholar
Chen, W.-K., Zhang, Y., Jiang, B., Fang, W.-H. & Cui, G. Efficient construction of excited-state Hessian matrices with machine learning accelerated multilayer energy-based fragment method. J. Phys. Chem. A 124, 5684–5695 (2020).
Article Google Scholar
Christensen, A. S., Bratholm, L. A., Faber, F. A. & Anatole von Lilienfeld, O. FCHL revisited: faster and more accurate quantum machine learning. J. Chem. Phys. 152, 044107 (2020).
Article Google Scholar
Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 106005 (2008).
MATH Google Scholar
Kim, S. et al. PubChem 2023 update. Nucleic Acids Res. 51, D1373–D1380 (2023).
Article Google Scholar
Frisch, M. et al. Gaussian 16, Revision B. 01 (Gaussian, 2016).
Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865 (1996).
Article Google Scholar
Schäfer, A., Huber, C. & Ahlrichs, R. Fully optimized contracted Gaussian basis sets of triple zeta valence quality for atoms Li to Kr. J. Chem. Phys. 100, 5829–5835 (1994).
Article Google Scholar
Gupta, A., Chakraborty, S. & Ramakrishnan, R. Revving up ¹³C NMR shielding predictions across chemical space: benchmarks for atoms-in-molecules kernel machine learning with new data for 134 kilo molecules. Mach. Learn. Sci. Technol. 2, 035010 (2021).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems Vol. 30 (2017).
Veit, M., Wilkins, D. M., Yang, Y., DiStasio, R. A. Jr. & Ceriotti, M. Predicting molecular dipole moments by combining atomic partial charges and atomic dipoles. J. Chem. Phys. 153, 024113 (2020).
Elfwing, S., Uchibe, E. & Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks 107, 3–11 (2018).
Article Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://doi.org/10.48550/arXiv.1412.6980 (2014).
Reddi, S. J., Kale, S. & Kumar, S. On the convergence of adam and beyond. Preprint at https://doi.org/10.48550/arXiv.1904.09237 (2019).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems Vol. 32 (2019).
Zihan Zou, W. H. QM9S dataset. Figshare https://doi.org/10.6084/m9.figshare.24235333 (2023).
Deep equivariant tensor attention Network(DetaNet). Code Ocean https://doi.org/10.24433/CO.5808137.v3 (2023).
Saito, T. et al. Spectral Database for Organic Compounds (SDBS) (National Institute of Advanced Industrial Science and Technology, 2006).
Lafuente, B., Downs, R. T., Yang, H. & Stone, N. in Highlights in Mineralogical Crystallography (eds. Armbruster, T. & Danisi, R. M.) Ch. 1, 1–30 (De Gruyter, 2015).
Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. Preprint at https://doi.org/10.48550/arXiv.1903.02428 (2019).
Geiger, M. & Smidt, T. e3nn: Euclidean neural networks. Preprint at https://doi.org/10.48550/arXiv.2207.09453 (2022).

Download references

Acknowledgements

We acknowledge grants from the National Natural Science Foundation of China (22073053 (W.H.), 22025304 (J.J.), 22033007 (J.J.)), the Young Taishan Scholar Program of Shandong Province (tsqn201909139 (W.H.)), the Natural Science Foundation of Shandong Province (ZR2023MA089 (Y.Z)), the Program of New Collegiate 20 Items in Jinan (2021GXRC042 (W.H.), 202228031 (Y.Z.)) and the Qilu University of Technology (Shandong Academy of Sciences) Basic Research Project of Science, Education and Industry Integration Pilot (2023PY046 (Y.Z.)). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

School of Chemistry and Chemical Engineering, Qilu University of Technology (Shandong Academy of Science), Jinan, China
Zihan Zou, Yujin Zhang, Mingzhi Wei, Jiancai Leng & Wei Hu
College of Automation, Hangzhou Dianzi University, Hangzhou, China
Lijun Liang
Key Laboratory of Precision and Intelligent Chemistry, School of Chemistry and Materials Science, University of Science and Technology of China, Hefei, China
Jun Jiang
Hefei National Laboratory, University of Science and Technology of China, Hefei, China
Jun Jiang & Yi Luo
Hefei National Research Center for Physical Sciences at the Microscale, University of Science and Technology of China, Hefei, China
Yi Luo

Authors

Zihan Zou
View author publications
You can also search for this author in PubMed Google Scholar
Yujin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lijun Liang
View author publications
You can also search for this author in PubMed Google Scholar
Mingzhi Wei
View author publications
You can also search for this author in PubMed Google Scholar
Jiancai Leng
View author publications
You can also search for this author in PubMed Google Scholar
Jun Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Luo
View author publications
You can also search for this author in PubMed Google Scholar
Wei Hu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.Z. and W.H. conceived the research, designed the DetaNet model and performed all data analysis. W.H., Y.L., J.J. and Y.Z. jointly supervised the work from the model design to data analysis. Y.Z., L.L., M.W. and J.L. interpreted the data. All authors contributed to the writing of the manuscript.

Corresponding authors

Correspondence to Yujin Zhang, Jun Jiang, Yi Luo or Wei Hu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Conrard Tetsassi Feugmo, Feng Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Schematic diagram of DetaNet’s architecture with a color-coded view of individual components.

a. Architecture overview. Irreducible representation features (\(\bf{T}_{i,l,p}^{n,m}\)) and scalar features (\(\bf{S}_{i}^{n}\)) are used as messages in the interaction layer, where n, p, i, l and m represent the orfer of the passing interaction layer, the even/odd parity, the atomic number, the rotation degree and order. N and \({\mathop{\bf r}\limits^{\rightharpoonup }}_{ij}\) is the maximum interaction layer and the position vector from atom j to i. \(\bf{t}_{i,l,p}^{m}\) and s_i is output irreps tensor and scalar. b. Atomic embedding module based on nuclear and electronic features, where O(Z_i) and Q(Z_i) represent the nuclear types and the inherent atomic electronic structure. L_z, L_Q and L_emb are linear layers to integrate a F-dimensional atomic features. c. Message module architecture. \({Y}_{l,p}^{m}\) is the spherical harmonic function. \({\Delta }_{M}\bf{S}_{i}^{n}\) and \({\Delta }_{M}\bf{T}_{i,l,p}^{n,m}\) represent the corresponding residuals. d. A radial embedding module to generate the key (w_k) and the value (w_v) weights for the next self-attention module. e. A radial self-attention module. Mq, Mk and M_v represents the query, key and value features. F^Mq and e_ij represent the dimension of Mq and the output edge features. f. Atomwise self-attention update module. Uq, Uk, and \(\rm{Uv}_{(\it{T})}\) are query, key and value features of the update module. All symbol of L with subscript indicate the learnable linear layers.

Extended Data Fig. 2 Schematic diagram for how the sub modules operate using the matrix representations, taking water as a example.

a. Illustration of H2O molecule and the definition of central (i) and neighboring (j) atom. b. Matrix representation for the sub module of the atomic embedding. O(Z_i) and Q(Z_i) represent the nuclear types and the inherent atomic electronic structure. (\({\bf S}_{i}^{0}\)) is the generated scalar properties. c. Matrix representation for the message module. \({\mathop{\bf{r}}\limits^{\rightharpoonup }}_{ij}\) is position vector from atom j to i. \({Y}_{l,p}^{m}\) is the spherical harmonic function. \({\Delta }_{M}\bf{S}_{i}^{n}\) and \({\Delta }_{M}\bf{T}_{i,l,p}^{n,m}\) represent the corresponding residuals. d. Matrix representation for the radial embedding module, where w_k and the w_v are the weight corresponding key and value e. Matrix representation for the radial attention module, where Mq, Mk and Mv are the corresponding query, key and value features. f. Matrix representation for the atomwise attention update module. \(\bf{T}_{i,l,p}^{\,n,m}\) and \({\bf S}_{i}^{n}\) are the irreps and scalar features.

Extended Data Fig. 3 Ablation studies of DetaNet.

Ablation experiments were performed to test the impact of each module on the model. We list the MAE of the dipole moment vectors and polarizability tensors if we exclude any given module. The model has the best performance with parameter values of N = 3, l_max = 3, when excluding the cutoff function while keeping the electronic features, the radial self-attention module, the update module and the local part in the output function. N and l_max are the maximum interaction layers and the maximum degrees, respectively.

Source data

Extended Data Fig. 4 Error distributions and regression plots of DetaNet’s predictions for eight properties.

a. Energy learned from partial QM7-X datasets. b. Atomic forces learned from partial QM7-X datasets. c. Natural Population Charge learned from QM9S datasets. d. Electric dipole Moment learned from QM9S datasets. e. Polarizability learned from QM9S datasets. f. First hyperpolarizability learned from QM9S datasets. g. Electric quadrupole moment learned from QM9S datasets. h. Electric octupole moment learned from QM9S datasets. The MAE, RMSE and R² represent the mean absolute errors, the root mean square errors and the coefficients of determination.

Source data

Extended Data Fig. 5 Complete program for predicting infrared and Raman spectra using DetaNet.

We firstly performed the frequency analysis by diagonalizing the DetaNet-predicted Hessian matrix to obtain the vibrational frequencies and the corresponding normal coordinates. Then the infrared adsorption intensities and Raman scattering activities were calculated as the first derivatives of the polarizability and dipole moment with respect to the normal coordinates using the chain rule. Here \({\mathop{\bf{r}}\limits^{\rightharpoonup }}_{i}\) is the atomic position and Z_i is the atomic number. ω represents the frequency of the adsorption/scattering light and \(\mathop{P}\limits^{\rightharpoonup }\) is the normal coordinates. μ and α are the dipole moment and polarizability tensor, respectively.

Extended Data Fig. 6 Computational efficiency of DetaNet.

a. Comparison of average times in seconds for the prediction of the vibrational, UV-Vis and NMR spectra. b. Average computational times for prediction of vibrational spectra using DFT and DetaNet (CPU) with increasing molecular size. Here DFT and DetaNet (CPU) indicates the time consumed on an Intel-i7 8700 K device, while DetaNet (GPU) on NVIDIA RTX 3080Ti.

Source data

Supplementary information

Supplementary Information

Supplementary Sections 1–11, Tables 1–5, Figs. 1–8 and references.

Peer Review File

Source data

Source Data Fig. 1

Statistical source data for the DetaNet- and DFT-predicted Hessian matrix, derivative of dipole moment and polarizability with respect to norm coordinates for the 6,500 molecules in evaluation sets. The data (including DFT-predicted, DetaNet-predicted and experimental) points for plotting IR and Raman spectra for cyclohexanone, 2-methylpyrazine and caffeine.

Source Data Fig. 2

Statistical source data for the DetaNet-predicted UV and NMR errors compared to DFT results for the 6,500 molecules in evaluation sets; NMR latent space and label for t-SNE. The data points (including DFT- and DetaNet-predicted) for plotting UV-Vis, NMRC and NMRH spectra for cyclohexanone, 2-methylpyrazine, hepta-3,5-diyn-2-one, aniline and 5-methoxy-1,3-oxazole-2-carbaldehyde.

Source Data Extended Data Fig. 3

Statistical source data for the MAE of dipole moment and polarizability obtained from the ablation experiment. Here the ablations include exclusion of electronic features, exclusion of radial attention, exclusion of update module, usage of Gaussian basis, usage of additional cutoff function, usage of different maximum interaction layers, usage of different maximum degrees, exclusion of local and no-local part and usage of non-equivariant linear.

Source Data Extended Data Fig. 4

Statistical source data used to plot the comparison between DetaNet and DFT in describing energy, atomic forces, natural population charge, dipole moment, polarizability, first hyperpolarizability, quadrupole moment and octupole moment.

Source Data Extended Data Fig. 6

Statistical source data for the average prediction time for DFT, DetaNet (CPU) and DetaNet (GPU).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zou, Z., Zhang, Y., Liang, L. et al. A deep learning model for predicting selected organic molecular spectra. Nat Comput Sci 3, 957–964 (2023). https://doi.org/10.1038/s43588-023-00550-y

Download citation

Received: 27 May 2023
Accepted: 06 October 2023
Published: 13 November 2023
Issue Date: November 2023
DOI: https://doi.org/10.1038/s43588-023-00550-y

This article is cited by

Accurately predicting molecular spectra with deep learning
- Conrard Giresse Tetsassi Feugmo
Nature Computational Science (2023)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links