Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

A deep learning model for predicting selected organic molecular spectra

Abstract

Accurate and efficient molecular spectra simulations are crucial for substance discovery and structure identification. However, the conventional approach of relying on the quantum chemistry is cost intensive, which hampers efficiency. Here we develop DetaNet, a deep-learning model combining E(3)-equivariance group and self-attention mechanism to predict molecular spectra with improved efficiency and accuracy. By passing high-order geometric tensorial messages, DetaNet is able to generate a wide variety of molecular properties, including scalars, vectors, and second- and third-order tensors—all at the accuracy of quantum chemistry calculations. Based on this we developed generalized modules to predict four important types of molecular spectra, namely infrared, Raman, ultraviolet–visible, and 1H and 13C nuclear magnetic resonance, taking the QM9S dataset containing 130,000 molecular species as an example. By speeding up the prediction of molecular spectra at quantum chemical accuracy, DetaNet could help progress toward real-time structural identification using spectroscopic measurements.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: DetaNet-predicted IR and Raman spectra and their comparison with DFT reference and experimental data.
Fig. 2: Performance of DetaNet in predicting UV-Vis and NMR spectra.

Similar content being viewed by others

Data availability

All datasets used in this document are publicly available. The QM9S dataset including optimized structures, various properties, and IR, Raman, and UV-Vis spectra of the 133,885 molecules is available at figshare33: https://doi.org/10.6084/m9.figshare.24235333 or code ocean for reference34: https://doi.org/10.24433/CO.5808137.v3. The original QM9 dataset16 is available from http://quantum-machine.org/datasets/. We used both the gas and solvent phase NMR dataset26 obtained at mPW1PW91/6-311 + G(2d,p) level, which is available at https://moldis.tifrh.res.in/data/QM9NMR. The QM7-X dataset17 is available from https://zenodo.org/record/3905361. Infrared and Raman experimental spectra for comparison are from Spectral Database for Organic Compounds35 (SDBS) at https://sdbs.db.aist.go.jp. The experimental Raman spectrum of caffeine comes from rruff database36 at https://rruff.info/Ca/D120006. Source data are provided with this paper.

Code availability

The DetaNet model and trained parameters are available from Code Ocean34. The program used for spectrum prediction based on DetaNet is also included in the code. All code is written using the PYG37 and e3nn38 libraries on Python 3.9.

References

  1. Chen, D., Wang, Z., Guo, D., Orekhov, V. & Qu, X. Review and prospect: deep learning in nuclear magnetic resonance spectroscopy. Chem. Eur. J. 26, 10391–10401 (2020).

    Article  Google Scholar 

  2. Qu, X. et al. Accelerated nuclear magnetic resonance spectroscopy with deep learning. Angew. Chem. 132, 10383–10386 (2020).

    Article  Google Scholar 

  3. Ghosh, K. et al. Deep learning spectroscopy: neural networks for molecular excitation spectra. Adv. Sci. 6, 1801367 (2019).

    Article  Google Scholar 

  4. Ye, S. et al. A machine learning protocol for predicting protein infrared spectra. J. Am. Chem. Soc. 142, 19071–19077 (2020).

    Article  Google Scholar 

  5. Chen, Z. & Yam, V. W.-W. Machine-learned electronically excited states with the MolOrbImage generated from the molecular ground state. J. Phys. Chem. Lett. 14, 1955–1961 (2023).

    Article  Google Scholar 

  6. Grossutti, M. et al. Deep learning and infrared spectroscopy: representation learning with a β-variational autoencoder. J. Phys. Chem. Lett. 13, 5787–5793 (2022).

    Article  Google Scholar 

  7. Behler, J. Atom-centered symmetry functions for constructing high-dimensional neural network potentials. J. Chem. Phys. 134, 074106 (2011).

    Article  Google Scholar 

  8. Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R. & Tkatchenko, A. Quantum-chemical insights from deep tensor neural networks. Nat. Commun. 8, 13890 (2017).

    Article  Google Scholar 

  9. Schütt, K., Unke, O. & Gastegger, M. Equivariant message passing for the prediction of tensorial properties and molecular spectra. In Int. Conf. Machine Learning 9377–9388 (PMLR, 2021).

  10. Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A. & Müller, K.-R. SchNet—a deep learning architecture for molecules and materials. J. Chem. Phys. 148, 241722 (2018).

  11. Unke, O. T. et al. SpookyNet: learning force fields with electronic degrees of freedom and nonlocal effects. Nat. Commun. 12, 7273 (2021).

    Article  Google Scholar 

  12. Gasteiger, J., Groß, J. & Günnemann, S. Directional message passing for molecular graphs. Preprint at https://doi.org/10.48550/arXiv.2003.03123 (2020).

  13. Thomas, N. et al. Tensor field networks: rotation-and translation-equivariant neural networks for 3D point clouds. Preprint at https://doi.org/10.48550/arXiv.1802.08219 (2018).

  14. Batzner, S. et al. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).

    Article  Google Scholar 

  15. Atz, K., Grisoni, F. & Schneider, G. Geometric deep learning on molecular representations. Nat. Mach. Intell. 3, 1023–1032 (2021).

    Article  Google Scholar 

  16. Ramakrishnan, R., Dral, P. O., Rupp, M. & Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1, 140022 (2014).

    Article  Google Scholar 

  17. Hoja, J. et al. QM7-X, a comprehensive dataset of quantum-mechanical properties spanning the chemical space of small organic molecules. Sci. Data 8, 43 (2021).

    Article  Google Scholar 

  18. Zubatyuk, R., Smith, J. S., Leszczynski, J. & Isayev, O. Accurate and transferable multitask prediction of chemical properties with an atoms-in-molecules neural network. Sci. Adv. 5, eaav6490 (2019).

    Article  Google Scholar 

  19. Chen, W.-K., Zhang, Y., Jiang, B., Fang, W.-H. & Cui, G. Efficient construction of excited-state Hessian matrices with machine learning accelerated multilayer energy-based fragment method. J. Phys. Chem. A 124, 5684–5695 (2020).

    Article  Google Scholar 

  20. Christensen, A. S., Bratholm, L. A., Faber, F. A. & Anatole von Lilienfeld, O. FCHL revisited: faster and more accurate quantum machine learning. J. Chem. Phys. 152, 044107 (2020).

    Article  Google Scholar 

  21. Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 106005 (2008).

    MATH  Google Scholar 

  22. Kim, S. et al. PubChem 2023 update. Nucleic Acids Res. 51, D1373–D1380 (2023).

    Article  Google Scholar 

  23. Frisch, M. et al. Gaussian 16, Revision B. 01 (Gaussian, 2016).

  24. Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865 (1996).

    Article  Google Scholar 

  25. Schäfer, A., Huber, C. & Ahlrichs, R. Fully optimized contracted Gaussian basis sets of triple zeta valence quality for atoms Li to Kr. J. Chem. Phys. 100, 5829–5835 (1994).

    Article  Google Scholar 

  26. Gupta, A., Chakraborty, S. & Ramakrishnan, R. Revving up 13C NMR shielding predictions across chemical space: benchmarks for atoms-in-molecules kernel machine learning with new data for 134 kilo molecules. Mach. Learn. Sci. Technol. 2, 035010 (2021).

    Article  Google Scholar 

  27. Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems Vol. 30 (2017).

  28. Veit, M., Wilkins, D. M., Yang, Y., DiStasio, R. A. Jr. & Ceriotti, M. Predicting molecular dipole moments by combining atomic partial charges and atomic dipoles. J. Chem. Phys. 153, 024113 (2020).

  29. Elfwing, S., Uchibe, E. & Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks 107, 3–11 (2018).

    Article  Google Scholar 

  30. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://doi.org/10.48550/arXiv.1412.6980 (2014).

  31. Reddi, S. J., Kale, S. & Kumar, S. On the convergence of adam and beyond. Preprint at https://doi.org/10.48550/arXiv.1904.09237 (2019).

  32. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems Vol. 32 (2019).

  33. Zihan Zou, W. H. QM9S dataset. Figshare https://doi.org/10.6084/m9.figshare.24235333 (2023).

  34. Deep equivariant tensor attention Network(DetaNet). Code Ocean https://doi.org/10.24433/CO.5808137.v3 (2023).

  35. Saito, T. et al. Spectral Database for Organic Compounds (SDBS) (National Institute of Advanced Industrial Science and Technology, 2006).

  36. Lafuente, B., Downs, R. T., Yang, H. & Stone, N. in Highlights in Mineralogical Crystallography (eds. Armbruster, T. & Danisi, R. M.) Ch. 1, 1–30 (De Gruyter, 2015).

  37. Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. Preprint at https://doi.org/10.48550/arXiv.1903.02428 (2019).

  38. Geiger, M. & Smidt, T. e3nn: Euclidean neural networks. Preprint at https://doi.org/10.48550/arXiv.2207.09453 (2022).

Download references

Acknowledgements

We acknowledge grants from the National Natural Science Foundation of China (22073053 (W.H.), 22025304 (J.J.), 22033007 (J.J.)), the Young Taishan Scholar Program of Shandong Province (tsqn201909139 (W.H.)), the Natural Science Foundation of Shandong Province (ZR2023MA089 (Y.Z)), the Program of New Collegiate 20 Items in Jinan (2021GXRC042 (W.H.), 202228031 (Y.Z.)) and the Qilu University of Technology (Shandong Academy of Sciences) Basic Research Project of Science, Education and Industry Integration Pilot (2023PY046 (Y.Z.)). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

Z.Z. and W.H. conceived the research, designed the DetaNet model and performed all data analysis. W.H., Y.L., J.J. and Y.Z. jointly supervised the work from the model design to data analysis. Y.Z., L.L., M.W. and J.L. interpreted the data. All authors contributed to the writing of the manuscript.

Corresponding authors

Correspondence to Yujin Zhang, Jun Jiang, Yi Luo or Wei Hu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Conrard Tetsassi Feugmo, Feng Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Schematic diagram of DetaNet’s architecture with a color-coded view of individual components.

a. Architecture overview. Irreducible representation features (\(\bf{T}_{i,l,p}^{n,m}\)) and scalar features (\(\bf{S}_{i}^{n}\)) are used as messages in the interaction layer, where n, p, i, l and m represent the orfer of the passing interaction layer, the even/odd parity, the atomic number, the rotation degree and order. N and \({\mathop{\bf r}\limits^{\rightharpoonup }}_{ij}\) is the maximum interaction layer and the position vector from atom j to i. \(\bf{t}_{i,l,p}^{m}\) and si is output irreps tensor and scalar. b. Atomic embedding module based on nuclear and electronic features, where O(Zi) and Q(Zi) represent the nuclear types and the inherent atomic electronic structure. Lz, LQ and Lemb are linear layers to integrate a F-dimensional atomic features. c. Message module architecture. \({Y}_{l,p}^{m}\) is the spherical harmonic function. \({\Delta }_{M}\bf{S}_{i}^{n}\) and \({\Delta }_{M}\bf{T}_{i,l,p}^{n,m}\) represent the corresponding residuals. d. A radial embedding module to generate the key (wk) and the value (wv) weights for the next self-attention module. e. A radial self-attention module. Mq, Mk and Mv represents the query, key and value features. FMq and eij represent the dimension of Mq and the output edge features. f. Atomwise self-attention update module. Uq, Uk, and \(\rm{Uv}_{(\it{T})}\) are query, key and value features of the update module. All symbol of L with subscript indicate the learnable linear layers.

Extended Data Fig. 2 Schematic diagram for how the sub modules operate using the matrix representations, taking water as a example.

a. Illustration of H2O molecule and the definition of central (i) and neighboring (j) atom. b. Matrix representation for the sub module of the atomic embedding. O(Zi) and Q(Zi) represent the nuclear types and the inherent atomic electronic structure. (\({\bf S}_{i}^{0}\)) is the generated scalar properties. c. Matrix representation for the message module. \({\mathop{\bf{r}}\limits^{\rightharpoonup }}_{ij}\) is position vector from atom j to i. \({Y}_{l,p}^{m}\) is the spherical harmonic function. \({\Delta }_{M}\bf{S}_{i}^{n}\) and \({\Delta }_{M}\bf{T}_{i,l,p}^{n,m}\) represent the corresponding residuals. d. Matrix representation for the radial embedding module, where wk and the wv are the weight corresponding key and value e. Matrix representation for the radial attention module, where Mq, Mk and Mv are the corresponding query, key and value features. f. Matrix representation for the atomwise attention update module. \(\bf{T}_{i,l,p}^{\,n,m}\) and \({\bf S}_{i}^{n}\) are the irreps and scalar features.

Extended Data Fig. 3 Ablation studies of DetaNet.

Ablation experiments were performed to test the impact of each module on the model. We list the MAE of the dipole moment vectors and polarizability tensors if we exclude any given module. The model has the best performance with parameter values of N = 3, lmax = 3, when excluding the cutoff function while keeping the electronic features, the radial self-attention module, the update module and the local part in the output function. N and lmax are the maximum interaction layers and the maximum degrees, respectively.

Source data

Extended Data Fig. 4 Error distributions and regression plots of DetaNet’s predictions for eight properties.

a. Energy learned from partial QM7-X datasets. b. Atomic forces learned from partial QM7-X datasets. c. Natural Population Charge learned from QM9S datasets. d. Electric dipole Moment learned from QM9S datasets. e. Polarizability learned from QM9S datasets. f. First hyperpolarizability learned from QM9S datasets. g. Electric quadrupole moment learned from QM9S datasets. h. Electric octupole moment learned from QM9S datasets. The MAE, RMSE and R2 represent the mean absolute errors, the root mean square errors and the coefficients of determination.

Source data

Extended Data Fig. 5 Complete program for predicting infrared and Raman spectra using DetaNet.

We firstly performed the frequency analysis by diagonalizing the DetaNet-predicted Hessian matrix to obtain the vibrational frequencies and the corresponding normal coordinates. Then the infrared adsorption intensities and Raman scattering activities were calculated as the first derivatives of the polarizability and dipole moment with respect to the normal coordinates using the chain rule. Here \({\mathop{\bf{r}}\limits^{\rightharpoonup }}_{i}\) is the atomic position and Zi is the atomic number. ω represents the frequency of the adsorption/scattering light and \(\mathop{P}\limits^{\rightharpoonup }\) is the normal coordinates. μ and α are the dipole moment and polarizability tensor, respectively.

Extended Data Fig. 6 Computational efficiency of DetaNet.

a. Comparison of average times in seconds for the prediction of the vibrational, UV-Vis and NMR spectra. b. Average computational times for prediction of vibrational spectra using DFT and DetaNet (CPU) with increasing molecular size. Here DFT and DetaNet (CPU) indicates the time consumed on an Intel-i7 8700 K device, while DetaNet (GPU) on NVIDIA RTX 3080Ti.

Source data

Supplementary information

Supplementary Information

Supplementary Sections 1–11, Tables 1–5, Figs. 1–8 and references.

Peer Review File

Source data

Source Data Fig. 1

Statistical source data for the DetaNet- and DFT-predicted Hessian matrix, derivative of dipole moment and polarizability with respect to norm coordinates for the 6,500 molecules in evaluation sets. The data (including DFT-predicted, DetaNet-predicted and experimental) points for plotting IR and Raman spectra for cyclohexanone, 2-methylpyrazine and caffeine.

Source Data Fig. 2

Statistical source data for the DetaNet-predicted UV and NMR errors compared to DFT results for the 6,500 molecules in evaluation sets; NMR latent space and label for t-SNE. The data points (including DFT- and DetaNet-predicted) for plotting UV-Vis, NMRC and NMRH spectra for cyclohexanone, 2-methylpyrazine, hepta-3,5-diyn-2-one, aniline and 5-methoxy-1,3-oxazole-2-carbaldehyde.

Source Data Extended Data Fig. 3

Statistical source data for the MAE of dipole moment and polarizability obtained from the ablation experiment. Here the ablations include exclusion of electronic features, exclusion of radial attention, exclusion of update module, usage of Gaussian basis, usage of additional cutoff function, usage of different maximum interaction layers, usage of different maximum degrees, exclusion of local and no-local part and usage of non-equivariant linear.

Source Data Extended Data Fig. 4

Statistical source data used to plot the comparison between DetaNet and DFT in describing energy, atomic forces, natural population charge, dipole moment, polarizability, first hyperpolarizability, quadrupole moment and octupole moment.

Source Data Extended Data Fig. 6

Statistical source data for the average prediction time for DFT, DetaNet (CPU) and DetaNet (GPU).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zou, Z., Zhang, Y., Liang, L. et al. A deep learning model for predicting selected organic molecular spectra. Nat Comput Sci 3, 957–964 (2023). https://doi.org/10.1038/s43588-023-00550-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-023-00550-y

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing