Introduction

The fundamental challenge of quantum chemistry, solid-state physics, and many areas of computational materials science is to obtain solutions to the electronic Schrödinger equation for a given system, which in principle provides complete access to its chemical properties. The ground and low-lying excited states typically determine the behavior of a system and are therefore of the most interest in many applications. Understanding and being able to describe excited-state processes1, including a wide variety of important spectroscopy methods such as fluorescence, photoionization, and optical absorption of molecules and solids, is key to the successful design of new materials.

Unfortunately, the Schrödinger equation cannot be solved exactly except in the simplest cases, such as one-dimensional toy systems or a single hydrogen atom. Accordingly, many approximate numerical methods have been developed which provide solutions at varying degrees of accuracy. Time-dependent density functional theory2,3 (TDDFT) is the most popular method due to its computational efficiency but has well-known limitations4,5,6,7,8,9. Higher-accuracy methods have a computational cost that scales rapidly with system size—the well-established full configuration interaction10 (FCI) and coupled cluster11 (CC) techniques scale \(\sim {{{{{{{\mathcal{O}}}}}}}}(\,{{\mbox{exp}}}\,(N))\) (FCI scales exponentially, while truncated CI scales polynomially.) and \(\sim {{{{{{{\mathcal{O}}}}}}}}({N}^{5-10})\) (The scaling of CC depends on the particular method used: CC2 \({{{{{{{\mathcal{O}}}}}}}}({N}^{5})\), CCSD \({{{{{{{\mathcal{O}}}}}}}}({N}^{6})\), CCSD(T) \({{{{{{{\mathcal{O}}}}}}}}({N}^{7})\), CC3 \({{{{{{{\mathcal{O}}}}}}}}({N}^{7})\), CCSDT \({{{{{{{\mathcal{O}}}}}}}}({N}^{8})\), CCSDT(Q) \({{{{{{{\mathcal{O}}}}}}}}({N}^{9})\), CCSDTQ \({{{{{{{\mathcal{O}}}}}}}}({N}^{10})\).) respectively, where N is the number of electrons, thereby severely limiting their practical use. There is thus a huge need for ab initio methods that scale more favorably with system size, allowing the modeling of practically relevant molecules and materials.

Quantum Monte Carlo (QMC) techniques offer a route forward with their favorable scaling (\({{{{{{{\mathcal{O}}}}}}}}({N}^{3-4})\)) and therefore dominate high-accuracy calculations where other methods are too expensive12,13. A state-of-the-art QMC calculation typically involves the construction of a multi-determinant baseline wavefunction through standard electronic-structure methods, which is augmented with a Jastrow factor to efficiently incorporate electron correlation, and then optimized through variational QMC (VMC) to obtain a trial wavefunction. This is then used within fixed-node diffusion QMC (DMC) to obtain a final electronic energy. The fixed-node approximation is used to avoid exponential scaling, with the drawback that the nodal surface of the trial wavefunction cannot be modified, which limits the accuracy of the DMC result14. A more expressive baseline wavefunction can improve upon this but traditional DMC often needs thousands to hundreds of thousands of determinants to reach convergence15. Additionally, DMC only provides the final energy, restricting the calculation of other electronic properties16. Both of these limitations can, in principle, be resolved at the VMC level, with the accuracy of VMC constrained only by the flexibility of the trainable wavefunction ansatz. So far, these techniques have mostly been developed for ground-state calculations, with different extensions proposed to address excited states12,17,18,19,20,21,22,23,24,25,26.

Recently, the new ab initio approach of deep VMC methods has been introduced27,28,29,30 and subsequently further extended and improved31,32,33. In particular, PauliNet27 and FermiNet28 were the first methods to demonstrate that highly accurate ground-state results for molecules could be obtained using deep VMC with lower computational complexity and using orders of magnitude fewer Slater determinants than typically employed in other methods that achieve similar accuracy.

In the same spirit as Carleo and Troyer proposed for optimizing quantum states in lattice models34, VMC is used in order to train a neural network model that represents the many-body wavefunction in an unsupervised fashion, i.e. in contrast to other quantum machine learning approaches the only input to the method is the Hamiltonian, and training data are generated on the fly by sampling from the current wavefunction model and minimizing the variational energy. In both PauliNet and FermiNet deep antisymmetric neural networks are used to represent the fermionic wavefunction in the real space of electron coordinates.

Recently, there has been much interest in developing deep learning methods for excited states35. In this paper, we extend PauliNet towards the ab initio computation of electronic excited states (see the “Methods” section for details). The input is again only the Hamiltonian of the quantum system. By employing a simple energy minimization and numerical orthogonalization procedure, we are able to obtain the lowest excited-state wavefunctions of a given system. The excited-state optimization makes use of a penalty method that minimizes the overlap between the nth excited state and the lower-lying states in the spectrum. Optimization methods that introduce additional constraints have been used in the context of VMC before26 and provide a simple way to obtain orthogonal states without explicit enforcement in the wavefunction ansatzes. Combining these techniques with the expressiveness of neural network ansatzes yields highly accurate approximations to excited states with direct access to the wavefunctions for the evaluation of electronic observables. Neural network-based methods have targeted low-lying excited states of one-dimensional lattice models25, but have not been applied to first-principles systems.

We demonstrate our method on a variety of small- and medium-sized molecules, where we consistently achieve highly accurate total energies, outperforming traditional quantum chemistry methods. We also compute excitation energies, transition dipole moments, and oscillator strengths, the main ground-to-excited transition properties, with the latter two known to be more sensitive to errors in the underlying wavefunctions than energies. In all test systems, we find PauliNet closely matches high-order CC and experimental results. Next, we show that our method can be applied in a straightforward manner to much larger molecules, using the example of benzene where we match significantly more expensive high-level electronic-structure methods. Finally, we demonstrate that PauliNet can be used to compute excited-state potential energy surfaces by modeling an avoided crossing and conical intersection of ethylene, a highly multi-referential problem.

Results

Nearly exact solutions for small atoms and molecules

To demonstrate our method we start by applying it to a range of small atoms and molecules. We optimize the lowest-lying excited states and compute their vertical excitation energies for the ground-state equilibrium geometry (see Supplementary Table I), with each PauliNet wavefunction containing a maximum of 10 determinants. In all systems, we obtain highly accurate total energies and estimates of the first few excitation energies competitive with high-accuracy quantum chemistry methods.

In Fig. 1 the excitation energies of the lowest states are shown for several atoms. For all the atoms the excitation energies are obtained within 4 mHa of the theoretical best estimates (TBE)36. Due to the high degree of symmetry the atoms exhibit degeneracies, that is, multiple orthogonal states can be found with the same energy. Being subject to the orthogonalization constraint, PauliNet approximates all orthogonal states of an energy level individually, which is observed by attaining multiple results at the same energy level. The multiplicity of the exact solution can be obtained theoretically by considering the electronic configurations of the atoms and is reproduced within our experiments.

Fig. 1: Deep VMC obtains highly accurate excited states for single elements.
figure 1

PauliNet results for the excitation energies (with (red) and without (yellow) variance matching (see the “Methods” section for details)) are compared to the theoretical best estimates (TBE) taken from the NIST database36. Multiple PauliNet ansatzes with identical energies correspond to orthogonal degenerate states. For the TBE we have depicted four excitations per atom, taking account of the degeneracies. For all atoms, we find the first excited state with high accuracy. For B, C, and O the ground state is threefold degenerate. For these systems we choose one of the three states to compute excitation energies, resulting in transitions with a relative energy of zero. For Li and Be a further excitation energy is found. While we obtain the second excited state for Be, in Li we miss out intermediate states and instead find the transition from the ground state to the 2D state. This can be related to the generic CASSCF initialization of the ansatzes (see the “Methods” section for details). (The numerical data can be found in Supplementary Table II; source data are provided as a Source Data file).

We then compute a larger number of excited states for LiH, BeH and Be. In each experiment, we optimize eight ansatzes in parallel. In Fig. 2 we illustrate the training process by plotting the convergence of the total energies and excitation energies. Additionally, we plot the training estimates of the pairwise overlaps of the wavefunctions, which remain small throughout the optimization process. We confirm that the final overlaps are near-zero by exhaustively sampling the trained wavefunctions, thereby obtaining well-converged Monte Carlo estimates (see Supplementary Table VI). Based on the degeneracies we find a total of five (LiH), four (BeH), and three (Be) distinct excitation energies, respectively. The excitation energies match those from reference values, and in particular, we find that for all systems studied here we reliably obtain the first excited state, and apart from one case also the second excited state. However, especially for clusters of higher-lying excited states with similar energies, we typically do not find all members of the cluster. In these cases, which states are found depends on the initialization of our ansatzes, as well as the total number of states that are being sought. To give a transparent picture of the capabilities of our method, in this work we have refrained from optimizing the CASSCF baseline in order to find all possible excitations.

Fig. 2: Optimizing low-lying excited states for small molecules.
figure 2

Several excited states of LiH, BeH, and Be are approximated. The convergence of the total energies (upper row), excitation energies (middle row), and the pairwise overlaps between the wavefunctions (bottom row) are shown. For degenerate states, multiple ansatzes attain the same energy. Dotted horizontal lines are excitation energies from FCI calculations and other highly accurate references36,57,58,59. Due to the initialization from the CASSCF baseline, the wavefunctions start with a small overlap, which is retained throughout the optimization. (The numerical data can be found in Supplementary Table II; source data are provided as a Source Data file).

Highly accurate wavefunctions: transition dipole moments and oscillator strengths

Total energies and vertical excitation energies are the primary focus when benchmarking excited-state methods as they are readily available from many theoretical models and provide a good initial guess of a particular method’s accuracy. However, they provide only a partial characterization of the electronic states, and while a method in question may give accurate energies, other quantities of key importance may be inaccurate37,38,39.

Transition dipole moments (TDM) and oscillator strengths are two principal ground-to-excited transition properties and are of great interest. TDMs determine how polarized electromagnetic radiation will interact with a system due to its distribution of charge, and therefore determine transition rates and probabilities of induced state changes. In the electric dipole approximation, the TDM between two states i and j is given by

$${{{{{{{{\bf{d}}}}}}}}}_{ij}=\langle {\psi }_{i}|\hat{{{{{{{{\boldsymbol{\mu }}}}}}}}}|{\psi }_{j}\rangle,$$
(1)

where \(\hat{{{{{{{{\boldsymbol{\mu }}}}}}}}}={\sum }_{k}q{\hat{{{{{{{{\bf{r}}}}}}}}}}_{k}\) is the sum over the position operator of each particle weighted by its charge, with q = −e for electronic systems. We obtain the expectation value by Monte Carlo sampling according to Eq. (15). While the TDM is important for understanding a number of processes, including optical spectra, it is generally a complex-valued vector quantity and not an experimental observable by itself. The closely related oscillator strength is what is inferred through the experiment and is given by

$${f}_{ij}=\frac{2}{3}\Delta E{d}_{ij}^{2},$$
(2)

where ΔE is the excitation energy between states i and j, and \({d}_{ij}^{2}\) is the dipole strength. It is known that, in addition to being more basis-set sensitive, dij and fij are both highly dependent on the quality of the trial wavefunctions40 and represent a more rigorous test for ab initio methods than just energies.

Recently, transition energies and oscillator strengths for a variety of small molecules have been computed using high-order CC calculations, systematically extrapolating to the complete basis set (CBS) limit, and comparing to experimental results where possible, in order to supply a comprehensive set of theoretical benchmarks41,42. In that spirit, we now use these results to benchmark the accuracy of oscillator strengths computed using PauliNet. Furthermore, we also compare multi-reference CC (MR-CC) results where possible43. We compute the first few electronic states for five molecules (BH, CH+, H2O, NH3, CO), such that we obtain the first non-zero oscillator strength (within the dipole approximation) for each. All calculations (CH+ was not included in the CC calculations in refs. 41,42. We instead compare to (MR-)CC results in ref. 43, using the same ground-state equilibrium geometry, which was obtained in a split-valence basis augmented with diffuse and polarization functions. See refs. 43,44 for more details.) are performed at the same ground-state equilibrium geometries as refs. 41, 42 (see Supplementary Table I) and using the same number of determinants (≤10) as in the section “Nearly exact solutions for small atoms and molecules”.

Our results for all systems are shown in Fig. 3. First, we compute the amount of correlation energy recovered in the ground state and find PauliNet matches high-order CC methods (panel a). Second, we compute the excitation energy for each transition and find this to be close to the TBE, on par with CC and much more consistent than TDDFT where the accuracy depends on the molecule and on the exact TDDFT method used (panel b). Finally, we compare the oscillator strengths (for the 0 → 2 transition) in panel c. Even high-order methods such as CC and MR-CC can produce a spectrum of results depending on the expansion and basis set used, with this exacerbated in cheaper methods such as TDDFT (see the example of CO). In all systems, PauliNet compares well with experimental results, demonstrating the quality of deep VMC wavefunctions with just a minimal number of determinants.

Fig. 3: Deep VMC obtains highly accurate excited-state energies and wavefunctions for small molecules.
figure 3

a PauliNet recovers the same amount of correlation energy as high-order CC methods36. (CH+: No better reference energy to compare with.) b Lowest triplet (0 → 1) and singlet (0 → 2) excitation energies obtained using PauliNet (with (red) and without (yellow) variance matching), CC, and TDDFT, with the TBE given. (BH and CH+ exhibit degeneracy for the triplet state; CC is CCSD or higher, except for the triplet state of BH which includes CC2.) c Oscillator strengths computed for the 0 → 2 transitions. PauliNet compares well to experiment in all systems and matches the accuracy of (MR−)CC results, demonstrating the quality of few-determinant PauliNet wavefunctions. (We have omitted a factor of two linked to degeneracy in BH and CO.) Refs: exact correlation energies36,60,61; excitation energies from CC41,42,43,62,63,64,65,66, TDDFT65,66,67,68 and TBE41,42,44,62,69; oscillator strengths from (MR-)CC41 -- 43,70,71, TDDFT72 and experiment73,74,75,76,77. (The numerical data can be found in Supplementary Table III; source data are provided as a Source Data file).

Application to larger molecules

The previous two sections showed that we achieve highly accurate results across a range of small systems. While this is encouraging, traditional high-accuracy methods that are better established are readily available for such small systems. In this section, to demonstrate the potential of excited PauliNet, we show that it can be applied in a straightforward manner to significantly larger molecules. For this objective, we choose the example of the benzene molecule (panel a of Fig. 4). Studies of its electronic structure and other properties are plentiful due to its importance in bio and organic chemistry, and with 42 electrons all-electron calculations will be extremely demanding or even intractable for a high-level description of its electronic states, depending on the theory level used.

Fig. 4: Calculating the two lowest electronic states of the benzene molecule.
figure 4

Inset: Benzene structure. a Convergence of the total energies of the ground state (red) and excited state (light red) with training. Total energies of the ground state from CCSD(T) in the frozen-core approximation with the aug-cc-pVnZ basis set (n = D, T) (dashed blue), and full CCSD(T) at the CBS limit (solid blue) are shown36. b Convergence of the excitation energy with training (with (red) and without (yellow) variance matching). c Excitation energy computed using PauliNet, TDDFT68, CC78, DMC26, CAS-PT (79 and calculations in openMolcas80) and Experiment26,45 (adiabatic (dashed black) and vertical (solid black) excitation energies). (The numerical data can be found in Supplementary Table IV; source data are provided as a Source Data file).

Using a PauliNet ansatz with just 10 determinants, the same as in the much smaller systems, and slightly deeper neural networks (see Supplementary Table VII) we obtain very good total energies for the ground state and first excited state (upper left of Fig. 4). We note the better accuracy than high-level CC calculations, with this signifying highly accurate wavefunctions that can be used to compute other observables, as demonstrated in the previous section. The computed excitation energy is also shown (right of Fig. 4), with PauliNet compared against several experimental and theoretical results. The lower experimental result45 (dashed black line) quantifies an adiabatic excitation energy, i.e. the energy difference between the ground state and the excited state at the corresponding relaxed geometries. This quantity is corrected to obtain the vertical excitation energy26 (solid black line), which omits nuclear relaxation and vibrational effects. As our calculations are performed at the ground-state equilibrium geometry, we are targeting the vertical excitation energy, and therefore consider this corrected experimental result to be closer to the ground truth. We find this to be slightly underestimated by high-order methods (CC, DMC), and slightly overestimated by PauliNet. In other systems (panel b of Fig. 3) we notice a similar trend when comparing to the TBE.

PauliNet formally scales as \({{{{{{{\mathcal{O}}}}}}}}({N}^{4})\) with the number of electrons N, and in practice, we observe a scaling behavior \({{{{{{{\mathcal{O}}}}}}}}({N}^{3})\) for the systems investigated so far, which is related to quadratic scaling of the neural network with an extra factor from the evaluation of the local energy. As PauliNet is currently implemented in a research code, which is not optimized for production purposes, the computational time will have a large prefactor which makes it computationally unfavorable to, e.g. CC methods for small molecules. However, its very favorable scaling in N compared to \({{{{{{{\mathcal{O}}}}}}}}({N}^{5-10})\) of high-level electronic-structure methods dominates for larger molecules, and this is clearly visible in benzene. For instance, ref. 46 used several state-of-the-art methods to obtain accurate benzene ground-state energies, with calculations run on several CPU types in a highly parallel manner (see Supporting Information of ref. 46 for details). PauliNet was run on a single RTX 3090 GPU at a fraction of the number of node hours. Although PauliNet is the computationally cheapest method in this comparison, it provides a significantly better (variational) ground-state energy than all methods (~0.48 Ha lower). As all methods compared in Fig. 4 provide similar excitation energies, these cannot be used to group the methods into more or less accurate, but overall this data indicates that PauliNet and deep VMC methods in general have a very favorable cost/accuracy trade-off for molecules of the size of benzene and beyond.

Multi-reference application: conical intersections

Molecular configurations that produce electronic states with similar energies are fundamental in photochemical applications. Such configurations can lead to several states mixing, meaning they are all necessary for an accurate description of a particular process. Conical intersections are produced when two states become degenerate and require the computation of excited-state potential energy surfaces. The modeling of energy surfaces near degeneracies is inherently multi-reference with significant electronic correlation and is thus a challenging application for electronic-structure methods.

As a final application of excited PauliNet, we compute ground- and excited-state potential energies for ethylene (H2C=CH2) as a function of its torsion and pyramidalization angles (see inset of Fig. 5). Twisting around the C=C bond raises the energy of the ground state while lowering that of the first-excited singlet state, giving rise to an avoided crossing at a torsion angle τ of 90°. From this twisted structure, the energy gap between the two states is further reduced through the pyramidalization of one of the CH2 groups, leading to a conical intersection. These potential energy curves, whose modeling is often too challenging for single-reference methods47,48,49, have been characterized using multi-reference configuration interaction (MR-CI) methods50 which we use for comparison.

Fig. 5: Modeling a conical intersection of ethylene.
figure 5

Inset: Ethylene structure. a Total energies (relative to the ground state of the planar geometry E0) of the ground state and first-excited singlet state as a function of torsion angle τ, with MR-CI50 and TDDFT48 results also plotted for comparison. TDDFT overestimates the barrier (the ground state at τ = 90°) and produces an unphysical cusp, while the MR-CI results which predict an avoided crossing are well reproduced by PauliNet. b Same as above but as a function of pyramidalization angle ϕ (τ = 90°), with the degeneracy of the two states producing a conical intersection. The arrows denote the conical intersection, with PauliNet (ϕ ~ 100°) closely matching the MR-CI result (ϕ ~ 96°). Note: The geometric parameters (bond lengths and angles) vary slightly between the torsion and pyramidalization experiments (see ref. 50). (The numerical data can be found in Supplementary Table V; source data are provided as a Source Data file).

We choose the same ground-state (planar) geometry as ref. 50 (optimized using a small CAS and the aug-cc-pVDZ basis set; see Supplementary Table I) and find the excitation energy between the ground state and first-excited singlet state to be within a few mHa of the MR-CI results. As we vary τ, while keeping all other geometric parameters fixed, we find the energy curves to be well reproduced by PauliNet, with an avoided crossing at τ = 90° (panel a of Fig. 5; curves symmetric about τ = 90°). Single-reference methods, such as TDDFT (see figure), often overestimate the energy of the ground state at τ = 90° (barrier) and produce an unphysical cusp.

Next, we take the same twisted structure (τ = 90°) as ref. 50 (optimized using a small CAS and the aug-cc-pVDZ basis set; see Supplementary Table I) and vary the pyramidalization angle ϕ, while keeping all other geometric parameters fixed. While there is a small discrepancy between PauliNet and the MR-CI results (panel b of Fig. 5), the trend of the energy curves is well described, including the correct minimum of the excited-state curve (~70°) and the conical intersection (PauliNet: ϕ ~ 100°; MR-CI: ϕ ~ 96°). We note that many single-reference methods are unable to even qualitatively describe the conical intersection, instead predicting spurious features49.

Discussion

We have introduced an approach to compute highly accurate excited-state solutions of the electronic Schrödinger equation for molecules by using deep neural networks that are trained in an unsupervised manner with variational Monte Carlo. We have employed the PauliNet architecture27 to approximate the ground- and excited-state wavefunctions, however other architectures such as FermiNet28 or second quantization approaches29 could also be employed, with suitable modifications. As our approach to find excited states only constrains the excited-state wavefunctions, the ability to compute highly accurate and variational absolute ground-state energies is unchanged. In addition, we demonstrate for a number of small molecules containing up to 42 electrons, that excited PauliNet can reliably find the first excitation energies with an accuracy that is on par with high-level electronic-structure methods, whereas cheaper methods such as TDDFT are less consistent in approximating these energies. The accuracy of the excited-state wavefunctions is underlined by an accurate match of oscillator strengths, which depend on the transition dipole moment, a quantity that is more sensitive to the exact form of the wavefunction than the energy. For benzene (42 electrons), PauliNet already requires significantly less computational time than higher-order methods, and this advantage will only improve for larger molecules. Formally, a single PauliNet is an \({{{{{{{\mathcal{O}}}}}}}}({N}^{4})\) method for N electrons, due to the computational cost of the Hartree-Fock or CASSCF baseline, however, in practice we empirically observe an \({{{{{{{\mathcal{O}}}}}}}}({N}^{3})\) dependency for the system sizes tested, as discussed above. In addition, for excited-state calculations n PauliNet replicas are used which gives rise to \({{{{{{{\mathcal{O}}}}}}}}(n{N}^{3})+{{{{{{{\mathcal{O}}}}}}}}({n}^{2}{N}^{2})\), with the latter term arising from the pairwise overlaps and having a much smaller prefactor than the former.

Notably, almost identical excited PauliNet architectures are used across the systems shown in this paper—up to minor modifications such as the budget of Slater determinants and the total number of excited states requested, and a deeper network for benzene to adapt for a potentially more complex wavefunction. Whereas a skilled quantum chemist can usually tune and specialize an existing electronic-structure method to give very high-accuracy results for a given molecule, our aim is the exact opposite: to provide a method that, by leveraging machine-learning tools, is as automated as possible and will work over a wide range of Hamiltonians provided.

We have demonstrated that we can compute ground- and excited-state potential energy surfaces with the example of ethylene where we model an avoided crossing and conical intersection. Here, where single-reference methods often fail, PauliNet performs well against multi-reference CI results. By combining the present approach with recent and ongoing extensions of PauliNet32 and FermiNet33 that variationally compute entire potential energy surfaces, both highly accurate ground- and excited-state energy surfaces are now accessible with deep VMC methods. Future work will investigate the application of PauliNet to other interesting processes where molecular dynamics interacts with excited states.

One of the limitations of the current approach is that it appears difficult to reliably find all excited states up to a given desired number, especially in cases where several excited states have similar energies. This is a complex problem that depends on the Hartree-Fock/CASSCF initialization, on the total number of states requested, on the learning algorithm, and the expressiveness of the architecture and will be studied in more detail elsewhere. However, the first excited state could be reliably found for all molecules studied here, and apart from one exception also the second excited state. This, in combination with the high numerical accuracy and the favorable computational cost, makes deep VMC a promising method to compute both ground- and excited-state properties for small- and medium-sized molecules with dozens or even low hundreds of electrons.

Methods

PauliNet ansatz

At the heart of our approach is the PauliNet ansatz, introduced in ref. 27 and further refined in ref. 51, a multi-determinant Slater–Jastrow-backflow type trial wavefunction that is parametrized by highly expressive deep neural networks:

$$\psi_{\boldsymbol\theta}({{{{{{{\bf{r}}}}}}}})={\mathrm e}^{\gamma({{{{{{{\bf{r}}}}}}}})+{J}_{\boldsymbol\theta}({{{{{{{\bf{r}}}}}}}})} \textstyle \sum\limits_{p} {c_p} \det\left[\tilde\varphi_{\boldsymbol\theta,{\mu_p}i}^\uparrow({{{{{{{\bf{r}}}}}}}})\right] \det \left[\tilde\varphi_{\boldsymbol\theta,{\mu_p}i}^\downarrow({{{{{{{\bf{r}}}}}}}})\right],$$
(3)
$${\tilde{\varphi }}_{{{{{{{{\boldsymbol{\theta }}}}}}}},\mu i}({{{{{{{\bf{r}}}}}}}})={\varphi }_{\mu }({{{{{{{{\bf{r}}}}}}}}}_{i}){f}_{{{{{{{{\boldsymbol{\theta }}}}}}}},\mu i}^{({{{{\rm {m}}}}})}({{{{{{{\bf{r}}}}}}}})+{f}_{{{{{{{{\boldsymbol{\theta }}}}}}}},\mu i}^{({{{{\rm {a}}}}})}({{{{{{{\bf{r}}}}}}}}),$$
(4)

where r = (r1, . . ., rN) is the 3N-dimensional real space of electron coordinates. The structure of our ansatz ensures that the correct physics is encoded: the wavefunction obeys exact asymptotic behavior through the fixed electronic cusps γ, and is antisymmetric with respect to the exchange of like-spin electrons through the use of generalized Slater determinants, guaranteeing the Pauli exclusion principle is obeyed.

The expressiveness of PauliNet is contained in the Jastrow factor Jθ and backflow fθ, which introduce many-body correlation, and are both represented through deep neural networks (denoted by trainable parameters θ). Jθ and fθ are constructed in ways that preserve the antisymmetry of the fermionic wavefunction with respect to exchanging like-spin electrons, as well as its cusp behavior. The Jastrow factor is an exchange-symmetric function, and captures complex correlation effects through augmenting the Slater-determinant baseline, but is incapable of modifying the nodal surface of the determinant expansion. Changes to the nodal surface are possible through the backflow, which acts on the single-electron orbitals φμ directly, transforming them into permutation-equivariant many-electron orbitals \({\tilde{\varphi }}_{\mu }\). fθ is split into multiplicative (m) and additive (a) components (Eq. (4)), and is designed to be equivariant under the exchange of like-spin electrons.

Ground-state optimization

Like traditional VMC methods, PauliNet is based on the variational principle, which guarantees that the energy expectation value of a trial wavefunction ψθ is an upper bound to the true ground-state energy:

$${E}_{0}=\mathop{\min }\limits_{\psi }\langle \psi|\hat{H}|\psi \rangle \le \mathop{\min }\limits_{{{{{{{{\boldsymbol{\theta }}}}}}}}}\langle {\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}}}|\hat{H}|{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}}}\rangle .$$
(5)

For a given system, a standard quantum chemistry method (Hartree–Fock (HF) for a single determinant; complete active space self-consistent field (CASSCF) for multiple determinants) is performed, with the solution supplemented by the analytically-known cusp conditions, thus producing a reasonable baseline wavefunction. We then optimize the PauliNet ansatz by minimizing the total electronic energy (serving directly as the loss), following the standard VMC trick of evaluating it as an expectation value of the local energy, \({E}_{{{{{{{{\rm{loc}}}}}}}}}({{{{{{{\bf{r}}}}}}}})=\hat{H}\psi ({{{{{{{\bf{r}}}}}}}})/\psi ({{{{{{{\bf{r}}}}}}}})\), over the probability distribution ψθ2:

$${{{{{{{\mathcal{L}}}}}}}}({{{{{{{\boldsymbol{\theta }}}}}}}})={\mathbb{E}}{}_{{{{{{{{\bf{r}}}}}}}} \sim {|\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}}}{|}^{2}}\left[{E}_{{{{{{{{\rm{loc}}}}}}}}}[{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}}}]({{{{{{{\bf{r}}}}}}}})\right].$$
(6)

This means that, in practice, we alternate between sampling electron positions generated using a Langevin algorithm with the probability of the trial wavefunction serving as the target distribution, and optimizing the trial wavefunction parameters using stochastic gradient descent. For further details, see ref. 27.

Computing excited states

We now introduce the central idea of this paper: a deep VMC method to compute the ground and low-lying excited states of a given electronic system. While we employ PauliNet to represent the individual wavefunctions, the method can also employ FermiNet or other real-space wavefunction representations with suitable modifications.

In a similar spirit to the ground-state optimization process, we first obtain a reasonable baseline for each state by performing a minimal state-averaged CASSCF calculation. This optimizes the energy average for all states in question and yields a single set of orbitals to construct each multi-determinant wavefunction, which in turn are supplemented by the analytically-known cusp conditions. We fix the number of determinants in our ansatz by cutting off the CASSCF expansion based on the absolute values of their determinant coefficients. The choice of the CASSCF baseline ensures that the PauliNet ansatzes for the different excited states are close to orthogonal upon initialization. In contrast to the ground-state calculation, the optimization of excited states requires a more nuanced choice of active space. In principle, we must ensure that the solutions contain determinants with orbitals of the necessary rotational symmetries (the Jastrow factor and backflow correction are rotationally-symmetric modifications of the orbitals) and spin configurations (the choice of the number of spin-up and spin-down electrons does impose restrictions on the states that may be attained by our ansatz). For most systems studied in this paper, a generic choice of the active space was sufficient (see Supplementary Table VIII) and we have not studied the dependence on the CAS initialization in more depth. As shown in previous studies the quality of the orbitals has only a minor effect on the training and does not change the final energy51. If, however, the initialization is not accounted for and the baseline solutions provide a qualitatively wrong spectrum of excited states our ansatzes may be trapped in local minima and miss intermediate excited states (see Fig. 2), even though we keep the Slater-determinant coefficients cp and linear coefficients cμk of the single-electron orbitals φμ(ri) = ∑kcμkϕk(ri) trainable.

Our objective is to calculate the lowest n eigenstates of a given system, that is, find the set of orthogonal states that minimizes the energy expectation value. We approach this challenge by introducing a penalty term to the energy loss function (Eq. (6)) and optimizing the joint loss for n PauliNet instances:

$${{{{{{{\mathcal{L}}}}}}}}({{{{{{{\boldsymbol{\theta }}}}}}}})=\underbrace{\mathop{\sum}\limits_{i}{{\mathbb{E}}_{i}\left[{E}_{{{{{{{{\rm{loc}}}}}}}}}[{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}]({{{{{{{\bf{r}}}}}}}})\right]}}_{\begin{array}{c}{{\mbox{energy minimization}}}\end{array}}+\alpha \underbrace{\mathop{\sum}\limits_{i\ > \ j}\left(\frac{1}{1-|{S}_{ij}|}-1\right)}_{\begin{array}{c}{{\mbox{overlap penalty}}}\end{array}},$$
(7)

where \({{\mathbb{E}}}_{i}={{\mathbb{E}}}_{{{{{{{{\bf{r}}}}}}}} \sim|{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}{|}^{2}}\) and Sij is the pairwise overlap between states i and j. The functional form of the overlap penalty is chosen to diverge when two states collapse and behave linearly when states are close to orthogonal (see the next section for details). This allows states to overlap during the optimization procedure while preventing their collapse and eventually driving them to orthogonality when they have settled in a local minimum of the energy. The hyperparameter α weights the two loss terms and can be increased throughout the training to strengthen the orthogonality condition when approaching the final wavefunctions. For a sufficiently large α, the true minimum of the loss function corresponds to the sum of the energies of the lowest-lying excited states with these states having no overlap. Thus, optimizing the penalized loss function (Eq. (7)) leads to an unbiased convergence towards the lowest-lying excited states (see below). In practice a small α is typically sufficient, making a robust choice possible.

To stabilize the training and reduce the computational cost we detach gradients in such a way that we only consider the overlap with the lower-lying states respectively, that is, the ground state is subject to unconstrained energy minimization and the nth excited state introduces n pairwise penalty terms. We compute the overlap of the unnormalized states i and j as the geometric mean of the two Monte Carlo estimates, obtained over distributions ψθ,i2 and ψθ,j2, respectively:

$${S}_{ij}={{{{{{{\rm{sgn}}}}}}}}\left({{\mathbb{E}}}_{i}\left[\frac{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},j}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}\right]\right)\times \sqrt{{{\mathbb{E}}}_{i}\left[\frac{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},j}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}\right]{{\mathbb{E}}}_{j}\left[\frac{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},j}({{{{{{{\bf{r}}}}}}}})}\right]}.$$
(8)

The sign of the overlap can be obtained from either of the two estimators, which match in the limit of infinite sampling. If the overlap is close to zero and the signs of the two estimates differ due to statistical noise of the sampling, we consider the states to be orthogonal. Similar to the energy loss, the gradient (We employ gradient clipping to stabilize the training.) of the pairwise overlap can be formulated such that it depends on the first derivative of the log wavefunction with respect to the parameters only (see below for details).

Finally, we note that different states may be modeled at different levels of quality, which can lead to erroneous excitation energies. In order to improve the error cancellation of our ansatzes we employ a variance-matching technique. As the variance of the energy σ2 can be considered a metric of how close a wavefunction is to a true eigenstate, variance-matching procedures can be useful tools21,52,53. Here, we utilize a simple scheme: for single-state quantities such as total energies, we evaluate all wavefunctions at the end of training. For multi-state quantities, such as excitation energies or transition dipole moments, we match states of a similar variance. That is, if final ψθ,i has a lower variance than final ψθ,j, we take ψθ,i at an earlier point in training. This simply involves computing σ2 of the training energies and applying an exponential moving average at each iteration to monitor convergence (see below for details). We find this procedure typically improves the final results.

Loss function and overlap penalty

There are a number of choices of possible loss functions for the optimization of excited states in quantum Monte Carlo20,26,54. In order to assess the feasibility of excited-state optimization with deep neural network ansatzes in variational Monte Carlo we conducted a range of experiments with different types of optimization objectives. Our empirical findings showed that employing a penalty method is the conceptually most straightforward approach and gives stable results when combining it with our implementation of PauliNet. Initially, we started with an overlap penalty term similar to Pathak et al26. However, we found that our optimization could still collapse even if we chose a sufficiently large prefactor (α) and the training could not recover. We therefore switched to an alternative penalty term (Eq. (7)) which diverges upon a collapse of the states. The effect of our penalty term can be illustrated by considering the loss for a two-state system with the exact ground state \(\left|{\psi }_{0}\right\rangle\) and a linear combination of the ground and first excited state \(\left|{\psi }_{1}\right\rangle\) (see Fig. 6):

$$\left|{\psi }_{\epsilon }\right\rangle=\sqrt{1-\epsilon }\left|{\psi }_{1}\right\rangle+\sqrt{\epsilon }\left|{\psi }_{0}\right\rangle .$$
(9)

The overlap and the energy can be obtained as

$$\langle {\psi }_{0}|{\psi }_{\epsilon }\rangle=\sqrt{\epsilon },\quad \langle {\psi }_{\epsilon }|H|{\psi }_{\epsilon }\rangle=(1-\epsilon ){E}_{1}+\epsilon {E}_{0}.$$
(10)

In the vicinity of the orthogonal solution, the Taylor expansion of the penalty term is

$$\frac{1}{1-|S|}-1=|S |+|S{|}^{2}+|S{|}^{3}+...,\,{{{{{{{\rm{at}}}}}}}}\quad|S |=0,$$
(11)

that is, the overlap penalty behaves linearly to first order. This gives rise to a penalty that is locally stable for any prefactor, lower bounded by the S2 penalty term, and diverges if states collapse. For a large enough α parameter the global optimum of the total loss is at zero overlap, that is, the optimization method is incentivized to find exactly orthogonal states without mixing.

Fig. 6: Sketch of the loss function.
figure 6

This figure illustrates the behavior of our loss function for a two-state system. The ground state is kept fixed and the second state is considered to be a linear combination of the ground state and first excited state (Eq. (9)). The scales are to be understood in arbitrary units, as they depend on the choice of hyperparameters and the energies of the system under investigation.

In practice, for the batch sizes used in our calculations, we have not observed a bias due to the non-linear nature of the penalty when applied to sampled expectation values of the overlap. However, it is expected that this is no longer the case in the limit of small batches. In order to elucidate how our loss function behaves in this regard, we compute the two lowest states of LiH using a range of different batch sizes (see Fig. 7). We find the optimization procedure to be robust for the large batch sizes that we typically employ (≥2000), with the excitation energy within 1 mHa of the exact, and the pairwise overlap remaining small throughout training (panel c). For smaller batch sizes, we observe a larger degree of statistical noise in the pairwise overlap, which leads to a less reliable approximation for the excited state and the corresponding excitation energy (panel b).

Fig. 7: Behavior of the loss function with batch size.
figure 7

The ground state and first excited state of LiH are approximated. The convergence of the total energies (a), excitation energy (b), and the pairwise overlap between the wavefunctions (c) are shown. Dotted horizontal lines are excitation energies from highly accurate references57,58. While the optimization works well for the large batches that we typically employ in our calculations (≥2000), this becomes less reliable, at least for the excited state, in the limit of smaller batch sizes. Note: Darker corresponds to a larger batch size in each respective color.

Gradient of the loss function

In order to differentiate the loss function we explicitly formulate the gradient. We consider the general case of a mixed observable:

$${O}_{ij}=\frac{1}{{N}_{i}{N}_{j}}\int\,{d}^{3}{{{{{{{\bf{r}}}}}}}}{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})\left[\hat{O}{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},j}({{{{{{{\bf{r}}}}}}}})\right],$$
(12)
$$=\frac{{N}_{i}}{{N}_{j}}{{\mathbb{E}}}_{i}\left[\frac{\hat{O}{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},j}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}\right],$$
(13)

where Ni, Nj are the norms of the wavefunctions and \({{\mathbb{E}}}_{i}={{\mathbb{E}}}_{{{{{{{{\bf{r}}}}}}}} \sim|{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}{|}^{2}}\). By the property of Hermitian matrices, Oij = Oji, we derive an expression that does not depend on the wavefunction norms:

$${O}_{ij}=\sqrt{\frac{{N}_{i}}{{N}_{j}}{{\mathbb{E}}}_{i}\left[\frac{\hat{O}{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},j}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}\right]}\sqrt{\frac{{N}_{j}}{{N}_{i}}{{\mathbb{E}}}_{j}\left[\frac{\hat{O}{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},j}({{{{{{{\bf{r}}}}}}}})}\right]},$$
(14)
$$={{{{{{{\rm{sgn}}}}}}}}\left({{\mathbb{E}}}_{i}\left[\frac{\hat{O}{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},j}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}\right]\right)\times \sqrt{{{\mathbb{E}}}_{i}\left[\frac{\hat{O}{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},j}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}\right]{{\mathbb{E}}}_{j}\left[\frac{\hat{O}{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},j}({{{{{{{\bf{r}}}}}}}})}\right]}.$$
(15)

This expression reduces to the pairwise overlaps (Eq. (8)) upon setting \(\hat{O}={{\mbox{Id}}}\). The derivative of this term can be expressed as

$$\partial {O}_{ij}=\frac{1}{{O}_{ij}}\Bigg\{{{\mathbb{E}}}_{i}\Bigg[\left(\frac{\hat{O}{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},j}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}-{{\mathbb{E}}}_{i}\left[\frac{\hat{O}{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},j}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}\right]\right) \partial \ln|{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})|\Bigg]\times {{\mathbb{E}}}_{j}\left[\frac{\hat{O}{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},j}({{{{{{{\bf{r}}}}}}}})}\right]+(i {\iff} j)\Bigg\},$$
(16)

where (ij) is an additional term with the two indices interchanged.

By considering the Hamiltonian operator \(\hat{H}\) and setting i = j we recover the gradient of the energy loss27:

$$\partial {E}_{ii}=2{{\mathbb{E}}}_{i}\left[\left(\frac{\hat{H}{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}-{{\mathbb{E}}}_{i}\left[\frac{\hat{H}{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}{{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})}\right]\right)\partial \ln|{\psi }_{{{{{{{{\boldsymbol{\theta }}}}}}}},i}({{{{{{{\bf{r}}}}}}}})|\right].$$
(17)

Variance matching

As far as relative energies are concerned most computational chemistry methods rely heavily on the cancellation of error. While quantum Monte Carlo methods using neural network-based trial wavefunctions provide highly accurate total energies, the flexibility of these ansatzes is difficult to control which can lead to varying qualities of approximations for different states. In order to account for potential imbalances we utilize the variance of the wavefunctions as a measure of the quality of the approximation (zero-variance principle) and employ a variance-matching scheme. Variance-matching techniques as well as variance extrapolation have typically been applied by optimizing a family of ansatzes and comparing variances across the optimized wavefunctions53. Instead of training multiple ansatzes we checkpoint wavefunctions during the training and compute excitation energies by rewinding the ground state to match the variance of the excited state as depicted in Fig. 8. The mean and variance of each wavefunction are computed over the batch dimension at each step in training and smoothed with an exponential walking average. For the final estimation of excitation energies, the respective wavefunctions are then sampled exhaustively as in the usual evaluation process. While the variance matching hardly impacts the excitation energies for small systems, for larger and harder-to-optimize systems, such as benzene, it becomes increasingly relevant.

Fig. 8: Sketch of the variance-matching procedure.
figure 8

The excitation energy of the benzene calculation at step 4000 is obtained for illustration purposes. The variance (b) of the excited state is higher than that of the ground state and is therefore matched with the variance of the ground state at a previous iteration. The excitation energy is computed by comparing the mean energies (a) at the respective iterations. This acts to reduce the excitation energy and is found to improve the results in all of our experiments.

Spin treatment

PauliNet encodes only the spatial part of the wavefunction and its like-spin antisymmetry explicitly12, while the spin part, which guarantees the opposite-spin antisymmetry, is only implicit. Every spin-assigned spatial ansatz such as PauliNet is always an eigenstate of \({{{{{{{{\mathcal{S}}}}}}}}}_{z}\) with an eigenvalue of \(M=\frac{1}{2}({N}_{\uparrow }-{N}_{\downarrow })\), but it may not be an eigenstate of \({{{{{{{{\mathcal{S}}}}}}}}}^{2}\). The spatial part of eigenstates of \({{{{{{{{\mathcal{S}}}}}}}}}^{2}\) is characterized by specific sets of permutational symmetries involving opposite-spin electrons55. PauliNet does not enforce these symmetries but instead attempts to learn them through the variational principle because eigenstates of the Hamiltonian are also eigenstates of \({{{{{{{{\mathcal{S}}}}}}}}}^{2}\). Therefore, we do not, in general, control the spin of the eigenstates found in the optimization procedure—they are simply found in the order of increasing energy, independent of spin. The spin of a found eigenstate can be obtained in principle by Monte Carlo sampling56. Whether a particular spin state is found in practice may be influenced by the spin of the CASSCF baseline wavefunction, which we, therefore, report in Supplementary Table VIII. In special cases, we may wish to target a specific spin state (e.g., see the section “Multi-reference application: conical intersections”), and for that, we can take advantage of the orbital-assigned backflow of PauliNet. Combined with the freezing of the determinant coefficients, this ensures that PauliNet remains in the same spin state as the CASSCF baseline wavefunction.