Excited state, non-adiabatic dynamics of large photoswitchable molecules using a chemically transferable machine learning potential

Light-induced chemical processes are ubiquitous in nature and have widespread technological applications. For example, photoisomerization can allow a drug with a photo-switchable scaffold such as azobenzene to be activated with light. In principle, photoswitches with desired photophysical properties like high isomerization quantum yields can be identified through virtual screening with reactive simulations. In practice, these simulations are rarely used for screening, since they require hundreds of trajectories and expensive quantum chemical methods to account for non-adiabatic excited state effects. Here we introduce a diabatic artificial neural network (DANN) based on diabatic states to accelerate such simulations for azobenzene derivatives. The network is six orders of magnitude faster than the quantum chemistry method used for training. DANN is transferable to azobenzene molecules outside the training set, predicting quantum yields for unseen species that are correlated with experiment. We use the model to virtually screen 3,100 hypothetical molecules, and identify novel species with extremely high predicted quantum yields. The model predictions are confirmed using high accuracy non-adiabatic dynamics. Our results pave the way for fast and accurate virtual screening of photoactive compounds.

Light is a powerful tool for manipulating molecular systems. It can be controlled with high spatial, spectral and temporal precision to facilitate a variety of processes, including energy transfer, intermolecular reactions, and photoisomerization [1]. These processes are used in areas as diverse as synthesis, energy storage, display technology, biological imaging, diagnostics and medicine [1][2][3]. Photoactive drugs, for instance, are photoswitchable compounds whose bioactivity can be toggled through light-induced isomerization. Precise spatiotemporal control of bioactivity allows photoactive drugs to be delivered in high doses with minimal off-target activity and side effects. Such therapeutics are a promising path for the treatment of cancer, neurodegenerative diseases, bacterial infections, diabetes, and blindness [4,5].
Theory plays a key role in explaining and predicting photochemistry because empirical heuristics learned from thermally activated ground state processes typically do not apply to excited states [3]. Computer simulations based on quantum mechanics can achieve impressive accuracy in the prediction of experimental observables. These include the isomerization efficiency and absorption spectrum of photoswitchable compounds [6,7], which are key quantities in the design of photoactive drugs. * Corresponding author: rafagb@mit.edu However, ab initio methods in photochemistry are severely limited by their computational cost [8]. In order to gather meaningful statistics for one molecule, hundreds of replicate simulations are needed, each of which involves thousands of electronic structure calculations performed in series with sub-femtosecond timesteps. The individual quantum chemical calculations are particularly demanding, requiring excited state gradients and some treatment of multireference effects. In some cases, both the ground-and excitedstate gradients are required at each time step [9][10][11]. Using ab initio methods to compute photochemical properties of tens or hundreds molecules is impractical, and photodynamic simulations have not yet been used for large-scale virtual screening.
Among the most accurate and expensive electronic structure methods are multi-configuration perturbation techniques [12][13][14][15][16], but their cost and requirement for manual active space selection limit their use in virtual screening. The photochemistry community has made exciting developments over several years to overcome both of these hurdles. For example, reduced scaling techniques [17,18] and graphics processing units [19] can significantly accelerate multi-reference calculations. The density matrix renormalization group (DMRG) [20,21] and multi-reference density functional theory (DFT) methods [22][23][24][25] have expanded the size of systems that can be treated with high accuracy. DMRG has also been used to automate the selection of active spaces for multi-reference methods [26,27]. Less accurate but more affordable blackbox methods include spin-flip time-dependent DFT (SF-TDDFT) and hole-hole Tamm-Dancoff DFT [28], among others [29,30]. Despite these developments, the cost of non-adiabatic simulations remains high. As discussed below, even SF-TDDFT is prohibitively expensive for virtual screening. Semi-empirical methods [31][32][33] are currently the only affordable approach for large-scale screening. They provide qualitatively correct results across many systems, but are ultimately bounded by their approximations, with average energy errors of 15 kcal/mol [32].
A different approach is to use data-driven models in place of quantum chemistry (QC) calculations. Machine learning (ML) models trained on quantum chemical data can now routinely predict ground state energies and forces with sub-chemical accuracy [34][35][36], and take only milliseconds to make predictions. These models have been successfully used in a variety of ground state simulations [35,37,38]. They have also been used to accelerate non-adiabatic simulations in a number of model systems [39][40][41][42][43][44][45]. However, excited state ML has not yet offered affordable photodynamics for hundreds of molecules of realistic size, which is the ultimate goal for predictive simulation in photopharmacology. Further, no excited-state interatomic potentials have been developed that are transferable to different compounds. They therefore require thousands of QC calculations for every new species to serve as training data.
Here we make significant progress toward affordable, large-scale photochemical simulations and virtual screening with ML. To develop a transferable potential we focus on molecules from the same chemical family, studying derivatives of azobenzene, a prototypical photoswitch. The derivatives studied here contain up to 100 atoms, making them the largest systems fit with excited-state ML potentials to date. Combining an equivariant neural network [35] and a physics-informed diabatic model, together with data generated by combinatorial exploration of chemical space, and configurational sampling through active learning, we produce a model that is transferable to large, unseen derivatives of azobenzene. This yields computational savings in excess of six orders of magnitude. Predicted isomerization quantum yields of unseen species are well-correlated with experimental values. The model is used to predict the quantum yield for over 3,100 hypothetical species, revealing rare molecules with extremely high cis-to-trans and trans-to-cis quantum yields.

Results
Azobenzene photoswitches. This work focuses on the photoswitching of azobenzene derivatives, but the methods are general and can be applied to other chemistries and other excited state processes. Azobenzene derivatives can exist as cis and trans conformers. The conformations are local minima in the ground state, but not in the excited state. Photoexcitation of either can therefore induce isomerization into the other (see the potential energy schematics in Figs. 1(a) and 2(b)). A key experimental observable is the quantum yield, defined as the probability that excitation leads to isomerization. The yield depends critically on the dynamics near conical intersections (CIs), configurations in which the excitation energy is zero. In these regions the electrons can return to the ground state with non-zero probability.
Many approaches have been developed over several decades to model such non-adiabatic transitions. These include ab initio multiple spawning [11] and cloning [46]; Ehrenfest dynamics [9,10]; coherent switching with decay of mixing [47]; the variantional multi-configurational Gaussian method [48]; exact factorization [49][50][51][52][53]; the multi-configuration timedependent Hartree (MCTDH) method [54,55]; Gaussian MCTDH [56]; and trajectory surface hopping [57]. A recent review of these methods can be found in Ref. [3]. Surface hopping is a popular approach because of its simplicity and efficiency. In this method, independent trajectories are simulated with stochastic hops between potential energy surfaces (PESs). Depending on the curvature of the PESs and the location of the hop, a trajectory can end in the original isomer or in a new isomer (Figs. 1(a) and 2(b)). The quantum yield is the proportion of trajectories that end in a new isomer. Our goal is to predict the quantum yield of azobenzene derivatives after excitation from the singlet ground state (S 0 ) to the first singlet excited state (S 1 ). This can be accomplished with the surface hopping approach described above, using a fast surrogate ML model to generate the PESs. The impact of considering only the first excited state is discussed in Supplementary Sec. IV.
ML architecture and training. Our model is based on the PaiNN neural network [35], which uses equivariant message-passing to predict molecular properties. In this approach, an initial feature vector is generated for each atom using its atomic number. The vector is then updated through a set of neural network Figure 1. Depiction of the potential energy surfaces in azobenzene derivatives. (a) S0 and S1 adiabatic energies, with the CI region shaded in gray. Initial excitation is shown with a vertical zigzag line. Trajectories prior to hopping are shown in black. Reactive and unreactive trajectories after hopping are shown in green and yellow, respectively. (b) Diabatic energies dnm ≡ (H d )nm. The diagonal diabatic elements cross and become re-ordered along the isomerization coordinate. A CI occurs when the diagonal diabatic elements cross and the off-diagonal element becomes zero.
operations involving "messages", which incorporate the distance, orientation, and features of atoms within a cutoff distance. A series of updates leads to information being aggregated from increasingly distant atoms. Once the updates are complete, the atomic features are mapped to molecular energies using a neural network.
This architecture can be used to predict energies and, through automatic differentiation, the forces for each state. However, models that predict adiabatic energies have a basic shortcoming for non-adiabatic molecular dynamics (NAMD). Since surface hopping is largely controlled by the energy gap when it is close to zero, small errors in the energies can lead to exponentially large errors in the hopping probability [58,59]. This in turn can cause large errors in observable quantities like the quantum yield. This point is discussed in further detail in Supplementary Sec. II A. Further, since CIs are non-differentiable cusps in the energy gap, they are difficult to fit with neural networks. For N atoms in a molecule, the network must predict two different energies that are exactly equal in 3N − 8 dimensions. We found this to be particularly challenging for trans species that are outside the training set. As shown in Supplementary Sec. VII, small errors in the gap lead to the incorrect prediction that many species never hop to the ground state.
To remedy this issue we introduce a model based on diabatic states, which we call DANN (diabatic artificial neural network ; Fig. 2(a)). The approach builds on previous work using neural networks for diabatiza-tion [60][61][62]. Much of the previous work could only be used for specific system types, such as semi-rigid molecules [61] and coupled monomers, and is thus not applicable to azobenzene. None of the methods have been used for large systems with significant conformational changes [60,62], such as azobenzene derivatives. Further, our work uses diabatization to ease the fitting of adiabatic states across chemical space. In particular, it addresses the issue of gap overestimation near conical intersections of unseen species, as described in Supplementary Secs. II and VII. Our work uses diabatization to address this problem, whereas previous work developed diabatic states because of their favorable theoretical properties. We also note that gap overestimation in unseen species is both a newly-identified and newly-addressed problem, as previous work in ML-NAMD focused on single species only [39][40][41][42][43][44][45].
The diabatic energies form a non-diagonal Hamiltonian matrix, H d , which is diagonalized to yield adiabatic energies. When a 2×2 sub-block of H d has diagonal elements that cross, and off-diagonal elements that pass through zero, a CI cusp is generated (Fig.  1). The diabatic energies that generate the cusp are smooth, which makes them easier to fit with an interpolating function than the adiabatic energies. In the DANN architecture, smoothness is imposed through a loss function related to the non-adiabatic coupling vector (NACV). The loss minimizes the value that the NACV takes when it is rotated from the adiabatic basis (Eq. (3)) into the diabatic basis. The NACV measures the change in overlap between two wavefunctions after a small nuclear displacement. If the NACV between two states is zero, then their wavefunctions must change slowly in response to a nuclear perturbation. Therefore, their energies cannot form the cusp in Fig.  1(a), and must instead resemble the smooth energies in Fig. 1(b).
Unlike traditional TDDFT [66], SF-TDDFT provides an accurate description of the CI region [67], and, unlike multi-reference methods, is fairly fast and requires no manual parameter selection. The configurations were sampled from 8,269 azobenzene derivatives, of which 164 were taken from the experimental literature. The remaining molecules were generated from combinatorial substitution using common literature patterns (Supplementary Tables S10  and S11).
The data generation process is shown in Fig. 2. Initial data was generated through ab initio NAMD with   [37].
The performance of the model is summarized in Table I. Statistics are shown for both seen and unseen species. The former contains species that are in the training set, but geometries that are outside of it. The geometries were selected with the balanced sampling criteria described in Supplementary Sec. X. Geometries from unseen species were generated with DANN-NAMD using the final trained model. Half of the DANN-NAMD geometries were selected randomly from the full trajectory and half by proximity to a CI (Supplementary Eq. (S13)). 100 configurations were chosen for each molecule.
For species in the training set, all quantities are accurate to within approximately 1 kcal/mol(/Å). Apart from the NACV, all quantities have R 2 correlation coefficients close to 1. The R 2 of the NACV is 0.84. This may be somewhat low because diabatization cannot remove the curl component of the NACV in the diabatic basis [68,69]. This would also explain the low R 2 value for the NACV in Ref. [42], which computed it as the gradient of a scalar. For molecules outside the training set, all quantities apart from the energies have an error below 3 kcal/mol(/Å). The energy gaps and ground state forces have R 2 correlation coefficients near 1. The gap error of 1.89 kcal/mol should be contrasted with the error of 15 kcal/mol in Ref. [32], which applied semi-empirical methods to azobenzene. The errors in the excited state forces are slightly larger, but still quite low. The correlation coefficient for the force NACV g 01 is rather poor. As described in Supplementary Sec. VII, the yields of trans derivatives are better correlated with experiment when using Zhu-Nakamura surface hopping than Tully's method. The latter uses the NACV and the former does not, so part of the difference may be explained by the high error in the force NACV. Nevertheless, there is still reasonable agreement between Tully's method and experiment, suggesting that errors in the force NACV do not spoil the dynamics. Figure 3(a) shows snapshots from an example DANN-NAMD trajectory, and panel (b) shows random samples of the hopping geometries. Reactive hopping geometries are shown on top, and nonreactive ones are shown below. The molecule is the (aminomethyl)pyridine derivative 26, with the species numbering given in Supplementary Tables S12 and S13. The overlays show cis-trans isomerization proceeding through inversion-assisted rotation, consistent with previous work [70]. The dominant motion is rotation, with the CNNC dihedral angle increasing in magnitude from −10 • at equilibrium to −86 • at the hopping points. Significant changes also occur in the CNN and NNC angles, with each transitioning from 123 • to either 113 • or 135 • . The predicted PES in the branching space ( g, h) is shown beside the geometries. g is the direction of the force coupling and h ∝ ∇ R (∆E 01 ) is the direction of the gap gradient. Each vector was computed with automatic differentiation using Eq. (1). The diabatic energies, adiabatic energies, and gap are shown from top to bottom. We see that the model generates a true CI, in which the S 0 and S 1 energies are exactly equal. Further, the degeneracy is lifted in both the gand h-directions, so that the S 1 energy and gap each form a characteristic cone. These hallmarks of CIs are built into the model because the adiabatic energies are eigenvalues of a diabatic matrix. For example, the cone emerges from the fact that d 11 − d 00 and d 01 each pass linearly through zero in different directions [71]. Figure 3(c) indicates that the predicted and experimental quantum yields of unseen species are correlated. The yields are for the 33 cis and trans species with experimental data in Supplementary Table S12. The R 2 value is 0.42, and the Spearman rank correlation coefficient ρ is 0.74. While the R 2 value is somewhat low, the Spearman rank correlation is high. The Spearman coefficient measures the accuracy with which the model ranks species by quantum yield. ρ only compares orderings, while R 2 compares the model error to the error of a mean predictor. This means that ρ is a more forgiving metric, and also a more relevant metric for virtual screening. Since cis isomers have yields 2 to 3 times higher than trans isomers, the high value of ρ means that the model properly separates the isomers into low-and high-yield groups.
Further, as shown in Supplementary Figs. S5 and S7, the model produces meaningful rankings among trans species. The correlation coefficients are ρ = 0.32 using Tully's method [57] and ρ = 0.57 using the Zhu-Nakamura approach [72] tives. Several such molecules are of interest. They are color-coded in the plots, with the legend given below. A full list of predictions is given in Supplementary Table S12. We see, for example, that the (aminomethyl)pyridine derivatives 1 and 35 are both predicted to have near-zero yields. These species do not isomerize from trans to cis, because strong N-H hydrogen bonds lock the planar trans conformation in place [73]. Replacing the NH group in 1 with N − CH 3 gives species 25. This molecule isomerizes because there is no hydrogen bonding. This, too, is predicted by the model. Further, the hepta-tert-butyl derivative 17 has an experimental and predicted yield of zero. This is likely because of steric interactions among the bulky tert-butyl groups. While able to account for these two different mechanisms, the model fails to predict the subtle electronic effects in species 11 and 29.
Resonance interactions between oxygen lone pairs and the azo group modify the PES, such that there is no rotational CI [74]. There is instead a concerted inversion CI, which occurs too early along the path between trans and cis to allow for isomerization. The changes in the PES may either be too small or too specific to the substituents for the model to predict without fine tuning. Finally, derivatives with high yields are partly distinguished from those with low but non-zero yields. An example is 21, whose experimental yield of 10% is half that of trans-azobenzene. The model properly identifies this molecule as having a low yield, but also mistakenly does the same for several highyield species. The accuracy for unseen species could always be improved with transfer learning, in which the model is fine-tuned with a small number of calculations from a single molecule (discussed below). This would increase the computational cost, but would still be orders of magnitude less expensive than ab initio NAMD. While meaningful correlations are produced for trans species, the same is not true of cis molecules (ρ = 0.02). This may be because there are no cis derivatives with zero yield. Nevertheless, the model properly identifies 20 as having the highest yield. Further, it does not mistakenly assign a zero yield to any derivative. This is noteworthy because, as shown in Fig. 4(a) and (b), several hypothetical cis species are predicted to have zero yield. Synthesis of nonswitching cis derivatives and comparison to predictions could therefore be of interest in the future.
Overall, we observe moderate correlation between predicted and experimental yields. The Spearman cor-relation is high when including both isomers, moderate for trans isomers, and low for cis isomers. The R 2 value, a measure of numerical error compared to that of a mean predictor, is moderate when including both isomers and near-zero when separating them. Indeed, the MAEs of the mean predictor are 9.5%, 10.3%, and 17.7% for trans, cis, and all species, respectively. The model MAEs before (after) subtracting the mean signed error are 14.4% (13.5%), 11.5% (11.2%) and 13.2% (13.0%). In addition to model error, sources of error include inaccuracies in SF-TDDFT, approximations in surface hopping, solvent effects, and experimental uncertainty. These are discussed in depth in Supplementary Sec. IV. Each source of error affects both R 2 and ρ, but is expected to have a larger effect on R 2 . The rank correlation with experiment is encouraging given the difficulty of the task, as captured by the sensitivity of the yield to model errors in the PES [72], and given the sources of error outside the model. Further, as discussed below, DANN provides an excellent starting point for fine-tuned, moleculespecific models that can be used for high-accuracy simulations of single species. Figure 3(d) shows that DANN-NAMD is extremely fast. The plot shows the node time, defined as t calc /n calc , where t calc is the calculation time per geometry, and n calc is the number of parallel calculations that can be performed on a single node. We see that ML speeds up calculations by five to six orders of magnitude. The direct comparison of the pre-trained model node times and QC node times is appropriate because the model generalizes to unseen species. This means that it incurs no extra QC cost for any future simulations. The minimum speedup corresponds to the smallest molecules (14 heavy atoms or 24 total atoms), and the maximum to the largest molecules (70 heavy atoms or 99 total atoms). This reflects the different scaling of the QC and ML calculations. Empirically we see that DANN scales as N 0.49 for N heavy atoms, while SF-TDDFT scales as N 2.8 . These values come from fitting the timings to t = A · N x , where t is the computational time, A and x are fitted constants, and N is the number of heavy atoms. DANN's apparent sub-linear scaling is an artifact of diagonalizing H d ; when the diagonalization is removed, the scaling becomes linear. This is the expected scaling for a message-passing neural network with a fixed cutoff radius. Evidently diagonalizing H d introduces a large overhead with weak dependence on system size. Nevertheless, we see that DANN is still quite fast.
Virtual screening. Having shown that the model is fast and generalizes in the chemical and configurational space of azobenzenes, we next used it for virtual screening of hypothetical compounds. We first retrained the network on all available data, including species that were originally held out, for a total of 631,367 geometries in the training set. We then predicted the quantum yields of 3,100 combinatorial species generated through literature-informed substitution patterns, as in Ref. [75]. This screen served two purposes. The first was to gather statistics about the distribution of photophysical properties of azobenzenes at a scale not accessible to experiments or traditional simulations. The second was to identify molecules with rare desirable properties. In particular, we sought to find molecules with high c − → t or t − → c quantum yields and redshifted absorption spectra. The former is important because increasing the ratio QY a− →b / QY b− →a , where QY is the quantum yield, can lead to more complete a− →b transformation under steady state illumination. This is critical for precise spatial control of drug activity when the two isomers have different biological effects [76]. Redshifting is a crucial requirement for photo-active drugs, since human tissue is transparent only in the near-IR [76].
The results are shown in Fig. 4. Panels (a) and (c) show the predicted yield vs. mean gap. For each species we averaged the gap over the configurations sampled during neural network ground state MD. The thermal averaging led to a typical blueshift of 0.2-0.3 eV relative to the gaps of single equilibrium geometries. The mean excitation energies are 2.95 eV for cis derivatives and 2.84 eV for trans species; the gaps are 2.98 eV and 2.97 eV for the respective unsubstituted compounds. The average gaps and their differences are similar to experimental measurements for azobenzene [77]. The average c → t and t → c yields are 54% and 24%, respectively, while those of the unsubstituted species are 59% and 37%. These are consistent with experimental results in non-polar solution, for which the base compound has yields of 44-55% and 23-28% [77]; the former is closer to 55% on average. However, the yield of the base trans compound is overestimated with respect to both theory and experiment [7,72,77]. The mean (median) proportion of trajectories ending in the ground state after 2 ps are 92% (100%) for cis species and 31% (17%) for trans species. The standard deviations are 25% and 30%, respectively. Panels (b) and (d) show the yield plotted against the isomeric stability, defined as E trans −E cis for trans isomers and E cis − E trans for cis isomers. The energy E is the median value of the configurations sampled in the ground state; we used the median to reduce the effect of outlier geometries. On average the trans isomers are more stable than the cis isomers by 0.66 eV (15.3 kcal/mol), which is similar to experimental values over 10 kcal/mol for azobenzene [78]. The stability is of interest for three reasons. First, a large absolute value indicates that one isomer is dominant at room temperature. This is essential for photoactive drugs, and is the case for regular azobenzene. Second, an inverted stability, in which cis is more stable than trans, allows for stronger absorption at longer wavelengths. This is because the dipole-forbidden n−π * (S 1 ) transition is significantly stronger for cis than for trans [77]. Third, in photopharmacology, one often wants to deliver a drug in inactive form, and activate it with light in a localized region. If trans happens to be active and cis inactive, then localized activation is only possible if cis is more stable.
Several species of interest are shown in Fig. 4. The molecules 165 and 166 have predicted c → t yields of 75 ± 6% and 72 ± 6%, respectively, which are well above the cis average of 55%. The species 169 and 170 have predicted t → c yields of 66 ± 7% and 63 ± 10%, respectively, which are three times the average trans yield. Molecule 167 is highly redshifted, with a mean predicted gap of 2.26 eV (548 nm), and a standard deviation of 0.87 eV. QC calculations on the geometries sampled with DANN gave a gap of 2.26 ± 0.61 eV, in good agreement with predictions. The mean gap is lower than the median of 2.52 eV, which reflects the presence of several ultra-low gap structures. The low gap and large variance mean that 167 may be able to absorb in the near IR. The redshifting is likely because of the six electron donating groups, which increase the HOMO energy, together with the crowding of the four ortho substituents. The latter distorts the molecule, leading to twisted configurations with smaller gaps. Finally, species 168 is more stable in cis form than trans form. The predicted cis stability is −0.79 eV (−18 kcal/mol), in good agreement with the QC prediction of −0.92 eV (−21 kcal/mol). As mentioned above, this inverted stability can be a desirable property for photopharmacology.
To validate the yield results, we performed DANN-NAMD using highly accurate species-specific models. As described in Supplementary Sec. XIV B, we generated a model for each species by refining the base network with data from that species alone. The data was generated through several active learning cycles, resulting in 1,200 to 2,500 training geometries for each compound. We used this approach in place of ab initio NAMD because of the latter's prohibitive cost for large molecules. The QC computational cost for fine-tuning was at most 3% of that of an ab initio simulation, and hence far less demanding. The average gradient calculation for a molecule with 50 atoms took 58 minutes for two surfaces using 8 cores, and the average NACV calculation took 55 minutes. Fine-tuning with 2,000 geometries for a medium-sized molecule would thus take 30,000 core hours. For ab initio NAMD, a conservative estimate of 100 trajectories run for 1 ps each, with only one gradient computed per frame, would take 780,000 core hours.
We also computed the yields of cis and trans azobenzene for comparison. For these species we used full ab initio simulations, because of the central role of the unsubstituted compound as a reference point and because simulations were fairly affordable for such small molecules. Issues with spin contamination also hampered the fine-tuning process for these compounds (see Supplementary Sec. XIV B).
Initially we generated refined models for species 165, 167, 169 and 170. It became clear early on that only 165 and 169 had high yields, and so we focused on those molecules thereafter. Using molecule-specific models, we computed the quantum yields of 165 and 169 to be 66±1% and 37±1%, respectively. The computed yields for cis and trans azobenzene are 60 ± 4% and 26±3%, respectively, which are in excellent agreement with experiment [77]. Both of the new molecules have higher quantum yields than the associated base compounds. The improvement is particularly large for species 169: its yield is 11 points higher than trans azobenzene, which is a relative enhancement of 42 percent. We argue below that that this significant increase has an intuitive physical explanation.
The dynamics of the four molecules are shown in Fig. 5. Panels (a) and (b) show the CNNC dihedral angle vs. time for azobenzene, and panels (d) and (e) show the same for the derivatives. Both the substituted and unsubstituted cis isomers rapidly proceed through a rotational CI, but the derivative rotates much more quickly. Indeed, we see that the isomerization of the derivative is complete within 75 fs, while the base compound takes nearly 130 fs. The excited state lifetimes are 34.2 ± 0.3 fs and 63 ± 3 fs for the derivative and base compound, respectively, indicating that the former reaches the CI earlier than the latter. These observations may explain the enhanced yield, since a higher rotational velocity leads to more efficient isomerization [79]. We also note that the derivative rotates in only the counter-clockwise direction, while cis azobenzene rotates in both directions, but this is not expected to affect the yield.
The two trans molecules behave in qualitatively different ways. In trans azobenzene, the distribution of dihedral angles slowly widens with time ( Fig. 5(b)). This is consistent with a rotational barrier [7,72], as different trajectories overcome the barrier at different times, and so the torsion angle becomes uniformly distributed. Additionally, as seen in the marginal dihedral distribution of Fig. 5(c), many of the geometries hop near 180 • . This agrees with Ref. [7], which identified a non-reactive planar CI and a reactive twisted CI as the main hopping points for trans azobenzene. The non-reactive CI leads exclusively back to trans, while the reactive CI leads to cis and trans in different proportions. Using the method described in Supplementary Sec. XIV C, we found that 26% of the trajectories proceed through the planar CI and 74% through the rotational CI. This is the same distribution reported in Ref. [72]. Approximately 36% of the rotational trajectories generate cis azobenzene, giving an overall yield of 26%. This is in good agreement with previous computational and experimental values [7].
By contrast, nearly all trajectories of 169, including non-reactive trajectories, rotate significantly. This can be seen in the marginal dihedral distribution in Fig.  5(f), in which the hops are tightly localized around 180 ± 77 • . Only 5% of the trajectories hop at the planar CI, which is five times lower than the base compound. Additionally, the yield of the rotational trajectories increases from 36% to 40%. The inhibition of the planar CI pathway, together with the enhancement of the rotational yield, leads to an overall yield increase from 26% to 37%. While the enhanced reactive yield does not have a simple explanation, the reason for the planar pathway inhibition can be clearly seen in Fig.  5(e). Whereas the rotation of 51 is stochastic, leading to a uniform distribution of angles, the rotation of 169 is initially concerted. Nearly all trajectories rotate in unison to a dihedral angle of 180 ± 45 • at 300 fs. Past 300 fs, hopping begins and the trajectories separate from each other. Hence they proceed through the rotational reactive CI, and become distributed between 0 • and 360 • after hopping. The planar non-reactive CI is avoided because of the molecule's initial rotation. This explanation is consistent with the presence of bulky ortho groups, which twist the equilibrium structure and hence weaken the N=N double bond. This lowers the excited state barrier to rotation, which leads to an initial torsion and hence increases the yield.

Discussion
The DANN model shows high accuracy and transferability among azobenzene derivatives. One limita-tion is that the unseen species contained functional groups that were present to some degree in the training set. Model performance was generally higher for more highly represented functional groups, though some groups were highly represented yet difficult to fit, while others were weakly represented and well-fit (Supplementary Sec. V). Moreover, the model cannot be applied to other chemical families without additional training data. Further, as shown in Supplementary Sec. VII, it substantially overestimates the excited state lifetime for a number of trans derivatives. On the other hand, semi-empirical methods provide qualitatively correct predictions across a variety of chemistries, but cannot match DANN's indomain accuracy, and cannot be improved with more reference data. Adding features from semi-empirical calculations, as done in the OrbNet model [80], may therefore prove useful in the future. Recent developments accounting for non-local effects and spin states have improved neural network transferability [36], and could also be beneficial for excited states. The model could be further improved with high-accuracy multireference calculations, solvent effects, and the inclusion of the bright S 2 state. The use of spin-complete methods in particular is of crucial importance, since spin contamination prevented fine-tuning the model for the base compounds. It may also have affected the accuracy of the DANN model in general. Thus spin-complete, affordable alternatives are of particular interest [28]. Active learning could be accelerated through differentiable sampling with adversarial uncertainty attacks [81], which would improve the excited state lifetimes. Transfer learning could also be used to improve performance for specific molecules. Only a small number of ab initio calculations would be required to fine-tune the model for an individual species.
Diabatization may also prove to be useful for reactive ground states. Reaction barriers can often be understood as transitions from one diabatic state to another [82]. The diabatic basis may make reactive surfaces easier to fit with neural networks.
In conclusion, we have introduced a diabatic multistate neural network potential trained on over 630,000 geometries at the SF-TDDFT BHHLYP/6-31G* level of theory, covering over 8,000 unique azobenzene molecules. We used DANN-NAMD to predict the isomerization quantum yields of derivatives outside the training set, and the results were well-correlated with experiment. We also identified several hypothetical compounds with high quantum yields, redshifted excitation energies, and inverted stabilities. The network architecture, diabatization approach, and chemical and configurational diversity of the training data allowed us to produce a robust and transferable potential. The model can be applied off-the-shelf to new molecules, producing results that replicate those of SF-TDDFT at orders of magnitude lower computational cost.

Network and training.
As explained in Supplementary Sec. I, a unique challenge for non-adiabatic simulations is their sensitivity to the energy difference between states. Using a typical neural network to predict energies and forces for NAMD leads to some molecules becoming incorrectly trapped in the excited state. This is partly caused by overestimation of the gap and/or an incorrectly shaped PES in the vicinity of the CI. To address this issue we introduce an architecture based on diabatic states, whose smooth variation leads to more accurate neural network fitting ( Fig. 1(b)).
In general diabatic states must satisfy [83] where ∇ R is the gradient with respect to R, U diagonalizes the diabatic Hamiltonian through and fn = −∇ R En is the adiabatic force for the n th state. The dependence on R has been suppressed for ease of notation. gnm is the force coupling, whereĤ( r, R) is the clamped nucleus Hamiltonian, ψn( r; R) is the n th adiabatic wavefunction, and the matrix element is an integral over the electronic degrees of freedom r. The vector knm( R) is the derivative coupling: Combined with the following reference geometry conditions (Supplementary Sec. I), we arrive at three sets of constraints, Eqs. (1), (2), and (5). In principle only Eqs. (1) and (2) are required for the states to be diabatic. However, we found the reference loss to provide a minor improvement in the gap near CIs (Supplementary Table  S1). We use a neural network to map the nuclear positions R i and charges Z i to the diabatic matrix elements dnm, and a loss function to impose Eqs. (1), (2) and (5). The adiabatic energies En are generated by diagonalizing H d , and the forces and couplings by applying Eq. (1) and using automatic differentiation. The design of the network is shown schematically in Fig. 2(a). The general form of the diabatic loss function is Here Lcore penalizes errors in the adiabatic energies, forces, and gaps, L ref imposes Eq. (5) and Lnacv imposes Eq. (1) for n = m. The terms are defined explicitly in Supplementary Eqs. (S1)-(S3).
For the network itself we adopt the PaiNN equivariant architecture [35]. In this approach a set of scalar and vector features for each atom are iteratively updated through a series of convolutions ( Fig. 2(a)). In the message block, the features of each atom gather information from atoms within a cutoff distance, using the interatomic displacements. The scalar and vector features for each atom are then mixed in the update phase. Hyperparameters can be found in Supplementary Table S4. Most were taken from Ref. [35], but some were modified based on experiments with azobenzene geometries. Further details of the PaiNN model can be found in Ref. [35]. Once the elements of H d are generated, the diabatic matrix is diagonalized to yield the transformation matrix U and the adiabatic energies En. The vector quantities fn and gnm are given by Eq. (1). When nonadiabatic couplings are not required, the fn can be calculated by directly differentiating the En. This is more efficient than Eq. (1), since it requires only M ad = 2 < M d (M d + 1)/2 = 6 gradient calculations. This approach was used for NAMD runs, which required only diabatic energies, adiabatic energies, and adiabatic forces, while Eq. (1) was used for training.

Data generation and active learning.
Data was generated in two different ways. First, we searched the literature for azobenzene derivatives that had been synthesized and tested experimentally. This yielded 164 species (82 cis and 82 trans). For these species we performed ab initio NAMD, yielding geometries that densely sampled configurational space. Second, to enhance chemical diversity, we generated nearly 10,000 species through combinatorial azobenzene substitution. This was done using 48 common literature substituents and four common substitution patterns (Supplementary Tables S10 and S11). We then performed geometry optimizations, normal mode sampling, and inversion/rotation about the central N=N bond to generate configurations. QC calculations were performed on 25,212 combinatorial geometries. All calculations were performed with Q-Chem 5.3 [84], using SF-TDDFT [63] with the BHHLYP functional [65] and 6-31G* basis [64].
Two neural networks were trained on the initial data and used to perform DANN-NAMD. Initial positions and velocities for DANN-NAMD were generated from classical MD with the Nosé-Hoover thermostat [85,86]. The initial trajectories were unstable because the networks had not been trained on highenergy configurations. To address this issue we used active learning [37,38] to iteratively improve the network predictions ( Fig.  2(b)). After each trajectory we performed new QC calculations on a sample of the generated geometries. For all but the last two rounds of active learning, geometries were selected according to the variance in predictions of two different networks, where the networks were initialized with different parameters and trained with different random batches. In the last two rounds, half the geometries were selected by network variance, and half by proximity to a CI. Further details are given in Supplementary Sec. XIII. The new data was then added to the training set and used to retrain the networks. The cycle was repeated three times with all species and another two times with azobenzene alone. In all, we computed ground-state gradients, excited-state gradients, and NACVs with SF-TDDFT for 641,367 geometries. 96% of the geometries were from the 164 literature species. 88% were generated through ab initio NAMD and 8% through active learning. The remaining 4% were from the combinatorial species. 1.5% were generated through geometry optimizations, 1.5% through inversion/rotation, and 1% through normal-mode sampling.
We initially set out to train a model using energies and forces alone. Since analytic NACVs are unavailable for many ab initio methods, an adiabatic architecture could have been used with a wider variety of methods. NACVs also add computational overhead, and so generating training data for an adiabatic model would have taken less time. To this end we initially used the Zhu-Nakamura (ZN) surface hopping method [79], which only requires adiabatic energies and forces. However, the issues with adiabatic models described in Supplementary Sec. VII led us to develop the diabatic approach. Since diabatic states can be used with any surface hopping method, we used the diabatic model to perform Tully's fewest switches (FS) surface hopping [57] after the last round of active learning. All results in the main text use the FS method. A comparison of FS and ZN results is given in Supplementary Sec. VII.

Data availability
The quantum chemistry data is available at https://doi. org/10.18126/unc8-336t. A detailed description of how to load and interpret the data is given in the README file. Source data of experimental and predicted quantum yields are provided with this paper.

Code availability
Trained models and computer code are available in the Neural Force Field repository at https://github.com/ learningmatter-mit/NeuralForceField. Notebook tutorials explain how to train the models and perform DANN-NAMD. Accurate prediction of the energy gap is crucial for generating reliable NAMD results. The gap controls the hopping rate, and in turn observables like the photoisomerization quantum yield, excited state lifetime and time-resolved photoelectron spectrum [87]. To understand the importance of the gap, consider that in most approaches, the hopping rate is determined by the derivative form of the NACV [57,88]. The derivative coupling between states n and m can be written as g nm /∆E nm , where ∆E nm is the energy gap and g nm is the force coupling [83,89]. The gap in the denominator leads to singular derivative coupling at CIs, and therefore to a guaranteed hop. The coupling can alternatively be obtained from the gap alone from through its first and second derivatives [44]. The energy difference also features prominently in the Zhu-Nakamura method, in which the hopping rate is approximately exponential in the square of the gap [79]. Therefore, it is crucial to accurately predict the energy gap in any NAMD simulation.

B. Diabatization
The adiabatic energies of a system, {E n ( R)}, are the energies produced by a QC calculation. They depend on the nuclear coordinates R, and form the usual Born-Oppenheimer PESs. In the adiabatic basis the nuclear kinetic energy operator, related to ∇ 2 R , is non-diagonal. Its off-diagonal elements are related to the NACVs, which generate transitions between adiabatic PESs [90]. The derivative form of the NACV also becomes singular at CIs, which is an undesirable numerical property. The diabatic basis is designed to remove this singularity [91]. The diabatic Hamiltonian H d ( R) is a rotation of the adiabatic energies into a new basis, given by Here diag denotes a diagonal matrix, and U( R) is a rotation matrix that depends on the nuclear coordinates (see Eq. (2)). H d can also be viewed as the clamped nucleus Hamiltonian expressed in the basis of diabatic wave functions. These wave functions are given by ψ d,n ( r; R) = m U nm ( R) ψ ad,m ( r; R), where ψ ad,m is the m th adiabatic wave function and r denotes the electronic coordinates.
The diabatic states are defined such that the nuclear kinetic energy is approximately diagonal [91]. The states are instead coupled through off-diagonal elements in the potential energy matrix, known as diabatic couplings. The diabatic couplings are smooth functions of the nuclear coordinates. The diagonal elements are also smooth, maintaining their orbital character and switching energy ordering through a CI ( Fig. 1(b)). In many applications, diabatic states are preferred over adiabatic ones because the singular coupling is removed. Here we prefer them because diabatic energies are easier to fit than adiabatic energies. That is, even if one is only interested in adiabatic energies and not NACVs, it is easier to learn the diabatic energies and then diagonalize H d than to learn the adiabatic energies directly. This is because the diabatic states possess no CI cusps or avoided crossings (Fig. 1), and are thus more easily fit by interpolating functions such as neural networks. As discussed below, diabatic states improve the model accuracy for species outside the training set.
While many diabatization methods exist, the most common ones cannot be straightforwardly applied to the current problem. Property-based methods were developed for charge-transfer type problems [92], while orbitalbased approaches [93] are not implemented for TDDFT. Diabatic models that are parameterized by electronic structure data [61,83,94] are not designed for systems undergoing large distortions. Approaches that solve for the adiabatic-to-diabatic transformation matrix [95] are difficult to implement in practice, because the matrix varies rapidly near a CI.
Recent work introduced neural network diabatization based on reference geometries [60,96]. In this procedure one assumes that the diabatic Hamiltonian is diagonal at a set of known reference geometries. At such geometries the elements of H d are equal to the adiabatic energies, but possibly reordered. For example, for two states and two reference geometries, one would have H d = diag(E 0 , E 1 ) at the first geometry and H d = diag(E 1 , E 0 ) at the second ( Fig. 1(b)). A neural network is then trained to produce H d , such that its eigenvalues are always equal to the adiabatic energies, and its elements are as above at the reference geometries. These two constraints generate H d at all intermediate geometries.
This method was successfully applied to thiophenol dissociation, yielding results consistent with the fourfold way [60]. However, as shown by the NACV error in Table S1, this method alone does not generate true diabatic states for azobenzene. To understand why, consider that near a CI, the true diabatic coupling d 01 must closely resemble the coupling shown in Fig. 1 . This is because d 01 must be linear in displacements about a CI, and quadratic only for Renner-Teller type intersections [71]. However, for two diabatic states, only the square of the diabatic coupling |d 01 | 2 enters into the expression for the adiabatic energies. The model might then generate an off-diagonal element similar to |d 01 |. This error would incur no penalty on the network, because the diagonal elements would properly switch ordering and the adiabatic energies would be correct. Hence the model would not generate smooth diabatic states.
To remedy this issue we used the combined loss described in the Methods section. The NACV component of this loss was also used in Ref. [62]. We note that the number of diabatic states and choice of orderings in Eq. (5) may depend on the system. For azobenzene we were interested in fitting M ad = 2 adiabatic states, and in general one should use M d ≥ M ad diabatic states. In this work chose M d = 3 because it significantly improved on the results of M d = 2. In fact, with M d = 2, the R 2 correlation of the force NACV was negative. The orderings in Eq. (5) were chosen by taking a small sample of azobenzene configurations, training several small models with different diabatic orderings, and picking the one with the best results. Thus no system knowledge is required to choose the orderings. The only system knowledge required is the set of reference geometries 1 , but the reference loss can be omitted with negligible impact on model performance.
The diabatic model leads to true CIs, where the ground and excited state energies are exactly equal. To see why, consider Fig. 3(b), which shows the two diagonal diabatic elements (red and blue) and the off-diagonal coupling (light gray). The off-diagonal element passes linearly through zero in the h direction. The diagonal elements cross in the g direction, and so ∆ ≡ d 11 − d 00 passes linearly through zero. Therefore, one can begin at a geometry for which ∆ = 0, and move in the h direction until d 01 = 0 without changing ∆. The final geometry will therefore be a CI. The fact that both d 01 and ∆ are locally linear, and hence must pass through zero, is known theoretically [71] and properly predicted by the model.

C. Network loss
The neural network loss terms are defined as: 1 Typically the reference geometries are reactants and products. For reactions in which the product is not known a priori, one might use the following approach. First, train an adiabatic network on a single molecule. We have found that adiabatic models match or outperform diabatic ones for single species. Then use the model to simulate dynamics. By using active learning to improve the model's coverage of configurational space, the simulations can eventually discover the product. If derivatives of the molecule are expected to have similar reactions, then their products are also now known. With knowledge of the reactants and products, and hence the reference geometries, one can now build a diabatic model. If this is not possible, then one can train the model without L ref . Table  S1 shows that this will likely decrease the model performance near CIs by a small amount.
where ∆E nm = E n − E m is the energy gap. Each loss function is a sum over mean squared error (MSE) terms scaled by different weights ρ. For scalar molecular quantities X the loss is given by MSE(X) = M j=1 (X j − X j ) 2 /M , whereX is the predicted quantity and the sum is over M geometries in a batch. For atomic vectorial quantities the mean is additionally over the 3L j vector components, where L j is the number of atoms in the j th geometry.
The loss term L core contains the usual energy and force losses, plus an additional term for gap errors. While the gap term was not used in previous work, we found it to be crucial for the systems studied here. Indeed, without this loss term, one would expect the gap MAE in adiabatic models to be the geometric sum of the energy MAEs. This is what we found when using L core without the gap loss. Table S1 shows that, when using the full core loss, the gap MAE is actually lower than each individual energy MAE.
The reference loss, L ref , is a sum over geometries R which are considered to be cis or trans. At these geometries the target d nn are given by Eq. (5). A geometry is considered to be cis or trans if its central CNNC atoms deviate from those of the equilibrium structure by <0.15 Å. The distance is computed as the root-mean-square deviation (RMSD) among the atoms after alignment.
The NACV loss imposes diabaticity. It involves the force NACV g nm , and a phase correction A n ( R) = ±1. The phase correction is chosen separately for each geometry to minimize the error between the predicted and target force coupling. This factor is necessary because each adiabatic wavefunction can have an arbitrary sign [97]. The signs cancel for diagonal terms like E n and f n , but not for off-diagonal terms like g nm . The A n account for these arbitrary sign changes.

D. Data generation
To train a useful model one must generate reliable QC data. TDDFT [98] typically offers a good compromise between speed and accuracy for modeling excited states. However, it has known instabilities near CIs [66], and, as a result of the Brillouin theorem, produces the wrong branching space dimensionality for S 0 /S 1 intersections [99]. These issues, which can be traced back to the single-reference description of the excitation, can be partially alleviated with SF-TDDFT [63]. In this approach the excitation is performed with respect to a high-spin reference state. This yields some transitions that have double-excitation character with respect to the singlet ground state. Here we used SF-TDDFT with the BHHLYP functional [65] and the 6-31G* basis [64]. SF-TDDFT/BHHLYP is well-benchmarked for CIs in a number of molecules [67,100]. Because SF is not spin complete, the singlet states must be identified based on their square spins S 2 . We used the approach of Ref. [72], which identifies singlets as the two states with the lowest S 2 from the three excitations of lowest energy. Calculations were performed with the Q-Chem package [84].
Near-CI regions of the combinatorial species were sampled by setting the central CNNC dihedral angle of relaxed geometries to ±90 degrees and/or the CNN/NNC angles to 180 degrees. The other internal coordinates were not changed. The former corresponds to the rotation pathway and the latter to inversion. This led to 11 possible combinations of rotation and inversion, including pure rotation, pure inversion, concerted inversion, and inversion-assisted rotation [77].

II. Model accuracy and ablation studies
Here we compare the DANN model to a model trained without a NACV loss ("−L nacv "), a model trained without a reference geometry loss ("−L ref "), an adiabatic model, and a median predictor. The latter predicts the median ground-and excited-state energy for each species, and the median value among all species for other quantities. The MAEs for unseen species are compared in Table S1. Half the geometries were sampled randomly and half by proximity to a CI, as described in Supplementary Sec. IX. Of particular interest are (∆E) small , the MAE of the gap when it is under 0.2 eV, and sgn(∆E) small , the average overestimation of small gaps by the model. DANN actually underestimates the small gaps, indicating that it does not "miss" conical intersections. Using an adiabatic model or removing the NACV loss increases both the error and overestimation of (∆E) small . The same is true of removing L ref , but the effect is smaller. While the changes in (∆E) small appear minor, we show in Supplementary Sec. VII that they lead to noticeable quantum yield differences for a number of species. Also of note is the difference in NACV error among the different models. Removing L nacv substantially increases the NACV error, even when L ref is used. Since L nacv imposes diabaticity, this shows that the reference loss alone does not give accurate diabatic states. We also see that the adiabatic model is much worse at predicting ∆E away from the CI than the diabatic model. This is because of the large loss weight used for (∆E) small in the second stage of adiabatic training (see Supplementary Sec. IX). Indeed, the results after the first stage are much better for ∆E, but far worse for (∆E) small . Finally, the models without reference or NACV losses have far lower energy errors than DANN. It may be possible to decrease DANN's energy errors by increasing the weight of the energy loss.
A key benefit of the diabatic model is that it accurately predicts the gap near CIs. It also produces the NACVs, adiabatic energies, and forces all with one model. Without the diabatic model, one might learn the force coupling as a separate property. This has been done in other work by taking the gradient of a learnable scalar [42,44]. However, one would still have to divide by the adiabatic gap to obtain the derivative coupling. Thus gap prediction errors from an adiabatic model would still be problematic. A separate option would be to compute the NACV through the Hessian of the gap [44]. This is quite slow, and would have similar problems with the gap. Table S2 shows the test set statistics for the DANN model trained on all species. This model was used for virtual screening, as described in the main text. Here the test set simply consists of 5,000 held-out geometries. Unlike in Table S1, the test set contains mostly seen species, and the geometries were not generated from DANN-NAMD with the trained model. The results are quite similar to the "seen species" rows in Table I.

III. Experimental data
We performed an extensive literature search to find quantum yields of azobenzene derivatives. To best compare to the vacuum conditions simulated here, we chose quantum yields measured in non-polar solvents. When multiple results were available in different non-polar solvents, we computed the prediction error relative to the mean experimental value. To compare to the simulated S 1 dynamics, we selected yields measured at the peak of the highest-wavelength absorption band, which is a dipole-forbidden n − π * transition in unsubstituted azobenzene. All sources used here specifically reported yields at wavelengths close to the n − π * and π − π * absorption peaks, so it was straightforward to choose the appropriate wavelength. Lastly, we only selected yields at or around room temperature, since yields can have a strong temperature dependence [101].

A. Experimental error
There are two main sources of uncertainty in the experimental quantum yield results. The first is the error in the absorption coefficients ε of the two isomers. The absorption strengths are used in standard rate equations to compute the yield; see, for example, Refs. [102,103] and the supplementary of Refs. [104,105]. The error can be traced back to the overlapping absorption spectra of the two isomers, which must be disentangled. The magnitude of the error depends on the method used to compute ε, which itself is related to the year of publication. For example, Ref. [102], published in 1988, indicated that the absorption coefficients were the primary source of error, but did not give an estimate for their uncertainty. They isolated the different isomers using chromatography and computed their absorption coefficients separately. When the isomers could not be isolated in sufficient amounts, the authors used the time-dependent absorption approach of Ref. [106]. This method had relative errors of of ±25% for n − π * quantum yields [106] (i.e, yield → yield × (1 ± 0.25)). Ref. [103], published in 2004, reported a yield of 0.7 to 1.0 for compound 20, a large range stemming from the overlap of the isomers' spectra and the method used to separate them [107]. More modern methods have lower errors; for example, Ref. [105], published in 2015, reported a relative error of only ±10% in the supplementary.
The second important source of error is the overlap of the n − π * and π − π * absorption bands. The more strongly the two bands overlap, the higher the error of using only the S 1 state in the dynamics. Moreover, many experiments do not irradiate at the precise maximum of the n − π * band. Often a representative n − π * or π − π * wavelength is chosen, and then used for each derivative [101,104], even though the derivatives have different absorption peaks.
To get an idea of the range of errors this could introduce, consider the results of Ref. [108] from 1958. This work computed the quantum yield of unsubstituted azobenzene over several wavelengths. The trans to cis yield was measured as 21% at 405 nm and 27% at 436 nm (10 −3 M solution). The difference is because 405 nm light excites more of the π − π * band (313 nm) than 436 nm light. 405 nm excitation leads to a lower yield because the π − π * yield is only 11%. The difference is even more pronounced for the cis quantum yield, which drops from 55% to 40% moving from 436 nm to 546 nm. This result does not have a simple explanation, as there is no lower energy band beyond 436 nm. In principle, the error due to overlapping bands could be mitigated with an explicit dipole-electric field coupling term [109], which would excite multiple states in different proportions. For derivatives with completely overlapping n − π * and π − π * bands, the S 1 approximation would incur a maximum error of the S 2 yield minus the S 1 yield. This is because the S 2 state is much brighter than S 1 , and would therefore dominate the experimental yield. The yield difference is about 10 percentage points for both cis and trans unsubstituted azobenzene [110,111]. Therefore, in the worst case scenario, we would expect an error of about 10 percentage points from the S 1 approximation. The error would likely be closer to the 6% reported in Ref. [108] at wavelengths of moderate overlap.
We can further quantify these errors by computing the range of results from different studies. Measurements of the t → c azobenzene yield range from 20% to 28% between 1979 and 1987 [106,110,112], though this could also be related to solvent effects (Sec. IV B).The yields of compounds 9 and 10 range from 16% to 24% and 44% to 50%, respectively, with references from 1962 and 1988 [101,102]. Aggregating the results of all three compounds gives an average yield range of 7.3 percentage points, and an average relative error of ±14%.
A final source of uncertainty is specific to Refs. [73,74]. Several species were reported to have zero yield over a large range of irradiation wavelengths. However, as noted in the footnote to Table S12, the dipole-allowed S 2 transition was highly redshifted and thus overlapped strongly with the S 1 transition. The absorbance changes after irradiation were quite small, indicating a near-zero yield, but could have been due to the S 1 transition. Hence it is possible that the S 1 yield is not precisely zero.

B. Computational error
There are several sources of computational error. The first is error in the PES, which can be decomposed into error from SF-TDDFT and error from the model. While it is intractable to perform SF-TDDFT for all species with experimental data, our results for unsubstituted azobenzene suggest that it is rather accurate. As reported in the main text, we computed yields of 60 ± 4% and 26 ± 3% for c → t and t → c, respectively. Experimental measurements between 1979 to 1987 gave t → c yields between 20% and 28% [106,[110][111][112]. Experimental measurements in 1974 and 1979 gave c → t yields of 55% and 56% [110,111], while a measurement in 1958 gave 48% ± 5% [108]. All measurements were performed in non-polar solvents. The yields computed with the original model were 59% and 37% for c → t and t → c, respectively. Hence the main source of error seems to be the model, rather than SF-TDDFT.
Solvent effects may also affect the yield. These effects can be decomposed into a systematic term and a nonsystematic term. It has been argued that non-polar solvents systematically reduce the quantum yield relative to vacuum [113]. However, careful analysis of the experimental data reported in [114] and cited in [113] does not support this conclusion. Indeed, given the good agreement between non-polar experimental results and both SF-TDDFT and hh-TDA DFT [7], systematic effects of non-polar solvents are likely to be small. Solvent-specific effects are part of the range of reported experimental values, since different works often use different non-polar solvents. Typical ranges were discussed in Sec. IV A.
A final source of error is the approximations in surface hopping. A large body of literature has examined surface hopping's accuracy; see, for example, Refs. [115][116][117][118]. In this work, we can roughly evaluate its performance by comparing its results to those of ab initio multiple spawning (AIMS) [7]. AIMS is a fully quantum mechanical method and can thus be treated as a benchmark. We computed the t → c yield as 26 ± 3% using surface hopping with SF-TDDFT, and Ref. [7] computed the yield as 24% ± 6% using AIMS with hh-TDA DFT. While there may be some error cancellation from the different quantum chemistry methods, the good agreement is still encouraging.

V. Influence of different functional groups
Here we discuss how the model transferability depends on the functional groups in a compound. To answer this question, we analyzed each of the 40 unseen species in the test set, and computed the model error for each geometry. As described in the main text, the geometries were taken from DANN-NAMD with the trained model; half were selected by proximity to a CI and half selected randomly. For each functional group, we aggregated the errors of all geometries that contained the group, and then computed the mean.
The functional groups with the lowest gap errors are shown in Fig. S1(a), and those with the highest are shown in panel (b). Table S3 shows the number of times that each functional group appears in the training set, together with the errors for all properties. Of the groups with the lowest errors, we see that both B and C are well-represented in the training set, which may explain their accurate results. The other groups each have about 1,000 samples in the training set. This is smaller than that of B and C, but not negligible. The compounds are also rather simple, which may partly explain their low error.
Of the groups with the highest gap error, only the nitro group G is well-represented in the training set. There are almost 10,000 training geometries with nitro groups, yet the error is still quite high. This may be because NO 2 is a strong electron-withdrawing group, and can thus have a significant effect on the electronic structure of azobenzene. Of the remaining substituents, both H and J are rather complicated, and each has only about 200 samples in the training set. Groups F and I both have about 1,000 samples in the training set, and are thus moderately represented. I may have large errors because of the electron-withdrawing effects of the three fluorines, or because it is found together with J in compound 20. F, the tert-butyl group, likely has large errors simply because it is bulky and thus leads to distorted geometries.
We see that the transferability of the model depends on how well-represented a functional group is in the training set, and how complicated its electronic effects are. The former is supported by Fig. S2, which shows that the functional group error is anti-correlated with its prevalence in the training set. However, the effect is  stronger for the gap than the excited state forces (ρ = −0.35 and ρ = −0.15, respectively), and neither fully explains the error. The remaining error can best be explained by analyzing the groups themselves.

VI. Spin contamination
Here we discuss the amount of spin contamination in the training set. Figure S3 shows the square spin, S 2 , for both the ground and excited states. The spin contamination is quite low for the ground state, with S 2 under 1.0 for 96% of geometries. It is much higher for the excited state, with 76%, 82%, and 93% of geometries having S 2 under 1.0, 1.2, and 1.4, respectively. Most of the geometries with excited-state S 2 1.2 were near the non-reactive S 1 /S 2 CI, characterized by near-planarity and CNN angles near 108 • [7]. The S 0 /S 1 CIs, on the other hand, did not have severe spin contamination. The spin-contamination near the S 1 /S 2 CI led to difficulty in fine-tuning the model for unsubstituted azobenzene. We also visually inspected a sample of geometries with extreme spin contamination, S 2 > 1. 8  these geometries in the training set so that the model could learn the high energy of bond breaking.
VII. Surface hopping results with different models Figure S4 compares ZN hopping statistics of adiabatic and diabatic models for unseen species. Panel (a) shows the distribution of hopping percentages among species. The hopping percentage is defined as the percent of trajectories for a given species that end in the ground state. The y-axis is the percent of all species that correspond to each bin. Panel (b) has an analogous plot, but with the hopping percentage replaced by the lifetime. The lifetime was estimated from an exponential fit of S 1 population, p = exp(−(t − t on )/τ ) Θ(t − t on ) + Θ(t on − t). Here p ∈ [0, 1] is the S 1 population, t is the time, t on is the fitted turn-on time, τ is the fitted lifetime, and Θ is the Heaviside step function. Trajectories that did not contain any hops were assigned the maximum lifetime of all the other trajectories.
The cis derivatives have lifetimes around 50 fs, consistent with computational results for cis azobenzene [72]. Nearly 100% of all cis trajectories ended in the ground state. The trans derivatives have a wide distribution of lifetimes. Some are between 1 and 2 ps, which is similar to trans azobenzene [72]. Others are in the range of tens to hundreds of ps, which reflects the high proportion of trajectories that never returned to the ground state. These are very likely incorrect. They may be because of barriers between S 1 minima and S 0 /S 1 CIs, which are known to exist for trans azobenzene [7]. Relatively small errors in the barrier may lead to large over-estimations of the lifetime.
Excited state barriers make it even more important to have accurate PESs near CIs. Since trajectories spend little time near crossing regions, trapping becomes even more severe when the gap near CIs is overestimated. The diabatic model helps to address this problem: in each plot we can see that the diabatic model leads to more hopping. For example, using the adiabatic model, 21% of species have hopping percentages under 10%. This number is reduced to 13% for the diabatic model.
The diabatic model also improves the quantum yield. Figure S5 shows the ZN quantum yields with the diabatic model, and Fig. S7 shows the diabatic FS yields. The Spearman rank correlation for the trans species is fairly high with both the ZN and FS methods, with the model accurately predicting low yields for a number of species. Figure S6 shows the ZN yield with the adiabatic model. The correlation among all species is reasonable, but for trans species is rather low. This is because of three molecules with zero predicted yield, all of which became trapped in the excited state. The diabatic model only predicts zero yield for one of these. Further, the diabatic model properly predicts zero yield for species 35, while the adiabatic model does not. For these reasons the diabatic model has a fairly high correlation with experiment. Still, it is clear that excited state trapping of trans species is not a fully solved problem. Preferential sampling of excited state barriers may help to further address this issue in the future.

VIII. Architecture
All models were implemented in PyTorch [119]. The model hyperparameters are given in Table S4, and an in-depth explanation of the PaiNN parameters can be found in Ref. [35]. Note that we used five convolutions instead of the three used originally, as this substantially improved model performance. Following DimeNet [120], we allowed the k values in the radial basis functions to be updated during training. For the adiabatic model, we predicted each property as a sum over per-convolution properties, which was also used in DimeNet.
In particular, each convolution had a readout network to convert the atomic features to an output. The final property was obtained by summing each of these outputs. Several variations on the architecture were tested. For example, we trained both two-and three-state diabatic models on 5,000 azobenzene configurations. We found that adding a third diabatic state significantly decreased the error for all properties. We then trained models using all possible three-state reference orderings and chose the best one. We also experimented with intensive pooling for off-diagonal energies (see Supplementary Sec. XVI). In particular, we generated a molecular fingerprint through an attention-weighted average of atomic fingerprints. We then mapped this fingerprint to the d nm for n = m. Even though the d nm are intensive for n = m, this approach did not improve model predictions.
We also experimented with several adiabatic models. The model in the main text predicted E 0 and E 1 directly. This approach was used in previous work for single-molecule non-adiabatic dynamics [39][40][41][42][43][44][45] and for the prediction of absorption spectra across chemical space [121]. We also examined three models that predicted E 0 directly and E 1 as the sum of E 0 and a learned gap. We tried three different pooling methods for the learned gap ∆E: taking the mean over atomwise gaps, taking an attention-weighted average over atomwise gaps, and applying a dense readout network to an attention-weighted sum of atomic fingerprints. All approaches performed quite poorly compared to learning E 0 and E 1 separately as summed atomwise energies. This is problematic, because only adiabatic models that predict E 1 as E 0 + ∆E can guarantee the positivity of the gap (e.g. by squaring ∆E or applying a softplus function). Hence the adiabatic model in the main text sometimes generated predictions with E 1 < E 0 . This is impossible in the diabatic model by construction.  The training set contained 562,037 geometries from 8,197 species. 5,000 geometries from 308 species were used for validation. The remaining 74,322 geometries from 40 species were held out for testing, so that the predicted yields of unseen species could be compared with experiment. A different random seed was used to determine the training and validation splits for each committee model, and also to initialize the different models. After training, we ran FS DANN-NAMD on 40 holdout species using the trained diabatic model. For each species we selected 50 geometries randomly and 50 by CI proximity (Eq. (S13)), for a total of 4,000 geometries. These geometries were used as the test set, giving the "unseen" statistics in Table I. The DANN-NAMD geometries from the diabatic network were used for all models, including the adiabatic and ablated ones in Table S1. The "seen" statistics were generated using the validation set. For all statistics, a phase correction was applied to g 01 to minimize the prediction error, as in Eq. (S3).
The model for screening new azobenzene derivatives was trained on all species. Inclusion of the holdout species provided an additional 74,322 geometries. We used 631,367 geometries for training, 5,000 for validation, and 5,000 for testing. This corresponded to 8,215 training species, 332 validation species, and 303 test species.
Training was performed over energies and forces/force couplings in units of kcal/mol and kcal/mol/Å, respectively. Per-species reference energies were subtracted from each energy. These were obtained by summing atomic reference energies, computed using multi-variable linear regression from (atom type, count) to relaxed geometry energy. Configurations with 10-σ energy and force outliers were removed prior to training. Those with forces or energies ≥450 kcal/mol(/Å) from the mean were also removed. We found that more stringent removal of outliers led to less stable trajectories. For example, removing 3-σ outliers and maximal energies/forces of  ≥300 kcal/mol(/Å) led to unstable ground state trajectories for six of the 40 unseen species. The 10-σ and 450 kcal/mol(/Å) criteria led to no diverging ground state trajectories. A small proportion of excited state trajectories were still unstable. We discarded all excited state trajectories that produced NaN geometries or energies, which were at most a few percent of the trajectories for a given species. Models were trained with the Adam algorithm using a batch size of 60. Geometries were sampled for each batch using Eq. (S12), as described below. The loss was given by Eq. (6), using parameters in Table S5. Note that the range of NACVs is approximately 10 times smaller than the range of forces. This means that ρ nacv = 1, ρ fi = 1 gives much higher weight to the forces. We experimented with a range of NACV coefficients between 0.1 and 10, and found that ρ nacv = 1 gave the best performance.
The learning rate was initialized to 10 −4 and reduced by a factor of two if the validation loss had not improved in 10 epochs. The final model was selected as the one with the lowest validation loss. Training was performed on a single 32 GB Nvidia Volta V100 GPU, and took 13 days to complete.
For diabatic models, the training was stopped once the learning rate reached 10 −5 . Adiabatic models were trained in a two-step process, using different loss functions in each stage. Each step ended once the learning rate fell below a certain value. The following loss function was used with different loss trade-offs at different stages: where L small penalizes errors in gaps under 0.2 eV. The parameters for each stage are given in Table S6, and the ρ E and ρ f coefficients are the same as in Table S5. The first stage emphasized energy gaps and gradients, while the second stage emphasized small gaps to fine-tune the model near conical intersections. We also experimented with scheduled training and using L small for diabatic models, but did not find any improvements.

X. Balanced sampling
A custom data sampler was used during training because the dataset was imbalanced in the following ways. First, the combinatorial species only had a few geometries each, and so would be very rarely sampled during training. Second, there were more equilibrium trans geometries than cis geometries. Third, there were more equilibrium geometries in general than near-CI configurations. Our sampler addressed these imbalances by giving higher sampling probability to underrepresented species and/or configurations.
To explain the sampling procedure, let us define two types of sampling weights. Sampling weights that are balanced by species are denoted by w, and those that are not are denoted by v. For example, the weights w cluster (g i,A ) and v cluster (g i,A ) are both assigned to the i th geometry in species A, denoted g i,A , based on the cluster of configurations that it belongs to. This cluster is denoted k, where k ∈ (cis, trans, other). For a geometry g i,A in cluster k, the weights are given as follows: Here n k,A is the number of geometries of species A in cluster k, and n A is the number of geometries in species A. n spec is the total number of species, n clusters is the total number of clusters, and n geoms is the total number of geometries. The above definitions ensure that there is an equal probability P k of sampling any cluster k: Using v : Here we have used the fact that gi∈ k,A 1 = n k,A , A 1 = n spec , and A n A = n geoms . The difference between the two weights is that w leads to balanced sampling among species, while v does not: Using v : Here we have used the fact that k 1 = n cluster . We see that w gives equal weights for each species, whereas v gives weights proportional to the number of geometries in that species. Similarly to the cluster weights, we define Zhu-Nakamura weights w ZN and v ZN , such that geometries with a higher hopping probability p ZN are sampled more often. The corresponding expressions are: The overall sampling weight for a given geometry is determined by each w and v, together with two userdefined weights: P spec ∈ [0, 1], the importance of species balance, and P ZN ∈ [0, 1], the importance of sampling geometries with high hopping rates. P cluster = 1 − P ZN is the importance of sampling different clusters in a balanced way. The final sampling weight is then given by In applying Eq. (S12) we used P spec = 0.6 and P cluster = 0.5. We defined a geometry as cis or trans if its RMSD with respect to the corresponding reference structure was ≤ 0.25 Å, and "other" otherwise. In contrast to diabatic reference geometries, the RMSD was computed using all atoms, not just the central CNNC atoms. Note that while our clustering approach was specific to cis-trans isomerization, many black-box clustering methods exist, such as hierarchical clustering [81]. Any of these methods could have been used in place of our user-defined clusters. In principle p ZN should depend on trajectory-specific factors such as velocity, and is therefore not a function of geometry alone. However, during simulations, hops almost always occurred below 0.5 eV. Further, since p ZN is approximately exponential in the square of the gap, we approximated it with where ∆E is the energy gap and ∆E 0 = 0.15 eV. This choice of ∆E 0 gave p(∆E = 0.5 eV) ≈ 3 × 10 −3 . This meant that geometries were usually selected only if their gap was under 0.5 eV. The sampling probabilities for the training set are shown in Fig. S8. Panel (a) shows the cumulative probability of sampling different species, with the compounds ordered from fewest to most geometries. Without the custom sampler the probability was quite small for most molecules. This was because the majority were combinatorially generated and had few geometries. 90% of the probability was contained in the last 41 species. This probability was reduced to 35% when using the balanced sampler. Panel (b) shows that most geometries were trans isomers and few were cis. The custom sampler gave approximately equal probability to the two isomers, and the highest probability to the other group. Most geometries in this group had small gaps, and hence were highly weighted by w ZN and v ZN . This is demonstrated in panel (c), which shows that near-CI geometries had the highest probabilities when using the balanced sampler.

XI. Dynamics
Simulations were performed in two stages. First the starting geometries and velocities were generated for each NAMD trajectory. For ab initio simulations we optimized the ground state geometry of each species and computed its normal modes. These modes were used to sample geometries and velocities from the harmonic oscillator Wigner distribution at temperature T = 300 K [122]. For neural trajectories we ran classical MD for 15 ps with the Nosé-Hoover thermostat [85,86], and selected random samples to start the NAMD trajectories. Classical MD was implemented with ASE [123]. The effective mass Q was set to (3N − 6) · τ 2 k B T , where N is the number of atoms, k B is Boltzmann's constant, T = 300 K, and τ = 25 fs is the relaxation time. Initial atomic velocities were sampled from a Maxwell-Boltzmann distribution at 300 K. Optimized geometries were used for the initial atomic positions. The total linear and angular momenta of the system were set to zero, and the target kinetic energy was set to (3N − 6) · k B T /2. The time step of the simulation was set to 0.5 fs, and the neighbor list of the system was updated every 10 time steps (5 fs). All pairs of atoms within with 7 Å were considered neighbors. In each step the distance between all pairs of neighbors was computed, and only pairs within 5 Å of each other were used in the model. This procedure meant that atoms entering the 5 Å cutoff between neighbor list updates would not be missed (i.e., an extra 2 Å "skin" was addded). All ground state and excited state trajectories in this work were propagated with the velocity Verlet algorithm [124]. Frames for NAMD were taken only after the first 1 ps of ground state MD to allow for equilibration.
Next we ran ZN dynamics [72,79,125] using N trj trajectories, where N trj was 10 during the active learning cycle and 500 during final inference. Hops were restricted to gaps ≤ 0.5 eV to avoid unphysical transitions, the time step was set to 0.5 fs [72,79,125], and the neighbor list was again updated every 10 steps. Excited state trajectories were run for 1.5 ps during active learning and 5 ps for final inference. All other details of the implementation can be found in Refs. [72,79,125]. For each species we performed inference in parallel over 100-250 geometries at a time, depending on the number of atoms in the molecule. This was done by batching together geometries from 100-250 trajectories and evaluating the model on the batch. The batch size depended on the size of the molecule, which determined the GPU memory consumption. FS simulations were also performed for final inference. The ZN parameters above were used for FS DANN-NAMD of molecules in the holdout set. For virtual screening we used 100 trajectories and 2 ps of excited state dynamics. Species with a hopping rate under 10%, or with molecular graphs that changed during ground state dynamics, were deemed unreliable. These molecules were excluded from Fig. 4, but not from the average hopping percentage reported in the main text.
The quantum yield was calculated as Y = n R /n T , where n R is the number of reactive trajectories and n T is the total number of trajectories. The uncertainty was computed as the standard deviation of 1,000 bootstrapped samples. To identify reactivity, we computed the RMSD between the CNNC atoms in the final frame and the corresponding atoms in the optimized cis/trans geometries. A trajectory was considered reactive if it started as cis and ended closer to trans, or vice-versa. Trajectories that ended in the excited state were excluded from n R and n T . The yield was reported as 0 ± 0 if all trajectories ended in the excited state.

XII. Fewest switches implementation
Tully's surface hopping propagates the nuclei on one electronic surface at a time, called the active surface. The expansion coefficients of the electronic wavefunction are also propagated, and are used to make stochastic changes in the active surface. The coefficient vector c, expressed in a basis denoted by "rep", evolves as [126] d dt The Hamiltonian matrix is H rep = ψ rep n | H |ψ rep m , and the coupling term is where v is the classical velocity of the nuclei. In principle the nuclei can be propagated on a PES in any electronic basis, and hops can be performed between states in that basis. For example, nuclei could be propagated on diabatic PESs and hops could occur between diabatic states. However, it is well-established that surface hopping in the adiabatic basis gives the best results [97]. Hence the nuclei should be propagated in the adiabatic basis, and hops should be decided using c ad . While the adiabatic basis should be used for hopping, c can be propagated in the diabatic basis, and then transformed into the adiabatic basis for all subsequent manipulations (e.g. deciding on hops, adding decoherence, etc.). Indeed, using Eq. (S14) in the adiabatic basis can require small time steps for so-called trivial crossings, where derivative NACVs are quite narrow and large [127]. Since global diabatic states are almost never available in ab initio simulations, a local diabatization method is often used to propagate c [88,127]. This method is very stable and can be used with a fairly large time step. In this work we have a global diabatic basis that can replace local diabatization. Following Refs. [126,127], we then propagate c as follows: c ad (t + ∆t) = P ad (t, t + ∆t) c ad (t), (S16) Here P ad (t, t + ∆t) is the propagator of c ad from t to t + ∆t (Eq. (S16)), and ∆t is the timestep. The adiabatic propagator is a transformation of the diabatic propagator into the adiabatic basis (Eq. (S17)), using the transformation matrix U (Eq. (2)). The diabatic propagator is a matrix product of K sub-propagators (Eq. (S18)), where K is the number of electronic substeps for each nuclear step. "exp" denotes a matrix exponential, not element-wise exponentiation. The k th sub-propagator uses a linear interpolation between H d (t) and H d (t + ∆t) to approximate H d (t + k∆t/K) (Eq. (S19)). The diabatic Hamiltonian is produced by the neural network model. The probability of hopping from state n to m is [126] p n→m = 1 − |c ad where Re(x) is the real part of x. Note that while our approach follows that of SHARC [126], we have written our own code and made it publicly available [128]. Our repository also contains code for surface-hopping with the ZN method.
In this work we initialized c ad (t = 0) = [0, 1, 0], used Eqs. (S16)-(S19) to generate c ad (t + ∆t), and computed the hopping probability with Eq. (S20). c ad was expressed in a three-state basis because our model used three diabatic states, and hence U had three dimensions. Since the model was only trained on the first and second adiabatic energies, we set p = 0 for all transitions to the third state. Hops to state m were performed if [126] where 0 ≤ r ≤ 1 is a random number. The momentum was rescaled after each hop to conserve the total energy. The momentum was multiplied by a constant factor, rather than being re-scaled in the direction of the NACV, to avoid the overhead of a NACV calculation (see below) [88]. If the factor was complex-a so-called "frustrated" hop-then no transition was made [88]. We also tested propagation in the adiabatic basis. In this case we used the NACV to evaluate T, and constructed P ad from the interpolation of H ad −iT [126]. Both Eq. (S20) and Tully's original hopping expression [57] were used. The momentum was re-scaled in the direction of the NACV [115]. In all cases the results were very similar to those of the diabatic basis. Diabatic propagation was ultimately chosen because of its reported stability [127], and because of its computational efficiency. In particular, the diabatic propagation requires only diabatic energies and one adiabatic gradient. It therefore uses only one forward and one backward pass through the neural network. By contrast, constructing the NACVs requires gradients for each diabatic element (Eq. (1)). For three diabatic states, this corresponds to one forward pass and six backward passes. While shared convolution layers and caching mean that N gradients do not take N × as long as one gradient, we found that they still added significant time. Hence diabatic propagation was chosen as the most efficient method.
The decoherence correction of Ref. [129] was used to counter the over-coherence of Tully's original method. A sign correction was also used to remove random sign changes in the eigenvectors of H d . The sign correction A m for the m th eigenvector v m was computed as The transformation matrix and NACVs were corrected through Note also that S nm ≈ ψ ad n (t) ψ ad m (t + ∆t) : Here we used the fact that there is no derivative coupling between diabatic states, and that the diabatic states at a given time are orthonormal (Eq. (S35)). All simulations were performed with a time step of 0.5 fs and K = 25 substeps for electronic propagation. Unlike in ZN dynamics, hops were not restricted to gaps under 0.5 eV. We found that restricting hops improved the ZN results but hurt the FS results. For example, with no maximum gap, most neural ZN trajectories starting from trans-azobenzene hopped at ∼ 1.5 eV. This did not match the results of Ref. [72], which used ab initio ZN with the same level of DFT theory. After adding the restriction, most hops occurred at gaps under 0.1 eV, in agreement with Ref. [72]. By contrast, most FS hops occurred under 0.1 eV even without a maximum restriction, though some still occurred at large gaps. The lifetime and yield without a maximum also better matched previous calculations.

XIII. Active learning
New geometries were selected for QC calculations using two different criteria. In the first three active learning loops, new geometries were chosen based on the prediction variance of two neural networks. Our aim was to select geometries with fairly uncertain predictions. We did not want geometries with extremely high uncertainty, as these usually corresponded to broken graphs that were outside the target learning space for the model. We therefore used a log-normal target distribution for the uncertainty. Target uncertainties were randomly sampled from this distribution, and geometries with the closest variance to the targets were selected. The log-normal probability of obtaining a sample x is given by P (x) = exp −(ln(x/s) − µ) 2 /2σ 2 /(xσ √ 2π), where s, σ and µ are positive numbers. Under this distribution, both completely certain and completely uncertain predictions have zero probability, with a peak for predictions with medium uncertainty. The decay of the probability at large uncertainty is slow, such that highly uncertain geometries can still be selected with reasonable probability. We set µ = 0, σ = 1, and s = 7, giving a target distribution with a mode of 2.5 kcal/mol(/Å) and a mean of 11.5 kcal/mol(/Å). Committee variances were computed for both forces and energies for each electronic state, and the largest variance was compared with the target variance from the distribution. For each species we selected 16 geometries from ground state MD and 33 from surface hopping. Only geometries from the 164 literature species were chosen. This led to approximately 8,000 new data points in each active learning cycle. Other, simpler methods of geometry selection are certainly possible; for example, random selection is also known to work quite well [37,38]. Our approach successfully increased the model quality throughout active learning, but we did not thoroughly compare it to other methods. Such a comparison may be of interest in the future.
In the next two loops we used only azobenzene, with the goal of densely sampling the CI region. Half the geometries were selected randomly from the excited state dynamics, and half were selected based on the gap.
Because the model overestimated the gap in uncertain regions, we did not want to simply select geometries with low predicted gap. Rather, in each trajectory we identified avoided crossing geometries, and assigned equal sampling probability to all geometries before and after that crossing, up to a maximum predicted gap of 1.7 eV. An avoided crossing geometry was defined as having a gap that was lower than in the previous and subsequent time step. In this way we aimed to identify regions that could have small gaps, even if the model did not predict them to be so.

XIV. Validation
A. Ab initio NAMD Ab initio simulations were used for unsubstituted cis and trans azobenzene. Starting configurations and velocities were generated with ab initio MD using Q-Chem to drive the dynamics. Ten MD simulations were performed for each species. Each simulation was initiated with a different geometry produced by ground-state neural network MD. The simulations were run for 10 ps each using a time step of 0.5 fs, with the the BP86 functional [130,131] and 6-31G* basis [64]. Initial velocities were sampled from a thermal distribution at 300 K, with rotation and translation projected out. The nuclei were propagated with the Velocity Verlet algorithm. A Nosé-Hoover chain of length 5, timescale τ = 25 fs, and temperature T = 300 K was used as a thermostat. The Fock extrapolation order was set to six, and twelve Fock matrices from previous steps were used in the extrapolation.
167 and 178 excited state trajectories were generated for trans and cis azobenzene, respectively. Each trajectory was initialized with a different set of coordinates and velocities, randomly sampled from the ten ground state simulations after 1 ps of equilibration. Spin-flip TDDFT was used with the BHHLYP functional [65] and 6-31G* basis. DIIS with geometric direct minimization (GDM) [132] was used to improve SCF convergence. We found this to be critical: with the usual DIIS algorithm, the SCF cycle failed to converge for 36% of trajectories. This usually occurred within the first 200 fs.
The dynamics were run with in-house scripts, which can be found at https://github.com/ learningmatter-mit/NeuralForceField. We propagated the elements of the electronic wavefunction in the adiabatic basis, and corrected the sign of the force NACV by minimizing its change from the previous step. The momentum was re-scaled in the direction of the NACV after a hop [115]. All other parameters were unchanged from the DANN-NAMD simulations. Cis trajectories were propagated for 200 fs, and trans trajectories for a maximum of 1.5 ps. For the former, we extended trajectories that had hopped within the last 50 fs, or not hopped at all, until at least 50 fs had passed since they hopped. For the latter, we stopped a trajectory if it had lasted at least 500 fs and had hopped more than 300 fs earlier.

B. Transfer learning
To validate the yield results for substituted compounds, we performed DANN-NAMD for the top candidates using a set of highly accurate models. Each model was fine-tuned for a single species only, which allowed it to achieve high accuracy on that one molecule. This transfer learning strategy was used in place of ab initio NAMD because the latter would be prohibitively slow for all but the smallest molecules. For example, consider molecule 169, which has only 54 atoms. We found that a single gradient or NACV calculation for this species took approximately 50 minutes with 8 CPU cores. Since trans derivatives must be simulated for at least 1 ps, and since the time step must be no larger than 0.5 fs, we would need to perform 2,000 QC calculations for each trajectory. Assuming parallel calculation of the NACVs and gradients at each step, an ab initio simulation would take 70 days. By contrast, the fine-tuning approach allows us to perform highly accurate simulations for tens to hundreds of species.
Each new model was refined from the original network using QC data from a single species. The initial training geometries were sampled from DANN-NAMD simulations with the original DANN model. A committee of three  Table S7. Training parameters used for transfer learning. "Patience" refers to the number of epochs without an improvement in validation loss before the learning rate is reduced. "Factor" is the amount by which the learning rate is lowered.
DANN models, each trained on the entire dataset, was used to select the initial geometries (see below). Three fine-tuned models were then trained. Each was initialized with a different random seed and trained with different data splits. 90% of the data was used for training and 10% for validation. The first model was subsequently used for DANN-NAMD. Geometries were selected from these simulations using the newly-trained committee, and each geometry received QC calculations. This data was added to the training set, each committee model was re-trained, and the cycle was repeated as in Fig. 2(b). After each cycle we evaluated the model accuracy using the geometries generated by DANN-NAMD.
Initially we selected only 50 geometries in each round of active learning. Once we narrowed our focus to two species, we sampled 500 geometries in each round. The geometries were sampled according to the following four strategies. One third was sampled randomly. One third was chosen by the prediction variance in the excited state forces. Those with the highest variance were selected. We used only the uncertainty in the forces, and not the energies, because the former is a better indicator of trajectory instability [81]. One sixth was chosen by proximity to a CI (Eq. (S13) with ∆E = 0.2 eV). The final sixth was chosen to sample excited-state barriers. In particular, we sampled geometries with probability Here E is the excited state energy, E min is the minimum excited sate energy in the trajectory, and T = 300 K. We only sampled configurations from 50 fs or later in the simulations. This was done to avoid the initial high-energy geometries encountered before relaxation. Equation (S29) is inversely proportional to the roomtemperature Boltzmann probability, and thus assigns the highest weight to the highest-energy configurations. Transfer learning has previously been used to account for solvent effects [37] and to reach higher levels of QC theory [37,133] with only a small portion of the training set. Typically the majority of the network weights are frozen during re-training. This leaves modifiable parameters in only the final few layers. However, we found that the best results were obtained without any parameter freezing. We therefore allowed all parameters to be modified during re-training. Further, we found it best to start with the normal learning rate rather than a reduced one. We also did not use a reference loss, as its effect on the original DANN results was minor, and possibly even harmful. The full set of training parameters can be found in Table S7. All trajectory parameters were unchanged from the screening phase. For final inference we performed DANN-NAMD for 5 ps with 2,000 trajectories, using 500 ps of ground state MD.
The accuracy of the fine-tuned networks is shown in Tables S8 and S9. Statistics are shown for 500 geometries that were sampled from DANN-NAMD for each molecule using the final trained networks. The model errors are well below 1 kcal/mol(/Å) for geometries sampled by all methods other than uncertainty. The error on the uncertainty-selected configurations is under 2 kcal/mol(/Å) in all cases. Since these geometries are specifically chosen to have the highest error, and since their errors are still rather small, we can be confident in the accuracy of the models. We note, however, that the average error for the uncertain geometries depends on how many trajectories are run. Running more trajectories means sampling more configurations, and hence finding more geometries with high uncertainty. As mentioned above, we ran 100 trajectories for 5 ps each for both screening and transfer learning (2,000 trajectories were used for the final predictions). We saved frames every 15 fs, leading to 33,333 geometries in total. Since we picked the 167 most uncertain frames, our sample is roughly the same as choosing the top 0.5% of all geometries with the highest error.
One reason that we did not use transfer learning for the unsubstituted compounds was severe spin contamination. This is an artifact of unrestricted SF-TDDFT, and is a well-documented problem in NAMD for trans azobenzene [72]. After the first round of active learning, we found that the force error for the unsubstituted  Table S9. As in Table S8, but for species 169. 2,445 geometries were used for fine-tuning. models increased to 8 kcal/mol/Å. This error did not drop significantly with more data, even though the new geometries were fairly close to the Franck-Condon region. This issue did not occur with any of the derivatives. We found that the error was correlated with S 2 , and reasoned that our method of singlet selection (Sec. II D) was likely choosing triplet excited states. This highlights the issues inherent in SF-TDDFT, and reinforces the need for low-cost, spin-complete alternatives [28][29][30]. Lastly, we note that both the ab initio and transfer-learned cis quantum yields were noticeably higher than in other SF-TDDFT studies [72]. This is likely because we used MD to initiate the trajectories, rather than normal-mode or Wigner sampling based on the harmonic approximation. Our experiments showed that normalmode sampling led to decreased cis yields, closer to those reported in Ref. [72]. Unlike the trans isomer, cis azobenzene is somewhat flexible, with significant torsions occurring during ground-state MD. This indicates that the harmonic approximation should be avoided when possible. Indeed, using MD ground state sampling together with FS dynamics for cis azobenzene, we obtained a yield of 60 ± 4%; this is in excellent agreement with experimental values in non-polar solution, which are close to 55% on average [73]. It is in much better agreement than the value of 34% obtained with FS surface hopping in Ref. [72]. Since we used the same electronic structure method and the same surface hopping approach, we can be confident that the difference is mainly due to MD sampling.

C. Conical intersection pathways
We labeled the CI pathway of each trans trajectory according to its last hopping geometry. Each geometry was compared to two reference planar CIs and two reference rotational CIs. The trajectory was labeled by the reference CI that was the closest to the hopping geometry. That is, if a trajectory hopped at a geometry closer to one of the rotational CIs, it was labeled a rotational trajectory, and similarly for the planar CI.
The reference CIs were chosen from the set of all hopping geometries in the trajectories. The two rotational CIs were those with dihedral angles closest to 90 • and 270 • , respectively, and max(α CNN , α NNC ) closest to 138 • . The two planar CIs were those with dihedral angles closest to 186 • and 174 • , respectively, and both max(α CNN and α NNC ) closest to 148 • . The angles are those of the optimized CI geometries in Ref. [7]. For the derivative 169, we further optimized each of the reference structures to minimize the gap and hence obtain a true CI. For each trajectory we computed the RMSD between the CNNC atoms of the hopping geometry and the CNNC atoms of the reference CIs.
We additionally enforced that a hopping geometry could only be considered planar if |180 − θ| ≤ θ 0 , where θ is the CNNC dihedral angle and θ 0 is a cutoff value. We chose θ 0 = 75 • for azobenzene and 65 • for 169. Without this constraint, many of the hopping geometries labeled "planar" in 169 actually led to isomerization. This made a dihedral constraint necessary. On the other hand, when we only labeled all geometries with |θ − 180| ≤ 45 • as planar [7], we found through visual inspection that many non-reactive CI geometries were mislabeled as rotational. Using the minimal distance to a reference CI, together with a modest dihedral constraint, led to the most qualitatively reasonable results. We confirmed that none of the geometries labeled as "planar CI" with our metric led to switching, which further reinforced the soundness of our approach.

XV. Figure details
A. Computational speed-up Here we describe how the speeds of ML and QC calculations were computed. For QC, we first computed the run time of one gradient calculation on a single geometry, denoted t calc . The node time was then calculated as t node = t calc /n calc , with n calc = (cores per node)/(cores per job). We assumed 40 cores per node. All QC jobs were performed with 8 cores, and so n calc was equal to 5.
For ML we performed a batched calculation on ten copies of each geometry, and computed one gradient.
The node time was computed as t node = t calc /n calc , with n calc = 10 · (total memory per gpu)/(memory used in calculation) · (GPUs per node). We set (total memory per gpu) = 32 GB and (GPUs per node) = 2. We used a script that performed one batched calculation on ten copies of a random geometry, and re-ran it 7,000 times. For each iteration we used the nvidia-smi command with the PID of the current job to access the GPU memory. We re-ran the script many times, rather than running many calculations in one script, because nvidia-smi does not account for all freed memory until a job is finished. That is, after one calculation is finished, nvidia-smi shows that GPU memory is still being occupied by the job. This occurs even after all local variables are deleted. Hence running multiple calculations in one script would yield an overestimated GPU memory.
In the above calculation, we implicitly assumed that multiplying the batch size by x would lead to an x-fold increase in memory. In practice we have found that the memory is increased by less than that. This means that the ML speedup in Fig. 3 is a conservative estimate.

B. Diabatic energies
Since three diabatic states were used in the DANN model, and since all three were coupled near the CI, it would be incorrect to plot only two states in Fig. 3(b). We therefore applied a fixed rotation matrix, U = U( R), to generate a new diabatic Hamiltonian with only two coupled states. The new diabatic Hamiltonian was given by H d ( R) = U † H d ( R)U. Note that any position-independent rotation matrix can be brought outside the gradient in Eq. (1), and so H d is still diabatic. The rotation matrix was chosen to diagonalize H d at the CI, so that U † H d ( R CI ) U = diag({E}). This led to d 00 = d 11 and d 01 = 0 at the CI. Hence the lowest two rotated states were the most important contributors to E 0 and E 1 , and were therefore used in Fig. 3(b).

XVI. Intensive and extensive quantities
The DANN model predicts each property by summing over atomic contributions, just as in the PaiNN model. This guarantees size-extensivity. However, as explained below, the off-diagonal elements of H d should be intensive, in the sense that atoms not involved in the excitation should not contribute to these values. Nonetheless, we found that summation for these terms gave better results than averaging. This was true even when using a learnable weighted average. Atom-wise summation can still generate accurate d nm , because the readout network can simply map unimportant atoms to zero.
To demonstrate extensivity and intensivity, consider two uncoupled subsystems, A and B. The total clamped nucleus Hamiltonian is H( r, R) = H A ( r, R) + H B ( r, R). Let the excitation of interest be in subsystem A. In this case the adiabatic states involve only excitations in subsystem A, so that the n th diabatic wave function is written as The diabatic state is a direct product of the ground state wave function of system B, ψ B ad,0 ( r; R), and a rotation of the adiabatic states of system A, ψ A ad,k ( r; R). The matrix elements of H d are then given by Hence the diagonal elements of H d each gain the adiabatic energy of B, while the off-diagonal elements remain the same. To understand this result physically, consider that adding a scalar multiplied by the identity yields eigenvalues that are each shifted by the scalar. This means that the eigenvalues of H d are each shifted by E B . Therefore the excitation energy is unchanged, which is the expected behavior upon adding an uncoupled system. Note also that extensivity is sometimes described as doubling the energy when the system size is doubled. However, this definition is too narrow, as it applies to only one adiabatic energy at a time. For example, if subsystem B is a copy of subsystem A, then E B = E 0,A , meaning that E 0 = E 0,A + E B = 2E 0,A , and so the ground state energy is indeed doubled. However, Therefore, the excited state energy is not doubled. A more general definition is that the energy of the second subsystem is added to each adiabatic energy. This analysis shows that the off-diagonal elements of H d are intensive. The adiabatic energy gap is also intensive. Intensive here means that adding a subsystem that does not participate in the excitation does not modify the quantity. Such properties can naturally be modeled with an attention mechanism [134][135][136][137] that learns the importance of each atom to the excitation. For example, the excitation energy can be modeled as an attention-weighted sum over atomwise quantities. However, as discussed in Supplementary Sec. IX, this approach led to worse performance than simply predicting extensive energies for each state.

XVII. Proof of diabaticity
Here we prove Eq. (1) in the main text. Using bra-ket notation for the wave functions, we define the diabatic states as a linear combination of adiabatic states, where V is a unitary matrix. Multiplying each side by V nn and summing over n yields where we have renamed the dummy indices, n → m and n → n. We have also used the fact that (V † V) nm = δ nm for any unitary matrix, where δ nm is the Kronecker delta. Given the connection between diabatic and adiabatic states, we can relate V to the derivative coupling k: We have used the fact that, by definition, the derivative coupling between any two diabatic states is zero. We have also used the orthonormality of the diabatic states: which follows from the orthonormality of the adiabatic wave functions. Meanwhile, by construction, the Hamiltonian produced by the model has eigenvalues equal to the adiabatic energies: where U is the unitary matrix that diagonalizes H d ,Ĥ is the Hamiltonian operator, and we have used the relation ψ ad,j |Ĥ |ψ ad,i = E i ψ ad,j |ψ ad,i = E i δ ij . Comparing Eqs. (S36) and (S32), we see that H d is the representation ofĤ in a different electronic basis. In particular, if we choose U = V, such that U satisfies Eq. (S34), then H d is the representation ofĤ in the diabatic basis. The model could be trained directly with Eq. (S34). However, the equation is numerically ill-posed because k nm diverges at conical intersections. It is preferable to work instead with the force coupling, g nm . We now show that Eq. (1), which is defined in terms of g nm , holds only if (S34) is satisfied. The left-hand side of Eq.
(1) can be written as The first term is where we have used the fact that (U † U) nm = δ nm . Substituting Eq. (S34) with V = U, the second term is Performing the same substitution for the third term gives where we have used the anti-Hermitian property k * mn = − k nm . Adding the three terms gives Noting that f n = −∇ R E n and g nm = (E m − E n ) k nm gives Eq. (1) in the main text. Hence Eq. (1) can only hold if Eq. (S34) is true. Therefore, enforcing Eq. (1) ensures that there is no derivative coupling between any pair of diabatic states. Note that in the main text we used three diabatic states, and trained only on energies and couplings between the first two adiabatic states. This still means that all three diabatic states are properly diabatic: if Eq. (1) is satisfied for even one pair of adiabatic states, then the derivative coupling must be zero between all pairs of diabatic states.
Here we provide the motifs and substituents used for combinatorial molecule generation, together with a list of the literature species used for dense configurational sampling. Species with a net charge were excluded from training but are given here for completeness. In some cases only the cis or trans isomer was actually investigated experimentally, but in all cases we reference the publication for both isomers.
Many species in both the training and test set had experimental S 1 yields in non-polar solution. We did not put all such molecules in the test set, because this would mean losing hundreds of thousands of training geometries. Instead we used the 40 species with the fewest QC calculations. Table S10. Motifs used for combinatorial species generation. Examples of literature species for each motif are also given. The species numbers are those in Tables S12 and S13.