Introduction

The transition from fossil fuels to environment-friendly and renewable sources of energy has become one of the central challenges for today’s society and across scientific disciplines1. Though global energy studies report that there has been record-breaking growth in sustainable energy production, further improvements are desirable2,3,4. Solar energy is a promising solution for the near future, owing to its availability and potentially high output. Since the first practical silicon photovoltaic cells were developed more than 50 years ago5, several methods of harvesting energy from sunlight have been investigated. Today, four generations of photovoltaic cells are recognised6,7, which are categorised based on factors such as the absorbed spectral bandwidth, and the power conversion efficiency and their upper thermodynamic maxima. However, although record conversion efficiencies are reported frequently8,9, many of these prototypes are not suitable for mass production as they use rare or toxic chemicals, or due to their short operational lifetimes. Considering the demand for a global solution, an upgrade to mass-producible first-generation technology could be valuable.

One such strategy aims to exceed the Shockley-Queisser (SQ) limit in single junction cells10 by applying a multiple exciton generation technique called singlet fission (SF)11,12,13. In SF, higher-energy photons initialise local excitations in a SF-absorber material, which then split into pairs of lower-energy excitons. The currently widely accepted mechanism proceeds from a two-site singly excited state (S1,S0), to a coupled singlet of two triplets 1(T1, T1) and finally to two independent spin-dephased triplet excitons (T1 + T1) after decoupling14,15,16. However, recent reports suggest that there may be an additional spatially separated exciton state 1(T1…T1) between formation and fission17, and that quintet spin states may also contribute to the final spin-dephasing mechanism18,19. Ultimately, when the energy of the decoupled T1 states are high enough, both triplet excitons can be used to drive a photovoltaic current20. Note, that while triplet excitons with energies below the PV semiconductor’s band-gap are also generally available for this process, the occurrence of dark recombination currents can lead to a voltage penalty in that case21. Finally, when more than one exciton per photon can be made available as charge carriers the device formally exceeds the SQ limit.

The mechanism of SF also dictates its energy requirement. That is, the singlet state excitation energy E(S1, S0) must be larger than the final state E(T1 + T1), which can be simplified into an approximate monomeric condition E(S1) ≥ 2E(T1). Furthermore, it has been reported that this process becomes less efficient as the inequality increases, which is assumed to be connected to heat loss related reaction channels22. So far, no collective structural design rules have been formulated to allow the free design of organic SF molecules, but several trends that promote SF have been identified from quantum chemistry and experimental studies. For example, in polycyclic aromatic hydrocarbon SF molecules, as well as (BN)2-pyrenes, a relationship between the SF criterion E(SF) = E(S1) − 2E(T1) and the multiple diradical nature of the structure y0, y1 was reported23,24. Here, a correlation between the wavefunctions of the dark excited singlet state (1[TT]) and the radicaloid ground state allows to connect y0 and y1 to the energetic conditions of singlet fission25. Further, both concepts of Hueckel and Baird aromaticity have been found useful in the design of SF molecules when the molecular design is chosen carefully, as both are able to lower the excited triplet state’s excitation energy26,27,28. In addition, a linear correlation between the delocalised π electrons in specific aromatic rings and the first triplet excitation energy E(T1) was found27. Inspired by these connections between electronic structure theory and energetic SF conditions that may serve as design criteria even across different core structure designs, we aim to identify further complementary trends by applying machine-learning assisted design rule extraction to more than four million derivatives of 2,2′-diethenyl cibalackrot and check for the validity of known concepts in this large sample space.

The article is structured as follows. In the results section, we first explain how the chemical space of 2,2′-diethenyl cibalackrot was chosen through pre-screening a smaller set of cibalackrot derivatives with different substituents at ten residue positions. Next, we provide details on the calculation and prediction of excited state energies for the over four million structures, including the specifications and performance of the machine learning model used for predictions. In the second half of the results section, we present a three-step analysis procedure in which we extract common structural designs from the predicted chemical groups of the singlet fission photovoltaics (SF-PV) candidates, generalise these motifs towards distinguishing from non-SF-PV using machine learning-assisted classification and finally infer wavefunction-based rules a posteriori from the logic of the trained classifier.

Results

Pre-screening for a suitable chemical subspace

From the initial full cibalackrot space of 215 million possible substituted structures, a set of 2000 structures was constructed using randomised substitution patterns (cf. histogram in Supplementary Fig. 1 of the ESI). For these structures, we calculated the singlet and triplet excitation energies E(S1) and E(T1) using time-dependent density-functional theory (TDDFT) and the delta self-consistent field (Δ-SCF) method, as described in the Methods section. Figure 1a shows the distribution of these energies for the 1975 converged molecules, where the solid black line represents the energy condition threshold for SF, and the black dashed line represents the silicon band gap energy threshold of 1.12 eV29.

Fig. 1: Excitation energy distributions (left) and chemical compositions (right) of highlighted orange samples from respective distribution.
figure 1

a, b and c; data from pre-screening, training and predictions, respectively. For the predictions, one per mille of random data is shown in the distribution while the histogram covers the full data. Solid and dashed black lines show onsets for SF and the Si band-gap (1.12 eV). Mean energies \(\tilde{E}({{{{\rm{S}}}}}_{{{{\rm{1}}}}})\) and \(\tilde{E}({{{{\rm{T}}}}}_{{{{\rm{1}}}}})\) (black dot) separate the data into useful quadrants (grey lines).

This distribution shows that 3.2% of the initially screened molecules fulfilled the SF energy requirements, which highlights the importance of pre-screening prior to machine learning. The mean energies \(\tilde{E}({{{{\rm{T}}}}}_{{{{\rm{1}}}}})\) and \(\tilde{E}({{{{\rm{S}}}}}_{{{{\rm{1}}}}})\) were 1.27 and 2.27 eV, respectively. It should be noted that, in this random set, there were nine structures that fulfilled the SF energy condition and exceeded the silicon band-gap threshold. However, since these were not enough to draw any conclusions about the group of SF-PV molecules, the pre-screening focused on molecules that fulfilled the SF criterion (cf. orange dots in panel on left of Fig. 1a). By considering their specific substitution patterns, a histogram of the chemical groups at their respective positions in the molecule was extracted for the SF candidates (cf. panel on right Fig. 1a). For generalisation, inversion-symmetry equivalent positions were grouped together according to the colour code in Fig. 2a.

Fig. 2: Structure of studied molecules.
figure 2

a Core scaffold of cibalackrot with residues R1--R10 colour coded with respect to inversion symmetry. The dashed line separates two parts that are referred to as the half-structures. The 5- and 6-membered rings are labelled A, B, C and their inversion symmetric counterparts are labelled A', B', C', respectively. b Chemical substituent groups used to construct the chemical space. The coloured frame around each substituent marks its role as electron withdrawing (blue) or electron donating group (red).

Looking at the overall substitution counts (cf. right of Fig. 1a) shows that molecules which fulfilled the SF condition showed more cyano- (‘CN’) and ethenyl- (‘Vi’, as for vinyl-) groups, and relatively few phenyl-group substitutions (‘Ph’). From an energetic perspective, cyano-groups were often found in structures with lower singlet and triplet excitation energies (compare energetic quadrant analysis, Supplementary Fig. 2 in the ESI). However, the ethenyl-groups on the central positions R5 and R6 strongly reduced E(T1), while having little effect on E(S1).

To investigate the cause of this asymmetric red-shift, we conducted a detailed comparison of a sample 2,2′-diethenyl SF candidate molecule (referred to as CIBAVi) and its 2,2′-dihydrogen counterpart CIBAHyd. The molecules had excitation energies of EVi(S1) = 2.20 eV (EHyd(S1) = 2.41 eV) and EVi(T1) = 1.03 eV (EHyd(T1) = 1.35 eV), respectively. Visualising the spin densities of the Δ-SCF triplet state, the two molecules (Fig. 3a, b, respectively) revealed that a portion of the spin density was delocalised onto the ethenyl units, which hints at a stabilising effect for the triplet excitation energy. Considering that also a close connection between radical stabilising groups at the 2,2′ positions and SF had been reported on differently 2,2′-functionalised cibalackrot molecules earlier27, it is likely that the same effect also adds to the stabilisation here as well. To probe this notion in more generality for also other positions and molecules, the relationship between spin density, the biradical character y0 and the energetic SF criterion will be investigated in more detail in a later chapter of the results sections.

Fig. 3: Wavefunction properties of the first excited triplet and singlet states of two molecules.
figure 3

Spin densities of the first excited triplet state of a CIBAVi and b CIBAHyd. Hole and particle wavefunctions of the first excited singlet state of c CIBAVi and d CIBAHyd. Hole and particle wavefunctions of the first excited triplet state of e CIBAVi and f CIBAHyd. Note that the latter NTOs depict the nature of the spin-forbidden S0 to T1 transition and were obtained through considering a formally spin-free one-particle transition density matrix. All surfaces were created with an isosurface density of 0.0004 using GaussView64. Natural transition orbitals (NTOs) were calculated from the Gaussian output using the ORBKIT programme package65,66,67.

For analysing the character of the excited states and further check for potential reasons for the asymmetric shift, the hole and particle wavefunctions for both the singlet and triplet excited state were calculated from natural transition orbitals (NTO) of TDDFT calculations. Note, that both excitations use the same singlet ground state as initial reference for constructing the one-particle transition density matrices (1-TDM) so that a direct comparison of the transition characters can be made. Since the spin-forbidden singlet to triplet excitation would formally yield zero for the NTOs, a spin-free 1-TDM was used here instead. Also, note that in both cases the transitions reflect a mostly pure HOMO to LUMO transition, as was observed in other cibalackrot derivates earlier as well27.

Looking at the nature of the transitions (Fig. 3c–f), it can be seen that in all cases the hole and particle wavefunctions describe ππ* transitions with similar patterns of sign-changes in the respective wavefunctions. This shows that the introduction of the ethenyl groups does not change the nature of transition but rather extends the delocalisation for both the singlet and triplet excitation. Hence, when judging from a purely spin-free perspective the stabilisation caused by the ethenyl groups should be similar in magnitude for both singlet and triplet state. Hence, it can be inferred that the remaining difference between the excitation energies and cause of asymmetric stabilisation must be related to purely spin-related differences; which could result in a direct correlation between spin-related properties and the energy difference E(S1) − E(T1) (and the singlet fission criterion E(SF) = E(S1) − 2E(T1), similarly), assuming that the transition characters behave in the same way for all chemical modifications. Again, a generalisation of this hypothesis towards a larger number of molecules shall be revisited in more depth in the later sections of the results.

In any case, the asymmetric red-shift caused by the 2,2′-diethenyl modification was deemed favourable for discovering SF-PV structures, and hence these substituents were incorporated into the core structure for all further investigations, focusing the chemical space to a subspace of 4,573,800 structures.

Machine learning of excitation energies

After selecting the 2,2′-diethenyl subspace, a set of 7214 molecules was obtained for machine learning their respective E(S1) and E(T1) energies. Comparing the resulting energy distribution with the one obtained previously (see Fig. 1a, b) revealed that the mean excitation energies (\(\tilde{E}({{{{\rm{T}}}}}_{{{{\rm{1}}}}})=\) 1.09 eV and \(\tilde{E}({{{{\rm{S}}}}}_{{{{\rm{1}}}}})=\) 2.19 eV) were indeed red-shifted, especially to lower triplet excitation energies. This also increases the amount of molecules that satisfy the SF condition in this set to 52.4%.

Using this set of molecules, one machine learning model was trained for each the first singlet and triplet excitation energies, as described in the Methods section. The learning curves of the two models (cf. Fig. 4) displayed convergent prediction behaviour as the size of the training set increased, and they reached mean absolute errors (MAE) of ~0.025 eV, with coefficients of determination R2 above 0.7. The S1 model performed slightly better than the T1 model in terms of the MAE and R2, which we ascribe to the fact that the underlying PM6 wavefunction used for some of the features was of singlet character itself.

Fig. 4: Learning curves of the machine learning models.
figure 4

Learning curves of the a S1 and b T1 machine learning models for a fixed test set of 1445 structures. The solid lines show the mean absolute error (MAE, left y-axis), while the dashed lines show the coefficient of determination (R2, right y-axis).

The histograms of the signed prediction errors (cf. Supplementary Fig. 3 of the ESI) were used to obtain the respective standard deviations of the models, σT = 0.031 eV and σS = 0.034 eV.

Note that despite the potential for higher accuracy offered by current deep neural network models for excited state energy predictions, such models generally require much more training data, which makes them more attractive for much larger chemical spaces than the one considered here30. However, since the training reference values were obtained from theoretical calculations themselves, such deviations can be considered negligible compared to the errors inherent to the density functional theory (DFT) reference itself. Judging from benchmark studies on the performance of DFT methods for the description of singlet fission molecules31,32,33, it is thus fair to assume that the actual excitation energies of our predicted overall distribution will be globally shifted by somewhere between 0.1 and 0.3 eV with respect to possible experimental values, while the trends when comparing individuals inside the distribution are predicted correctly.

Distributions of energies and chemical groups of predictions

Utilising these models, the first singlet and triplet excitation energies for all 4,463,586 PM6-converged molecules were predicted. In the following analyses, the SF-PV space of interest is given by structures that lie close to the SF energy criterion (Epred(S1) − 2Epred(T1) < 0.03 eV) and also exceed the triplet energy threshold of Epred(T1) > 1.12 eV. The energy window of 0.03 eV was chosen as it had the same order of magnitude as a single standard deviation in the models. Take note that while the histogrammatic trends presented in this initial analysis are helpful for understanding what general types of molecules tend to fulfil the SF-PV condition and what underlying design principles might exist, the observations drawn from inside the SF-PV chemical subspace are likely not sufficient for generalisation towards also the non-SF-PV chemical space. Hence, the information highlighted in this chapter must not be confused with actual general design rules that have been checked against the full chemical space. The strategy that was applied to develop globally applicable decision rules, will be addressed in the next part of the results section, instead.

The distribution of singlet and triplet excitation energies inside the predicted chemical space was similar to the training set, as shown in Fig. 1c. The mean energies were \(\tilde{E}({{{{\rm{S}}}}}_{{{{\rm{1}}}}})=\) 2.19 eV and \(\tilde{E}({{{{\rm{T}}}}}_{{{{\rm{1}}}}})=1.10\) eV, with 50.16% of the structures fulfilling the SF energetic condition. In addition, 393,841 (8.82%) structures also satisfied the desired, slightly exoergic SF-PV condition (orange dots in Fig. 1c). Their chemical composition analysis revealed that there were below average amounts of cyano-, ethenyl- and thiophenyl-groups, where the low number of cyano-groups might seem surprising compared to the previous results in Fig. 1b. However, it was observed that structures with higher singlet and triplet excitation energies generally featured fewer cyano-groups. This can be confirmed by chemical composition analysis of the four quadrants surrounding the mean \(\tilde{E}({{{{\rm{S}}}}}_{{{{\rm{1}}}}})\) and \(\tilde{E}({{{{\rm{T}}}}}_{{{{\rm{1}}}}})\) values (cf. Supplementary Fig. 2 in the ESI). The most common residues were hydrogen, followed by fluoro-, methoxy-, chloro- and phenyl-groups, displaying an overall slight preference towards electron donating substituents.

In terms of positioning of the groups, the thiophenyl-group displayed a significant anomaly, as it almost never occurred at the R1 position. Since we assumed a potential design principle was the reason for this anomaly, we attempted to formulate a possible explanation for this observation. If a bulky thiophenyl substituent was located at this position, electronic repulsion between the sulfur and oxygen lone pairs might have had a negative effect on the energy requirements. This effect would be less pronounced for the phenyl-group, which lacks the sterically demanding sulfur lone-pair electrons, thus explaining why phenyl was instead only slightly disfavoured at this position. Probing the suspected design principle against non-SF-PV structures by grouping together all SF-PV and non-SF-PV molecules with none, one or two bulky groups (phenyl- and thiophene) at either (and both) of the R1/R10 positions strengthened this suspicion, as indeed decreasing portions of 54.8%, 38.1% and 28.5% of the respective subspaces matched singlet fission requirements.

However, to confirm the suspected physicochemical origin of our hypothesis would require access to the electron density surrounding the ketone oxygen for all 4,463,586 structures. Since we had no direct access to this property, we instead used an auxiliary descriptor for probing the hypothesis. For this purpose, we used the available preoptimised structural data of all structures and calculated the sum of local nuclear potentials surrounding the ketone oxygen atoms as a rough approximation to the local electronic repulsion. Using this description, however, no conclusive correlation was identified (cf. Supplementary Fig. 5 in the ESI). Hence, while smaller groups at the R1/R10 positions correlate with SF-PV, the exact cause for this remains to be uncovered in future work.

Next, to include two-body relationships, a connectivity diagram was used to illustrate the 200 most commonly co-occurring combinations of groups and positions (see Fig. 5a)34. To facilitate the discussion, we introduce the shorthand notation Grp1Pos1/Grp2Pos2, indicating chemical group Grp1 located at position Pos1 and chemical group Grp2 located at position Pos2, respectively. In this diagram, the brightness of connecting lines represents the number of occurrences for each such combination of sites and groups, given that the probed molecule satisfies the SF-PV criteria (i.e. conditional probability). Owing to the nature of the artificial randomness of structures, the balance between symmetrically equivalent modifications (e.g. OMe7/Hyd4 and OMe4/Hyd7) was not exactly equal. A list of the ten most and least common co-occurring pairs can be found in Supplementary Table 1 of the ESI.

Fig. 5: Most commonly co-occurring combinations of groups and positions.
figure 5

a Plot showing connectivity between pairs of groups and positions, generated using the MNE Python programme package34. The brightness of each node indicates how strongly it is included in the connections. b Graphical representation of the learned feature importance of the Random Forest Classifier, generated using the MNE Python programme package34. The three groups of features, explained in the Methods section, are depicted as follows: Node brightness indicates the importance of individual groups at specific positions (first group of features). Lines connecting two nodes represent the 'and/or' symmetry-equivalent position information for a chemical group (e.g. hydrogen at either/and R4 or R7; second group of features). Finally, the brightness of the colour-coded node borders specifies the importance of typical, average, or uncommon groups (blue, green and red, respectively; third group features). The exact importance values can be found in Supplementary Table 2 of the ESI.

As shown, the most common modification (4.74%) contains a methoxy-group at position R4 and hydrogen at position R7. This percentage was significantly elevated, considering that the mean conditional frequency for all possibilities was 1.56%. In accordance with the previous analysis, there were many methoxy-, fluoro- and hydrogen modifications among the high-percentage co-occurring entries at positions R4 and R7. However, small groups with positive mesomeric effects, such as the methoxy-, fluoro- and chloro-groups, also occurred together in pairs relatively often (cf. connectivity diagram of chemical groups, Supplementary Fig. 4b in the ESI). Similarly, by analysing the least common combinations, we confirmed that there were only few structures among the SF-PV candidates that featured any two of the electron-withdrawing cyano- or bulky thiophenyl-groups simultaneously.

Focusing on just positional arguments in these combinations, the symmetry-equivalent positions R4 and R7 displayed the strongest overall connection and had the highest summed density of connections to any other node (cf. Supplementary Fig. 4a in the ESI). Conversely, R3 and R8 displayed the lowest density of connections, and the weakest connection was found between R3 and R1. Likewise focusing only on the chemical groups regardless of their position, mostly pairs of hydrogen were observed, while there were almost no combinations containing pairs of any of the larger phenyl- and thiophenyl-substituents, or the strongly electron withdrawing cyano-group (cf. Supplementary Fig. 4b in the ESI).

In summary, these considerations indicate that the R4 and R7 positions can be expected to play a significant role in potential design principles for SF-PV in 2,2′-diethenyl cibalackrot. Further, it is implied that such a design principle favours small, electron donating groups at these positions (methoxy- and hydrogen groups in 42.9% of cases, cf. Fig. 1c). However, likely due to the non-local nature of ππ* excitations, note that these positions are also strongly connected to all remaining positions in the molecule (cf. Supplementary Fig. 4a). Hence, a clear physicochemical origin may be difficult to pinpoint without additional data. Take note, that a full ‘top-down search’ for any potential origin property always requires that any such probed origin property is available for all structures of the database (like in the case for the repulsion of ketone groups above). To work around this problem, we therefore instead choose a ‘bottom-up approach’, in which we first train a classification model to identify SF-PV structures from the available information and then search for correlations in the physicochemical properties of a smaller group of representative molecules chosen by the classifier logic.

Random forest classification

To identify generally applicable structural design principles, a random forest classifier was trained using the predicted excitation energies of the database. The features were based exclusively on the chemical groups and their positions, with the aim of translating the energetic region of interest directly into structural design rules relevant to synthesis. A detailed description of the construction of the three groups of features can be found in the Methods section. For training and testing purposes, two types of datasets were created from the full database of predictions. Both types contained all 393,841 SF-PV candidate structures and an equally large amount of non-SF-PV molecules. In the first type, all the non-SF-PV structures were taken from one of the energetic quadrants (cf. Fig. 1c), and in the second type, those structures were taken randomly from anywhere. In the following discussion, the former is called a balanced quadrant set, and the latter is called a balanced random set.

The random forest model was found to be most suitable for classification, when it was trained using the upper-right balanced quadrant set (i.e. energetically close to the SF-PV distribution itself). Using this training method, an accuracy of 80.3% was achieved when performing classification on a random balanced test set. From the confusion matrix of this test set, equally high precision and sensitivity were found (cf. Supplementary Fig. 6 in the ESI). When the balanced random set was applied for training, the resulting random forest model achieved a slightly better accuracy of 82.3%. However, in this case, the precision and sensitivity of the model was unfavourable for the desired application, because there were far more false-positive classifications. In particular, because the actual chemical space is highly imbalanced (91.2% non-SF-PV and 8.8% SF-PV entities), a high sensitivity is desirable. Finally, when the model was trained in any of the other quadrants, the performance in the respective quadrant was highest individually, but it failed to generalise towards any other quadrant. The largest discrepancy was observed when training was conducted using the lower-left balanced quadrant set (testing accuracy of 97.7%), and the resulting model was tested using the upper-right balanced quadrant set (testing accuracy 55.2%). The training statistics for all quadrants can be found in Supplementary Table 3 of the ESI.

From a chemical informatics perspective, these findings can be explained by the fact that the different energetic quadrants do not share enough similarities to learn anything other than the difference between the quadrants themselves. In the previous extreme case, the model practically learnt whether molecules were part of the bottom-left quadrant or not by judging the presence or absence of cyano-groups. However, this approach to classification cannot be used to make decisions about the upper-right quadrant, where few cyano-groups are featured anyway (see chemical composition of energetic quadrants, Supplementary Fig. 2 in the ESI).

To understand the meaning behind the learned design principles, the importance of the features can be interpreted together with the distribution of the observed chemical residues. Figure 5b is a graphical representation of the importance of all the features. The brightness of individual nodes GrpPos reflects how important the singular combination of group and position is for classification. The opacity of the lines between two opposing nodes gives the importance of a positional either–or presence \({{{{\rm{Grp}}}}}_{{{{{\rm{PoS}}}}}_{1},{{{{\rm{PoS}}}}}_{2}}\) (i.e. whether the group is found at either or both of the two positions). Finally, the intensity of the differently coloured edges around the nodes represents the importance of typical, common, or uncommon groups of chemical residues for a specific position (blue, green and red, respectively). A list of the actual importance values can be found in Supplementary Table 2 in the ESI, and a more detailed explanation of the features is given in the ‘Methods’ section.

The most important feature for classification was the presence (or absence) of a thiophenyl group at position R10 (i.e. unc10 in 5b), followed by the inversion-symmetric Boolean of whether or not there was a thiophenyl substituent at position R4 or R7 (i.e. Thi4,7). Furthermore, the chemical distribution of the SF-PV molecules was devoid of thiophenyl across all these positions (cf. Fig. 5a). Comparing in this fashion what features are important for classification and what groups were present in the SF-PV molecules, the absence of Thi4,7,10 can be considered as a generalisable design rule for the specific set of molecules and all 2,2′-diethenyl cibalackrot derivatives. By going through the list of importances in this way, it can be seen that similar clear generalisations include the absence of Ph3,8 and the presence of typ7 (i.e., hydrogen or methoxy substituents at position R7).

In summary, bulky substituents like the phenyl- and thiophenyl-groups are generally confirmed to be unfavourable close to the central positions, whereas the small electron-donating methoxy-group is favoured on the R4 and R7 positions. These findings are consistent with the actual distribution of groups in the energetic region, and were identified as global classification rules to distinguish from the non-SF-PV molecules. Further, note that because this random forest model does not possess chemical intuition (e.g. concepts of electron donating/withdrawing or bulky groups) the design principles learnt by the model can be understood as purely data-driven and free of human bias except for the nature of features that the model learned from.

Transfer towards wavefunction-based design strategies

When using the trained random forest model to predict classes for all 4,463,586 PM6-converged molecules, each structure was associated with its own class probability for each SF-PV and non-SF-PV. Interpreting this probability as a measure of how representative these structures were for their individual group and thus to which extent they represent the related design rules, the most confident 500 SF-PV and 500 non-SF-PV structures were selected for an a posteriori in-depth analysis of their wavefunction characteristics. This way, we attempt to transfer the structural design rules of the random forest classifier towards more general electronic structure characteristics that might be applicable also for other core structures and substituents.

For all selected structures, the SF criterion energy E(SF) = E(S1) − 2E(T1) was calculated from the respective TDDFT and Δ-SCF results after performing a B3LYP structure optimisation, as for the training data. Inspired by the work of Zeng et al.27, we extracted several different wavefunction-related features from the individual rings A–C and A’–C’ and from the α-carbon atoms (i.e., the 2,2′ positions which connect to the ethenyl moieties) and investigated these local properties for correlations with singlet fission. To ensure the analysis was insensitive to inversion symmetry on the artificially random cases of either image or inverted image, all features were summed up inside every pair of inversion-symmetric regions of the core scaffold (i.e. rings A+A’, B+B’ and C+C’ and the 2+2’ carbon atoms that connect to the ethenyl groups). Note that features were only extracted on the original cibalackrot core to improve transferability (i.e. including the ketone oxygen atoms, but excluding the 2,2′-diethenyl substituents).

Besides these local properties, a connection towards the biradical character y0 of each structure was investigated, since this property had been reported to show correlations with singlet fission in earlier works as well27. Here, we evaluated y0 as the occupation number of the lowest unoccupied natural orbital (LUNO) from a broken-symmetry unrestricted Hartree–Fock (BS-UHF) calculation25.

A consistent gradient in the SF criterion energy can be observed with respect to the sum of ground state Mulliken charges and spin densities in specific regions of the core structure (cf. Fig. 6). As shown in the upper row of graphs, charge neutral 2+2’-carbon atoms tended to fulfil the SF condition, with a further separation becoming apparent when the charge in other core regions (A+A’, B+B’ and C+C’) was considered at the same time. Similarly, a separation into clusters was obtained when considering the spin densities of the C+C’ region against the other core regions. Here, low spin densities in both the C+C’ region and other regions are desirable.

Fig. 6: Distributions of Mulliken charges and spin densities colour coded by the SF criterion energy.
figure 6

(Top) Distribution of summed Mulliken charge over different parts of the molecule and at different excitation energies, colour coded by the SF criterion energy. (Bottom) Distribution of summed spin densities over different parts of the molecule and at different excitation energies, colour coded by the SF criterion energy.

We closer examined these local features by considering their correlations with respect to the individual E(S1) and E(T1) excitation energies and found that more charge neutral 2+2’-carbon atoms led to lower triplet energies, while the singlet energies remained mostly unaffected. A similar independence for excited singlet energies was observed for the sum of spins in the C+C’ rings. These observations show that these features may become useful for tuning the energetics in favour of singlet fission and generalise our previously suspected design concept of energetically asymmetric shifts towards the other substituent positions as well.

It is assumed that the independence of E(S1) with respect to the charge on the 2,2′-carbon atoms can be explained by the electro-neutrality of the 2,2′-carbon atoms and their attached ethenyl substituents. Considering that the first singlet excitation is of global ππ* nature, the charge of two singular carbon atoms inside the conjugated network will be efficiently re-distributed among the rest of the core π-network upon excitation anyway, regardless of their initial charge on those specific atoms - thus being of little importance to the overall E(S1) excitation energy. However, since the electronic (spin-)density in the triplet state does not have the same global extent, but is instead accumulated strongly on the C,C’-rings and the two ethenyl substituents (cf. Fig. 3a, b), the observed negative correlation can be related to a situation where more negative local charge on the 2,2′-carbon atoms could lead to a higher energy requirement for E(T1).

As for the second correlated feature, higher spin densities forming in any part of the molecule can be practically associated with an equally larger excitation energy for their formation, explaining the origin of the positive correlation between density and E(T1). This is indirectly confirmed by noting that any combination of axes (i.e. plotting A+A’ against B+B’ instead) would result in an equally correlated picture. The reason why there is no correlation with E(S1), however, is likely connected to the same asymmetric red-shift of E(T1) versus E(S1) as observed for the example study earlier (cf. Fig. 3c–f). While both excitation energies change with the structure in a similar way due to the practically identical hole and particle wavefunctions, only the triplet energy will additionally benefit from spin-related effects.

Finally, the evaluation of the biradical characters y0 of 921 molecules that converged to a broken-symmetry solution implies that a third correlation with respect to the singlet fission energy criterion E(SF) is contained in the structural design principles of the random forest classifier (cf. Fig. 7). Seeing how radical stabilisation was found to be beneficial for SF for substituents at the 2,2′-positions of cibalackrot27, we can thus conclude that also other positions may be used to improve the radical stabilisation further. Take note though, that even if correlations exist for the biradical value y0 inside the chosen samples it is not a straightforward task to identify what groups and positions exactly caused which shift, due to the non-locality of the excited state wavefunction that was briefly mentioned at the end of the first half of the results section.

Fig. 7: Correlation diagram between the singlet fission energy criterion E(SF) and the biradical value y0.
figure 7

Correlation diagram between the singlet fission energy criterion E(SF) and the biradical value y0. Note that 48.3% of the 921 molecules fulfil SF conditions when values of up to E(SF) = −0.06 eV are allowed (see dashed line), which represents two times the standard deviation of the energy prediction models that the classifier learned from.

In summary, the sampled designs follow an asymmetric shift of the T1 versus S1 states’ excitation energies. More specifically, this asymmetric shift is found to be correlated to local changes in the spin density of the C,C’-ring regions and the charges on the 2,2′-carbon atoms. Both findings imply that the purely data-driven classification model has learned a design strategy, where local changes in the electronic structure close to the central part of the molecule (i.e. in the region where most of the spin density is accumulated) play a key role for designing singlet fission molecules. The design principles of the model are further backed up by a simultaneous increase in the radical stabilisation through the biradical character (i.e. y0-value). Overall, this means that our envisioned three-step procedure was successful in learning structural design rules that could be transferred into relevant underlying wavefunction-related properties.

Discussion

Machine learning has become an invaluable asset when designing materials. On one hand, it can be used to predict key properties of molecules in an efficient way to construct databases of millions of molecules in a relatively short amount of time. This makes it possible to screen for candidate molecules in silico and identify those that satisfy the desired criteria, such as the energetic condition for SF35,36. On the other hand, by analysing large databases, it is possible to search for correlations with specific properties to help identify potential chromophores and broaden our physicochemical understanding25,37. In this work, these directions were rigorously combined to search for potential SF structures from a generated database, investigate the common design principles inside their molecular structure and finally translate them into more basic properties, that is, their wavefunction and electronic structure.

For this purpose, we first trained a kernel ridge regression (KRR) model to predict the first excited singlet and triplet excitation energies of the 4,573,800 chemical derivatives of 2,2′-diethenyl cibalackrot in TDDFT quality (MAEs of 0.025 and 0.026 eV, respectively) based on features obtained from semi-empirical PM6 calculations. Using this database of excitation energies, the suitability of each candidate structure for SF-PV was assessed based on two factors: (a) whether the predicted singlet excitation energy was within the range of twice the predicted triplet excitation energy (i.e. E(S1) − 2E(T1) < 0.03 eV allowing only for slightly exoergic candidates); and (b), whether the predicted triplet excitation energy E(T1) was larger than the band gap energy of silicon (1.12 eV29) to avoid voltage penalties in hypothetical devices21. We found that 30.15% of the 2,2′-diethenyl subspace fulfilled the SF condition (a), and 8.82% of the overall chemical space also fulfilled the photovoltaic condition (b). The comparably high percentage of SF candidates was attributed specifically to the double ethenyl functionalisation of the core structure, which greatly (and almost exclusively) decreased the triplet excitation energy compared to the unfunctionalised cibalackrot. This observation is in line with previous studies on the 2,2′-substituent position and was previously attributed to the radical stabilising effect of the ethenyl substituent27. A collection of ten SF-PV structures that were identified within our predictions along with their a posteriori confirmed TDDFT/Δ-SCF excitation energies can be found in Fig. 8. Note that these structures were specifically selected from only symmetric molecules to increase their potential for synthesis with varying degrees of functionalisation as a preview of the diversity of the database. Further, since cibalackrot is a well-known indigoid vat dye with more than 100 years of history and former commercial use38,39, its modifications may be equally promising and available for applications.

Fig. 8: Selection of ten symmetric SF-PV candidate structures.
figure 8

Selection of ten symmetric structures (aj) with varying degrees of substitution that may be target for synthesis. Excitation energies below the molecules were obtained at TDDFT/Δ-SCF level of theory. Structural data for these examples are available in the ESI.

To extract which chemical modifications were beneficial for singlet fission photovoltaics and to generalise towards wavefunction-based properties, a three-step extraction procedure was performed on the predicted database. In the first step, the distribution of chemical groups at different substituent positions in the set of predicted SF-PV candidate molecules was screened. Here, it was found that electron withdrawing groups (cyano- most specifically) were less common in the predicted SF-PV designs at practically any position, whereas electron donating groups (methoxy- and hydrogen substituents at the central positions, most specifically) were overall more often present (cf. Fig. 9a). However, even if these observations from inside the PV-SF set are understood as characteristic for the subspace itself, we would like to raise awareness that they remain difficult to generalise towards global design rules for the full 2,2′-diethenyl cibalackrot space. For example, while it was found that roughly 280,000 out of 400,000 SF-PV structures featured an electron donating group at position R4, another 2,250,000 of non-SF-PV structures fall into this same category. Further, due to the highly non-local nature of the electronic structure, formulating design recommendations for several positions will become increasingly difficult, since making the decision of actually placing a specific group will lead to a changed situation for the remaining positions (example shown in Fig. 9b).

Fig. 9: Structural occurrences of electron withdrawing and donating groups.
figure 9

a Probabilities of finding an electron withdrawing (blue left number) and donating group (red right number) at each position of the 393,841 predicted SF-PV candidates. b Conditional probabilities for encountering a second group, after an electron withdrawing group has been chosen for position R2. The substituents were assigned to the respective groups in accordance with the colour scheme of Fig. 2.

To account for this problem, the second step of the procedure is aimed at extracting design principles that are both applicable to SF-PV molecules while excluding non-SF-PV structures. For this purpose, a random forest classification approach was chosen to treat the non-local nature of the problem. The model achieved an overall classification accuracy of 80.3% with equally high sensitivity and precision when classifying a balanced test set of SF-PV molecules and otherwise random non-SF-PV structures. By analysing the importance values of the classification features (Fig. 5b) together with the chemical distributions of the previous analysis step (Fig. 5a), we further deduced that bulky substituents such as phenyl- and thiophene are disfavoured especially for positions R1 and R10 which are close to the ketone oxygen and confirm that cyano-groups are not only rare in SF-PV molecules, but also carry significant exclusivity towards the non-SF-PV structures. While we are aware that this is a somewhat unusual perspective, note that we consider this combined analysis of which groups were typically present (or not) together with which factors were important in the classifier to be our most compact and concise attempt at capturing all possible structural recommendations in one image, while both paying the necessary respect to the non-locality of the problem and balancing the aforementioned generality and exclusivity that is necessary for formulating a valid recommendation.

In the last step of the analysis we transfer the structural logic learnt by the random forest model towards wavefunction-based properties that are related with SF-PV. This was done by analysing the most likely 500 SF-PV molecules and 500 non-SF-PV molecules according to the model and subsequently perform (TD)DFT, UDFT and broken-symmetry UHF calculations. Relating the SF criterion energy E(SF) = E(S1) − 2E(T1) to different wavefunction-related quantities (cf. Figs. 6 and 7), three meaningful correlated features were found. These are the sum of the ground state charges at the 2,2’-carbon atoms, the sum of T1 spin density inside the C,C’-rings and the occupation numbers of the LUNO of the broken-symmetry UHF solution (i.e. the y0-value as used in Minami et al.25). Note that these are identical or at least in part related to previously reported correlations for singlet fission in a chemical space of cibalackrot that focused on 2,2′-derivates27. Thus, our results not only serve as a proof of concept for the purely data-driven extraction method of design principles for future properties and chemical spaces, but further generalise design rules known to be beneficial for singlet fission towards a much larger chemical space when following the design rules provided by the random forest model.

Methods

Generation of chemical space

For the generation of cibalackrot derivatives, the core structure was decorated with eight different substituent groups (cf. Fig. 2a, b) that are commonly used in SF studies27,36,40. The chloro- and fluoro-groups were chosen to represent small substituents with mesomeric electron-donating and inductive withdrawing effects, the cyano-group was used to represent a medium-sized electron-withdrawing substituent, and the methoxy- and ethenyl-groups were considered as electron-donating medium-sized groups. For eventual tuning of the crystal structure packing41,42, the bulky phenyl-group was added. Finally, the 5-methylthiophenyl substituent (in the following just referred to as thiophenyl) was introduced. For simplicity of comparison with their study on aromaticity, we adopted the naming convention for 5- and 6-membered rings used by Zeng et al.27.

We deemed it difficult to attach large numbers of phenyl- and thiophenyl-groups simultaneously from steric hindrance and potential synthetic perspectives, so we limited the number of such groups on either side of the half-structures (R1–R5 and R6–R10, respectively) to one at a time. Following these rules for chemical space generation and considering the inversion symmetry of the core scaffold, the full catalogue of structures included 215,001,216 molecules.

To identify potential SF candidate molecules from this pool of possibilities, a pre-screening was performed to find a chemical subspace in which to carry out the machine learning study (cf. ‘Results’). For this purpose, a set of 2000 structures was randomly selected from the full space, and it was investigated using quantum chemistry calculations. After this initial screening, the chemical space was narrowed down to 4,573,800 molecules, in which the central residues (R5 and R6) were fixed to ethenyl-groups. From this reduced space, a set of 7500 molecules was randomly selected to serve as the training set for machine learning.

Utilising the inversion symmetry of the core scaffold, initial guess structures for the calculations were efficiently generated from combinations of the half-structures (indicated by the dashed line in Fig 2a). The half-structures themselves were constructed using the Schrodinger Materials Science Suite43. Note that, for the initial guess, no structural optimisation other than that provided by the Schrodinger Suite was performed43. In some instances when the substituents were too close or overlapping, the guess structures were adjusted prior to any quantum chemistry calculations by rotating the substituents slightly apart.

Quantum chemistry calculations

To generate the excitation energies of the chemical space, a machine learning model was developed. For this purpose both labels (i.e. answer data E(S1) and E(T1)) and features (i.e. describing data) for the excitation energies of the first singlet and triplet states, had to be obtained for a set of molecules.

The label excitation energies of the 7500 selected training set structures were obtained with the following quantum chemistry protocol. After pre-optimisation of the guess structures using the semi-empirical PM6 method44, the remaining 7214 converged structures were used in restricted density functional theory (RDFT)45,46 calculations for final structure optimisation. The excited singlet energies E(S1) were calculated using TDDFT47,48 for the RDFT optimised ground state structures. Previous benchmark studies reported that triplet energies from TDDFT calculations tend to be unreliable for the study of SF materials, so the Δ-SCF method was used to calculate the E(T1) energy labels instead32,36.

All quantum chemistry calculations were conducted using the Gaussian programme package49 and utilised the B3LYP functional in the respective restricted or unrestricted version50,51,52,53. Similar to other studies36, we found that combinations with the 6-31G(d’) basis set54,55 had a favourable balance between accuracy (compared to available data from other studies) and computation time for several available standard methods. It should be mentioned that, although we expected the reported combination of the M06-2X functional and the 6-311+G(d,p) basis set to give a more accurate image27, the trade-off with respect to the higher calculation time was considered unfavourable for the high-throughput nature of our study. Calculation of the diracdical character y0 were performed by running BS-UHF calculations at the B3LYP optimised ground state structures at the higher 6-311+G(d,p) basis56.

The features required to train the machine learning model were obtained from the semi-empirical PM6 structure optimisation calculations (which also served as pre-optimisation for the training set). On average, one of these calculations took less than 40 CPU seconds per structure, so they presented a balanced way of obtaining useful electronic structure information and structural features for the 4,573,800 molecules considered. In the following discussion, the generation of the features will be considered together with the central aspects of the machine learning model. For a general overview, Westermayr et al. recently compiled a list of state-of-the-art machine learning models for excited state properties57.

General machine learning model specifications

The machine learning technique applied in this work was KRR,58 as implemented in the scikit-learn Python package59. This is a non-linear method where a sum of products is used to predict a desired property (here, energy Epred) connected to a molecule \(M^{\prime}\) with respect to a training set M with known reference energies. That is,

$${E}^{{{{\rm{pred}}}}}(M^{\prime} ,{{{\bf{M}}}})=\mathop{\sum}\limits_{i}{\alpha }_{i}f(M^{\prime} ,{M}_{i}).$$
(1)

Here, αi are the coefficients to be trained, and the function \(f(M^{\prime} ,{M}_{i})\) determines how similar molecule \(M^{\prime}\) is to the i-th molecule of the training set Mi. Training of the coefficients is performed by minimising the expression

$${\min }_{\alpha }\mathop{\sum}\limits_{j}{({E}^{{{{\rm{pred}}}}}({M}_{j})-{E}_{j}^{{{{\rm{lab}}}}})}^{2}+\lambda \mathop{\sum}\limits_{j}{\alpha }_{j}^{2},$$
(2)

for the training subset. Here the superscript ‘lab’ indicates the label values for the respective excitation energies E. To increase the generality, a regularisation term with respect to the norm of the coefficients was added. The corresponding prefactor λ is a hyperparameter of the model itself, and it does not change during training; instead, it needs to be optimised separately via cross-validation. Through kernelisation, the minimisation problem can be expressed as the matrix inversion problem

$${{{\boldsymbol{\alpha }}}}={({{{\bf{K}}}}+\lambda {{{\bf{I}}}})}^{-1}{{{{\bf{E}}}}}^{{{{\rm{ref}}}}},$$
(3)

in which I is the identity matrix and K is the so-called kernel matrix. Individual elements of the kernel matrix are obtained as Kij = f(Mi, Mj). To design a machine learning model suitable for predicting excitation energies, different descriptive expressions for f correlated to the desired quantity must be defined.

In this work, f took the form of products of Gaussians with respect to the abstract distance measures between two molecules dg(Mi, Mj),

$$f({M}_{i},{M}_{j})=\exp \left(-\left(\mathop{\sum}\limits_{g}\frac{{d}_{g}{({M}_{i},{M}_{j})}^{2}}{2{\sigma }_{g}^{2}}\right)\right).$$
(4)

Here, the width of each respective Gaussian σg is a hyperparameter of the machine learning model. Note that in practice, however, we made an initial guess for the individual σg based on the data distribution in histograms of the training set’s distances dg, and then scaled all σg according to a single, collective hyperparameter σ. Therefore the model only uses two hyperparameters that needed optimisation through cross-validation.

Depending on the functional choice of the distance measures dg, the machine learning model will be more or less capable of relating the features of a set of molecules to their respective labels. To prevent the model from learning the training set without being able to generalise to unknown structures (i.e., to prevent overfitting), we performed stratified 5-fold cross-validation for the selection of hyperparameters and benchmarked models that were trained with differently split compositions of 80% training and 20% testing sets.

Overall, the final models use a total of 54 features belonging to fifteen types of distance measure descriptors dg, the detailed descriptions of which can be found in the ESI. Note that, even though the features were designed to support machine learning correlations, not all of them necessarily reflected physico- and quantum chemical behaviours. Further, the landscape of machine learning assisted chemical design studies is growing rapidly, and by the nature of the task, the developed features are often custom-made solutions for a given problem, rather than universally applicable quantities whose true origin can be easily tracked. Therefore, although we reference special cases where we know of an earlier application or definition of a feature60,61, the list of references is almost certainly incomplete.

Finally, take note that in it’s current form, the models for energy prediction designed here are not transferable to other chemical spaces in their current form, since they were only developed as a tool to predict the excitation energies of cibalackrot-type molecules.

Random forest classifier

To determine a mapping between the predicted excitation energies and chemical structures, a random forest classifier62,63, as implemented in the sci-kit learn Python programme package59, was trained to distinguish SF-PV structures from non-SF-PV structures only using information regarding the chemical substituents and their positions. In total, 120 features were used for training, which were subdivided into three groups of increasing generality as follows.

In the first group of features, each possible combination of the eight chemical substituents at the eight possible functionalisation sites R1–R4 and R7–R10 was evaluated. This resulted in 82 = 64 different Boolean features (e.g., checking for specific modifications such as Ph1). Similarly, a second group was constructed such that any of the pairwise inversion-symmetry equivalent positions counted for the determination of the Boolean. In other words, for all eight chemical groups, the four pairs of positions were evaluated in the same way as for group one (e.g., specific modification Ph1,10 at R1 or R10). This treatment resulted in 4  8 = 32 features for the second group.

Finally, a third group of features was generated based on the predicted chemical group distribution of SF-PV molecules (cf. Fig. 1c). For this purpose, the substituent sites of each molecule were evaluated based on whether they were carrying a substituent that was typical, common, or uncommon in SF-PV molecules. The three cases were distinguished as follows.

In the predicted distribution of chemical groups in SF-PV molecules (cf. Fig. 1c), for each position, a specific portion of molecules was found to carry a chemical group at given pairs of chemical inversion-symmetry equivalent sites, expressed as pPos(Grp). For each site, the mean percentage \({\tilde{p}}^{{{{\rm{Pos}}}}}\) and standard deviation σPos was calculated. The portion of phenyl- and thiophenyl groups was different owing to the choice of structure generation rules per se; therefore, to equalise the picture somewhat, the portion of phenyl- and thiophenyl groups was doubled for the purpose of pPos(Grp). For each position, the distinction between typical, common and uncommon groups was then performed according to:

$${{{\rm{Type}}}}\,\,{{{\rm{of}}}}\,\,{{{{\rm{Grp}}}}}_{{{{\rm{Pos}}}}}=\left\{\begin{array}{l}{{{\rm{typical}}}}\,\,\,\,{{{\rm{for}}}}\,\,{p}^{{{{\rm{Pos}}}}}({{{\rm{Grp}}}}) \,>\, {\tilde{p}}^{{{{\rm{Pos}}}}}+{\sigma }_{p}^{{{{\rm{Pos}}}}}/2,\quad \\ {{{\rm{uncommon}}}}\,\,\,\,{{{\rm{for}}}}\,\,{p}^{{{{\rm{Pos}}}}}({{{\rm{Grp}}}}) \,<\, {\tilde{p}}^{{{{\rm{Pos}}}}}-{\sigma }_{p}^{{{{\rm{Pos}}}}}/2,\quad \\ {{{\rm{common}}}}\,\,\,\,\,{{\mbox{for all remaining cases.}}}\,\quad \end{array}\right.$$
(5)

Following this scheme, for each position of each molecule, one of three cases was assigned. These three cases were subsequently one-hots encoded to obtain three Boolean features for each of the eight positions, giving a total of 24 features for this last group.

To check the generality of the resulting random forest classifier, all structures were randomly subjugated to an inversion operation before feature generation. The model used 300 different decision trees with a maximum depth of 20 distinguishing steps. It was confirmed that for different randomisation seeds, training-testing splits and slightly changed model size and depth parameters, the arising classification rules were approximately identical, within reasonable margins. The training itself was conducted with a balanced data subset of 50% SF-PV and 50% non-SF-PV molecules. From the full chemical space, different subsets were constructed. First, a balanced set with non-SF-PV structures taken from all energetic regions of the E(S1)/E(T1) distribution was considered. Second, non-SF-PV structures were taken from specific energetic quadrants with respect to the mean energies \(\tilde{E}({{{{\rm{S}}}}}_{{{{\rm{1}}}}})\) and \(\tilde{E}({{{{\rm{T}}}}}_{{{{\rm{1}}}}})\) (cf. grey lines on left of Fig. 1c). From each of the quadrants, one additional balanced set was generated. Training was conducted using each of the subsets, which resulted in different classification rules each time. To test the quality of the resulting models, each dataset was split 80:20 into training and testing sets. The quality of each model was determined using the test set of the balanced random set. The best performing model was considered based on accuracy and sensitivity with an emphasis on the latter, since in the actual chemical space much more non-SF-PV structures are expected. It should be noted that although all the datasets used the same SF-PV molecules before the training/testing split, the non-SF-PV entries of each quadrant may or may not have randomly overlapped with the balanced random set.

Once the model was chosen, the probability of each classification, ’SF-PV’ or ’non-SF-PV’, was calculated for all 4,463,586 molecules. From these molecules, the ones with the highest probabilities for SF-PV and non-SF-PV were treated as the most confident; i.e., most congruent with the learned classifier. Consequently, these confident cases were expected to yield the best separation in subsequent investigations of the quantum chemical properties, as performed in the results section.