Introduction

Since the molecular orbital theory was proposed in the 20th century, the concepts of the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO) have been used in various research areas1,2,3,4,5,6,7,8. For example, Fukui et al. found that π-electrons in the HOMO played decisive roles in the reaction of aromatic hydrocarbons and extended this result to the concept of frontier molecular orbitals (i.e., the HOMO and LUMO)9,10. In organic photovoltaics (OPVs), the power conversion efficiency (PCE) can be significantly increased by optimizing the frontier molecular orbital energies of the component materials11,12. Shockley and Queisser suggested that the optimal bandgap in light-harvesting materials is approximately 1.3 eV, which is based on a compromise between the short-circuit current (JSC) and open-circuit voltage (VOC) made to maximize the PCE11. In organic light-emitting diodes (OLEDs), the HOMO and LUMO energies are crucial factors when designing new component materials. For example, ideal host materials should have proper HOMO-LUMO energy gaps, which are required for sufficient spectral overlaps with emitters for efficient energy transfer13. An appropriate alignment of HOMO and LUMO energy levels of component materials in OLEDs is required to transport charge carriers to the emitting layer and trap them there, leading to a high exciton recombination yield14,15.

To design and develop new materials, the molecular properties such as HOMO and LUMO energies need to be accurately estimated. The density functional theory (DFT) calculations have been extensively used in many research areas to calculate molecular properties16,17,18,19,20,21. Recently, deep learning (DL) methods based on big-data have emerged as a promising solution for reliable estimation of molecular properties with substantially reduced computational costs22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39. The DL methods based on the databases of DFT-calculated molecular properties including HOMO and LUMO energies have been reported8,22,40,41,42,43,44,45,46,47,48,49, but they have limitations in practical applications for the development of optimal molecules in various research areas because of the following reasons: (1) most databases obtained by DFT calculations contain relatively small molecules50,51,52,53, (2) DFT-calculated values are different from actual experimental ones54, and (3) molecule-environment interactions are not included in most currently available databases obtained by DFT-calculations. The molecular properties such as absorption and emission wavelengths, bandwidths, and HOMO and LUMO energies are significantly influenced by the molecule-environment interactions. For practical chemical applications, these limitations of previously-developed DL methods based on DFT-calculated database need to be overcome.

In this work, we built an experimental database containing the HOMO and LUMO energies of various molecules via direct data collection from the literature. We then developed a DL model trained using an experimental database to quickly and reliably predict HOMO and LUMO energies close to the experimental values. The architecture of DL model is based on our previously developed DL optical spectroscopy (DLOS) model, which has been proven to accurately predict optical and photophysical properties influenced by the local environments55. When compared with the DFT calculations, our DL model exhibited better performance in terms of computational time and prediction error. Interestingly, our DL model was found to rationally recognize the effects of donor–acceptor structures, substituents, conjugation length, heteroatoms, and heavy atoms on the HOMO and LUMO energies. Finally, we demonstrated DL-assisted virtual screening for developing emitter and host molecules for ultra-deep-blue fluorescent OLEDs.

Results and discussion

Experimental database

Our experimental database includes the HOMO and LUMO energies of 3026 organic molecules in solvent or solids, yielding 3362 molecule/solvent combinations, and our data collection procedure is described in the Methods in detail. Figure 1a shows the distributions of HOMO and LUMO energies of molecules in our experimental database. In our experimental database, the numbers of datapoints are 2990 and 3077 datapoints for the HOMO and LUMO energies, respectively. Figure 1b shows the solvents used for the measurement of the HOMO and LUMO energies of molecules. Dichloromethane is found to be the most frequently used solvent for cyclic voltammetry (CV). The experimental errors of the HOMO and LUMO energies are 0.089 and 0.112 eV, respectively (See the Methods for details).

Fig. 1: Experimental database of HOMO and LUMO energies.
figure 1

a Distributions of HOMO and LUMO energies (EHOMO and ELUMO) of molecules in our experimental database. The number of datapoints (N) is indicated. b Histogram of solvents (CH2Cl2: dichloromethane, CH3CN: acetonitrile, THF: tetrahydrofuran, DMF: N, N-dimethylformamide, ODCB: 1,2-dichlorobenzene, PhCN: benzonitrile, and EtOH: ethanol). c Distributions of the molecular weights of molecules in the QM9 database (in red) and our experimental database (in blue). d Plot of the bandgap (Eg) vs. the transition energy from our experimental database.

In Fig. 1c, the molecular weight distribution in our experimental database is compared with that in the QM9 database which is one of DFT-calculated databases. Our experimental database has a much broader molecular weight distribution than the QM9 database and contains much larger molecules that have been developed for practical use in many research areas (Supplementary Fig. 1)56,57,58,59.

In our experimental database of HOMO and LUMO energies, 598 molecule/solvent combinations are the same combinations in our previously reported experimental database of optical properties60. Using the same combinations in these two databases, the bandgap (the difference of the HOMO and LUMO energies for a given molecule) is plotted against the electronic transition energies (i.e., absorption and emission energies) in Fig. 1d. The bandgap is shown to be correlated reasonably well with both the absorption and emission energies, and the bandgap is smaller than the absorption energy but is larger than the emission energy. Additionally, the LUMO energy is correlated with the absorption and emission energies, but the HOMO energy is not correlated with them, as shown in Supplementary Fig. 2.

Deep learning model

Our DL model is basically based on the graph convolutional network (GCN) and its architecture is schematically illustrated in Fig. 2a. For the inputs of our DL model, the molecule and solvent are represented as the molecular graph which can include up to 150 atoms (without hydrogen atoms) to cover most of practically used organic molecules. The detail description of the molecular graph and our DL model can be found in the Methods and elsewhere55. Note that the solvent is included in our DL model because the molecular orbital energies are affected by the local environment of a given molecule as will be discussed in the next section in great detail. In this work, we used the transfer learning-based DL model to predict the HOMO and LUMO energies because the bandgap and LUMO energy were correlated with the absorption and emission energies (Fig. 1d) which were well predicted by our previously developed DLOS. Transfer learning methods are DL techniques that reuse a model developed for a task as a starting point for training a new model on a similar task61,62,63,64,65,66. Here, we compared two learning methods, i.e., learning from scratch (LS) and transfer learning (TL) methods. In our TL method, our DL model was trained with the experimental database of HOMO and LUMO energies using the pre-trained parameters of our previously developed DLOS as initial conditions (See the Method for details)55. The TL method has been shown to be more effective with small datasets. The performance of our DL model was examined by varying the training dataset size as shown in Fig. 2b, Supplementary Tables 1 and 2. The mean absolute errors (MAEs) of the TL method are always smaller than those of the LS method, which indicates that our current DL model benefits from the pre-trained parameters of our previously-reported DLOS model. When the dataset size is 50, the MAE of the TL method is about 10.8% smaller than that of the LS method. When the dataset size is ~2000, the MAE of the TL method is only 4.8% smaller than that of the LS method. As the dataset size is increased, the MAEs of TL and LS methods are found to get closer to the experimental error of ~0.1 eV.

Fig. 2: Deep learning model.
figure 2

a Schematic illustration of our DL model used to predict the HOMO and LUMO energies (EHOMO and ELUMO) of 2,6-dinitrotoluene in acetonitrile. See the Methods for the detailed architecture of our DL model. b The mean absolute errors (MAEs) in the test dataset as a function of the training dataset size. The MAE of our DeepHL model is indicated by a magenta star. The solid lines represent the experimental errors of HOMO (red) and LUMO (blue) energies, respectively. c HOMO and LUMO energies of all molecules in our database predicted by our DL model (DeepHL). d HOMO and LUMO energies of 988 organic molecules, calculated using DFT methods (B3LYP/6-31 G(d)).

Our final DL model (referred to as ‘DeepHL’), which was re-trained with the combined training and validation datasets, shows high performance in predicting the HOMO and LUMO energies with relatively small MAEs, as shown in Fig. 2c. In addition, the distributions of the prediction error in the training and test datasets are presented in Supplementary Fig. 3. The MAEs of the HOMO and LUMO energies for the test dataset are 0.148 and 0.163 eV, respectively, which are very close to the experimental errors of HOMO and LUMO energies. For the total dataset, the MAEs for the HOMO and LUMO energies are 0.050 and 0.065 eV, respectively.

To compare the accuracy of DeepHL and theoretical methods, 988 organic molecules were randomly selected from our database, and their HOMO and LUMO energies were calculated using the density functional theory (DFT) calculations with the B3LYP functional and 6–31 G(d) basis set, as implemented in the Gaussian 16 software package67. As shown in Fig. 2d, the MAEs of the DFT-calculated HOMO and LUMO energies are 0.425 and 0.839 eV, respectively, which are much larger than the DeepHL-prediction errors of the test dataset (0.148 and 0.163 eV for the HOMO and LUMO energies, respectively). Furthermore, DeepHL is superior to DFT calculations in terms of computation time, taking only 0.82 s to predict the HOMO and LUMO energies of 338 molecules (in the test dataset). Note that the HOMO and LUMO energies predicted by DFT calculations depend on DFT functionals and basis sets. In fact, it has been reported that DFT-calculated HOMO and LUMO energies deviate from the experimental values with a relatively large error (i.e., the smallest DFT-calculation error = 0.77 eV)54. In this work, B3LYP/6-31 G(d) was used because it was commonly used to calculate large organic materials in OLEDs and solar cells, and also used to build computational databases including PubChemQC68.

It should be noted that several DL methods based on DFT-calculated HOMO and LUMO energies have recently been reported as summarized in Supplementary Table 3. The database size of DFT-calculated HOMO and LUMO energies ranges from ~7000 to ~133,000. The DL models trained with the QM9 database show superior accuracy as shown in Supplementary Table 340,41,42,43,44,48,69. However, they are not useful for practical applications in designing and developing new materials in many research areas because the QM9 database contains only small molecules in which the number of atoms is limited up to 9 including C, N, O, S, and halogens except hydrogen atoms51. For example, hexafluoropropane (C3H2F6, molecular weight = 152 g/mol) is the heaviest molecule in the QM9 database (Supplementary Fig. 1). By contrast, our experimental database contains more practically used molecules than DFT-calculated databases (Supplementary Fig. 1). In addition, the DL models based on DFT-calculated databases might inherit the calculation errors as shown in Fig. 2d, and might not be able to predict the values close to the actual experimental values. In this respect, DeepHL based on the experimental database is substantially more useful and practical in developing new materials than the DL models trained with DFT-calculated databases.

Performance of our DeepHL

Trained with the experimental database, DeepHL can predict the HOMO and LUMO energies of various molecules widely used in optoelectronics. In this section, we will show how well our DL model (DeepHL) is trained and predicts HOMO and LUMO energies based on the different structural moieties of molecules in our experimental database. Figure 3 summarizes four different groups of molecules used in OPVs and OLEDs along with their experimental and DeepHL-predicted HOMO and LUMO energies. For example, the two molecules on the left in Fig. 3a were developed for electron donors in OPVs70,71, and their HOMO energies were accurately predicted by DeepHL. Additionally, the ITIC and PC60BM on the right in Fig. 3a, which are the electron acceptor, had their LUMO energies accurately predicted by DeepHL as well72,73. Figures 3b and c show emitters with various core structures and host molecules commonly used in OLEDs, respectively74,75,76,77,78,79,80,81, and their HOMO and LUMO energies were accurately predicted by DeepHL. Additionally, the molecules shown in Fig. 3d are frequently used as the electron and hole transport layers in the OLEDs82,83,84,85, and their HOMO and LUMO energies are shown to have been accurately predicted within the MAEs.

Fig. 3: Experimental and DeepHL-predicted HOMO and LUMO energies for the selected molecules in our experimental database.
figure 3

a Molecules used in organic photovoltaics. b Molecules used for OLED emitters. c Molecules used for host materials. d Molecules used for transport materials.

It is interesting to note that the high accuracy of DeepHL results from an understanding of the effects of molecular structures on the HOMO and LUMO energies, including donor–acceptor structures, functional groups, heteroatoms and heavy atoms, and conjugation length. Engineering of donor (D)-acceptor (A) structures is widely used as a design strategy to tune the HOMO and LUMO energies of given molecules. As shown in Fig. 4a, DeepHL can predict the HOMO and LUMO energies of a given D-A type molecule by identifying donor and acceptor moieties in the molecule. For example, three molecules containing triazine (TRZ) as an acceptor in Fig. 4a were predicted to have similar LUMO energies which agree well with the experimental values within the prediction error. Molecules having the same donor moieties, such as dimethyl acridine (DMAC) and phenoxazine (PXZ), are also predicted to have similar HOMO energies. As shown in Supplementary Fig. 5a, DeepHL predicts the HOMO and LUMO energies of DMAC-TRZ as a combination of the HOMO energy of DMAC and the LUMO energy of TRZ. This result agrees well with the DFT calculations shown in Supplementary Fig. 5, indicating that the HOMO and LUMO in DMAC-TRZ are located at DMAC and TRZ, respectively. Additionally, the HOMO and LUMO energies of D–A–D- and A–D–A-type molecules can be accurately predicted. As shown in Supplementary Fig. 6, the LUMO energies of molecules with A–D–A structures of the same acceptor moieties are predicted to be almost the same as the experimental values, whereas the HOMO energies vary with different donor moieties. In the cases of the D–A–D molecules shown in Supplementary Fig. 7, it is observed that the LUMO energies are predicted to gradually decrease as the number of 1,3,4-thiadiazole acceptor moieties is increased.

Fig. 4: Experimental and DeepHL-predicted HOMO and LUMO energies.
figure 4

a Donor–acceptor-type molecules. Identical moieties are grouped in the colored boxes. b The same core structure with a different number of nitrogen atoms. c Molecules with oxygen or selenium atoms and different conjugation lengths.

DeepHL is also found to accurately represent the effects of functional groups, heteroatoms and heavy atoms, and conjugation on the HOMO and LUMO energies. Functional groups with strong electron-donating ability increase the electron density of the molecules and cause more electron–electron repulsion, leading to increased HOMO and LUMO energies. At the same time, electron-withdrawing groups reduce the HOMO and LUMO energies. The dimethyl amine and nitro groups shown in Supplementary Fig. 8 are electron-donating and electron-withdrawing groups, respectively. The effects of substituents on the HOMO and LUMO energies are precisely reflected in the predicted values. The molecules shown in Fig. 4b have a different number of heteroatoms (nitrogen)86. In Fig. 4b, it is clearly shown that the experimentally measured HOMO and LUMO energies decrease as the number of nitrogen atoms is increased due to the strong inductive effect87. As demonstrated in Fig. 4b, DeepHL can reproduce the tendency of nitrogen atoms to cause a decrease of HOMO and LUMO energies. Furthermore, DeepHL can capture how the conjugation length can change the bandgap, similar to a particle-in-a-box model. As can be seen in Fig. 4c, DeepHL predicts that the bandgap will be decreased when the conjugation length is increased88 or when oxygen is replaced by selenium (i.e., heavy atom)87,89.

Before this section is closed, it should be highlighted that DeepHL can predict the HOMO and LUMO energies of molecules in different local environments because DeepHL includes the molecule–solvent interactions. As shown in Supplementary Table 4, the HOMO and LUMO energies in solids, dichloromethane, and N, N-dimethylformamide are shown to be well predicted within the MAEs.

Deep learning-assisted development of emitter and host molecules

The HOMO and LUMO energies are crucial factors when designing organic molecules for use in high-performance OLED devices. A proper alignment of the electronic energies of materials across the multiple layers of OLED devices can facilitate the transportation of charge carriers (i.e., holes and electrons) to the emitting layer and the efficient energy transfer from host molecules to emitters, achieving a high external quantum efficiency (EQE). DeepHL can be effectively used to quickly and accurately predict the HOMO and LUMO energies of component materials for OLED devices. In the following, we will demonstrate how DeepHL can be used to efficiently prescreen newly designed emitter and host molecules for deep-blue OLEDs.

Let’s first consider a typical structure of OLED devices comprising indium–tin–oxide (ITO, anode), poly(3,4-ethylenedioxythiophene):poly(styrene sulfonate) (PEDOT:PSS, hole injection layer), poly(N-vinylcarbazole) (PVK, hole transport layer), 1,3,5-tris(2-N-phenylbenzimidazolyl) benzene (TPBi, electron transport layer), and lithium fluoride/aluminum (LiF/Al, cathode), as shown in Fig. 5b. For this given OLED device structure with deep-blue emission, newly designed emitters and host molecules must satisfy the following requirements. First, the HOMO (LUMO) energy of host molecules should be between the HOMO (LUMO) energies of the electron transport layer (TPBi) and hole transport layer (PVK). Second, host molecules should have larger bandgaps than emitters to efficiently transfer the excited energy of host molecules to emitters. Third, emitters should have a bandgap larger than ~2.9 eV for deep-blue emission.

Fig. 5: Deep-blue fluorescent OLED device with TDBA-pyCz and DPAc-Cz developed by DL-assisted virtual screening.
figure 5

a Molecular structures of DPAc-Cz (host) and TDBA-pyCz (emitter). b Device structure. c CIE 1931 chromaticity diagram. d Current density (J)–Voltage (V)–Luminance (L) curves. e EQE–Luminance (L) curves. f EL spectra at different doping concentrations of emitters at a luminance of 500 cd m−2. A photograph of the device emission is shown in the inset.

As host and emitter molecules, we have designed 8 and 10 molecules in Supplementary Figs. 9 and 10, respectively and then we used DeepHL to predict their HOMO and LUMO energies. Finally, we selected DPAc-Cz and TDBA-pyCz as promising host and emitter molecules because the HOMO/LUMO energies of TDBA-pyCz (emitter) and DPAc-Cz (host) were DeepHL-predicted to be −5.84/−2.93 and −5.59/−2.37, respectively, which satisfied the aforementioned requirements. It should be noted that all newly designed molecules are not included in our experimental database, and the top 3 molecules from our experimental database with the highest similarity scores for TDBA-pyCz and DPAC-Cz are presented in Supplementary Fig. 11. In addition, we further used the previously developed DLOS to predict the optical and photophysical properties of TDBA-pyCz and DPAc-Cz to ensure that the absorption and emission properties of TDBA-pyCz and DPAc-Cz were suitable for deep-blue emission and were also favorable for an efficient energy transfer from DPAc-Cz to TDBA-pyCz.

As the suitability of TDBA-pyCz and DPAc-Cz as emitter and host molecules was confirmed by DL-prediction, TDBA-pyCz and DPAc-Cz were synthesized (See the Supplementary Information for details), and their optical, photophysical, and electrochemical properties were measured as summarized in Table 1 and Supplementary Table 5. DL-predicted and experimentally-measured optical, photophysical, and electrochemical properties of TDBA-pyCz and DPAC-Cz were found to be in an agreement within the prediction error. Notably, DeepHL-predicted HOMO and LUMO energies agreed very well with the experimental values obtained from cyclic voltammetry (Supplementary Fig. 16), as shown in Fig. 5b, and the emission properties of TDBA-pyCz in DPAC-Cz were suitable for deep-blue emission (Table 1). Furthermore, DL-prediction revealed that the efficient energy transfer from DPAc-Cz to TDBA-pyCz was possible because the emission spectrum of DPAC-Cz overlapped well with the absorption spectrum of TDBA-pyCz (Supplementary Fig. 17), which was confirmed by UV-visible absorption and fluorescence spectra measured with TDBA-pyCz, DPAC-Cz, and TDBA-pyCz-doped DPAC-Cz films (Supplementary Fig. 18).

Table 1 DeepHL-predicted HOMO and LUMO energies and DLOS-predicted optical and photophysical properties of TDBA-pyCz and DPAc-Cz (Experiment / Prediction).

Fabrication of deep-blue fluorescent OLED device

Using TDBA-pyCz and DPAc-Cz, OLED devices were fabricated by solution processes, and their performance was fully characterized, as shown in Fig. 5 and Table 2. The OLED devices with TDBA-pyCz and DPAc-Cz were found to exhibit a narrow deep-blue emission at 412 nm (CIE: x = 0.17, y = 0.07; FWHM = 36 nm), satisfying the National Television System Committee standard, and the maximum EQE (EQEmax) was measured to be as high as 6.58%. The time-resolved fluorescence (TRF) signals were measured with TDBA-pyCz-doped DPAc-Cz film (Supplementary Fig. 20), and the fluorescence lifetime was determined to be 2.4 ns. DFT calculations revealed that TDBA-pyCz exhibits a relatively large singlet-triplet state energy gap (ΔEST = 0.42 eV) (Supplementary Fig. 21). TRF experimental results and DFT calculations confirmed that TDBA-pyCz is a pure fluorescent emitter. The orientation factor of TDBA-pyCz in DPAc-Cz film was measured to Θ = 0.187 using the angle-dependent fluorescence experiments (Supplementary Figs. 22 and 23), indicating that ~81% of TDBA-pyCz is horizontally aligned along the glass substrate, which is expected to lead to a high out-coupling efficiency. The surface morphology of TDBA-pyCz-doped DPAc-Cz films examined by atomic force microscopy (AFM) shows a small root-mean-square roughness of 0.291 nm in Supplementary Fig. 24, implying that our host and emitter molecules are suitable for the solution process. The current density–voltage curves of the hole-only device and electron-only device presented in Supplementary Fig. 25 reveal that the hole and electron mobilities are reasonably well-balanced in the OLED device.

Table 2 EL performance of deep-blue OLEDs with TDBA-pyCz and DPAc-Cz.

In short, the deep-blue fluorescent OLED device with TDBA-pyCz and DPAc-Cz was successfully developed by DL-assisted virtual screening, and exhibited the high EQE (EQEmax = 6.58%), high horizontal emitter orientation (Θ = 0.187), and reasonably well-balanced hole and electron mobilities. The performance of our solution-processed deep-blue fluorescent OLED device was found to be superior to those previously reported in terms of EQE, emission bandwidth, and emitter orientation as shown in Supplementary Table 6 and Supplementary Fig. 26.

In this study, we built an experimental database of the HOMO and LUMO energies of 3026 organic molecules in solutions and solid states. We successfully developed a DL model to reliably and quickly predict the HOMO and LUMO energies of molecules in different local environments (DeepHL). The high accuracy of DeepHL was shown to result from an understanding of the effect of molecular structure on the HOMO and LUMO energies, including the donor–acceptor structures, functional groups, heteroatoms and heavy atoms, and conjugation. Lastly, we demonstrated that DeepHL combined with DLOS were efficiently used to prescreen newly designed emitter and host molecules optimized for a given OLED device structure. The optical, photophysical, and electrochemical properties of DPAc-Cz and TDBA-pyCz that were predicted by DeepHL and DLOS were shown to agree very well with the experimental ones. Solution-processed deep-blue fluorescent OLEDs, successfully developed with the aid of our DL-prediction (DeepHL and DLOS), exhibited an EQE of 6.58% and narrow emission bandwidth. Overall, our DL-assisted virtual screening will be able to revolutionize the development of component materials in optoelectronics.

Methods

Building the experimental database of HOMO and LUMO energies

Our experimental database was built by collecting the HOMO and LUMO energies of organic compounds from 860 articles. Overall, our database included the HOMO and LUMO energies of 3026 organic molecules in solvents or solids, yielding 3362 molecule/solvent combinations. In our database, the solution and solid states were labeled to reflect the local environment of given molecules. In the literature, the HOMO and LUMO energies have been measured using ultraviolet photoelectron spectroscopy, inverse photoelectron spectroscopy, and cyclic voltammetry (CV). Additionally, ultraviolet-visible (UV-visible) absorption spectroscopy was applied to measure the optical bandgaps that were used to determine either HOMO or LUMO energy, in the case where one of the energies was not able to be directly measured because of the technical difficulty of the measurements. We found that the HOMO and LUMO energies of some molecules were measured with different experimental methods and/or under different experimental conditions (i.e. in solvents and solid states). The experimental errors of the HOMO and LUMO energies of the same molecules with different experimental methods and/or under different experimental conditions were found to be 0.089 and 0.112 eV for the HOMO and LUMO energies, respectively.

Graph representation of molecules

Molecules and their structural features can be represented by using an adjacency matrix (Aj) and a feature matrix (Xj). The adjacency matrix describes the connectivity of atoms in a given molecule. The single, aromatic, double, and triple bonds are encoded as 1, 1.5, 2, and 3 in the adjacency matrix, respectively, as shown in Fig. 6b. In addition, the diagonal elements are encoded as 1 to represent the atom itself. In our DL model, the maximum number of atoms is 150, which can be readily extended if necessary, giving the adjacency matrix of 150 × 150 elements. The feature matrix is one-hot-encoded consisting of the identity of the atoms, the number of hydrogen atoms, the number of connected atoms, aromaticity, hybridization state, ring, and formal charge. The total feature matrix size is 150 × 43. The example of the adjacency and feature matrices for 2,6-dinitrotoluene is shown in Fig. 6.

Fig. 6: Graph representation of a molecule.
figure 6

a Molecular structure. b adjacency matrix (Aj), and c feature matrix (Xj) of 2,6-dinitrotoluene.

Algorithm of our deep learning model

The detailed algorithm of our DL model is described in Algorithm 1. Aj and Xj are the adjacency and feature matrices of j, respectively, Hj is the hidden matrix which is an updated feature matrix of j, reduce_sum represents the summation of all row vectors of the hidden matrix, concat represents the concatenation of vectors, and denotes the function composition. In addition, GCN and MLP stand for the graph convolutional network and multi-layer perceptron, respectively.

Algorithm 1

Deep learning model algorithm

 

Input: Amol, Xmol, Asol, Xsol

# Molecule and solvent graphs

 

Output: Properties y

1

Hmol(0)Xmol

2

Hsol(0)Xsol

3

for k in range(Number of GCN layers)

# GCN layers

4

Hmol ← GCNmolk(Amol,Hmol)

5

Hsol ← GCNsolk(Asol, Hsol)

6

endfor

7

zmol(0)reduce_sum(Hmol)

# Chemical space layers

8

zsol(0)reduce_sum(Hsol)

9

for l in range(Number of MLPs)

10

zmol ← MLPmoll(zmol)

11

zsol ← MLPsoll(zsol)

12

endfor

13

for m in range(Number of interaction layers)

# Interaction layers

14

z←MLPm concat(zmol, zsol)

15

endfor

16

y←MLP (z)

# Output

17

return y

In our DL model (DeepHL), the GCN updates the l + 1-th hidden matrix of j (\({{{\mathbf{H}}}}_j^{l + 1}\)) as follows:

$${{{\mathbf{H}}}}_j^{l + 1} = \sigma \left( {{{{\mathbf{A}}}}_j \cdot {{{\mathbf{H}}}}_j^l \cdot {{{\mathbf{W}}}}_j^l + {{{\mathbf{b}}}}_j^l + {{{\mathbf{H}}}}_j^l} \right)$$
(1)

where σ is the rectified linear unit (ReLU), \({{{\mathbf{A}}}}_j\) is the adjacency matrix of j, and \({{{\mathbf{W}}}}_j^l\) and \({{{\mathbf{b}}}}_j^l\) are weight and bias of the l-th layer, respectively.

After passing through the GCN layers, the row vectors (\({{{\mathbf{h}}}}_{i,j}\)) of \({{{\mathbf{H}}}}_j^{}\) (i.e., \({{{\mathbf{H}}}}_j = ({{{\mathbf{h}}}}_{1,j},{{{\mathbf{h}}}}_{2,j}, \cdots {{{\mathbf{h}}}}_{i,j})^{{{\mathrm{T}}}}\)) are summed to ensure permutation invariance and to produce a chemical space vector of j (\({{{\mathbf{z}}}}_j\)) as follows:

$${{{\mathbf{z}}}}_j = \mathop {\sum}\limits_i {{{{\mathbf{h}}}}_{i,j}}$$
(2)

In our DL model, we used the multi-layer perceptron (MLP)

$${{{\mathbf{z}}}}_j^{l + 1} = \sigma \left( {{{{\mathbf{z}}}}_j^l \cdot {{{\mathbf{W}}}}_j^l + {{{\mathbf{b}}}}_j^l} \right)$$
(3)

It should be noted that the last MLP, which is used to calculate the HOMO and LUMO energies, does not have the activation function for the regression task. The architecture of our DL model, which is in-house code based on the RDKit and Keras packages90,91, has been also described in great detail elsewhere55.

Training procedure of DL model (DeepHL)

A total of 3362 molecule/solvent combinations were randomly divided into 2688, 336, and 338 for the training, validation, and test datasets, respectively. The distributions of HOMO and LUMO energies and molecular weights in the training, validation, and test datasets were found to be very similar as shown in Supplementary Fig. 4. In addition, we confirmed that there were no duplicate molecule/solvent combinations among the training, validation, and test datasets. The HOMO and LUMO energies were normalized to follow the standard normal distribution. Here, we used two learning methods: learning from scratch (LS) and transfer learning (TL). In the LS method, all parameters (i.e., weights and biases) in the DL model were initialized and optimized over 3000 epochs. The parameters with the lowest validation loss were finally selected. The TL method was performed in two stages. First, the DL model used the optimized parameters of the previously trained DLOS55, and only the parameters in the last hidden layer were optimized. Second, all the parameters in the DL model were fine-tuned by being trained with 103 times smaller learning rate. We compared the performances of the two methods and further investigated the dependence of the two methods on the training dataset size. To examine the dependence of the performance of our DL model on the training dataset size, a group of sub-datasets from the training dataset were constructed with different sizes of 50, 100, 200, 400, 800, 1000, 1500, 2000, and 2688. In addition, the larger sub-datasets were set to contain the smaller sub-datasets. Because small datasets can be easily biased to certain composition, total 10 groups of sub-datasets with different compositions were used. The dependence of the two training methods (i.e. LS and TL methods) on the training dataset size was quantified by the MAEs that were calculated using the test dataset, as shown in Fig. 2b, Supplementary Tables 1 and 2. It was found that the MAEs of the TL method were always smaller than those of the LS method. The TL-based DL model (DeepHL) was finally selected to be trained by using both training set and validation set, and the MAEs of the HOMO and LUMO energies of molecules in the test dataset were 0.148 and 0.163 eV, respectively.

Although the HOMO and LUMO energies of a given molecule depend on the molecule’s local environment, the MAE of DL predictions can be quite small when combinations of the same molecule and different solvents are divided into the training and test datasets. To avoid such data leakages, the molecules classified as the same scaffold by the MurckoScaffold module in the RDkit were grouped together and split into the training and test datasets regardless of solvents. In this way, the molecules with the same scaffold cannot be included in the training and test datasets at the same time. The prediction MAE of the DL model trained with the scaffold-split datasets was 0.170 eV, which was slightly larger than that of randomly split datasets (0.155 eV). In short, our DL model shows similar performance regardless of data splitting methods.

Materials and synthesis

All reagents were purchased from Sigma–Aldrich, TCI, Acros and Alfa Aesar and used without purification. All reactions were carried out under a nitrogen gas. The chemical structures of the synthesized compounds were analysed through 1H nuclear magnetic resonance (NMR) and 13C NMR spectra recorded in deuterated chloroform using a Varian Mercury 500 MHz spectrometer (Cambridge Isotope Laboratories). 1H and 13C NMR spectrum of synthesized compounds are presented in Supplementary Figs. 1114.