Background & Summary

High Throughput Virtual Screenings (HTVSs)1,2 have recently been exploited to a great extent to identify promising materials in the domain of organic electronics. This powerful technique has often been used in combination with domain knowledge of the problem, carrying out screenings of modifications of known motifs or architectures known to work for a specific problem e.g. functionalisation for dye-sensitized solar cells3, donor-acceptor motifs for thermally activated delayed fluorescence (TADF)4, singlet fission (SF)5, and for general photovoltaic architectures6. This strategy translates in computational terms the process of experimental discovery exploiting chemical intuition7,8, and allows the reduction of the chemical space to explore9. However, the findings are bound to fall within the domain of what is already known and prevent the discovery of new motifs and design rules. Studies based on exploiting domain knowledge like biradical character for SF10,11 or donor-acceptor motifs for TADF12,13 will not find new design rules. Generative models also tend to find motifs similar to those already known14. Additionally, the identified candidates may not be easy to synthesise in the laboratory or be stable enough to be characterised, despite recent progresses in introducing measures of synthetic accessibility in HTVSs15.

In this study, we aim at providing a starting point for computational searches overcoming the mentioned limitations by presenting a data set of 48182 organic semiconductors (OSCs) constituted of molecules that were prepared with a documented synthetic pathway, and are stable in solid-state, enabling their crystallographic characterisation. The data set is therefore an excellent starting point to identify OSCs for various applications that can guide experimental research. We based our search on the Cambridge Structural Database (CSD)16, from which we selected OSCs with a computational strategy described in the following sections. The CSD dates back to the 60–70s, and contains crystal structure data for \( > 1M\) samples prepared for various purposes. Excluding polymorphs17,18,19 or samples measured in different experimental conditions20, the vast majority of molecules in the data set has characterisation data available. As it was not built with organic materials applications in mind though, of course, it does contain entries related to this field, any data set derived from the CSD21 is, therefore, unbiased with respect to the application, though some bias is present due to choices of research groups in the study of a certain molecule or experimental contraints with respect to the ability to crystallise the sample and characterise it. This low bias provides a great potential for novel physical insight: setting different criteria for the ideal candidates based on experimental benchmarks, the more stringent ones (i.e. more rare) can be used to translate results into design principles. Additionally, the fact that it contains molecules used for benchmarks in many fields of organic materials research allows testing the reliability of computational screenings for the desired application, “rediscovering” well known molecules.

Studies of OSCs for technological applications exploit the analyses of various electronic properties, ranging from frontier orbital energies to excited state energies and oscillator strengths. For instance, early searches of materials for organic photovoltaics exploited HOMO and LUMO energies22,23,24,25, high performance non-fullerene acceptors are known to possess a low LUMO-LUMO + 1 gap26,27, luminescent materials for new generation organic light emitting diodes (OLEDs) based on TADF28,29 as well as singlet fission candidates30 have been identified by calculating the S1-T1 gap (\(\Delta {E}_{ST}\)), and high mobility semiconductors were discovered by looking at electronic couplings, reorganisation energies and electron-phonon couplings31,32,33. Providing wavefunctions and basic excited state properties for the first few states will enable other researchers to carry out systematic investigations for applications that, to the best of our knowledge, are yet to be explored through computational screenings, such as aggregation induced emitters (AIEgens)34, but also for extremely innovative applications based on higher excited states, for which chemical intuition is still limited, e.g. designing anti-Kasha fluorophores35, even displaying delayed fluorescence36.

The data set presented in this work thus contains a collection of simulated spectroscopic properties on the X-ray geometries of existing organic molecules, showing a simulated HOMO-LUMO gap (\({E}_{gap}\)) falling below 4 eV, which we therefore define as organic semiconductors, and can be searched for relevant properties in various technological applications. Some data sets offer interesting properties for OSCs relevant for specific applications, e.g. the HOPV for organic photovoltaics37, but they are limited to boundaries within the chemical space, i.e. they exploit domain knowledge about what is known to work. Other data sets offer spectroscopic properties of molecules, such as e.g the QM838 or the OE6221 data sets, but the former is limited in the number and type of heavy atoms and excited states considered, while the latter provides spectroscopic data only for a small fraction of the data set (\(\approx 5K\) entries). The data set we present in this work is thus aimed at complementing the currently available ones in these aspects, which we describe in more detail in the following sections.


The data set of OSCs we present here has been built starting from the python application programming interface provided within the CSD distribution. To identify OSCs, we started by removing polymeric molecules, disordered solids, and co–crystals from the entries containing X-ray structures. We further reduced the structures to be retained by:

  1. 1.

    including only the most commonly used elements in typical OSCs in the selection of molecules (H, B, C, N, O, F, Si, P, S, Cl, As, Se, Br, I);

  2. 2.

    removing entries with more than one molecule type in the unit cell;

  3. 3.

    removing duplicate entries

X-ray geometries include all heavy atoms, while hydrogen atoms are added and normalised (i.e. placed at a typical X-H distance using statistical surveys of neutron diffraction data) using the CSD library’s built-in functions exploiting such literature data39,40. Due to errors in the procedure, e.g. missing hydrogens in diborane moieties, the structurally erroneous entries are filtered out by comparing the heavy atom connectivity layer of InChI41 strings of the CSD entry and the extracted geometry, followed by comparison of the chemical formulae between the CSD entry and the extracted geometry. The data set is up to date with the 2020 version of the CSD, thus updates starting from the 2021 version are possible.

This procedure resulted in a reduction of the data set from \(\approx 1M\) to \(\approx 265K\) molecules. To identify OSCs we adopted a three-step computational funnel strategy in combination with a calibration procedure, aimed at estimating the HOMO-LUMO gap (\({E}_{gap}\)) with quantum mechanical (QM) methods of reasonable accuracy. First of all, we selected three methods of increasing accuracy for our computational funnel: PM7, B3LYP/3 – 21 G*, and B3LYP/6–31 G*. Second, we picked a subset of 550 molecules on which we performed single point calculations on the X-ray geometries provided within the CSD, obtaining orbital energies with all three methods. This allowed us to compute calibration curves to estimate the B3LYP/6-31 G* HOMO-LUMO gap (\({\widetilde{E}}_{gap}\)) from low accuracy ones (see panels b), c), e), and f) in Fig. 1), and the associated error distribution. With calibration curves available, we proceeded to compute HOMO-LUMO gaps (\({\widetilde{E}}_{gap}\)) for the entire data set of \(\approx 265K\) molecules (panel d) in Fig. 1), estimating the gap that we would obtain if we ran a higher level calculation. Considering the distribution of errors of the calibration curve, at the PM7 level we retained any molecule showing \({\widetilde{E}}_{gap}\le 5.5\) eV as a potential OSC, reducing the data set from \(\approx 265K\) to \(\approx 200K\) molecules. On these molecules, we recomputed the gap at the B3LYP/3-21 G* level (panel g) in Fig. 1), considering any molecule showing \({\widetilde{E}}_{gap}\le 4\) eV as an OSC, resulting in the \(\approx 50K\) molecules that constitute the data set presented here. 4 eV is a conventional upper limit for semiconductors42, and all the best performing molecules across various applications have a smaller gap. On these molecules, we computed excited states properties at TD-DFT/M06-2X/def2-SVP (see Fig. 2), releasing, as part of the data set, the converged ground state wavefunction, and the results for the first three singlet (S1-S3) and triplet (T1-T3) states. A calibration of the TD-DFT method for S1 and T1 excitation energies for \(\approx 100\) data points with available experimental data is presented elsewhere30, and guarantees the reliability of the method (RMSE \(\approx 0.05\) eV). All QM calculations were carried out with the Gaussian16 software43, and the data provided as part of this release were extracted from output and checkpoint files using the Multiwfn software44 and the CClib python library45.

Fig. 1
figure 1

(a) Computational strategy used to identify OSCs starting from the CSD. (b) Calibration curve to estimate B3LYP/6-31 G* HOMO from PM7 HOMO. (c) Calibration curve to estimate B3LYP/6-31 G* LUMO from PM7 LUMO. (d) distribution of estimated B3LYP/6-31 G* HOMO-LUMO gap from PM7 energy levels. (e) Calibration curve to estimate B3LYP/6-31 G* HOMO from B3LYP/3-21 G* HOMO. (f) Calibration curve to estimate B3LYP/6-31 G* LUMO from B3LYP/3-21 G* LUMO. (g) distribution of estimated B3LYP/6-31 G* HOMO-LUMO gap from B3LYP/3-21 G* energy levels.

Fig. 2
figure 2

Distributions of energy levels computed on X-ray geometries for all entries in the database. Left: frontier molecular orbitals computed at the DFT/M06-2X/def2-SVP level. Right: singlet (S1, S2) and triplet (T1, T2) excited state energies computed at TD-DFT/M06-2X/def2-SVP level.

These calculations allow for interesting analyses regarding the time evolution of the CSD. For instance, since the deposition date of each entry is known, it is possible to follow how many OSCs were deposited over time, both in absolute and fractional terms. From these analyses (see Fig. 3) we see that, while the absolute number is naturally increasing over time, the fractional number of OSCs within the CSD is constant until ≈2010, and since then it has basically doubled, rising from ≈3–4% to ≈7%, which is in agreement with the evolution of research in the organic materials field.

Fig. 3
figure 3

Time evolution of OSCs within the CSD. Left: total number of OSCs deposited each year. Right: fraction of OSCs deposited each year.

Data Records

The curated data set is available from DataCat, the University of Liverpool repository46:

  1. 1.

    data extracted from QM calculations are provided at the University of Liverpool repository46 in comma-separated values (.csv) format, which can be easily read through common programs or programming languages. A description of the provided data is given in Table 1;

    Table 1 Description of metadata and electronic properties gathered in the database.
  2. 2.

    the wavefunctions for each entry are provided in a set of 31 sequential archives at the University of Liverpool repository46 allowing for sequential or partial download. Geometries are also given to facilitate analyses. Data are made available in .wfn format;

  3. 3.

    Gaussian16 QM calculations output files are provided at the University of Liverpool repository46 to allow for additional wavefunction analyses, with the aim to characterise electronic states or transitions, as mentioned in the following sections.

Geometries and wavefunctions are provided in .wfn format, the AIM traditional format. We chose this format to provide interested users data for analyses or subsequent calculations that would be independent of the software we used. In fact, .wfn files can be generated or processed with a multitude of tools, among which the popular software Multiwfn44, the python library IOData47, ORCA48,49 and others50,51. Each .wfn file contains the molecular geometry, as well as occupied molecular orbitals expressed in the atomic basis and their energies. These data can be used for visualisation of e.g. geometries, occupied orbitals, but also to run QM calculations with an initial guess to obtain refined properties for applications of interest.

Gaussian16 output files are provided to allow for additional wavefunction analyses on electronic excitations, allowing interested users to avoid repeating calculations that we have already performed.

Technical Validation

The key idea is that new applications of existing molecules can be discovered by searching for useful properties computed for a large data set, thanks to a robust calibration between predicted and experimentally validated data. Crucially, the data set should be totally unbiased and not related to the property of interest: this way, discoveries are truly unexpected and have a large applicative and commercial value. We proved this concept through a range of demonstrations in recent works, covering various applications areas. These demonstrations considered an outdated data set consisting of \(\approx 40K\) OSCs. The data set presented here is up to date with the 2020 version of the CSD, thus containing entries that were not the objects of our previous studies; the same strategies can be used on the fraction of molecules not previously considered to discover more potential candidates, in line with our previous findings.

The key applications demonstrated in our previous works are the following:

  1. 1.

    we showed that it is possible to identify completely new molecules that undergo singlet fission (a property of relevance for solar cells) by calibrating a computational method to yield accurate energies of singlet and triplet excited states and found molecules with the ideal energy level alignment30. The method rediscovered known molecules for singlet fission (true positives), and identified several different families of known compounds with this desirable property;

  2. 2.

    we proposed a related screening protocol to identify molecules undergoing TADF28, a relevant property in the area of display technologies. The protocol indicated without any adjustable parameter that \(0.3 \% \) of the \(\approx 40K\) molecules considered may undergo TADF. About half of them were known TADF emitters, providing great confidence in the quality of the prediction. The other half of the hits were totally unknown to the field, illustrating in parallel how this approach can lead to completely novel design rules;

  3. 3.

    we showed that a similar approach can be used to identify novel electron acceptors to be used in organic solar cells to replace expensive and inefficient fullerene derivatives52. Also in this case, about half of the “discovered” molecules were known, the other half being totally novel ones. This work showed that database searching is only the first step and it is possible to modify lead compounds to have other desirable properties, like solubility;

  4. 4.

    we showed that we can screen for luminescent crystals displaying superradiance or near IR emission53, properties of interest in the areas of light-emitting diodes, organic lasers, and biological imaging. A common theme of all applications particularly well exemplified by this one is the ability of large screenings to identify plausible optima for any properties; in this case, what is the maximum red shift that can be observed when a particular molecule is studied in its crystal.

The basis of similar studies can be laid by analysing properties provided in this database similarly to what is shown in Fig. 4. In the left panel, we report T1 vs S1 energies. Potential singlet fission materials fall to the left of the dashed black line, representing the main singlet fission criterion, i.e. S1 = 2 T1. Similarly, potential TADF materials fall in the proximity of the dashed blue line, representing the main TADF criterion, i.e. S1 = T1. Colours encode the S1 oscillator strength (\({f}_{{S}_{1}}\)) through a logarithmic scale, since one would be interested in materials able to absorb (singlet fission) or emit (TADF) light with a good performance. These types of analyses led us to the work shortly described in points 1 and 2, where we have “rediscovered” well known singlet fission and TADF materials, proving that the starting point, i.e. a reduced version of the data set presented here, is reliable. The same, however, can be done for other properties yet to be studied: for instance, in the right panel of Fig. 4, we report S1 vs S2 energies, useful to identify potential anti-Kasha materials, falling in the proximity of the dashed black line, representing S2 = 2 S1. This is a reasonable criterion according to domain knowledge regarding the role of kinetics in anti-Kasha photoreactions54,55. In this case, colours encode the S2 oscillator strength (\({f}_{{S}_{2}}\)) through a logarithmic scale, since in anti-Kasha materials the fluorescence is expected from a higher excited state.

Fig. 4
figure 4

Relationships between relevant excited states energies to identify promising materials within the database. Left: T1 vs S1 energies. Potential singlet fission materials fall in proximity of the dashed black line, representing S1 = 2 T1. Potential TADF materials fall in proximity of the dashed blue line, representing S1 = T1. Colours encode the S1 oscillator strength (\({f}_{{S}_{1}}\)) through a logarithmic scale, assuming bright states are desirable. Right: S1 vs S2 energies. Potential anti-Kasha materials fall in proximity of the dashed black line, representing S2 = 2 S1. Colours encode the S2 oscillator strength (\({f}_{{S}_{2}}\)) through a logarithmic scale, assuming bright states are desirable.

Usage Notes

Above, we have listed some applications deriving from the data presented here. In general, the starting point for each of those applications consisted of a calibration of the computational method used to carry out further analyses with available experimental data. Thanks to the fact that we provide the ground state wavefunction for each of our entries, not only will these calibrations be faster because we provide an initial guess for QM calculations, but also many more analyses are accessible. For instance, electronic states or transitions can be thoroughly characterised with packages such as Multiwfn44 or TheoDORE56, which can provide detailed information regarding the nature of an electronic transition (e.g. Charge Transfer metrics57,58, ghost states59, electronic density difference60, exciton delocalisation61,62 etc). Additionally, this data set can form the basis for training sets to Machine Learning models aiming at reproducing the electronic density of molecules63,64, based on experimental X-ray geometries. The availability of CSD identifiers enables the expansion of analyses to molecules in their crystals32, which is fundamental for technological applications of organic semiconductors. Finally, the synthetic approaches that make molecules within the CSD accessible can be easily tracked down thanks to references provided within the data set. This allows not only for a prompt source of synthetic routes to be exploited in case of experimental validation of the results, but is also useful in combination with retrosynthetic planning strategies65,66.