Organic materials repurposing, a data set for theoretical predictions of new applications for existing compounds

We present a data set of 48182 organic semiconductors, constituted of molecules that were prepared with a documented synthetic pathway and are stable in solid state. We based our search on the Cambridge Structural Database, from which we selected semiconductors with a computational funnel procedure. For each entry we provide a set of electronic properties relevant for organic materials research, and the electronic wavefunction for further calculations and/or analyses. This data set has low bias because it was not built from a set of materials designed for organic electronics, and thus it provides an excellent starting point in the search of new applications for known materials, with a great potential for novel physical insight. The data set contains molecules used as benchmarks in many fields of organic materials research, allowing to test the reliability of computational screenings for the desired application, “rediscovering” well-known molecules. This is demonstrated by a series of different applications in the field of organic materials, confirming the potential for the repurposing of known organic molecules.

We present a data set of 48182 organic semiconductors, constituted of molecules that were prepared with a documented synthetic pathway and are stable in solid state. We based our search on the Cambridge Structural Database, from which we selected semiconductors with a computational funnel procedure. For each entry we provide a set of electronic properties relevant for organic materials research, and the electronic wavefunction for further calculations and/or analyses. this data set has low bias because it was not built from a set of materials designed for organic electronics, and thus it provides an excellent starting point in the search of new applications for known materials, with a great potential for novel physical insight. the data set contains molecules used as benchmarks in many fields of organic materials research, allowing to test the reliability of computational screenings for the desired application, "rediscovering" well-known molecules. This is demonstrated by a series of different applications in the field of organic materials, confirming the potential for the repurposing of known organic molecules.

Background & Summary
High Throughput Virtual Screenings (HTVSs) 1,2 have recently been exploited to a great extent to identify promising materials in the domain of organic electronics. This powerful technique has often been used in combination with domain knowledge of the problem, carrying out screenings of modifications of known motifs or architectures known to work for a specific problem e.g. functionalisation for dye-sensitized solar cells 3 , donor-acceptor motifs for thermally activated delayed fluorescence (TADF) 4 , singlet fission (SF) 5 , and for general photovoltaic architectures 6 . This strategy translates in computational terms the process of experimental discovery exploiting chemical intuition 7,8 , and allows the reduction of the chemical space to explore 9 . However, the findings are bound to fall within the domain of what is already known and prevent the discovery of new motifs and design rules. Studies based on exploiting domain knowledge like biradical character for SF 10,11 or donor-acceptor motifs for TADF 12,13 will not find new design rules. Generative models also tend to find motifs similar to those already known 14 . Additionally, the identified candidates may not be easy to synthesise in the laboratory or be stable enough to be characterised, despite recent progresses in introducing measures of synthetic accessibility in HTVSs 15 .
In this study, we aim at providing a starting point for computational searches overcoming the mentioned limitations by presenting a data set of 48182 organic semiconductors (OSCs) constituted of molecules that were prepared with a documented synthetic pathway, and are stable in solid-state, enabling their crystallographic characterisation. The data set is therefore an excellent starting point to identify OSCs for various applications that can guide experimental research. We based our search on the Cambridge Structural Database (CSD) 16 , from which we selected OSCs with a computational strategy described in the following sections. The CSD dates back to the 60-70s, and contains crystal structure data for > M 1 samples prepared for various purposes. Excluding polymorphs [17][18][19] or samples measured in different experimental conditions 20 , the vast majority of molecules in the data set has characterisation data available. As it was not built with organic materials applications in mind though, of course, it does contain entries related to this field, any data set derived from the CSD 21 is, therefore, unbiased with respect to the application, though some bias is present due to choices of research groups in the study of a certain molecule or experimental contraints with respect to the ability to crystallise the sample and characterise it. This low bias provides a great potential for novel physical insight: setting different criteria for the ideal candidates based on experimental benchmarks, the more stringent ones (i.e. more rare) can be used to translate results into design principles. Additionally, the fact that it contains molecules used for benchmarks in many fields of organic materials research allows testing the reliability of computational screenings for the desired application, "rediscovering" well known molecules.
Studies of OSCs for technological applications exploit the analyses of various electronic properties, ranging from frontier orbital energies to excited state energies and oscillator strengths. For instance, early searches of materials for organic photovoltaics exploited HOMO and LUMO energies [22][23][24][25] , high performance non-fullerene acceptors are known to possess a low LUMO-LUMO + 1 gap 26,27 , luminescent materials for new generation organic light emitting diodes (OLEDs) based on TADF 28,29 as well as singlet fission candidates 30 have been identified by calculating the S 1 -T 1 gap (∆E ST ), and high mobility semiconductors were discovered by looking at electronic couplings, reorganisation energies and electron-phonon couplings [31][32][33] . Providing wavefunctions and basic excited state properties for the first few states will enable other researchers to carry out systematic investigations for applications that, to the best of our knowledge, are yet to be explored through computational screenings, such as aggregation induced emitters (AIEgens) 34 , but also for extremely innovative applications based on higher excited states, for which chemical intuition is still limited, e.g. designing anti-Kasha fluorophores 35 , even displaying delayed fluorescence 36 .
The data set presented in this work thus contains a collection of simulated spectroscopic properties on the X-ray geometries of existing organic molecules, showing a simulated HOMO-LUMO gap (E gap ) falling below 4 eV, which we therefore define as organic semiconductors, and can be searched for relevant properties in various technological applications. Some data sets offer interesting properties for OSCs relevant for specific applications, e.g. the HOPV for organic photovoltaics 37 , but they are limited to boundaries within the chemical space, i.e. they exploit domain knowledge about what is known to work. Other data sets offer spectroscopic properties of molecules, such as e.g the QM8 38 or the OE62 21 data sets, but the former is limited in the number and type of heavy atoms and excited states considered, while the latter provides spectroscopic data only for a small fraction of the data set ( K 5 ≈ entries). The data set we present in this work is thus aimed at complementing the currently available ones in these aspects, which we describe in more detail in the following sections.

Methods
The data set of OSCs we present here has been built starting from the python application programming interface provided within the CSD distribution. To identify OSCs, we started by removing polymeric molecules, disordered solids, and co-crystals from the entries containing X-ray structures. We further reduced the structures to be retained by: 1. including only the most commonly used elements in typical OSCs in the selection of molecules (H, B, C, N, O, F, Si, P, S, Cl, As, Se, Br, I); 2. removing entries with more than one molecule type in the unit cell; 3. removing duplicate entries X-ray geometries include all heavy atoms, while hydrogen atoms are added and normalised (i.e. placed at a typical X-H distance using statistical surveys of neutron diffraction data) using the CSD library's built-in functions exploiting such literature data 39,40 . Due to errors in the procedure, e.g. missing hydrogens in diborane moieties, the structurally erroneous entries are filtered out by comparing the heavy atom connectivity layer of InChI 41 strings of the CSD entry and the extracted geometry, followed by comparison of the chemical formulae between the CSD entry and the extracted geometry. The data set is up to date with the 2020 version of the CSD, thus updates starting from the 2021 version are possible.
This procedure resulted in a reduction of the data set from ≈ M 1 to ≈ K 265 molecules. To identify OSCs we adopted a three-step computational funnel strategy in combination with a calibration procedure, aimed at estimating the HOMO-LUMO gap (E gap ) with quantum mechanical (QM) methods of reasonable accuracy. First of all, we selected three methods of increasing accuracy for our computational funnel: PM7, B3LYP/3 -21 G*, and B3LYP/6-31 G*. Second, we picked a subset of 550 molecules on which we performed single point calculations on the X-ray geometries provided within the CSD, obtaining orbital energies with all three methods. This allowed us to compute calibration curves to estimate the B3LYP/6-31 G* HOMO-LUMO gap (E gap ) from low accuracy ones (see panels b), c), e), and f) in Fig. 1), and the associated error distribution. With calibration curves available, we proceeded to compute HOMO-LUMO gaps (E gap ) for the entire data set of K 265 ≈ molecules (panel d) in Fig. 1), estimating the gap that we would obtain if we ran a higher level calculation. Considering the distribution of errors of the calibration curve, at the PM7 level we retained any molecule showing E 5 5 gap ≤ . eV as a potential OSC, reducing the data set from ≈ K 265 to ≈ K 200 molecules. On these molecules, we recomputed the gap at the B3LYP/3-21 G* level (panel g) in Fig. 1), considering any molecule showing E 4 gap ≤ eV as an OSC, resulting in the K 50 ≈ molecules that constitute the data set presented here. 4 eV is a conventional upper limit for semiconductors 42 , and all the best performing molecules across various applications have a smaller gap. On these molecules, we computed excited states properties at TD-DFT/M06-2X/ def2-SVP (see Fig. 2), releasing, as part of the data set, the converged ground state wavefunction, and the results for the first three singlet (S 1 -S 3 ) and triplet (T 1 -T 3 ) states. A calibration of the TD-DFT method for S 1 and T 1 excitation energies for ≈100 data points with available experimental data is presented elsewhere 30 , www.nature.com/scientificdata www.nature.com/scientificdata/ and guarantees the reliability of the method (RMSE 0 05 ≈ . eV). All QM calculations were carried out with the Gaussian16 software 43 , and the data provided as part of this release were extracted from output and checkpoint files using the Multiwfn software 44 and the CClib python library 45 .
These calculations allow for interesting analyses regarding the time evolution of the CSD. For instance, since the deposition date of each entry is known, it is possible to follow how many OSCs were deposited over time, both in absolute and fractional terms. From these analyses (see Fig. 3) we see that, while the absolute number is naturally increasing over time, the fractional number of OSCs within the CSD is constant until ≈2010, and since then it has basically doubled, rising from ≈3-4% to ≈7%, which is in agreement with the evolution of research in the organic materials field.

Data records
The curated data set is available from DataCat, the University of Liverpool repository 46 : 1. data extracted from QM calculations are provided at the University of Liverpool repository 46 in commaseparated values (.csv) format, which can be easily read through common programs or programming languages. A description of the provided data is given in Table 1;  www.nature.com/scientificdata www.nature.com/scientificdata/ 2. the wavefunctions for each entry are provided in a set of 31 sequential archives at the University of Liverpool repository 46 allowing for sequential or partial download. Geometries are also given to facilitate analyses. Data are made available in .wfn format; 3. Gaussian16 QM calculations output files are provided at the University of Liverpool repository 46 to allow for additional wavefunction analyses, with the aim to characterise electronic states or transitions, as mentioned in the following sections.
Geometries and wavefunctions are provided in .wfn format, the AIM traditional format. We chose this format to provide interested users data for analyses or subsequent calculations that would be independent of the software we used. In fact, .wfn files can be generated or processed with a multitude of tools, among which the popular software Multiwfn 44 , the python library IOData 47 , ORCA 48,49 and others 50,51 . Each .wfn file contains the molecular geometry, as well as occupied molecular orbitals expressed in the atomic basis and their energies. These data can be used for visualisation of e.g. geometries, occupied orbitals, but also to run QM calculations with an initial guess to obtain refined properties for applications of interest.
Gaussian16 output files are provided to allow for additional wavefunction analyses on electronic excitations, allowing interested users to avoid repeating calculations that we have already performed.

technical Validation
The key idea is that new applications of existing molecules can be discovered by searching for useful properties computed for a large data set, thanks to a robust calibration between predicted and experimentally validated data. Crucially, the data set should be totally unbiased and not related to the property of interest: this way, discoveries are truly unexpected and have a large applicative and commercial value. We proved this concept through a range of demonstrations in recent works, covering various applications areas. These demonstrations considered an outdated data set consisting of K 40 ≈ OSCs. The data set presented here is up to date with the 2020  www.nature.com/scientificdata www.nature.com/scientificdata/ version of the CSD, thus containing entries that were not the objects of our previous studies; the same strategies can be used on the fraction of molecules not previously considered to discover more potential candidates, in line with our previous findings.
The key applications demonstrated in our previous works are the following: 1. we showed that it is possible to identify completely new molecules that undergo singlet fission (a property of relevance for solar cells) by calibrating a computational method to yield accurate energies of singlet and triplet excited states and found molecules with the ideal energy level alignment 30 . The method rediscovered known molecules for singlet fission (true positives), and identified several different families of known compounds with this desirable property; 2. we proposed a related screening protocol to identify molecules undergoing TADF 28 , a relevant property in the area of display technologies. The protocol indicated without any adjustable parameter that . 0 3% of the ≈ K 40 molecules considered may undergo TADF. About half of them were known TADF emitters, providing great confidence in the quality of the prediction. The other half of the hits were totally unknown to the field, illustrating in parallel how this approach can lead to completely novel design rules; 3. we showed that a similar approach can be used to identify novel electron acceptors to be used in organic solar cells to replace expensive and inefficient fullerene derivatives 52 . Also in this case, about half of the "discovered" molecules were known, the other half being totally novel ones. This work showed that database searching is only the first step and it is possible to modify lead compounds to have other desirable properties, like solubility; 4. we showed that we can screen for luminescent crystals displaying superradiance or near IR emission 53 , properties of interest in the areas of light-emitting diodes, organic lasers, and biological imaging. A common theme of all applications particularly well exemplified by this one is the ability of large screenings to identify plausible optima for any properties; in this case, what is the maximum red shift that can be observed when a particular molecule is studied in its crystal.
The basis of similar studies can be laid by analysing properties provided in this database similarly to what is shown in Fig. 4. In the left panel, we report T 1 vs S 1 energies. Potential singlet fission materials fall to the left of the dashed black line, representing the main singlet fission criterion, i.e. S 1 = 2 T 1 . Similarly, potential TADF materials fall in the proximity of the dashed blue line, representing the main TADF criterion, i.e. S 1 = T 1 . Colours encode the S 1 oscillator strength ( f S 1 ) through a logarithmic scale, since one would be interested in materials able to absorb (singlet fission) or emit (TADF) light with a good performance. These types of analyses led us to the work shortly described in points 1 and 2, where we have "rediscovered" well known singlet fission and TADF materials, proving that the starting point, i.e. a reduced version of the data set presented here, is reliable. The same, however, can be done for other properties yet to be studied: for instance, in the right panel of Fig. 4, we report S 1 vs S 2 energies, useful to identify potential anti-Kasha materials, falling in the proximity of the dashed black line, representing S 2 = 2 S 1 . This is a reasonable criterion according to domain knowledge regarding the role of kinetics in anti-Kasha photoreactions 54,55 . In this case, colours encode the S 2 oscillator strength ( f S 2 ) through a logarithmic scale, since in anti-Kasha materials the fluorescence is expected from a higher excited state. www.nature.com/scientificdata www.nature.com/scientificdata/

Usage Notes
Above, we have listed some applications deriving from the data presented here. In general, the starting point for each of those applications consisted of a calibration of the computational method used to carry out further analyses with available experimental data. Thanks to the fact that we provide the ground state wavefunction for each of our entries, not only will these calibrations be faster because we provide an initial guess for QM calculations, but also many more analyses are accessible. For instance, electronic states or transitions can be thoroughly characterised with packages such as Multiwfn 44 or TheoDORE 56 , which can provide detailed information regarding the nature of an electronic transition (e.g. Charge Transfer metrics 57,58 , ghost states 59 , electronic density difference 60 , exciton delocalisation 61,62 etc). Additionally, this data set can form the basis for training sets to Machine Learning models aiming at reproducing the electronic density of molecules 63,64 , based on experimental X-ray geometries. The availability of CSD identifiers enables the expansion of analyses to molecules in their crystals 32 , which is fundamental for technological applications of organic semiconductors. Finally, the synthetic approaches that make molecules within the CSD accessible can be easily tracked down thanks to references provided within the data set. This allows not only for a prompt source of synthetic routes to be exploited in case of experimental validation of the results, but is also useful in combination with retrosynthetic planning strategies 65,66 .

Code availability
Scripts to obtain plots starting from the database are available at the University of Liverpool repository 46 .