Experimental database of optical properties of organic compounds

Experimental databases on the optical properties of organic chromophores are important for the implementation of data-driven chemistry using machine learning. Herein, we present a series of experimental data including various optical properties such as the first absorption and emission maximum wavelengths and their bandwidths (full width at half maximum), extinction coefficient, photoluminescence quantum yield, and fluorescence lifetime. A database of 20,236 data points was developed by collecting the optical properties of organic compounds already reported in the literature. A dataset of 7,016 unique organic chromophores in 365 solvents or in solid state is available in CSV format.

All the optical properties in our database are based on the absorption and emission spectra reported in the originally published papers. To extract the optical properties, the absorption and emission spectra were carefully examined to exclude unreliable experimental results. In the case of collecting the extinction coefficients, absorption maximum, and absorption bandwidth, the background corrected absorption spectra in the dynamic range (typically, absorbance < 2) were selected. Similarly, for collecting the emission maximum, bandwidth, and quantum yield, the properly measured emission spectra were carefully selected. The Φ QY values exceeding 1 were not included. In the absorption spectrum of a given molecule, the first absorption peak was selected and its λ abs, max , σ abs (in full width at half maximum (FWHM)), and ε max values were obtained. Likewise, the λ emi, max and σ emi (in FWHM) values were obtained from the emission (or fluorescence) spectra. The bandwidths (σ abs and σ emi ) were reported in cm −1 or nm as provided in the published papers.
Furthermore, the PLQY (Φ QY ) measured in degassed media was preferentially collected, if reported. Otherwise, the PLQY measured in the air was collected. The fluorescence (or excited state) lifetimes (τ) measured by time-resolved fluorescence (TRF) experiments were also collected. In the case that the TRF signal was fit using a multi-exponential function was obtained and recorded. The molecular structures are reported in the canonicalized simplified molecular input line entry system (SMILES) [8][9][10][11] . For the optical properties, a pair of chromophore and solvent are provided, whereas for solid states, the chromophore is used as both chromophore and solvent. Moreover, for chromophores in a solid matrix, the solid matrix is used as the solvent.

Data records
The developed database is available at figshare 12 and its format is described in Table 1. The database comprises 20,236 combinations of 7,016 chromophores in 365 solvents and 17 solid matrices (or host) or solid states. Furthermore, the SMILES strings of the chromophores and solvents are provided and they indicate their molecular structures. All experimental data from the literature are presented with the corresponding reference, and each digital object identifier (DOI) is also reported. An example of benzene in cyclohexane is presented in Table 2. The data that are not reported in the references are indicated as NaN (not a number).

technical Validation
The main purpose of our database is to provide the optical properties of chromophores to the scientific and industrial communities with high quality and reliability.
The validation of the data we collected relies on the validation of peer-reviewed articles. To reduce the potential errors, we built our database in the following procedure. Two people, who had sufficient background in spectroscopic measurements, separately collected the optical properties from the published papers. The third person cross-checked these two datasets and added them to the database. In addition, the outliers such as λ abs, max (λ emi, max ) > 950 nm or <200 nm, λ abs, max > λ emi, max , σ abs or σ emi > 7000 cm −1 , τ flu < 0.1 ns, and log 10 (ε max ) < 2.5, were double-checked. Therefore, all the values in the final version of our database were carefully checked with those values and the spectra in the originally published papers.
A summary of the developed database is provided in Fig. 2. Among the 7,016 chromophores that can be found in our database, 95.2% have molecular weights lower than 1000 g/mol. Moreover, the chromophores contain diverse core structures such as pyrene, coumarin, perylene, porphyrin, boron-dipyrromethene (BODIPY), stilbene, azobenzene, and so on. In addition, the chromophores with molecular weight higher than 1000 g/mol generally comprise long alkyl chains or sugar units, which are introduced to improve the solubility without affecting the optical properties.
The histograms of λ abs, max and λ emi, max in Fig. 2b,c are divided into bins with a 20-nm width, covering a wide range of λ abs, max and λ emi, max . For example, 63% of λ abs, max and 88% of λ emi, max are in the visible range (380-700 nm), whereas more than 93% of the chromophores can absorb sunlight (310-750 nm), indicating their potential use as dyes and light harvesting molecules. Furthermore, our database contains fluorophores covering a wide range of emission wavelengths from UV to near infrared (NIR), which are applicable to OLEDs, fluorescence imaging dyes, and fluorescence sensors. In addition, chromophores with various functional groups 13 www.nature.com/scientificdata www.nature.com/scientificdata/ a chromophore in various solvents 15,16 are included so that the effects of the functional groups and the solvents (solvatochromism) on the optical properties of the chromophores are well documented.
The histogram of the collected Φ QY values is also divided into bins with a width of 0.05 (Fig. 2d). The standards for PLQYs, such as quinine sulfate and rhodamine 6G in solution, are also included 17 . Among the obtained QY data, the Φ QY of 91 data points is 0, and that of 137 data points is 1, whereas for approximately 23% of the QY data, the Φ QY is less than 0.05. Furthermore, the PLQYs of 803 samples in solid state were obtained mainly from OLED molecules 18,19 . In addition, molecules exhibiting aggregation induced emission were also collected for our database 20,21 .
Figure 2e,f display the σ abs and σ emi values that were extracted from the absorption and emission spectra of over 1,600 and 2,800 molecules, respectively. Our database contains 3292 and 7198 data points of σ abs and σ emi reported in nm and 747 and 627 data points of σ abs and σ emi reported in cm −1 . σ abs and σ emi values were barely reported in the published papers when compared with other optical properties. Most of σ abs and σ emi values in nm were extracted directly from the absorption and emission spectra reported in the originally published papers.
The standards for the fluorescence lifetime (τ flu ) values reported by Boens et al. 22 as well as other τ flu measurements were also included in our database. The histogram of the collected τ flu is divided into bins with a width of 1 ns (Fig. 2g), indicating that approximately 5% of the τ flu values is longer than 20 ns.
The ε max values at λ abs, max are recorded in log 10 (ε max ) and their distribution in Fig. 2h is shown in the histogram which is divided into bins with a width of 0.2. In our database, most of the ε max values are in the range of 10 3 -10 6 mol −1 dm 3 cm −1 (mol −1 L cm −1 ). Note that the product of Φ QY and ε max is proportional to the brightness, which is the fluorescence intensity per fluorophore. In addition, the number of data points simultaneously exhibiting Φ QY and ε max is 6,663, which can be used to estimate the brightness.
Estimation of experimental uncertainties of the optical properties in the database. The optical properties of organic compounds were collected from the published peer-reviewed papers. In most original papers, the experimental uncertainties in seven optical properties were not reported. In addition, the experimental conditions were different when the optical properties were measured. Therefore, it is very difficult to accurately estimate the experimental uncertainties. However, the experimental uncertainties of the optical properties are roughly estimated in the following way.
Experimental uncertainty of λ abs, max and λ emi, max . Most of UV-visible absorption and emission spectra reported in the published papers were measured by the spectrophotometers and spectrofluorometers available from Agilent, Ocean optics, Hitachi, and JASCO. Including them, the typical and modern absorption and emission spectrometers have a wavelength resolution of less than 1 nm. The maximum wavelengths (λ abs, max and λ emi, max ) of absorption and emission spectra can be readily determined within an experimental error of 1 nm. Therefore, the experimental uncertainty of λ abs, max and λ emi, max is estimated to be less than 1 nm.
Experimental uncertainty of σ abs and σ emi . The values of absorption and emission bandwidth (σ abs and σ emi ) in full width half maximum (FWHM) were extracted from the absorption and emission spectra reported in the published paper when they were not directly reported. Therefore, the error is much smaller than the thickness of the linewidth of spectra. The experimental uncertainty of σ abs and σ emi in FWHM is estimated to be a maximum of 2 nm.  www.nature.com/scientificdata www.nature.com/scientificdata/ Experimental uncertainty of Φ QY . The photoluminescence quantum yield (Φ QY ) is found to be the most error-prone quantity among seven optical properties. The experimental error in Φ QY is affected by several factors such as experimental instruments, measuring methods (absolute vs relative), and molecular oxygen (O 2 ). The IUPAC technical report is useful for estimating the error in Φ QY 17 . The Φ QY of 9,10-diphenylanthracene in cyclohexane is in the range of 0.9 to 0.97. Based on the fact that Φ QY is error-prone, the experimental uncertainty in Φ QY is conservatively estimated to be a maximum of 0.1.
Experimental uncertainty of τ flu . The fluorescence lifetime (τ flu ) is determined by an exponential fit to the time-resolved fluorescence (TRF) signal. The experimental uncertainty of τ flu results mainly from the instrument response function (IRF) and multi-exponential fit process. Since the IRF determines the time-resolution of the TRF spectrometer, the experimental error of τ flu is significant when τ flu is shorter than the IRF. In most cases. the multi-exponential fitting error does not exceed a maximum of 1%. We collected τ flu that was substantially larger than the IRF. Therefore, the experimental uncertainty of τ flu is conservatively estimated to be 1%.
Experimental uncertainty of log 10 (ε max ). To determine the extinction coefficient (ε max ), the absorbance (A) and the concentration (c) of chromophores should be known based on the Beer's law (A = εbc where b is the pathlength). Considering that the published papers are peer-reviewed, the experimental error in the concentration is assumed to be less than 5%. Therefore, the experimental uncertainty of log 10 (ε max ) is estimated to be less than 0.02 which is corresponding to log 10 (1.05).

Code availability
The optical properties of the chromophores were extracted from the scientific literatures, which is available at https://doi.org/10.6084/m9.figshare.12045567.v2 12 . We have opened a user-friendly webpage (http:// Deep4Chem.korea.ac.kr/search) where users can search for chromophores in the database. The database of this webpage will be updated regularly.