## Background & Summary

Organic chromophores used in optoelectronics, organic light emitting diodes (OLEDs), staining, fluorescent dyes, and bioimaging dyes, have been steadily developed. Therefore, it would be useful to reliably and quickly predict the optical properties of newly designed organic chromophores prior to their synthesis. Theoretical calculations based on ab initio and density functional theory methods have been extensively used to characterize the optical properties of newly designed organic chromophores. However, such theoretical calculations require high computational costs. Therefore, data-driven sciences based on machine learning have emerged as a promising alternative method and have been applied in many research areas1,2,3. However, databases are a prerequisite for data-driven sciences based on machine learning. Thus, databases for specific applications need to be available or collected.

The optical properties, such as absorption and emission maximum wavelengths and their bandwidths, extinction coefficient, photoluminescence quantum yield (PLQY), and lifetime, are important factors in characterizing organic chromophores. Therefore, databases on optical properties can be used to model the quantitative structure–property relationship for designing new organic chromophores with desired optical properties. Recently, the absorption peaks and extinction coefficients of small organic molecules have already been obtained using quantum chemical calculations and have been used for machine learning4,5,6. In addition, Beard et al. have reported the datasets of experimental and computational ultraviolet–visible (UV–Vis) absorption spectra7. However, no databases are currently available for the experimental absorption, emission, and fluorescence properties of organic chromophores.

As illustrated in Fig. 1, the absorption properties of organic chromophores are characterized by the first maximum absorption wavelength (λabs, max), bandwidth (σabs), and extinction coefficient (εmax) (Fig. 1a), which are important parameters for the design of chromophores for specific applications in various research fields such as photovoltaics, dyes, and optical filters. Similarly, the emission and fluorescent properties, which are characterized by the maximum emission wavelength (λemi, max), bandwidth (σemi), PLQY (ΦQY), and excited state lifetime (τ) (Fig. 1b, c), are essential for the development of emitters in OLEDs, fluorescent bioimaging dyes, and fluorescent sensors. In this study, we present a reliable and high-quality database of the optical properties of organic compounds that can be used for various purposes in diverse research fields.

## Methods

A total of 1,358 articles containing organic compounds were downloaded from journals of Nature Research, American Chemical Society, Royal Society of Chemistry, Springer, and Elsevier by exploring keywords such as fluorescence, luminescence, emission, OLED, fluorescence lifetime, or PLQY.

In our database, the organic compounds and solvent molecules are limited to a maximum number of 150 atoms (except hydrogen atoms) consisting of C, N, O, S, F, Cl, Br, I, Se, Te, Si, P, B, Sn, and Ge. Binary or ternary solvent systems are not included in our database. Data points in the solid state include one component systems (either amorphous or crystalline) and the solid solution such as dopant (chromophore) – host (solvent) systems in our database.

All the optical properties in our database are based on the absorption and emission spectra reported in the originally published papers. To extract the optical properties, the absorption and emission spectra were carefully examined to exclude unreliable experimental results. In the case of collecting the extinction coefficients, absorption maximum, and absorption bandwidth, the background corrected absorption spectra in the dynamic range (typically, absorbance < 2) were selected. Similarly, for collecting the emission maximum, bandwidth, and quantum yield, the properly measured emission spectra were carefully selected. The ΦQY values exceeding 1 were not included. In the absorption spectrum of a given molecule, the first absorption peak was selected and its λabs, max, σabs (in full width at half maximum (FWHM)), and εmax values were obtained. Likewise, the λemi, max and σemi (in FWHM) values were obtained from the emission (or fluorescence) spectra. The bandwidths (σabs and σemi) were reported in cm−1 or nm as provided in the published papers.

Furthermore, the PLQY (ΦQY) measured in degassed media was preferentially collected, if reported. Otherwise, the PLQY measured in the air was collected. The fluorescence (or excited state) lifetimes (τ) measured by time-resolved fluorescence (TRF) experiments were also collected. In the case that the TRF signal was fit using a multi-exponential function [$$S\left(t\right)=\sum _{i}\,{A}_{i}{\rm{\exp }}(-t/{\tau }_{i})$$] where Ai and τi is the amplitude and time constant, the average lifetime [$$\tau =\sum _{i}\,{A}_{i}{\tau }_{i}/\sum _{i}\,{A}_{i}$$] was obtained and recorded. The molecular structures are reported in the canonicalized simplified molecular input line entry system (SMILES)8,9,10,11. For the optical properties, a pair of chromophore and solvent are provided, whereas for solid states, the chromophore is used as both chromophore and solvent. Moreover, for chromophores in a solid matrix, the solid matrix is used as the solvent.

## Data Records

The developed database is available at figshare12 and its format is described in Table 1. The database comprises 20,236 combinations of 7,016 chromophores in 365 solvents and 17 solid matrices (or host) or solid states. Furthermore, the SMILES strings of the chromophores and solvents are provided and they indicate their molecular structures. All experimental data from the literature are presented with the corresponding reference, and each digital object identifier (DOI) is also reported. An example of benzene in cyclohexane is presented in Table 2. The data that are not reported in the references are indicated as NaN (not a number).

## Technical Validation

The main purpose of our database is to provide the optical properties of chromophores to the scientific and industrial communities with high quality and reliability.

The validation of the data we collected relies on the validation of peer-reviewed articles. To reduce the potential errors, we built our database in the following procedure. Two people, who had sufficient background in spectroscopic measurements, separately collected the optical properties from the published papers. The third person cross-checked these two datasets and added them to the database. In addition, the outliers such as λabs, max (λemi, max) > 950 nm or <200 nm, λabs, max > λemi, max, σabs or σemi > 7000 cm−1, τflu < 0.1 ns, and log10(εmax) < 2.5, were double-checked. Therefore, all the values in the final version of our database were carefully checked with those values and the spectra in the originally published papers.

A summary of the developed database is provided in Fig. 2. Among the 7,016 chromophores that can be found in our database, 95.2% have molecular weights lower than 1000 g/mol. Moreover, the chromophores contain diverse core structures such as pyrene, coumarin, perylene, porphyrin, boron-dipyrromethene (BODIPY), stilbene, azobenzene, and so on. In addition, the chromophores with molecular weight higher than 1000 g/mol generally comprise long alkyl chains or sugar units, which are introduced to improve the solubility without affecting the optical properties.

The histograms of λabs, max and λemi, max in Fig. 2b,c are divided into bins with a 20-nm width, covering a wide range of λabs, max and λemi, max. For example, 63% of λabs, max and 88% of λemi, max are in the visible range (380–700 nm), whereas more than 93% of the chromophores can absorb sunlight (310–750 nm), indicating their potential use as dyes and light harvesting molecules. Furthermore, our database contains fluorophores covering a wide range of emission wavelengths from UV to near infrared (NIR), which are applicable to OLEDs, fluorescence imaging dyes, and fluorescence sensors. In addition, chromophores with various functional groups13,14 and a chromophore in various solvents15,16 are included so that the effects of the functional groups and the solvents (solvatochromism) on the optical properties of the chromophores are well documented.

The histogram of the collected ΦQY values is also divided into bins with a width of 0.05 (Fig. 2d). The standards for PLQYs, such as quinine sulfate and rhodamine 6G in solution, are also included17. Among the obtained QY data, the ΦQY of 91 data points is 0, and that of 137 data points is 1, whereas for approximately 23% of the QY data, the ΦQY is less than 0.05. Furthermore, the PLQYs of 803 samples in solid state were obtained mainly from OLED molecules18,19. In addition, molecules exhibiting aggregation induced emission were also collected for our database20,21.

Figure 2e,f display the σabs and σemi values that were extracted from the absorption and emission spectra of over 1,600 and 2,800 molecules, respectively. Our database contains 3292 and 7198 data points of σabs and σemi reported in nm and 747 and 627 data points of σabs and σemi reported in cm−1. σabs and σemi values were barely reported in the published papers when compared with other optical properties. Most of σabs and σemi values in nm were extracted directly from the absorption and emission spectra reported in the originally published papers.

The standards for the fluorescence lifetime (τflu) values reported by Boens et al.22 as well as other τflu measurements were also included in our database. The histogram of the collected τflu is divided into bins with a width of 1 ns (Fig. 2g), indicating that approximately 5% of the τflu values is longer than 20 ns.

The εmax values at λabs, max are recorded in log10(εmax) and their distribution in Fig. 2h is shown in the histogram which is divided into bins with a width of 0.2. In our database, most of the εmax values are in the range of 103–106 mol−1 dm3 cm−1 (mol−1 L cm−1). Note that the product of ΦQY and εmax is proportional to the brightness, which is the fluorescence intensity per fluorophore. In addition, the number of data points simultaneously exhibiting ΦQY and εmax is 6,663, which can be used to estimate the brightness.

Finally, the optical properties of chromophores are solvent-dependent. In our database, 365 solvents are included. Among the 12 most common solvents presented in Fig. 2i, dichloromethane is the most frequently used. Moreover, alkanes with a number of carbon atoms ranging from 2 (1,1-dichloroethane) to 16 (1-chlorohexadecane) and 99 alcohols with one (methanol) to 12 (dodecanol) carbon atoms are reported as solvents. In addition, solid solutions and host molecules, such as 4,4ʹ-bis(carbazol-9-yl)biphenyl, bis[2-(diphenylphosphino)phenyl] ether oxide (DPEPO), and 1,3-bis(N-carbazolyl)benzene (mCP), are included in our database.

### Estimation of experimental uncertainties of the optical properties in the database

The optical properties of organic compounds were collected from the published peer-reviewed papers. In most original papers, the experimental uncertainties in seven optical properties were not reported. In addition, the experimental conditions were different when the optical properties were measured. Therefore, it is very difficult to accurately estimate the experimental uncertainties. However, the experimental uncertainties of the optical properties are roughly estimated in the following way.

#### Experimental uncertainty of λabs, max and λemi, max

Most of UV-visible absorption and emission spectra reported in the published papers were measured by the spectrophotometers and spectrofluorometers available from Agilent, Ocean optics, Hitachi, and JASCO. Including them, the typical and modern absorption and emission spectrometers have a wavelength resolution of less than 1 nm. The maximum wavelengths (λabs, max and λemi, max) of absorption and emission spectra can be readily determined within an experimental error of 1 nm. Therefore, the experimental uncertainty of λabs, max and λemi, max is estimated to be less than 1 nm.

#### Experimental uncertainty of σabs and σemi

The values of absorption and emission bandwidth (σabs and σemi) in full width half maximum (FWHM) were extracted from the absorption and emission spectra reported in the published paper when they were not directly reported. Therefore, the error is much smaller than the thickness of the linewidth of spectra. The experimental uncertainty of σabs and σemi in FWHM is estimated to be a maximum of 2 nm.

#### Experimental uncertainty of ΦQY

The photoluminescence quantum yield (ΦQY) is found to be the most error-prone quantity among seven optical properties. The experimental error in ΦQY is affected by several factors such as experimental instruments, measuring methods (absolute vs relative), and molecular oxygen (O2). The IUPAC technical report is useful for estimating the error in ΦQY17. The ΦQY of 9,10-diphenylanthracene in cyclohexane is in the range of 0.9 to 0.97. Based on the fact that ΦQY is error-prone, the experimental uncertainty in ΦQY is conservatively estimated to be a maximum of 0.1.

#### Experimental uncertainty of τflu

The fluorescence lifetime (τflu) is determined by an exponential fit to the time-resolved fluorescence (TRF) signal. The experimental uncertainty of τflu results mainly from the instrument response function (IRF) and multi-exponential fit process. Since the IRF determines the time-resolution of the TRF spectrometer, the experimental error of τflu is significant when τflu is shorter than the IRF. In most cases. the multi-exponential fitting error does not exceed a maximum of 1%. We collected τflu that was substantially larger than the IRF. Therefore, the experimental uncertainty of τflu is conservatively estimated to be 1%.

#### Experimental uncertainty of log10(εmax)

To determine the extinction coefficient (εmax), the absorbance (A) and the concentration (c) of chromophores should be known based on the Beer’s law (A = εbc where b is the pathlength). Considering that the published papers are peer-reviewed, the experimental error in the concentration is assumed to be less than 5%. Therefore, the experimental uncertainty of log10(εmax) is estimated to be less than 0.02 which is corresponding to log10(1.05).