A hybrid organic-inorganic perovskite dataset

Hybrid organic-inorganic perovskites (HOIPs) have been attracting a great deal of attention due to their versatility of electronic properties and fabrication methods. We prepare a dataset of 1,346 HOIPs, which features 16 organic cations, 3 group-IV cations and 4 halide anions. Using a combination of an atomic structure search method and density functional theory calculations, the optimized structures, the bandgap, the dielectric constant, and the relative energies of the HOIPs are uniformly prepared and validated by comparing with relevant experimental and/or theoretical data. We make the dataset available at Dryad Digital Repository, NoMaD Repository, and Khazana Repository (http://khazana.uconn.edu/), hoping that it could be useful for future data-mining efforts that can explore possible structure-property relationships and phenomenological models. Progressive extension of the dataset is expected as new organic cations become appropriate within the HOIP framework, and as additional properties are calculated for the new compounds found.

summarizes the workflow of the dataset preparation. This procedure starts by collecting 16 organic (molecular) cations A +1 , all of which have been considered in the literature 1,6,7,12 . Each of these 16 cations, shown in Fig. 2, is placed at the site A of the ASnI 3 -based perovskites. This is the starting point for various structure prediction simulations, performed with the minima-hopping method 31,32 . The lowenergy structures predicted for ASnI 3 are subjected to a preliminary filtering step, keeping 135 prototype structures that are different in the DFT energy and the volume (these quantities are estimated on a notso-high accuracy level used for the searches). Next, we expand the set of 135 structures by substituting either Ge or Pb for Sn, and, similarly, by substituting either F, Cl, or Br for I. The resulted 1,620 (initial) structures were optimized by DFT at the desired level of accuracy (described in Numerical calculations Figure 1. Scheme for preparing the dataset of hybrid organic-inorganic perovskites. Minima-hopping is a structure prediction method that was used for generating an initial set of 135 ASnI 3 prototypical structures (where A stands for 16 organic cations), which were used as seeds for the creation of the remaining compounds. Section), yielding the relative energies and the atomization energies. Then, the band edge positions in the k space, the energy bandgap, and the dielectric constant were calculated for the optimized structures. A post-filtering step is finally performed on the whole dataset, removing redundancy (this time, redundancy is identified at the desired accuracy level of DFT computations), keeping 1,346 distinct data points (summarized in Table 1). Whenever possible, our calculated results are compared with those computed and/or measured data. Relaxed structures of all the materials are finally converted into the crystallographic information format (cif) using the pymatgen library 33 .

Initial structure accumulation
As briefly demonstrated in the Workflow section, our dataset is built up from 135 prototype structures obtained by searching for low-energy structures of 16 HOIPs with chemical formulae ASnI 3 (in fact, prototype structures of any material can be searched). In the minima-hopping structure prediction simulations, the DFT-level evergy is used to construct the potential energy surface (PES) of the composition 31,32 . Starting from an initial structure, low-energy minima of the PES are then searched by alternatively performing DFT-based local optimization runs (to locate the nearby minima) and molecular dynamics runs (to escape the identified minima). Thanks to some feedback mechanisms implemented, structure searches using this method is biased, giving some preference to the low-energy domains of the PES. Because of the large number of minima, the searches were performed at a given not-so-high accuracy level of DFT energy, and the minima identified in this step were then refined at the desired level. The power of the minima-hopping has been demonstrated over several classes of crystalline solids [34][35][36] , including three SnI 3 -based HOIPs 12 .
For each of 16 ASnI 3 HOIPs, numerous low-energy structures identified are subjected to a filtering step, keeping only those that are different by at least 5 meV/atom in the DFT energy and at least 0.1 Å 3 / atom in the structure volume. After the filtering step, 135 prototypical structures of 16 HOIPs were selected, three of which are shown in Fig. 3. In case of isotropic organic cations such as tetramethylammonium, a cubic-like cage formed by the network of Sn and I ions is stabilized in a three-dimensional structure. For the case of anisotropic or polar organic cations, the framework deforms into the two-dimensional planar or pillar motif. More structural variation is possible to be found from further structure searching using different organic cations and/or slightly nonstoichiometric composition in the HOIP system 6,37 . By substituting either Ge or Pb for Sn, and substituting either Cl, F, or Br for I, 1,620 structures of 192 chemically distinct HOIPs were obtained. They are the initial structures used to build up the HOIP dataset.

Numerical calculations
General scheme. Our calculations are performed within the DFT 29,30 formalism, using the projector augmented-wave (PAW) method 38 as implemented in the Vienna Ab initio Simulation Package (vasp) [39][40][41][42] . The default accuracy level of our calculations is 'Accurate', specified by setting PREC = Accurate in all the runs with vasp. The basis set includes plane waves with kinetic energies up to 400 eV, as recommended by vasp manual for this level of accuracy. PAW datasets of version 5.2, which were used to describe the ion-electron interactions, are also summarized in Table 2. The van der Waals dispersion interactions are estimated with the non-local density functional vdW-DF2 (ref. 43). The generalized gradient approximation (GGA) functional associated with vdW-DF2, i.e., refitted Perdew-Wang 86 (rPW86) 44 , was used for the exchange-correlation (XC) energies. For all the calculations, except bandgap determination, we sample the Brillouin zones, which are significantly different in shape for the different compounds, by an equispaced (with the spacing of h k = 0.20 Å − 1 ), Γ-centered Monkhorst-Pack 45 k-points mesh. The equilibration of the examined structures is assumed when the atomic forces are below 0.01 eV/Å. This numerical scheme is consistent with that we used for preparing the polymer dataset reported in ref. 35.
Bandgap determination. The bandgap E g is perhaps the most desired physical property of HOIPs. Within DFT, E g is determined as the energy difference between the conduction band minimum (CBM) Organic cation A Cation B and anion X Total Ammonium  2  2  4  3  3  4  3  4 113   Tetramethylammonium  2  3  3  2  1  3  3  2  1  3  3  3  2 9   Ethylammonium  9  10  11  12  11  11  12  12  12  10  10  11  131   Propylammonium  8  11  13  12  10  13  13  12  11  10  11  13  137   Isopropylammonium  9  8  9  10  9  9  11  9  12  11  8    and the valence band maximum (VBM), identified on a given k-point mesh. For a solid with an arbitrary primitive cell, the locations of VBM and CBM are generally not known beforehand, and the k-point mesh should be very dense in order to locate the band edges accurately. With a mesh of this type, the computation of E g using the Heyd-Scuseria-Ernzerhof (HSE06) 46,47 exchange-correlation functional, the level of DFT at which the calculated bandgap is expected to be close to the real bandgap, is computationally prohibitive. Although such a computation at the GGA level of DFT is feasible, E g is generally underestimated by 30% or more 48 .
The conduction bands and the valence bands computed at the GGA and HSE06 levels of DFT are essentially similar in the shape. However, they are shifted as a whole with respect to each other and to the true electronic structrures (see, for example, ref. 49). Therefore, our bandgap determination procedure, shown in Fig. 4, includes two steps. First, the locations of VBM and CBM are searched at the GGA level on three different dense k-point meshes. The first two meshes (one centered at Γ = (0,0,0) and the other centered at X = (0.5, 0.5, 0.5)) are equispaced with h k = 0.15 Å − 1 , while the third mesh contains k-points distributed along Γ-X-M-Γ-R-M-X-R, the path that has widely been used to represent the electronic band structrure of HOIPs 12,50 . In the second step, the positions of VBM and CBM identified in the first step are    used with zero weight for sampling the Brillouin zones using a Monkhorst-Pack k-point mesh with h k = 0.20 Å − 1 , hereby determining the energy difference between CBM and VBM at the HSE06 level of DFT. Although this procedure needs some extra work, we expect that the bandgap computed for HOIPs with an arbitrary primitive cell is reliable.
Atomization and relative energies definitions. The atomization energy of each of these compounds are calculated as where E ABX 3 is the energy of the HOIP and n i and E i are the number and the energy of an isolated atom of the element i respectively. We also report two kinds of relative energies with respect to the atomic constituents and solid constituents.
where E A 0 , E B , E X 2 , and E H 2 are the energies of isolated neutral organic molecule A, metallic crystals B, isolated X 2 , and H 2 molecules respectively. E BX2 and E HX are the energies of the metallic halides (BX 2 ) and hydrogen halides (HX), respectively. For the case of tetramethylammonium cation (C 4 H 12 N + ), the energy of neutral trimethylamine (C 3 H 9 N) was used for E A 0 , and the energies of the molecules C 2 H 6 and CH 3 X are used instead of E H 2 and E HX in equations (2) and (3), respectively.

Post-filtering
The preliminary filtering step is performed only on prototypical structures (ASnI 3 ) based on their DFT energy and bandgap estimated during the structure prediction runs with a limited accuracy. Therefore, an additional filtering step is performed on the whole relaxed structures from 1,620 initial structures to remove any possible redundancy. Within this step, all cases with the same chemical composition but different by less than 2% in volume of unit cell Ω, E g , ε at , є elec and є ion , are clustered. All the clustered points were inspected visually, keeping only those materials that are distinct. At the end of this step, we are left with 1,346 distinct compounds (also summarized in Table 1). These compounds constitute our final dataset.

Data Records
The

File format
The information reported in the dataset for a given material is stored in a file, named as N.cif, where N is a cardinal number used for the identification of the entry in the dataset. The first part of a file of this type is devoted to the optimized structure in the standard cif format which is compatible with many visualization software. Other information, including the calculated properties, is provided as the comments lines in the second part of the file as follows (for the example of N = 845). While most of the keywords are clear, we used keyword Label to provide more detail information of the HOIP compounds, which includes the common name of A organic cation, B cation and X anion. The origin of the formula and structure of organic cations is provided in the keyword Organic cation source. Keywords Material class and Geometry class are set to be 'Hybrid organicinorganic perovskite' and 'Bulk crystalline materials', respectively.

Graphical summary of the dataset
We visualize the calculated quantities in the property space as shown in Fig. 5. Because the relative energy, unit cell volume of the compound, bandgap and dielectric constant are the primary properties reported by this dataset, six plots, namely Ω -ε rel 1 g -є elec , E HSE06 g -є ion , and E HSE06 g -є, were shown. Compounds containing different A cations and X anions are represented using different colors and size of the symbols to clarify the role of the chemical contents in controlling the properties of the HOIP.
It can be clearly seen that the dataset is clustered based on the X anions, showing the sequence of F, Cl, Br and I. As shown in Fig. 5a most of F containing HOIP compounds are more favorable to be formed as measured by the relative energy regardless of the A cation contents. Bandgap and unit cell volume are strongly correlated mainly because the electronegativity and the ionic radii of X anions significantly differ for F, Cl, Br and I. Simple and strong correlation between GGA and HSE level bandgap is found as a linear function with scale factor of~1.2 as shown in Fig. 5c. Small bandgap values varying from 1.5 eV to 1.6 eV, favorable for photovoltaic application, was found for SnI 3 containing HOIP compounds including CH 3 NH 3 SnI 3 , NH 3 NH 2 I 3 SnI 3 , C 3 H 8 NSnI 3 . A limit of the form є elece 1=E HSE06 g shown in Fig. 5 (d) has also been demonstrated for other classes of materials in the literature 13,35,36,[51][52][53][54][55][56][57][58][59][60][61][62] .

Technical Validation
The relative energy computed via equation (2) is physically relevant to examine the relative stability useful for future studies of new HOIPs. As the dataset contains theoretically stable structures, we used the bandgap, dielectric constant, and XRD pattern with Cu Kα (1.54056 Å) for the validation of the calculations. Since available experimental studies for HOIPs seem to be limited to a small subset of the combinatorial possibilities, a small number of experimental bandgap could only be collected from available resources. These correspond to compounds containing acetamidinium (ACM, C 2 H 7 N 2 ), formamidinium (FA, CH 5 N 2 ), guanidinium (GUA, CH 6 N 3 ), isopropylammonium (IPA, C 3 H 10 N), methylammonium (MA, CH 3 NH 3 ), and tetramethylammonium (TMA, C 4 H 12 N). Four computed bandgaps are also included in the comparison set. As shown in Fig. 6a, the calculated bandgap for the most stable structure of each case (marked as color coded symbols) agrees well with the data from previous studies. (gray symbols correspond to less stable polymorphs).
In order to further validate the HOIP dataset, experimentally measured and theoretically calculated dielectric constants for both high frequency and static regime are collected and compared with computed dielectric constants. The information is available for a limited number of HOIPs with MA and FA organic cations. Since the computation of dielectric constant using DFPT is highly sensitive to the numerical Figure 6. Validation of data computed for some HOIPs by comparing it with the measured data available. Bandgap and dielectric constants computed for the low-energy structures of these compounds are plotted in (a,b) vs. those experimentally measured, respectively. In these panels, the lowest-energy structure of each HOIP is indicated by a colored symbol while data from the energetically competing structures are shown in gray (a) or given within an error bar (b). Experimental data of bandgap and dielectric constants of these HOIPs is obtained from refs 8,64-73 and refs 74-83, respectively. In (c-f), the simulated and measured XRD spectra for MAPbBr 3 65 , MAPbI 3 84 , IPAGeI 3 73 , and MASnI 3 5,85 , are shown. The reported index of reflection orientation is given on top of each significant peak. accuracy of the vibration frequency we used rather tight convergence criterion for the change of total energy by 10 − 8 eV. Figure 6b shows the excellent agreement between previously reported and computed dielectric constants for the selected HOIPs. Finally, we show the XRD spectra calculated for four HOIPs, including MAPbBr 3 , MAPbI 3 , IPAGeI 3 and MASnI 3 in Fig. 6c-f. Each of them is compared with the corresponding measured XRD patterns showing comparable agreement that can be regarded as supportive validation of computational schemes.

Usage Notes
This dataset, which includes 1,346 HOIPs, has been consistently prepared using first-principles calculations. While the HSE06 bandgap E HSE06 g is believed to be fairly close to the true bandgap of the materials, the GGA-rPW86 bandgap is also reported for completeness and for further possible analysis. The reported atomization energy and the dielectric constants are also expected to be accurate.