A band-gap database for semiconducting inorganic materials calculated with hybrid functional

Semiconducting inorganic materials with band gaps ranging between 0 and 5 eV constitute major components in electronic, optoelectronic and photovoltaic devices. Since the band gap is a primary material property that affects the device performance, large band-gap databases are useful in selecting optimal materials in each application. While there exist several band-gap databases that are theoretically compiled by density-functional-theory calculations, they suffer from computational limitations such as band-gap underestimation and metastable magnetism. In this data descriptor, we present a computational database of band gaps for 10,481 materials compiled by applying a hybrid functional and considering the stable magnetic ordering. For benchmark materials, the root-mean-square error in reference to experimental data is 0.36 eV, significantly smaller than 0.75–1.05 eV in the existing databases. Furthermore, we identify many small-gap materials that are misclassified as metals in other databases. By providing accurate band gaps, the present database will be useful in screening materials in diverse applications.


Background & Summary
The band gap (E g ) is a fundamental quantity that directly relates to usability of materials in optical, electronic, and energy applications. For instance, in photovoltaic devices, materials with a direct E g of ∼1.3 eV 1,2 , corresponding to the Shockley-Queisser limit, are favored as photo-absorbers that maximize the solar-cell efficiency. In power electronics, semiconductors with E g ≥ 3 eV are employed to sustain high electric fields 3 . To increase the figure of merit in thermoelectric devices, materials with E g of 10 k B T op where k B and T op are the Boltzmann constant and operating temperature, respectively, are selected 4 . Given the central role of E g , a database of E g over a wide range of materials can expedite the material selection in specific applications by factoring out suboptimal candidates rapidly. Currently, popular material databases such as the Materials Project 5 , the Automatic Flow of Materials Discovery Library (AFLOW) 6 , the Open Quantum Materials Database (OQMD) 7 , and the Joint Automated Repository for Various Integrated Simulations (JARVIS) 8 provide theoretical E g for up to one million inorganic materials. However, most of them were obtained by semilocal functionals with a generalized gradient approximation (GGA), which underestimates E g by typically 30-40% 9 . (MatDB 10 provides accurate quasi-particle band gaps, but the number of data is limited to hundreds.) To compensate for this, AFLOW provides adjusted E g using a linear fit to experimental data 11 . However, such a universal correction may not address element-dependent error fluctuations. We note that JARVIS provides E g based on meta-GGA 12 , which significantly improves the accuracy. As a related issue, many small-gap semiconductors such as Ge, InAs, PdO, Zn 3 As 2 , and Ag 2 O are misclassified as metals, which can affect selection of narrow-gap semiconductors in IR sensors 13 , for instance. (In JARVIS, some of these errors are resolved by meta-GGA.) Besides the underestimation of E g , all the databases consider only the ferromagnetic ordering for spin-polarized systems due to computational convenience, which can cause significant errors in E g of antiferromagnetic materials. For instance, the antiferromagnetic NiO has an experimental E g of 4.3 eV 14 , but the computational E g ranges over 2.2-2.6 eV in the ferromagnetic ordering and GGA functional [5][6][7] while the correct antiferromagnetic ordering produces 4.5 eV within the hybrid functional.
Addressing limitations in the existing material databases, we herein report a theoretical dataset of fundamental and optical E g computed by employing a hybrid functional and identifying the stable magnetic ordering, thus providing more accurate E g than the existing databases. For the high-throughput computational workflow, we employ ' Automated Ab initio Modeling of Materials Property Package' (AMP 2 ) 15 , which is a fully automated package for density functional theory (DFT) calculations of crystalline properties. AMP 2 addresses the band-gap underestimation in semilocal functionals with the help of a hybrid functional, thereby producing a more accurate E g , even if the material is incorrectly metallic within the semilocal functional. Furthermore, the package finds the antiferromagnetic ground state based on an effective Ising model. The present database focuses on materials with 0 eV < E g < 5 eV, which covers most semiconducting materials. The target materials are selected from Inorganic Crystal Structure Database (ICSD) 16 and partly filtered by information from the Materials Project database. In total, the database collects E g for 10,481 materials that encompass most inorganic solids with E g ranging between 0 and 5 eV. For 116 benchmark materials, the root-mean-square error (RMSE) with respect to experimental data is 0.36 eV, significantly smaller than 0.75-1.05 eV in the existing databases. The resulting data are available online at figshare 17 or SNUMAT 18 .

Methods
High-throughput methodology: aMp 2 . The present database is constructed by employing AMP 2 which is an automation script operating VASP [19][20][21] . Starting only with the initial crystalline structure, AMP 2 provides band structure, E g , effective mass, density of states (DOS) and dielectric constants of the crystal by automatically pipelining computational procedures. To summarize computational settings relevant in the present work, we employ GGA developed by Perdew-Burke-Ernzerhof (PBE) 22 for the exchange-correlation functional for structural relaxation and identifying band edges. The E g is obtained by 'one-shot' hybrid functional (specifically, HSE06 23 (simply HSE hereafter)) calculations in which the package estimates E g from HSE eigenvalues at k points of band edges found with PBE (crystal structures are also fixed to those relaxed by PBE). In the previous study 24 , it was demonstrated that band edges from PBE and HSE lie at the same k points, which is confirmed again in the present work with Si, SrS, BAs, BeS, AlAs, AgI, AgGaTe 2 , ZnSiAs 2 , and ZnIn 2 Se 4 . In addition, the small structural differences between PBE and HSE 25 would not affect the band gap significantly, except for small-gap semiconductors (see below). (This is also the case for systems that go from metallic in PBE to insulating in HSE.) This supports that the one-shot scheme can produce E g close to the full hybrid calculations. If the material is identified as a metal within PBE, AMP 2 inspects DOS, and if DOS at the Fermi level normalized by the valence band (D F /D VB ) is less than a threshold, the package further tests a possible gap-opening by the one-shot hybrid calculation. The PBE+U method is applied on 3d orbitals 26 only when the material has a finite E g . About pseudopotentials, we mostly employ those without any tags in the VASP database, which tends to reduce the number of valence electrons. For further details, we refer to the original publication 15 .
Computational parameters used in the present work follow the default setting of AMP 2 except that the package applies the PBE+U method on Ce 4f levels with the U value of 4 eV 27 . (The pseudopotentials for La and Ce treat f levels as valence.) Furthermore, for compounds including Tl, Pb, and Bi, E g is recalculated by including the spin-orbit coupling (SOC) when the default E g without SOC is smaller than 1 eV. (The band edges are also resought with SOC.) This is because typical SOC corrections of ∼0.5 eV would be critical in these cases.
In identifying the stable collinear magnetic ordering, AMP 2 applies a genetic algorithm to the Ising model 28 . This approach finds the stable magnetic ordering correctly for many compounds. However, the original formulation requires a large supercell to isolate exchange interactions from periodic images, which costs significant computational resources and also suffers from ill-convergence in electronic iterations. To resolve this, we here develop an alternative method in obtaining exchange parameters. First, we choose a minimal supercell under the following two conditions: (i) A magnetic site α and its periodic images in other supercells are apart more than 5 Å (cutoff range for magnetic interactions). (ii) If two magnetic sites α and β (not necessarily belong to the same supercell) are within 5 Å, then the distance between α and β′, a periodic image of β (β′ ≠ β) is longer than 5 Å except when α-β and α-β′ are symmetrically equivalent. Within the Ising model, the total energy of the supercell (E) can be expressed as follows: where E 0 is the base energy excluding the magnetic interaction, and I is the index for independent exchange interactions (total m interactions) with the maximum range of 5 Å and the exchange parameter of J I . In Eq. (1), N I,P and N I,A are the numbers of parallel and antiparallel spin pairs within the supercell corresponding to the interaction I, respectively. Then, based on the ferromagnetic configuration (all spin-up), diverse spin configurations are sampled by spin-flipping a magnetic pair (both atoms) or a certain magnetic site. The number of resulting equations is larger than m and an optimal {E 0 , J I } can be obtained by the pseudoinverse method. We find that this approach produces essentially the same parameters as the original scheme but is more reliable and efficient. Figure 1 schematizes the workflow of constructing the database. Starting from the ICSD, we only consider compounds consisting of elements with atomic number (Z) < 84. Among the lanthanides, we limit the elements to La and Ce. We remove structural duplicates and structures with partially occupied sites, and also omit large primitive cells that contain more than 40 atoms. For unary and binary compounds, all the structures are calculated with AMP 2 . For ternary and higher compounds, we utilize information on E g and DOS in the Materials Project database (calculated by PBE) to filter out materials that are likely to be metallic or large-gap insulators. To be specific, we exclude materials with E g GGA bigger than 3 eV since they are likely to have E g HSE larger than 5 eV. (Compiling data of 4,421 compounds from the previous screening studies 24,29-31 , we find www.nature.com/scientificdata www.nature.com/scientificdata/ that 99.7% of materials with E g HSE < 5 eV have E g GGA < 3 eV.) We also include metallic materials with D F /D VB < 0.8 for possible gap opening (see above; a larger threshold is used because of low-resolution DOS in the Materials Project). If a Materials Project data has incomplete entries for E g or DOS, the material is included in the computation list. In this way, we could factor out 5,059 materials from the list of ternary and higher compounds. Finally, we calculate 21,353 materials with AMP 2 . After computation, we collect 10,481 materials with finite E g (unary: 63, binary: 1,919, ternary: 5,074, quaternary: 2,804, quinary: 573, and higher: 48).

Data records
All the calculated properties for 10,481 compounds can be downloaded from the Figshare Repository 17 . The whole data including metals are also uploaded to SNUMAT (www.snumat.com), which provides easy search and visualization of materials through its own interactive interface. SNUMAT also supports REST API 32 for users to search the materials with authorization. The authorization token expires 24 hours after they are issued.
File format. The data are stored in the JSON format. The name of the file is X_ICSD#.json, where X is chemical formula and ICSD# is the ICSD collection code of the initial structure used for calculation. Each JSON file includes final relaxation structure information, E g GGA , E g HSE , and DOS. Table 1 summarizes keys for metadata.
Graphical representation of the data. In Fig. 2, we present the distribution of E g GGA and E g HSE for 10,481 materials. Most materials with E g HSE > 5 eV (663 cases) are unary or binary compounds for which AMP 2 is applied to the whole structure dataset from ICSD.

technical Validation
Comparison to experimental measurements and other databases. In Fig. 3a, we compare experimental and theoretical values for 116 benchmark materials with experimental E g between 0 and 5 eV. The list of compounds is shown in Online-only Table 1. For comparison, theoretical results from other databases are also shown in Fig. 3b- ) This confirms that the present database provides more accurate E g than the existing databases on average. In particular, we correctly identify the semiconducting nature for small-gap semiconductors such as AgSbTe 2 , CdO, CoP 3 , Cu 3 AsSe 4 , Cu 3 SbS 4 , Cu 3 SbSe 4 , CuFeS 2 , Ge, Mg 2 Sn, RhSb 3 , and ZnSnSb 2 , which are mostly misreported as metals in other databases. In addition, other databases exhibit pronounced errors for every antiferromagnetic material (CuFeS 2 , CuO, FeF 2 , MnO, MnTe, and NiO) because these www.nature.com/scientificdata www.nature.com/scientificdata/ materials are considered as ferromagnetic or non-magnetic. (For non-magnetic materials in Online-only Table 1, the E g calculated with pure PBE (without +U and SOC) by AMP 2 agrees well with those from Materials Project (the mean absolute error is 0.034 eV).) www.nature.com/scientificdata www.nature.com/scientificdata/ In most cases, the present database provides E g that agrees well with experiment. However, there are some materials with large errors of ≥0.5 eV such as AgAlTe 2 , Cu 3 AsSe 4 , CuAlSe 2 , CuBr, CuCl, CuFeS 2 , CuO, Ge, IrSb 3 , La 2 S 3 , MnO, RhAs 3 , RhSb 3 , SnO 2 , SrS, and ZnO. For small-gap materials such as Cu 3 AsSe 4 , Ge, and IrSb 3 , E g is sensitive to the lattice parameters that are slightly overestimated by PBE. Employing experimental lattice parameters or those relaxed within HSE significantly improves the results 15 . For Cu-bearing materials, it is known that HSE often exhibits substantial errors in E g due to nonlocal screening effects in Cu, which requires GW calculations 33,34 . We also note that van der Waals interactions are not described by semilocal functionals, and lattice parameters can be overestimated in layered structures such transition-metal dichalcogenides 35 . This can significantly affect E g , and so care is needed in referring to E g in layered materials. The present results do not consider finite-temperature effects on E g , which can be significant in some materials, for example, hybrid perovskites 36 . More generally, E g dataset with the ultimate theoretical accuracy would be obtained by the quasiparticle approaches such as GW or Bethe-Salpeter equations 37,38 .

Code availability
The AMP 2 package used for constructing the present database is available at https://github.com/MDIL-SNU/ AMP2 and was released under a GPLv3 (GNU General Public License). The package requires pre-installation of numpy, scipy, spglib, and PyYAML modules. Detailed guidelines and examples can be found in the manual (https://amp2.readthedocs.io/en/latest/).