Background & Summary

The band gap (Eg) is a fundamental quantity that directly relates to usability of materials in optical, electronic, and energy applications. For instance, in photovoltaic devices, materials with a direct Eg of 1.3 eV1,2, corresponding to the Shockley-Queisser limit, are favored as photo-absorbers that maximize the solar-cell efficiency. In power electronics, semiconductors with Eg ≥ 3 eV are employed to sustain high electric fields3. To increase the figure of merit in thermoelectric devices, materials with Eg of 10 kBTop where kB and Top are the Boltzmann constant and operating temperature, respectively, are selected4. Given the central role of Eg, a database of Eg over a wide range of materials can expedite the material selection in specific applications by factoring out suboptimal candidates rapidly. Currently, popular material databases such as the Materials Project5, the Automatic Flow of Materials Discovery Library (AFLOW)6, the Open Quantum Materials Database (OQMD)7, and the Joint Automated Repository for Various Integrated Simulations (JARVIS)8 provide theoretical Eg for up to one million inorganic materials. However, most of them were obtained by semilocal functionals with a generalized gradient approximation (GGA), which underestimates Eg by typically 30–40%9. (MatDB10 provides accurate quasi-particle band gaps, but the number of data is limited to hundreds.) To compensate for this, AFLOW provides adjusted Eg using a linear fit to experimental data11. However, such a universal correction may not address element-dependent error fluctuations. We note that JARVIS provides Eg based on meta-GGA12, which significantly improves the accuracy. As a related issue, many small-gap semiconductors such as Ge, InAs, PdO, Zn3As2, and Ag2O are misclassified as metals, which can affect selection of narrow-gap semiconductors in IR sensors13, for instance. (In JARVIS, some of these errors are resolved by meta-GGA.) Besides the underestimation of Eg, all the databases consider only the ferromagnetic ordering for spin-polarized systems due to computational convenience, which can cause significant errors in Eg of antiferromagnetic materials. For instance, the antiferromagnetic NiO has an experimental Eg of 4.3 eV14, but the computational Eg ranges over 2.2–2.6 eV in the ferromagnetic ordering and GGA functional5,6,7 while the correct antiferromagnetic ordering produces 4.5 eV within the hybrid functional.

Addressing limitations in the existing material databases, we herein report a theoretical dataset of fundamental and optical Eg computed by employing a hybrid functional and identifying the stable magnetic ordering, thus providing more accurate Eg than the existing databases. For the high-throughput computational workflow, we employ ‘Automated Ab initio Modeling of Materials Property Package’ (AMP2)15, which is a fully automated package for density functional theory (DFT) calculations of crystalline properties. AMP2 addresses the band-gap underestimation in semilocal functionals with the help of a hybrid functional, thereby producing a more accurate Eg, even if the material is incorrectly metallic within the semilocal functional. Furthermore, the package finds the antiferromagnetic ground state based on an effective Ising model. The present database focuses on materials with 0 eV < Eg < 5 eV, which covers most semiconducting materials. The target materials are selected from Inorganic Crystal Structure Database (ICSD)16 and partly filtered by information from the Materials Project database. In total, the database collects Eg for 10,481 materials that encompass most inorganic solids with Eg ranging between 0 and 5 eV. For 116 benchmark materials, the root-mean-square error (RMSE) with respect to experimental data is 0.36 eV, significantly smaller than 0.75–1.05 eV in the existing databases. The resulting data are available online at figshare17 or SNUMAT18.

Methods

High-throughput methodology: AMP2

The present database is constructed by employing AMP2 which is an automation script operating VASP19,20,21. Starting only with the initial crystalline structure, AMP2 provides band structure, Eg, effective mass, density of states (DOS) and dielectric constants of the crystal by automatically pipelining computational procedures. To summarize computational settings relevant in the present work, we employ GGA developed by Perdew-Burke-Ernzerhof (PBE)22 for the exchange-correlation functional for structural relaxation and identifying band edges. The Eg is obtained by ‘one-shot’ hybrid functional (specifically, HSE0623 (simply HSE hereafter)) calculations in which the package estimates Eg from HSE eigenvalues at k points of band edges found with PBE (crystal structures are also fixed to those relaxed by PBE). In the previous study24, it was demonstrated that band edges from PBE and HSE lie at the same k points, which is confirmed again in the present work with Si, SrS, BAs, BeS, AlAs, AgI, AgGaTe2, ZnSiAs2, and ZnIn2Se4. In addition, the small structural differences between PBE and HSE25 would not affect the band gap significantly, except for small-gap semiconductors (see below). (This is also the case for systems that go from metallic in PBE to insulating in HSE.) This supports that the one-shot scheme can produce Eg close to the full hybrid calculations. If the material is identified as a metal within PBE, AMP2 inspects DOS, and if DOS at the Fermi level normalized by the valence band (DF/DVB) is less than a threshold, the package further tests a possible gap-opening by the one-shot hybrid calculation. The PBE+U method is applied on 3d orbitals26 only when the material has a finite Eg. About pseudopotentials, we mostly employ those without any tags in the VASP database, which tends to reduce the number of valence electrons. For further details, we refer to the original publication15.

Computational parameters used in the present work follow the default setting of AMP2 except that the package applies the PBE+U method on Ce 4f levels with the U value of 4 eV27. (The pseudopotentials for La and Ce treat f levels as valence.) Furthermore, for compounds including Tl, Pb, and Bi, Eg is recalculated by including the spin-orbit coupling (SOC) when the default Eg without SOC is smaller than 1 eV. (The band edges are also resought with SOC.) This is because typical SOC corrections of 0.5 eV would be critical in these cases.

In identifying the stable collinear magnetic ordering, AMP2 applies a genetic algorithm to the Ising model28. This approach finds the stable magnetic ordering correctly for many compounds. However, the original formulation requires a large supercell to isolate exchange interactions from periodic images, which costs significant computational resources and also suffers from ill-convergence in electronic iterations. To resolve this, we here develop an alternative method in obtaining exchange parameters. First, we choose a minimal supercell under the following two conditions: (i) A magnetic site α and its periodic images in other supercells are apart more than 5 Å (cutoff range for magnetic interactions). (ii) If two magnetic sites α and β (not necessarily belong to the same supercell) are within 5 Å, then the distance between α and β′, a periodic image of β (β′ ≠ β) is longer than 5 Å except when α-β and α-β′ are symmetrically equivalent. Within the Ising model, the total energy of the supercell (E) can be expressed as follows:

$$E={E}_{0}+\mathop{\sum }\limits_{I}^{m}\,{J}_{I}{\rm{(}}{N}_{I,{\rm{P}}}-{N}_{I,{\rm{A}}}{\rm{)}},$$
(1)

where E0 is the base energy excluding the magnetic interaction, and I is the index for independent exchange interactions (total m interactions) with the maximum range of 5 Å and the exchange parameter of JI. In Eq. (1), NI,P and NI,A are the numbers of parallel and antiparallel spin pairs within the supercell corresponding to the interaction I, respectively. Then, based on the ferromagnetic configuration (all spin-up), diverse spin configurations are sampled by spin-flipping a magnetic pair (both atoms) or a certain magnetic site. The number of resulting equations is larger than m and an optimal {E0, JI} can be obtained by the pseudoinverse method. We find that this approach produces essentially the same parameters as the original scheme but is more reliable and efficient.

Selection of materials

Figure 1 schematizes the workflow of constructing the database. Starting from the ICSD, we only consider compounds consisting of elements with atomic number (Z) < 84. Among the lanthanides, we limit the elements to La and Ce. We remove structural duplicates and structures with partially occupied sites, and also omit large primitive cells that contain more than 40 atoms. For unary and binary compounds, all the structures are calculated with AMP2. For ternary and higher compounds, we utilize information on Eg and DOS in the Materials Project database (calculated by PBE) to filter out materials that are likely to be metallic or large-gap insulators. To be specific, we exclude materials with \({E}_{{\rm{g}}}^{{\rm{GGA}}}\) bigger than 3 eV since they are likely to have \({E}_{{\rm{g}}}^{{\rm{HSE}}}\) larger than 5 eV. (Compiling data of 4,421 compounds from the previous screening studies24,29,30,31, we find that 99.7% of materials with \({E}_{{\rm{g}}}^{{\rm{HSE}}}\) < 5 eV have \({E}_{{\rm{g}}}^{{\rm{GGA}}}\) < 3 eV.) We also include metallic materials with DF/DVB < 0.8 for possible gap opening (see above; a larger threshold is used because of low-resolution DOS in the Materials Project). If a Materials Project data has incomplete entries for Eg or DOS, the material is included in the computation list. In this way, we could factor out 5,059 materials from the list of ternary and higher compounds. Finally, we calculate 21,353 materials with AMP2. After computation, we collect 10,481 materials with finite Eg (unary: 63, binary: 1,919, ternary: 5,074, quaternary: 2,804, quinary: 573, and higher: 48).

Fig. 1
figure 1

The computational workflow for collecting the dataset. Numbers in the parentheses indicate material counts.

Data Records

All the calculated properties for 10,481 compounds can be downloaded from the Figshare Repository17. The whole data including metals are also uploaded to SNUMAT (www.snumat.com), which provides easy search and visualization of materials through its own interactive interface. SNUMAT also supports REST API32 for users to search the materials with authorization. The authorization token expires 24 hours after they are issued.

File format

The data are stored in the JSON format. The name of the file is X_ICSD#.json, where X is chemical formula and ICSD# is the ICSD collection code of the initial structure used for calculation. Each JSON file includes final relaxation structure information, \({E}_{{\rm{g}}}^{{\rm{GGA}}}\), \({E}_{{\rm{g}}}^{{\rm{HSE}}}\), and DOS. Table 1 summarizes keys for metadata.

Table 1 Description of metadata keys in JSON file.

Graphical representation of the data

In Fig. 2, we present the distribution of \({E}_{{\rm{g}}}^{{\rm{GGA}}}\) and \({E}_{{\rm{g}}}^{{\rm{HSE}}}\) for 10,481 materials. Most materials with \({E}_{{\rm{g}}}^{{\rm{HSE}}}\) > 5 eV (663 cases) are unary or binary compounds for which AMP2 is applied to the whole structure dataset from ICSD.

Fig. 2
figure 2

Distribution of \({E}_{{\rm{g}}}^{{\rm{GGA}}}\) and \({E}_{{\rm{g}}}^{{\rm{HSE}}}\). Top and right are occurrence histograms of \({E}_{{\rm{g}}}^{{\rm{GGA}}}\) and \({E}_{{\rm{g}}}^{{\rm{HSE}}}\), respectively.

Technical Validation

Comparison to experimental measurements and other databases

In Fig. 3a, we compare experimental and theoretical values for 116 benchmark materials with experimental Eg between 0 and 5 eV. The list of compounds is shown in Online-only Table 1. For comparison, theoretical results from other databases are also shown in Fig. 3b–d. The RMSE values are 0.36 eV (present work), 1.05 eV (Materials Project), 0.93 eV (AFLOW), 0.75 eV (AFLOW fitted), 1.02 eV (OQMD), and 0.85 eV (JARVIS). (The meta-GGA values of 19 materials, mostly with small Eg, are missing in JARVIS.) This confirms that the present database provides more accurate Eg than the existing databases on average. In particular, we correctly identify the semiconducting nature for small-gap semiconductors such as AgSbTe2, CdO, CoP3, Cu3AsSe4, Cu3SbS4, Cu3SbSe4, CuFeS2, Ge, Mg2Sn, RhSb3, and ZnSnSb2, which are mostly misreported as metals in other databases. In addition, other databases exhibit pronounced errors for every antiferromagnetic material (CuFeS2, CuO, FeF2, MnO, MnTe, and NiO) because these materials are considered as ferromagnetic or non-magnetic. (For non-magnetic materials in Online-only Table 1, the Eg calculated with pure PBE (without +U and SOC) by AMP2 agrees well with those from Materials Project (the mean absolute error is 0.034 eV).)

Fig. 3
figure 3

Comparison of Eg for benchmark materials between experimental and theoretical data from (a) this work, (b) Materials Project, (c) AFLOW and (d) OQMD and JARVIS. AFLOW-fitted values are obtained from \({E}_{{\rm{g}}}^{{\rm{f}}{\rm{i}}{\rm{t}}}=1.34\,{E}_{{\rm{g}}}^{{\rm{G}}{\rm{G}}{\rm{A}}}+0.913\) eV.

In most cases, the present database provides Eg that agrees well with experiment. However, there are some materials with large errors of ≥0.5 eV such as AgAlTe2, Cu3AsSe4, CuAlSe2, CuBr, CuCl, CuFeS2, CuO, Ge, IrSb3, La2S3, MnO, RhAs3, RhSb3, SnO2, SrS, and ZnO. For small-gap materials such as Cu3AsSe4, Ge, and IrSb3, Eg is sensitive to the lattice parameters that are slightly overestimated by PBE. Employing experimental lattice parameters or those relaxed within HSE significantly improves the results15. For Cu-bearing materials, it is known that HSE often exhibits substantial errors in Eg due to nonlocal screening effects in Cu, which requires GW calculations33,34. We also note that van der Waals interactions are not described by semilocal functionals, and lattice parameters can be overestimated in layered structures such transition-metal dichalcogenides35. This can significantly affect Eg, and so care is needed in referring to Eg in layered materials. The present results do not consider finite-temperature effects on Eg, which can be significant in some materials, for example, hybrid perovskites36. More generally, Eg dataset with the ultimate theoretical accuracy would be obtained by the quasiparticle approaches such as GW or Bethe-Salpeter equations37,38.