## Abstract

Modification of physical properties of materials and design of materials with on-demand characteristics is at the heart of modern technology. Rare application relies on pure materials—most devices and technologies require careful design of materials properties through alloying, creating heterostructures of composites, or controllable introduction of defects. At the same time, such designer materials are notoriously difficult to model. Thus, it is very tempting to apply machine learning methods to such systems. Unfortunately, there is only a handful of machine learning-friendly material databases available these days. We develop a platform for easy implementation of machine learning techniques to materials design and populate it with datasets on pristine and defected materials. Here we introduce the 2D Material Defect (2DMD) datasets that include defect properties of represented 2D materials such as MoS_{2}, WSe_{2}, hBN, GaSe, InSe, and black phosphorous, calculated using DFT. Our study provides a data-driven physical understanding of complex behaviors of defect properties in 2D materials, holding promise for a guide to the development of efficient machine learning models. In addition, with the increasing enrollment of datasets, our database could provide a platform for designing materials with predetermined properties.

## Introduction

The intelligent design of materials with predetermined properties is the central agenda for modern materials science. To this end, multiple materials genome projects have been established. Two-dimensional (2D) materials is a very important subset of our overall material library, and, thus, it is not surprising, that many genome projects have been associated exactly with 2D crystals. Still, one of the most exciting opportunities provided by 2D materials is the possibility to controllably alter their properties through chemical modification—materials without bulk are strongly susceptible to the introduction of foreign atoms either at the surface as adatoms or into the plane of the crystal as a substitution. Still, up to now, no database is available which would deal with the modification of the properties of 2D crystals by the introduction of defects. Here we present our database for 2D materials and defects in such crystals. We comprehensively analyze the electronic properties of defects in MoS_{2} material. Our database enables an analysis of the defect structures using machine learning methods (which will be described elsewhere).

Defects play a very important role in terms of modification of the mechanical, thermal, electronic, optical, and other properties of solids. Thus, individual defects can act as single-photon emitters, qubits, be used for single atom catalysis, and further used for many other applications^{1}. Individual defects have been demonstrated to be even more important for the modification of properties of 2D materials^{2,3}. For instance, it has been shown that localized defects embed in transition metal dichalcogenides (TMDs) and hexagonal boron nitride (hBN) exhibit single-photon emission (SPE) at cryogenic and room temperatures, which hold advantages over the 3D counterpart of NV centers in diamond^{4,5}.

Controllable defect engineering for the purpose of the modification of material properties requires knowledge of the structure-property relation of defects, which is a formidable task considering the vast space of possible host materials and defect configurations. Even though a single defect structure can be calculated with state-of-the-art density functional theory (DFT) methods in modern high-performance computing infrastructures within hours, such computations are not scalable. The properties of each new defect has to be calculated from scratch. The state-of-the-art research paradigm integrated of high throughput simulations, data science, and machine learning is appealing to accelerate material exploration. In this spirit, a large number of computational databases such as Material Projects, the Open Quantum Materials Database (OQMD), the Automatic Flow for Materials Discovery (AFLOWLIB), and the Novel Materials Discovery (NOMAD) Laboratory have been established^{6,7,8,9}. Various machine learning (ML), especially graph neural network-based methods for material science such as MEGNet, CGNN, SchNet, GemNet, etc., have been proposed for property prediction^{10,11,12,13}. The increasing repositories of computational datasets and the development of machine learning methods together have led to an exploding growth of in-silico material exploration in the areas of 2D materials, catalysts, batteries, photovoltaics, etc.^{14,15,16,17,18}.

Still, great difficulties arise when trying to apply machine learning to predict properties of defects, which may be due to the lack of defect datasets and the challenge in the prediction of quantum states. Also, there is a great deal of uncertainty when the machine learning of “black box” nature encounters the nonlinear quantum properties in defects. Due to these reasons, there have been few studies of machine learning applied to defects in solids. The reported studies mainly focus on the prediction of the key properties of point defects in 2D materials, predictions of vacancy migration and formation energies in alloys, and defect dynamics in 2D TMDCs^{19,20,21}. Even though great efforts have been devoted to the screening and creation of a database of 2D materials^{15,16,22}, such effort on defect properties is limited. Only recently, Fabian et. al. reported a quantum point defect database with a size of 503 defect structures in 82 different 2D materials^{23}. The study is comprehensive and includes a wide range of thermodynamic and electronic properties. However, according to our ongoing exploration of machine learning defect prediction, the size and the density of data is still limited, thus hindering efficient prediction with AI methods. A comprehensive dataset of defect properties is necessary not only for the engineering of imperfections in solids towards emerging technologies, but also for the perfection of machine learning models appropriate to defect prediction.

In this paper, we present datasets of defect properties in represented 2D materials (2DMD) created by high throughput DFT calculations. The property distribution in the datasets can be visualized through a binary property map. Interestingly, the band gap vs. formation energy plot shows a nontrivial correlation which is attributed to the hierarchical impact of defect components on the host crystal. Such property maps could be seen as fingerprints of 2D crystals and defects introduced. The interaction between defects decay with distance in an oscillatory manner due to the contributions of quantum phenomena. Such quantum oscillations are understood via a simplified quantum mechanics model and visualized by the wave function resonance of defect states. Our study provides a data-driven physical understanding of the electronic properties of defects in 2D materials, which would help the better design of machine learning models toward accurate and efficient prediction. Moreover, our datasets could serve as a platform for competitive model training, and with the enrollments of more and more datasets, scalable defect prediction would be expected.

## Results

### Defect space and the structure of the 2DMD datasets

The defect space in 2D materials is mainly attributed to three aspects of variables, i.e. the 2D material hosts, the defect components, and the defect configurations (Fig. 1a). There are hundreds of existing 2D materials and even more are expected to be synthesized in the future. The components of defects are limited to a few types, such as intrinsic vacancies, impurity substitutions, and antisites. However, the configuration space for defects is infinite. These three variables together contribute to a vast defect space for 2D materials, which is impractical for thorough experimental or even computational investigations.

Considering the extremely large space of defects, probably any practical datasets should be termed as “small data”. Consequently, to train machine learning methods for efficient prediction of defect properties, the structure of the datasets should be carefully designed. Our datasets are established in two groups, including one structured and the other dispersive (Fig. 1b). The structured datasets contain fine-tune features of defects correlated with the periodic nature of crystal lattices and the quantum mechanical nature of defect properties. Herein, as one example, we generate and compute a structured dataset of TMDCs by screening all possible configurations in selected single, double, and triple-site defects in MoS_{2} and WSe_{2}. The dataset includes properties of simple defects of different configurations and inter-defect separations, and provides very detailed information for machine learning. This dataset contains dense data and occupies only a small subspace of the whole defect space. The other group of dispersive datasets aims to spread data in the large defect space. This is done here by calculating the properties of a wide concentration range of defects for a wide range of 2D materials. So far, we have finished the defect properties of a few popular materials, including MoS_{2}, WeSe_{2}, hBN, GaSe, InSe, and black phosphorus. Such a random sampling strategy has a disadvantage in the sparsity of data. However, it allows us to cover the diverse phase space of defect properties and facilitates the design of universal, transferable machine learning algorithms such as active learning^{24}.

### The distribution and the correlation of defect properties

The structured dataset is composed of DFT-computed properties of 5933 defect configurations for MoS_{2} and another 5933 configurations for WSe_{2}. The compositions, numbers, atomic structures, and the variation range of properties of each type of defect for MoS_{2} are summarized in Table 1. The dispersive dataset includes computational properties of defects in a wide range of concentrations for representative 2D materials such as MoS_{2}, WSe_{2}, hBN, GaSe, InSe, and black phosphorus. The datasets, together with the output data such as relaxed atomic structures, density of states (DOS), and band structures in the DFT PBE levels, are available for further studies and machine learning training^{25}.

The variation range in the formation energy provides knowledge about the amplitude of the interaction between the defects for each defect type. The formation energy for V3 defects composed of one Mo vacancy and two S vacancies span the widest range of ~4.1 eV, whereas defects that included one Mo vacancy and only one S vacancy varies within a moderate range of ~2.2 eV. The variation in the formation energy of defects containing two S vacancies is only 0.1 eV. The variation of the formation energy of defects composed mainly by substitution atoms is only a few tens of meV.

Such a spread of energies for different types of defects seems logical. The creation of a Mo vacancy breaks six Mo-S bonds and induces the most significant lattice imperfection. The large formation energy of Mo vacancy of 7.12 eV is needed to be paid for such imperfection. A creation of S vacancy requires the breaking of only three Mo-S bonds and thus produces a moderate disturbance to the lattice and requires a formation energy of 2.65 eV. Due to the similar outer electron configuration between W and Mo, and between Se and S, substitution atoms have almost negligible change in the host lattice and electronic structures. The formation energies for W substituted Mo, and Se substituted S, are 0.167 and 0.279 eV, respectively. The impurity orbitals of both of these two substitutions are merged into the valence band or conduction band of MoS_{2} and do not show any defect level in the band gap.

The variation ranges of the frontier defect orbitals (HOMO and LUMO) for each type of defect provide knowledge about to what extend we can manipulate the defects to achieve desired properties. This information is of great importance for such applications as quantum computing and quantum telecommunications (single-photon emitters), catalysis, and many others. Vacancy defects create deep-lying levels that span a range of 0.1–0.4 eV inside the band gap. At the same time, substitutional defects create defect states inside the valence or conduction bands without significant change to the band edges—of the order of 10 meV.

To visualize the distribution of the whole dataset in the property space, we show the binary property, i.e., band gap vs formation energy plot in Fig. 2. Most interestingly, as shown by the guideline in Fig. 2a, there is an overall trend for the whole dataset that the band gap of defect structures decreases with increasing formation energy and finally converge in groups near 0.3 eV. In other words, to achieve the deep defects levels inside the band gap (which reduces the gap)—large formation energy has to be paid. This provides a guideline for the manipulation of defects in TMDCs in order to achieve the desired properties. Thus, W and Se substitutions produce defect levels without a significant alteration of the electronic structure. On the other hand, shallow gap states can be introduced by single or double S vacancies. Formation of deep levels requires the presence of Mo vacancy.

The distribution of defects in the binary property plot can be understood by the hierarchy impact of defects on the host MoS_{2}. For the substitution-only defects (X3, X4, S1, S4, and S5), since the disturbance on the lattice is negligible (<10 meV)— the formation energies are low. And because no gap state is introduced—the band gap of the substitution defects retains the separation of band edges of MoS_{2} of 1.81 eV. For defects composed of one S vacancy (X1, S2, and S6), the band gaps concentrate at ~1.1 eV and the formation energies lie near ~3.0 eV. For defects composed of two vacancies (X2 and S3), the formation energy is approximately double that of a single S vacancy ~5.5 eV. Interestingly, even though the formation energy has a very low span, the band gaps of X2 and S3 span in a wide range of 0.6 eV. This indicates that, by controlling the distance between S vacancies, the frontier orbitals of defects become highly tunable. For defects that include Mo sites (V1, V2, V3, V4, V5, and V6), the band gaps are mainly within 0.5 eV. The formation energy span over 7.0–13.0 eV, depending on the number of vacancies.

### The property distribution of defects as a fingerprint

As shown in Fig. 2b, the trend in the property map of simple defects persists for the high-density defects of MoS_{2}. Considering that in the high-density defect dataset, there are widespread configurations and concentrations of defects, the nontrivial distribution of properties is intrinsic and could serve as a fingerprint for defects in 2D materials. Indeed, the property distribution maps for MoS_{2}, WSe_{2}, hBN, GaSe, InSe, and black phosphorus show distinct characteristics for different 2D material types. That is, the maps for MoS_{2} and WSe_{2}, or GaSe and InSe, share similar features, while largely different between different crystals. This is shown in the supplemental materials (Supplementary Fig. 2).

Defects with unpaired electrons are of special interest for magneto-optical and information technology applications. Among our datasets, no magnetic defects are found in MoS_{2} and WSe_{2}, while such defects could be created in GaSe, InSe, BP, and C-doped hBN. The exchange splitting of states results in asymmetric distribution of band gap for the majority and minority states. Interestingly, there seems to exist a nontrivial trend of the larger energy distribution of the majority band gap. This is especially conclusive for defects in BP and C-doped hBN. Vacancy defects in GaSe, InSe, and BP intrinsically possess local magnetic moments due to the odd number of valence electrons of Ga, In, and P atoms. The introduction of a C impurity at both the B and N sites creates an unpaired electron due to the valence electron difference. The total spin of the carbon-impurity complexes in hBN is governed by the difference between the carbon substitutions at the two sublattices according to Lieb’s theorem^{26,27}.

For this reason, the magnetic moments in C-doped hBN could reach a very large number if the carbon substitutions in the two sublattices are highly imbalanced (Supplementary Fig. 3). Carbon substitutions in hBN are particularly interesting for the area of single-photon emitters aiming at the identification and engineering of candidate emitting centers in 2D hosts instead of that in bulk materials such as the NV center in diamond^{27,28}.

### The quantum fluctuation of defect properties

Complex defects in TMDCs demonstrate nontrivial quantum properties. Thus, for V2 defects (one Mo vacancy and one S vacancy), the interaction energy (calculated as the formation energy difference between the defect complex and the sum of the defect components), as well as HOMO and LUMO, exhibit oscillatory behavior as a function of the separation between the vacancies, Fig. 3. The oscillatory behavior is most obvious for the small separation between the vacancies. The local minima in the interaction energy correspond to the configurations when the S vacancies are at the 1st, 3rd, 6th, and 10th nearest S sites to the Mo vacancy. These are exactly the triangular number series \(S_n = n \cdot (n + 1)/2\). Interestingly, the sites which exhibit local minima of the formation energy lie exactly along the zigzag crystalline orientation, as shown in the inset of Fig. 3a.

In metals, the introduction of impurities generally results in Friedel oscillations^{29}. Due to the strong screening of the surrounding electrons, the long-range tail of a charged impurity potential is suppressed, resulting in a power law decay of the defect disturbance^{30,31}. In insulators, however, when the wave function is strongly localized on the defect site—the lattice structure plays a significant role in the defect properties^{1,32,33}.

In crystals with a honeycomb structure, the interaction between defects is strongly influenced by the presence of the two sublattices. Thus, it has been predicted and confirmed both by theory and experiment that the coupling between hydrogen atoms on the surfaces of graphene depends on the sublattices occupied^{34,35}. To understand the defect interaction in TMDCs, we employ a simplified two-orbital picture. We propose that there are two localized defect orbitals \(\phi _A\) and \(\phi _B\) at the lattice sites A and B of the crystal, each was occupied by one electron. In case the two defects are largely separated, the correlation between the two is negligible. Each single defect wave function fulfills the single-particle Schrödinger equation

where \(T\) is the kinetic energy of the electron, \(V_A^1\) and \(V_B^2\) represent the Coulomb potential energy at the A and B sites contributed together by the atomic nuclei and band electrons. As the two defects occupied neighboring lattice sites, the coupling of the two should be taken into account, and the two-electron wave function takes the form:

Where \(S = \left\langle {\phi _A|\phi _B} \right\rangle\) is the overlap integral of the two orbitals. The positive and negative signs correspond to the bonding and anti-bonding states, respectively. The Hamiltonian of the defect complex is:

In addition to the additive single-defect terms in the parentheses, the Hamiltonian includes the Coulomb potential originating from the neighboring defect sites (represented by \(V_A^1\) and \(V_B^2\)) as well as electron-electron and core-core electrostatic interactions in the screening environment of band electrons (described by the dielectric constant \(\varepsilon\)). In the last two terms, \(r_{12}\) is the distance of the two electrons and \(R_{AB}\) is the separation of the two defect sites. The energy of the bonded \(\left( {E_ + } \right)\) and anti-bonded \(\left( {E_ - } \right)\) state can be expressed as:

Where *K* and *J* is the direct Coulomb and exchange integrals, which take into account the electron-core attraction, electron-electron repulsion, and core-core repulsion of the two defects^{32}. The stabilization of the configuration is governed by the competition between the exchange and the direct Coulomb interactions of the two-electron state. According to the definition in Eq. (2), the interaction energy is essentially the stabilization energy, originating mainly from the exchange interaction of the two defect orbitals:

Based on the above picture, we can understand the oscillations in the interaction energy of Fig. 3 in detail. Firstly, we check the oscillation of the defect wave function and the displacement of the electric charge of surrounding atoms disturbed by the defect states. A Mo vacancy introduces six dangling bonds, which result in several energy levels inside the gap (Fig. 4a). These states inside the gap are highly localized at the vacancy center and decay within a few lattice spaces. As shown in Fig. 4b, due to the resonance of the electron wave and the honeycomb lattice, the wave function of Mo vacancy has nodes at the Mo sites, where it changes sign^{36}. This means that this wave function demonstrates oscillatory behavior with maxima around S sites and zeros at Mo sites (Fig. 4b). Likewise, the wave function of an S vacancy demonstrates similar oscillatory behavior, just with the maxima of the wave function at Mo sites (Fig. 4c). This trend is consistent with the fluctuation in the interaction energy of the V2 defects shown in Fig. 3a. The wave function oscillation of the defect states is reflected in the atomic charges of neighboring atoms. As shown in Fig. 4d, the atomic charge of S atoms around the Mo vacancy was plotted as a function of distance to the vacancy, showing a similar fluctuation trend in the interaction energy of Fig. 3a. In a pristine MoS_{2} structure, the atomic charge of S gained from Mo atoms was calculated to be ~0.6 electrons. It is calculated to be less than 0.5 electrons for S atoms nearest to the Mo vacancy due to the breaking of the Mo-S bonds. The bond breaking around the vacancy also impacts the atomic charges of the other S atoms due to the wave function fluctuation of defect states. However, such impact decays rapidly due to screening.

The coupling of the two vacancies is highly site dependent. As the two vacancies occupy neighboring lattice sites, the two states hybridize according to the phase and amplitude of their overlapping wave functions. Two states hybridize strongly if their wave functions overlap in-phase, otherwise, their coupling is weak (like in the case if their wave functions do not overlap or overlap out of phase). This is evident according to the wave functions of the HOMO for the four V2 defects, with the S vacancy occupying the first, second, third, and fourth nearest lattice site to the Mo vacancy (Fig. 4e–h). For the 1st and the third configurations, one vacancy occupies the lattice sites where the wave function of the other reaches a peak value, the dangling bond states hybridize strongly, which results in a large exchange interaction and stabilization energy. Some of the dangling states of the two vacancies hybridize and lie at the lower energy, leaving other dangling states unaffected. As a result, the HOMO wave function is constructed from the unaffected dangling states and shows some features of the pristine Mo vacancy (Fig. 4e, g). For the second configuration, even though the separation between the two vacancies is smaller than that of the third configuration, the two vacancies occupy lattice sites where there is a knot of their wave functions. The wave functions of the HOMO retain that of the isolated Mo and S vacancies (Fig. 4f). For defect configurations with a distance larger than the fourth nearest sites, either the separation is too large or the wave function are out of phase—the hybridization can be neglected (Fig. 4g). According to these wave function plots and the quantum mechanic origin of the stabilization energy, the fluctuation in the interaction energy in Fig. 3 can be understood.

The defect levels in the band gap fluctuate accordingly with the coupling strength of defect states. This can be seen from the opposite variation trend of the LUMO compared with that of interaction energy (Fig. 3b). The variation in the coupling strength of defect states gives rise to diverse locality and affinity of defect electrons, resulting in tunable activity for alien species. On the other hand, a wide range of resonant transitions between defect levels could be achieved owing to different symmetry and separation of defect states. The same rationale can be used to structure the results on the triple defect in which additional complexities involve due to the participation of the third defect sites. These data are presented in Supplementary Fig. 4.

## Discussion

We develop a database of machine learning-friendly datasets on the physical properties of solid-state materials with and without defects. As a starting point, a structured dataset of 11,866 defect configurations in TMDCs and a dispersive dataset of 3000 configurations in six represented 2D materials were created. It is based on high throughput DFT calculations and unveils the complex structure-property correlations through proper data ordering and physical insights. The initially structured dataset spans all possible single, double, and triple defects configurations with components of Mo/W and S/Se vacancies, W/Mo, and Se/S substitutions in an 8 × 8 × 1 supercell. According to the band gap vs formation energy map, the property distribution of the defect configurations was visualized. This may provide a general guideline for defect engineering in TMDCs to enable emerging technologies. The property of a selected double and triple defect were studied in detail. The fluctuation of defect properties with the lattice site and distance are observed and explained in depth through symmetry and quantum mechanics electronic structure analysis. The dispersive dataset was created to span over as large as possible in the defect space of 2D materials. Even though the defect configurations in the dataset are highly dispersive, the property maps for the calculated materials persist in some non-trial characteristics and exist as fingerprints of the host materials. Our study demonstrates that a properly structured dataset can reveal the complex structure-property correlation. This could provide an in-depth guide to the engineering of materials with predetermined properties through their chemical modification, alloying, and defect formation. With the enrollment of further datasets of materials with and without defects—a powerful platform for the realization of materials with tailored properties will be formed.

## Methods

### Dataset generation

We computed datasets of both simple defects and high-density defects in 2D materials. For simple defects, symmetrically inequivalent single, double, and triple-site defects composed of Mo vacancies, S vacancies, W substitutions, and Se substitutions in the 8 × 8 monolayer MoS_{2} supercell were created. The compositions, configuration numbers, and atomic structures of each type of defects are summarized in Table 1. The same procedure was repeated for WSe_{2} with substitution atoms of Mo and S instead. The dataset, containing DFT computational properties of 5933 defect configurations for MoS_{2} and another 5933 for WSe_{2}, is available online^{25} (https://rolos.com/open/2d-materials-point-defects/). Another dataset (also available in the same repository) of high-density defects was created by randomly generating combined vacancy and substitution defects in a wide range of concentrations of 2.5, 5, 7.5, 10, and 12.5% for represented 2D materials such as MoS_{2}, WSe_{2}, hBN, GaSe, InSe, and black phosphorous (BP). We generated and computed 100 structures for each defect concentration for each material, totaling 500 configurations for each material and 3000 in total. The availability of such high-density defects datasets, alongside the full set of triple-site defects will allow researchers to verify complex scenarios which might arise in defect engineering.

### DFT calculations

Our calculations are based on density functional theory (DFT) using the PBE functional as implemented in the Vienna Ab Initio Simulation Package (VASP)^{36,37,38}. The interaction between the valence electrons and ionic cores is described within the projector-augmented (PAW) approach with a plane‐wave energy cutoff of 500 eV^{39}. Spin polarization was included in all the calculations. The initial crystal structures were obtained from the Material Project database and the supercell sizes and the computational parameters for each materials are listed in the supplemental materials (Supplementary Table 1 and Fig. 1). Since very large supercells are used for the calculation of defect, the Brillouin zone was sampled using Γ-point only Monkhorst‐Pack grid for structural relaxation and denser grids for further electronic structure calculations. A vacuum space of at least 15 Å was used to avoid interaction between neighboring layers. In the structural energy minimization, the atomic coordinates are allowed to relax until the forces on all the atoms are less than 0.01 eV/Å. The energy tolerance is 10^{−6} eV. For defect structures with unpaired electrons, we utilize standard collinear spin-polarized calculations with magnetic ions in a high-spin ferromagnetic initialization (the ion moments can, of course, relax to a low spin state during the ionic and electronic relaxations). Currently, we are focusing on the basic properties of defects and did not include spin–orbit coupling (SOC) and charged states calculations^{40}.

The formation energy, i.e., the energy required to create a defect, is given by

where \(E_D\) is the total energy of the defect structure, \(E_{{\mathrm{pristine}}}\) is the total energy of the pristine MoS_{2} or WS_{2}, \(n_i\) is the number of an element (Mo, W, S, or Se) transferred from the defect structure to the chemical reservoirs, and \(\mu _i\) is the chemical potential of the element.

The interaction energy for a defect complex respective to the defect components is defined as

Where \(E_f(D)\) is the formation energy of the double-site or triple-site defect complex and \(E_f^i\) is the formation energy of the single-site defect components. Negative (positive) interaction energy indicates the tendency for the individual defects to attract (repel) each other.

To parameterize the electronic properties of defects, we inspect the positions of the highest occupied states, the lowest unoccupied states, and the separation of these two levels of defects structures. For the sake of representation, we adopt the terminologies of the highest occupied molecule orbital (HOMO), the lowest unoccupied molecule orbital (LUMO), and the band gap for these electronic properties. This is reasonable, considering that the defects in TMDCs are highly localized and molecule-like. Practically, we would like to evaluate the positions of defect levels with respect to the valence band maximum (VBM). However, due to the finite cell effect, the calculated valence band edge is generally highly disturbed by defects and hardly identified. Moreover, the whole Kohn–Sham energy spectrum shifts due to the introduction of defects. For these reasons, we chose the deepest Kohn–Sham orbital as a reference considering that it is less affected by defects. Accordingly, the HOMO of the defects with respect to the pristine VBM are normalized according to

Where \(E_{{\mathrm{HOMO}}}^D\) is the energy of the highest occupied Kohn–Sham state of a defect at the \({\Gamma}\) point, \(E_{{\mathrm{VBM}}}^{{\mathrm{pristine}}}\) is the energy of the valence band maximum of pristine MoS_{2} or WS_{2}, \(E_1^D\) and \(E_1^{{\mathrm{pristine}}}\) are the energy of the first Kohn–Sham orbital of the calculated defect and pristine MoS_{2} or WS_{2} structures. Since the defect states are localized—the defect bands are flat in the wave vector space. Thus, the variations in energy in the k-space are small, and we extracted all the energies at the \({\Gamma}\) point. The regularized LUMO energy is defined similarly as

There is a well-known underestimation of the band gap at the level of PBE functionals. To well reproduce the experimental band gap, more advanced hybrid functional or many-body interaction including methods should be employed. However, such methods are computationally much more expensive and impractical for high throughput calculations. Since we are focusing on the general trends of formation energy, HOMO, and LUMO for a wide range of defect structures, the main conclusions based on PBE functionals should be transferable to the results of other methods.

## Data availability

All data and code are available at https://rolos.com/open/2d-materials-point-defects/.

## References

Wolfowicz, G. et al. Quantum guidelines for solid-state spin defects.

*Nat. Rev. Mater.***6**, 906–925 (2021).Novoselov, K. S. et al. Electric field effect in atomically thin carbon films.

*Science***306**, 666–669 (2004).Koperski, M. et al. Single photon emitters in exfoliated WSe2 structures.

*Nat. Nanotechnol.***10**, 503–506 (2015).He, Y.-M. et al. Single quantum emitters in monolayer semiconductors.

*Nat. Nanotechnol.***10**, 497–502 (2015).Tran, T. T., Bray, K., Ford, M. J., Toth, M. & Aharonovich, I. Quantum emission from hexagonal boron nitride monolayers.

*Nat. Nanotechnol.***11**, 37–41 (2016).Jain, A. et al. Commentary: The Materials Project: a materials genome approach to accelerating materials innovation.

*APL Mater.***1**, 011002 (2013).Kirklin, S. et al. The open quantum materials database (OQMD): assessing the accuracy of DFT formation energies.

*npj Comput. Mater.***1**, 1–15 (2015).Saal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. Materials design and discovery with high-throughput density functional theory: the open quantum materials database (OQMD).

*JOM***65**, 1501–1509 (2013).Curtarolo, S. et al. AFLOW: An automatic framework for high-throughput materials discovery.

*Comput. Mater. Sci.***58**, 218–226 (2012).Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P. Graph networks as a universal machine learning framework for molecules and crystals.

*Chem. Mater.***31**, 3564–3572 (2019).Xie, T. & Grossman, J. C. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties.

*Phys. Rev. Lett.***120**, 145301 (2018).Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A. & Müller, K.-R. SchNet – A deep learning architecture for molecules and materials.

*J. Chem. Phys.***148**, 241722 (2018).Gasteiger, J., Becker, F. & Günnemann, S. GemNet: universal directional graph neural networks for molecules. In

*Advances in Neural Information Processing Systems*6790–6802 (Curran Associates, Inc., 2021).Zitnick, C. L. et al. An introduction to electrocatalyst design using machine learning for renewable energy storage. Preprint at https://doi.org/10.48550/arXiv.2010.09435 (2020).

Mounet, N. et al. Two-dimensional materials from high-throughput computational exfoliation of experimentally known compounds.

*Nat. Nanotechnol.***13**, 246–252 (2018).Zhou, J. et al. 2DMatPedia, an open computational database of two-dimensional materials from top-down and bottom-up approaches.

*Sci. Data***6**, 86 (2019).Sun, W. et al. Machine learning–assisted molecular design and efficiency prediction for high-performance organic photovoltaic materials.

*Sci. Adv.***5**, eaay4275 (2019).Zhang, Y. et al. Identifying degradation patterns of lithium ion batteries from impedance spectroscopy using machine learning.

*Nat. Commun.***11**, 1706 (2020).Frey, N. C., Akinwande, D., Jariwala, D. & Shenoy, V. B. Machine learning-enabled design of point defects in 2D materials for quantum and neuromorphic information processing.

*ACS Nano***14**, 13406–13417 (2020).Manzoor, A. et al. Machine learning based methodology to predict point defect energies in multi-principal element alloys.

*Front. Mater.***8**, 129 (2021).Patra, T. K. et al. Defect dynamics in 2-D MoS2 probed by using machine learning, atomistic simulations, and high-resolution microscopy.

*ACS Nano***12**, 8006–8016 (2018).Haastrup, S. et al. The computational 2D materials database: high-throughput modeling and discovery of atomically thin crystals.

*2D Mater.***5**, 042002 (2018).Bertoldo, F., Ali, S., Manti, S. & Thygesen, K. S. Quantum point defects in 2D materials - the QPOD database.

*npj Comput. Mater.***8**, 1–16 (2022).Murray, C. et al. Addressing bias in active learning with depth uncertainty networks... or not. Workshop at NeurIPS PMLR163 59–63 (2022).

*Rolos*. https://rolos.com/open/2d-materials-point-defects/ (2022).Lieb, E. H. Two theorems on the Hubbard model.

*Phys. Rev. Lett.***62**, 1201–1204 (1989).Huang, P. et al. Carbon and vacancy centers in hexagonal boron nitride.

*Phys. Rev. B***106**, 014107 (2022).Koperski, M. et al. Midgap radiative centers in carbon-enriched hexagonal boron nitride.

*Proc. Natl Acad. Sci. USA***117**, 13214–13219 (2020).Friedel, J. The distribution of electrons round impurities in monovalent metals.

*Lond. Edinb. Dublin Philos. Mag. J. Sci.***43**, 153–189 (1952).Lau, K. H. & Kohn, W. Indirect long-range oscillatory interaction between adsorbed atoms.

*Surf. Sci.***75**, 69–85 (1978).Cheianov, V. V. & Fal’ko, V. I. Friedel oscillations, impurity scattering, and temperature dependence of resistivity in graphene.

*Phys. Rev. Lett.***97**, 226801 (2006).Pereira, V. M., Guinea, F., Lopes dos Santos, J. M. B., Peres, N. M. R. & Castro Neto, A. H. Disorder induced localized states in graphene.

*Phys. Rev. Lett.***96**, 036801 (2006).Pereira, V. M., Lopes dos Santos, J. M. B. & Castro Neto, A. H. Modeling disorder in graphene.

*Phys. Rev. B***77**, 115109 (2008).Shytov, A. V., Abanin, D. A. & Levitov, L. S. Long-range interaction between adatoms in graphene.

*Phys. Rev. Lett.***103**, 016806 (2009).González-Herrero, H. et al. Atomic-scale control of graphene magnetism by using hydrogen atoms.

*Science***352**, 437–441 (2016).Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple.

*Phys. Rev. Lett.***77**, 3865–3868 (1996).Kresse, G. & Furthmüller, J. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set.

*Comput. Mater. Sci.***6**, 15–50 (1996).Kresse, G. & Joubert, D. From ultrasoft pseudopotentials to the projector augmented-wave method.

*Phys. Rev. B***59**, 1758–1775 (1999).Blöchl, P. E. Projector augmented-wave method.

*Phys. Rev. B***50**, 17953–17979 (1994).Sohier, T., Calandra, M. & Mauri, F. Density functional perturbation theory for gated two-dimensional heterostructures: theoretical developments and application to flexural phonons in graphene.

*Phys. Rev. B***96**, 075448 (2017).

## Acknowledgements

This research is supported by the Ministry of Education, Singapore, under its Research Center of Excellence award to the Institute for Functional Intelligent Materials (I-FIM, project No. EDUNC-33-18-279-V12). K.S.N. is grateful to the Royal Society (UK, grant number RSRP\R\190000) for support. P.H. thanks the support of the National Key Research and Development Program (No. 2021YFB3802400) and the National Natural Science Foundation (No. 52161037) of China. The authors acknowledge particularly the HPC support from Dr. Miguel Dias Costa. The computational work for this article was performed on computational resources at the National Supercomputing Center of Singapore (NSCC), the Centre for Advanced 2D Materials, the NUS HPC and the HSE University.

## Author information

### Authors and Affiliations

### Contributions

P.H., A.U., and K.S.N. conceived the research; P.H. and R.L. done most of the calculations, P.H., R.L., M.F., N.K., and A.R.A.-M. participated in data analyses, D.V.A., A.T., and A.H.C.N. participated in the discussion. All authors contributed to the drafting of the work and approved the final version of the manuscript.

### Corresponding author

## Ethics declarations

### Competing interests

The authors declare no competing interests.

## Additional information

**Publisher’s note** Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary information

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

## About this article

### Cite this article

Huang, P., Lukin, R., Faleev, M. *et al.* Unveiling the complex structure-property correlation of defects in 2D materials based on high throughput datasets.
*npj 2D Mater Appl* **7**, 6 (2023). https://doi.org/10.1038/s41699-023-00369-1

Received:

Accepted:

Published:

DOI: https://doi.org/10.1038/s41699-023-00369-1