Defects control the properties of many functional materials and devices1, like solar cells2,3, batteries4,5, catalysts6,7,8, and quantum computers9,10,11,12. To discover better materials for these applications it is thus necessary to predict how their defects behave. However, defect calculations are computationally demanding. The large supercells and high level of theory required to obtain robust predictions typically limit point defect analysis to in-depth studies of specific materials. In a move towards data-driven defect workflows13, defect databases14,15,16,17,18,19,20 and surrogate models have been developed to predict defect properties, like the dominant defect type18, formation19,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35 and migration35 energies, and charge transition levels19,25,36. By learning the relationship between defect structure and properties, these models enable high-throughput studies that quickly evaluate and screen a group of materials based on their defect behaviour.27,28,30,37

Despite progress in accelerating defect predictions, most high-throughput studies are limited in scope. Typically, their training datasets are generated assuming the ideal defect structure inherited from the crystal host, which often lies within a local minimum, thereby trapping a gradient-based optimisation algorithm in a metastable arrangement38,39,40,41. By yielding incorrect geometries, the predicted defect properties, such as equilibrium concentrations39,41,42,43, charge transition levels39,41,42 and recombination rates39, are rendered inaccurate8,44,45,46,47. However, defect structure searching is often too expensive for high-throughput studies that target thousands of defects30 or materials with complex (defect) energy landscapes, like alloys, disordered solids, and low-symmetry crystals.

In this study, we aim to reduce the computational burden of defect structure searching by introducing a machine-learning surrogate model. We build a dataset containing a set of point defect structures, energies, forces and stresses from first-principles, and use it to fine-tune a universal machine-learning force field (MLFF) and qualitatively explore the energy landscape across 132 defects. Defect reconstructions often follow common motifs41, especially when comparing similar defects in families of related compounds. By learning the plausible reconstructions undergone by defects in similar hosts, a surrogate model can be used to optimise the initial sampling structures and thus identify the promising, low-energy configurations (Fig. 1), as previously shown for surface adsorbates48,49 and transition state searches50.

Fig. 1: Schematic of a machine-learning surrogate model used to accelerate defect structure searching.
figure 1

The computationally efficient model learns the plausible defect reconstructions (local minima in the potential energy surface) and thus reduces the number of candidate structures relaxed with expensive first-principles density functional theory (DFT) calculations.


To assess the ability of a surrogate model to learn defect reconstructions, we will focus on one of the most common — and often strongest in terms of energy-lowering — reconstruction motifs: dimerisation41,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73. Dimers/trimers have been previously reported for numerous vacancies and interstitials, including \({{V}_{{{{\rm{Se}}}}}}^{0}\) in ZnSe, CuInSe2 and CuGaSe251, \({{V}_{{{{\rm{S}}}}}}^{0}\) in ZnS51, \({{V}_{{{{\rm{Cd}}}}}}^{0}\) in CdTe39, \({{V}_{{{{\rm{Sb}}}}}}^{0,+1,+2}\) in Sb2S/Se340,42, \({{V}_{{{{\rm{Ti}}}}}}^{0,-1}\) and \({{V}_{{{{\rm{Zr}}}}}}^{0}\) in CaZrTi2O745, \({{V}_{{{{\rm{Sb}}}}}}^{0}\) in Sb2O574, \({{{{{\rm{O}}}}}_{{{{\rm{i}}}}}}^{0}\) in In2O355, ZnO58, Al2O359, MgO60,61, CdO62, SnO263,64, PbO265, CeO266, BaSnO375, In2ZnO467 and LiNi0.5Mn1.5O476, \({{{{\rm{Ag}}}}}_{{{{\rm{i}}}}}^{0}\) in AgCl and AgBr53, \({{V}_{{{{\rm{I}}}}}}^{-}\), \({{{{{\rm{I}}}}}_{{{{\rm{i}}}}}}^{0}\), \({{{{{\rm{Pb}}}}}_{{{{\rm{i}}}}}}^{0}\), \({{{{\rm{Pb}}}}}_{{{{{\rm{CH}}}}}_{3}{{{{\rm{NH}}}}}_{3}}^{0}\) and \({{{{\rm{I}}}}}_{{{{{\rm{CH}}}}}_{3}{{{{\rm{NH}}}}}_{3}}^{0}\) in CH3NH3PbI368,69,70,71, \({{{{\rm{Pb}}}}}_{i}^{0}\) in CsPbBr352, (CH3NH3)3Pb2I754, (CH3NH3)2Pb(SCN)2I272 and \({{{{{\rm{Sn}}}}}_{{{{\rm{i}}}}}}^{0}\) in CH3NH3SnI357. While cation dimerisation has been reported in several hosts (AgCl/Br, CuInSe2, CuGaSe2, ZnS/Se, CdTe, Sb2S/Se3, CH3NH3PbI3, CsPbBr3, (CH3NH3)3Pb2I7, CH3NH3SnI3)41,51,52,53,54, anion dimers are more common and will be the focus of our study.

To target dimerisation, we consider cation vacancies in low-symmetry metal sulfides/selenides, where their covalent character and soft structures favour dimer formation41,42,56. Our first-principles dataset spans 50 hosts (exemplified in Fig. 2a) and 132 neutral cation vacancies, covering 25 elements (Fig. 3b) and 6 space groups. The configurational landscape of each vacancy was explored with the ShakeNBreak method41,77 by applying 15 chemically-guided distortions to the unperturbed defect structure, followed by geometry optimisation with DFT (see Methods)—resulting in a diverse set of trajectories for each defect and the dataset shown in Fig. 2c.

Fig. 2: Distribution of the defect dataset generated with first-principles calculations.
figure 2

a Example host structures and their respective space groups. b) Number of configurations containing each element. c) Two-dimensional projection of structural similarity for defect configurations. Each configuration is represented with the feature vector generated by the M3GNet model79,80 (trained on the bulk formation energies of the Materials Project database) and the vector dimensions are reduced using t-distributed stochastic neighbour embedding (t-SNE)131,132. The defect configurations are coloured by their host composition (with similar colours indicating compositions with similar MEGNet133 feature vectors), showing that related chemical systems cluster near each other. For clarity, in (b) and (c) 10 evenly spaced steps are selected from each relaxation trajectory.

Fig. 3: Analysis of the point defect dataset.
figure 3

a Distribution of number of holes produced upon defect formation. b Correlation between the number of anion—anion bonds formed and the number of holes created per defect. The label, colour, and size of the circles indicate the percentage of defects with that number of anion—anion bonds for defects for a given number of holes created.

Defect reconstructions

By analysing our first-principles dataset, we find that 29.9% of the neutral defects undergo symmetry-breaking reconstructions missed by both the standard modelling approach but also when applying a rattle distortion (with energy differences between the identified ground state and the relaxed ideal configuration greater than 0.5 eV; Supplementary Table 1, Supplementary Fig. 1). Rattle distortions (i.e. randomised displacements) have been used in recent studies37 as the prevalence of defect reconstructions have become more recognised. While rattling helps to break the symmetry of the initial defect configuration and escape PES saddle points, it often fails to identify reconstructions with significant energy barriers (i.e. bond formation), highlighting the need for structure searching.

The identified reconstructions are driven by anion–anion bond formation, with the number of new bonds determined by the number of valence electrons lost upon defect formation (Supplementary Fig. 2). In general, energy-lowering structural reconstructions at defects tend to be driven by the localisation of excess charge introduced by the defect formation, through various bonding (re-)arrangements. Here, excess charge refers to the change in valence electrons available for bonding, which is determined by the oxidation state of the original defect atom and the defect charge state—and in fact is the chemical guiding principle used in ShakeNBreak to target likely distortion pathways. For instance, upon forming a neutral antimony vacancy (\({{V}_{{{{\rm{Sb}}}}}}^{0}\)) in Sb2(S/Se)3 (where Sb is in the +3 oxidation state), we have removed three bonding electrons and so we have three excess holes. Further changes in the defect charge state will then alter this excess charge (e.g. 2 excess holes in the −1 charge state, or zero excess charge in the ‘fully-ionised’ -3 charge state). Similarly, for a neutral Li vacancy in Li4SnS4, we have removed 1 bonding electron and so we have 1 excess hole (and zero excess charge for the fully ionised -1 state). Analogously for an anion vacancy behaving as a donor defect (as in most semiconductors), it would contribute x excess electrons where -x is the oxidation state of the anion in that compound. Defects resulting in one hole (e.g. \({{V}_{{{{\rm{Li}}}}}}^{0}\) in Li4SnS4) can easily accommodate the missing charge without strong reconstructions, while defects with two or more holes (e.g. \({{V}_{{{{\rm{Bi}}}}}}^{0}\) in BiSI) tend to form anion dimers or trimers, as shown in Fig. 3b. As a result, anion–anion bonds are more favourable for more positive defect charge states, and can stabilise unexpected defect oxidation states, as observed previously for \({{V}_{{{{\rm{Sb}}}}}}^{+1}\) in Sb2(S/Se)341,42 and \({{{{{\rm{O}}}}}_{{{{\rm{i}}}}}}^{+1,+2}\) in several metal oxides41,55,58.

There are some exceptions to this trend, where systems are able to accommodate three or more holes without undergoing strong reconstructions. One example is hosts with d/f metals that adopt multiple stable oxidation states (e.g. Fe, Co, Cu), which can accommodate a hole by adopting a higher oxidation state8. To verify this trend, we compared two isostructural AIIIBIS2 systems which only differ in the identity of the B cation: \({{V}_{{{{\rm{Ga}}}}}}^{0}\) in GaCuS2 and GaAgS2; and \({{V}_{{{{\rm{In}}}}}}^{0}\) in InCuS2 and InAgS2 (Supplementary Fig. 3). In (Ga/In)AgS2, two of the holes localise in a S–S bond formed by the vacancy nearest neighbours (NN), while the third hole is split between the remaining two NNs. In contrast, in (Ga/In)CuS2, no dimer forms since three holes are localised in three of the vacancy NNs and five of the Cu ions closer to the vacancy — with these Cu ions showing shorter Cu–S bonds. The different behaviour of Cu and Ag can be rationalised by considering their second ionisation energies (I2(Cu): 20.3 eV, I2(Ag): 21.5 eV)78, where the low I2 (and thus higher d states) of Cu(I) favours cation oxidation, while the higher I2 of Ag(I) results in a sulfur dimer accommodating two of the holes (Supplementary Fig. 3).

In addition to systems with d/f elements, defects with nearby anion–anion bonds can localise the positive charge in these bonds and thus avoid forming new ones. This behaviour is exemplified by RhSe2, where the two symmetry-inequivalent Rh vacancies show different reconstructions. The first vacancy site is surrounded by four Se–Se bonds (Supplementary Fig. 4b), and thus can accommodate the four holes by depopulating the anti-bonding orbitals of these bonds. In contrast, the second site has only one Se–Se bond neighbouring the vacancy (Supplementary Fig. 4c), and thus has to form an additional Se dimer to accommodate the positive charge.

Beyond chalcogenide dimers, other rearrangements to accommodate positive charge involve chalco-halide (e.g. S–Cl formed by \({{V}_{{{{\rm{Bi}}}}}}^{0}\) in AgBiSCl2) and halide-halide bond formation (e.g. Cl dimers formed by \({{V}_{{{{\rm{Sb}}}}}}^{0}\) in SbSCl9) (Supplementary Fig. 5). Here we note that the zero-dimensional character of SbSCl9 enables this defect to undergo strong distortions forming two Cl dimers (Supplementary Fig. 5). Overall, we highlight the common reconstruction motifs exhibited by different defects in various host structures (Supplementary Fig. 2), facilitating the requisite diversity for a model to learn the plausible reconstructions for a group of related defects.

Model training

To develop a model that can be applied for defect structure searching in unseen compositions, we first split our dataset by composition into training, validation and test sets (Supplementary Fig. 6), amounting to 68%, 5% and 27% of defects, respectively. The validation set is then augmented with 5% of the configurations selected for the systems in the training set, to ensure that the structural diversity of the training set is also included for validation (thus evaluating how the model performs for a large diversity of defects and compositions and also how it extrapolates to unseen compositions). This results in training, validation and test sets of 11,955 (63%), 2,100 (11%), and 4830 (26%) configurations, respectively, where configuration denotes a point defect structure with its associated energy, forces and stresses.

To sample the training data, we compared two approaches: (i) a manual method where we sample 10 evenly spaced frames from each relaxation (MS) and (ii) the Dimensionality-Reduced Encoded Clusters approach (DIRECT)79, which aims to select a robust training set from a complex configurational space. Surprisingly, we find that, when using datasets of similar sizes, the MS approach performs better—with the DIRECT approach only outperforming MS when the final DIRECT dataset is larger than the MS one (Supplementary Table 3). This is because the DIRECT approach mainly samples structures from the initial ionic steps (Supplementary Fig. 9), which correspond to high distortions and thus lead to larger errors for the low energy structures (Supplementary Fig. 10).

As a surrogate model, we aim for a method that takes an initial defect structure and outputs the energy and structure of the locally relaxed configuration. Machine-learning force fields are ideal for this task since they can map regions of the potential energy surface (PES) by learning the energies, forces, and, optionally stresses of a set of training structures. Specifically, we focus on universal graph-based MLFFs, which are trained on relaxation data from diverse databases of bulk crystals80,81,82,83, and thus already incorporate general chemical behaviour. Accordingly, we use a universal MLFF as a base model and fine-tune it with a training set of defect configurations. We have compared different model architectures (M3GNet80, CHGNet81 and MACE84), elemental reference energies, structure featurisation parameters (graph cutoffs, readout layers) and fine-tuning strategies, which are discussed in detail in the Supporting Information (SI) (Supplementary Notes 1.B). In addition, we compared a model trained on just defect structures, and both defect and bulk structures, with the second case improving performance (Supplementary Table 10 and Supplementary Fig. 12). From these benchmarks, the optimal model architecture and parameters were selected: a M3GNet model80 with radial and 3-body cutoffs of 5 Å and 4 Å, respectively, and the weighted atom readout function80,85 (further details in Methods).

Overall, we note that the mean absolute errors for the absolute energies in the validation and test sets are significant (MAEE,test = 31.2 meV atom−1, Table 1), but comparable to those obtained in MLFFs used for bulk structure searching of carbon (MAEE,test = 64.8 meV atom−1)86. However, a more meaningful metric for our purpose is the error for the relative energies of each defect configuration relative to its ground state structure (MAEE,rel,test = 11.3 meV atom−1). Further, we mostly care about the low-energy region of the potential energy surface, which can be measured by calculating the relative energy errors for configurations less than ≈ 5 eV above the global minimum, resulting in MAEs of 3.6 meV atom−1 ≈ 0.29 eV for an 80 atom supercell.

Beyond these metrics, we calculate the Spearman correlation coefficient (ρ) to measure how well the MLFF and DFT energies are monotonically related (i.e. if greater DFT energies correspond to greater MLFF energies87). While the value of ρ for the test set is significantly lower than those obtained with MLFFs developed for bulk structure searching for a single composition (0.72 versus 0.98–0.99987), this was expected considering that our dataset spans a diverse range of compositions and a wide range of energies. While the errors are high, we note that this does not prevent the model from being used as a qualitative surrogate of the DFT PES for structure searching (i.e. identification of local minima), as previously observed for surface adsorbates88,89.

Model performance

To evaluate the model performance, we apply the trained model to a robust test set, which includes 13 unseen compositions and 32 defects (accounting for 26% and 26.5% of the total number of compositions and defects in our dataset, respectively; Supplementary Fig. 6). For each defect, the MLFF is used to relax the 15 distorted structures generated with ShakeNBreak77 to sample the defect PES. The MLFF-relaxed structures are then compared to identify the different local minima in the MLFF PES using the SOAP fingerprint90 of the defect site. These local minima are then further relaxed with DFT. By comparing the ground state identified from the MLFF+DFT approach and full DFT search, we find the former to correctly identify the DFT ground state for 88% of test defects, while simultaneously reducing the number of DFT calculations required by 73% (Table 2) and accelerating structure searching by a factor of 13 (Supplementary Notes 1.C4). In addition, it identifies a more favourable structure than the ones found in the DFT search for VGe,9 in TlGeS2, with an energy lowering of 0.5 eV (Supplementary Fig. 15). The 12% of failed cases, where the MLFF ground state structure differed from the DFT one, mostly involve complex hosts. For instance, \({V}_{{{{\rm{Sn}}}}}\) in Li4SnS4 has a complex DFT energy surface, which traps most of the relaxations in very high energy basins (Supplementary Fig. 19). PESs of similar complexity are displayed by the iso-structural systems that were included in the training set (Li4GeS4 and Li4TiS4; Supplementary Fig. 19), which biases our training data to the high energy region of the PES for these compositions and thus hinders learning the low-energy region. Accordingly, the training data for these systems can be improved by reducing the magnitude of the distortion used by ShakeNBreak to generate their sampling structures; which would improve model performance. Other defects for which the surrogate model misses the most stable structure are VTl,0 in TlGeS2 and VBi in BiSeI — yet in both cases the DFT and MLFF+DFT structures are very similar and differ by small energy differences (0.1 and 0.2 eV, respectively) (Supplementary Figures 16 and 17). In all failed cases, while the model misses the full DFT ground state, it still correctly predicts a favourable reconstruction, that lowers the energy compared to the relaxed ideal configuration.

Beyond identifying the correct ground state in the majority of cases, the model has indirectly learned the correlation between the number of holes and the number of formed dimers. For defects with 1 missing electron, the candidate structures generated by the surrogate model rarely contained anion–anion bonds; while for defects with more missing electrons, the model often identifies at least one local minima with a dimer.

The decreased performance observed for out-of-sample compositions less similar to the training set posed the question of what performance could be achieved if targeting a family of more related systems. To consider more similar host compositions, we select the chalcohalide systems from our dataset and split them composition-wise into training, validation and test sets as described in Supplementary Notes 1.C5. After training the model on the training set and applying it to the unseen test defects (details in Supplementary Notes 1.C5), we find that the model identifies the correct ground state for all test cases, and achieves lower mean absolute errors compared to the full model. This suggests that higher accuracy can be achieved when targeting more similar host structures, which is likely the case in most high-throughput defect studies.

Our current trained model is limited to neutral cation vacancies in metal sulphides/selenides. However, the approach can be extended to a different compositional space or defect type by first generating a custom training set through first-principles calculations and using it to fine-tune the universal bulk MLFF.

Application to alloys

Beyond high-throughput studies of many single-phase materials, the surrogate model can also be used to accelerate structure searching in alloys or disordered solids, which is computationally challenging due to the high number of local host compositions and inequivalent defects to consider91. The distinct local or site environments of a given defect can significantly affect its properties34,35,92,93,94,95,96,97,98,99,100,101,102,103, altering formation and migration energies by up to 1.5 eV34,35,93,100,101,102,103. Properly sampling various site environments is key to characterise the defect behaviour in such cases.

We consider the case of cadmium vacancies in the CdSexTe(1−x) (x = 0, 0.2, 0.3, 0.5, 0.6, 0.8, 0.9, 1) pseudo-binary alloy. For each composition, a supercell is generated through random substitution of Te sites, and the Cd sites with a unique nearest neighbour chemical environment are considered (e.g. Cd surrounded by 4 Te; by 3 Te and 1 Se; by 2 Se and 2 Te, etc) (Fig. 4a). The configurational landscape of each vacancy is explored with the ShakeNBreak method (14 sampling structures), using the relaxations from the pure compositions as the training and validation data while the mixed systems (0 < x < 1) are reserved as the test set.

Fig. 4: Structures for VCd in the CdSexTe1−x alloys.
figure 4

a Inequivalent defect environments for two of the CdSexTe1−x alloys (x = 0.5, 0.8). b Examples of the ground state configurations only identified through the finer MLFF+DFT search. These reconstructions are driven by either forming a dimer (e.g. Te–Te bond formation), forming a more favourable anion–anion bond (Te–Te instead of Se–Te; Se–Te instead of Se–Se) or forming the same type of anion bond (Se-Se) but breaking weaker anion-cation bonds between the defect nearest neighbours and the defect next nearest neighbours.

After fine-tuning the surrogate model (MLFF) on the training configurations (details in Methods), it is applied to all alloys to perform the structure searching calculations, allowing a more extensive sampling than for the DFT search (31 sampling structures). From the MLFF-relaxed structures, the unique configurations are selected for further relaxation with DFT and compared with the results from the DFT-only search. This comparison shows that the model successfully identifies the ground state for all defects, even in cases where the defects form Te-Se bonds not seen in the training set – which only included the Te-Te and Se-Se bonds formed by VCd(CdTe) and VCd(CdSe), respectively. Although Te-Se bonds were not present in our defect training set, they were included in the Materials Project (bulk) training set, thus suggesting the benefit of transfer learning for model generalisability.

More significantly, for 70% of the defects, the model identifies a more favourable ground state missed in the coarser DFT search (with a mean energy lowering of −0.4 eV, Supplementary Notes 1.E). These reconstructions are driven by forming a more favourable anion–anion bond (Fig. 4b) and missed in the DFT-only search due to the coarser sampling performed. This illustrates the advantage of the faster surrogate model to tackle defects with complex configurational landscapes, that require a more exhaustive exploration than a DFT-based search would allow, like alloys, compositionally disordered materials104,105,106,107, and low-symmetry or multinary systems with many degrees of freedom in their PES.


By building a dataset for defect structure searching, we have demonstrated the prevalence of defect reconstructions missed by the standard modelling approach – and thus the need to perform structure searching in high-throughput defect studies. To reduce the associated computational burden, we have developed a surrogate model by fine-tuning a universal machine-learning force field on defect configurations. By qualitatively learning the defect configurational landscapes, the trained model successfully predicts low-energy defect structures for unseen defect environments in unseen compositions, thereby reducing the number of DFT calculations by 73%. While our current model is limited to neutral cation vacancies in metal chalcogenides, the methodology can be applied to different defect types or compositional spaces. In addition, our openly-available dataset could be used to measure the out-of-distribution performance of universal MLFFs108 by testing the ability to extrapolate from learned bulk motifs to defect environments.

Beyond accelerating structure searching in high-throughput studies, this approach is ideal for systems with a complex defect landscape, like alloys, disordered, or low-symmetry materials where their many inequivalent defects make it intractable to explicitly calculate all of them with accurate DFT methods. By using a surrogate model, we can consider a range of alloy compositions and all inequivalent defects, while performing a more exhaustive sampling of the PES — thereby identifying more favourable reconstructions missed in the (coarser) DFT-based search. Beyond (pseudo-)binary alloys, this approach could be extended to model more chemically complex systems, like high-entropy alloys, where the MLFF could be trained on defects of the constituent binary systems and applied to the ternary, quaternary, or high-entropy alloys.

A current limitation of this strategy is the handling of defects in distinct charge states, which have different energy landscapes and structural configurations (e.g. a defect in two different charge states can have a common local structure with different energies). The approach could handle the potential energy landscape for each charge state independently (e.g. training a separate model for defects in the -1 charge state). To consider different charge states simultaneously, the net charge state can be encoded as a graph global attribute109. However, a more descriptive encoding could be achieved by using fourth-generation MLFFs that include atomic charges110. Beyond accounting for the defect charge, another improvement could be MLFFs that are fine-tuned on-the-fly during geometry optimisations. As shown for surface absorbates88,89, this strategy would accelerate the defect geometry optimisation by skipping many ionic steps that are performed with the surrogate model. Overall, we note the promise of surrogate models to accelerate and increase the accuracy of defect modelling, whether this is by improving structure searching, accounting for metastable configurations111,112, enabling the calculation of defect formation entropies109,112, accelerating defect migration studies113 or going beyond the dilute limit107.


High-throughput vacancies in chalcogenide hosts

The conventional supercell approach for modelling defects in periodic solids was used114. To reduce periodic image interactions, supercell dimensions of at least 10 Å in each direction115 were employed. To explore the configurational landscape of each defect, we used the ShakeNBreak code77, with a distortion increment of 0.1 and the default rattle standard deviation (10% of the nearest neighbour distance in the bulk supercell). This strategy results in 14 sampling structures. In addition to these, to ensure that dimerisation was properly sampled, we also generated a sampling structure where two defect neighbours were pushed towards each other with a separation of 2 Å, resulting in a total of 15 initial configurations. Due to the limitation of universal MLFFs to describe charge, we only considered one charge state of the defects. We chose the neutral state for several reasons. First, it is often stable for cation vacancies in metal chalcogenides. Secondly, it is usually included within the potential charge states to be calculated for a given defect (e.g. generally ranging from the fully ionised state to (at least) the neutral one), and it often has a complex potential energy landscape. However, we note that we did not check whether it was the thermodynamically stable state for each defect.

All reference calculations were performed with Density Functional Theory using the exchange-correlation functional HSE06116 and the projector augmented wave method117, as implemented in the Vienna Ab initio Simulation Package118,119. Calculations for the pristine unit cells were performed using a plane wave energy cutoff of 585 eV and sampling reciprocal space with a Monkhorst-Pack mesh of density 900 k-points/site. The convergence thresholds for the geometry optimisations were set to 10−6 eV and 10−5eV Å−1 for energy and forces, respectively. Defect relaxations were performed with the Γ-point approximation, which is accurate enough for defect structure searching41, and with a plane wave energy cutoff of 350 eV. The energy and force thresholds for defect relaxations were set to 10−4 eV and 10−2eV Å−1, respectively. We note that these settings were selected for an efficient exploration of the defect configurational landscape due to the high number of relaxations required for structure searching. In a full defect study aiming for high accuracy, once the ground state configuration is identified with these settings, it should be further relaxed with tighter convergence thresholds and account for spin-orbit coupling when necessary (elements from period five/six and below).

To automate the generation of input files, we designed a workflow using aiida120,121,122, pymatgen123,124,125, pymatgen-analysis-defects126,127, ASE128, doped129 and ShakeNBreak77. This code is available from The datasets and trained models are available from the Zenodo repository with

To generate the training and test set for the machine learning model, we processed the DFT defect relaxation data by removing unreasonably high-energy configurations (e.g. structures with positive energies), as they decreased model performance. After cleaning the data, 10 evenly-spaced ionic steps were selected from each relaxation. We used the M3GNet model80, as implemented in ref.85, with radial and 3 body cutoffs of 5 Å and 4 Å, respectively, and the weighted atom readout function. The loss function was defined as a combination of the mean squared errors for the energies, forces and stresses, with respective weights of 1, 1 and 0.180,85. For fine-tuning, the model was initialised with the weights from the trained bulk crystal model85 and then trained on the defect training set (see Supplementary Notes 1.B5 for further details). A batch size of 4 and an exponential learning rate scheduler with an initial rate of 5 × 10−4 were used. The model was trained on a Quadro RTX6000 GPU until the validation errors were converged (30 epochs, 5.3 hours) (Supplementary Fig. 13).

MLFF geometry optimisation was performed with the FIRE algorithm130, as implemented in the ASE package128, until the mean force was lower than 10−5eV Å−1 or the number of ionic steps exceeded 1500, which were found to be reasonable thresholds. After relaxing the sampling structures with the model, we identified the different local minima or configurations by calculating the cosine distance between the SOAP descriptor90 for the defect site of each configuration, which was found to be an effective metric for identifying different defect motifs. We note that using the SOAP fingerprint of the defect site was more robust than considering the energies or the root mean squared displacement between the structures. The first case can miss local minima if these have similar energies in the MLFF PES, while the second was more sensitive to structural differences far from the defect site. The parameters used to generate the SOAP descriptor were: r = 5 Å, nmax = 10, lmax = 10, σ = 1.0 Å, for the local cutoff, number of radial basis functions, maximum degree of spherical harmonics, and the standard deviation of the Gaussian functions used to expand the atomic density, respectively. To evaluate the correlation between DFT and MLFF energies, the Spearman coefficient was calculated for each defect independently, and then averaged across defects.

Application to the CdSexTe(1−x) alloy

To generate the supercells for the mixed compositions in CdSexTe(1−x) (x = 0.2, 0.3, 0.5, 0.6, 0.8, 0.9), we used random substitution of Te sites with Se. For each supercell, we consider the Cd sites with a unique nearest neighbour chemical environment as vacancy sites (e.g. Cd surrounded by 4 Te; by 3 Te and 1 Se; by 2 Se and 2 Te etc), and generate the vacancy high-symmetry structures with pymatgen123,124,125. For the DFT-based exploration of the PES, we apply ShakeNBreak with default parameters, generating 14 sampling structures, which were relaxed with DFT as previously described.

To generate the dataset, we again processed the defect relaxation data by removing unreasonably high-energy configurations (>15 eV above the defect ground state configuration). As the training set, we used a combination of defect and bulk configurations: 45 evenly-spaced frames from the 14 relaxations of VCd in CdTe and CdSe, and 20 frames from the relaxations of each pristine system, resulting in a total of 1420 frames. For validation, we selected 5 unseen configurations from the 14 relaxations of VCd in CdTe and CdSe (total of 140 frames). The M3GNet surrogate model was trained with similar parameters as previously described and until the errors were converged (80 epochs, see Supplementary Fig. 26).

To perform a finer exploration of the PES with the surrogate model, we applied a set of bond distortions generated by ShakeNBreak (−0.6, −0.5, −0.4, −0.3, −0.2) to all unique pair combinations of nearest neighbours (e.g. for a VCd surrounded by two Te and two Se anions, Te(1), Te(2), Se(1) and Se(2), we considered the pairs Te(1)-Te(2), Te(1)-Se(1), Te(1)-Se(2), Te(2)-Se(1), Te(2)-Se(2), and Se(1)-Se(2)). By default, for a defect with two missing electrons like \({{V}_{{{{\rm{Cd}}}}}}^{0}\), ShakeNBreak only applies the bond distortions to the two atoms closest to the defect. This is typically a reliable approach for most pure systems, but can miss reconstructions for alloys with complex defect environments (e.g. VCd surrounded by a mix of Te and Se anions). The model application and analysis were performed as described in the previous section.