Introduction

Ab-initio calculations have become the method of choice for the atomistic description of many materials1. However, relevant physical properties of a crystalline material often depend not only on its pristine structure but on various lattice defects2,3,4,5,6. While advances in sample preparation for low-dimensional materials have concurrently improved control over the occurrence of such defects, their influence is still significant for many investigations. Conversely, tailoring of material properties by defect engineering7, chemical doping or functionalization inherently introduces point defects into the material8. An accurate description of how these defects modify the electronic structure of a system is thus key for exploring potential applications of novel materials.

While density-functional theory (DFT) typically yields a high-level description for moderately sized systems, simulating realistic devices used for measurements involves system sizes beyond the realm of these methods. Tight-binding (TB) models offer a quantum mechanical description for coherent electronic structure simulations with a scalability far better suited to experimental length scales9. For bulk systems, empirical tight-binding parameters10,11 can be fit against ab-initio (e.g., DFT) or measured band structures (BS). More rigorous approaches aim for directly calculating effective tight-binding Hamiltonians by projecting DFT orbitals onto a suitably spatially localized basis12,13,14, as implemented by the associated program package PAOFLOW15,16. These approaches work very well for automatically finding accurate bulk tight-binding descriptions of occupied orbitals that are well described by a suitable, known localized basis. By contrast, iterative methods such as maximally localized Wannier functions (e.g., wannier9017,18,19) try to determine an optimal localized basis, which often requires cumbersome convergence procedures. Once converged, they typically yield TB models of the highest quality20,21,22,23,24. A recent ML approach for high-throughput investigation also produced accurate bulk TB parametrizations25 for pristine materials.

Computing accurate TB parameters for defect structures presents a challenge to established parametrization approaches. Simulating a defect requires large supercells to prevent artifacts from interacting with periodic images and to ensure accurate geometry relaxation at the edge of the supercell. In addition, breaking translational as well as (some) point group symmetries at the defect site vastly increases the number of independent TB parameters. Empirical bulk parametrizations lack the flexibility to describe defect systems with different local environments (e.g., different coordination numbers) than the pristine system26, while Wannier projections become increasingly difficult to converge for larger cells. The resulting TB Hamiltonians also lack sparsity, and typically include finite long-distance interactions beyond even 5th nearest neighbors22. Since the efficiency of TB models partly stems from operating with sparse matrices, there is motivation to find sparse TB representations with minimal loss of representability. Simply truncating long-range interactions generally produces a significant loss of accuracy22. A quantitative description of defects, which is key to understanding their influence on the electronic structure, thus seems out of reach using established parameterization techniques.

In recent years, machine learning (ML) has facilitated new research lines in materials science and chemistry27,28,29,30,31,32,33,34,35,36,37,38. Here, we apply ML methods to generate TB parametrizations for defect structures in novel materials. We aim for an ML based scheme that achieves Wannier TB accuracy, while being automated and thus easy to use. Ideally, we want to be able to tune the sparseness of our machine-learned TB parameters at will to obtain a desired balance of accuracy and efficiency. To remain accurate despite fewer tuning parameters implied by improved sparsity, we will adjust the parametrization to specific energy regions of interest (i.e., close to the Fermi edge). We benchmark several test cases to demonstrate the accuracy of our approach, its efficiency and the effect of sparseness on accuracy and speed.

For simplicitly and focus on the ML technique, the graphene benchmark system we consider features a comparatively simple orbital structure, with only the pz orbitals contributing close to the Fermi edge. To account for coupling between orbitals of different angular momenta or different atomic species, it is straightforward to extend our scheme to distant-dependent, element-specific Slater–Koster parameters. All directional and orbital contributions are then accounted for by the well-known Slater–Koster formulas10, while the distance-dependence still allows for the flexibility to accurately describe defect structures. We showcase this extension for a Se-divacancy in WSe2 in Supplementary Note 6.

This paper is structured as follows: we introduce a model for mapping the desired TB Hamiltonian matrix to a vector of parameters that is compatible with ML algorithms. After establishing the necessary approximations and retrieval of DFT input data, we compare the efficiency and accuracy of different ML techniques for calculating TB parameters. We find multi-layer perceptrons (MLPs), i.e., neural networks, to be optimal for the task at hand. We present a detailed workflow of MLPs used for determining an optimal set of TB parameters for a given atomic structure. We explicitly generate parametrizations for two common defects (see insets in Table 2 and the methods section “Methods” for calculation details) in single layer graphene (SLG). The final section of this work focuses on validating and testing our machine-learned parametrizations. We consider the influence of defects on the local density of states, electronic transport as well as the level spectrum of a smoothly confined graphene quantum dot (GQD).

Results

TB model

The TB approximation projects the Schrödinger equation for electrons — a partial differential equation — onto a basis of tightly bound (i.e., well localized) orbitals \(\left|i\right\rangle\) at site i, yielding an algebraic equation. A system with no orbitals can then be described by a TB Hamiltonian

$${{{\mathcal{H}}}}=\mathop{\sum }\limits_{i}^{{n}_{o}}{s}_{i}{\hat{c}}_{i}^{{\dagger} }{\hat{c}}_{i}+\mathop{\sum}\limits_{\langle i,j\rangle }{\gamma }_{ij}{\hat{c}}_{i}^{{\dagger} }{\hat{c}}_{j}.$$
(1)

\({\hat{c}}_{i}^{{\dagger} }({\hat{c}}_{i})\) are the creation (annihilation) operators of a quasiparticle at site i with position ri, \({s}_{i}=\left\langle i\right|{{{\mathcal{H}}}}\left|i\right\rangle\) the onsite (diagonal) matrix elements and \({\gamma }_{ij}=\left\langle i\right|{{{\mathcal{H}}}}\left|j\right\rangle\) the hopping amplitudes between sites i and j. For sufficiently localized orbitals, the magnitude of γij quickly decays for increasing distance \(\left|{{{{\bf{r}}}}}_{i}-{{{{\bf{r}}}}}_{j}\right|\) between orbitals. Omitting such elements below a certain threshold (e.g., 1 meV) makes \({{{\mathcal{H}}}}\) sparse.

Starting from a full DFT Hamiltonian, optimal values for si, γij can be directly and exactly calculated using maximally localized Wannier functions17,18,19,39. In practice, however, the final degree of localization — i.e., the distance beyond which overlaps between orbitals are smaller than the defined threshold — may be several unit cells22. To obtain a more sparse description, one can directly fit a small set of TB parameters si, γij to reproduce the DFT BS in an energy region of interest. The second sum in Eq. (1) then only runs over the n-th nearest-neighbor (NN) sites (Fig. 1a), where \(\left|{{{{\bf{r}}}}}_{i}-{{{{\bf{r}}}}}_{j}\right| \,<\, {r}_{{{{\rm{NN}}}}}\) with a cutoff radius rNN controlling the sparseness.

Fig. 1: Tight-binding model of graphene lattice.
figure 1

a Nearest-neighbor interactions in a hexagonal graphene lattice. The central gray atom has three first nearest neighbors (1NN, dark blue), 6 second-nearest neighbors (2NN, green), etc. b Interaction of a defect supercell with its periodic images (defect region highlighted in orange). The center cell itself is described by \({{{{\mathcal{H}}}}}^{(0,0)}\), the interactions to its neighboring cells by \({H}^{({\lambda }_{x},{\lambda }_{y})}\) --- for large supercells only the nearest-neighbor interactions between cells, i.e., λx, λy {−1, 0, 1} are non-zero.

Without loss of generality, we restrict our analysis to two-dimensional systems. We account for the Bloch phase of the periodic wave function by adding corresponding phase factors in the periodic images of the Hamiltonian. The periodic Hamiltonian matrices \({{{{\mathcal{H}}}}}^{({\lambda }_{x},{\lambda }_{y})}\) determine the interaction of sites in the original cell (0, 0) with sites in the periodic image of the cell (λx, λy) translated along a linear combination of lattice vectors {λxRx, λyRy}. The entire Hamiltonian then reads

$${{{\mathcal{H}}}}({{{\bf{k}}}})=\mathop{\sum}\limits_{{\lambda }_{x},{\lambda }_{y}}{e}^{{{{\rm{i}}}}{{{\bf{k}}}}\cdot ({\lambda }_{x}{{{{\bf{R}}}}}_{{{{\bf{x}}}}}+{\lambda }_{y}{{{{\bf{R}}}}}_{{{{\bf{y}}}}})}{{{{\mathcal{H}}}}}^{({\lambda }_{x},{\lambda }_{y})}.$$
(2)

Note that the set of si, γij entirely determines the matrix elements of \({{{{\mathcal{H}}}}}^{({\lambda }_{x},{\lambda }_{y})}\) while the grouping into periodic cells just accounts for the periodicity of the lattice. A system of interest is thus fully described by a set of lattice vectors and parameters si, γij yielding the Hamiltonian matrices \(\{{{{{\mathcal{H}}}}}^{({\lambda }_{x},{\lambda }_{y})}\}\). The indices λx, λy [−m, m] with \(m\in {{\mathbb{N}}}_{0}\) determine the range of non-zero interactions between periodically shifted unit cells. In practice, we truncate at m = 1 given the large defect super cells in this work (see Fig. 1b).

Our objective is to use our TB Hamiltonian for transport calculations of SLG in realistic device settings, i.e., SLG including defects. We can therefore restrict the TB Hamiltonian to the carbon pz orbitals, which determine the electronic structure of SLG close to the Fermi energy (see Supplementary Information for details).

Having reduced the TB Hamiltonian to only the pz orbitals of carbon, we now consider a further reduction of the number of free parameters for the TB Hamiltonian. If we were to only enforce hermiticity, our TB Hamiltionian of Eq. (2) would feature \(\frac{{n}_{{{{\rm{o}}}}}({n}_{{{{\rm{o}}}}}+1)}{2}+4{n}_{{{{\rm{o}}}}}^{2}\) independent parameters si, γij, which quickly gets out of hand. Considering a medium-sized defect supercell with 70 orbitals this would require ~25,000 independent parameters. We can however employ the residual symmetries of a defect structure to further reduce the number of parameters our ML model needs to optimize. To obtain a robust framework, we aim for a simple mapping between the hopping matrix elements γij and local geometry information.

Finding such a simple mapping seems daunting as coordination numbers of atoms around the defect site will in general differ substantially from those in the bulk. A general mapping therefore seems to require detailed information about the local chemical environment. We avoid additional, complex geometrical parameters by exploiting that for the pristine bulk lattice, there are only a few distances (the nearest-neighbor spacings, Fig. 4a) while a relaxed defect geometry features many different distances. We generate the γij purely as a mapping of distance \({\gamma }_{ij}=\gamma (\left|{{{{\bf{r}}}}}_{i}-{{{{\bf{r}}}}}_{j}\right|)\) to obtain an efficient and compact representation of the final TB Hamiltonian. A sufficiently fine, discontinuous mapping between atomic distance and hopping parameters essentially implies assigning an individual hopping parameter to each unique distance — except for degeneracies implied by symmetries, which should, indeed, have the same hopping interaction. A parametrization on distance alone thus yields a hermitian Hamiltonian correctly accounting for symmetries by construction. We can also simply choose a cutoff length rNN above which no orbitals share a finite hopping value, to obtain a more sparse description. We discretize the interval [0, rNN] into nc equidistant bins l with l [1, nc] using

$${\gamma }_{ij}=\gamma (\left|{{{{\bf{r}}}}}_{i}-{{{{\bf{r}}}}}_{j}\right|)={\gamma }_{l},\quad l={{{\rm{ceil}}}}\frac{\left|{{{{\bf{r}}}}}_{i}-{{{{\bf{r}}}}}_{j}\right|}{{{\Delta }}r},$$
(3)

with Δr = rNN/nc the discretization step, and ceil(x) the ceiling function picking the smallest integer l with l ≥ x.

We append a minimal set of onsite terms {si} (accounting for symmetries) to the set of hopping values {γl} with l [1, nc] to obtain a full TB parameterization, denoted for brevity as {γl}. We can then establish a bijective mapping from this list of interactions to full Hamiltonian matrices and vice versa. rNN provides a tunable parameter for the desired sparseness of our TB model (up to how distant a neighboring orbital interacts with another one).

The number of bins nc controls the coarseness of the discretization and can be adapted depending on the distribution of inter-orbital distances in a given structure. As long as the discretization Δr is fine enough, we only establish a convenient way of simultaneously addressing all symmetry-related interactions. For the two SLG defects we choose as benchmark systems, we decrease Δr until the number of different γl no longer increases (i.e., each value γl only addresses the hopping terms connected by symmetry, Δr ≈ 10−4 Å). At first glance, this prescription for grouping and setting the relevant interaction elements in a TB Hamiltonian seems quite similar to introducing an exponential dependence on distance in Slater–Koster parametrizations10,11,40. However, the discrete distance-hopping map only decouples symmetries and hermiticity from the parameter search and introduces little to no unnecessary simplification — in particular, it does not enforce a specific functional dependence on the distance. We do not need to consider the local geometric configuration (screening) of interacting orbital pairs as long as the discretization is fine enough to distinguish all different hoppings not related by symmetry. Indeed, we do not aim for a smooth mapping γ(rij), but rather for a distinct hopping parameter for all different couplings. Consequently, two neighboring values γl and γl+1 can in principle take entirely different values.

From TB parameters {γl} one can easily calculate a TB BS by diagonalizing the k-space Hamiltonian of Eq. (2) to obtain band energies \({\epsilon }_{b,{{{\bf{k}}}}}^{{{{\rm{TB}}}}}\) and eigenfunctions \(\left|{\psi }_{b,{{{\bf{k}}}}}\right\rangle\) via the eigenvalue problem:

$${{{\mathcal{H}}}}({{{\bf{k}}}})\left|{\psi }_{b,{{{\bf{k}}}}}\right\rangle ={\epsilon }_{b,{{{\bf{k}}}}}\left|{\psi }_{b,{{{\bf{k}}}}}\right\rangle$$
(4)

The full set of TB parameters thus straightforwardly yields a BS with minimal numerical cost, (\(\{{\gamma }_{l}\}\to {{{\mathcal{H}}}}\to {\epsilon }_{b,{{{\bf{k}}}}}^{{{{\rm{TB}}}}}[{\gamma }_{l}]\)).

Inverse band structure problem

Obtaining a BS from Eqs. (2) and (4) for a given Hamiltonian \({{{{\mathcal{H}}}}}^{({\lambda }_{x},{\lambda }_{y})}\) is straightforward. However, to find the optimal Hamiltonian that best reproduces a given DFT BS \(\{{\epsilon }_{b,{{{\bf{k}}}}}^{{{{\rm{DFT}}}}}\}\) we need to solve the inverse problem (\(\{{\epsilon }_{b,{{{\bf{k}}}}}\}\to {{{\mathcal{H}}}}\), Fig. 2). There is no straightforward (or unique) solution to this problem as highlighted by the plethora of TB parametrizations for any given material. Since \(\{{\epsilon }_{b,{{{\bf{k}}}}}^{{{{\rm{TB}}}}}\}\) can be quickly evaluated, generating pairs of (arbitrary) sets {γl, si} and the resulting BS \(\{{\epsilon }_{b,{{{\bf{k}}}}}^{{{{\rm{TB}}}}}\}\) on the TB level is easy. We can then use ML algorithms to identify the set of TB parameters which produces a TB BS in closest agreement with DFT.

Fig. 2: Schematic of the inverse BS problem.
figure 2

For a given Hamiltonian, calculating a BS is trivial. By contrast, there is no constructive algorithm to obtain a Hamiltonian from a BS. Using ML, we aim to find such an inverse mapping from BS data (scalar energy values ϵb,k for each band b and k-point k) to a minimal list of TB parameters {γl, si} (for each distance and onsite class l) which describes a full TB Hamiltonian \({{{\mathcal{H}}}}\).

To select a ML algorithm suitable for the inverse problem, we need to quantitatively compare different approaches. We grade several ML approaches both in terms of computational efficiency (how quickly do we arrive at an answer) as well as quality. To obtain a quantitative criterion for the quality of a parametrization we evaluate the difference of the final converged result to the DFT BS \(\{{\epsilon }_{b,{{{\bf{k}}}}}^{{{{\rm{DFT}}}}}\}\),

$${\delta }_{\epsilon }[\{{\epsilon }_{b,{{{\bf{k}}}}}\}]=\mathop{\sum }\limits_{j}^{{n}_{{{{\rm{k}}}}}}\mathop{\sum }\limits_{b}^{{n}_{{{{\rm{b}}}}}}{\left({\epsilon }_{b,{{{{\bf{k}}}}}_{j}}-{\epsilon }_{b,{{{{\bf{k}}}}}_{j}}^{{{{\rm{DFT}}}}}\right)}^{2}.$$
(5)

To tackle such a relatively high-dimensional, non-uniquely solvable inversion problem, we test variations of gradientless descent methods41,42 (GLD), both multilayer perceptrons (MLPs) and convolutional neural networks (CNN) and Bayesian optimization via Gaussian process regression (GPR43) as possible alternative methods. We include the conceptually most simple gradientless descent as reference method to assess the benefit of more intricate approaches. All our ML methods produce reasonable parameter sets as exemplified by the small errors (δϵ) in Table 1. Comparing also the time required to obtain a parametrization, we observe considerable differences between the approaches and therefore selected only the MLP for our final benchmarks. Below we briefly introduce each approach and discuss its pros and cons.

Table 1 ML comparison: comparing performance [in terms of BS error δε, see Eq. (5)] and time efficiency of several ML approaches to the inverse BS problem of the double vacancy in SLG.

a. Bayesian Optimization trains a Gaussian process that maps input TB parameters {γl} to the BS mismatch δϵ. An acquisition function (see Supplementary Information for details) tailored to minimize δϵ then decides which new ({γl}, \(\{{\epsilon }_{b,{{{\bf{k}}}}}^{{{{\rm{TB}}}}}\}\))-pair is added to the data set. Such an active learning strategy results in compact datasets. However, given the low computational cost of generating ({γl}, \(\{{\epsilon }_{b,{{{\bf{k}}}}}^{{{{\rm{TB}}}}}\}\))-pairs, the Bayesian optimization is dominated by the high cost of GPR training (Table 1). In the high-dimensional search space, the time saved by avoiding unnecessary evaluations of the forward problem (i.e., TB → δϵ mapping) is smaller than the additional time needed to fit the Gaussian process.

b. Gradientless Descent is a zeroth-order, model-free optimization technique41,42 that does not rely on an underlying gradient estimate (such an estimate can get expensive to come by in high dimensional spaces). It solves the inverse problem by repeated application of the forward problem. Despite reasonable δϵ, the extracted parametrizations seem to perform less convincing for derived quantities (see Supplementary Information).

c. Multilayer Perceptrons are shallow feed-forward neural networks. In previous work, some of us have shown that neural networks can accurately predict spectra from the atomic positions alone34. Here we demonstrate that multilayer perceptrons (MLPs) can also solve the inverse problem directly by mapping band structures onto TB parameters. We add regularization via dropout layers and train them on (\(\{{\epsilon }_{b,{{{\bf{k}}}}}^{{{{\rm{TB}}}}}\},\{{\gamma }_{l}\}\))-pairs. We then make a final TB parameter prediction for \(\{{\epsilon }_{b,{{{\bf{k}}}}}^{{{{\rm{DFT}}}}}\}\). In our investigations, MLPs outperform all alternative approaches in accuracy at approximately equal or even lower computational cost (Table 1). We attribute this to the strong interdependence between the different TB parameters: an almost identical BS can be described by several different parameter sets, while changing only a single parameter (with the others fixed) will substantially change the BS. Such a structure is better represented by the fully connected network as opposed to model-free optimization schemes optimizing the different parameters individually.

d. Convolutional Neural Networks are reasonably deep, sparsely connected neural networks that are designed for automatic feature extraction from the input BS. CNNs excell at exploiting correlations in their input data (e.g., the continuous lines forming a BS). Despite a reduction of trainable parameters compared to MLPs, the convolutional setups we benchmarked resulted in significantly longer training times but slightly worse BS losses.

We provide more details on all candidate methods in Supplementary Note 4, and focus on our final, most efficient algorithm below.

In the following we provide a step-by-step guide to our ML approach shown graphically in Fig. 3.

Fig. 3: Schematic flow chart of ML algorithm.
figure 3

Produces a machine-learned TB parametrization of a defect system. The hued section on the right can be replaced with an initial γ(SK) from Slater–Koster theory for materials with challenging bulk cells. Cornered green nodes represent calculation processes and rounded blue nodes represent data.

Data set generation

Before we can query our MLP to predict hopping parameters for the DFT BS of a defect system we need to procure appropriate training data in the form of BSs and their corresponding parameter lists. We do so entirely on the TB level, i.e., without requiring any DFT input by randomly sampling the vicinity of a reasonable initial guess in TB parameter space. We first determine an initial distance-hopping map \(\gamma (\left|{{{{\bf{r}}}}}_{i}-{{{{\bf{r}}}}}_{j}\right|)\) (used to create {γl, si} following Eq. (3)) based on the TB parameters of the pristine material, to obtain an initial TB Hamiltonian \({H}_{{{{\rm{TB}}}}}^{(0)}\). We assume some reasonable parametrization of the pristine material exists - it is far simpler to extract a 10th-NN TB description for the bulk material than it is for a defect structure. For materials where even the bulk cell proves challenging to wannierize, one could resort to empirical or recent machine-learning approaches25 for the initial parameter set. We initialize the distance-hopping map γ(0)(δr) as a piece-wise linear interpolation between the ten distance-hopping pairs extracted for the bulk material (see blue line and red markers in Fig. 4). We have validated this interpolated initialization for several defects in graphene and found that already such a (physically unmotivated) prescription for a TB parametrization outperforms a common Slater–Koster parametrization of graphene (see Table 1 and dashed green line in Fig. 4).

Fig. 4: Distance-hopping map of interaction elements.
figure 4

We plot all entries of a Wannier parametrization for the double vacancy defect in SLG (black dots) with both a Slater–Koster based initialization (taken from40) γ(SK)(δr) (dashed green line) as well as another initialization γ(0)(δr) defined as piece-wise linear interpolation (solid blue line) between the 10th-NN distance-hopping values (red crosses) of a bulk singler layer graphene cell. The hoppings at distance zero represent the onsite energies si.

Having obtained a starting guess for \({H}_{{{{\rm{TB}}}}}^{(0)}\), we calculate the training dataset by solving the forward problem (\({{{\mathcal{H}}}}\to {\epsilon }_{b,{{{\bf{k}}}}}\)) many times with random fluctuations added to \({H}_{{{{\rm{TB}}}}}^{(0)}\) (see Supplementary Information for details). We generate samples until further increase of the dataset size no longer reduces the BS error. We add relative and absolute noise to randomly selected parameters, carefully choosing noise amplitudes to sufficiently explore the relevant search space (for details see Supplementary Note 1). We then train the MLP to correlate changes in the shape of bands to corresponding modifications of values for specific TB parameters.

Multilayer perceptron model

As alluded to in Section “Inverse band structure problem”, we adopt a multilayer perceptron to map BSs to TB parameters. The MLP takes all BS data {ϵb,k} as 1D vector \(({\overrightarrow{\epsilon }}_{{k}_{0}},\ldots ,{\overrightarrow{\epsilon }}_{{k}_{n}})\) and outputs TB parameters as another 1D vector {γl} holding the different hopping values for every distance as well as the minimal set of onsite energies necessary for building the entire TB Hamiltonian. We find optimal performance using three hidden network layers and choose their sizes via linear interpolation of the sizes for input- and output layers (see Supplementary Information and Methods section for details).

The range of BS inputs and TB parameter outputs covers several orders of magnitude. This wide spread necessitates Gaussian scaling of both inputs and outputs across all samples. Drop-out regularization (20% at the input layer) effectively avoids overfitting. By applying the distance-hopping map procedure to TB Hamiltonians of the two defect structures, we obtain a number of output parameters that strongly varies with the desired sparseness of the model (see Table 2). The sampling density of our BS in k space determines the number of input neurons in our network. We find sampling the Brillouin zone path with 30 points (i.e., 30 × no input values for the network) to be a sufficient compromise between resolving BS features while keeping the input layer size manageable.

Table 2 Sparseness of TB parametrization: number of independent TB parameters for a given sparseness in both defect structures under consideration.

We emphasize that we aim to train a single-use network that is specifically tailored to one specific defect in a given material, as opposed to training a general MLP for predicting parameters for different defects. Such an approach would fail to capture the peculiarities and details of the individual defects. Our training approach is very robust and straightforward, enabling a much faster workflow than manually converging a well-behaved Wannier parametrization. Indeed, for large systems converging a Wannier parametrization can even prove quite elusive, while our MLP-based approach should still work.

Training

We train the MLP on Ns = 150,000 data points, since performance converges and does not improve further by providing more samples (see Supplementary Information). We use a custom loss function that accounts for both parameter loss and BS mismatch of the predictions:

$${{{\mathcal{L}}}}={{{{\mathcal{L}}}}}_{\gamma }+{a}_{\epsilon }{{{{\mathcal{L}}}}}_{\epsilon }$$
(6)
$${{{{\mathcal{L}}}}}_{\epsilon }=\mathop{\sum }\limits_{b=1}^{{n}_{{{{\rm{b}}}}}}\mathop{\sum }\limits_{j=1}^{{n}_{{{{\rm{k}}}}}}{\left({\epsilon }_{b,{k}_{j}}^{({{{\rm{p}}}})}-{\epsilon }_{b,{k}_{j}}^{({{{\rm{t}}}})}\right)}^{2}$$
(7)
$${{{{\mathcal{L}}}}}_{\gamma }=\mathop{\sum }\limits_{l}^{{n}_{{{{\rm{p}}}}}}{\left({\gamma }_{l}^{({{{\rm{p}}}})}-{\gamma }_{l}^{({{{\rm{t}}}})}\right)}^{2}$$
(8)

With aϵ as a weighting factor, \({\epsilon }_{b}^{(t)}({k}_{j})({\epsilon }_{b}^{(p)}({k}_{j}))\) the true (predicted) value of band b at k-point j and \({\gamma }_{l}^{(t)}\), (\({\gamma }_{l}^{(p)}\)) the true (predicted) value for the hopping (or si) of distance l, which we know for each pair of random Hamiltonian and associated BS in the training set. While an exact solution of the inverse band structure problem implies zero parameter loss, \({{{{\mathcal{L}}}}}_{\gamma }=0\), we find that adding a physical observable, i.e., the actual BS mismatch \({{{{\mathcal{L}}}}}_{\epsilon }\) to the loss function improves convergence. We achieve optimal performance for aϵ ≈ 5 × 10−4 (see Supplementary Information).

Models for sparse parametrizations

The numerical effort in using a given TB parametrization strongly depends on the sparsity of the TB Hamiltonian, i.e., the number of non-zero hopping elements γij. To improve performance, one can introduce a smaller cutoff length rNN requiring that all interactions beyond the NN-th nearest neighbor are set to zero. We denote this as xNN for the models generated in this work. Generating sparser TB models barely requires changes to our ML workflow yet enables vast performance gains for subsequent application of the TB models (Eq. (11)). The initial parameters γ(0)(δr) can again be taken from the piece-wise linearly interpolated bulk parameters (but cut off at rNN). We will end up with fewer individual parameters (see Table 2) in a sparser TB description, generally allowing for a less accurate fit. However, in many applications the interesting physics is confined to a specific energy region, most commonly around the Fermi edge. Depending on the desired sparseness it proved beneficial to introduce additional weighting \(w({\bar{\epsilon }}_{b}^{({{{\rm{t}}}})})\) into the BS loss function:

$${{{{\mathcal{L}}}}}_{\epsilon }=\mathop{\sum }\limits_{b=1}^{{n}_{{{{\rm{b}}}}}}\mathop{\sum }\limits_{j=1}^{{n}_{{{{\rm{k}}}}}}{\left({\epsilon }_{b,{k}_{j}}^{({{{\rm{p}}}})}-{\epsilon }_{b,{k}_{j}}^{({{{\rm{t}}}})}\right)}^{2}w\left({\bar{\epsilon }}_{b}^{({{{\rm{t}}}})}\right)$$
(9)
$${\bar{\epsilon }}_{b}^{({{{\rm{t}}}})}=\frac{1}{{n}_{{{{\rm{k}}}}}}\mathop{\sum }\limits_{j=1}^{{n}_{{{{\rm{k}}}}}}{\epsilon }_{b,{k}_{j}}^{({{{\rm{t}}}})}$$
(10)

Restricting long-range interactions increasingly compromises the accurate reconstruction of the entire band structure. We achieved best results by focusing on the energy bands close to the charge neutrality point (E = 0) by reducing the number of input bands for the MLP (i.e., this mimics a step function for \(w({\bar{\epsilon }}_{b}^{({{{\rm{t}}}})})\)) all together and thus reduce both network size and computational cost for training. Employing a zero-centered Gaussian distribution with appropriate width for \(w({\bar{\epsilon }}_{b}^{({{{\rm{t}}}})})\) achieves similar results at higher computational costs.

Our machine-learned TB parameters cannot be directly verified as they are no physical observables. Their exact values are not necessarily unique so long as they are capable of accurately reproducing derived quantities. We thus test the quality and validity of our extracted parametrizations with respect to BS, local density of states (LDOS), quantum transport and GQD-spectra which we found to be highly sensitive to the local electronic configuration of defects in recent work44.

Benchmarks

For each defect, we calculate the LDOS on both the TB and DFT level thus enabling direct comparison to DFT results (as compared to the additional benchmarks discussed below in which the Wannier TB parametrization is the only reference). LDOS and BS are shown for the double vacancy and flower defect in Figs. 5 and 6, respectively.

Fig. 5: Electronic structure analysis of vacancy defect.
figure 5

a BS of the SLG double vacancy supercell along ΓMXΓ of both DFT calculation and MLP TB-model. b pz-projected density of states of the supercell. c Cosine similarity of the local density of states between different TB models and the DFT result. d LDOS at the three energies (left to right) indicated by veritcal dash-dotted gray lines in b and c for DFT, Wannier, MLP respectively (top to bottom, colored boxes match line colors in b and c).

Our 10th-NN ML TB model displays excellent agreement with the DFT BS (Fig. 5a) over a large energy window. While exact symmetries are captured via the distance-hopping map, noticeable disagreement regarding the exact width of some avoided crossings prove as the most challenging aspects for the MLP. In terms of the total density of states (DOS) the 10th-NN ML-TB-model is on par with the Wannier-TB-model. While neither can capture all the features of the ab-initio DOS both reproduce it much better than general Slater–Koster models (see Figs. 5b and 6b). Since the deviations to the DFT DOS are present for both the machine learned and the Wannier parametrization we ascribe them to approximations of the TB formalism rather than a deficiency of our MLP algorithm.

Fig. 6: Electronic structure analysis of flower defect.
figure 6

a BS of the SLG flower defect supercell along ΓMXΓ of both DFT calculation and MLP TB-model. b pz-projected density of states of the supercell. c Cosine similarity of the local density of states between different TB models and the DFT result. d LDOS at the three energies (left to right) indicated by veritcal dash-dotted gray lines in b and c for DFT, Wannier, MLP respectively (top to bottom).

The spatial information of the LDOS provides an even more detailed comparison, which we analyze both visually (Figs. 5d–f and 6d–f) at relevant energies (indicated as dash-dotted vertical gray lines in (Figs. 5b, c and 6b, c) and numerically via the cosine-similarity of individually normalized LDOS distributions with respect to the DFT results over the entire energy range (Figs. 5c and 6c). The results show that the MLP parametrizations not only very well capture the total DOS but also its spatial distribution (on par with Wannier) over a wide energy range (see SuppIementary Information).

State-of-the-art modular recursive Green’s function methods (MRGM)45 (see methods section “Methods”) profit immensely from sparse Hamiltonian matrices. Applying our sparse ML-TB-parametrizations to electronic transport calculations is therefore especially interesting. We study the different TB-parametrizations by embedding the defect supercells at five random but reproducible positions within a 15nm wide zig-zag SLG ribbon of length ≈130 nm (Fig. 7b). Employing our MGRM code we obtain the energy-dependent transmission T(E) which uniquely portrays the multiple scattering events occuring in systems with several defects and compare T(E) for the different parametrizations.

Fig. 7: Transport benchmark.
figure 7

Energy-dependent transmission T(E) for different TB parametrizations of the a double vacancy and c flower defect in SLG (vertically offset for clarity). b Scattering density plots for the three lowest modes at E = 0.7 eV in the double vacancy setup with ribbon-width and embedded defect positions indicated.

The 10th-NN ML-TB parametrizations accurately reproduce the transmission signature T(E) for both defects (Fig. 7a, c). Our results also highlight the limited transferability26 of Slater–Koster parametrizations to different defect geometries: While the SK-TB-parameters for the double vacancy (Fig. 7a) produce a somewhat useful transmission curve its performance degrades drastically when applied to the flower defect (Fig. 7c).

Our sparser ML-TB parametrizations with interactions only up to the 3rd- or 5th-nearest neighbor still outperform the SK-parametrization. The loss in accuracy when enforcing very sparse Hamiltonians (3rd-NN) is a priori hard to quantify. While the TB description of the double vacancy seems more robust with respect to restraining long-range interaction than that of the flower vacancy (compare Fig. 7a and Fig. 7c) the 5th-nearest neighbor parametrization seems to strike an appropriate balance between computational performance gain

$$\begin{array}{l}{t}_{{{{\rm{Transport}}}}}^{10{{{\rm{NN}}}}}\,:\,10{{{\rm{m}}}}42{{{\rm{s}}}}\\ {t}_{{{{\rm{Transport}}}}}^{5{{{\rm{NN}}}}}\,:\,1{{{\rm{m}}}}26{{{\rm{s}}}}\\ {t}_{{{{\rm{Transport}}}}}^{3{{{\rm{NN}}}}}\,:\,0{{{\rm{m}}}}49{{{\rm{s}}}}\end{array}$$
(11)

and accuracy.

Another highly sensitive probe of our parametrizations comes in the form of smoothly-confined SLG quantum dots46,47. We consider the influence of nearby lattice defects on the level spectrum of GQD’s44 as a benchmark for how well different TB-parametrizations model the local electronic configuration. Smoothly confining electrons in SLG retains the valley degeneracy which, omitting spin, yields doubly degenerate states. In the vicinity of a lattice defect this degeneracy is lifted as a function of defect-GQD distance44 (see Fig. 8). The resulting level spectra as a function of GQD displacement XT work as a unique fingerprint of the electronic structure of a defect.

Fig. 8: Quantum dot benchmark of vacancy defect.
figure 8

Level spectrum landscapes calculated with different TB parametrizations of the double vacancy in SLG compared against the Wannier parametrization [a MLP(10NN), b Slater–Koster, c MLP(5NN), d MLP(3NN)]. Inset shows schematic sketch of the underlying system: we calculate the level spectrum (orbital and valley quantum number, spin is omitted) as a function of the position of an STM-tip (brown) induced (smoothly confined) GQD relative to an embedded defect in a large graphene flake (gray rectangle). Dotted gray lines represent the level structure of a pristine GQD with doubly degenerate orbitals.

We again find excellent agreement between the Wannier and the 10th-NN ML-TB parametrization. Conventional approaches such as Slater–Koster heavily underestimate the induced valley splittings Δτ and fail to capture the characteristic asymmetry of the lowest splitting for the double vacancy (Fig. 8c, d). The sparse ML parametrizations (3rd-NN or 5th-NN) still work quite well. Both slightly underestimate the induced splittings but manage to reproduce some of the asymmetry of the splittings for the double vacancy. The sparse ML-TB descriptions work especially well for the flower defect in this benchmark: qualitative agreement remains excellent and the quantitative changes to the induced valley splittings with increasing sparseness remain minor. The Slater–Koster model highly overestimates splittings and fails to reproduce several of the sharp avoided crossings.

Discussion

The ML TB parametrizations yield accuracy on par with a full Wannier description, yet at substantially reduced cost. Once the sparseness levels are set, no human input or convergence issues appear during the parametrization step, and the improved sparseness greatly reduces computational demands in applications. The learning phase proceeds in an automated way, allowing for high-throughput simulations of different defects. More complicated materials such as transition-metal dichalcogenides will require even more parameters, and thus grouping of interactions by atom and orbital type. The same general algorithm should again work to provide tailored defect models.

The comparatively poor performance (see Table 1) of effective Slater–Koster methods strongly highlights the need for more accurate defect descriptions tailored to the corresponding electronic structure, which simply cannot be captured without additional DFT calculations. The remaining minor discrepancies in the highly sensitive GQD benchmark underline how the long-range interactions dictated by the underlying physics ultimately determine the accuracy of effective short-range descriptions: since we cut off long-range hoppings in the TB Hamiltonian, the sparse parametrization underestimates the range of the change in electronic structure induced by the defect. As a consequence, energy splittings between the two valley states are underestimated for small point defects like a vacancy (Fig. 8): only a tiny fraction of the quantum dot wavefunction (those few orbitals close to the defect) can actually contribute to the defect-induced energy shift. By contrast, an extended defect like the flower (Fig. 9) is much better described.

Fig. 9: Quantum dot benchmark of flower defect.
figure 9

Level spectrum landscapes calculated with different TB parametrizations of the flower defect in SLG compared against the Wannier parametrization [a MLP(10NN), b Slater–Koster, c MLP(5NN), d MLP(3NN)].

Our comprehensive benchmarks (LDOS, transport, quantum states directly influenced by the defects) clearly outline the prowess of ML in obtaining DFT-quality results of defects in devices without substantial additional cost beyond the initial DFT calculation of the defect. Our sparse description of a defect system can be understood as a constrained optimization problem where ML offers elegant ways to find the sparse description with an optimal balance between accuracy and efficiency.

We have successfully implemented a ML algorithm to derive a TB Hamiltonian that accurately reproduces the BS details for general defect supercell structures in SLG. Given our universal treatment of symmetries and geometry information (distance-hopping map) this method can be applied to arbitrary material classes. This model requires a target BS and geometry information as inputs and allows for optimization towards a predefined sparseness of the desired TB description.

Our approach can be generalized to systems with relevant spin texture by either introducing additional distance-hopping maps (γ, γ, γ) or employ a split off spin-orbit coupling term. For materials with a richer orbital structure (e.g., TMDs with dominant contributions from five d-orbitals on the metal site and three p orbitals on the chalcogen site) one may adopt a mixture of Slater–Koster10 and discrete-distance-hopping-map approach by following the usual scheme for the angle-dependent assignment of interactions (i.e., direction cosines for the spherical harmonic nature of the respective orbitals) but promoting the typical Slater–Koster parameters (e.g., Vpp−σ, Vpp−π, Vpd−σ, Vpd−π, Vdd−σ, Vdd−π, Vdd−δ, …) to discretized distance-dependent maps (in principal identical to γ). An MLP can then learn these maps following the same algorithm as outlined above. Using such a scheme for Se divacancies in WSe2 accurately reproduces all midgap defect states, including their different orbital characters (see Supplementary Note 6).

The conducted benchmarks included DOS analysis, multi-defect scattering in electronic transport calculations as well as simulations of the defect-induced splittings in a GQD. We found both qualitative and quantitative agreement of Wannier-TB-parameters (reference system) and the ML TB parameters of our MLP based approach. Given the considerably less complex input (energy values and atomic positions) than required by state-of-the-art iterative projection based methods (full DFT solution including Bloch states) our method should prove better suited for high-throughput material analysis.

Methods

Machine learning

Our proposed neural network architectures (MLPs and CNNs) may be conveniently implemented via all common ML packages. We build our model via TensorFlow (v2.2.0) and the KERAS API (v2.3.0-tf). Furthermore, we use the Adam48 optimizer with a learning rate β ≈ 10−5.We use a train/validation split of 75/25 of in total 200,000 samples. Learning rates β ≈ 10−5 with batch sizes of 2048 result in a fully trained model after roughly 1500 epochs. The Gaussian process regression employed in our Bayesian optimization scheme are implemented via the scikit-learn python package49.

Density functional calculations

We perform DFT structural and electronic optimization with the VASP software package50,51,52,53. The double vacancy real space cell measures 6 × 6 pristine unit cells whereas the flower defect is modeled in an 8 × 8 cell. Both calculations encompass 25 Å vacuum in z direction and use a 3 × 3 × 1 Monkhorst-Pack k-space grid. Our exchange-correlation functional of choice is Perdew–Burke–Ernzerhof (PBE) in a generalized gradient approximation. Both geometries are fully relaxed (using a conjugate gradient algorithm) to residual forces less than 10−2 eV Å−1. Plane-wave energy cutoff is set to 500 eV and the systems are electronically converged to δE ≈ 10−9 eV.

Maximally localized Wannier transformation

The benchmark TB descriptions for the defects in this work have been generated with the Wannier9017,18,19,39 software package. The double vacancy requires 175 Wannier functions initialized as atom-centered pz and bond-centered s orbitals optimized with an outer energy window of [−28.5 eV,12.4 eV] and an inner window of [−28.5 eV, −0.12 eV]. Disentangling the conduction bands from those virtual bands not included in the localized basis converges after 440 iterations while spread minimization converges after 187,089 iterations. The slightly larger flower defect requires 320 Wannier functions again initialized as atom-centered pz and bond-centered s orbitals optimized with an outer energy window of [−28.5 eV, 12.4 eV] and an inner window of [−28.5 eV, −0.12089 eV]. Disentangling converges after 599 iterations while spread minimization converges after 99,980 iterations. In both cases, Monkhorst k-space grids are taken over from the DFT calculations

Electronic transport

We evaluate transport in the Landau–Büttiker approximation using the energy-dependent Green’s function G(E) of the scattering structure54. By projecting \(G\left|{\chi }_{i}\right\rangle\) onto the incoming wave in mode i we obtain a scattering state (see, e.g., Fig. 7b). By sandwiching G between incoming mode i and outgoing mode j we obtain the transmission \({t}_{ji}\propto \langle {\chi }_{j}|G|{\chi }_{i}\rangle\), where the proportionality factor is given by the square root of the relative group velocities \(\sqrt{{v}_{j}/{v}_{i}}\). The total transmission is the sum of all squared transmission amplitudes \(T=\sum {\left|{t}_{ij}\right|}^{2}\).