Introduction

The state-of-the-art theoretical framework for computing material properties of crystals at their ground state is density functional theory (DFT)1,2. DFT allows to describe the total energy as a functional of electron density, \(E\left[\rho \right]\), for a given atomic configuration {R}, by taking advantage of the conjugate relationship between the electrostatic potential of the nuclei V({R}), and the ground-state electron density ρ. By solving the expensive quantum mechanical equations that result from this definition for electrons, DFT outlines a path to determine the total energy, the forces on each atom, the stress due to crystal structure, and several other ground-state properties of materials. Yet the cost of solving the quantum mechanical equations, as well as having to work with the extensive electronic wavefunctions and density, hinders the application of this method to systems beyond a few thousands of atoms.

A way to reduce the computational cost lies in the realization that the same conjugate relationship between ρ and V guarantees that a functional exists, which maps the electrostatic potential of the nuclei to the total energy, hence it is possible to describe ground-state properties as a functional of the positions of atoms in the structure, without having to work explicitly with the electron density. Yet, the exact form of such a functional is unknown. One approach to approximate this unknown functional is using artificial neural networks (ANNs). ANNs and in general machine learning techniques have been shown to yield reasonably accurate functional approximations for a wide range of applications, and have already been adopted with success to some material science problems3,4,5,6,7,8,9,10,11,12,13,14,15.

ANNs can be seen as an attractive alternative to the classical approach for constructing interatomic interaction models (also known as force fields (FFs)) where physical intuition is used to fix the form of the approximate functional for E[V({R})]. While physically meaningful forms can describe the interatomic interaction in a compact way, with only few parameters to be fitted, the rigidity of the functional form reduces the predictive power of this method in exploratory studies. In particular, for highly polymorphic materials such as carbon, where several different bonding types and structures exist, the lack of transferability of a model from one structure to another results in many different interaction models, each with a limited applicability. For example, among the several empirical FFs for carbon, the non-reactive, short range, bond-order-based Tersoff16 model can describe dense sp3 carbon structures while a highly parametric reactive force field (ReaxFF)17 that explicitly includes long-range van der Waals (vdW) interactions and Coulomb energy through charge equilibration scheme18 is needed for structures with sp2 hybridization. Furthermore, even though these empirical FFs give a qualitative understanding of materials properties, they are quantitatively inaccurate when compared to both ab initio methods and experiments19,20,21,22.

Interatomic interaction models based on ANNs do not have a fixed functional form beyond the network architecture, and their parameters are fitted to vast amounts of ab initio quantum mechanical data in the hope of assimilating the physics of the system into the parametrization. Hence the transferability restraint of classical FFs, that is due to their rigid form, is traded for a transferability challenge in the case of neural networks due to the (lack of) variety and completeness in the training set. To address this challenge of generating truly transferable ANN interatomic interaction models, training data must be obtained from an efficient and thorough sampling of the potential energy landscape. Such sampling of the very rugged and high dimensional landscape with ab initio electronic structure tools is a formidable challenge.

In this work, we integrate evolutionary algorithm (EA) with molecular dynamics (MD) and clustering techniques in a self-consistent manner to sample the potential energy landscape and obtain data with high variability. The workflow we introduce extends the training data iteratively, similar to other active learning approaches that previously appeared in literature19,23,24,25,26. Unlike these methods that aim at constructing an optimal dataset for a specified part of the potential energy landscape, our workflow targets an unbiased training dataset, which is necessary for increased transferability expected of a general purpose potential. Moreover, for reliable materials modeling, it is crucial to have indicators that signal when the limit of transferability is crossed. We address this aspect of ANN models by studying the relationship between data variability and transferability of the trained network via unsupervised data analysis. We demonstrate the performance of the approach highlighted above on the challenging example of crystalline and amorphous carbon structures.

This study is a continuation of similar efforts in the literature: the first ANN interaction model for elemental carbon was developed in 2010 by Khaliullin et al.19 to study graphite–diamond co-existence. The network was trained on an adaptive training set, where the starting configurations were manually selected from randomly distorted graphite and diamond phases, relaxed under a range of external pressures (from −10 to 200 GPa) at zero temperature. Then, configurations for new training data were obtained using this model in finite temperature MD simulations, which in turn were used to refine the network, until a self-consistency was reached in the prediction error on the new structures. More recently in 2019, a hybrid model, where an ANN potential for the short-range interaction is supplemented with a theoretically motivated analytical term to model long-range dispersion, has been developed in order to address the properties of monolayer and multilayer graphene, with encouraging results22. As we will demonstrate in this work, ANN models such as these, built on data sampled solely from a limited part of the potential energy landscape can, however, be highly non-transferable. This transferability challenge for carbon has been observed with kernel-based machine learning models as well.

In 2017, a kernel-based model, specifically, a Gaussian approximation potential (GAP), was constructed21 using data from MD melt-quench trajectories of liquid and amorphous carbon, to study amorphous structures. Motivated from its non-optimal behavior on crystalline phases, authors developed another GAP model with a specialized training data obtained via MD, for graphene27. It is worthwhile to note that recently, a strategy combining kernel-based model generation with crystal structure prediction was suggested by Bernstein et al.28. Since computational cost for training or evaluation of a kernel-based model grows with the training set, however, this approach is suitable for small scale configuration space sampling. Alternatively, a sparsification approach, such as the one based on clustering recently proposed in ref. 29, can be used. In comparison, computational cost of neural networks is independent of the size of the training dataset, a feature that is exploited in the current study for accurate prediction of elastic and vibrational properties. It should be mentioned that regression-based machine-learnt potential models other than GAP also exist, e.g., spectral neighbor analysis potential (SNAP)8 and moment tensor potential (MTP)30. A recent work comparing them concludes GAP to have the highest accuracy, but also the highest computational cost, increasing with the size of the training dataset31. SNAP and MTP use lower cost regression strategies to correlate the local atomic environment with its contribution to the total energy.

In this work we use a systematic approach to construct a highly flexible and transferable neural network potential (NNP) and demonstrate its application to the development of a general NNP for carbon. We compare its performance with respect to other potential models previously optimized for specific phases and discuss the implications of our results for the trade-off between transferability and specialization.

Results

Self-consistent training and validation

The NNP is constructed following the self-consistent approach sketched in Fig. 1. This recursive data-creating and fitting cycle starts with a trial FF, which is used to generate an initial set of configurations via EA. In the absence of an established FF model for a new material, rough approximations such as Lennard–Jones or low-cost DFT approximations can be used with small unit cells for the very first iteration. EAs are commonly used in crystal structure prediction studies as they allow efficient sampling of the configuration space. Their success in thorough sampling is demonstrated by their ability to predict new crystal structures before the experimental observation32,33. As the exploration of the configuration space continues, a single-point DFT calculation is performed on each distinct polymorph generated by EA. These structures are then clustered using a distance measure. From each cluster, a representative example is manually selected and a classical MD simulation at a given pressure and temperature range is performed. The additional MD simulation step allows the sampling of the whole neighborhood of the equilibrium configuration for each polymorph, resulting in accurate prediction of structural properties for every polymorph. The dataset obtained this way is used to train a neural network model. The trained NNP is then used for starting a new iteration of the self-consistent cycle. This increases the training set diversity, by preventing the energetically favorable structures that are easily accessed by EA from dominating the whole training set. The iterative procedure highlighted above is repeated until no new structures are found.

Fig. 1: The self-consistent scheme.
figure 1

The initial step to start the process (yellow arrow) can be performed with a classical force field as shown here, or any comprehensive dataset of structures such as the ones in Aflowlib72, Materials Genome Initiative73, or Nomad74 repositories can be used to generate the first neural network potential model (blue triangle) to be refined through the self-consistent cycle. Once an initial potential model is chosen, evolutionary algorithm enables a diverse set of structures to be sampled. The following clustering-based pruning of structures further ensures that no single polymorph biases the dataset, i.e., at each step only novel structures (red and blue disks for the particular step highlighted above) are to be considered, further refined, and added to the dataset. The subsequent MD simulations sample the potential energy surface of each polymorph. Finally, DFT calculations performed on a subset of MD-sampled structures are added to the ab initio dataset obtained thus far. The ab initio dataset augmented this way is then used to train the next neural network potential model (a darker blue triangle), starting the next cycle of the self-consistent scheme until no new structures are found by the evolutionary algorithm.

While iterative expansion of training set is not a new idea, our implementation pushes its limits in diversity and balance: we use a full EA to sample configurations, without anchoring the search in any known polymorph or rigid transformations between polymorphs as in refs. 25 or 26. This makes our method applicable to materials with unexplored phase space and prevents any bias toward known phases. We then use clustering, which allows to achieve a balanced set despite the tendency of EA to sample stable configurations more often. Finally, starting from a representative configuration for each cluster, we perform MD simulations so that equilibrium properties of every polymorph are well described independent of their stability with respect to the ground state. We refrain from using active learning methods that depend on network agreement (as in ref. 23) as network prediction errors are not guaranteed to be uncorrelated, e.g., two networks may agree on the wrong result, especially if under-parametrized. We also refrain from expanding the training set with structures obtained solely through MD trajectories as in ref. 34, because of the risk of missing significant polymorphs that would only be sampled rarely, and with decreasing frequency, i.e., requiring longer and longer MD runs to run into significant additions to the dataset. Instead, a coherent integration of EA, clustering and MD together yields an unbiased, balanced, and diverse dataset. Further details of the self-consistent training used in this work are given in “Methods” and the expansion of the dataset explored at each step is given in the Supplementary Fig. 1.

The performance of an NNP at each self-consistent loop is evaluated during training via the validation scheme. Figure 2 shows the evolution of NNP energy accuracy on the training and validation set as a function of training steps at each self-consistent iteration (Fig. 2a–c). The training root-mean-square error (RMSE) corresponds to the instantaneous RMSE computed on the elements of the batch considered at that training step while the validation RMSE is computed on all the configurations in the validation set. The RMSE on the validation set agrees with the training RMSE throughout the training, an indication that the model does not overfit to the training dataset. The analysis of the force prediction error at different stages of training gives similar results and can be found in the Supplementary Fig. 2. The increase in energy and force RMSE from iteration 1 to 3 is a result of the increase in the diversity of atomic environments. At each self-consistent iteration, the diversity of the dataset increases as new structures are explored (see Table 1), while the number of parameters of the network, therefore its capacity, is kept fixed. It is worth noting that the prediction error is not distributed according to a Gaussian distribution function but a fatter-tailed one (see Fig. 2d). Therefore, while the RMSE given here is a good measure to compare training and validation error with one another, it overestimates the average NNP prediction error in general.

Fig. 2: The evolution of the distribution of error in energy prediction.
figure 2

The RMSE in prediction of per atom energy for potentials trained at first, second, and third iteration of the self-consistent cycle is given in ac, respectively. The blue lines are the RMSE on a given batch of 128 configurations during training. The networks are evaluated during training on all the validation set of sizes ≈3000, ≈5200 and, ≈12000 configurations for first, second, and third iterations, respectively (red dots with lines as guide to eye). The final training and validation RMSE are reported in Table 1. d Error distribution for the validation dataset at third iteration. The black dashed line is a normalized Gaussian fit, resulting in an RMSE of 11.3 meV, clearly failing to fit the fat-tailed distribution. The error distribution of energy and force prediction for all iterations is given in Supplementary Fig. 3.

Table 1 Training and validation RMSE.

To demonstrate how the general accuracy of the NNPs is changing with each iteration, we check their performance on a dataset of 197 distinct carbon structures. These structures were obtained by Deringer and co-workers35 via random search of crystal structure of carbon with a GAP developed for liquid and amorphous carbon systems21 and are distributed online36. They represent 197 different crystal configurations of carbon, classified according to the topology of the carbon network. For consistency, their energies are re-calculated with the same DFT parameters as explained in “Methods”. Figure 3 shows the energy ranking as predicted by NNP, GAP, Tersoff, and ReaxFF. It can be seen that the NNP accuracy gets better with each iteration. The third iteration NNP accuracy agrees remarkably well with DFT results and performs better than all the other methods tested. It is noteworthy that the final NNP carries no signature of the ReaxFF used in the initial step to explore the configuration space. Both classical potentials, Tersoff and ReaxFF, perform very poorly compared to machine-learnt ones, and the NNP outperforms GAP results published in refs. 21,35, albeit GAP was fitted on ab initio data obtained with local density approximation (LDA) exchange-correlation functional37. For fair comparison, we train a new NNP, using the same training dataset structures obtained via the self-consistent procedure, but using LDA functional. This potential, referred as NNP-LDA, performs similarly to the NNP highlighted in this work, and similarly outperforms all the other potentials. In the rest of the work, the results denoted with NNP refer to the potential that is trained with the rVV10 functional unless otherwise specified.

Fig. 3: Prediction of energy ordering of carbon structures.
figure 3

The energy ordering predicted by NNP is compared to DFT and other models for 197 distinct carbon structures reported in ref. 35. a Prediction performance of NNP at different iterations of the self-consistent cycle visible improves. b Prediction performance of GAP21, reactive force field (ReaxFF)75 and Tersoff16 models is reported alongside the final NNP model (blue line). For comparison, we train a new model with LDA exchange-correlation functional, named as NNP-LDA (red line). The neural network potentials for the two functionals overlap for the majority of the structures, as do the DFT results (see the Supplementary Fig. 4). Further analysis of prediction error in energy (instead of ranking) and the analysis of similarity between this dataset and the one used in NNP generation is reported in Supplementary Fig. 5.

Structural and elastic properties

In this section, we discuss the performance of the NNP on the structural and elastic properties of select carbon polymorphs, namely, diamond, graphite, and graphene (see Tables 24). The equilibrium lattice parameters are obtained by minimizing the total energy until the force components on each atom are lower than 26 meV Å−1 for both DFT and NNP simulations. We also include results obtained with Tersoff potential, as well as other DFT and machine learning studies in literature.

Table 2 Elastic properties of diamond.
Table 3 Elastic properties of graphite.
Table 4 Elastic properties of graphene.

In the case of diamond, all machine learning methods agree reasonably well with the DFT results they were trained with, both for the equilibrium volume and elastic constants. The largest deviation is seen in C12 prediction with GAP with 24% relative error. For all properties tested, the predictions of NNP of the current study is within a relative error of 5% with respect to DFT. It should be noted that the variation between DFT studies employing different exchange-correlation functionals is larger than the difference between machine-learnt models and the DFT results they are trained to reproduce. Tersoff potential, although it predicts the equilibrium volume well, fails to predict the C44.

In the more challenging case of graphite, C11 and C12 relate to the in-plane elastic properties while C33 probes the relationship between strain and stress between the planes, which are held together by vdW interactions. C13 and C44 couple the strong in-plane interaction with the weak out-of-plane ones, namely C13 can be seen as a measure of interlayer dilation upon layer compression, and C44 as a measure of response to shear deformation. The performance of the NNP on prediction of graphite elastic constants is aligned with this overview: for all potentials reported in Table 3, in-plane lattice parameter and elastic constants are better predicted than the ones that relate to out-of-plane interaction, indicating that more data or better training is needed to describe these more delicate properties. Yet it is encouraging that the general purpose NNP of the current work performs at least as well as other NNPs from literature that were developed with a focus on vdW systems such as graphite and multilayer graphene. In the “Discussion,” we discuss how focusing on particular system could further improve on these predictions.

Vibrational properties

Phonon dispersion relations give a complete picture of the elastic properties of a material, and reproduction of the dispersion relations obtained via DFT is a tight accuracy criterion on model potentials. Here we examine the performance of NNP through its prediction of phonon dispersion in the case of diamond and graphene, as a function of lattice parameter, up to a 1% deviation from the equilibrium structure. This is a relevant range for thermal expansion of these materials as, for instance, the change in lattice parameter of diamond at temperatures up to 2000 K is found to be below 1%38. Similarly, thermal expansion increases graphene lattice parameter only within 1% at temperatures up to 2500 K39.

The predictions of NNP for phonon dispersion of diamond and graphene are depicted in Fig. 4. There is an overall good agreement between NNP and DFT in the case of diamond. In the case of graphene, there is a slight disagreement for the transverse optical mode around K point. This is the same trend observed in other machine-learnt potentials22,27 and likely the result of electronic structural properties associated with this special point coupling with the lattice vibration. For both structures, the predicted phonon frequencies reduce when the crystal expands and increase when it is compressed, as expected. An exception to this is the soft flexural mode of graphene close to Γ point. The instability of graphene upon compression can be seen via small imaginary frequency of this mode (shown as negative). This feature is predicted with DFT and is successfully reproduced with NNP, pointing at the capacity of NNP in predicting important structural stability indicators.

Fig. 4: Phonon dispersion at equilibrium and deformed geometries.
figure 4

The phonon dispersion along the high symmetry lines of diamond and graphene is reported in a–c and d–f, respectively. The value at the top of each graph represent the percentage of expansion (positive) or compression (negative) of the lattice parameter. The dotted black line is the maximum frequency in THz at the Γ point at equilibrium lattice parameter.

Phonon dispersion of graphite, shown in Fig. 5 displays negative frequencies for low wave vectors close to Γ, along the perpendicular direction to the graphene plane. These phonon modes are particularly soft and are very sensitive to the level of accuracy of the forces predicted by NNP. We verify this hypothesis with an alternative loss function for NNP training, one that minimizes the relative force error rather than the absolute one used so far (see “Methods”). With a loss function that is based on relative error, configurations with small forces impact the NNP parameter minimization more strongly. We retrain the NNP starting from the previously optimized parameters and report graphite phonon dispersion obtained with the retrained NNP in Fig. 5b. It is evident that this approach can improve the NNP prediction for structures with small forces, e.g., close to equilibrium conditions. Phonon dispersions for diamond and graphene obtained with this NNP are given in Supplementary Fig. 7, and demonstrate that the general quality of the NNP is slightly modified and mostly for the high frequency modes. Further tuning of retraining parameters and loss function can be used as a way to achieve higher accuracy in the desired range of energy and force distributions.

Fig. 5: Phonon dispersion of graphite.
figure 5

The phonon dispersion along the high symmetry lines is reported for a an NNP trained with the whole dataset at the last iteration, b an NNP retrained with the whole dataset but with the minimization of the relative error on forces, and c an NNP trained with all the data within D = 0.05 from diamond and graphite (D12, as described in “Discussion”). The small imaginary frequencies are lifted by modifying the NNP training loss function, or by training on data close to graphite in structure.

An alternative approach that is commonly used in literature for improving NNP prediction is to bias the training set with the configurations for a certain polymorph. To show the effect of this approach, we train the NNP model from scratch this time using a biased dataset with structures from the close neighborhoods of diamond or graphite only. The results reported in Fig. 5c show that this approach indeed allows to reach a better agreement with DFT and there are no imaginary phonon frequencies. However, as it will be further examined later (see Discussion), while this NNP model predicts well properties of configurations around its reference, i.e., diamond or graphite, it is found to be highly non-transferable to other regions of the potential energy surface of carbon.

Amorphous carbon structures

Last, we test the NNP in its ability to construct amorphous carbon structures in a range of densities from 1.5 to 3.5 g cm−3 generated via the melt and quench method following the steps highlighted in ref. 21. We start from a 216 atoms simple-cubic simulation cell and randomized velocities at 9000 K and perform MD simulation first at 9000 K with Nose–Hoover thermostat40 for 4 ps, followed by another at 5000 K for 4 ps, then a fast exponential quench to 300 K at a rate of 10 K fs−1 (total duration ~0.5 ps), and finally for 4 ps we let the system evolve with the thermostat fixed at 300 K.

The radial distribution function (RDF) of liquid and amorphous phases are given in Fig. 6a. The liquid is less ordered than the amorphous configurations at all densities, for all potentials considered. In ref. 21, it was shown that both DFT and GAP have a non-zero first minimum for the liquid phase at about 1.9 Å, which is not properly described by the screened Tersoff potential41. Similarly, the NNP of this work captures the non-zero first minimum in the liquid phase while the original Tersoff potential does not. In the case of the amorphous phase, historically one of the first validation cases for the Tersoff potential, the agreement is overall better. A more detailed comparison of RDF reported in ref. 21 and experiments is given in the Supplementary Fig. 8 and shows that NNP can successfully reproduce peak position and width across the densities considered.

Fig. 6: Performance of the NNP on amorphous phases of carbon.
figure 6

a Radial distribution function for liquid (left) and amorphous (right) carbon, for our NNP and Tersoff potential, at increasing densities (top to bottom). b Percentage of tetrahedrally coordinated atoms in amorphous carbon structures as a function of density, comparing NNP with rVV10 and LDA-level, and Tersoff potential to results taken from ref. 21 for GAP and screened Tersoff potentials, as well as experimental results from refs. 42,43. c Young modulus of amorphous carbon as a function of density for NNP at rVV10 level and Tersoff, compared to results taken from ref. 21 for GAP and screened Tersoff potentials, as well as experimental results from refs. 76,77. Error bars represent standard deviation over ten random initialization of particle velocities of the melt-quench cycles at each density.

In order to quantify the short-range order of amorphous structures, we calculate the sp3 concentration by computing the fraction of carbon atoms with at least four neighbors within a 1.85 Å radius. In Fig. 6b, we show the behavior of this quantity as a function of density, comparing with the results of ref. 21 and those obtained with regular and screened Tersoff potentials41. All methods underestimate the experimental observations yet show a similar general trend with density.

There are quantitative differences among the predictions of theoretical models, in particular, the difference between NNP and GAP predictions are more significant at medium and low densities. This may be attributed to the fact that the DFT dataset used to construct the GAP potential is built with LDA, while in this study the DFT dataset for NNP is built with an accurate exchange-correlation functional that includes vdW interaction from first principles. In the low density region, vdW interactions allow bonding beyond the typical sp3 bond length, such that low energy configurations can be constructed with less sp3 and more sp2 bonds; while at high densities and at shorter length scales, vdW interactions are of lesser significance. This is more evident as we compare the sp3 count predicted with NNP-LDA as it agrees more closely with the GAP result, revealing the role of the underlying DFT reference in the prediction of the properties of amorphous materials with machine-learnt potential models.

The bonding character between atoms strongly affects the elastic properties of materials. Hence, comparing the elastic properties as observed by experiments with those predicted by theory is another way of assessing the theoretical prediction of sp3 count in amorphous structures. In order to do that, we first find the metastable configurations closest in the phase space to the amorphous structures examined so far, by further quenching the dynamics from 300 to 0 K, and then performing geometry relaxation until the force components on atoms are below 1 mRy bohr−1 at fixed volume. Figure 6c shows the Young’s modulus of these metastable amorphous structures as a function of density. The agreement with the experiment is remarkable, hinting that the discrepancy in theoretical and experimental sp3 count seen in Fig. 6b might stem from an inconsistency in definitions between theory and experiment, i.e., the neighbor count within 1.85 Å used in theory underestimates the experimentally measured value that is obtained via comparison of electron energy-loss spectroscopy peak area to graphitized carbon42,43.

We emphasize that the NNP was not constructed specifically for the description of amorphous C, nor did it include amorphous or melt structures hand-picked to represent these configurations. Despite this, the self-consistent approach yields an NNP, which describes these structures well at all volumes considered, validating successful extrapolation of the potential beyond the training set (see Supplementary Fig. 9 for energy analysis of liquid and amorphous structures compared to the training set).

Discussion

The accuracy of a neural network model is often measured by the distribution of the prediction error on a test dataset, in particular via mean and standard deviation of error. But as is the case with training sets, test sets are also not standardized between studies. Therefore the accuracy of potentials tested on different datasets cannot be compared. Here we study the effect of the training and test sets on the apparent accuracy of networks, and measure the impact of these sets on the transferability of NNPs.

For every configuration in a dataset, we first define its Euclidean distance from a reference atomic environment (e.g., cubic diamond, graphite). The distance between the reference configuration α and a given configuration β is defined as:

$${d}_{\alpha \beta }=\frac{1}{2}{\left(\frac{1}{{N}_{\beta }^{{\rm{at}}}}\mathop{\sum }\limits_{i = 1}^{{N}_{\beta }^{{\rm{at}}}}| {{\bf{g}}}_{\alpha }-{{\bf{g}}}_{\beta }^{i}{| }^{2}\right)}^{1/2}$$
(1)

where \({\bf{g}}=\frac{{\bf{G}}}{| {\bf{G}}| }\) with G being a fingerprint vector that describes the atomic environment of all atoms in the unit cell for a given configuration, \({N}_{\beta }^{{\rm{at}}}\) is the number of atoms in configuration β. In this work, for the definition of atomic environment, we use the well-established atom-centered symmetry functions of Behler and Parrinello44, with modifications by refs. 45,46. This definition is also used to describe the input to the neural network architecture. (see “Methods” for a detail description of the descriptor vectors and their use in neural network training.)

Then, we construct a dataset by considering only configurations within a given cutoff distance D from this reference. Following this strategy we build four datasets, three of which are referenced from cubic diamond with D values of 0.05, 0.10, and 0.15; the fourth one is referenced from either cubic diamond or graphite with D = 0.05 (denoted by D12). For each D, 20% of the dataset is set aside for validation and the remaining 80% is used for training. We train four different NNPs on these four sets from scratch, and test each on the respective validation datasets.

In Fig. 7a, we report the training and validation RMSE in energy prediction as the cutoff distance D from the reference structure increases. We show that an RMSE as low as 2.4 (2.5) meV/atom for training (validation) can be obtained when training and validation configurations are very similar, i.e., within a distance of 0.05 from the diamond reference. However, the prediction error of this NNP dramatically increases as it gets tested on structures farther in the input space, to as high as an RMSE of 473 meV/atom. This is a confirmation of the common observation that the prediction error of a neural network is strongly dependent on the similarity of training and test environments47. On the other hand, when the model is trained and tested using the complete set, a prediction RMSE of 22.1 meV/atom is obtained for energy, while, for the configurations within D = 0.05 from diamond, the prediction RMSE is still considerably small, 7.7 meV/atom. The analysis for forces follows the same trend as energies. The RMSE values for energies and forces are given in the Supplementary Table I.

Fig. 7: The relationship between model transferability and the similarity of training and validation datasets.
figure 7

a Validation error of networks trained on different datasets as a function of the distance of the validation set from diamond. Numerical values are given in the Supplementary Table I for energies and forces. b Representative structures at given distances from diamond, the reference structure. The structures at 0.05 or lower are recognizably related to the reference, while at 0.10 and 0.15 compressed and/or defected layered structures are visible. At 0.30 and above, configurations with several double bonds and carbon chains appear. c Energy per atom as a function of volume for structures in the dataset, colored according to their distance cutoff D from diamond. The black dot corresponds to the reference diamond structure. The complete dataset includes structures with larger volume that are omitted here for clarity. The complete volume range is given in the Supplementary Fig. 10.

Hence, it can be deduced that, for a fixed network architecture, a trade-off must be struck between having small error on configurations similar to a reference structure, and obtaining reliable predictions for general configurations from the full potential energy surface. The other entries in these tables confirm this analysis: the more diverse the training set is, the more robust is the resulting potential outside its training basin. Therefore, for a reliable NNP for multiple C polymorphs, as the one targeted here, a diverse training set from a wide region of the potential energy surface is necessary.

In summary, in this work, we have presented a self-consistent technique for generating an accurate and transferable NNP. Since neural networks encode the physics of a system into their parametrization through data, the dataset plays a crucial role in the resulting NNP performance. The method described in this work achieves a comprehensive dataset via balanced integration of evolutionary algorithm, unsupervised machine learning in the form of clustering, and MD. As the training dataset is central to all machine learning models, we believe this generation method may be adopted by and would be beneficial to other ML approaches as well.

The distance-based analysis also gives an a posteriori measure of the profound diversity of the final dataset achieved via the self-consistent method. MD together with EA and clustering successfully explores a wide range of configurations on equal footing so that the dataset shown in Fig. 7c covers energy and volume landscape rather homogeneously. This is in line with the observation that at each iteration dataset diversity increases and validation RMSE may also increase since the network is tasked with a more complex functional approximation problem.

The presented workflow requires minimum human intervention. As the potential is iteratively improved, even rough starting models could be utilized for the very first step, and we have shown that the converged potential does not carry the limitations of the initial model. Therefore, not only this workflow is ready for high-throughput automation schemes as envisioned in future of experimentation but it is also robust with respect to lack of previous information about a system, as is often the case with novel materials.

Many new materials with practical applications can be expected to be multicomponent systems. As the phase space of possible compounds grows larger and wildly unexplored, truly automated and unbiased approaches for an efficient exploration will become essential. We believe that our dataset generation approach (which can be coupled to any other ML approximator with multicomponent capability, e.g., ref. 48) would be particularly suited to such systems. The workflow and the underlying neural network49 and electronic structure codes are publicly available and are open-source.

The self-consistent NNP generation procedure is entirely system independent and we demonstrated its successful application to the challenging case of carbon for which classical and machine-learnt potentials are abundant in literature. We show that for diamond, graphite, and graphene phases, NNP reported in this work performs considerably better than Tersoff, a classical potential, and overall better than the existing machine-learnt potentials for structural and elastic properties. Recently, a new GAP model trained on a large dataset with wide range of polymorphs was published50. Based on our reproduction of ab initio reference and ML results of this model, a preliminary comparison is given in the Supplementary Fig. 11 and Supplementary Table II, and it is found that NNP performs as well as or better for all properties studied.

When predicting graphite phonon dispersion, NNP resulted in very good agreement for the majority of the modes, yet predicted instability for the very soft modes that relate to interlayer interaction. We have traced this behavior to the accuracy requirement in predicting such small forces. To increase accuracy using a fixed neural network architecture, we built the training set only with structures that are in the vicinity of graphite according to a fingerprint-based distance measure. The resulting potential provided accurate phonon frequencies but it showed poor generalization to a wider range of structures, compared to a more comprehensive potential trained on the entire dataset. This example highlights the need for a procedure to standardize the accuracy measure of NNPs and a more pressing need to build error estimate measures into the process of generating NNPs.

Methods

Evolutionary algorithm for configuration space search

In iterative schemes, having a good starting point often means that a smaller number of iterations is needed to reach convergence. In a realistic use case scenario of NNPs, it is reasonable to expect that only a moderately well-fitting potential would be available as a starting point. To demonstrate this, we start the self-consistent cycle using a Li–C ReaxFF model to generate the initial configurations. This model is fit to DFT results with vdW correction and its details are set to describe well Li–C environments and defective graphite but not the wide range of solid C polymorphs considered in this work. We generate the initial configurations with 16 and 24 carbon atoms per unit cell at 0, 10, 20, 30, 40, and 50 GPa via EA as implemented in USPEX51,52. At each pressure, we start with a population of 30 (50) randomly generated structures for the 16 (24) atoms per unit cell, and evolve it through the following evolutionary operations with the given ratios: heredity (two parent structures are combined) 50%, mutation (a distortion matrix is applied to a structure) 25%, or by generating new random structures 25%.

At each generation, structures are optimized in five successive steps: (a) constant pressure and temperature MD at 0.1 GPa and 50 K, respectively, for 0.3 ps with time step of 0.1 fs, (b) relaxation of cell parameters and internal coordinates until force components are <0.26 eV Å−1, (c) constant pressure and temperature MD at 0.1 GPa and 50 K, respectively, for 0.3 ps with time step of 0.1 fs, (d) relaxation of cell parameters and internal coordinates until force components are <0.026 eV Å−1, and (e) a final relaxation of cell parameters and internal coordinates until force components are <0.0026 eV Å−1.

Only the 70% most energetically stable parents were allowed to participate in the process of creating the new generation. In the heredity step, only sufficiently distinct structures (whose cosine distance, as defined in the next section, is greater than a given threshold) are considered as parents. This threshold is fixed at 0.008 in the first iteration, as it is small enough to allow deformed structures from the same polymorph to be parents. In order to enhance the diversity of the structures in the subsequent iterations, the threshold is increased to 0.05 so that the parents can be expected to be from different polymorphs.

Each structure search is evolved up to a maximum of 50 generations at the first iterations and 30 in the subsequent ones. The configuration space search performed this way produces a wide range of sp2, sp3 and mixture of sp2 and sp3 structures, including defective layered structures.

Clustering

Initially, an unsupervised, bottom-up, distance-based hierarchical clustering approach with single linkage is used on all structures obtained with EA to identify the unique polymorphs. In the later iterations, clustering is applied only to those structures where NNP prediction differs from DFT ground-truth energy by more than 5 meV/atom. That way, polymorphs that are already well described by NNP are not over-sampled. During clustering, to measure the similarity between structures, we use the fingerprint-based cosine distance defined in refs. 53,54. In the case of a single species in the unit cell, and in its discretized form, the fingerprint of a configuration becomes:

$$F[k]=\frac{1}{2}{\mathop{\sum}\limits_{i\in{\rm{cell}}}}\mathop{\sum}\limits_{j}\frac{{\rm{erf}}\left[\frac{(k+1){{\Delta }}-{R}_{ij}}{\sqrt{2}\sigma }\right]-{\rm{erf}}\left[\frac{k{{\Delta }}-{R}_{ij}}{\sqrt{2}\sigma }\right]}{4\pi {R}_{ij}^{2}\frac{{N}^{2}}{V}{{\Delta }}}-1$$
(2)

where the first sum runs over all atoms i in the unit cell and the second sum runs over all atoms j within a spherical cutoff radius \({R}_{\max }\), and Rij is the distance between atoms i and j. The numerator describes the integral of a Gaussian density of width sigma over a bin of size Δ. N is the number of atoms in the unit cell and V is the unit cell volume.

The cosine distance between structures 1 and 2 is defined as:

$${D}_{{\rm{cosine}}}(1,2)=\frac{1}{2}\left(1-\frac{{{\bf{F}}}_{{\bf{1}}}\cdot {{\bf{F}}}_{{\bf{2}}}}{| {{\bf{F}}}_{{\bf{1}}}| | {{\bf{F}}}_{{\bf{2}}}| }\right).$$
(3)

The dimension of the F-vector is set to \({R}_{\max }/{{\Delta }}=125\) with \({R}_{\max }=10\) Å and Δ = 0.08 in this work. Two configurations closer to one another than a distance threshold are determined to belong to the same cluster. In this work the threshold is tuned to yield ~100–150 clusters at each step, which results in affordable computational cost for the remaining calculations of the self-consistent cycle.

Molecular dynamics (MD)

We manually select a representative structure from each cluster and perform a 0.5-ns classical NPT MD simulation with Nose–Hoover thermostat and barostat. In these simulations, the external conditions of pressure and temperature are ramped up from −50 GPa at 100 K, to 50 GPa at 1000 K in the course of 0.5 ns. The characteristic relaxation times of the thermostat and barostat are chosen as 50 and 100 fs, respectively. By sampling a snapshot of the dynamics every 5 ps, 100 configurations are selected. All MD simulations are performed with LAMMPS package55. In addition, 440 randomly selected graphene atomic configurations from the libAtoms repository36 are added to the selection. This set constitutes the set of structures where ab initio total energy calculations are then performed and added to the training set.

First principles calculations

The first principles calculations performed on all the structures visited during EA configuration space search and MD refinement described earlier employ the following parameters: plane wave basis set kinetic energy cutoff for wavefunctions and charge density is 80 and 480 Ry, respectively. The rVV1056 exchange-correlation functional that incorporates non-local vdW correlations is employed. A Brillouin zone sampling with resolution of 0.034 × 2π Å−1 for the 3D carbon structures and 0.014 × 2π Å−1 for graphene is used. These parameters are found to yield 1 mRy/atom precision on diamond, graphite, and graphene. All DFT calculations were performed with the Quantum ESPRESSO package57,58. Elastic properties are computed through the thermopw framework59 while vibrational properties are obtained with PHON package60.

In the first self-consistent iteration, the training set is made up of all generated structures lying within 10 eV from the lowest energy one. This results in a total of ~16,000 configurations. In the subsequent iterations of the self-consistent procedure, we use all configurations whose energy per atom is within 1.2 eV of the lowest one, these are added to the previously selected configurations, amounting to a total of about 30,000 configurations in the second and 60,000 configurations in the third and final iteration. From these configurations, 20% was set aside for validation and the remaining 80% was used in the NNP training.

Neural network architecture

In this work, we adopt the Behler–Parrinello approach to atomistic neural networks44 where the total energy of a system of N atoms is defined as the sum of atomic energy contributions

$$E=\mathop{\sum }\limits_{i=1}^{N}{E}_{i}({G}_{i}),$$
(4)

where Ei is the energy contribution of an atom i, and Gi is its local environment descriptor vector. As described in detail in the next section, we choose descriptors with 144 components per atomic environment. The contribution of an atom to the total energy is obtained by feeding its environment descriptor to the feed-forward all-to-all-connected neural network. Here we build a network with two hidden layers, with 64 and 32 nodes for the first and second layer, respectively, both with Gaussian activation function, and a single-node output layer with linear activation. The resulting network has a total of 11,393 parameters, i.e., (144 × 64) + (64 × 32) + (32 × 1) = 11,296 weights and 64 + 32 + 1 = 97 biases. The energy of each atom is then summed to obtain the total energy of the configuration. The force on each atom can be obtained analytically

$${{\bf{F}}}_{i}=-\mathop{\sum}\limits_{j}\mathop{\sum}\limits_{\mu }\frac{\partial {E}_{j}}{\partial {G}_{j\mu }}\frac{\partial {G}_{j\mu }}{\partial {{\bf{R}}}_{i}}$$
(5)

where the atom index, j, runs over all the atoms within the cutoff distance of atom i, and index μ runs over the descriptor components.

During training, the weight and bias parameters W, are optimized with the Adam algorithm61 using gradients obtained by randomly selected subsets (minibatches) of data. The loss function of this stochastic optimization problem is defined as the sum of two contributions: one using the total energy value (Eq. (6)) and one using the force on each atom (Eq. (7)):

$${{\mathcal{L}}}^{{\rm{E}}}(W)=\mathop{\sum}\limits_{c\in {\rm{batch}}}{\left({E}_{c}^{{\rm{DFT}}}-{E}_{c}(W)\right)}^{2}+\exp \left[a\tanh \left(\frac{1}{a}\mathop{\sum}\limits_{c\in {\rm{batch}}}{\left(\frac{{E}_{c}^{{\rm{DFT}}}-{E}_{c}(W)}{{N}_{c}}\right)}^{2}\right)\right],$$
(6)

where \({E}_{c}^{{\rm{DFT}}}\) is the ground-truth total energy obtained via DFT and Ec is the NN prediction for total energy of a given configuration c, consisting of Nc atoms in the unit cell. The second part of this equation exponentially penalizes outliers while keeping the exponent normalized; a is a constant that allows to tune this penalty, a = 5 is used in this study. The force contribution to the loss is given by:

$${{\mathcal{L}}}^{F}(W)={\gamma }_{F}\mathop{\sum}\limits_{c\in {\rm{batch}}}\mathop{\sum }\limits_{i=1}^{{N}_{c}}{\left|{{\bf{F}}}_{i}^{{\rm{DFT}}}-{{\bf{F}}}_{i}\right|}^{2},$$
(7)

where for any atom i of configuration c, \({{\bf{F}}}_{i}^{{\rm{DFT}}}\) is the ground-truth force obtained via DFT, and Fi is the NN prediction for it. γF is a user-defined parameter that controls the scale of this loss component. The results reported are obtained with γF equals 0.5. The relative error loss highlighted in “Results” is defined as

$${{\mathcal{L}}}^{F}(W)={\gamma }_{F}\mathop{\sum}\limits_{c\in {\rm{batch}}}\frac{1}{{N}_{c}}\mathop{\sum }\limits_{i=1}^{{N}_{c}}\frac{{\left|{{\bf{F}}}_{i}^{{\rm{DFT}}}-{{\bf{F}}}_{i}\right|}^{2}}{{\left|{{\bf{F}}}_{i}^{{\rm{DFT}}}\right|}^{2}+{f}_{0}^{2}},$$
(8)

where f0 is a regularizer constant, chosen as f0 = 260 meV Å−1 in this work.

An L2-norm regularization term is also added with a small coefficient γR = 10−4 to prevent weights from becoming spuriously large

$${{\mathcal{L}}}^{R}(W)={\gamma }_{R}\frac{| W{| }^{2}}{2}.$$
(9)

The total loss is thus defined as:

$${\mathcal{L}}(W)={{\mathcal{L}}}^{{\rm{E}}}(W)+{{\mathcal{L}}}^{F}(W)+{{\mathcal{L}}}^{R}(W).$$
(10)

All models are trained starting from random weights and a starting learning rate α0 = 0.001. The learning rate is decreased exponentially with optimization step t following the relationship α(t) = α0rt/τ with decay rate r = 0.96 and the decay step τ = 3200. A batch size of 128 data points is used throughout the study.

Atomic environment descriptors

We use Behler–Parrinello symmetry functions44 as local atomic descriptors. These functions include a two body and a three-body term, referred to as radial and angular descriptor, respectively. We use a modified version of the original angular descriptor45 as implemented and detailed in PANNA package46. The radial descriptor function is defined as:

$${G}_{i}^{{\rm{Rad}}}[s]=\mathop{\sum}\limits_{j\ne i}{e}^{-\eta {\left({R}_{ij}-{R}_{s}\right)}^{2}}{f}_{c}({R}_{ij}),$$
(11)

where η and a set of Gaussian-centers Rs are user-defined parameters of the descriptor. The sum over j runs over all atoms whose distance Rij from the central atom i is within the cutoff distance Rc. The cutoff function, fc is defined as:

$${f}_{c}({R}_{ij})=\left\{\begin{array}{ll}\frac{1}{2}\left[\cos \left(\frac{\pi {R}_{ij}}{{R}_{c}}\right)+1\right]&{R}_{ij}\le {R}_{c}\\ 0&{R}_{ij} \, > \, {R}_{c}.\end{array}\right.$$
(12)

The angular part of the descriptor with central atom i is defined as:

$$\begin{array}{lll}{G}_{i}^{{\rm{Ang}}}[s]&=&{2}^{1-\zeta }\mathop{\sum}\limits_{j,k\ne i}{\left(1+\cos ({\theta }_{ijk}-{\theta }_{s})\right)}^{\zeta }\\ &&\times {e}^{-\eta {\left({R}_{ij}/2+{R}_{ik}/2-{R}_{s}\right)}^{2}}\\ &&\times {f}_{c}({R}_{ij}){f}_{c}({R}_{ik}).\end{array}$$
(13)

The sum runs over all pairs of neighbors of atom i, indexed as j and k, with distances Rij and Rik within the cutoff radius Rc, forming an angle θijk with it. Here η, ζ, and the sets of θs and Rs are the user-defined parameters of the descriptor.

We note that the descriptor as written in Eq. (13) has discontinuous derivative with respect to atomic positions when atoms are collinear. To restore the continuity, we replace the \(\cos ({\theta }_{ijk}-{\theta }_{s})\) term with the following expression:

$$2\frac{\cos ({\theta }_{ijk})\cos ({\theta }_{s})+\sqrt{1-\cos {({\theta }_{ijk})}^{2}+\epsilon \sin {({\theta }_{s})}^{2}}\sin ({\theta }_{s})}{1+\sqrt{1+\epsilon \sin {({\theta }_{s})}^{2}}}$$
(14)

where we introduce a small normalization parameter, ϵ, such that the expression approaches \(\cos ({\theta }_{ijk}-{\theta }_{s})\) in the limit of ϵ → 0. In this work, ϵ = 0.001 was used, while values between 0.001 and 0.01 were found to yield stable dynamics and equivalent network potentials for any practical purpose.

The radial descriptors are parametrized with η = 16.0 Å−2, while 32 equidistant Gaussian centers, Rs, are distributed between 0.5 and 4.6 Å. For the angular part η = 10.0 Å−2, ζ = 23.0, 8 equidistant Rs are distributed between 0.5 and 4.0 Å and 14 θs are chosen between π/28 and 27π/28 with spacing π/14. The cutoff Rc is 4.6 Å for radial and 4.0 Å for the angular descriptors, respectively. The resulting descriptor has a total of 32 + 14 × 8 = 144 components per atomic environment.