A Systematic Approach to Generating Accurate Neural Network Potentials: the Case of Carbon

Availability of affordable and widely applicable interatomic potentials is the key needed to unlock the riches of modern materials modelling. Artificial neural network based approaches for generating potentials are promising; however neural network training requires large amounts of data, sampled adequately from an often unknown potential energy surface. Here we propose a self-consistent approach that is based on crystal structure prediction formalism and is guided by unsupervised data analysis, to construct an accurate, inexpensive and transferable artificial neural network potential. Using this approach, we construct an interatomic potential for Carbon and demonstrate its ability to reproduce first principles results on elastic and vibrational properties for diamond, graphite and graphene, as well as energy ordering and structural properties of a wide range of crystalline and amorphous phases.


I. INTRODUCTION
The state of the art theoretical framework for computing material properties of crystals at their ground state is Density Functional Theory (DFT) 1,2 . DFT allows to describe the total energy as a functional of electron density, E[ρ], for a given atomic configuration {R}, by taking advantage of the conjugate relationship between the electrostatic potential of the nuclei V ({R}), and the ground state electron density ρ. By solving the expensive quantum mechanical equations that result from this definition for electrons, DFT outlines a path to determine the total energy, the forces on each atom, the stress due to crystal structure, and several other ground-state properties of materials. Yet the cost of solving the quantum mechanical equations, as well as having to work with the extensive electronic wavefunctions and density, hinders the application of this method to systems beyond a few thousands of atoms.
A way to reduce the computational cost lies in the realization that the same conjugate relationship between ρ and V guarantees that a functional exists which maps the electrostatic potential of the nuclei to the total energy, hence it is possible to describe ground state properties as a functional of the positions of atoms in the structure, without having to work explicitly with the electron density. Yet, the exact form of such a functional is unknown. One approach to approximate this unknown functional is using artificial neural networks (ANNs). ANNs and in general machine learning techniques, have been shown to yield reasonably accurate functional approximations for a wide range of applications, and have already been adopted with success to some material science problems [3][4][5][6][7][8][9][10][11][12][13][14][15] .
ANNs can be seen as an attractive alternative to the classical approach for constructing interatomic interaction models (also known as force fields) where physical intuition is used to fix the form of the approximate func-arXiv:2011.04604v1 [cond-mat.mtrl-sci] 9 Nov 2020 tional for E[V ({R})]. While physically meaningful forms can describe the interatomic interaction in a compact way, with only few parameters to be fitted, the rigidity of the functional form reduces the predictive power of this method in exploratory studies. In particular, for highly polymorphic materials such as Carbon, where several different bonding types and structures exist, the lack of transferability of a model from one structure to another results in many different interaction models, each with a limited applicability. For example, among the several empirical force fields for carbon, the nonreactive, short range, bond-order-based Tersoff 16 model can describe dense sp 3 carbon structures while a highly parametric reactive force field (ReaxFF) 17 that explicitly includes long-range van der Waals interactions and Coulomb energy through charge equilibration scheme 18 is needed for structures with sp 2 hybridization. Furthermore, even though these empirical force fields give a qualitative understanding of materials properties, they are quantitatively inaccurate when compared to both ab initio methods and experiments [19][20][21][22] .
Interatomic interaction models based on ANNs do not have a fixed functional form beyond the network architecture, and their parameters are fitted to vast amounts of ab initio quantum mechanical data in the hope of assimilating the physics of the system into the parametrization. Hence the transferability restraint of classical force fields, that is due to their rigid form, is traded for a transferability challenge in the case of neural networks due to the (lack of) variety and completeness in the training set. To address this challenge of generating truly transferable ANN interatomic interaction models, training data must be obtained from an efficient and thorough sampling of the potential energy landscape. Such sampling of the very rugged and high dimensional landscape with ab initio electronic structure tools is a formidable challenge.
In this work we integrate evolutionary algorithm with molecular dynamics in a self-consistent manner to sample the potential energy landscape and obtain data with high variability. Moreover, for reliable materials modelling, it is crucial to have indicators that signal when the limit of transferability is crossed. We address this aspect of ANN models by studying the relationship between data variability and transferability of the trained network via unsupervised data analysis. We demonstrate the performance of the approach highlighted above on the challenging example of crystalline and amorphous Carbon structures.
This study is a continuation of similar efforts in the literature: The first ANN interaction model for elemental carbon was developed in 2010 by Khaliullin et al. 19 to study graphite-diamond co-existence. The network was trained on an adaptive training set, where the starting configurations were manually selected from randomly distorted graphite and diamond phases, relaxed under a range of external pressures (from -10 to 200 GPa) at zero temperature. Then, configurations for new training data were obtained using this model in finite tempera-FIG. 1. Sketch of the self-consistent scheme to generate an accurate and transferable neural network potential. The initial step to start the process (yellow arrow) can be performed with a classical force field as shown here, or any comprehensive dataset of structures such as the ones in Aflowlib 23 , Materials Genome Initiative 24 or Nomad 25 repositories can be used to generate the first neural network potential model (blue triangle) to be refined through the self-consistent cycle. Once an initial potential model is chosen, Evolutionary Algorithm enables a diverse set of structures to be sampled. The following clustering-based pruning of structures further ensures that no single polymorph biases the dataset, i.e. at each step only novel structures (red and blue disks for the particular step highlighted above) are to be considered, further refined, and added to the dataset. The subsequent MD simulations sample the potential energy surface of each polymorph. Finally DFT calculations performed on a subset of MD-sampled structures are added to the ab initio dataset obtained thus far. The ab initio dataset augmented this way is then used to train the next neural network potential model (a darker blue triangle), starting the next cycle of the self-consistent scheme until no new structures are found by the Evolutionary Algorithm. ture molecular dynamics (MD) simulations, which in turn were used to refine the network, until a self-consistency was reached in the prediction error on the new structures. More recently in 2019, a hybrid model, where an ANN potential for the short-range interaction is supplemented with a theoretically motivated analytical term to model long-range dispersion, has been developed in order to address the properties of monolayer and multilayer graphene, with encouraging results 22 . As we will demonstrate in this work, ANN models such as these, built on data sampled solely from a limited part of the potential energy landscape can however be highly nontransferable. This transferability challenge for Carbon has been observed with kernel-based machine learning models as well.
In 2017, a kernel-based model was constructed 21 using data from MD melt-quench trajectories of liquid and amorphous Carbon, to study amorphous structures. Motivated from its non-optimal behavior on crystalline phases, authors developed another model with a special- ized training data obtained via MD, for graphene 26 . It is worthwhile to note that recently, a strategy combining kernel-based model generation with crystal structure prediction was suggested by Bernstein et al. 27 . Since computational cost for training or evaluation of a kernel-based model grows with the training set however, this approach is suitable for small scale configuration space sampling. In comparison, computational cost of neural networks are independent of the size of the training dataset, a feature that is exploited in the current study for accurate prediction of elastic and vibrational properties. In this work we use a systematic approach to construct a highly flexible and transferable neural network potential (NNP) and demonstrate its application to the development of a general NNP for Carbon. We compare its performance with respect to other potential models previously optimized for specific phases and discuss the implications of our results for the trade-off between transferability and specialization.

A. Self-Consistent Training and Validation
The neural network potential is constructed following the self-consistent approach sketched in Fig. 1. This recursive data-creating and fitting cycle starts with a trial force field (FF) which is used to generate an initial set of configurations via evolutionary algorithm. Evolutionary algorithms (EA) are commonly used in crystal structure prediction studies as they allow efficient sampling of the configuration space. Their success in thorough sampling is demonstrated by their ability to predict new crystal structures before the experimental observation 28,29 . As the exploration of the configuration space continues, a single point DFT calculation is performed on each dis-tinct polymorph generated by EA. These structures are then clustered using a distance measure. From each cluster, a representative example is manually selected and a classical molecular dynamics simulation at a given pressure and temperature range is performed. The additional MD simulation step allows the sampling of the whole neighborhood of the equilibrium configuration for each polymorph, resulting in accurate prediction of structural properties for every polymorph. The dataset obtained this way is used to train a neural network model. The trained NNP is then used for starting a new iteration of the self-consistent cycle. This increases the training set diversity, by preventing the energetically favorable structures that are easily accessed by EA from dominating the whole training set. The iterative procedure highlighted above is repeated until no new structures are found. The methodological details of the self-consistent training used in this work is detailed in the Methods section.
The performance of an NNP at each self-consistent loop is evaluated during training via the validation scheme. Fig. 2 shows the evolution of NNP energy accuracy on the training and validation set as a function  Table I), while the number of parameters of the network, therefore its capacity, is kept fixed. It is worth noting that the prediction error is not distributed according to a Gaussian distribution function but a fatter-tailed one (see Fig. 2d). Therefore, while the RMSE given here is a good measure to compare training and validation error with one another, it overestimates the average NNP prediction error in general.
To demonstrate how the general accuracy of the NNPs is changing with each iteration, we check their performance on a dataset of 197 distinct carbon structures. These structures were obtained by Deringer and coworkers 30 via random search of crystal structure of carbon with a Gaussian approximation potential (GAP) developed for liquid and amorphous carbon systems 21 and are distributed online 32 . For consistency, their energies are re-calculated with the same DFT parameters as explained in Methods section. Fig. 3 shows the energy ranking as predicted by NNP, GAP, Tersoff and ReaxFF. It can be seen that the NNP accuracy gets better with each iteration. The third iteration NNP accuracy agrees remarkably well with DFT results and performs better than all the other methods tested. It is noteworthy that the final NNP carries no signature of the ReaxFF used in the initial step to explore the configuration space. Both classical potentials, Tersoff and ReaxFF, perform very poorly compared to machine learnt ones, and the NNP outperforms GAP results published in Ref. 21 and 30, albeit GAP was fitted on ab initio data obtained with LDA exchange correlation functional 33 . For fair comparison, we train a new NNP, using the same training dataset obtained via the self consistent procedure, but with LDA functional. This potential, referred as NNP-LDA, performs similarly to the NNP highlighted in this work, and similarly outperforms all the other potentials. In the rest of the work, the results denoted with NNP refers to the potential that is trained with the rVV10 functional unless otherwise specified.

B. Structural and elastic properties
In this section, we discuss the performance of the NNP on the structural and elastic properties of select Carbon polymorphs, namely, diamond, graphite and graphene (See Table II). The equilibrium lattice parameters are obtained by minimizing the total energy until the force components on each atom are lower than 26 meV/Å for both DFT and NNP simulations. We also include results obtained with Tersoff potential, as well as other DFT and machine learning studies in literature.
In the case of diamond, all machine learning methods agree reasonably well with the DFT results they were trained with, both for the equilibrium volume and elastic constants. The largest deviation is seen in C 12 prediction with GAP potential with 24% relative error. For all properties tested, the predictions of NNP of the current study is within a relative error of 5% with respect to DFT. It should be noted that the variation between DFT studies employing different exchange correlation functionals are larger than the difference between machine learnt models and the DFT results they are trained to reproduce. Tersoff potential, although it predicts the equilibrium volume well, fails to predict the C 44 .
In the more challenging case of graphite, C 11 and C 12 relate to the in-plane elastic properties while C 33 probes the relationship between strain and stress between the planes which are held together by van der Waals interactions. C 13 and C 44 couple the strong in-plane inter- II. Elastic properties of Diamond, Graphite and Graphene. For Diamond (top) all machine learnt potentials reproduce their reference DFT lattice parameter with less than 1% relative error. When the comparison extends to elastic constants and bulk modulus, however, only the NNP described in this work shows a consistently close agreement between DFT and the potential model, < 3% relative error, the range of variation also observed between two different experiments. For Graphite (middle) we report the lower bound for the bulk modulus using Reuss average, i.e. 1/B0 ≡ s11+s22+s33+2(s12+s23+s31). The robust intraplanar structural features of graphite is captured well by all machine learnt potentials while the weaker interplanar interaction and, in particular, elastic properties that couple the two, are more challenging to capture. This is true even for the hybrid potential hNN-Gr of Ref. 22 where the distance dependence of the long range interaction is manually set to r −6 and the potential is tailor-fit to describe multi-graphene systems. For Graphene (bottom) the 2D elastic constants were computed with the normalized 3D stress as σ2D( ) = E( )/A0 at a given strain of , where E( ) is the total energy at and A0 = √ 3a 2 /2 is the area of graphene plane. E is the Young modulus and ν is Poisson's ratio. The elastic constant is converted to bulk properties in GPa by dividing by the interlayer distance c/2 of graphite reported in Table II   action with the weak out-of-plane ones, namely C 13 can be seen as a measure of interlayer dilation upon layer compression, and C 44 as a measure of response to shear deformation. The performance of the NNP on prediction of graphite elastic constants are aligned with this overview: For all potentials reported in the middle panel of Table II, in-plane lattice parameter and elastic constants are better predicted than the ones that relate to out of plane interaction, indicating that more data or better training is needed to describe these more delicate properties. Yet it is encouraging that the general purpose NNP of the current work performs at least as well as other NNPs from literature that were developed with a focus on van der Waals systems such as graphite and multilayer graphene. In section III we discuss how focusing on particular system could further improve on these predictions.

C. Vibrational properties
Phonon dispersion relations give a complete picture of the elastic properties of a material, and reproduction of the dispersion relations obtained via DFT is a tight accuracy criterion on model potentials. Here we examine the performance of NNP through its prediction of phonon dispersion in the case of diamond and graphene, as a function of lattice parameter, up to a 1% deviation from the equilibrium structure. This is a relevant range for thermal expansion of these materials as, for instance, the change in lattice parameter of diamond at temperatures up to 2000 K is found to be below 1 % 42 . Similarly, thermal expansion increases graphene lattice parameter only within 1% at temperatures up to 2500 K 43 .
The predictions of NNP for phonon dispersion of diamond and graphene are depicted in Fig. 4. There is an overall good agreement between NNP and DFT in the case of diamond. In the case of graphene, there is a slight disagreement for the transverse optical mode around K point. This is the same trend observed in other machine learnt potentials 22,26 and likely the result of electronic structural properties associated with this special point coupling with the lattice vibration. For both structures, the predicted phonon frequencies reduce when the crystal expands and increase when it is compressed, as expected. An exception to this is the soft flexural mode of graphene close to Γ point. The instability of graphene upon compression can be seen via small imaginary frequency of this mode (shown as negative). This feature is predicted with DFT and is successfully reproduced with NNP, pointing at the capacity of NNP in predicting important structural stability indicators.
Phonon dispersion of graphite, shown in Fig. 5 displays negative frequencies for low wave vectors close to Γ, along the perpendicular direction to the graphene plane. These phonon modes are particularly soft and are very sensitive to the level of accuracy of the forces predicted by NNP. In order to verify this hypothesis, we retrain a NNP model (from scratch) using data from small neighborhoods of diamond or graphite (within a distance of 0.05 as defined in equation 1 of the Discussion section). There are no imaginary frequencies for this potential and a general agreement with DFT is obtained. However, as it will be further examined later (see Discussion), this NNP model only predicts properties of configurations around diamond and graphite and it is found to be highly nontransferable to other regions of the potential energy surface of carbon.

D. Amorphous carbon structures
Lastly, we test the NNP in its ability to construct amorphous carbon structures in a range of densities from 1.5 to 3.5 g/cm 3 generated via the melt and quench method following the steps highlighted in Ref. 21. We start from a 216 atoms simple-cubic simulation cell and randomized velocities at 9000 K and perform molecular dynamics simulation first at 9000 K with Nose-Hoover thermostat 48 for 4 picoseconds (ps), followed by another at 5000 K for 4 ps, then a fast exponential quench to 300 K at a rate of 10 K/fs (total duration ∼0.5 ps), and finally for 4 ps we let the system evolve with the thermostat fixed at 300 K.
The radial distribution function (RDF) of liquid and amorphous phases are given in Fig. 6a. The liquid is less ordered than the amorphous configurations at all densities, for all potentials considered. In Ref. 21, it was shown that both DFT and GAP have a non-zero first minimum for the liquid phase at about 1.9Å which is not properly described by the screened Tersoff potential 49 . Similarly, the NNP of this work captures the non-zero first minimum in the liquid phase. The density dependence of the RDF found in this work agrees well with the one reported in Ref. 21 as well.
In order to quantify the short-range order of amorphous structures, we calculate the sp 3 concentration by computing the fraction of carbon atoms with at least four neighbours within a 1.85Å radius. In Fig. 6b we show the behavior of this quantity as a function of density, comparing with the results of Ref. 21 and those obtained with regular and screened Tersoff potentials 49 . All methods underestimate the experimental observations yet show a similar general trend with density.
There are quantitative differences among the predictions of theoretical models, in particular, the difference between NNP and GAP predictions are more significant at medium and low densities. This may be attributed to the fact that the DFT dataset used to construct the GAP potential is built with local density approximation (LDA), while in this study the DFT dataset for NNP is built with an accurate exchange-correlation functional that includes van der Waals (vdW) interaction from first principles. In the low density region, vdW interactions allow bonding beyond the typical sp 3 bond length, such that low energy configurations can be constructed with less sp 3 and more sp 2 bonds; while at high densities and at shorter length scales vdW interactions are of lesser significance. This is more evident as we compare the sp 3 count predicted with NNP-LDA as it agrees more closely with the GAP result, revealing the role of the underlying DFT reference in the prediction of the properties of amorphous materials with machine learnt potential models.
The bonding character between atoms strongly affects the elastic properties of materials. Hence, comparing the elastic properties as observed by experiments with those predicted by theory is another way of assessing the theoretical prediction of sp 3 count in amorphous structures. In order to do that we first find the metastable configurations closest in the phase space to the amorphous structures examined so far, by further quenching the dynamics from 300 K to 0 K, and then performing geometry relaxation until the force components on atoms are below 1 mRy/bohr at fixed volume. Fig. 6c shows the Young's modulus of these metastable amorphous structures as a function of density. The agreement with the experiment is remarkable, hinting that the discrepancy in theoretical and experimental sp 3 count seen in Fig. 6b might stem from an inconsistency in definitions between theory and experiment, i.e. the neighbor count within 1.85Å used in theory underestimates the experimentally measured value that is obtained via comparison of electron energyloss spectroscopy (EELS) peak area to graphitized carbon 44,45 .

III. DISCUSSION
The accuracy of a neural network model is often measured by the distribution of the prediction error on a test dataset, in particular via mean and standard deviation of error. But as is the case with training sets, test sets are also not standardized between studies. Therefore the accuracy of potentials tested on different datasets cannot be compared. Here we study the effect of the training and test sets on the apparent accuracy of networks, and measure the impact of these sets on the transferability of neural network potentials.
For every configuration in a dataset we first define its Euclidean distance from a reference atomic environment (e.g. cubic diamond, graphite). The distance between the reference configuration α and a given configuration β is defined as where g = G |G| with G being a "fingerprint" vector that describes the atomic environment of all atoms in the unit cell for a given configuration, N at β is the number of atoms in configuration β. In this work, for the definition of atomic environment, we use the well-established atomcentered symmetry functions of Behler and Parrinello 50 , Then, we construct a dataset by considering only configurations within a given cutoff distance D from this reference. Following this strategy we build four datasets, three of which are referenced from cubic diamond with D values of 0.05, 0.10 and 0.15; the fourth one is referenced from either cubic diamond or graphite with D=0.05 (denoted by D 12 ). For each D, 20 % of the dataset is set aside for validation and the remaining 80 % is used for training. We train four different NNPs on these four sets from scratch, and test each on the respective validation datasets.
In the top panel of Fig.7, we report the training and validation RMSE in energy prediction as the cutoff dis-tance D from the reference structure increases. We show that an RMSE as low as 2.4 (2.5) meV/atom for training (validation) can be obtained when training and validation configurations are very similar, i.e. within a distance of 0.05 from the diamond reference. However, the prediction error of this NNP dramatically increases as it gets tested on structures farther in the input space, to as high as an RMSE of 473 meV/atom. This is a confirmation of the common observation that the prediction error of a neural network is strongly dependent on the similarity of training and test environments 53 . On the other hand, when the model is trained and tested using the complete set, a prediction RMSE of 22.1 meV/atom is obtained for energy, while, for the configurations within D = 0.05 from diamond, the prediction RMSE is still considerably small, 7.7 meV/atom. The analysis for forces follows the same trend as energies. The RMSE values for energies and forces are given in the Supplementary Material. Hence, it can be deduced that, for a fixed network architecture, a trade-off must be struck between having small error on configurations similar to a reference structure, and obtaining reliable predictions for general con-figurations from the full potential energy surface. The other entries in these tables confirm this analysis: the more diverse the training set is, the more robust is the resulting potential outside its training basin. Therefore, for a reliable NNP for multiple C polymorphs, as the one targeted here, a diverse training set from a wide region of the potential energy surface is necessary.
In summary, in this work we have presented a selfconsistent technique for generating an accurate and transferable neural network potential. Since neural networks encode the physics of a system into their parametrization through data, the dataset plays a crucial role of the resulting NNP performance. The self consistent method described in this work achieves an accurate and comprehensive dataset via balanced integration of evolutionary algorithm, unsupervised machine learning in the form of clustering, and molecular dynamics. The presented workflow for the self consistent refinement of the potential requires minimum human intervention. Therefore not only it is ready for high-throughput automation schemes as envisioned in future of experimentation but it is also robust with respect to lack of previous information about a system, as is often the case with novel materials. The workflow and the underlying neural network 54 and electronic structure codes are publicly available and are open-source.
The self-consistent NNP generation procedure is entirely system independent and we demonstrated its successful application to the challenging case of Carbon for which classical and machine-learnt potentials are abundant in literature. We show that for diamond, graphite and graphene phases, NNP reported in this work performs considerably better than Tersoff, a classical potential, and overall better than the existing machine learnt potentials for structural and elastic properties. When predicting graphite phonon dispersion, NNP resulted in very good agreement for the majority of the modes, yet predicted instability for the very soft modes that relate to interlayer interaction. We have traced this behaviour to the accuracy requirement in predicting such small forces. To increase accuracy using a fixed neural network architecture, we built the training set only with structures that are in the vicinity of graphite according to a fingerprint based distance measure. The resulting potential provided accurate phonon frequencies but it showed poor generalization to a wider range of structures, compared to a more comprehensive potential trained on the entire dataset. This example highlights the need for a procedure to standardize the accuracy measure of NNPs and a more pressing need to build error estimate measures into the process of generating NNPs.

A. Evolutionary Algorithm for Configuration Space Search
We start the self consistent cycle using ReaxFF 31 , a reactive FF, that was fitted for Li-C system, to generate the initial configurations with 16 and 24 Carbon atoms per unit cell at 0, 10, 20, 30, 40 and 50 GPa via EA as implemented in USPEX 55,56 . At each pressure, we start with a population of 30 (50) randomly generated structures for the 16 (24) atoms per unit cell, and evolve it through the following evolutionary operations with the given ratios: heredity (two parent structures are combined) 50%, mutation (a distortion matrix is applied to a structure) 25%, or by generating new random structures 25%.
At each generation, structures are optimized in five successive steps: (a) constant pressure and temperature molecular dynamics at 0.1 GPa and 50 K respectively for 0.3 ps with time step of 0.1 fs, (b) relaxation of cell parameters and internal coordinates until force components are less than 0.26 eV/Å, (c) constant pressure and temperature molecular dynamics at 0.1 GPa and 50 K respectively for 0.3 ps with time step of 0.1 fs, (d) relaxation of cell parameters and internal coordinates until force components are less than 0.026 eV/Å, and (e) a final relaxation of cell parameters and internal coordinates until force components are less than 0.0026 eV/Å.
Only the 70% most energetically stable "parents" were allowed to participate in the process of creating the new generation. In the heredity step, only sufficiently distinct structures (whose cosine distance, as defined in the next section, is greater than a given threshold) are considered as parents. This threshold is fixed at 0.008 in the first iteration, as it is small enough to allow deformed structures from the same polymorph to be parents. In order to enhance the diversity of the structures in the subsequent iterations, the threshold is increased to 0.05 so that the parents can be expected to be from different polymorphs.
Each structure search is evolved up to a maximum of 50 generations at the first iterations and 30 in the subsequent ones. The configuration space search performed this way produces a wide range of sp 2 , sp 3 and mixture of sp 2 and sp 3 structures, including defective layered structures.

B. Clustering
Initially, an unsupervised, bottom-up, distance-based hierarchical clustering approach with single linkage is used on all structures obtained with EA to identify the unique polymorphs. In the later iterations, clustering is applied only to those structures where NNP prediction differs from DFT ground truth energy by more than 5 meV/atom. That way, polymorphs that are already well described by NNP are not over-sampled. During clustering, to measure the similarity between structures, we use the fingerprint-based cosine distance defined in Ref.s 57 and 58. In the case of a single species in the unit cell, and in its discretized form, the fingerprint of a configuration becomes: (2) where the first sum runs over all atoms i in the unit cell and the second sum runs over all atoms j within a spherical cutoff radius R max , and R ij is the distance between atoms i and j. The numerator describes the integral of a Gaussian density of width sigma over a bin of size ∆. N is the number of atoms in the unit cell and V is the unit cell volume.
The cosine distance between structures 1 and 2 is defined as The dimension of the F -vector is set to R max /∆ = 125 with R max = 10Å and ∆ = 0.08 in this work. Two configurations closer to one another than a distance threshold are determined to belong to the same cluster. In this work the threshold is tuned to yield approximately 100-150 clusters at each step, which results in affordable computational cost for the remaining calculations of the self-consistent cycle.

C. Molecular Dynamics
We manually select a representative structure from each cluster and perform a 0.5 ns classical NPT molecular dynamics simulation with Nose-Hoover thermostat and barostat. In these simulations the external conditions of pressure and temperature are ramped up from -50GPa at 100 K, to 50GPa at 1000K in the course of 0.5 ns. The characteristic relaxation times of the thermostat and barostat are chosen as 50 fs and 100 fs, respectively. By sampling a snapshot of the dynamics every 5 ps, 100 configurations are selected. All molecular dynamics simulations are performed with LAMMPS package 59 . In addition, 440 randomly selected graphene atomic configurations from the libAtoms repository 32 are added to the selection. This set constitutes the set of structures where ab initio total energy calculations are then performed and added to the training set.

D. First Principles Calculations
The first principles calculations performed on all the structures visited during EA configuration space search and MD refinement described earlier employ the following parameters: Plane wave basis set kinetic energy cutoff for wavefunctions and charge density is 80 and 480 Ry respectively. The rVV10 60 exchange-correlation functional that incorporates non-local van der Waals correlations is employed. A Brillouin zone sampling with resolution of 0.034×2πÅ −1 for the 3D carbon structures and 0.014×2πÅ −1 for graphene is used. These parameters are found to yield 1mRy/atom precision on diamond, graphite and graphene. All DFT calculations were performed with the Quantum ESPRESSO package 61,62 . Elastic properties are computed through the thermopw framework 63 while vibrational properties are obtained with PHON package 64 .
In the first self-consistent iteration, the training set is made up of all generated structures lying within 10 eV from the lowest energy one. This results in a total of ∼16000 configurations. In the subsequent iterations of the self-consistent procedure, we use all configurations whose energy per atom is within 1.2 eV of the lowest one, these are added to the previously selected configurations, amounting to a total of about 30000 configurations in the second and 60000 configurations in the third and final iteration. From these configurations, 20 % was set aside for validation and the remaining 80 % was used in the NNP training.

E. Neural Network Architecture
In this work we adopt the Behler-Parrinello approach to atomistic neural networks 50 where the total energy of a system of N atoms is defined as the sum of atomic energy contributions: where E i is the energy contribution of an atom i, and G i is its local environment descriptor vector. As described in detail in the next section, we choose descriptors with 144 components per atomic environment. The contribution of an atom to the total energy is obtained by feeding its environment descriptor to the feed-forward all-to-allconnected neural network. Here we build a network with two hidden layers, with 64 and 32 nodes for the first and second layer respectively, both with Gaussian activation function, and a single-node output layer with linear activation. The resulting network has a total of 11393 parameters, i.e. (144 × 64) + (64 × 32) + (32 × 1) = 11296 weights and 64 + 32 + 1 = 97 biases. The energy of each atom is then summed to obtain the total energy of the configuration. The force on each atom can be obtained analytically: where the atom index, j, runs over all the atoms within the cutoff distance of atom i, and index µ runs over the descriptor components.
During training, the weight and bias parameters W , are optimized with the Adam algorithm 65 using gradients obtained by randomly selected subsets (minibatches) of data. The loss function of this stochastic optimization problem is defined as the sum of two contributions: one using the total energy value (Eq. 6) and one using the force on each atom (Eq. 7): where E DFT c is the ground-truth total energy obtained via DFT and E c is the NN prediction for total energy of a given configuration c, consisting of N c atoms in the unit cell. The second part of this equation exponentially penalizes outliers while keeping the exponent normalized; a is a constant that allows to tune this penalty, a = 5 is used in this study. The force contribution to the loss is given by: where for any atom i of configuration c, F DFT i is the ground-truth force obtained via DFT, and F i is the NN prediction for it. γ F is a user-defined parameter that controls the scale of this loss component. The results reported are obtained with γ F equals 0.5.
An L 2 -norm regularization term is also added with a small coefficient γ R = 10 −4 to prevent weights from becoming spuriously large: The total loss is thus defined as All models are trained starting from random weights and a starting learning rate α 0 = 0.001. The learning rate is decreased exponentially with optimization step t following the relationship α(t) = α 0 r t/τ with decay rate r = 0.96 and the decay step τ = 3200. A batch size of 128 data points is used throughout the study.

F. Atomic Environment Descriptors
We use Behler-Parrinello symmetry functions 50 as local atomic descriptors. These functions include a two body and a three-body term, referred to as radial and angular descriptor, respectively. We use a modified version of the original angular descriptor 51 as implemented and detailed in PANNA package 52 . The radial descriptor function is defined as: where η and a set of Gaussian-centers R s are user-defined parameters of the descriptor. The sum over j runs over all atoms whose distance R ij from the central atom i is within the cutoff distance R c . The cutoff function, f c is defined as: The angular part of the descriptor with central atom i is defined as: The sum runs over all pairs of neighbours of atom i, indexed as j and k, with distances R ij and R ik within the cutoff radius R c , forming an angle θ ijk with it. Here η, ζ, and the sets of θ s and R s are the user-defined parameters of the descriptor. We note that the descriptor as written in Eq. 12 has discontinuous derivative with respect to atomic positions when atoms are collinear. To restore the continuity, we replace the cos(θ ijk − θ s ) term with the following expression 2 cos(θ ijk ) cos(θ s ) + 1 − cos(θ ijk ) 2 + sin(θ s ) 2 sin(θ s ) 1 + 1 + sin(θ s ) 2 (13) where we introduce a small normalization parameter, , such that the expression approaches cos(θ ijk − θ s ) in the limit of → 0. In this work, = 0.001 was used, while values between 0.001 − 0.01 were found to yield stable dynamics and equivalent network potentials for any practical purpose.
The radial descriptors are parametrized with η = 16.0 A −2 , while 32 equidistant Gaussian centers, R s , are distributed between 0.5Å and 4.6Å. For the angular part η = 10.0Å −2 , ζ = 23.0, 8 equidistant R s are distributed between 0.5Å and 4.0Å and 14 θ s are chosen between π/28 and 27π/28 with spacing π/14. The cutoff R c is 4.6Å for radial and 4.0Å for the angular descriptors, respectively. The resulting descriptor has a total of 32 + 14 × 8 = 144 components per atomic environment.

V. DATA AVAILABILITY
The neural network potential described in this work is released in PANNA 52 format compatible with several molecular dynamics packages via OPENKIM 66 . A native LAMMPS plugin version is also given in the Supplementary Material.