Introduction

Due to their remarkable properties, nanoclusters have gained attention in heterogeneous catalysis.1,2,3 Nanoclusters differ from bulk metal behaviour, and their catalytic properties are sensitive to changes in size and morphology.4,5,6,7 For example, gold clusters with a diameter of a few nanometres exhibit non-metallic properties due to quantum size effects.8 Scientists have advanced significantly in producing nanoparticles with defined composition, size and morphology in the last decade.1,9,10,11,12 These developments mean a tremendous combinatorial and structural space has opened up for rational catalyst design, where nanoscale experiments and computational screening can be used to optimize catalyst design.13

In this context, the development of new materials for the scalable production of hydrogen is a key challenge, with massive impact in clean-energy technologies.14,15,16 At the cathode of electrolytic water splitting into hydrogen and oxygen, the hydrogen evolution reaction (HER) takes place. As part of the process, the currently used expensive noble metals, especially platinum group metals (PGM), categorised as critical by the European Commission,17 need to be replaced to make the production of hydrogen competitive to other energy storage technologies. Some bimetallic alloy nanoclusters, such as copper–titanium18 exhibit catalytic activity towards HER, thus binary combinations of metals are of high interest, particularly if the fraction of PGMs can be significantly reduced.19 Beyond metals, one candidate to replace PGMs are MoS2 nanoclusters. Recent studies of single-layer MoS2 have shown that its electronic band structure can be fine-tuned at the nanoscale.20 On the otherwise semiconducting material, the edges of triangular- to hexagonal-shaped nanoclusters demonstrate metallic character and these are likely to be the active site for HER.20,21,22

The configurational space offered by the wide variety of nanocluster materials, active sites and environmental conditions means that a conventional approach to catalyst optimization, using ab initio methods, is particularly challenging. Hence, very recently there has been a surge in attempts to apply machine learning (ML) approaches to modelling catalytic systems.23,24,25,26,27,28

In this work, we begin by considering the latest developments in descriptors for ML in materials science, as yet untested in nanocatalytic systems, and compare them in terms of accuracy and efficiency for characterizing a particular catalytic reaction, HER:

$${\mathrm{H}}^ + + {\mathrm{e}}^ - \to \frac{1}{2}{\mathrm{H}}_2.$$

This stands out as a relatively simple reaction with one intermediate state—adsorbed hydrogen on the catalyst surface. The rate of the reaction on a catalyst surface (denoted below as *) is determined by the hydrogen adsorption free energy ΔGH of the elementary Volmer step:

$${\mathrm{H}}^ + + {\mathrm{e}}^ - + ^\ast \to {\mathrm{H}} ^\ast.$$

According to the Sabatier principle, hydrogen should neither bind too weakly nor too strongly. This general principle explains why ΔGH can reasonably describe catalytic activity. Optimally, nanoclusters should have adsorption sites with ΔGH ≈ 0 to be considered catalyst candidates.29,30 Since this quantity is accessible by ab initio methods, directly from the adsorption energy of hydrogen, materials can be pre-screened computationally. Our approach is to build a large data set of hydrogen adsorption energies on a variety of nanoclusters, characterize this with appropriate structural descriptors, and then train a model to predict these energies for an arbitrary site based on its description.

Results

Potential energy scan of sample clusters

As initial data sets, we started by mapping out the energy landscape of hydrogen adsorbed on the surface of one sample cluster for each system, MoS2 and AuCu.

The two nanoclusters were fully scanned with respect to the hydrogen position and are depicted in Fig. 1. Figure 1a shows a potential energy scan of a triangular-shaped sample MoS2 cluster with molybdenum-terminated edges (Fig. 1b). The cluster Au40Cu40-H had a flatter potential energy surface (Fig. 1c) than MoS2-H and no patterns were clearly apparent. On the other hand, MoS2-H had three distinct global minima at the edges where hydrogen bound to molybdenum. Since the cluster had a near-C3-symmetry the local environments of the 3 minima were equivalent. When hydrogen was bound at corner-sites, ΔEH increased, while the highest energy positions were observed on the surface sulphur atoms. Even though the C3-symmetry of the cluster was broken, ΔEH remained similar at different edges and corners.

Fig. 1
figure 1

a Hydrogen position scan on the surface of a triangular-shaped MoS2 cluster (b). c Hydrogen position scan on the surface of a Au40Cu40 cluster (d)

Machine learning on single clusters

The data sets MoS2(single) and Au40Cu40(single) contained 10,000 DFT-based ΔEH single-point calculations of hydrogen positioned on the surface of the same cluster. We were interested in how many points were needed to predict the potential energy surface by interpolation. However, we did not conduct this interpolation in real space, but feature space with KRR. Thus, two points far away from each other in real space were close in feature space if the structures were similar. The feature space was spanned by the descriptors Atom-Centered Symmetry Functions (ACSF), Many-Body Tensor Representation (MBTR) or Smooth Overlap of Atomic Positions (SOAP). The goal was to reach an accuracy of 0.1 eV, which would allow us to make reasonable predictions of ΔEH for an arbitrary system.

Figure 2a shows learning curves predicting ΔEH at random positions around the triangular MoS2 cluster (Fig. 1b). In this example only, we included the results for the Coulomb Matrix (CM) descriptor in order to see how it fares with respect to adsorption energy prediction. As we transformed the global descriptor into a local CM, we observed an improved accuracy. This was due to the strong dependence of ΔEH on the local environment. In general, the CM had a significantly higher MAE, which might be due to its values ranging over many orders of magnitude,31 see also Fig. 3. To do justice to CM, it is possible to increase the accuracy a bit by randomly sorting it, and thus smoothening the feature space.32 ACSF performs comparably to ACSFH and MBTR with a training set larger than 3000, and reached the threshold of 0.1 eV at about 900 training points. ACSFH required only about 400 training points. SOAP and MBTR, on the other hand, had a MAE of 0.1 eV with only 300 training points, while SOAP also performed best at large training set sizes.

Fig. 2
figure 2

Learning curves for different data sets show the MAE for different training set sizes. The descriptors CM, SOAP, MBTR and ACSF were used as features in KRR to predict ΔEH. The following data sets were used: a MoS2(single), b Au40Cu40(single), c MoS2(multi), d AuCu(multi)

Fig. 3
figure 3

Mean of data point pairs on the axes of Δ(ΔEH) and (dis)similarity defined by d = \(\left\| {Descriptor} \right\|_2\) within bins of size 0.1. The coloured area highlights the standard deviation in those bins. The data set MoS2(multi) was used to compare the descriptors CM (cyan, offset 1.0 eV), SOAP (red, offset 0.7 eV), MBTR (blue, offset 0.3 eV) and ACSF (green)

Figure 2b shows learning curves predicting ΔEH at random positions around a medium-sized AuCu cluster. SOAP and MBTR again performed equally well reaching the threshold of 0.1 eV with about 300 training points. Remarkably, ACSFH reached 0.1 eV MAE with only 100 training points, but it exhibited a shallow learning curve. Although ACSF had a lower accuracy with small training set sizes, it overtook ACSFH and MBTR with a training set larger than 3000. The low error with a large training set makes ACSF an excellent choice for Molecular Dynamics simulations where high accuracy is needed, for example simulations over many time steps where even small errors can propagate rapidly. A machine learning potential fitted to a large DFT data set provides energies close to the reference method.31 SOAP showed a similarly steep learning curve compared to ACSF, however was offset to a lower accuracy at all training set sizes.

To summarise the results for both test systems, ACSF needed a large training set, but then it was as good or even better than MBTR. This was due to the many symmetry functions used. If symmetry functions were eliminated by feature selection the performance of ACSF at lower training set sizes would likely be better.

Indeed, a principal component analysis revealed that 130 components for both data sets could explain 99% of the variance. A sensible choice was to restrict the features to ACSFH, the local version of ACSF. Expectedly, ACSFH performed better than global ACSF for smaller training set sizes. Systematic feature selection using e.g., mutual information could further reduce the MAE for small training set sizes. Eventually, ACSFH, MBTR and SOAP showed comparable MAE with smaller training set sizes.

Machine learning on multiple clusters

In the next step, we were interested if it was possible to interpolate between hydrogen adsorption sites on different clusters. The data sets MoS2(multi) and AuCu(multi) contained around 10,000 DFT-based ΔEH single-point calculations. The data set MoS2(multi) consisted of hydrogen positioned on the surface of 91 MoS2 clusters. A total of 110 points were randomly chosen for each cluster.

Figure 2c shows the learning curve predicting ΔEH at random positions around multiple MoS2 clusters. The descriptor SOAP reached a MAE of 0.1 eV with a training set size of 4000 (or 44 per cluster). It was estimated before that learning on the potential energy surface of a single cluster required 300 training points (MoS2(single)). This comparison clarified that learning on different clusters simultaneously was beneficial and interpolation in compound space was possible with similar nanoclusters. MBTR got as low as 0.13 eV with a training set size of 9000. The size of ACSF depended on the number of atoms in the system. Since the nanoclusters had different sizes and different compositions, it did not make sense to compare atoms other than hydrogen with each other. Hence, the local version of ACSF, ACSFH was taken. Similar to MBTR it did not reach the threshold of 0.1 eV, but got as close as 0.11 eV with 9000 training points. Since SOAP (here a local descriptor) and MBTR (here a global descriptor) were designed in such a way that they might contain information which the other did not, we tried to combine both. In this case, however, the combined and equally weighted features of MBTR and SOAP did not improve the overall accuracy.

To verify that the results were independent of the system, we repeated the analysis with the data set AuCu(multi) containing 24 small copper–gold clusters with a fixed size of 13 atoms, but different compositions. A total of 420 hydrogen positions were randomly chosen on the surface of each cluster.

Figure 2d shows the learning curve predicting ΔEH at random positions around multiple AuCu clusters. A MAE of around 0.11 eV was reached at 9000 training points with MBTR and ACSFH. With SOAP, only 2000 training points or 80 per cluster were needed to achieve a MAE lower than 0.1 eV. It was estimated before that learning on the potential energy surface of a single-copper–gold cluster required around 300 training points. Again, this comparison confirmed that learning on different clusters was possible, which indicated that it should be possible on any nanocluster system. Furthermore, the fact that MBTR and SOAP combined did not improve the overall accuracy, strongly suggests that the relevant information is contained around the adsorption site. Since SOAP outperformed the other descriptors even though it only contained information about the local environment around hydrogen, it became apparent that size effects of nanoclusters play a minor role (<0.1 eV in our model) in defining ΔEH.

The log–log plots of Fig. 2 emphasize the empirical linear relationship log(MAE) = ab log(N) for large N in agreement with ref. 33. The linear relationship of our data sets started at around N = 500–2000 where different error decay rates became apparent. The global descriptor ACSF and SOAP displayed their superiority over ACSFH and MBTR in this regime.

The purpose of the above data sets was to compare descriptors as well as to investigate the benefit of merging data from diverse structures. The generalization error of the best performing descriptor should be estimated higher, though only slightly, since the test sets acted as validation sets to pick the best descriptor. An estimate of the generalization error will be presented for MoS2 in Fig. 5.

To visualise that similar local environments indeed do not give vastly different ΔEH, 1000 data point pairs were selected with the lowest (dis)similarity d = \(\left\| {Descriptor} \right\|_2\), descriptor being SOAP, MBTR or ACSF. In Fig. 3, a histogram plot shows pairs of local environments at a certain (dis)similarity d (taken from the data set MoS2(multi)) and the mean of their difference in energy Δ(ΔEH). The mean difference in ΔEH at any given d increased monotonously. As depicted by the increasing standard deviation, the more dissimilar the data points were the wider the spread of ΔEH, which indicated that the property changed smoothly in feature space. On average, MBTR had a slightly higher Δ(ΔEH) than SOAP or ACSF. For comparison, CM exhibits a much less smooth feature space. In summary, SOAP outperformed MBTR and ACSFH and the information to explain adsorption energies is contained in the local environment. The property of interest, ΔEH, changed smoothly in feature space spanned by SOAP even though clusters of different sizes were present.

As depicted in Fig. 3 similar adsorption sites have similar ΔEH. In order to achieve predictive power with as few training points as possible, clustered data points should be avoided, but instead selected as such that they are approximately evenly spaced. The data set MoS2(single) is a good example to show that the accuracy depends on whether the training points are chosen randomly or are identically distributed. Since significantly more data points were sampled on the sulphur surface of MoS2 than on its Mo-terminated edges we suspected a biased data set. Descriptors can be used to select an identically distributed data set with respect to feature space (spanned by the descriptor).

The greedy algorithm farthest point sampling (FPS) was exerted to get a set of the most dissimilar training points.34 In Fig. 4, the MAE of random training and test sets were plotted and contrasted against FPS-sampled training and test sets. Using FPS improved the overall accuracy significantly at smaller training set sizes but the effect soon became less apparent. The choice of the test set did not significantly affect the MAE. At a large enough training set size of 500–1500, selecting training points did not make a difference any more. However, when the training set size was in the range of interest (MAE around 0.1 eV) the difference was significant. We interpreted this result as such that the randomly selected data set was biased and not identically distributed. In order to reduce data set size, descriptors could be used to scan local environments and represent them evenly without bias towards more abundant structural patterns.

Fig. 4
figure 4

The data set MoS2(single) was sampled randomly or with FPS in SOAP feature space, and the mean absolute error compared. Random training and testing is shown in red whereas FPS-sampled training and testing or random testing is shown in green or blue, respectively

Prediction of energy distribution of potential energy scan

Next, we investigated to which degree the potential energy surface of a single cluster can be inferred from a data set of multiple clusters. The data set MoS2(multi) was used as a training set to predict ΔEH on the surface of the sample cluster MoS2(single), where a large test set was available. It should be mentioned that the sample MoS2 cluster was part of the data set MoS2(multi) with 110 data points.

Figure 5a shows the parity plot of ΔEH of the test set MoS2(multi). Here, SOAP was chosen as the descriptor. An overall MAE of 0.13 eV was reached. In the sparsely sampled high-energy region, the error was significantly higher than average. In the sparsely sampled low-energy region, however, the error was much lower. Since stable adsorption sites will not be found in the high-energy region, the accuracy of predictions could further be improved by sampling more in the low-energy region. As can be seen from the dashed line errors introduced predicting ΔEH with descriptors were statistical and not systematic since the predictions were centered around y = x. Figure 5a also shows the distribution of ΔEH of the test set MoS2(multi). When focusing on global rather than local properties, the MAE does not have to be as low as 0.1 eV rather should the energy distribution be predicted accurately. The predicted energy distribution was in good agreement with the DFT energy distribution. Depending on the desired accuracy, smaller data sets than the ones we used might be enough to reliably predict the energy distribution.

Fig. 5
figure 5

Parity plot of predicted against calculated ΔEHGH together with a histogram of predicted (red) and calculated (black) energy distributions. a The data set of multiple clusters MoS2(multi) was used as a training set and the data set MoS2(single) cluster was used as the displayed test set. b The data set of multiple clusters MoS2(multi) was used as a training set and a data set of local minima on frozen clusters was used as the displayed test set

Finally, we tested whether ΔGH of local minima on the potential energy surface could be predicted accurately from single-point calculations only going from ΔEH to ΔGH by adding a constant shift. Hydrogen on top of around 1000 MoS2 surface atoms of the data set MoS2(multi) was relaxed while the cluster itself was kept frozen. SOAP descriptors were created at the relaxed positions. The data set MoS2(multi) was used as a training set to predict ΔGH of the relaxed hydrogen adsorption sites. Figure 5b shows the resulting parity plot. Again, an overall MAE of 0.12 eV was reached. However, it showed several outliers. This was probably due to the fact that local environments of the low-energy region were under-represented in the data set MoS2(multi). Higher sampling in the region of interest could alleviate the probability of outliers and further reduce MAE.

Figure 5b also shows the distribution of ΔGH of the sampled hydrogen adsorption sites. The predicted energy distribution was in good agreement with the DFT energy distribution. There seemed to be no systematic over- or under-estimation of the property. KRR failed to predict the lowest-energy adsorption sites under ΔGH = −0.4 eV. This was again due to poor sampling in the low-energy region. Even though only random positions were taken on the surface of several nanoclusters, a combined database could extrapolate to the local minima with a satisfactory accuracy. A smarter selection of points in feature space spanned by a descriptor opens up a new way of finding adsorption sites on similar systems.

To show the limitation of this method, we greedily extrapolated from the data set AuCu(multi), containing 13-atom clusters to predict ΔEH on the surface of the sample cluster Au40Cu40. Figure 6 shows a parity plot using the previously best performing descriptor SOAP.

Fig. 6
figure 6

Parity plot of predicted against calculated ΔEH together with a histogram of predicted (red) and calculated (black) energy distributions. The data set of multiple clusters AuCu(multi) was used as a training set and the data set Au40Cu40(single) cluster was used as the displayed test set

SOAP showed learning tendency with a slight under-estimation. However, the MAE at 0.25 was too high, especially due to the under-estimation of the high-energy regime. Also, it can be noted that the parity plot featured two clusters which indicated that only part of the local environments of Au40Cu40 were represented in the training set.

Discussion

We analysed the performance of state-of-the-art atomic structural descriptors (SOAP, MBTR and ACSF) when used to predict the hydrogen adsorption (free) energy on the surface of nanoclusters. As expected, we found that none of the descriptors which had been designed for molecules and crystals are optimized for nanoclusters. In general, we observed that learning on one cluster at a time required unnecessarily large training sets to achieve good accuracy—this can be improved by merging data from many different nanoclusters in the training set. Since SOAP performed significantly better, we deem it a good choice for adsorption energy predictions. Our data sets did not make it necessary to include global information as could be seen upon the combination of SOAP and MBTR, so the local environment dominates the influence on the adsorption energy. It is, however, possible that a global addition improves the learning when e.g., dopants or defects are added. Descriptor improvements might be possible by combining other descriptors, optimising the weighting functions or other parameters of MBTR and SOAP, or even by constructing a new descriptor encompassing the special structural features of nanoclusters like size, shape and surface morphology. Recently, a multi-scale SOAP kernel has been developed which could incorporate missing information while still retaining the local nature of the descriptor.34 This new approach will be subject to future work. Nevertheless, given sufficient training, all descriptors except CM performed satisfactorily when used as features in KRR.

We identified a few shortcomings of ACSF, MBTR and SOAP with respect to the description of nanoclusters. SOAP in the implementation used here only considers the local environment of hydrogen within a certain cutoff. There are, however, global SOAP descriptors which take into account local environments of all atoms—its performance on nanoclusters will be investigated in the future. ACSF, in order to be size-consistent, was feature selected to be a local descriptor ACSFH, and the accuracy improves slowly with increasing training set size. Better performance with smaller training set sizes could be achieved by feature-selecting symmetry functions. MBTR as a global, size-consistent descriptor could not exhibit its conceptual advantage over the local descriptors, the local environment mostly determined ΔEH.

Many interesting studies could build upon the presented results. In the future, we plan to make more complex databases where the compound space is enlarged by defects or dopants. Ternary metallic clusters, with increased compositional space are particularly challenging for conventional ab initio approaches and could be systems of interest for ML optimization. In terms of the DFT data generation itself, by including information about local similarity encoded in the descriptors it should also be possible to reduce the number of relaxation steps needed to find the local minimum. In conclusion, our results demonstrate that the approach of predicting properties based on descriptors alleviates redundancy in a batch of similar nanocluster calculations—the near-symmetric structures with repeating patterns offer many similar local environments perfectly suited to descriptor methods.

Methods

Density functional theory calculations

All electronic calculations were performed with the CP2K package35 at the density functional theory (DFT) level, where orbitals and electron density were represented by Gaussian and planewave (GPW) basis sets. The exchange-correlation energy was approximated using the spin-polarized GGA-functional by Perdew–Burke–Ernzerhof (PBE).36 Short-ranged double-ζ valence plus polarization molecularly optimized basis sets (MOLOPT-SR-DZVP)37 and norm-conserving Goedecker-Teter-Hutter (GTH) pseudopotentials38,39,40 were assigned to all atom types. Van der Waals interactions were taken into account with the D3 method of Grimme et al. with Becke-Johnson damping (DFT-D3(BJ)).41,42 The energy cutoff for the auxiliary PW basis was set to 550 Ry and the cutoff of the reference grid was set to 60 Ry. Atomic positions were optimised using the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm until the maximum force component reached 0.02 eV/Å. A gap of at least 8 Å vacuum was added in all cartesian directions of the simulation box. The crystal structures of bulk gold, copper and MoS2 obtained at this DFT-level were in good agreement with experiments. In Supplementary Information, it is shown that the double-ζ basis set performs in good agreement with TZV2P.

Regarding relaxed hydrogen structures, we calculated the Gibbs free energy of adsorbed hydrogen ΔGH as

$${\mathrm{\Delta }}G_H = {\mathrm{\Delta }}E_H + {\mathrm{\Delta }}E_{{\rm ZPE}} - T{\mathrm{\Delta }}S_H,$$
$$\begin{array}{*{20}{l}} {{\mathrm{\Delta }}E_H} \hfill & = \hfill & {E_{{\rm Cluster} + H} - E_{{\rm Cluster}}} \hfill \cr {} \hfill & {} \hfill & { - \frac{1}{2}E_{H_2} + E({\rm BSSE})} \hfill \end{array},$$

where ECluster+H, ECluster and \(E_{H_2}\) denote the total energy of adsorbed hydrogen, the solitary cluster and molecular hydrogen in the gas phase. The term EBSSE corrected for basis-set-superposition error. The term ΔEZPE − TΔSH was approximated by values from literature at standard conditions; in the case of MoS2, the zero-point energy minus the entropic term was estimated as 0.29 eV in ref. 43. Considering the system AuCu, ΔSH was approximated by \(\frac{1}{2}{\mathrm{\Delta }}S_{H_2}^0\), the entropy of H2 in the gas phase at standard conditions as in ref. 43; The zero-point-energies of copper (0.17) and gold (0.14) from ref. 44 differed only a little and were averaged as an approximation, which resulted in ΔEZPE − 298 KΔSH ≈ 0.22 eV. This approximation resulted in a constant shift in adsorption energy.

Nanocluster data sets

We created several DFT data sets based on nanoclusters of the 2D-material MoS2 and the metal alloy AuCu. Two nanoclusters were fully scanned with respect to the hydrogen position. They are as follows:

  • a triangular MoS2 cluster with Mo-terminated edges

  • a medium-sized near-spherical Au40Cu40 cluster

The structures are depicted in Fig. 1. The single-cluster data sets, named hereafter MoS2(single) and Au40Cu40(single), comprised of 10,000 single-point calculations of single-hydrogen atoms adsorbed on the surface. Hydrogen was positioned randomly at a distance of 130–220 pm from the cluster, where the random points were at least 0.1 Å from each other. Furthermore, data sets containing hydrogen adsorbed on different nanoclusters were produced in a similar fashion. Small-sized AuCu clusters containing 13 atoms ranged from 4 to 9 gold atoms. We wanted to analyse clusters of the same size, but with different compositions. For each of those 24 clusters, we calculated 420 data points of adsorbed hydrogen. The combined data set, named hereafter AuCu(multi), had a size of around 10,000 points. Analogously, the data set MoS2(multi) comprised of 91 different MoS2 nanoclusters, so that it also contained around 10,000 data points. MoS2 clusters of different size (ranging from 4 to 11 Mo atoms at the edge), shape and edge-termination were chosen based on ref. 22. In order to create clusters of different shapes, ranging from triangular to hexagonal, corners were capped, leaving behind 3 additional sulphur-terminated edges. First, one Mo atom was capped, then 3, then 6, until the cluster had a hexagonal shape. Different edge types were also present in the data set, with sulphur coverages of 0, 25, 50 and 100% equally represented. A few examples are shown in Fig. 7, otherwise edge structures can be found here.22

Fig. 7
figure 7

Four example MoS2 clusters illustrate different sizes, shapes and edge-terminations: a small triangular cluster, b hexagonal cluster with a sulphur coverage of 50%, c triangular cluster with capped corners, terminated by 100% sulphur, and d triangular and Mo-terminated (sulphur coverage 0%)

Structural descriptors

In general, with a large enough data set containing nanocluster structures, the location of the hydrogen adsorption site and their corresponding ΔGH, it is fairly straightforward to develop a predictive model with the help of ML. ab initio calculations require only atomic types and relative positions of atoms as input. Hence, cartesian coordinate or Z-matrix formats contain all information in order to calculate the total energy of a nanocluster and then derive ΔGH. Those formats, however, have a disadvantage when it comes to interpolation of data or ML. The same structure can be constructed in many different ways—as a result, similar structures might not be treated as similar by the ML algorithm, and discontinuities appear. ML in general requires the input data to be in compact form and in a smooth feature space.

Another structural representation (descriptor) is needed which fulfils several criteria, summarised here.45 A good structural descriptor is:

  • invariant with respect to rotation, translation and homo-nuclear permutation

  • unique—there should be only one way to construct a descriptor for any given structure

  • non-degenerate—no two sets of descriptor features are identical for structures with different relevant properties

  • continuous in the spanned feature space

Efforts to develop efficient descriptors in materials science have led to a family of approaches successfully applied to molecules and crystals.46,47 In particular, we consider the following popular descriptors (a detailed description of each of the descriptors is available in Supplementary Information):

  • CM is a global descriptor based on pairwise coulomb repulsion of the nuclei.48

  • ACSF49—for each atom in a system, ACSF express distance and angular interactions with neighbour atoms in symmetry functions.

  • SOAP47,50—SOAP represents the local environment around a center atom by gaussian-smeared neighbour atom positions made rotationally invariant.

  • MBTR51—MBTR is a global descriptor which groups interactions by atomic type and puts them into a tensor.

Descriptor hyper-parameters

The structural descriptors CM, ACSF, MBTR and SOAP have method-specific parameters which can be fitted to the investigated system. A few performance tests showed that the mean absolute error (MAE) was sensitive to a few of those hyper-parameters. The radial cutoff of the local CM was optimised to 6 Å. The rows and columns of the matrix were sorted with respect to the L2-norm. Regarding ACSF, only the radial cutoff Rc was optimised. For other parameters, all combinations of sensible values inspired by Behler,49 were used to construct symmetry functions. Table 1 shows the values used for the parameters ζ, κ, η, λ and Rs, which in combination formed symmetry functions from Supplementary Eqs. (S2)–(S5). ACSFH denotes the symmetry functions with hydrogen as the center atom.

Table 1 List of parameters of ACSF

The performance of MBTR depended on several hyper-parameters, namely the gaussian broadening parameters σ(k2), σ(k3) as well as the decay exponent d. The other parameters, such as σ(k1) = 5 Å and the grid fineness n(k1) = 100, n(k2) = 900, n(k3) = 360 were kept constant for all data sets. SOAP can in principle be made global by matching local environments with each other, but we used it only locally in this work. The performance of the SOAP descriptor was to a small degree sensitive to the radial cutoff Rc. Other parameters, such as the highest angular contribution lmax = 9 and the highest radial contribution nmax = 10 were kept constant. The aforementioned descriptor parameters were scanned and evaluated on around 1000 data points, a subset of the training set. The optimal parameters are listed in Table 2.

Table 2 Optimised descriptor hyper-parameters for different data sets

Kernel ridge regression

For medium-sized data sets (1000–10,000) kernel ridge regression is a fast and accurate ML method. In ref. 52, KRR performed best with the descriptor HDAD (histograms of distances, angles and dihedrals) at predicting atomization energies, a conceptually similar descriptor to the ones we used which supported our choice of KRR. Of the ML models in ref. 52, graph convolution neural networks were not applicable to the descriptors, hence only random forest regression was another sensible choice. However, as shown in Supplementary Information, its performance is significantly worse than KRR in our case. In order to predict the properties of new data points, the descriptor features of the training set x are compressed into the kernel matrix K

$${\bf{K}} = \left[ {\begin{array}{*{20}{c}} {K\left( {x_1,x_1} \right)} & \ldots & {K\left( {x_1,x_N} \right)} \cr \vdots & \ddots & {} \cr {K\left( {x_N,x_1} \right)} & {} & {K\left( {x_N,x_N} \right)} \end{array}} \right],$$

where x1, …, xN are feature vectors of N training points and K(xi, xj) is a symmetric positive semi-definite kernel function (e.g., Gaussian kernel). The property y of a new data point xpred is predicted by inverting the kernel matrix

$$y\left( {x_{{\rm pred}}} \right) = {\bf{k}}_{{\bf{pred}}}^T({\bf{K}} + \lambda {\bf{I}})^{ - 1}{\bf{y}}_{{\bf{train}}},$$

and regularising it by λ. The vector ytrain consists of the properties y1, …, yN of the training set. The kernel vector kpred is defined as:

$${\bf{k}}_{{\bf{pred}}} = \left[ {\begin{array}{*{20}{c}} {K\left( {x_{{\rm pred}},x_1} \right)} \cr \vdots \cr {K\left( {x_{{\rm pred}},x_N} \right)} \end{array}} \right].$$

The method benefits from a continuous feature space and a unique descriptor-property relation. It is worth mentioning that it works well even with large descriptor sizes and small training sets. The computational cost, however, scales with \({\cal O}\left( {N^3} \right)\), which makes it computationally expensive or infeasible for large data sets (>10,000).

The calculated adsorption energies of the training sets were interpolated by kernel ridge regression using the radial basis function kernel

$$K(x,x\prime ) = {\mathrm{exp}}\left( { - \gamma \left\| {x - x\prime } \right\|_2} \right)$$

Based on a comparison of different kernels in Supplementary Information, the RBF kernel performs on par with the SOAP-kernel.50 The resulting kernel matrices were used to predict the (free) adsorption energies of the test sets. The exponent of the radial distribution function γ and regularization parameter α were optimised by fivefold cross-validation.

When the features of MBTR and SOAP were combined to a new descriptor, they were weighted within the kernel:

$$K(x,x\prime ) = {\mathrm{exp}}\left( { - \gamma \left( {\left\| {x_{MBTR} - x_{MBTR}^\prime } \right\|_2 + q\left\| {x_{SOAP} - x_{SOAP}^\prime } \right\|_2} \right)} \right)$$

where \(q = \frac{{n_{{\rm MBTR}}}}{{n_{{\rm SOAP}}}}\) is the quotient of the number of features in MBTR and SOAP. This accounted for different descriptor sizes and thus ensured equal weigthing of the descriptors.

Data availability

The DFT data that support the findings of this study are available in the NOMAD repository with the identifiers https://doi.org/10.17172/NOMAD/2018.06.12-2 and https://doi.org/10.17172/NOMAD/2018.06.12-1.53,54 The structures and adsorption energies of the data sets can be found as Supplementary Material.