Artificial-intelligence-driven discovery of catalyst genes with application to CO2 activation on semiconductor oxides

Catalytic-materials design requires predictive modeling of the interaction between catalyst and reactants. This is challenging due to the complexity and diversity of structure-property relationships across the chemical space. Here, we report a strategy for a rational design of catalytic materials using the artificial intelligence approach (AI) subgroup discovery. We identify catalyst genes (features) that correlate with mechanisms that trigger, facilitate, or hinder the activation of carbon dioxide (CO2) towards a chemical conversion. The AI model is trained on first-principles data for a broad family of oxides. We demonstrate that surfaces of experimentally identified good catalysts consistently exhibit combinations of genes resulting in a strong elongation of a C-O bond. The same combinations of genes also minimize the OCO-angle, the previously proposed indicator of activation, albeit under the constraint that the Sabatier principle is satisfied. Based on these findings, we propose a set of new promising catalyst materials for CO2 conversion. Here the authors demonstrate an artificial-intelligence based approach to identify catalytic materials features that correlate with mechanisms that trigger, facilitate, or hinder CO2 catalytic reactions.

T he need for converting stable molecules such as carbon dioxide (CO 2 ), methane, or water into useful chemicals and fuels is growing quickly along with the depletion of fossil-fuel reserves and the pollution of the environment [1][2][3] . Such a conversion does not have a satisfactory solution, so far. In particular, CO 2 conversion remains one of the most important societal and technological challenges 1,2,[4][5][6][7][8] .
The general understanding in heterogeneous catalysis is that a stable molecule such as CO 2 needs to be "prepared" before its catalytic conversion occurs. This leads to the notion of molecular activation 9 . However, on one hand, this notion encompasses a very wide variety of processes (adsorption, photo-excitation, application of electric field, etc.) and materials (including compositional and structural variability), and it remains unclear which properties of the catalytic material and the adsorbed molecule determine the final chemistry, what is the relationship between the two sets of properties, and how general this relationship may be. On the other hand, finding the set of descriptive parameters of a catalytic material that characterize the catalytic performance in a particular process, or even in general for a given reactant, would be very valuable, because it would allow us to quickly search for promising candidate catalysts using rational design [10][11][12][13][14][15][16][17] . We call these properties materials genes. The genes do not necessarily correlate with catalytic activity by themselves. Similar to biological genes, their role depends on the combination in which they occur, and can be either beneficial or detrimental to the catalytic activity.
Several strategies exist to find such properties for a given reaction. One way is to explore the free-energy surface for each catalyst candidate, which is a slow and resource-consuming process, and currently computationally unfeasible for many materials on a high-throughput basis. An alternative approach consists in searching for a correlation between experimentally determined material's properties and its catalytic performance. Such a strategy requires consistent experimental measurements at well-defined conditions for a set of materials. To the best of our knowledge, such consistent data have not been reported so far for CO 2 conversion on semiconductor oxides. Moreover, available publications usually do not report unsuccessful experimental results. These issues and a strategy to address them have been recently discussed in our publication 18 .
Yet another strategy is to find an indicator of activation, namely, a property of the system that directly indicates the certain catalytic performance of the material 10 . Indicators are distinguished from materials genes based on a qualitatively different level of computational complexity. The indicator can still be unfeasible or hard for a high-throughput study of hundreds of thousands or millions of materials. However, when it can be calculated for a few tens or hundreds of materials in a reasonable time, these data can then be used to find materials genes that control the value of the indicator. Since a direct search for a relationship between the indicator and catalytic performance of material would also require a consistent set of data of turnover frequency (TOF), selectivity, and yield values, one could instead consider several most promising indicators, find out which materials are good catalysts, and then check which indicators correlate with this observation. This approach also addresses the problem of defining activation in terms of the adsorbed-molecule properties as potential indicators of catalytic activity.
Catalytic conversion of CO 2 requires activation of other reactants as well, e.g., molecular hydrogen, water, or methane. In particular, hydrogen can serve as an environmentally friendly reagent that can be produced by water electrolysis or photosplitting avoiding extra CO 2 emissions [19][20][21] . Also, oxygen vacancies have been proposed as active sites for CO 2 conversion on some materials 22 . Therefore, predictions of catalytic activity of materials for CO 2 conversion can be refined based on analysis of activation of other reactants and defects. An additional challenge is to ensure that the useful products, as well as the surface catalytic activity, are preserved under the conditions of activation and subsequent conversion. While the strong C-O double bonds in CO 2 can be weakened or even broken by adsorption at a solid surface at an elevated temperature, this may also lead to too strong adsorption or further dissociation of the molecule, so that the catalytic surface is poisoned by carbonate or carbon deposits. Weak adsorption, on the other hand, means no activation.
In this work, we combine first-principles calculations with an artificial-intelligence (AI) method, subgroup discovery (SGD), to identify pristine materials properties that optimize indicators of catalytic CO 2 activation. Moreover, SGD allows identifying one or more distinct combinations of materials features (genes) that promote activation. We focus on oxide materials as candidate catalysts. Oxides are structurally and compositionally stable under realistic temperatures and can be less expensive than the traditional precious metal-containing catalysts [23][24][25] . Activation of other reactants and defects are not considered. As shown below, meaningful predictions can be made based solely on the analysis of the adsorption properties of CO 2 on pristine surfaces. This confirms that these properties are good indicators of activation with a viable optimization pathway at least for the chosen class of materials. The Sabatier principle is taken into account by ensuring that the adsorption energy is not too large or too small. In order to ensure reproducibility of our AI data analysis, we provide all necessary metadata (input parameters) and workflow in the easily accessible form of a Jupyter notebook 26 . We argue that, with the ever-growing importance and complexity of AI, such detailed and tutorial documentation is a necessity of good scientific practice. Our approach is applicable to a wider class of materials and molecules, not limited to oxides or CO 2 . Our study by no means encompasses all possible mechanisms of CO 2 conversion on oxide surfaces, but it offers a clear design path among many possible ones.
Results CO 2 activation. We find that on semiconductor oxide surfaces CO 2 is chemisorbed exclusively when the carbon atom binds to surface O-atoms. All other minima of the potential-energy surface are found to be either metastable or correspond to physisorption. Therefore, there are as many different potential chemisorption sites as there are unique O-atoms at the surface. The dataset includes all non-equivalent surface O-atoms on the 141 considered surfaces of 71 materials, which sum up to 255 unique adsorption sites. Among these sites on about 4% (10 out of 255) CO 2 prefers to physisorb, i.e., any chemisorbed state is metastable with respect to the physisorbed one. The physisorption can be easily identified by an almost linear geometry of the adsorbed molecule, and a C-O bond distance very close to the C-O bond length in a gas-phase CO 2 molecule, 1.17 Å.
We considered six different candidate indicators of CO 2 activation, including OCO-angle and C-O bond distance. The bending of the OCO-angle in the adsorbed CO 2 molecule relative to the gas-phase value of 180°(linear configuration) has been previously proposed 27 and is widely accepted as a good indicator of activation. For gas-phase CO 2 , it is understood that the C-O double bond is weakened when an electron is added to the lowest unoccupied orbital, because it is of antibonding (π*) character with a concomitant bending of the molecule. There is a one-toone mapping between the C-O bond length l(C-O) and the OCO-angle in gas-phase CO 2 δ− for a range of δ > 0 (red curve in Fig. 1). However, this is not the case for the adsorbed CO 2 (dots in Fig. 1). There is a subset of adsorbed CO 2 that is close to the ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-28042-z red line, but there are many cases where l(C-O) is substantially larger for a given OCO-angle. This is in contrast to metal alloy nanoparticle catalysts, where there is a better correlation between OCO-angle and l(C-O) 28 . Also, a longer C-O bond reflects a weakening and readiness for further chemical transformations. Thus, the bond elongation itself may be an alternative indicator of activation. A look at the adsorbed CO 2 structures reveals that, on sites following the gas-phase correlation, the molecule adsorbs in nearly symmetric adsorption structures with nearly equal length of the two C-O bonds. In the other cases one O-atom of CO 2 is close to surface cation(s), leading to a pronounced asymmetry of the adsorbed molecule.
Other considered potential indicators of activation include Hirshfeld charge 29 of adsorbed CO 2 (a direct indicator of the charge transferred to CO 2 ), the dipole moment of the surface along the surface normal per adsorbed CO 2 molecule (includes charge transfer to the molecule, as well as adsorption induced surface relaxation), the difference in Hirshfeld charges of C and O-atoms in an adsorbed CO 2 molecule (indicates the ionicity of C-O bonds), and the difference in Hirshfeld charges of the O-atoms in the adsorbed molecule (indicates asymmetry of the adsorbed molecule) 9,29 .
Subgroup discovery. To find out which properties (features) of the clean surfaces determine when a given activation indicator is maximized or minimized, we employ the subgroup-discovery (SGD) approach [30][31][32][33][34] . Given a dataset and a target property known for all data points, the SGD algorithm identifies subgroups with "outstanding characteristics" (see further for the criteria for being outstanding) and describes them by means of conjunction of basic propositions (selectors) of the kind "(f 1 < a) AND (f 2 ≥ b) AND ...", where f i is a feature and a, b are threshold values also found by SGD. In the framework of SGD, we call the selected primary features {f 1 , f 2 , ...} materials genes. Thus, SGD identifies both the outstanding subgroups and the relevant materials genes for a given target property.
Obviously, the selectors should only contain features that are much easier to evaluate than the target property. In the presented work, the considered features include properties of gas-phase atoms that build the material, and properties of the pristine material (properties of the bulk phase and of the pristine relaxed surface). Overall 46 primary features have been considered. The full list is presented Supplementary Table 3. Our strategy is to provide an almost exhaustive list of features, and use data analytics to select materials genes from this list. Some of these features have been explored previously as descriptors of catalytic activity for semiconducting and metallic oxides [35][36][37][38] . O 2p-band center features have been shown to correlate with catalytic properties of both semiconducting and metallic oxides 35,37 . In particular, most of the features (or closely related ones) mentioned in ref. 36 , inspired by the work of Grasselli 39 , are included in our set, except oxygen vacancy formation energy, which is relevant for the oxidation catalysis, while here we are interested in partial or complete reduction. Additional important features in our work (see below) include features related to the polarizability of surface cations, which describe the long-range surface response to charged adsorbates. A subset of features from our list has been recently used successfully for predicting catalytic properties of metallic oxides 38 , along with additional features relevant specifically for metallic oxides (such as partial electronic state fillings).
The features selected by the SGD are summarized in Table 1.
The outstanding subgroup should satisfy several criteria. It should be statistically relevant; therefore the subgroups of too small size should be penalized. Target-property values (OCOangle, C-O bond length, etc.) for subgroup samples should be as different as possible from corresponding gas-phase values since their change upon adsorption indicates CO 2 activation 33 . To achieve this, two requirements are imposed simultaneously: (i) The target-property values for subgroup members should be smaller or larger (depending on the target) than a certain value (a cutoff), and (ii) the target-property values are minimized or maximized within the cutoff. The latter condition gives preference to subgroups with smaller or larger target-property values among similarly sized subgroups within the cutoff. The value of the cutoff is a parameter. As it approaches the optimal value of an activation indicator among all data points, additional or alternative materials genes and their combinations leading to stronger activation are identified. We explore the whole range of the parameter for each target property (for OCO-angle-123°, 124°, 126°, 128°, 130°, and 132°; for l(C-O)-1.26 Å, 1.28 Å, and 1.30 Å).
In addition to these criteria, we consider the requirement that adsorption energies are not too strong and not too weak for most of the samples in a subgroup. Strong activation (i.e., strong weakening of the C-O bonds) can be achieved by strong binding to the surface. It is well known that good catalytic performance requires a balanced adsorption strength. This is known as Sabatier principle. In addition to the practical value of identifying subgroups that satisfy this principle, comparison of subgroup selectors obtained with and without this requirement helps to identify combinations of materials features that promote desired changes in target properties and at the same time yield intermediate adsorption energies. Sabatier principle is reflected by a characteristic volcano-type behavior of catalytic activity as a function of adsorption energy of reactants and intermediates. The position of the top of the volcano depends on particular reactions and conditions. It can be estimated from condition |ΔG|~0, where ΔG is the Gibbs free energy of adsorption. For CO 2 adsorption at room temperature and partial CO 2 pressure of 1 atm this condition corresponds to about −0.5 eV adsorption energy 40 . At temperatures around 450°C (typical conditions for CO 2 methanation 41 ) ΔG = 0 corresponds to adsorption energy −1.7 eV 41 . Therefore, for catalytic conversion at low or moderate temperatures this implies that CO 2 adsorption energies should be in the range from between −2.0 and −0.5 eV.
These requirements are implemented in the following quality functions that are maximized during the search for subgroups. In particular, for OCO-angle minimization we use: and for C-O bond maximization the following quality function was applied: where Y is the whole dataset, Z-a subgroup, s-size (number of data points), min and maxminimal or maximal value of the target property, α g and l g are the gas-phase values of OCO-angle and C-O bond distance, 180°and 1.17 Å, respectively, and θ cut is the Heaviside step function which is equal 1 if all data points in the subgroup satisfy the cutoff condition and 0 otherwise. Thus, larger values of the quality function F(Z) are obtained for those subgroups in which minimal (maximal) value of a target property is close to the maximal (minimal) value of the whole sampling with respect to the gas-phase value of CO 2 molecule. The use of maximum/minimum instead of a median is done to ensure that a target property is optimal for as many members of a subgroup as possible. The gas-phase reference values are usually significantly Mulliken electronegativity, minimal and maximal in the pair of gas-phase atoms A and B r −1 min , r −1 max Radii of the maximum value of the Kohn-Sham radial wave functions of the spin-unpolarized spherically symmetric atom for HOMO-1, maximum (max) and minimum (min) in the pair of atoms A and B r +1 min , r +1 max Radii of the maximum value of the Kohn-Sham radial wave functions of the spin-unpolarized spherically symmetric atom for LUMO, maximum (max) and minimum (min) in the pair of atoms A and B M Energy at which the surface O 2p-band projected density of states (PDOS) is maximal Distances from surface O-atom to the first-, second-, and third-nearest cations W Work function W, as the negative of the valence-band maximum (W = −VBM) with respect to vacuum level q min , q max Minimal and maximal Hirshfeld charges of cations in the pair A and B, calculated as an average for all surface cations of a given type Local-order parameter with l = 5 or 6 PC Weighted surface O 2p-band center α O , C 6 O Polarizability and C 6 -coefficient for surface O-atom obtained from many-body dispersion scheme α min , α max , C 6 min , C 6 max Polarizability and C 6 -coefficient for cations, minimal and maximal in the pair A and B, calculated as an average for all surface cations of a given type different from the "chemisorption" subset. Therefore, the term in squared brackets in Eqs. (1) and (2) can noticeably contribute only when the sizes of candidate subgroups are similar. The term u(p) in Eqs. (1) and (2) is added in order to account for Sabatier principle in SGD framework. We have implemented a multitask quality function, where a factor u(p) increases the quality of subgroups with adsorption energies falling within this range. This is formulated in terms of the information gain 34 , i.e., reduction of the normalized Shannon entropy. We perform the SGD for each target property both explicitly accounting for the Sabatier principle and without it. The latter case is equal to u(p) = 1 in Eqs. (1) and (2) 34 .
We note that SGD is qualitatively different from machinelearning classification/regression techniques such as neural networks, kernel regression methods, or decision-tree regression (DTR 42 ) (e.g., random forest). SGD is typically referred to as a supervised descriptive rule-induction technique 43 , i.e., it uses the labels assigned to the data points (the values of the target property) in order to identify patterns in the data distribution (the statistically exceptional data groups) and the rules defining them (the selectors), by optimizing a quality function which is a function of the distribution of values of the target property 43 . While there are apparent similarities between SGD and DTR as both methods yield models in terms of physically interpretable selectors (usually, inequalities) on a selected subset of the input features, the analogy stops at this level, as SGD focuses at (and only at) subgroups from the very beginning and says nothing about the data that are not in the subgroup. In contrast, DTR determines a global partitioning of the input space by minimizing a global quality function, i.e., the quality of a single subset is secondary with respect to the resulting quality of all subsets partitioning the whole dataset. In other words, for finding distinct combinations of materials genes driving desirable changes in a particular target property (possibly different combinations leading to the same result), the SGD approach has significantly higher flexibility and reliability. This is demonstrated below for a DTR analysis for our target properties.
The metadata and workflow for the AI analysis are documented in the Jupyter notebook 26 .
Results of the subgroup discovery. The SGD for OCO-angles was done with Eq. (1) for the quality function, and OCO as a target property, since smaller angles indicate larger charge transferred to the molecular π * orbital. The subgroup selectors obtained with different OCO-angle cutoffs (126°, 128°, 130°, and 132°) with or without the adsorption energy constraint are listed in Table 2 (for more details see the Supplementary Table 4). Analysis of these subgroups reveals that the angle reduction is determined by an interplay of several factors: an electron transfer from the cations to surface O-atoms, delocalization of electron density between cations and O-atoms, and coordination of the surface O-atoms. Without the Sabatier principle constraint, the OCO-angle reduction below 132°is mainly due to the electron accumulation at the O-atom of the clean surface. This is expressed by the conditions of more negative Hirshfeld charge on O-atoms (q O < …), not very low IP of at least one cation (IP max > …), and increased polarizability of the surface O-atom on which CO 2 is adsorbed (C 6 O > ...). Upon adsorption of CO 2 , this charge on the surface O-atom is readily available for transfer to CO 2 . When the Sabatier principle constraint is introduced, the OCO < 132°subgroup also includes sites with a pronounced electron transfer to CO 2 , but with a lower-energy O 2p-band maximum (M < ...) with respect to vacuum level, and a larger kurtosis (kurt > ...). These conditions imply reduced interelectronic repulsion around the surface O-atom achieved by partial delocalization of the charge density.
At lower OCO cutoffs, the subgroup selectors include coordination descriptors Q i , i = 5, 6. Without Sabatier principle, sites with larger Q i are selected, and vice versa. Larger Q i indicates lower coordination of the O-atom. This reduces electron repulsion and therefore facilitates electron transfer to the O-atom of the clean surface. However, this also increases the bonding strength of CO 2 to the surface. This explains why selectors of subgroups obtained with Sabatier principle include the opposite conditions (Q 5 < ...).
Other surface features describing electron distribution are related to Madelung potential: electrostatic potential and field (φ 1.4 , φ 2.6 , and Δφ = φ 1.4 − φ 2.6 ) and distances between the O-atom and surface cations. More open surface structure with larger distances between cations at the O site facilitates charge transfer to adsorbed CO 2 molecule, since the Madelung potential from the nearby cations is reduced. This is reflected in the appearance of propositions involving features d 1 , d 2 , and d 3 . For example, for the OCO ≤ 130°subgroups, imposing energy constraint changes proposition (d 1 > ...) to (d 1 < ...), which implies an increased energy cost for transferring electrons to CO 2 . Larger electric fields Δφ around the adsorption site imply stronger localization of electron density on O-atoms, and thus also improve the efficiency of charge transfer to the adsorbed molecule.
The smaller OCO subgroups with Sabatier principle also include propositions implying increased polarizability of both cations (C 6 min > ...). Another support-defining condition is that the radius of the lowest unoccupied orbital for the metal atoms should not be small (r +1 ≥ ...). This requirement is true for most cations with negative electron affinities ( Supplementary Fig. 4). Analysis of adsorbed CO 2 structures and Hirshfeld charges reveals that this condition together with the higher polarizability of cations at the pristine surface encompasses two scenarios: (i) additional electron transfer to CO 2 upon adsorption and (ii) stronger binding between O-atoms in CO 2 and surface cations. When scenario (ii) dominates, CO 3 δ− anion lies nearly horizontally at the surface, and is bound with nearby cations by chemical bonds via its oxygen atoms. Such a structure leads to small OCO-angles in CO 3 δ− (around 120°), even if charge transfer is limited. Thus, increased bending of adsorbed CO 2 occurs due to charge transfer over larger distances and/or distortion of the adsorbed molecule and the surface, both leading to weaker adsorption. The cases where both scenarios are active include the same sites as in the subgroups with elongated l(C-O), as described below.
In order to obtain the subgroups of adsorption sites with larger l(C-O), we performed the SGD with the quality function Eq. (2) and l(C-O) as target property. The results for l(C-O) cutoffs 1.26, 1.28, and 1.30 Å are summarized in Table 2 and Supplementary  Table 5. In contrast to OCO, the analysis of the obtained top subgroups shows a much less pronounced or no effect of imposing Sabatier principle on the distribution of adsorption energies within the subgroups. This is because sites with too strong adsorption are excluded based on l(C-O) threshold alone, without the need to introduce the energy constraint. For example, the range of l(C-O) for the top l(C-O) > 1.26 Å subgroup without constraining adsorption energies is the same as for the top OCO < 130°subgroup, but it contains significantly more sites with intermediate adsorption energies.
Electron transfer to an adsorbed CO 2 molecule increases both the OCO bending and C-O bond elongation. The main difference between OCO and l(C-O) subgroups is that in the latter an additional mechanism of increasing l(C-O) is in effect, namely a covalent bonding between one O-atom of the CO 2 molecule and the nearest surface cation. This can be concluded from the analysis of adsorption geometries, and correlates with the presence of proposition (EA max ≤ 0.005 eV), selecting cation species that can accept electron density, e.g., from an O-atom in adsorbed CO 2 molecule. Other proposition that appears in most selectors of top subgroups is (d 2 > 2.14 Å) or (d 2 > 2.22 Å)larger distances to the second nearest cation from an O-atom. Larger elongation of the C-O bond is achieved by the asymmetry of the cation types at the surface, where one can bind an O-atom of the adsorbed CO 2 , while the other (located further away) cannot. An example asymmetric CO 2 adsorption structure is shown in Supplementary Fig. 5.
Other propositions indicate a moderate charge transfer to adsorbed CO 2  In general, we find that subgroups obtained with smaller cutoffs do not have a strong overlap with subgroups with larger cutoffs for OCO. In particular, for subgroups with close cutoffs the overlap can be smaller than 50% of the smaller subgroup (but is never below 30%). Interestingly, for l(C-O) the situation is opposite: subgroups with tighter cutoffs are mostly contained in the subgroups for more relaxed constraints. This means that, while larger values of l(C-O) are mainly controlled by the same or additional genes, smaller values of OCO are due to alternative genes. The overlap of OCO subgroups becomes even smaller when Sabatier principle is included, confirming the absence of a universal mechanism for OCO-angle reduction that is compatible with moderate adsorption energy.
In summary, we find that, while an increased electron density at the O adsorption site is necessary for chemisorption and leads to both OCO bending and C-O bond elongation in an adsorbed CO 2 molecule, there are additional actuators for these effects that are different for different target properties. The OCO-angle is in general minimized by increasing electron transfer to the O site. However, this also leads to strong adsorption for many materials (Fig. 2). To satisfy Sabatier principle, the electron transfer to CO 2 must be moderate. This is achieved by delocalization of charge density around O sites and/or by distortion of the adsorbed molecule due to the formation of covalent bonds between O-atoms in CO 2 and surface cations. The largest C-O bond elongations are achieved when both charge transfer to adsorbed CO 2 and the covalent interaction are present, and local geometry around surface O-atom provides the asymmetry in adsorption structure. This mechanism automatically fulfills the Sabatier principle.
The subgroups found by SGD for the dipole moment induced by CO 2 adsorption, its total Hirshfeld charge, and the difference of charges on C and O-atoms significantly overlap with the subgroup of smaller OCO-angles. The subgroup found by maximizing the difference of Hirshfeld charges on O-atoms of an adsorbed CO 2 largely overlaps with the subgroup of sites delivering larger l(C-O). In general, these indicators are not better than OCO or l(C-O). Therefore, below we focus on OCOangle and l(C-O) as indicators of CO 2 activation. More details about the other indicators can be found in Supplementary Discussion.
Comparison with experimental results. To address the question which of the discussed properties can serve as an indicator of the catalytic activity, we compare our predictions to reported experimental results (Table 3). It should be stressed that the available experimental data are scarce, and results are difficult to compare quantitatively. We consider thermally and, for completeness, some photo-driven catalysis and thus also include supported metal catalysts with the considered oxides as support. Despite possibly different mechanisms for CO 2 conversion in the different types of catalysis, we believe that the properties of adsorbed CO 2 molecule can still serve as indicators of catalytic activity. Thus, it is possible that under such a daunting situation a reliable indicator of CO 2 activation can still be identified. As described below, our analysis confirms this hope.
First, we consider materials with the sites from subgroups obtained by minimization of OCO-angle without Sabatier principle constraint 27 . For quite many materials from these subgroups, independent of the cutoff value, there are no reports of successful CO 2 conversion, even when they are used as supports for metal nanoparticles (Table 3). This is explained by the fact that absolute adsorption energies for these materials are above 2 eV (Fig. 2 left, Supplementary Table 4), indicating that their surfaces will be permanently poisoned by carbonate species at low or intermediate temperatures. This means that on materials with these sites hardly any reaction of CO 2 conversion can proceed at low, especially room temperature. Moreover, as shown in Table 3, even at increased temperatures, 700-750°C, the activity of these materials is low. Some of them have been considered as candidates for carbon capture and storage (CaO, SrO, BaO, and Na 2 O) 44 , which implies the formation of stable carbonates rather than CO 2 transformation. Thus, we conclude that OCO-angle alone is not a good indicator of enhanced catalytic activity in CO 2 conversion.
On the other hand, several of the materials with sites from l(C-O) > 1.30 Å subgroups (independent on either with or without Sabatier principle constraint) are known as good materials for CO 2 conversion (Table 3) in different reactions proceeding at room or higher temperatures. For these sites, the absolute adsorption energies already satisfy the Sabatier principle (Fig. 2, left), as discussed above. We note that, contrary to what one may expect, there is no correlation between the adsorption energy and the value of l(C-O) (see Supplementary Fig. 5). Although there is a general trend, there are also significant variations in l(C-O) for given adsorption energy.
Interestingly, some of the materials with sites in the l(C-O) > 1.30 Å subgroups were studied as supports for metallic nanoparticles. For instance, Ni/LaAlO 3 is a catalyst for dry reforming of methane 45 at 700°C. It was shown that its catalytic performance is higher in terms of CO 2 and CH 4 conversion rates compared to Ni/La 2 O 3 and Ni/Al 2 O 3 45 . All sites on considered lanthanum (III) oxide surfaces belong to the subgroup of OCO < 132°without Sabatier constraint, whereas the sites on Al 2 O 3 do not enter any of the two subgroups. KNbO 3 has been studied only with Pt nanoparticles and as a composite with g-C 3 N 4 in photocatalytic reduction of CO 2 into CH 4 46,47 . Pt-KNbO 3 is~2.5 times more photoactive than Pt-NaNbO 3 46 , whereas the NaNbO 3 is known to be photoactive even without nanoparticles 48 . This seems to suggest that l(C-O) is a good indicator of CO 2 activation for both unsupported and supported catalysts even at increased temperatures. Hence, the other materials with the sites from this subgroup are promising new There is also a set of materials [ternaries A 2+ B 4+ O 3 (A = Ca, Sr, Ba, B = Zr, Ti, Ge, Sn, Si) with a perovskite structure] containing both the surfaces with sites from the smaller OCO subgroups without Sabatier constraint and the surfaces with sites from the larger l(C-O) subgroups (Table 3). These two types of sites are located on different surfaces. Thus, based on the above results, a material for which a surface with sites from the l(C-O) > 1.30 Å subgroups has lower formation energy and is more abundant than the surface with sites from smaller OCO subgroups without Sabatier constraint is expected to be a good catalyst. To explore this possibility, we analyze the surfaces of these materials in more detail. Their most stable surfaces are AO-terminated (001) facets containing sites from the smaller OCO subgroup. The formation energies of ABO 3 -terminated (110) surfaces with larger l(C-O) sites are higher: for BaZrO 3 , SrZrO 3 , CaZrO 3 , and SrTiO 3 the differences in formation energies are 0.049, 0.027, 0.013, and 0.037 eV/Å 2 , respectively. The zirconates and SrTiO 3 were found to catalyze the water gasshift reaction under increased temperatures, 700-1100°C 49 . At room temperature the photocatalytic activity of SrTiO 3 was found to be significantly decreased 50 . We attribute the latter finding to the strong carbonation of its most stable surface, which is consistent with the calculated high absolute value of CO 2 adsorption energy (−2.4 eV) for this surface. Thus, the activity of SrTiO 3 at 700°C and higher temperatures is consistent with the estimates of the CO 2 chemical potential given above. The difference in formation energies of the most stable CaOterminated (001) surface and the stoichiometric (110) surface for CaTiO 3 is less pronounced compared to zirconates and other titanates (CaO-terminated (001) is more stable than the (110) surface by only 0.009 eV/Å 2 ). Thus, the (110) facets, which contain sites from the long l(C-O) subgroup, may be present on catalyst particles at the reaction conditions. This can explain the observed activity of CaTiO 3 in CO 2 conversion not only at high but also at room temperature. We note that the activity of this material was also attributed to the presence of TiO 2 nanoparticles on the surface 51 at reaction conditions.
The OCO subgroup that includes most of the known good catalysts and a minimal number of inactive materials is OCO < 132°with Sabatier principle. It contains the sites on discussed above LaAlO 3 , KNbO3, and NaNbO 3 catalysts, but also on nonactive YInO 3 according to ref. 52 (Table 3). This subgroup contains in addition the sites on a well-known CO 2 conversion catalyst Ga 2 O 3 . We should mention that the catalytic activity of Ga 2 O 3 has been attributed to its reducibility. According to Pan and coworkers 53 CO 2 molecules are activated via dissociation on surface O-vacancies. However, in ref. 54 only one Ga 2 O 3 (100) surface was considered for which no energetically stable CO 2 chemisorption structures were obtained with the PBE functional. We show in Supplementary Table 1 and Supplementary Fig. 1 that this functional underestimates CO 2 adsorption energies. Moreover, in our study we considered also other surfaces and Table 3 The catalytic performance of materials which contain the sites from larger l(C-O)) or/and smaller OCO subgroups. ARTICLE found stable CO 2 chemisorption structures on these surfaces. Thus, activation of CO 2 on Ga 2 O 3 can indeed proceed on O-atoms as discussed in our study, even without surface O-vacancies. The subgroups with small OCO cutoffs, 123°and 124°, do not contain any sites on known active or non-active catalysts. OCO < 132°subgroup with Sabatier principle contains a large number of sites with elongated C-O bonds. The overlap of this subgroup with l(C-O) > 1.30 Å subgroups is 19 samples (70% of the latter).
To demonstrate the advantages of SGD over DTR in finding materials genes and their optimal combinations, we have done a comparison of found SGD subgroups with DTR performance for l(C-O). DTR terminal nodes (leaves) with the largest average l(C-O) (Supplementary Figs. 2 and 3) include surface sites on materials prone to extremely strong carbonation (Table 2), and also sites at which CO 2 prefers to physisorb, with l(C-O) = 1.17 Å. Also, one cannot check the effect of imposing the constraint as there is no standard way to mix regression and classification in DTR. Thus, DTR in contrast to SGD is not able to separate different activation modes and even fails sometimes in distinguishing activation from non-activation.
Best materials for CO 2 reduction among calculated ones. Now those good indicators of activation are identified (OCO with Sabatier principle and l(C-O)), all calculated materials can be ranked according to the value of these indicators (smaller OCO or larger l(C-O) indicate C-O bond weakening and therefore higher catalytic activity, provided adsorption energy is moderate). The resulting list of the most promising catalysts for CO 2 conversion is presented in Table 4.  Table 4 that belong to both l(C-O) > 1.30 Å and OCO < 132°subgroups are the most promising catalysts, followed by materials that belong to one of the subgroups, with the performance decreasing further down the list. Taking into account the number of active surface cuts and Sabatier principle, we conclude that NaSbO 3 is the most promising unexplored catalyst for temperatures up to 340°C (for CO 2 pressures around 1 atm). Other A +1 B +5 O 3 type promising materials are KSbO 3 (for temperatures up to 110°C) and RbNbO 3 (up to 360°C) that belong to both subgroups, and LiSbO 3 (230°C), CsNbO 3 (260°C), CsVO 3 (110°C), NaVO 3 (130°C), belonging to one of the subgroups (listed in the order of decreasing performance). There are also several promising A +3 B +3 O 3 oxides with surfaces belonging to one or both subgroups, listed in the order they appear first time in the table: ScAlO 3 (up to 550°C), GaAlO 3 (230°C), GaInO 3 (340°C), rhombohedral InAlO 3 (120°C)-these and other In-containing materials are of course very expensive, but we list them here for completeness, LaGaO 3 (210°C), ScGaO 3 (240°C), YAlO 3 (330°C).
From Table 4 it can be seen that not all promising materials belong to one of the found subgroups. This means that there are other optimal materials gene combinations that are not identified by SGD as statistically significant based on the current dataset. Such combinations may be unique for a given material, or they may be found when more data for different materials are considered. Among these materials the most promising are: InScO 3 (up to 430°C), MgSnO 3 (430°C), CaGeO 3 (570°C), orthorhombic InAlO 3 (230°C), CaSiO 3 (420°C), SrSiO 3 (460°C), SrGeO 3 (480°C), and BaSnO 3 (up to 550°C).

Discussion
We have developed the subgroup-discovery strategy for finding improved oxide-based catalysts for the conversion of chemically inert molecules such as CO 2 into useful chemicals or fuels. For this purpose we identified a new indicator of CO 2 activation, namely the large C-O bond distance of the adsorbed molecule. This artificial-intelligence approach identifies the materials genes that correlate most strongly with the activation of the adsorbed molecule. Specifically, these are the following clean surface properties: Hirshfeld charges of O-atom at which CO 2 adsorbs (q O ) and of surface cations (q min , q max ), surface geometric features [coordination descriptors Q i , i = 5, 6, distances between the surface O-atom and the nearest surface cations (d i , i = 1-3)], electrostatic potential and electric field above the adsorption site (Δφ, φ 2.6 ), polarizability and C 6 coefficients for surface atoms (C 6 min , C 6 O , α max ), radii of HOMO and LUMO of the cation species (r +1 max , r +1 min , r HOMO min ), ionization potential, electron affinity, and electronegativity of surface cation species (IP max , EA max , EN min ), features of O 2p DOS (kurt, M, PC, U), conduction band minimum (CBM), energies of the lowest unoccupied projected eigenstates of surface cation species (L max , L min ), and surface work function (W). The found subgroup selectors predict whether a given candidate material belongs to the class of promising catalysts. The peculiarity of the large C-O bond indicator is that it automatically satisfies Sabatier principle for low and middle-temperature CO 2 conversion.
The present study shows also that the previously proposed indicator for CO 2 activation, the decrease of the OCO-angle 27 , is not appropriate and even correlates with strong adsorption so that poisoning by carbonation is likely which may be useful for carbon capture and storage (CCS) but not for carbon capture and utilization (CCU). When Sabatier principle is purposely included in the SGD search for small OCO, found subgroups substantially overlap with large l(C-O) subgroups (70%), although still contain a few sites on inactive materials for CO 2 conversion.
The subgroup analysis revealed an alternative mechanism of CO 2 activation by adsorption, namely bonding of an O-atom in CO 2 with a surface cation(s), combined with only moderate electron transfer from the surface to the molecule, which results not only in reduction of OCO-angles, but also in pronounced elongation and weakening of the C-O bond. Although the latter can be achieved also by a larger charge transfer, it results in stronger binding of CO 2 molecule to the surface and poisoning of the catalyst, contrary to the new mechanism. The same new mechanism is revealed when Sabatier principle is included when searching for small OCO subgroups.
We also demonstrated that a standard regression technique (DTR), which gives prediction models in a physically interpretable form similar to subgroup discovery (selectors based on identified descriptor), fails to identify the optimal combinations of materials genes and the activation in general. This failure is traced back to the fact that DTR is a global approach, which minimizes error in the prediction of the value of a target property for the whole dataset. As a result, different combinations of genes leading to the optimal value of the same target property are intermixed, and the combination that leads to the most optimal value is not identified. On the contrary, subgroup discovery finds unique local subsets in the data independent of the rest of the data. This makes it more suitable for identifying different combinations of materials genes that result in activation.
The other four considered potential indicators (charge at the adsorbed CO 2 , adsorption induced dipole moment, the difference of charges on O-atoms and on C and O-atoms of adsorbed CO 2 ) were found to reproduce the results of SGD obtained for OCOangles or C-O bond distances with significant overlap with corresponding subgroups. Based on our results, we propose several new promising oxidebased catalysts for CO 2 conversion (Table 4). Although the present work has focused on oxides only, the overall strategy is general and can be applied to any other family of materials. This work also emphasizes the importance of documenting metadata and workflows for AI data analysis in materials science in order to ensure the reproducibility of AI models and data analysis results.

Methods
Ab initio calculations. The calculations are performed using density-functional theory (DFT) with the PBEsol exchange-correlation functional 55 as implemented in FHI-aims code 56 using 'tight' basis sets. The functional is chosen based on a comparison of calculated bulk lattice constants 55 and CO 2 adsorption energy to the available experimental results and high-level calculations (CCSD(T) and validated hybrid); see Supporting Information (SI) for more details on the computational setup. Nevertheless, it is expected that, because of the large set of systems inspected and the small variations introduced by the functional choice, the main trends will hold even when using another functional.
Studied materials. The dataset includes 71 semiconductor oxide materials, with 141 surfaces. The materials are ternary (ABO 3 ) and binary oxides with metal cations A and B from groups 1-5 (including La) and groups 12-15 of the periodic table. The full list of materials and surface cuts is given in Supplementary Notes, and the dataset is available in ref. 26 . In this study we considered only stoichiometric surface reconstructions obtained by atomic relaxation of stoichiometric bulk-like initial surface geometries. While this seems to be a limitation, our results show that indicators of activation calculated with this assumption correlate with experimental activity for known good oxide catalysts. This does not imply that surfaces of these materials do not reconstruct, but that the properties of unreconstructed surfaces can be used as descriptors for catalysis at reconstructed and defected surfaces under realistic conditions. The inclusion of surface reconstructions in the training data will further improve the predictions and will be a subject of future work.
The details of SGD. The SGD was done with the RealKD code (https://bitbucket. org/realKD/), modified to include quality functions described by Eqs. (1) and (2) in which the information gain was defined as: here p is the number of samples in a subgroup within the required adsorption energy range divided by the total number of samples in the subgroup. Since Shannon entropy is a symmetric parabola-like function around 0.5, we set here F(Z) = 0 for p ≤ 0.5. Also, x·ln(x) = 0 for x = 0. The search of subgroups is performed using a Monte-Carlo scheme adapted for these tasks 34 .
The cutoff values x, y, ... used for setting propositions (feature-1 < x, feature-2 ≥ y, etc.) are obtained by k-means clustering, as implemented within RealKD. That is, for a desired number n = k − 1 of cutoff values a set of k representative values of a given feature and k groups (clusters) of the data points are determined that minimize the deviation of all the feature values from the representative values. Thus, each value of the feature in the dataset is assigned to a particular cluster, and the cutoffs are determined as the arithmetic mean between the closest feature values in neighboring clusters. The number k is a parameter, and different k-values can in principle result in different cutoff values. It is worth noting that, due to the stochastic Monte-Carlo sampling, the exact definitions of the subgroups may vary for consecutive runs of the SGD algorithm. We have tested k = 12, 14, and 16 and rerun the algorithm several times for each k. While the results indeed depend on the run and on the k value, the subgroups maximizing the quality function have largely or entirely overlapping populations, and selectors with the same or similar propositions. Here we report selectors that appear most often and have high population and quality function values.
Decision-tree regression. The DTR analysis was performed using Python scikitlearn libraries. DTR is a supervised learning method in which the training set is repeatedly split into patterns (so-called leaves) by means of propositions built from primary features. The fitting of a model is done with respect to the cost function, which encloses the deviation of fitted values of a target property from the actual values. In this study we considered two cost functions-mean squared error (MSE) and mean absolute error (MAE). The search for the most optimal partitioning (the so-called tree) is done with the greedy algorithm. To obtain the most optimal TR model, we used a standard approach for supervised machine learning-leave-oneout cross-validation with respect to the hyperparameters-minimal size of a leaf, maximal depth. The minimal size of a leaf is a bottom threshold of the population of a pattern, since too small size might result in overfitting. Maximal depth is a limit for the maximal number of splits in a tree.

Data availability
The dataset is available in the NOMAD AI Toolkit 26 .

Code availability
A Jupyter notebook is available in the NOMAD AI Toolkit 26 .