Introduction

Artificial neural networks (NN) and statistical regression are commonly used to automate the discovery of patterns and relations in data. NNs return “black-box” models, where the underlying functions are typically used for prediction only. In standard regression, the functional form is determined in advance, so model discovery amounts to parameter fitting. In symbolic regression (SR)1, 2, the functional form is not determined in advance, but is instead composed from operators in a given list (e.g., + , − , × , and ÷) and calculated from the data. SR models are typically more “interpretable” than NN models, and require less data. Thus, for discovering laws of nature in symbolic form from experimental data, SR may work better than NNs or fixed-form regression3; integration of NNs with SR has been a topic of recent research in neuro-symbolic AI4,5,6. A major challenge in SR is to identify, out of many models that fit the data, those that are scientifically meaningful. Schmidt and Lipson3 identify meaningful functions as those that balance accuracy and complexity. However many such expressions exist for a given dataset, and not all are consistent with the known background theory.

Another approach would be to start from the known background theory, but there are no existing practical reasoning tools that generate theorems consistent with experimental data from a set of known axioms. Automated Theorem Provers (ATPs), the most widely-used reasoning tools, instead solve the task of proving a conjecture for a given logical theory. Computational complexity is a major challenge for ATPs; for certain types of logic, proving a conjecture is undecidable. Moreover, deriving models from a logical theory using formal reasoning tools is especially difficult when arithmetic and calculus operators are involved (e.g., see the work of Grigoryev et al.7 for the case of inequalities). Machine-learning techniques have been used to improve the performance of ATPs, for example, by using reinforcement learning to guide the search process8. This research area has received much attention recently9,10,11.

Models that are derivable, and not merely empirically accurate, are appealing because they are arguably correct, predictive, and insightful. We attempt to obtain such models by combining a novel mathematical-optimization-based SR method with a reasoning system. This yields an end-to-end discovery system, which extracts formulas from data via SR, and then furnishes either a formal proof of derivability of the formula from a set of axioms, or a proof of inconsistency. We present novel measures that indicate how close a formula is to a derivable formula, when the model is provably non-derivable, and we calculate the values of these measures using our reasoning system. In earlier work combining machine learning with reasoning, Marra et al.12 use a logic-based description to constrain the output of a GAN neural architecture for generating images. Scott et al.13 and Ashok et al.14 combine machine-learning tools and reasoning engines to search for functional forms that satisfy prespecified constraints. They augment the initial dataset with new points in order to improve the efficiency of learning methods and the accuracy of the final model. Kubalik et al.15 also exploit prior knowledge to create additional data points. However, these works only consider constraints on the functional form to be learned, and do not incorporate general background-theory axioms (logic constraints that describe the other laws and unmeasured variables that are involved in the phenomenon).

Results

Discovery as a formal mathematical problem

Our automated scientific discovery method aims to discover an unknown symbolic model y = f *(x) (bold letters indicate vectors) where x is the vector (x1, …, xn) of independent variables, and y is the dependent variable. The discovered model f (an approximation of f *) should fit a collection of m data points, ((X1, Y1),   , (Xm, Ym)), be derivable from a background theory, have low complexity and bounded prediction error. More specifically, the inputs to our system are 4-tuples \(\langle {{{{{{{\mathcal{B}}}}}}}},{{{{{{{\mathcal{C}}}}}}}},{{{{{{{\mathcal{D}}}}}}}},{{{{{{{\mathcal{M}}}}}}}}\rangle\) as follows.

  • Background Knowledge \({{{{{{{\mathcal{B}}}}}}}}\): a set of domain-specific axioms expressed as logic formulae. They involve x, y, and possibly more variables that are necessary to formulate the background theory. In this work we focus mainly on first-order-logic formulae with equality, inequality and basic arithmetic operators. We assume that the background theory \({{{{{{{\mathcal{B}}}}}}}}\) is complete, that is, it contains all the axioms necessary to comprehensively explain the phenomena under consideration, and consistent, that is, the axioms do not contradict one another. These two assumptions guarantee that there exists a unique derivable function \({f}_{{{{{{{{\mathcal{B}}}}}}}}}\) that logically represents the variable of interest y. Note that although the derivable function is unique, there may exist different functional forms that are equivalent on the domain of interest. Considering the domain with two points {0, 1} for a variable x, the two functional forms f (x) = x and f (x) = x2 both define the same function.

  • A Hypothesis Class \({{{{{{{\mathcal{C}}}}}}}}\): a set of admissible symbolic models defined by a grammar, a set of invariance constraints to avoid redundant expressions (e.g., A + B is equivalent to B + A) and constraints on the functional form (e.g., monotonicity).

  • Data \({{{{{{{\mathcal{D}}}}}}}}\): a set of m examples, each providing certain values for x1, …, xn, and y.

  • Modeler Preferences \({{{{{{{\mathcal{M}}}}}}}}\): a set of numerical parameters (e.g., error bounds on accuracy).

Generalized notion of distance

In general, there may not exist a function \(f\in {{{{{{{\mathcal{C}}}}}}}}\) that fits the data exactly and is derivable from \({{{{{{{\mathcal{B}}}}}}}}\). This could happen because the symbolic model generating the data might not belong to \({{{{{{{\mathcal{C}}}}}}}}\), the sensors used to collect the data might give noisy measurements, or the background knowledge might be inaccurate or incomplete. To quantify the compatibility of a symbolic model with data and background theory, we introduce the notion of distance between a model f and \({{{{{{{\mathcal{B}}}}}}}}\). Roughly, it reflects the error between the predictions of f and the predictions of a formula \({f}_{{{{{{{{\mathcal{B}}}}}}}}}\) derivable from \({{{{{{{\mathcal{B}}}}}}}}\) (thus, the distance equals zero when f is derivable from \({{{{{{{\mathcal{B}}}}}}}}\)). Figure 1 provides a visualization of these two notions of distance for the problem of learning Kepler’s third law of planetary motion from solar-system data and background theory.

Fig. 1: Visualization of relevant sets and their distances.
figure 1

The numerical data, background theory, and a discovered model are depicted for Kepler’s third law of planetary motion giving the orbital period of a planet in the solar system. The data consists of measurements (m1, m2, d, p) of the mass of the sun m1, the orbital period p and mass m2 for each planet and its distance d from the sun. The background theory amounts to Newton’s laws of motion, i.e., the formulae for centrifugal force, gravitational force, and equilibrium conditions. The 4-tuples (m1, m2, d, p) are projected into (m1 + m2, d, p). The blue manifold represents solutions of \({f}_{{{{{{{{\mathcal{B}}}}}}}}}\), which is the function derivable from the background-theory axioms that represents the variable of interest. The gray manifold represents solutions of the discovered model f. The double arrows indicate the distances β ( f ) and ε( f ).

Integration of statistical and symbolic AI

Our system consists mainly of an SR module and a reasoning module. The SR module returns multiple candidate symbolic models (or formulae) expressing y as a function of x1, …, xn and that fit the data. For each of these models, the system outputs the distance ε( f ) between f and \({{{{{{{\mathcal{D}}}}}}}}\) and the distance β( f ) between f and \({{{{{{{\mathcal{B}}}}}}}}\). We will also be referring to ε( f ) and β( f ) as errors.

These functions are also tested to see if they satisfy the specified constraints on the functional form (in \({{{{{{{\mathcal{C}}}}}}}}\)) and the modeler-specified level of accuracy and complexity (in \({{{{{{{\mathcal{M}}}}}}}}\)). When the models are passed to the reasoning module (along with the background theory \({{{{{{{\mathcal{B}}}}}}}}\)), they are tested for derivability. If a model is found to be derivable from \({{{{{{{\mathcal{B}}}}}}}}\), it is returned as the chosen model for prediction; otherwise, if the reasoning module concludes that no candidate model is derivable, it is necessary to either collect additional data or add constraints. In this case, the reasoning module will return a quality assessment of the input set of candidate hypotheses based on the distance β, removing models that do not satisfy the modeler-specified bounds on β. The distance (or error) β is computed between a function (or formula) f, derived from numerical data, and the derivable function \({f}_{{{{{{{{\mathcal{B}}}}}}}}}\) which is implicitly defined by the set of axioms in \({{{{{{{\mathcal{B}}}}}}}}\) and is logically represented by the variable of interest y. The distance between the function \({f}_{{{{{{{{\mathcal{B}}}}}}}}}\) and any other formula f depends only on the background theory and the formula f and not on any particular functional form of \({f}_{{{{{{{{\mathcal{B}}}}}}}}}\). Moreover, the reasoning module can prove that a model is not derivable by returning counterexample points that satisfy \({{{{{{{\mathcal{B}}}}}}}}\) but do not fit the model.

Interplay between data and theory in AI-Descartes

SR is typically solved with genetic programming (GP)1,2,3, 16, however methods based on mixed-integer nonlinear programming (MINLP) have recently been proposed17,18,19. In this work, we develop a new MINLP-based SR solver (described in the Supplementary Information). The input consists of a subset of the operators {\(+,-, \times , \div ,\, \sqrt {\,} , \log ,\exp\)}, an upper bound on expression complexity, and an upper bound on the number of constants used that do not equal 1. Given a dataset, the system formulates multiple MINLP instances to find an expression that minimizes the least-square error. Each instance is solved approximately, subject to a time limit. Both linear and nonlinear constraints can be imposed. In particular, dimensional consistency is imposed when physical dimensions of variables are available.

We use KeYmaera X20 as a reasoning tool; it is an ATP for hybrid systems and combines different types of reasoning: deductive, real-algebraic, and computer-algebraic reasoning. We also use Mathematica21 for certain types of analysis of symbolic expressions. While a formula found by any grammar-based system (such as an SR system) is syntactically correct, it may contradict the axioms of the theory or not be derivable from them. In some cases, a formula may not be derivable as the theory may not have enough axioms; the formula may be provable under an extended axiom set or an alternative one (e.g., using a relativistic set of axioms rather than a “Newtonian” one).

An overview of our system seen as a discovery cycle is shown in Fig. 2. Our discovery cycle is inspired by Descartes who advanced the scientific method and emphasized the role that logical deduction, and not empirical evidence alone, plays in forming and validating scientific discoveries. Our present approach differs from implementations of the scientific method that obtain hypotheses from theory and then check them against data; instead we obtain hypotheses from data and assess them against theory. A more detailed schematic of the system is depicted in Fig. 3, where the colored components correspond to the system we present in this work, and the gray components refer to standard techniques for scientific discovery that we have not yet integrated into our current implementation.

Fig. 2: An interpretation of the scientific method as implemented by our system.
figure 2

The colors match the respective components of the system in Fig. 3.

Fig. 3: System overview.
figure 3

Colored components correspond to our system, and gray components indicate standard techniques for scientific discovery (human-driven or artificial) that have not been integrated into the current system. The colors match the respective components of the discovery cycle of Fig. 2. The present system generates hypotheses from data using symbolic regression, which are posed as conjectures to an automated deductive reasoning system, which proves or disproves them based on background theory or provides reasoning-based quality measures.

Experimental validation

We tested the different capabilities of our system on three problems (more details in the Methods section). First, we considered the problem of deriving Kepler’s third law of planetary motion, providing reasoning-based measures to analyze the quality and generalizablity of the generated formulae. Extracting this law from experimental data is challenging, especially when the masses involved are of very different magnitudes. This is the case for the solar system, where the solar mass is much larger than the planetary masses. The reasoning module helps in choosing between different candidate formulae and identifying the one that generalizes well: using our data and theory integration we were able to re-discover Kepler’s third law. We then considered Einstein’s time-dilation formula. Although we did not recover this formula from data, we used the reasoning module to identify the formula that generalizes best. Moreover, analyzing the reasoning errors with two different sets of axioms (one with “Newtonian” assumptions and one relativistic), we were able to identify the theory that better explains the phenomenon. Finally, we considered Langmuir’s adsorption equation, whose background theory contains material-dependent coefficients. By relating these coefficients to the ones in the SR-generated models via existential quantification, we were able to logically prove one of the extracted formulae.

Discussion

We have demonstrated the value of combining logical reasoning with symbolic regression in obtaining meaningful symbolic models of physical phenomena, in the sense that they are consistent with background theory and generalize well in a domain that is significantly larger than the experimental data. The synthesis of regression and reasoning yields better models than can be obtained by SR or logical reasoning alone.

Improvements or replacements of individual system components and introduction of new modules such as abductive reasoning or experimental design22 (not described in this work for the sake of brevity) would extend the capabilities of the overall system. A deeper integration of reasoning and regression can help synthesize models that are both data driven and based on first principles, and lead to a revolution in the scientific discovery process. The discovery of models that are consistent with prior knowledge will accelerate scientific discovery, and enable going beyond existing discovery paradigms.

Methods

We next describe in detail the methodologies used to address the three problems studied to validate our method: Kepler’s third law of planetary motion, relativistic time dilation, and Langmuir’s adsorption equation.

Kepler’s third law of planetary motion

Kepler’s law relates the distance d between two bodies (e.g., the sun and a planet in the solar system) and their orbital periods. It can be expressed as

$$p=\sqrt{\frac{4{\pi }^{2}{d}^{3}}{G\left({m}_{1}+{m}_{2}\right)}} \,,$$
(1)

where p is the period, G is the gravitational constant, and m1 and m2 are the two masses. It can be derived using the following axioms of the background theory \({{{{{{{\mathcal{B}}}}}}}}\), describing the center of mass (axiom K1), the distance between bodies (axiom K2), the gravitational force (axiom K3), the centrifugal force (axiom K4), the force balance (axiom K5), and the period (axiom K6):

$$\begin{array}{lll}{{{{{{{\rm{K}}}}}}}}1.\quad \,{{\mbox{Center of mass}}}\,\quad \hfill && {m}_{1}{d}_{1}={m}_{2}{d}_{2}\hfill\\ {{{{{{{\rm{K}}}}}}}}2.\quad \,{{\mbox{Distance between bodies}}}\, \hfill && d={d}_{1}+{d}_{2}\hfill\\ {{{{{{{\rm{K}}}}}}}}3.\quad \,{{\mbox{Gravitational force}}}\,\hfill && {F}_{g}=\frac{G{m}_{1}{m}_{2}}{{d}^{2}}\hfill\\ {{{{{{{\rm{K}}}}}}}}4.\quad \,{{\mbox{Centrifugal force}}}\,\quad \hfill && {F}_{c}={m}_{2}{d}_{2}{w}^{2}\hfill\\ {{{{{{{\rm{K}}}}}}}}5.\quad \,{{\mbox{Force balance}}}\,\quad \hfill && {F}_{g}={F}_{c}\hfill\\ {{{{{{{\rm{K}}}}}}}}6.\quad \,{{\mbox{Period definition}}}\,\quad \hfill && p=\frac{2\pi }{w}\hfill\\ {{{{{{{\rm{K}}}}}}}}7.\quad \,{{\mbox{Positivity constraints}}}\, \hfill && {m}_{1} \, > \, 0,\, {m}_{2} \, > \, 0,\, p \, > \, 0,\, {d}_{1} \, > \, 0,\, {d}_{2} \, > \, 0.\end{array}$$
(2)

We consider three real-world datasets: planets of the solar system (from the NASA Planetary Fact Sheet23), the solar-system planets along with exoplanets from Trappist-1 and the GJ 667 system (from the NASA exoplanet archive24), and binary stars25. These datasets contain measurements of pairs of masses (a sun and a planet for the first two, and two suns for the third), the distance between them, and the orbital period of the planet around its sun in the first two datasets or the orbital period around the common center of mass in the third dataset. The data we use is given in the Supplementary Information. Note that the dataset does not contain measurements for a number of variables in the axiom system, such as d1, d2, Fg, etc.

The goal is to recover Kepler’s third law (Eq. (1)) from the data, that is, to obtain p as the above-stated function of d, m1 and m2.

The SR module takes as input the set of operators {\(+,-, \times , \div ,\)√ } and outputs a set of candidate formulas. None of the formulae obtained via SR are derivable, though some are close approximations to derivable formulae. We evaluate the quality of these formulae by writing a logic program for calculating the error β of a formula with respect to a derivable formula. We use three measures, defined below, to assess the correctness of a data-driven formula from a reasoning viewpoint: the pointwise reasoning error, the generalization reasoning error, and variable dependence.

Pointwise reasoning error

The key idea is to compute a distance between a formula generated from the numerical data and some derivable formula that is implicitly defined by the axiom set. The distance is measured by the \({l}_{2}\) or \({l}_{\infty}\) norm applied to the differences between the values of the numerically-derived formula and a derivable formula at the points in the dataset. This definition can be extended to other norms.

We compute the relative error of numerically derived formula f (x) applied to the m data points Xi (i = 1, …, m) with respect to \({f}_{{{{{{{{\mathcal{B}}}}}}}}}({{{{{{{\bf{x}}}}}}}})\), derivable from the axioms via the following expressions:

$${\beta }_{2}^{r}=\sqrt{\mathop{\sum }\limits_{i=1}^{m}{\left(\frac{f({{{{{{{{\bf{X}}}}}}}}}^{i})-{f}_{{{{{{{{\mathcal{B}}}}}}}}}({{{{{{{{\bf{X}}}}}}}}}^{i})}{{f}_{{{{{{{{\mathcal{B}}}}}}}}}({{{{{{{{\bf{X}}}}}}}}}^{i})}\right)}^{2}}\quad \,{{\mbox{and}}}\,\quad {\beta }_{\infty }^{r}=\mathop{\max }\limits_{1\le i\le m}\left\{\frac{|\, f({{{{{{{{\bf{X}}}}}}}}}^{i})-{f}_{{{{{{{{\mathcal{B}}}}}}}}}({{{{{{{{\bf{X}}}}}}}}}^{i})|}{|\, {f}_{{{{{{{{\mathcal{B}}}}}}}}}({{{{{{{{\bf{X}}}}}}}}}^{i})|}\right\}$$
(3)

where \({f}_{{{{{{{{\mathcal{B}}}}}}}}}({{{{{{{{\bf{X}}}}}}}}}^{i})\) denotes a derivable formula for the variable of interest y evaluated at the data point Xi.

The KeYmaera formulation of these two measures for the first formula of Table 1 can be found in the Supplementary Information. Absolute-error variants of the first and second expressions in Eq. (3) are denoted by \({\beta }_{2}^{a},{\beta }_{\infty }^{a}\), respectively. The numerical (data) error measures \({\varepsilon }_{2}^{r}\) and \({\varepsilon }_{\infty }^{r}\) are defined by replacing \({f}_{{{{{{{{\mathcal{B}}}}}}}}}({{{{{{{{\bf{X}}}}}}}}}^{i})\) by Yi in Eq. (3). Analogous to \({\beta }_{2}^{a}\) and \({\beta }_{\infty }^{a}\), we also define absolute-numerical-error measures \({\varepsilon }_{2}^{a}\) and \({\varepsilon }_{\infty }^{a}\).

Table 1 Error values of candidate solutions for the Kepler dataset

Table 1 reports in columns 5 and 6 the values of \({\beta }_{2}^{r}\) and \({\beta }_{\infty }^{r}\), respectively. It also reports the relative numerical errors \({\varepsilon }_{2}^{r}\) and \({\varepsilon }_{\infty }^{r}\) in columns 3 and 4, measured by the \({l}_{2}\) and \({l}_{\infty}\) norms, respectively, for the candidate expressions given in column 2 when evaluated on the points in the dataset. We minimize the absolute \({l}_{2}\) error \({\varepsilon }_{2}^{a}\) (and not the relative error \({\varepsilon }_{2}^{r}\)), when obtaining candidate expressions via symbolic regression.

The pointwise reasoning errors β2 and \({\beta }_{\infty }\) are not very informative if SR yields a low-error candidate expression (measured with respect to the data), and the data itself satisfies the background theory up to a small error, which indeed is the case with the data we use; the reasoning errors and numerical errors are very similar.

Generalization reasoning error

Even when one can find a function that fits given data points well, it is challenging to obtain a function that generalizes well, that is, one which yields good results at points of the domain not equal to the data points. Let \({\beta }_{\infty,S}^{r}\) be calculated for a candidate formula f (x) over a domain S that is not equal to the original set of data points as follows:

$${\beta }_{\infty,S}^{r}=\mathop{\max }\limits_{{{{{{{{\bf{x}}}}}}}}\in S}\left\{\frac{|\, f({{{{{{{\bf{x}}}}}}}})-{f}_{{{{{{{{\mathcal{B}}}}}}}}}({{{{{{{\bf{x}}}}}}}})|}{|\, {f}_{{{{{{{{\mathcal{B}}}}}}}}}({{{{{{{\bf{x}}}}}}}})|}\right\} \, ,$$
(4)

where we consider the relative error and, as before, the function \({f}_{{{{{{{{\mathcal{B}}}}}}}}}({{{{{{{\bf{x}}}}}}}})\) is not known, but is implicitly defined by the axioms in the background theory. We call this measure the relative generalization reasoning error. If we do not divide by \({f}_{{{{{{{{\mathcal{B}}}}}}}}}({{{{{{{\bf{x}}}}}}}})\) in the above expression, we get the corresponding absolute version \({\beta }_{\infty,S}^{a}\). For the Kepler dataset, we let S be the smallest multi-dimensional interval (or Cartesian product of intervals on the real line) containing all data points. In column 7 of Table 1, we show the relative generalization reasoning error \({\beta }_{\infty,S}^{r}\) on the Kepler datasets with S defined as above. If this error is roughly the same as \({\beta }_{\infty }^{r}\) the pointwise relative reasoning error for \({l}_{\infty}\) (e.g., for the solar system dataset) then the formula extracted from the numerical data is as accurate at points in S as it is at the original data points.

Variable dependence

In order to check if the functional dependence of a candidate formula on a specific variable is accurate, we compute the generalization error over a domain \({S}^{{\prime} }\) where the domain of this variable is extended by an order of magnitude beyond the smallest interval containing the values of the variable in the dataset. Thus we can check whether there exist special conditions under which the formula does not hold. We modify the endpoints of an interval by one order of magnitude, one variable at a time. If we notice an increase in the generalization reasoning error while modifying intervals for one variable, we deem the candidate formula as missing a dependency on that variable. A missing dependency might occur, for example, because the exponent for a variable is incorrect, or that variable is not considered at all when it should be. One can get further insight into the type of dependency by analyzing how the error varies (e.g., linearly or exponentially). Table 1 provides, in columns 8–10, results regarding the candidate formulae for Kepler’s third law. For each formula, the dependencies on m1, m2, and d are indicated by 1 or 0 (for correct or incorrect dependency). For example, the candidate formula \(p=\sqrt{0.1319{d}^{3}}\) for the solar system does not depend on either mass, and the dependency analysis suggests that the formula approximates well the phenomenon in the solar system, but not for larger masses.

The best formula for the binary-star dataset, \(\sqrt{{d}^{3}/(0.9967{m}_{1}+{m}_{2})}\), has no missing dependency (all ones in columns 8–10), that is, it generalizes well; increasing the domain along any variable does not increase the generalized reasoning error.

Figure 4 provides a visualization of the two errors \({\varepsilon }_{2}^{r}\) and \({\beta }_{2}^{r}\) for the first three functions of Table 1 (solar-system dataset) and the ground truth \({f}^{*}\).

Fig. 4: Depiction of symbolic models for Kepler’s third law of planetary motion giving the orbital period of a planet in the solar system.
figure 4

The models produced by our SR system are represented by points (ε, β), where ε represents distance to data, and β represents distance to background theory. Both distances are computed with an appropriate norm on the scaled data.

Relativistic time dilation

Einstein’s theory of relativity postulates that the speed of light is constant, and implies that two observers in relative motion to each other will experience time differently and observe different clock frequencies. The frequency \({{{\mathcal{f}}}}\) for a clock moving at speed v is related to the frequency \({{{{\mathcal{f}}}}}_{0}\) of a stationary clock by the formula

$$\frac{{{{{\mathcal{f}}}}}-{{{{\mathcal{f}}}}}_{0}}{{{{{\mathcal{f}}}}}_{0}}=\sqrt{1-\frac{{v}^{2}}{{c}^{2}}}-1 \,,$$
(5)

where c is the speed of light. This formula was recently confirmed experimentally by Chou et al.26 using high precision atomic clocks. We test our system on the experimental data reported by Chou et al.26 which consists of measurements of v and associated values of \((\;{{{\mathcal{f}}}}-{{{{{\mathcal{f}}}}}_{0}})/{{{{{\mathcal{f}}}}}_{0}}\), reproduced in the Supplementary Information. We take the axioms for derivation of the time dilation formula from the work of Behroozi27 and Smith28. These are also listed in the Supplementary Information and involve variables that are not present in the experimental data.

In Table 2 we give some functions obtained by our SR module (using {\(+,-, \times , \div ,\)√ } as the set of input operators) along with the numerical errors of the associated functions and generalization reasoning errors. The sixth column gives the set S as an interval for v for which our reasoning module can verify that the absolute generalization reasoning error of the function in the first column is at most 1. The last column gives the interval for v for which we can verify a relative generalization reasoning error of at most 2%. Even though the last function has low relative error according to this metric, it can be ruled out as a reasonable candidate if one assumes the target function should be continuous (it has a singularity at v = 1). Thus, even though we cannot obtain the original function, we obtain another which generalizes well, as it yields excellent predictions for a very large range of velocities.

Table 2 Candidate functions derived from time dilation data, and associated error values

In this case, our system can also help rule out alternative axioms. Consider replacing the axiom that the speed of light is a constant value c by a “Newtonian” assumption that light behaves like other mechanical objects: if emitted from an object with velocity v in a direction perpendicular to the direction of motion of the object, it has velocity \(\sqrt{{v}^{2}+{c}^{2}}\). Replacing c by \(\sqrt{{v}^{2}+{c}^{2}}\) (in axiom R2 in the Supplementary Information to obtain R2’) produces a self-consistent axiom system (as confirmed by the theorem prover), albeit one leading to no time dilation. Our reasoning module concludes that none of the functions in Table 2 is compatible with this updated axiom system: the absolute generalization reasoning error is greater than 1 even on the dataset domain, as well as the pointwise reasoning error. Consequently, the data is used indirectly to discriminate between axiom systems relevant for the phenomenon under study; SR poses only accurate formulae as conjectures.

Langmuir’s adsorption equation

The Langmuir adsorption equation (Nobel Prize in Chemistry, 1932)29 describes a chemical process in which gas molecules contact a surface, and relates the loading q on the surface to the pressure p of the gas:

$$q=\frac{{q}_{\max }{K}_{{{{{{{{\rm{a}}}}}}}}}p}{1+{K}_{{{{{{{{\rm{a}}}}}}}}}p} \, .$$
(6)

The constants \({q}_{\max }\) and Ka characterize the maximum loading and the adsorption strength, respectively. A similar model for a material with two types of adsorption sites yields:

$$q=\frac{{q}_{\max\!,1}{K}_{{{{{{{{\rm{a}}}}}}}},1}p}{1+{K}_{{{{{{{{\rm{a}}}}}}}},1}p}+\frac{{q}_{\max\!,2}{K}_{{{{{{{{\rm{a}}}}}}}},2}p}{1+{K}_{{{{{{{{\rm{a}}}}}}}},2}p} \,,$$
(7)

with parameters for maximum loading and adsorption strength on each type of site. The parameters in Eqs. (6) and (7) fit experimental data using linear or nonlinear regression, and depend on the material, gas, and temperature.

We used data from Langmuir’s 1918 publication29 for methane adsorption on mica at a temperature of 90 K, and also data from the work of Sun et al.30 (Table 1) for isobutane adsorption on silicalite at a temperature of 277 K. In both cases, observed values of q are given for specific values of p; the goal is to express q as a function of p. We give the SR module the operators { + , − , × , ÷}, and obtain the best fitting functions with two and four constants. The code ran for 20 minutes on 45 cores, and seven of these functions are displayed for each dataset.

To encode the background theory, following Langmuir’s original theory29, we elicited the following set \({{{{{{{\mathcal{A}}}}}}}}\) of axioms:

$$\begin{array}{lll}{{{{{{{\rm{L}}}}}}}}1.\quad \,{{\mbox{Site balance}}}\,\hfill &&{S}_{0}=S+{S}_{{{{{{{{\rm{a}}}}}}}}} \hfill \\ {{{{{{{\rm{L}}}}}}}}2.\quad \,{{\mbox{Adsorption rate model}}}\,\hfill &&{r}_{{{{{{{{\rm{ads}}}}}}}}}={k}_{{{{{{{{\rm{ads}}}}}}}}}pS \hfill \\ {{{{{{{\rm{L}}}}}}}}3.\quad \,{{\mbox{Desorption rate model}}}\,\hfill &&{r}_{{{{{{{{\rm{des}}}}}}}}}={k}_{{{{{{{{\rm{des}}}}}}}}}{S}_{{{{{{{{\rm{a}}}}}}}}} \hfill \\ {{{{{{{\rm{L}}}}}}}}4.\quad \,{{\mbox{Equilibrium assumption}}}\,\hfill &&{r}_{{{{{{{{\rm{ads}}}}}}}}}={r}_{{{{{{{{\rm{des}}}}}}}}} \hfill \\ {{{{{{{\rm{L}}}}}}}}5.\quad \,{{\mbox{Mass balance on}}}\,q \hfill &&q={S}_{{{{{{{{\rm{a}}}}}}}}}. \hfill\end{array}$$
(8)

Here, S0 is the total number of sites, of which S are unoccupied and Sa are occupied (L1). The adsorption rate rads is proportional to the pressure p and the number of unoccupied sites (L2). The desorption rate rdes is proportional to the number of occupied sites (L3). At equilibrium, rads = rdes (L4), and the total amount adsorbed, q, is the number of occupied sites (L5) because the model assumes each site adsorbs at most one molecule. Langmuir solved these equations to obtain

$$q=\frac{{S}_{0}({k}_{{{{{{{{\rm{ads}}}}}}}}}/{k}_{{{{{{{{\rm{des}}}}}}}}})p}{1+({k}_{{{{{{{{\rm{ads}}}}}}}}}/{k}_{{{{{{{{\rm{des}}}}}}}}})p} \,,$$
(9)

which corresponds to Eq. (6), where \({q}_{\max }={S}_{0}\) and Ka = kads/kdes. An axiomatic formulation for the multi-site Langmuir expression is described in the Supplementary Information. Additionally, constants and variables are constrained to be positive (e.g., S0 > 0, S > 0, and Sa > 0) or non-negative (e.g., q ≥ 0).

The logic formulation to prove is:

$$({{{{{{{\mathcal{C}}}}}}}}\wedge {{{{{{{\mathcal{A}}}}}}}})\to f \,,$$
(10)

where \({{{{{{{\mathcal{C}}}}}}}}\) is the conjunction of the non-negativity constraints, \({{{{{{{\mathcal{A}}}}}}}}\) is a conjunction of the axioms, the union of \({{{{{{{\mathcal{C}}}}}}}}\) and \({{{{{{{\mathcal{A}}}}}}}}\) constitutes the background theory \({{{{{{{\mathcal{B}}}}}}}}\), and f is the formula we wish to prove.

SR can only generate numerical expressions involving the (dependent and independent) variables occurring in the input data, with certain values for constants; for example, the expression f = p/(0.709p + 0.157). The expressions built from variables and constants from the background theory, such as Eq. (9), involve the constants (in their symbolic form) explicitly: for example, kads and kdes appear explicitly in Eq. (9) while SR only generates a numerical instance of the ratio of these constants. Thus, we cannot use Formula (10) directly to prove formulae generated from SR. Instead, we replace each numerical constant of the formula by a logic variable ci : for example, the formula f = p/(0.709p + 0.157) is replaced by \({f}^{{\prime} }=p/({c}_{1}p+{c}_{2})\), introducing two new variables c1 and c2. We then quantify the new variables existentially, and define a new set of non-negativity constraints \({{{{{{{{\mathcal{C}}}}}}}}}^{{\prime} }\). In the example above we will have \({{{{{{{{\mathcal{C}}}}}}}}}^{{\prime} }={c}_{1} \, > \, 0\,\wedge \,{c}_{2} \, > \, 0\).

The final formulation to prove is:

$$\exists {c}_{1}\cdots \exists {c}_{n} ({{{{{{{\mathcal{C}}}}}}}}\wedge {{{{{{{\mathcal{A}}}}}}}})\to (\,{f}^{{\prime} }\wedge {{{{{{{{\mathcal{C}}}}}}}}}^{{\prime} }).$$
(11)

For example, \({f}^{{\prime} }=\, p/({c}_{1}p+{c}_{2})\) is proved true if the reasoner can prove that there exist values of c1 and c2 such that \({f}^{{\prime} }\) satisfies the background theory \({{{{{{{\mathcal{A}}}}}}}}\) and the constraints \({{{{{{{\mathcal{C}}}}}}}}\). Here c1 and c2 can be functions of constants kads, kdes, S0, and/or real numbers, but not the variables q and p.

We also consider background knowledge in the form of a list of desired properties of the relation between p and q, which helps trim the set of candidate formulae. Thus, we define a collection \({{{{{{{\mathcal{K}}}}}}}}\) of constraints on f, where q = f ( p), enforcing monotonicity or certain types of limiting behavior (see Supplementary Information). We use Mathematica21 to verify that a candidate function satisfies the constraints in \({{{{{{{\mathcal{K}}}}}}}}\).

In Table 3, column 1 gives the data source, and column 2 gives the “hyperparameters” used in our SR experiments: we allow either two or four constants in the derived expressions. Furthermore, as the first constraint C1 from \({{{{{{{\mathcal{K}}}}}}}}\) can be modeled by simply adding the data point p = q = 0, we also experiment with an “extra point”.

Table 3 Results on two datasets for the Langmuir problem

Column 3 displays a derived expression, while columns 4 and 5 give, respectively, the relative numerical errors \({\varepsilon }_{2}^{r}\) and \({\varepsilon }_{\infty }^{r}\). If the expression can be derived from our background theory, then we indicate that in column 6. These results are visualized in Fig. 5. Column 7 indicates the number of constraints from \({{{{{{{\mathcal{K}}}}}}}}\) that each expression satisfies, verified by Mathematica. Among the top two-constant expressions, f1 fits the data better than f2, which is derivable from the background theory, whereas f1 is not.

Fig. 5: Symbolic regression solutions to two adsorption datasets.
figure 5

Fig. 5a refers to the methane adsorption on mica at a temperature of 90 K, while Fig. 5b refers to the isobutane adsorption on silicalite at a temperature of 277 K. f2 and g2 are equivalent to the single-site Langmuir equation; g5 and g7 are equivalent to the two-site Langmuir equation.

When we search for four-constant expressions29, we get much smaller errors than Eq. (6) or even Eq. (7), but we do not obtain the two-site formula (Eq. (7)) as a candidate expression. For the dataset from Sun et al.30, g2 has a form equivalent to Langmuir’s one-site formula, and g5 and g7 have forms equivalent to Langmuir’s two-site formula, with appropriate values of \({q}_{\max \!,i}\) and Ka,i for i = 1, 2.

System limitations and future improvements

Our results on three problems and associated data are encouraging and provide the foundations of a new approach to automated scientific discovery. However our work is only a first, although crucial, step towards completing the missing links in automating the scientific method.

One limitation of the reasoning component is the assumption of correctness and completeness of the background theory. The incompleteness could be partially solved by the introduction of abductive reasoning31 (as depicted in Fig. 3). Abduction is a logic technique that aims to find explanations of an (or a set of) observation, given a logical theory. The explanation axioms are produced in a way that satisfy the following: (1) the explanation axioms are consistent with the original logical theory and (2) the observation can be deduced by the new enhanced theory (the original logical theory combined with the explanation axioms). In our context the logical theory corresponds to the set of background knowledge axioms that describe a scientific phenomenon, the observation is one of the formulas extracted from the numerical data and the explanations are the missing axioms in the incomplete background theory.

However the availability of background theory axioms in machine readable format for physics and other natural sciences is currently limited. Acquiring axioms could potentially be automated (or partially automated) using knowledge extraction techniques. Extraction from technical books or articles that describe a natural science phenomenon can be done by, for example, deep learning methods (e.g. the work of Pfahler and Morik32, Alexeeva et al.33, or Wang and Liu34) both from NL plain text or semi-structured text such as LateX or HTML. Despite the recent advancements in this research field, the quality of the existing tools remains quite inadequate with respect to the scope of our system.

Another limitation of our system, that heavily depends on the tools used, is the scaling behavior. Excessive computational complexity is a major challenge for automated theorem provers (ATPs): for certain types of logic (including the one that we use), proving a conjecture is undecidable. Deriving models from a logical theory using formal reasoning tools is even more difficult when using complex arithmetic and calculus operators. Moreover, the run-time variance of a theorem prover is very large: the system can at times solve some “large” problems while having difficulties with some “smaller” problems. Recent developments in the neuro-symbolic area use deep-learning techniques to enhance standard theorem provers (e.g., see Crouse et al.8). We are still at the early stages of this research and there is still a lot that can be done. We envision that the performance and capability (in terms of speed and expressivity) of theorem provers will improve with time. Symbolic regression tools, including the one based on solving mixed-integer nonlinear programs (MINLP) that we developed, often take an excessive amount of time to explore the space of possible symbolic expressions and find one that has low error and expression complexity, especially with noisy data. In practice, the worst-case solution time for MINLP solvers (including BARON) grows exponentially with input data encoding size (additional details in the Supplementary Information). However, MINLP solver performance and genetic programming based symbolic regression solvers are active areas of research.

Our proposed system could benefit from other improvements in individual components (especially in the functionality available). For example, Keymaera only supports differential equations in time and not in other variables and does not support higher order logic; BARON cannot handle differential equations.

Beyond improving individual components, our system can be improved by introducing techniques such as experimental design (not described in this work but envisioned in Fig. 3). A fundamental question in the holistic view of the discovery process is what data should be collected to give us maximum information regarding the underlying model. The goal of optimal experimental design (OED) is to find an optimal sequence of data acquisition steps such that the uncertainty associated with the inferred parameters, or some predicted quantity derived from them, is minimized with respect to a statistical or information theoretic criterion. In many realistic settings, experimentation may be restricted or costly, providing limited support for any given hypothesis as to the underlying functional form. It is therefore critical at times to incorporate an effective OED framework. In the context of model discovery, a large body of work addresses the question of experimental design for predetermined functional forms, and another body of research addresses the selection of a model (functional form) out of a set of candidates. A framework that can deal with both the functional form and the continuous set of parameters that define the model behavior is obviously desirable22; one that consistently accounts for logical derivability or knowledge-oriented considerations35 would be even better.