Main

Many clinical studies are driven by the research questions that are statistical at first glance but causal by nature1. For instance2, how safe and efficient is a vaccine against viral infection? How are clinicopathological variables3 related to cancer patient survival? From the causal perspective, the common ground of these problems starts with determining the causal variables of the outcome of interest. In clinical medicine, randomized controlled trials (RCTs) are considered to be the gold standard to investigate cause–effect relationships4. In a prototypical RCT, a participant is randomly assigned to the experimental or control arm and the outcome of interest is observed. In the context of causal inference, such a randomization can be modelled with do-intervention5.

In this Article, we develop a new automated causal inference method (AutoCI) and apply this to two large-scale, practice-changing RCTs of patients with endometrial carcinoma conducted in the Netherlands from 1990–1997 (PORTEC 1; refs. 6,7) and 2002–2006 (PORTEC 2; refs. 8,9), with full clinicopathological datasets and mature outcome data. Endometrial carcinoma is the most common type of gynaecological cancer for women in developed countries10. The majority of women that are diagnosed with early stage endometrial cancer (EC) have a favourable prognosis and are treated with surgery11. Approximately 15–20% of patients have an unfavourable prognosis with a high risk of distant metastasis11. For those patients, different adjuvant therapies such as vaginal brachytherapy, external beam radiotherapy and chemoradiation are recommended on the basis of their risk group12. The two trials (PORTEC 1 and 2) used in this study made a key contribution to clinical practice by investigating how these therapies impact the risk of recurrence rates and survival6,8. According to the latest ESGO/ESTRO/ESP guidelines12 for the management of patients with endometrial carcinoma, the risk classification is based on a series of clinical and pathological variables such as tumour grading (Grade), lymphovascular space invasion (LVSI), myometrial invasion and so on, as well as molecular variables including—but not limited to—polymerase epsilon mutant EC (POLEmut), mismatch repair deficient (MMRd) EC, p53 abnormal EC (p53abn) and EC with no specific molecular profile (NSMP)12. Correlative statistical methods were used in a recent study3 to investigate the hazardous relevance of these variables to EC recurrence. There is strong evidence to suggest that these variables impact EC recurrence, but a systematic investigation to support this understanding from a modern causal inference perspective has not been performed.

Causal inference addresses the determination of cause–effect relationships from data13,14,15,16. When given either observational data or the inclusion of additional interventional data, clinical studies aim to either (1) quantify the causal effect of a treatment given the outcome13 or (2) infer the underlying causal structure of relationships between patient and treatment characteristics and relevant outcomes5,17.

The former can be well formulated as the difference between the outcome expectations conditioned on different treatments (average causal effect)13. A wide range of studies have built on this methodology, including—but not limited to—target trial specification18, target trial emulation19,20 and extending inferences from randomized trials to new target populations21.

In comparison to the causal effect identification, we refer to (2) as (causal) structure identification5. The goal of structure identification is often to learn the entire causal structure, that is, a directed acyclic graph composed of nodes and edges that connect nodes; however, this is generally a non-deterministic polynomial-time hard problem16,22. To learn the cause–effect relations without inferring the entire causal structure, invariant causal prediction (ICP)23 was proposed to determine the set of causal variables given an outcome variable. Under the ICP framework, we use the random variable concept (informally, a function that assigns a set of possible samples to a measurable quantity) and define causal variable as follows (see Supplementary Table 1 for summarized terminologies used in the paper).

Table 1 The comparison of causal variable determination for the PORTEC dataset among ICP, NICP and the proposed method
Table 2 The comparison of causal variable determination for the PORTEC dataset among ICP, NICP and the proposed method

Definition 1

Assume there exists a set of environments uU, given a collection of random variables X = (X1, X2, …, Xn) and an outcome variable Y, if \({{{{\bf{X}}}}}_{{S}^{* }}=({X}_{{S}_{1}^{* }},\ldots ,{X}_{{S}_{j}^{* }})\) exists with indices S* {1, …, n} such that

$$Y={f}^{*}({{{{\boldsymbol{X}}}}}_{{S}^{*}}^{u})+{\delta }^{u}\ \forall u\in U,\ {{{\rm{where}}}}\ {f}^{*}:{{\mathbb{R}}}^{| {S}^{*}| }\mapsto {\mathbb{R}}.$$
(1)
$${\delta }^{u}\ {{{\rm{are}}}}\left\{\begin{array}{ll}\,{{\mbox{identically distributed (i.d.)}}}\,&\,{{\mbox{if}}} \, \exists \, {{\mbox{hidden confounders}}}\,\\ \,{{\mbox{i.d. and}}}\,{\delta }^{u} \perp\!\!\!\!\!\perp {{{{\bf{X}}}}}_{{S}^{* }}^{u}&\,{{\mbox{else}}}\end{array}\right.$$
(2)

Then \({{{{\bf{X}}}}}_{{S}^{* }}\) are the plausible causal variables (under U). Here, \({{{{\bf{X}}}}}_{{S}^{* }}^{u}=({X}_{{S}_{1}^{* }}^{u},\ldots ,{X}_{{S}_{j}^{* }}^{u})\) are the corresponding random variables to \({{{{\bf{X}}}}}_{{S}^{* }}=({X}_{{S}_{1}^{* }},\ldots ,{X}_{{S}_{j}^{* }})\) created under the environment u. For example, the (experimental) environment u can arise via do-intervention17 (do(X0c)) on X0, then \({X}_{0}^{u}\) only samples the fixed value c. It is worth noting that the cause–effect relation \(f^{*}\) in equation (1) stays invariant and independent of U. Next we introduce the definition of identifiable causal variables.

Definition 2

Following the specification of U, X = (X1, X2, …, Xn) and Y in Definition 1, if a \({{{{\bf{X}}}}}_{\bar{S}}=({X}_{{\bar{S}}_{1}},\ldots ,{X}_{{\bar{S}}_{j}})\) exists with indices \(\bar{S}\subseteq \{1,\ldots ,n\}\) such that

$$\bar{S}:= \bigcap \{S\subseteq \{1,\ldots ,n\}| \ {{{{\bf{X}}}}}_{S}\,{{\mathrm{are}} \, {\mathrm{plausible}}\, {\mathrm{causal}}\, {\mathrm{variables}}}\,\},$$
(3)

then \({{{{\bf{X}}}}}_{\bar{S}}\) are the identifiable causal variables (under U), as they are referred to henceforth.

Vanilla ICP23 was initially presented and verified on linear cause–effect relations. Heinze-Deml and co-workers24 next defined u as a random variable that is neither the descendant nor the parent of Y, and conducted multiple conditional independence tests on nonlinear cause–effect settings (NICP). Gamella and Heinze-Deml25 recently suggested investigating the stable set of variables instead, which is a relaxation of the set of identifiable causal variables (AICP). This progress in ICP has opened unprecedented paths to interpret complex datasets, especially those collected from RCTs. Finding which variables determine whether a treatment works or whether a patient will have a recurrence using ICP methods has great scientific potential. In application to the clinical domain, data interpretation by causal inference methods could improve our understanding of disease and aid in the design of new experiments and clinical trials; however, non-negligible efforts are required to adopt the existing ICPs to the clinical domain. This is due to: (1) the complexity and multitude of variables that are considered relevant for treatment outcomes including patient level characteristics (patient demographics, text data from clinical records), information derived from images (radiology, pathology) and molecular data (genomic sequencing); and (2) the incompatibility between the error-tolerant implementation for the simulated dataset and safe-critical application relevant for medical decisions. Candidate ICP methods therefore need to be robust against noise and need to provide meaningful outputs that can be related to clinical risk in order to inform patient stratification.

Results

Clinical variables overview

The PORTEC 1 and 2 trials6,8 recruited 714 (since 1990–1997) and 427 (since 2000–2006) patients with early stage endometrial carcinoma respectively; 305 cases from PORTEC 1 (42.7%) and 335 cases from PORTEC 2 (78.5%) with complete clinicopathological datasets were aligned and used in the experiments. Clinicopathologic characteristics of these subgroups were similar to the original trial populations (less than or equal to 17.3% absolute difference in frequency of any variable). Importantly, there was no substantial difference in the variable of interest (mean and five-year recurrence free survival (RFS)) for causal variable identification between the excluded and included patient datasets, supporting that our analysis is representative of the overall study population (Supplementary Fig. 1 and Supplementary Tables 2 and 3). Considering PORTEC 1 and 2 as the two experimental environments, we aimed to determine the causal pathological, molecular and immune-related variables of EC recurrence status.

Pathological variables

Pathological criteria tumour grading (Grade)26,27, LVSI28 and myometrial invasion26,27 are examined in the study, all of which are important indicators for an elevated risk of EC recurrence (see also ref. 12). All variables were reevaluated on formalin-fixed paraffin-embedded tumour material by specialized gynecopathologists to guarantee variable consistency for the two environments (trials).

Molecular variables

The molecular classification of EC distinguishes four subtypes with validated prognostic impact: (1) ultra-mutated EC with DNA-polymerase epsilon exonuclease domain mutations (POLEmut), with an excellent prognosis; (2) hypermutated EC with MMRd, with an intermediate prognosis; (3) copy-number-high EC with frequent TP53 mutations (p53abn), with an unfavourable prognosis; and (4) copy-number-low EC without an NSMP, with an intermediate prognosis27,29. Pathogenic POLE mutations were detected by next-generation sequencing of POLE hotspot exons30. Mismatch repair deficient (MMRd) EC and p53 status were determined by immunohistochemistry31. Cases with more than one classifying feature were classified according to the dominant molecular feature on the basis of pathogenicity32. Over-expression of L1CAM by tumour cells was assessed by immunohistochemistry using a cut-off of ≥10% for positivity, and is associated with an increased risk of metastasis and death33,34.

Immune variable

(Intraepithelial) CD8+ T-cell infiltration is an independent favourable prognostic indicator in early stage EC3. To quantify CD8+ T-cell infiltration in tissue samples of the PORTEC 1 and 2 trials, we compute CD8+ cell density derived from tissue microarrays by immunohistochemistry and image analysis3,35. Specifically, tissue microarrays capture cancer tissue samples from each patient in a highly standardized manner, allowing for the highly accurate evaluation of tumour and microenvironment-related factors in cancer samples for investigation with clinical outcomes36.

Based on existing domain knowledge and biological understanding3,29,31,37,38,39, we consider the pathological (P), molecular (M) and immune (I) variables to be the proxies of causal variables (Sprox) (see Supplementary Table 3 for more characteristics details).

Sanity-check variables

To investigate the robustness of the causal inference models, we intentionally include a randomized number as the Patient ID and the vital tissue area of each tissue microarray core (where the tissue area is the sum of randomly sampled tumour and stroma areas from each case) in our subsequent analysis as non-causal variables. Based on prior domain knowledge, the two variables should not be causally related to cancer outcomes. The inclusion of these two non-causal variables thus serves as an important benchmark for comparison of the methods presented in this study. Importantly, as the two variables are expected not to impact clinical outcome and we attempt to verify that they do not impact the outcome, this strategy can also be considered as a typical use case of negative control40 in RCT studies.

PORTEC experiments overview

As presented in Fig. 1 (top), our proposed AutoCI is composed of two key components: (1) a program synthesis language that searches the type-safe function candidates automatically, and (2) a novel causal differentiable learning scheme that determines the causal variables. In the following we first present the experimental results with emphasis on the individual components (1) and (2). We then demonstrate the overall performance on the main and ablation studies for the complete AutoCI method, confirming that the integration of program synthesis with a causal differentiable learning scheme is a critical step towards automated causal inference for clinical applications.

Fig. 1: The overall model illustration and performance of the proposed AutoCI.
figure 1

Top: an illustrative scheme of the proposed AutoCI. In the syntax (top left), the type T includes atomic type (ATM), function type (FUC) and abstract data type (ADT), the program prg contains neural network (NN), function composition (COMP), concatenation (CAT), filter (FILTER), predicate (PRED) and so on. In the causal differentiable learning, causal prob. indicates causal probability. The outcome variable RFS means recurrence free survival. Bottom left: the sampled numbers of type-safe functions versus generic functions. Here the size is the maximum amount of nn and PRED functions allowed during the program synthesis. Bottom middle: the learning curve of the JS for top-four type-safe functions achieved in the case with pathological, molecular and immune variables. Bottom right: the running time of determining the causal variables for P, PM and PMI. Here the proposed AutoCI utilizes the function COMP(NN, CAT(FILTER(PRED))).

Automated type-safe function search

Type-safe functions are favourable candidates to strengthen the security and reliability of critical software applications41. However, the manual verification of type-safe properties can be cumbersome, especially when examining thousands of function candidates. Figure 1 (bottom left) shows that the AutoCI can efficiently filter out a subset of promising type-safe differentiable functions (see also similar results in ref. 42), and this can be accomplished in a short period of time: 42.26 s, 118.56 s, 119.44 s for size = {3, 4, 5}. By automatically excluding a large amount of generic functions that are not type-safe, it can greatly improve the development efficiency and algorithm safety compared with manual function design. After obtaining the type-safe candidates, we execute the causal differentiable learning scheme (Fig. 1, top) on the candidates and determine the set of predicted causal variables SPRED. Here we utilize the Jaccard similarity (JS)25 as a key metric to measure the prediction accuracy,

$${\mathrm{JS}}({S}_{{\mathrm{pred}}},{S}_{{\mathrm{prox}}})=\frac{| {S}_{{\mathrm{pred}}}\cap {S}_{{\mathrm{prox}}}| }{| {S}_{{\mathrm{pred}}}\cup {S}_{{\mathrm{prox}}}| }$$
(4)

Figure 1 (bottom middle) demonstrates the JS accuracy with the growing epochs for the top-four type-safe candidates (see also Supplementary Table 4). We conclude that the function f = COMP(NN, CAT(FILTER(PRED)))))) achieves the optimal JS score (for example, 91.9 ± 0.06%) and thus is used as the default function for further analysis. As pointed out in Valkov et al.42, HOUDINI allows us to transfer high-level modules across learning tasks. More specifically, the type-safe candidates discovered via the search algorithm are agnostic of disease-specific features for hazard analysis. Independent of the hazard analysis conducted for cancer studies, these type-safe candidates can therefore be re-used and fine-tuned to perform causal variable identification given data on the survival outcome for each patient. Depending on the JS score achieved by the candidates we determine the optimal learned type-safe model. As a result, the proposed AutoCI approach can pave the way towards an efficient causal analysis for many of the real-world cancer studies.

Determining causal variables with clear differentiation

We compare the AutoCI with the state-of-the-art ICP methods, ICP and NICP. Although competitive results are achieved by AICP on the toy experiments, AICP requires the regeneration of additional interventional data in each learning step. Such a learning scheme is incompatible to the real-world RCTs setting, hence AICP is not applicable to the PORTEC experiments. As ICP and NICP explicitly accept or reject the variable of interest, we report yes or no in Table 1. For the proposed AutoCI, we report the mean of the causal probabilities for each variable. Overall, the proposed AutoCI outperforms the ICP and NICP in terms of differentiating causal and non-causal variables (≥50% versus <25% for PMI), demonstrating its advantages over the SOTA methods by a clear margin (Table 1). When examining the individual variables of interest, we can see that ICP fails to determine the proxy variables to be the causal ones, whereas for NICP all of the variables—including the sanity check ones—are considered to be causal. These results clearly contrast to the methodological comparisons of the ICP methods on the toy data (Fig. 2), which respect the normal distributions. If we decompose the proposed causal learning scheme, we witness the suppression of the causal probability on the sanity-check variables over the warm-up stage (steps 1 and 2 of the pseudo code in Fig. 1), whereas the causal variables do not show deterioration of performance. This clear differentiation of causal and non-causal variables can aid the definition of meaningful cut-offs by AutoCI on a given cohort guided by clinical expertise.

Fig. 2: The learning performance and time of the optimal type-safe f = COMP(NN, CAT(FILTER(PRED))) on toy datasets.
figure 2

Left: the learning curve of the JS for top-four type-safe functions (finite sample setting). Middle: the learning curve of the JS for top-four type-safe functions (ABCD setting). Right: the running time of the compared methods for the finite sample and ABCD settings. Here the proposed AutoCI utilizes the function COMP(NN, CAT(FILTER(PRED))).

Hazard analysis of the individual variables

In parallel to the causal variable determination, the corresponding HR analysis on EC recurrence is also performed. In the scenario in which unknown spurious (non-causal) variables are included in the hazard analysis, the causal cut-off can help reducing the noise introduced by non-causal variables. For instance, Table 2 reports the decreased hazard of tissue area (0.90; 0.88–0.92), Patient ID (0.92; 0.91–0.94) achieved in the warm-up stage for PMI case. Without causal analysis, one may falsely conclude that larger tissue area leads to a slightly lower risk of cancer recurrence. Besides, the HR achieved within the warm-up and complete learning stage remains stable and consistent to the standard clinical interpretation, that is, the values assigned to the variable indeed correctly correspond to either poor or favourable outcomes. This learning scheme can therefore help delivering reliable outputs that are understandable for clinical experts.

Ablation study with hidden confounders

To elaborate the robustness of AutoCI, we conducted ablation studies with the influence of confounding. This is achieved by step-wise inclusion of P and PM. As shown in Tables 1 and 2, the proposed AutoCI presents consistent advantages over the existing ICP methods in terms of learning meaningful causal probabilities for both non-causal and causal variables. For instance, the tissue area and Patient ID are determined to be 35.96% and 20.61% for P, and 24.94% and 17.48% for PM, respectively, whereas the causal probabilities of all of the proxy variables remain close to or above 50%. In the hazard analysis, the results of the confounding studies P and PM are also consistent with the results reported for the main study (PMI). For instance, L1CAM and p53abn are assigned with increased hazards for PM and PMI, indicating a poor prognostic outcome. These results are in agreement with clinical understanding27,29,33. The numerical improvements from P to PMI in the accuracy of probabilistic predictions (Fig. 3) provide complementary evidence confirming the robustness of AutoCI. When compared with the warm-up stage of AutoCI, we further observe a reduction in variance in the evaluation of the complete causal-aware AutoCI. Finally, the running time of the P, PM and PMI studies grows only mildly for the proposed method, in contrast to the dramatic increase in time complexity for ICP and NICP, where the bottleneck of the ICPs lies in the exponential increase o(2S) in search space with the growing number of variables S (equation (3)).

Fig. 3: The evaluation metrics of hazard analysis conducted on the PORTEC dataset.
figure 3

Box plots of the concordance index (left), Brier score (middle), and the binomial log-likelihood (right) that are derived from n = 640 patients included in the PORTEC dataset, where the box bounds the interquartile range (IQR) divided by the median, and whiskers extend to ±1.5 × IQR beyond the box.

Discussion

In this study, we proposed a novel automated causal inference algorithm (AutoCI). Taking two large RCTs as the experimental environments, we reinterpret the clinical variables of interest from a causal perspective. The proposed AutoCI demonstrates consistent advantages compared with existing ICP methods in determining causal variables with the presence of hidden confounders. Complementary to the standard hazard analysis, AutoCI provides an automatic tool for medical data analysis to investigate causal association of patient variables from a new perspective, offering informative and critical evidence to support clinical interpretation. Specifically, the accurate determination and exclusion of the spurious (non-causal) variables is a key step to enable more precise patient stratification in the future.

Design choices of AutoCI

Dissimilar to generic algorithms, clinical algorithms must deal with unique challenges in terms of ensuring the safety43 and robustness. Error-prone algorithms can potentially lead to critical errors in medical care. Driven by the need to develop safe-critical applications, AutoCI is carried out with a type-safe program synthesis method42. By further incorporating a newly proposed causal-aware module into this framework, we are able to synthesize a subset of differentiable type-safe candidates well suited for causal-aware learning. Compared to the laborious and error-prone manual function design, this implementation improves the efficiency and safety of AutoCI. Moreover, to achieve the robustness on the real clinical tasks, we introduce a novel causal differentiable learning scheme that utilizes the Fréchet inception distance (FID)44. As a whole, the proposed AutoCI is the seamless integration of both components.

Comparison with existing ICPs

Application of the prior ICP methods has confirmed the feasibility of causal variable learning on toy experiments (Table 3 and Fig. 2). This is substantiated by the outstanding results in the absence of confounders; however, the error-tolerant implementations of prior ICPs on the synthesized experiments are not well-tailored for real clinical applications, especially in the presence of hidden confounders. Compared with ICP, AICP and NICP, AutoCI presents robust results on both toy and PORTEC experiments. With the inclusion of confounders, AutoCI demonstrated a robust differentiation between causal and non-causal variables for PORTEC, and achieves superior quantitative scores on both the finite sample and ABCD settings.

Table 3 The comparison of causal variable determination for the toy datasets between ICPs and the proposed method

Clinical interpretations

Importantly, the hazard analysis and ranking of clinicopathological and molecular variables using AutoCI (Table 1) is generally consistent with the common biological and clinical interpretation. Taking pathological variables as an example, studies26,27,28 indeed show that grade, deep myometrial invasion and LVSI are important independent predictors of early EC recurrence. The causal probabilities provided by AutoCI thereby give additional information on the relevance of each variable for the determination of outcome, and the likelihood of each variable is consistent with domain expertise. Lymphovascular space invasion is considered to be a critical predictor independent of molecular subgroup and is ranked with the highest causal probability, while grade and myometrial invasion are indeed weaker but independent prognostic indicators.

Furthermore, AutoCI correctly identifies the prognostic associations of the molecular variables of EC, assigns the appropriate hazards for outcome and ranks the molecular subgroups in the order of causal probability that would be expected by an expert’s domain knowledge. Specifically, the molecular factors with the highest adverse risk, p53 abnormality (3.01 (PMI), 3.22 (PM)) and L1CAM over-expression (2.16 (PMI), 2.33 (PM))33 are recognized as such, whereas the POLEmut is consistently associated with a reduced risk of disease relapse as confirmed in previous studies30. Adding an immune variable further refines the model, as expected by domain expertise3, and highlights a causal relationship between cytotoxic T-cell responses and EC recurrence in early stage EC. In summary, AutoCI correctly quantifies and ranks causal pathological, molecular and immune variables for patient outcomes in the clinical trial setting.

Going beyond academic toy models, the proposed AutoCI extends the researchers current statistical toolbox with a new causal-driven method that can assign the causal likelihood to prognostic and predictive variables. As such, this method will enable identification of clinically relevant variables among the ever-increasing number of biomarkers in cancer research that show statistical correlation with clinical outcome. Hence, although the direct real-world application of this method is primarily scientific, subsequent clinical validation and development may enable better selection of (bio)markers to stratify patients for cancer treatments and prediction of prognosis.

Limitation

Despite the clear cut-off between non-causal and proxy variables provided by AutoCI, some of the proxy variables present borderline hazard ratios, for instance, MMRd. Due to small effect size, clinical studies45,46 usually associate MMRd with intermediate patient prognosis, not much different from the prognosis of NSMP EC; however, from the biological perspective, MMRd is highly relevant for a well-defined cascade of molecular changes in cancer cells with favourable prognostic impact as proven by a large number of well-designed experimental and translational studies31,47. The loss of DNA mismatch repair capability in cancer cells leads to a strong increase in tumour mutational burden caused by mismatch, frameshift and insertion/deletion mutations. Due to the structure of eucaryotic DNA, frameshift mutations frequently lead to the translation of truncated peptides that are highly immunogenic, and contribute to the induction of an effective anti-tumoral immune response47. Consistent with biological understanding, AutoCI identifies MMRd as causally related to outcome in the present study, although the lack of a strong association with EC recurrence requires further investigation.

Conclusion

In this study we investigate the causal variable determination among multiple RCTs and present its advantages over the compared methods on both toy and PORTEC experiments. For clinical application, further validation of this methodology in independent clinical trial datasets will be needed to ensure generalisation.

Methods

As per the AutoCI abbreviation, automated and causal are the two building blocks of the proposed method. Concretely, the automation component is implemented with a type-safe program synthesis language HOUDINI42. We also introduce a novel differentiable causal learning scheme that is built up ICP.

Type-safe program synthesis

HOUDINI42 is a typed language with a rich set of pythonic higher-order functions such as MAP, FOLD, COMP (Pythonic: map(), reduce(), lambdax: f(g(x))) and so on (Fig. 1, top left). Relying on the built-in method for program search, it allows us to efficiently search promising type-safe differentiable program candidates. Compared with other program synthesis languages48,49,50,51, HOUDINI rules out the error-prone functions that undermine the software safety and presents itself as an ideal candidate for our task (see Supplementary Table 5 for further comparison).

Despite of rich built-in functions provided by the HOUDINI, it lacks an explicit flow control mechanism. Driven by the need of integrating the causal-aware learning, we introduce the predicate module (PRED) containing the function (cau) (see also Fig. 1, top left)

$${\mathrm{cau}}({{{\mathbf{x}}}};{{{\mathbf{\uptheta }}}})={\mathrm{mask}}\odot {\mathrm{sigmoid}}({{{\mathbf{\uptheta }}}})\odot {{{\bf{x}}}},$$
(5)

where θ represents the learnable weights (normalized by sigmoid), is the element-wise multiplication, mask is the vector containing 0 or 1 manipulated in step 5 of Fig. 1 (top), masksigmoid(θ) presents the causal probability for each variable. Together with the newly introduced higher-order function FILTER, we are able to synthesize type-safe causal-aware programs.

Causal differentiable learning

In Definition 1, the ICP does not make assumptions about the function \(f^{*}\) (equation (1)). In real-world applications, it is reasonable to specify the search space of \(f^{*}\). If we assume that \(f^{*}\) is differentiable, then we have a trivial extension:

$$\begin{array}{ll} f:& {\mathbb{R}}^{n} \mapsto {\mathbb{R}} \\ & \underbrace{X_{i_0}, \ldots, X_{i_{|S^*|}}}_{{\bf{X}}_{S^*}}, \underbrace{X_{i_{|S^*| + 1}}, \ldots, X_{i_n}}_{{\bf{X}}_{S^{*c}}} \rightarrow f^*({\bf{X}}_{S^*}), \quad \forall u \in U, \end{array}$$
(6)

where f remains differentiable with regards to all of the variables X. More importantly, the gradient norms with regards to the non (plausible) causal variables \({{{{\bf{X}}}}}_{{S}^{* c}}\) should vanish, that is, \(\parallel {\nabla }_{{S}^{* c}}f\parallel =0\). Motivated by the extension, we first assume \(f^{*}\) to be differentiable in Definition 1, then we have the claim:

Claim 1

Following the specification of U, X = (X1, X2, …, Xn) and Y in Definition 1, if \({{{{\bf{X}}}}}_{\hat{S}}=({X}_{{\hat{S}}_{1}},\ldots ,{X}_{{\hat{S}}_{j}})\) with indices \(\hat{S}\subseteq \{1,\ldots ,n\}\) are the identifiable causal variables, then there exists a differentiable function \(f({{{\bf{X}}}}):{{\mathbb{R}}}^{n}\mapsto {\mathbb{R}}\) satisfying equations (1) and (2) such that f has the maximum amount \(| {\hat{S}}^{c}|\) of variables with \(\parallel {\nabla }_{{\hat{S}}^{c}}f\parallel =0\).

From this perspective, we can reduce the ICPs to learning an invariant differentiable function f, where f has the (maximum amount of) vanishing gradient norms on the non-causal variables. Such reduction enables us to smoothly integrate the ICP into the modern differentiable learning framework.

Algorithm design. To impose the vanishing gradient norms, we seek for the mask vector in equation (5) as the solution. Initially, we assign mask = 1 for all the variables. Assume we mask a causal variable Xi by flipping maski = 0, then it should greatly disturb the learning errors across multiple environments. If this is the case, we restore maski = 1 and take Xi as the causal variable. Otherwise we reject the variable Xi and maski remains as 0, which imposes the zero gradient with respect to Xi. To quantify the disturbance when variables of interest are missing, existing ICPs use several statistical tests52,53,54. These tests suffer from capturing the nuance of distributions related to higher-dimensional clinical data. Motivated by the recent success in complex vision data55, we utilize the FID44, which is derived from the Wasserstein distance 56. More specifically, the square root of FID between Gaussian distributions is exactly W2 Wasserstein distance56, that is, satisfying three axioms (identity, symmetry and triangular inequality), whereas the statistical tests used in refs. 23,24,25 are generally not mathematical metrics. As a result, the maximum FID (mFID) of U is proposed to measure the distribution difference,

$${\mathrm{mFID}}=\mathop{\max }\limits_{u\in U}{\mathrm{FID}}\,({{{{{\mu }}}}}_{u},{{{{{\mu }}}}}_{{u}^{c}}),$$
(7)

where \({{{{\mu }}}}_{u},{{{{{\mu }}}}}_{{u}^{c}}\) are the distributions with regards to the data sampled from the environment(s) {u} and {u}c. Supplementary Table 6 presents side by side comparisons between mFID, F-test + t-test23,25 and Levene-test + Wilcoxon-test24, all of which are applied for training the same type-safe function COMP(NN, CAT(FILTER(PRED))) under the proposed causal differentiable learning scheme. As displayed in Supplementary Table 6, we conclude that the proposed mFID outperforms the compared statistical tests with a clear margin. For the pseudo code of proposed algorithm please check the top plot of Fig. 1 (in the ‘causal differential learning’ box).

Proof of concept

For the sake of concept validation, we first conduct experiments on toy datasets. We compare the proposed AutoCI to the SOTA methods ICP, NICP and AICP. Specifically, we follow the two experimental protocols presented in AICP25: finite sample setting and ABCD setting57. The former presents the ideal scenario where the same amount of data (1,000) are sampled from both observational and experimental (interventional) environments, whereas the latter simulates a more realistic case where limited experimental data (10) are collected in conjunction with a large amount of observational data (1,000). The data of both settings are generated from randomly chosen linear structural causal models. In our experiments, 400 structural causal models are tested to guarantee the reliability of our results. For the compared ICP methods, we applied the optimal strategies discussed in the paper and parameters are fine-tuned to the experiments. Specifically, careful parallelization and code optimization is also performed for ICP methods. We use 16 cores of CPU Intel(R) Core(TM) i7-7820X CPU @ 3.60 GHz to train the ICP methods in parallel and the GPU NVIDIA TITAN V (12 GB) to train the AutoCI. For the proposed AutoCI, we use the standard Adam optimization58 at a learning rate of 0.02 throughout the experiments. For the warm-up stage (steps 1 and 2 of Fig. 1) of causal differentiable learning we adopt eight epochs. The batch size is set to be 64 for all the experiments. We calibrate the λ = 5, 1 for toy and PORTEC on a small subset of unused data. For the toy experiment, we apply the MSE loss to supervise the learning process and report the result obtained by training the AutoCI one time, where the non-causal variable is determined to be the one with 0 causal probability (equation (5)). For the PORTEC experiment, we utilize the partial likelihood to learn the hazard coefficient59. To fully utilize the PORTEC patient data and incorporate into the differentiable cox model59, the molecular subtype variables POLEmut, MMRd and p53abn are assigned with 1 if present else (including NSMP) 0. To guarantee the representativeness of the PORTEC results, we independently train the AutoCI 64 times and average the causal probability for each variable. Complementary to JS score, we also report FWER = P(SPRED Sprox) (type-I error).

As shown in Table 3, our AutoCI achieved competitive JS and FWER scores compared to the ICP methods. Clearly, the proposed method is more resistant to the influence of hidden confounders and all the results reach >90% JS accuracy. This is achieved by the optimal type-safe function COMP(NN, CAT(FILTER(PRED))) (Fig. 2, left and middle; Supplementary Tables 7 and 8). Such advantages also confirm the effectiveness of the proposed causal learning scheme with the utilization of mFID metric. Similar to the PORTEC experiments, due to the exhaustive subset research required in equation (3), the time complexity of ICP and NICP raises dramatically from ABCD to finite sample settings (Fig. 2, right).

Ethics statement

The PORTEC study protocols were approved by the Dutch Cancer Society and by the medical ethics committees at participating centers. Both studies were conducted in accordance with the principles of the Declaration of Helsinki. All patients provided informed consent for study participation. The PORTEC 1 trial was registered at the Daniel Den Hoed Cancer Center (DDHCC) Trial Office. The PORTEC 2 trial was registered at ClinicalTrials.gov under the identifier NCT00376844.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.