## Introduction

Since Galileo Galilei, insight into the causes behind the phenomena we observe has come from two strands of modern science: observational discoveries and carefully designed experiments that intervene in the system of interest under well-controlled conditions. In one of Galilei’s early experiments—albeit a thought experiment1—, the law of falling bodies is discovered by dropping two cannonballs of different masses from the tower of Pisa and measuring the effect of mass on the rate of fall to the ground. Discovering physical laws this way is a challenging problem when studying large-scale complex dynamical systems such as the Earth system, because replicated interventional experiments are either infeasible or ethically problematic2. Surely, we should not conduct large-scale experiments on the Earth’s atmosphere: anthropogenic climate change already represents a rather uncontrolled long-term experiment. While randomized controlled experiments are a standard approach in medicine and the social sciences3,4, the main current alternative within most disciplines of Earth sciences are computer simulation experiments. However, these are very expensive, time-consuming, and require substantial amounts of expert knowledge, which in turn may impose strong mechanistic assumptions on the system2. Fortunately, recent decades have seen an explosion in the availability of large-scale time series data, both from observations (satellite remote sensing5, station-based, or field site measurements6), and from Earth system model outputs2. Such data repositories, together with increasing computational power7, open up novel ways to use data-driven methods for the alternative strand of modern science: observational causal discoveries.

In recent years, rapid progress has been made in computer science, physics, statistics, philosophy, and applied fields to infer and quantify potential causal dependencies from time series data without the need to intervene in systems. Although the truism that correlation does not imply causation holds, the key idea shared by several approaches follows Reichenbach’s common cause principle8: if variables are dependent then they are either causal to each other (in either direction) or driven by a common driver. To estimate causal relationships among variables, different methods take different, partially strong, assumptions. Granger9 addressed this question quantitatively using prediction, while in the last decades a number of complementary concepts emerged, from nonlinear dynamics10,11 based on attractor reconstruction, to computer science exploiting statistical independence relations in the data4,12. More recently, research in statistics and machine learning utilizes the framework of structural causal models (SCMs)13 for this purpose. Causal inference is growing to become a mature scientific approach14.

In contrast to data-driven machine learning methods such as probabilistic modeling15, kernel machines16, or in particular deep learning17, which mainly focus on prediction and classification, causal inference methods aim at discovering and quantifying the causal interdependencies of the underlying system. Although interpreting deep learning models is an active area of research18, extracting the causes of particular phenomena, e.g., hurricanes, from a deep learning black box is usually not possible. Therefore, causal inference methods are crucial in complementing predictive machine learning to improve our theoretical understanding of the underlying system19.

Unfortunately, many causal inference methods are still only known within a small community of methodological developers and rarely adopted in applied fields like Earth system sciences. Yet, data-based inference of causation was already proposed in the early 20th century by the geneticist Wright20, but it has not been widely adopted partly due to the fierce opposition of statisticians like Pearson14. In Earth system sciences, besides simulation experiments, (Pearson) correlation and regression methods are still the most commonly used tools. However, causal inference methods do have the potential to substantially advance the state-of-the-art—if the underlying assumptions and methodological challenges are taken into consideration.

With this Perspective, we aim to bridge the gap between potential users and developers of methods for causal inference. We discuss the potential of applying causal inference methods to four key generic problems that are also common in other fields: causal hypothesis testing, causal network analysis, exploratory causal driver detection, and causal evaluation of physical models. First, we provide examples where causal inference methods have already led to important insights in Earth system sciences before giving an overview of different methodological concepts. Next, we highlight key generic problems in Earth system sciences and outline new ways to tackle these within causal inference frameworks. These problems are translated into challenges from a methodological perspective. Finally, as a way forward, we give recommendations for further methodological research as well as new ways in which causal inference methods and traditional physical modeling can complement each other, in particular in the context of climate change research. This Perspective is accompanied by a website (causeme.net) hosting a causality benchmark platform to spur more focused methodological research and provide benchmarks useful not only in Earth system sciences, but also in related fields with similar methodological challenges.

## Example applications of causal inference methods

As in many other fields, methods based on correlation and univariate regression are still the most common data-based tools to analyze relationships in Earth system sciences. Such association approaches are useful in daily practice, but provide few insights into the causal mechanisms that underlie the dynamics of a system. Causal inference methods can overcome some of the key shortcomings of such approaches. In this section, we discuss application examples where causal inference methods have already led to important insights before providing a systematic overview.

Concurrently to Wright’s20 seminal works on causation in the 1920s, Walker was the first to introduce systematic correlation and regression analysis into climate science21. He discovered the temperature and pressure relationships between the East and West Pacific giving rise to the Walker circulation, which has by now been established not only from observational studies, but also detailed physical simulation experiments22. In Fig. 1a, we illustrate these relationships using different methods: classical correlation, standard bivariate Granger causality (GC), and PCMCI23,24 (described later) that is better suited to this problem. Whereas GC and standard correlation analysis results in unphysical links, the example demonstrates that with the correct application of an appropriate method the Walker circulation can be inferred from data alone.

Similary, Kretschmer et al.25 investigated possible Arctic mechanisms which could be pivotal to understand northern hemisphere mid-latitude extreme winters in Eurasia and North America. Arctic teleconnection patterns are much less understood than tropical ones and data-driven causality analyses are especially important because different climate models partly give conflicting results26,27. In Fig. 1b we highlight the Arctic teleconnection pathways of the stratospheric Polar vortex that were extracted from observational data alone: here causal inference methods have confirmed previous model simulation studies, finding that Arctic sea ice extent in autumn is an important driver of winter circulation in the mid-latitudes28.

Finally, Fig. 1c shows an example from ecology demonstrating that traditional regression analysis is unable to identify the complex nonlinear interactions among sardines, anchovy, and sea surface temperature in the California Current ecosystem. A nonlinear causal state-space reconstruction method11 here extracts the underlying ecologically plausible network of interactions, revealing that sea surface temperatures are a common driver of both sardine and anchovy abundances.

These examples demonstrate how causal inference methods can help in distinguishing direct from indirect links and common drivers from observational time series, while classical correlation methods are ambiguous to interpret and can lead to incorrect conclusions.

Next to Granger’s seminal works in economics9,29, observational causal inference methods have mostly been applied in neuroscience30,31 and bioinformatics32,33 where observational causal inference can also be combined with interventional experiments. The challenges for causal inference on Earth system data, especially the spatio-temporal and nonlinear nature of the system, are more similar to those in neuroscience as further discussed in the application and challenges sections.

## Overview of causal inference methods

Observational causal inference from time series has come a long way since Wiener’s34 and Granger’s9 seminal works in the 1950s and 1960s and a plethora of different methods have been developed since then. Importantly, in the past few decades the works of Pearl, Spirtes, Glymour, Scheines, and Rubin3,4,12,35 have grounded causal reasoning and inference as a rigorous mathematical framework, elucidating the conditions under which discovering causal graphical models, also called Bayesian networks36, from purely observational data is at all possible. These are known as identifiability conditions in the field of statistics and causal inference. Many causal inference methods for time series are grounded on the assumptions of time-order (causes precede effects), Causal Sufficiency, meaning that all direct common drivers are observed, and the Causal Markov Condition, stating that in a graphical model a variable Y is independent of every other variable (that is not affected by Y) conditional on Y’s direct causes, among other, more technical, assumptions12,24. However, recent work shows that some of these assumptions can be relaxed. Peters et al.13 summarize recent progress of methods that utilize assumptions on the noise structure and dependency types in the framework of SCMs. Many causal inference methods are not restricted to time series to infer causal relations.

### Granger causality

The concept of Granger causality9 was the first formalization of a practically quantifiable causality definition from time series. The original idea, based on work by Wiener34, is to test whether omitting the past of a time series X in a time series model including Y’s own and other covariates’ past increases the prediction error of the next time step of Y (Fig. 2a). The concept of GC can be implemented with different time series models. Classically, the Granger causality test is based on linear autoregressive modeling (see Box 1), but nonlinear dependencies can be modeled with more complex time series models or even the information-theoretic analog transfer entropy37. While bivariate time series models do not explicitly account for indirect links or common drivers as shown in Fig. 1a, more variables can be included in multivariate extensions of GC. Nevertheless, as illustrated in Box 1 GC is limited to lagged causal dependencies and, furthermore, has known deficiencies in the presence of subsampled time series and other issues38. GC has a long history of applications across a wide range of scientific domains, including Earth system science39,40,41.

### Nonlinear state-space methods

While GC and also the other frameworks discussed here view systems as having interactions that arise from an underlying stochastic process, convergent cross-mapping11 (CCM) and related methods10,42 take a different dynamical systems perspective. These methods assume that interactions occur in an underlying dynamical system and attempt to uncover causal relationships based on Takens’ theorem and nonlinear state-space reconstruction. Thus, for these methods to apply it is necessary to demonstrate that a deterministic nonlinear attractor can be recovered from the data. In this sense it is thought to be complementary to the more statistical approaches discussed here. As illustrated in Fig. 2b, a causal relationship between two dynamical variables X and Y can be established if they belong to a common dynamical system, which can be reconstructed from time-delay embedding of each of the observed time series. More specifically, if variable X can be predicted using the reconstructed system based on the time-delay embedding of variable Y, then we know that X had a causal effect on Y. Nonlinear state-space methods have been applied to ecology11,43 as shown in Fig. 1c, as well as in climate science44.

### Structural causal model framework

GC requires a time delay between cause and effect to identify causal directionality. If causation occurs almost instantaneously, or at least faster than the observable sampling interval, then causal directions cannot be identified in general. Many causal network learning algorithms, on the other hand, are also applicable to contemporaneous dependencies, but they can only identify causal graphs up to a Markov-equivalence class. For example, under the Faithfulness assumption, measuring that X is conditionally independent of Y given Z, while all other (conditional) relationships are dependent, gives rise to three different causal graphs that are Markov-equivalent if no additional information about time-order is available: X$$\leftarrow$$ZY, XZY, or X$$\leftarrow$$Z$$\leftarrow$$Y. As illustrated in Box 1, the simplest example of Markov equivalence are two contemporaneously dependent variables where the causal direction cannot be inferred with conditional independence-based methods. Structural causal models (SCMs) (Fig. 2d) can identify causal directions in such cases because they permit assumptions about the functional class of models (e.g., linear or nonlinear, additivity, noise distributions)54,55,56. Other methods exploit heterogeneity in the data by searching for models that are invariant over space or time57,58,59,60,61. For an overview see references13,38. Most of these principles extend to settings with temporal dependence as further elaborated in the Way forward section. SCMs have not yet been applied in Earth system sciences except for one work in remote sensing62.

## Key generic problems in Earth system sciences

### Causal hypothesis testing

We start by illustrating the challenges associated with a key causal hypothesis testing problem in climate research. Mid-latitude weather (including extreme events) is largely determined by nonlinear dynamical interactions between jet streams, storm tracks, and low-frequency teleconnections63. These dynamical processes are partially not well represented in the latest climate models. Hence, understanding drivers and favorable boundary conditions of weather-determining circulation regimes is crucial to improve (sub-)seasonal predictions, evaluate climate models, and reduce uncertainty in regional climate projections64. Important questions (Fig. 3a) in this context include: what drives the strength, position, and shape of the jet stream? What is the relative importance of tropical and Arctic processes26,28,65? Uncovering causal relations from the observational record here raises a number of challenges. To name just a few, first, time series representing the climatologically relevant subprocesses need to be extracted from typically gridded spatio-temporal datasets25,66, as illustrated in Fig. 3a. This can, for example, be achieved by averaging over corresponding regions, defining an index describing the jet stream position, or a more data-driven approach using dimension-reduction methods66. Secondly, reconstructing the causal relations between these extracted variables is challenging because different nonlinear processes can interact on vastly different time scales from fast synoptic and cloud-radiative processes to multi-year variability driven by slow oceanic processes67. Last, the distributions of climate variables, for example precipitation, are often non-Gaussian. Similar data characteristics also occur in neuroscience where first different subprocesses of the brain need to be reconstructed, e.g., from spatio-temporal electroencephalography measurements, and time series reflect a multitude of processes operating on different frequencies30,68.

### Causal complex network analysis

Network analysis of complex systems is a rapidly growing field69 and the network perspective may help to identify aggregate and emergent properties of the human brain68 or the Earth system66. For example, a phenomenon such as El Niño results from the complex interplay between multiple processes in the tropical Pacific70 and has a large effect on the global climate system. In standard approaches68,71, nodes are defined as the time series at different grid locations and links are typically based on correlations between the grid point time series. A common network measure is the node degree, which quantifies the number of processes linked to a node. However, defined based on correlations, network measures69 do not allow for a causal interpretation such as the information flow within the system71. Grounding network theory in causal networks allows to better interpret network measures66,72: an example for linear measures is reproduced in Fig. 3b. Like for the other generic problems, the challenges lie in high-dimensional nonlinear spatio-temporal data, and here also in a proper definition of network measures that takes into account causal interactions and accounts for the spatial definition of nodes. Causal network comparison metrics can then be utilized for a causal evaluation of physical models (see last paragraph in this section).

### Exploratory detection of causes of extreme impacts

In the Earth system, as well as in many other complex systems, the most devastating impacts are often related to multiple, compound or synergistic drivers73. For instance, devastating wildfires need dry and hot conditions, available fuel, and an ignition source. Many impacts are related to threshold behavior74, and multiple drivers contribute to the tipping of the system75,76. Consider the example shown in Fig. 3c where only the synergistic combination of extreme inland precipitation and extreme storm surge leads to coastal floods77. Causal inference methods can be helpful in identifying the relevant drivers from a typically large number of potential drivers that may be correlated with impacts78. Causal methods further allow us to identify regime shifts in functional relationships that are, e.g., triggered by extreme conditions. The challenges here include high-dimensionality, synergistic effects, and the often small sample size of observed impacts, and are relevant also in other fields such as neuroscience68.

### Causal evaluation of physical models

In many disciplines of Earth system sciences, models of the system or subsystem play a fundamental role in understanding relevant processes. Models differ regarding which subprocesses are resolved and the type of parametrization used. Biogeochemical models, for instance, help to understand element cycles and are a crucial basis for carbon-climate feedbacks in the coupled Earth system. At a higher level, climate models2,79 simulate the interactions of the atmosphere, water bodies, land surface and the cryosphere. In all cases, and at all levels, models are based partly on differential equations representing known processes and partly on semi-empirical relationships representing unknown processes or approximating known processes that cannot be resolved at the global scale due to numerical issues80. Due to the nonlinear nature of the system, small differences in parameterization can potentially lead to large deviations in overall model characteristics. A key task is to evaluate which model better simulates the real system. Currently, such evaluations are based on simple descriptive statistics like mean and variance, climatologies, and spectral properties of model output and observations2,79. However, even though a particular model might well fit descriptive statistics of the observational data, for example, the global distribution of gross primary production (GPP) (Fig. 3d), the model might not well simulate the physical mechanisms affecting GPP, given that multiple model formulations and parameterizations, even when wrong, can fit the observations equally well, a problem known as underdetermination or equifinality81. As a complementary criterion we propose to compare reconstructed causal dependencies of models and observational data (Fig. 3d). The underlying premise is that causal dependencies are more directly linked to the physical processes and are, therefore, more robust against overfitting than simple statistics and, hence, models that are causally similar to observations will also yield more reliable future projections. As for the previous example, also here the challenges lie in extracting suitable causal variables from often noisy station-based measurements or high-dimensional spatio-temporal fields and also the fact that processes can interact nonlinearly involving different spatio-temporal scales. In addition, model output may not satisfy the conditions underlying some causal inference methods, e.g., if dependencies are purely deterministic. Finally, suitable evaluation and comparison statistics based on causal networks need to be defined (see paragraph on causal complex network analysis). In Earth system sciences, model evaluation can help to build more realistic models to improve projections of the future, which is highly relevant for policy making82.

## Challenges from a methodological perspective

### Process challenges

At the process level, a number of challenges arise due to the time-dependent nature of the processes giving rise to strong autocorrelation (Fig. 4, point 1) and time delays (Fig. 4, point 2). Next, ubiquitous nonlinearity (Fig. 4, point 3), also in the form of state-dependence (Fig. 4, point 4) and synergy (see Fig. 3c), requires a careful selection of the estimation method (see nonlinear methods in method overview section). Note that sometimes variables from model output can be deterministically related via a set of equations, which poses a serious problem for many, but not all, causal methods12,24. As mentioned in the jet stream example, a geoscientific time series will typically contain signals from different processes acting on vastly different time scales, e.g., oceanic and atmospheric ones, which may need to be disentangled to better interpret causal links (Fig. 4, point 5). A basic assumption in a number of statistical methods used in causal inference frameworks (e.g., linear regression) is the assumption that the noise distribution is Gaussian, which is violated by processes featuring heavy tails and extreme events (e.g., precipitation; Fig. 4, point 6). On the other hand, some methods turn non-Gaussianity into an advantage54 (Fig. 2d).

### Data challenges

At the data aggregation level, our generic examples demonstrate that a major challenge is to define and reconstruct the causally relevant variables that represent the subprocesses of interest (Fig. 4, point 7). These variables have to be extracted from typically high-dimensional spatio-temporal gridded datasets (e.g., from satellite observations or model output) or station data measurements, which can be done by dimensionality reduction methods. Moreover, these extracted variables should be interpretable and represent physical subprocesses of the system.

Often, relevant drivers cannot be measured, which requires to consider the possibility of unobserved variables (Fig. 4, point 8) regarding a causal interpretation of the estimated graph, since they may render detected links spurious (see also Box 1). Arguably, identifying the absence of a causal link, implying that a physical mechanism is unlikely24, is a more robust finding, which requires less strong assumptions (no Causal Sufficiency). Another aspect of Causal Sufficiency is that not taking into account important drivers, such as anthropogenic climate forcings, may render time series nonstationary. Time series pose a particular challenge regarding time-subsampling (Fig. 4, point 9), which can also be considered as a case of unobserved samples of a variable, and time-aggregation (Fig. 4, point 10) which can let causal dependencies appear contemporaneous or even cyclic. The standard GC cannot deal with contemporaneous links, which can be identified using network learning algorithms or SCMs (see also Box 1).

On the data quality side, satellites, as well as station instruments, are plagued by all kinds of measurement errors (Fig. 4, point 11) such as observational noise, systematic biases, and also missing values (notably cloud occlusions or sensor malfunctioning). These may also be causally related to the measured process, constituting a form of selection bias (Fig. 4, point 12).

While in Earth system sciences the data will often attain a continuous range of values (e.g., temperature), variables can also be of a discrete type (Fig. 4, point 13), either due to quantization, or as categorical data. For example, one may be interested in causal drivers of an index representing different weather regimes or a time series of rarely occurring extreme events, which additionally raises the challenge of class imbalance—many 0 and few 1. Causal inference problems with such data require a suitable choice of methods, for example, conditional independence tests adapted to mixed data types. For paleo-climate data, the assumption of a time order is challenged since the measurement time points typically are given only with uncertainty (Fig. 4, point 14).

### Computational and statistical challenges

From a computational and statistical point of view, scalability is a crucial issue, both regarding sample size (Fig. 4, point 15) and high dimensionality (Fig. 4, point 16). While larger sample sizes (long time series) are typically always beneficial for more reliable causal inferences, the computational time of methods may scale unfavorably with sample size (e.g., cubically for some kernel methods16). The more variables are taken into account for explaining a potentially spurious relationship, the more credible a causal discovery becomes. However, many variables together with large time lags to account for physical time delays (e.g., to identify atmospheric teleconnections), lead to high dimensionality which may strongly affect statistical reliability. This compromises statistical power, that is, the probability to detect a true causal link, and potentially also the control of false positives at a desired significance level23,24. Low-statistical power implies that, especially, weak causal effects with low signal-to-noise ratio, which are sometimes of interest, are not well detected. Last, uncertainty estimation (Fig. 4, point 17) that also takes into account potentially available data uncertainties (measurement value as well as dating uncertainties, see points 11 and 14), poses a major challenge for causal inference methods.

Most of the challenges discussed in this section are the same for correlation or regression methods which are, in addition, ambiguous to interpret and often lead to incorrect conclusions as shown in the examples section. We therefore emphasize that there is no strong reason to avoid adoption and exploration of modern causal inference techniques. Each of the methods summarized in the method overview section addresses one or several of these challenges. In Table 1 we list key strengths and suggest future research directions further discussed in the next section.

Finally, a crucial challenge when interpreting the output of causal inference methods is that causal conclusions are based on the assumptions underlying the different methods12,13,24. These assumptions should, but often cannot, be tested and it is important to make them transparent and discuss how different assumptions would alter conclusions for a particular application.

## Way forward

### Avenues of further methodological research

The preceding Earth system sciences challenges (Fig. 4) are rather generic for complex dynamical systems and apply to many other fields. The challenges point to a way forward to advance causal inference methods for such systems. In the short term, our example applications demonstrate that the existing methods already address some of the mentioned challenges. For example, PCMCI was developed to address high-dimensional time-lagged linear and nonlinear causal discovery and takes into account autocorrelation23,24 and CCM11 was specifically built to account for nonlinear state-dependent relationships. The largest potential for short-term methodological advancements lies in combining different conceptual approaches in order to address multiple challenges.

First, to give some examples, such as those listed in Table 1, causal network learning algorithms that deal well with high-dimensional data are limited by their inability to identify causal directionality among Markov equivalence classes12. This shortcoming can be alleviated by combining causal network learning algorithms with the SCM framework and making additional assumptions on (independence of) mechanisms4,13,57,83 that permits to identify causal directions in these cases. Secondly, novel methods can incorporate ideas from theory on causal discovery in the presence of unobserved variables and selection bias12,47, time-sub-sampling84,85, time-aggregation and cyclic feedbacks86, and measurement error87. Thirdly, filtering methods as preprocessing steps, e.g., based on wavelets88, can help to disentangle causal relations on different time scales, in the simplest example by filtering out a confounder like the seasonal cycle.

In the mid-term, it is worth exploring methods that have not been applied to Earth system data, but whose theoretical properties may render them suitable for the challenges at hand. For example, further methods that are based on the principle of independent mechanisms4,13,57,83 such as prediction invariance13,58,59,61 or causal discovery from non-stationary data60 can potentially make use of the ubiquitously present nonstationarity and external perturbations in Earth system data to infer causal structure. While the black-box character of most machine learning algorithms and deep learning in particular does not lend itself directly to causal discovery, such tools can nevertheless be useful in many aspects of causal discovery. For example, Chalupka et al.89 use neural networks to reconstruct causal features from gridded time series datasets. Also conditional independence tests can be based on deep learning90 and causal inference can be phrased as a classification problem91. And the other way around: causal knowledge, as argued by Pearl, should be incorporated into machine learning to yield more robust predictions and classifications, for example, in such unresolved problems as extrapolation and domain adaptation14.

### Validation and a benchmark platform

Method development and comparison require benchmark datasets with known causal ground truth for validation. Ideally, such ground truth comes from expert knowledge on real data or real experiments that can also be used for falsification of causal relationships predicted from observational causal inference methods. Unfortunately, in Earth system sciences such datasets currently exist only for expert-labeled causal relations among few variables (e.g., some bivariate examples in ref. 92). To some extent, out-of-sample predictions can provide partial validation, but the main alternative in Earth system sciences is experiments from physical simulation models. Such experiments, however, are computationally expensive and carry the challenge how these have to be designed. A more tractable approach is to generate synthetic data with simple model systems that mimic properties and challenges of geoscientific data, but where the underlying ground truth is known. These can then be used to study the performance of causal inference methods for different challenges in realistic finite sample situations. From a practitioner’s perspective, it is important to find out which method is best suited for a particular task with particular challenges and for a particular set of assumptions. Synthetic data, adapted to the problem at hand, can be used to choose the right method including method parameters. As a first step to close the gap between method users and developers, we accompany this Perspective by a causality benchmark platform (causeme.net) with synthetic models mimicking real data challenges on which causal inference methods can be compared. Next to method comparison, the platform also calls for submissions of real and modeled data sets where the causal structure is known with high confidence. Insights from such benchmark studies are relevant also for many other fields.

### Combining observational causal inference and physical modeling

In the long term, we envision that the two main approaches to understand the Earth system (observational data analysis and Earth system modeling) should become more and more integrated. On the one hand, the generic problem of model evaluation has outlined ways on how causal inference methods can be used to identify weaknesses of physical models and guide model improvement. Furthermore, the currently often heuristic parametrization schemes in physical models can be guided by causal analyses of the respective variables, similar to the proposal to utilize machine learning to systematically replace parametrization schemes19,93. Causal discovery can also help to design computationally expensive physical model experiments more efficiently: causal relationships estimated from climate model control runs79 (long model runs with fixed pre-industrial conditions) can provide guidance on where numerical experiments are useful and where causal effects are not to be expected.

On the other hand, physical constraints, either from theoretical knowledge or from experimental (modeling) results, can be used to regularize causal inference methods, for example, by defining variables, restricting functional classes, identifying expected noise distributions, time lags and time aggregation, or general data preprocessing. Even more integrated, novel causal inference methods can make combined use of observational as well as experimental data94,95 which has already led to fruitful insights in genetics. In Earth system sciences, also information from real experiments on subsystems can be incorporated, not on a large climatic scale2, but for example from ecosystem96 and mesocosm experiments97 in ecological labs.

### Detecting and attributing climate change

Detection and attribution approaches quantify the evidence for a causal link between external drivers of climate change and long-term changes in climatic variables2. The goal is to first detect a change and then attribute this change to the contributions of multiple anthropogenic and natural forcings, and from internal variability2. Importantly, the focus lies on the effects of long-term forcings on long-term climatic trends or also changes in, e.g., the frequency of extreme weather events. Such research questions require counterfactual worlds, which can only be constructed with climate models, that are then statistically analyzed. For example, the optimal fingerprinting method2 is based on attributing detected long-term responses to fingerprint patterns using multiple linear regression. Hannart et al.98 discuss the inclusion of Pearl’s4 causal counterfactual theory for a more rigorous foundation of detection and attribution studies.

Nevertheless, observational causal inference methods can help to improve climate models as discussed above and can also directly be used to analyze climate feedbacks in paleo-climate data44, which is still challenging due to scarce available data and dating uncertainties (Fig. 4). Furthermore, the recent concept of emergent constraints attempts to identify an observable statistical relationship between a feature of interest and a future climate change signal. For example, climate sensitivity, i.e., the response of global mean temperature to greenhouse gas emissions, can be constrained this way82. The underlying premise is, however, that today’s dependencies between the predictors and climate sensitivity represent actual physical processes that also hold under future climate change. Here causal discovery can give more robust insights by identifying causal predictors that are more likely to hold under future climate change scenarios.

## Conclusions

The current state-of-the-art in data analysis of the Earth system is still dominated by correlation and regression methods, despite the fact that these methods often lead to ambiguous and confounded results. Existing causality methods can already yield deeper insights from hypothesis testing to the causal evaluation of physical models—if the particular challenges of Earth system sciences are properly addressed. A major impediment to a much wider adoption of causal inference methods is the lack of a reliable benchmark database. We aim to fill this gap by the accompanying platform causeme.net which also includes links to accessible software packages. Applying and interpreting causal inference methods and integrating these with physical modeling, however, will also require more in-depth training on methods in Earth system sciences. Moreover, data-driven causality analyses need to be designed carefully: They should be guided by expert knowledge of the system (requiring expertise from the relevant field) and interpreted based on the assumptions and limitations of the causality method used (requiring expertise from the causal inference method). Sensibly applied causal inference methods promise to substantially advance the state-of-the-art in understanding complex dynamical systems from data also in many other fields with similar challenges as in Earth system sciences, if domain scientists and method developers closely work together—and join the ‘causal revolution’14.