Deep learning of causal structures in high dimensions under data limitations

Causal learning is a key challenge in scientific artificial intelligence as it allows researchers to go beyond purely correlative or predictive analyses towards learning underlying cause-and-effect relationships, which are important for scientific understanding as well as for a wide range of downstream tasks. Here, motivated by emerging biomedical questions, we propose a deep neural architecture for learning causal relationships between variables from a combination of high-dimensional data and prior causal knowledge. We combine convolutional and graph neural networks within a causal risk framework to provide an approach that is demonstrably effective under the conditions of high dimensionality, noise and data limitations that are characteristic of many applications, including in large-scale biology. In experiments, we find that the proposed learners can effectively identify novel causal relationships across thousands of variables. Results include extensive (linear and nonlinear) simulations (where the ground truth is known and can be directly compared against), as well as real biological examples where the models are applied to high-dimensional molecular data and their outputs compared against entirely unseen validation experiments. These results support the notion that deep learning approaches can be used to learn causal networks at large scale. Learning causal relationships between variables in large datasets is an outstanding challenge in various scientific applications. Lagemann et al. introduce a deep neural network approach combining convolutional and graph models intended for causal learning in high-dimensional biomedical problems.


Introduction
Causality remains an important open area in machine learning, statistics and related fields [see e.g.Peters et al., 2017, Arjovsky et al., 2019] and the task of identifying causal relationships between variables is key in many scientific domains including in particular biomedicine [see e.g.Glymour et al., 2016, Hill et al., 2016].The rich body of work in learning causal structures includes, among other methods, PC [Spirtes et al., 2000], LiNGAM [Shimizu et al., 2006], IDA [Maathuis et al., 2009], GIES [Hauser and Bühlmann, 2012], RFCI [Colombo et al., 2012], ICP [Peters et al., 2016] and MRCL [Hill et al., 2019].However, learning causal structures from data remains challenging, particularly under conditions -such as high dimensionality, limited data sizes, presence of hidden variables etc. -seen in many real-world problems.
In this paper, we propose a deep architecture for causal learning that is motivated in particular by questions involving high-dimensional biomedical data.The approach we put forward operates within a paradigm that views causal questions through the lens of expected loss or risk (see below).The learners proposed allow for the integration of partial knowledge concerning a subset of causal relationships and then seek to generalize beyond what is initially known to learn relationships between all observed variables.This corresponds to a common scientific use-case, in which some prior knowledge is available at the outset -from previous experiments or scientific background knowledge -but where the aim is to go beyond what is known to learn a model spanning all available variables.Much of the literature in learning causal structures involves statistical formulations that allow explicit description of the relevant data-generating distributions (including both observational and interventional distributions) and are in that sense "generative" [see, e.g., Heinze-Deml et al., 2018, and references therein].Taking a different approach, a number of recent papers, including Lopez-Paz et al. [2015], Mooij et al. [2016], Hill et al. [2019], Noè et al. [2019], have considered learning discrete indicators of causal relationships between variables (without necessarily learning full details of the underlying data-generating models) and this is related to notions of causal expected loss or risk [Eigenmann et al., 2020].Such indicators may encode for example, whether, for a pair of variables A and B, A has a causal influence on B, B on A, or neither.
The approach we propose, called "Deep Discriminative Causal Learning" (D 2 CL), is in the latter vein.We consider a version of the causal structure learning problem in which the desired output consists of binary indicators of causal relationships between observed variables [Hill et al., 2019, Eigenmann et al., 2020], which can be represented as a directed graph with nodes corresponding to the variables.Available multivariate data X are transformed to provide inputs to a neural network whose outputs are estimates of the causal indicators.As detailed below, D 2 CL has several differences to classical causal structure learning (e.g. based on causal graphical models).First, the objective is different: rather than giving access to all interventional distributions, D 2 CL outputs indicators of causal links.Second, D 2 CL is highly non-parametric, relying on the learners to detect relevant regularities.Third, D 2 CL is demonstrably scalable to large numbers of variables (and is in fact unsuitable for small problems spanning only a few variables, see Discussion).The assumptions underlying the approach are also different in nature from the kinds of assumptions usually made in causal structure learning and concern higher-level regularities in the data-generating processes, as discussed further below.
The remainder of the paper is organized as follows.We first introduce the D 2 CL methodology.We then present empirical results, on both synthetic, gold-standard problems and on real molecular biological data.In the latter case, model results are systematically checked against entirely unseen interventional experiments.Finally, we discuss open questions and limitations.

Methods
We propose an end-to-end neural approach to learn causal networks from a combination of empirical data X and prior causal knowledge Π.In this Section, we describe the proposed methodology, starting with notation and a problem statement and going on to present the learning scheme and architecture.

Notation
Observed variables with index set V = {1, . . ., p} are denoted X 1 , . . ., X p .The variables will be identified with vertices in a directed graph G whose vertex and edge sets are denoted V (G), E(G), respectively.We occasionally overload G to refer also to the corresponding binary adjacency matrix, using G i j to refer to the entry (i, j) of the adjacency matrix, as will be clear from context.Where needed to make the distinction clear we will use G * to denote a true (unknown) graph and Ĝ an estimate thereof.We use linear indexing of variable pairs to aid formulation as a machine learning problem.Specifically, an ordered pair (i, j) ∈ V ×V has an associated linear index k ∈ K = {1, . . ., K}, where K is the total number of variable pairs of interest.
Where useful we make the mapping explicit, denoting the linear index corresponding to a pair (i, j) as k(i, j) and the variable pair corresponding to a linear index k as (i(k), j(k)).The linear indices of pairs whose causal relationships are unknown and of interest are U ⊂ K and those pairs known in advance via input knowledge Π are T (Π) ⊂ K (the notation emphasizes the fact that the set T is, in general, determined by the input knowledge Π).In all experiments T (Π) and U are disjoint, i.e., no prior causal information is available on the pairs U of interest.

Problem statement
We focus on the setting in which available inputs are: (I1) Empirical data: an n × p data matrix X whose columns correspond to variables X 1 , . . ., X p .
(I2) Causal background knowledge Π providing information on a subset T (Π) ⊂ K of causal relationships.
For (I2), we assume that the prior knowledge Π can be viewed as information concerning the causal status of a subset of variable pairs.That is, for some variable pairs (X i , X j ) the correct binary indicator G * i j , representing the presence/absence of an edge in the target graphical object, is provided as an input.In terms of linear indexing, these can be viewed as available "labels" of causal status for the pairs T (Π) ⊂ K .No specific assumption is made on the data X, but in line with our focus on generalizing to unseen causal relationships, it is assumed that it does not contain interventional data corresponding to the pairs in U. Furthermore, in all experiments, not only are the sets T and U disjoint, but we enforce the stronger requirement that u ∈ U =⇒ j : k(i(u), j) ∈ T , meaning all interventions on which models are tested are entirely novel, i.e. unrepresented in the inputs to the learner.Thus, the learning task can be formulated as follows: given the inputs (I1) and (I2), the goal is to estimate for each ordered pair of variables (X i , X j ) with unknown causal relationship, whether or not X i has a causal influence on X j , or equivalently to learn the underlying graph G * .

Summary of learning scheme
With the notation above, the goal is to learn a graph whose nodes correspond to the variables X 1 , . . ., X p and edges represent causal relationships.To this end, we train a parameterized network F θ , i.e. a nonlinear function F with a set of unknown, trainable parameters θ.This is possible since we know for each pair k ∈ T the causal status G * i(k), j(k) based on input information Π.The architecture we use as F θ is detailed below, but for now assume this has been specified.Then, given the data X and prior input Π, we learn parameters θ(X, Π) under a loss that is supervised by the (causal) inputs/labels Y k = G * i(k), j(k) for all pairs k ∈ T (Π).In contrast to MRCL [Hill et al., 2019], which is semi-supervised and does not scale to high-dimensions, our approach is supervised and aimed at high-dimensional problems and unlike Noè et al. [2019] we use a deep learning framework that learns causally-informed embeddings.We share with Eigenmann et al. [2020] an emphasis on causal risk, but our focus is on learning, rather than risk estimation.
At this stage, the trained network F θ(X,Π) allows assignment of causal status to any pair since it gives an estimate of the entire graph including those pairs whose causal status was unknown.Specifically, the output is given by: Ĝi where (i, j) are ordered variable pairs.Note that the overall estimate depends solely on the data X and causal information Π.By default, no change is made for pairs T whose status was known at the outset.Eigenmann et al. [2020] studied causal notions of risk based on loss functions of the form L( Ĝ, G * ) that compare a graph estimate Ĝ with ground-truth G * .In our setting, we consider a classification-type loss on the variable pairs k, where the causal status of known pairs T (Π) provides the training "labels".We therefore use the corresponding binary cross-entropy loss, augmented by additional terms that, for instance, prevent exploding weights.
In the D 2 CL framework the notion of causal influence encoded by the edges is rooted in the application setting and input information Π, since causal semantics are inherited via the problem setting rather than specified by a generative model (see Hill et al. [2019] for related discussion).Indeed, in the experiments below we show examples in which D 2 CL is used to learn either direct or indirect/ancestral causal relationships, depending on the setting and inputs.We direct the interested reader to Appendix A for further discussion of assumptions.

Architecture details
CNN Tower: To capture distributional information from empirical data X, a preprocessing step is required.
In principle, this could be done via a variety of multi-dimensional transformations of X.We consider the simplest possible case, namely for a pair (i, j) to consider only the corresponding columns i and j in the data matrix X.Specifically, we use the n × 2 submatrix X (•,[i j]) , to form a bivariate kernel density estimate ).Note that this is in general asymmetric in the sense that f i j = f ji , which is important since we want to learn ordered/directed relationships.Evaluations of the KDE at equally spaced grid points on the plane (i.e.numerical values from the induced density function) are treated as the input to the CNN.The KDE itself is a standard bivariate approach using automated bandwidth selection following Silverman [1986], Turlach [1993].This provides an "image" of the data and allows us to leverage standard tools from computer vision.Furthermore, we concatenate channelwise the numerical KDE values on the regularly spaced grid with a positional encoding of the grid points.
The specific network architecture of our CNN tower is inspired by a ResNet-54 architecture [He et al., 2016].From a high level perspective, it consists of a stem, five stages with [3,4,6,3,3] ResNet blocks and multiple fully connected layers that transform the high-level feature maps into a latent space that is merged with the output of the GNN tower.The first ResNet block at each stage downsamples the spatial dimensions of the output of the previous stage by a factor of two.To enhance the computational efficiency of the bottleneck layers in each ResBlock, channel down-and up-sampling exploiting 1 × 1 convolutions is performed before and after each feature extraction CNN layer [Szegedy et al., 2015].We replaced ReLU activations by the parametric counterpart PReLU [He et al., 2015].Following Xie et al. [2017], we chose a full pre-activation of the convolutional layers, normalization-activation-convolution.
GNN tower.The GNN tower leverages the SEAL architecture of Zhang and Chen [2018] and the resulting graph convolutional neural network (GCNN) for link prediction.The underlying notion is that a heuristic function predicts scores for the existence of a link.However, instead of employing predefined heuristics (such as the Katz coefficient or PageRank), an adaptive function is learned in an end-to-end fashion, which is formulated as a graph classification problem on enclosing subgraphs.Here, for a node pair of interest (i, j), the GNN tower is intended to learn causally relevant node features and state embeddings based on a local 1-hop enclosing subgraph extracted from an initial input graph Ĝ0 .This is done as follows.For node pair (i, j), we first extract a set N of neighbouring nodes comprising all nodes connected to either i or j in Ĝ0 .Then, the edge structure within the subgraph G i j is reconstructed by pulling out all edges from Ĝ0 for which the parent and child node are in N .The order of the nodes is shuffled for each subgraph.The node features in every input subgraph consist of structural node labels that are assigned by a Double-Radius Node Labeling (DRNL) heuristic [Zhang and Chen, 2018] and the individual data features.In a first step, the distances between node i and all other nodes of the local subgraph except node j are computed.The same is repeated for node j.A hashing function then transforms the two distance labels into a DRNL label that assigns the same label to nodes that are on the same "orbit" around the center nodes i and j.During the training process the DRNL label is transformed into a one-hot encoded vector and passed to the first graph convolutional layer.In contrast to traditional CNNs, GCNNs do not benefit strongly from very deep architecture design [Chen et al., 2019, Li et al., 2018].Therefore, our GNN tower consists only of four sequentially stacked graph convolutional layers.The activation function is the hyperbolic tangent.Since the number of nodes in the enclosing subgraph for each pair of variables (i, j) is different, a SortPooling layer [Zhang et al., 2018] is applied to select the top k nodes according to their structural role within the graph.Afterwards, 1-dimensional convolutions extract features from the selected state embeddings.
Embedding Fusion.Each tower outputs an embedding; these are concatenated and further processed by multiple fully connected layers.Finally, the last layers output the log-likelihood of a directed edge from node i to node j.
Implementation summary.All network architectures were implemented in the open source framework PyTorch [Paszke et al., 2019].The GNN was implemented based on the deep graph library [Wang et al., 2019].All modules were initialized using random weights.During training, we applied an Adam-Optimizer [Kingma and Ba, 2015] starting at an initial learning rate ε 0 = 0.0001.Furthermore, the learning rate was reduced by a factor of five once the evaluation metrics stopped improving for 15 consecutive epochs.The minimum learning rate was set to ε min = 10 −8 .The training predictions were supervised on the binary cross entropy loss between estimated and ground truth edge labels.Every network architecture was trained for 100 epochs, using multiple GPU nodes simultaneously, each equipped with eight Nvidia Tesla V100s.

Results
We assess the proposed approaches in comparison to a range of existing methods, using both simulated data and real biological data.In the case of the simulations, we have access to the true, underlying causal graph, and hence can assess results by direct comparison with the ground truth.For the real data examples, we test the model output against the outcome of entirely unseen interventional experiments.In all experiments, simulated or real, model output is tested with respect to causal relationships that are entirely unseen in the sense that (i) the variable pairs on which the model output is tested are disjoint from those pairs whose causal relationships are provided as inputs during training, and (ii) no data used to define the gold-standard causal relationships against which the model output is tested appear in inputs to the models.
3.1 Gold-standard simulated benchmark data.
We first tested D 2 CL using linear and non-linear simulations.These involved generating data X (and obtaining prior knowledge Π) from a (linear or non-linear) structural equation model (SEM) with noise, based on a known underlying causal graph G * .The protocol is outlined in Figure 2a.In brief, data were generated via structural equations of the form p, where p is the total number of variables, Pa G * (X i ) is the set of parents for node i in the true graph G * , the U X i 's are noise variables (exogenous and jointly independent) and the f i 's functions unknown to the learners.Functional forms used include simple linear functions, multi-layer perceptrons (MLPs) with tangent hyperbolic activations, MLPs with leaky ReLU activation, leaky ReLU, a polynomial of order three and the tangent hyperbolic.Varying the magnitude of the noise terms allowed us to control the signal-to-noise ratio (SNR), while varying p allowed us to understand the effect of dimensionality.Results were evaluated against the true, gold-standard causal structure G * and hence tested in causal (and not correlational or predictive) terms.
Figure 2b shows results for a problem of dimension p=1500 using a nonlinear transition function (the tangent hyperbolic; other functions/configurations are shown in Appendix B) and varying SNR.(For these first results, we restricted the dimension of the problem to facilitate comparison with approaches that may not scale to larger problems; higher dimensional examples appear below.)Overall, D 2 CL remains effective across a broad range of SNRs, as well as for a range of linear and nonlinear problems and problem sizes (Appendix B).These results support the notion that D 2 CL can learn direct causal edges in systems spanning many variables.We note that the comparison with existing approaches is not one-to-one, since in many cases methods differ in their expected inputs and outputs.For example, IDA is aimed at analysis of observational data, hence the comparison is unfair since our approach has access also to background causal information Π.GIES allows for interventional data, but requires different inputs.Due to these differences in input/output requirements, we emphasize that comparisons here are provided for completeness but with the caveat that the various methods are intended for different use-cases (and furthermore make assumptions that are likely not met in the real biological data below).
The graph G * in the above examples encodes direct causal relationships since there is an edge from one node to another if the former appears in the equation for the latter.However, in many real-world examples, interest focuses also on indirect effects, that may be mediated by other nodes.For example, if node A has a direct effect on B, and B on C, intervention on A may change C, even though A does not itself appear in the equation for C. To study the ability to identify such indirect effects, we next tested the various methods on the task of learning indirect edges.This was done in the same way as above, but with the inputs Π being indirect edges and output tested against the true indirect graph.
Results appear in Figure 2c.D 2 CL performs well across a range of SNRs and also in other linear/nonlinear problem configurations (Appendix B).IDA performs well in case of a linear SEM but not for functions based on nonlinear MLPs.These results support the notion that D 2 CL can learn indirect causal edges over many variables under conditions of noise and nonlinearity.
Next, we sought to study performance in the context of real biological data.To this end, we leveraged a large set of gene deletion experiments in yeast [Kemmeren et al., 2014], which have previously been used for causal learning [Peters et al., 2016, Meinshausen et al., 2016, Hill et al., 2019].These data involve measuring gene expression in yeast cells under each of a large number of interventional (gene deletion) experiments.To define causal status, we followed the approach of Hill et al. [2019], considering changes under intervention relative to the observational distribution.
In biological experiments, causal effects may be indirect and our goal in the analysis is to learn a directed graph with nodes corresponding to p observed genes and edges representing (possibly indirect) causal influences.Such edges are scientifically interesting as they are relatively amenable to experimental verification [as noted in Zhang, 2008, Noè et al., 2019] [2017] and we direct the interested reader to these references for further discussion.
Since causal background knowledge is an input to our approach, it is relevant to consider performance as a function of the amount of such input.To this end, we fixed the problem size to p = 1000 and varied the number of interventions m whose effects were available to the learner.Since each experiment involves only a subset of the entire yeast genome, latent variables are present by design.The input prior knowledge Π is derived from the causal status, but, as in all experiments, is strictly disjoint with respect to any test edges.
Results are shown in Figure 3a-c, including the area under the ROC curve (AUC; computed with respect to an experimentally-determined gold-standard, as in Hill et al. [2019]).Interestingly, the two towers differ in some ways: the CNN tower degrades slowly with fewer causal inputs while the performance of the GNN tower degrades faster.GIES [Hauser and Bühlmann, 2012] was not effective in this setting (result not shown; findings are in line with Hill et al. [2019] using the same data); however, we note that GIES requires different inputs to our approach and its assumptions are likely violated in this setting.Next, to shed light on data efficiency we varied the sample size n of the data matrix X. Results are shown in Figure 3d-f.Finally, we tested performance in a higher dimensional example spanning all p=5535 available genes (cf. Figure 3g-k) and found that D 2 CL remains effective at genome scale.Interestingly, while the CNN tower performs particularly well, the GNN tower degrades more.This may be because larger p leads to a larger number of variable pairs (which is helpful for the CNN), but also to a (rapid) increase in the number of nodes and edges in the GNN subgraphs and hence a harder GNN learning task in practice.
D 2 CL leverages prior causal knowledge; however, in practice, available causal inputs Π may be incorrect, e.g.due to flawed initial experiments or errors in the known science.To study sensitivity to flawed causal inputs we introduced errors into Π.This was done by perturbing 10% of the inputs (i.e.labelling causal pairs as non-causal and vice versa) at the outset.Figure 4a shows corresponding results; the networks seem reasonably robust in this sense.These experiments point also to a benefit of the dual network variants: when one tower underperforms, the combined network still performs well, as it (automatically) adapts to rely on the effective tower.This aspect is further investigated in Figure 4b.To test the impact of a failing tower on overall performance, the embedding of either tower was modified right before the fusion layer.We considered four different modifications: (i) setting the complete embedding of one tower to zero and hence effectively removing all information from this tower.In the other cases we applied Gaussian noise with magnitude (ii) σ = 1.0, (iii) σ = 2.0, and (iv) σ = 5.0.The results support the notion that even when one  tower fails, the second can compensate so that D 2 CL still provides useful output.
Causal relations are in general directed and asymmetric.Given an image representation, the CNN tower extracts feature maps for (ordered) node pairs.The two-dimensional convolutional operation S(i, j) = ∑ m ∑ n I(m, n)K(i − m, j − n) that convolves image I with kernel K would produce the same feature map for two causal images I k→l and I l→k if and only if I k→l and I l→k were identical.In other words, unless the probability distribution P(X i , X j ) is perfectly symmetrical around the center of the causal image, the CNN tower can extract causal features that differ depending on direction.Figure 4c shows a low-dimensional representation of the feature maps of the converged CNN tower; the feature maps differ by direction, supporting the notion that the representations learned are asymmetric.

Conclusions
Our model leverages deep learning tools to learn causal relationships between variables in a scalable manner.However, and in contrast to well established approaches based on causal graphical models, it provides only structural output rather than a probability model of the underlying system.It would therefore be interesting to consider coupling our approach, as a first learning step, with a graphical model based analysis in a second step.This would amount to using the flexible and scalable discriminative approach as a filter to render subsequent causal modelling more Despite some initial ideas presented here (see also Appendix A), there remain open questions concerning the theoretical properties of the kind of approach studied here.In particular, precise conditions on the underlying system needed to ensure that the classification-type approach can guarantee recovery of specific causal structures remain to be elucidated.An interesting observation is that the proposed approach may benefit from a "blessing of dimensionality", since the learning problem will typically enjoy a larger number of examples as the dimension p grows.Conversely, and in contrast to established statistical-causal models, our approach (at the current stage) cannot be used in the small-p regime, since then the number of examples will be too small for deep learning.
the distribution over relations of variables and noise terms to system-specific distributions.
Note that no particular assumption is made on the individual functions f i , only that they are mutually related on a higher level.Furthermore, the generators themselves need not to be known or are directly estimated, it is only important that they are shared across the applied setting W .Note that a model learned for setting W will not in general be able to classify pairs in an entirely different applied setting W (since the generators may then differ strongly), i.e. we do not seek to learn "universal" patterns that apply to all causal relations in any system whatsoever.The classification task of D 2 CL aims at telling apart causal relationships, related by the system-specific function generator F W , from non-causal ones.We note that in real systems, f i 's may be coupled via constraints on global functionality, hence non-independent, however, the good performance seen in the Main Text empirically justifies the approach.We emphasize that while the ideas above provide some initial intuition, further work is needed to better understand the properties of the kind of approach studied here from a theoretical point of view.

Figure 1 :
Figure1: Overview of the D 2 CL architecture, training and inference.D 2 CL combines empirical data with prior causal knowledge to learn causal relationships between variables.This is done using a neural architecture with two components: a CNN tower aimed at learning distributional features and a GNN tower that detects structural regularities.The CNN and GNN embeddings are then merged through multiple layers to estimate the probability of a directed causal relationship.During inference the network generalizes beyond the initial inputs to provide a global estimate spanning all variables of interest.

Figure 2 :
Figure 2: Results, simulated data.(a) Overview.Data were simulated from known, gold-standard causal graphs with which the output of the learners was compared.Empirical data were generated using a directed causal graph of specified dimension p using linear and nonlinear structural equation models with noise (see text).(b) Results for an illustrative nonlinear case (the tangent hyperbolic), at varying noise levels, for direct causal relationships.Causal area under the ROC-Curve (AUC; with respect to the causal ground truth graph) is shown as a function of signal-to-noise ratio (SNR) for an experiment with p=1500 variables and a sample size of n=1024.D 2 CL (blue) is compared with: Pearson correlations (yellow; this is a non-causal baseline); IDA (red); and SCL (green).(c) Results for indirect causal relationships, with other settings as in (b).Here, causal AUC is with respect to a graph encoding causal, but potentially indirect, relationships.(Results shown are averages over five data sets at each specified SNR.)

Figure 3 :
Figure 3: Results, biological data.Causal learning methods, including D 2 CL, were applied to gene expression measurements from yeast cells.Performance was quantified using causal ROC curves (and the area under the curves, or AUC) computed with respect to a causal ground truth obtained from entirely unseen interventional experiments (see text).Panels (a)-(c): number of interventions whose effects are available to the learner varied as shown (with problem dimension fixed to p=1000 and sample size to n=706).Panels (d)-(f): sample size n varied as shown (with problem dimension fixed to p=1000 and number of available interventions to m=753).Panels (g)-(k): D 2 CL results for a higher-dimensional setting spanning all available genes with p=5535 (with n=706 and m=753).[D 2 CL variants shown include CNN tower alone, GNN tower alone and the combined architecture; methods compared against include IDA, LV-IDA, Kendall correlations (as a non-causal baseline) and SCL (see text).For D 2 CL variants with a GNN component two different initial graph estimates were used based respectively on Pearson correlation coefficients ("Pearson") and on a lightweight regression ("Lasso"; see text for details).] Low-dimensional representation of feature maps of CNN tower

Figure 4 :
Figure 4: Sensitivity to incorrect causal inputs and additional results on causal direction.(a) Robustness to incorrect causal inputs.Sensitivity of D 2 CL to errors in prior/input causal knowledge Π was studied by artificially introducing errors into Π, with 10% of inputs corrupted (see text).Results quantified via causal AUC (with respect to the correct ground truth).(b) Ablation-like study in which failures of either the CNN (orange) or the GNN (blue) tower within D 2 CL are artificially introduced.The affected embedding is either set to zero or zero-mean Gaussian noise with varying scale is applied.The unaffected case is given as dashed black line.(c) Causal direction analysis.Low-dimensional representations of latent feature maps of the converged CNN tower at two different layer depths.Edges A → B shown as dots and reverse edges B → A as x-shaped markers.An edge and its corresponding reverse is indicated by the same color.For improved readability, ten (randomly chosen) pairs are highlighted in colors and larger markers.[D 2 CL variants include: a CNN tower alone; a GNN tower for two different initial graph estimates; and the complete architecture.Initial graph estimates for the GNN and combined models either based on Pearson correlation coefficients ("Pearson") or a lightweight regression ("Lasso"; see text).] Hyttinen et al. [2012] [2007] biology [see e.g.Alon, 2019]and we do not enforce acyclicity [seeHyttinen et al., 2012, and references therein, for discussion of cyclic causality].A fuller discussion of the causal interpretation of laboratory experiments is beyond the scope of this paper, but relevant work includesEberhardt and Scheines [2007],Hyttinen et al. [2012],Kocaoglu et al.