Author Correction: Improving the accuracy of medical diagnosis with causal machine learning

A Correction to this paper has been published: https://doi.org/10.1038/s41467-021-21494-9

For a given variable X and a directed acyclic graph (DAG) G, we denote the set of parents of X as Pa(X), the set of children of X as Ch(X), all ancestors of X as Anc(X), and all descendants of X as Dec(X).If we perform a graph cut operation on G, removing a directed edge from Y to X, we denote the variable X in the new DAG generated by this cut as X \Y .
Functions: Bernoulli variables are represented interchangeably as Boolean variables, with 1 ↔ 'True' and 0 ↔ 'False'.For a given instantiation of a Bernoulli/Boolean variable X = x, we denote the negation of x as x -for example if x = 1(0), x = 0(1).We denote the Boolean AND function as ∧, and the Boolean OR function as ∨.

Supplementary note 2: structural causal models
First we define structural causal models (SCMs), sometimes also called structural equation models or functional causal models.These are widely applied and studied probabilistic models, and their relation to other approaches such as Bayesian networks are well understood [2,3].The key characteristic of SCMs is that they represent variables as functions of their direct causes, along with an exogenous 'noise' variable that is responsible for their randomness.
2. a set of observed variables V = {v 1 , . . ., v n }, 3. a directed acyclic graph G, called the causal structure of the model, whose nodes are the variables U ∪ V, 4. a collection of functions F = {f 1 , . . ., f n }, where f i is a mapping from U ∪ V/v i to v i .The collection F forms a mapping from U to V.This is symbolically represented as where pa i denotes the parent nodes of the ith observed variable in G.
Note that the causal structure and generative functions are typically provided by expert opinion, though in some instances the causal structure can be learned from data [4,5].As the collection of functions F forms a mapping from noise variables U to observed variables V, the distribution over noise variables induces a distribution over observed variables, given by P (v i ) := u|vi=fi(Pa(vi),u) P (u), for i = 1, . . ., n. (1) We can hence assign uncertainty over observed variables despite the the underlying dynamics being deterministic.
In order to formally define a counterfactual query, we must first define the interventional primitive known as the do-operator [3].Consider a SCM with functions F .The effect of intervention do(X = x) in this model corresponds to creating a new SCM with functions F X=x , formed by deleting from F all functions f i corresponding to members of the set X and replacing them with the set of constant functions X = x.That is, the do-operator forces variables to take certain values, regardless of the original causal mechanism.This represents the operation whereby an agent intervenes on a variable, fixing it to take a certain value.Probabilities involving the do-operator, such as P (Y = y|do(X = x)), correspond to evaluating ordinary probabilities in the SCM with functions F X=x , in this case P (Y = y).Where appropriate, we use the more compact notation of Y x to denote the variable Y following the intervention do(X = x).
Next we define noisy-OR models, a specific class of SCMs for Bernoulli variables that are widely employed as diagnostic models [6][7][8][9][10][11][12][13][14][15].The noisy-OR assumption states that a variable Y is the Boolean OR of its parents X 1 , X 2 , . . ., X n , where the inclusion or exclusion of each causal parent in the OR function is decided by an independent probability or 'noise' term.The standard approach to defining noisy-OR is to present the conditional independence constraints generated by the noisy-OR assumption [16], where P (Y = 0 | only(X i = 1)) is the probability that Y = 0 conditioned on all of its (endogenous) parents being 'off' (X j = 0) except for X i alone.We denote P (Y = 0 | only(X i = 1)) = λ Xi,Y by convention.
The utility of this assumption is that it reduces the number of parameters needed to specify a noisy-OR network to O(N ) where N is the number of directed edges in the network.All that is needed to specify a noisy-OR network are the single variable marginals P (X i = 1) and, for each directed edge X i → Y j , a single λ Xi,Yj .For this reason, noisy-OR has been a standard assumption in Bayesian diagnostic networks, which are typically large and densely connected and so could not be efficiently learned and stored without additional assumptions on the conditional probabilities.We now define the noisy-OR assumption for SCMs.
Definition 2 (noisy-OR SCM).A noisy-OR network is an SCM of Bernoulli variables, where for any variable Y with parents Pa(Y ) = {X 1 , . . ., X N } the following conditions hold 1. Y is the Boolean OR of its parents, where for each parent X i there is a Bernoulli variable U i whose state determines if we include that parent in the OR function or not i.e.Y = 1 if any parent is on, x i = 1, and is not ignored, u i = 0 (ū i = 1 where 'bar' denotes the negation of u i ).
2. The exogenous latent encodes the likelihood of ignoring the state of each parent in (1), P (u Y ) = P (u 1 , u 2 , . . ., u N ).The probability of ignoring the state of a given parent variable is independent of whether you have or have not ignored any of the other parents, 3. For every node Y there is a parent 'leak node' L Y that is singly connected to Y and is always 'on', with a probability of ignoring given by λ L Y The leak node (assumption 3) represents the probability that Y = 1, even if X i = 0 ∀ X i ∈ Pa(Y ).This allows Y = 1 to be caused by an exogenous factor (outside of our model).For example, the leak nodes allow us to model the situation that a disease spontaneously occurs, even if all risk factors that we model are absent, or that a symptom occurs but none of the diseases that we model have caused it.It is conventional to treat the leak node associated with a variable Y as a parent node L Y with P (L Y = 1).Every variable in the noisy-OR SCM has a single, independent leak node parent.
Given Definition 2, why is the noisy-or assuption justified for modelling diseases?First, consider the assumption (1), that the generative function is a Boolean OR of the individual parent 'activation functions' x i ∩ ūi .This is equivalent to assuming that the activations from diseases or risk-factors to their children never 'destructively interfere'.That is, if D i is activating symptom S, and so is D j , then this joint activation never cancels out to yield S = F .As a consequence, all that is required for a symptom to be present is that at least one disease to be causing it, and likewise for diseases being caused by risk factors.This property of noisy-OR, whereby an individual cause is also a sufficient cause, is a natural assumption for diseases modelling -where diseases are (typically by definition) sufficient causes of their symptoms, and risk factors are defined such that they are sufficient causes of diseases.For example, if preconditions R 1 = 1 and R 2 = 1 are needed to cause D = 1, then we can represent this as a single risk factor R = R 1 ∧ R 2 .Assumption 2 states that a given disease (risk factor) has a fixed likelihood of activating a symptom (disease), independent of the presence or absence of any other disease (risk factor).In the noisy-or model, the likelihood that we ignore the state of a parent X i of variable Y i is given by and so is directly associated with a (causal) relative risk.In the case that child Y has two parents, X 1 and X 2 , noisy-OR assumes that this joint relative risk factorises as Whilst it is likely that interactions between causal parents will mean that these relative risks are not always multiplicative, it is assumed to be a good approximation.For example, we assume that the likelihood that a disease fails to activate a symptoms is independent of whether or not any other disease similarly fails to activate that symptom.
As noisy-OR models are typically presented as Bayesian networks, the above definition of noisy-OR is nonstandard.We now show that the SCM definition yields the Bayesian network definition, (2).
Theorem 1 (noisy-OR CPT).The conditional probability distribution of a child Y given its parents {X 1 , . . ., X N } and obeying Definition 2 is given by where Proof.For Y = 0, the negation of y, denoted ȳ, is given by The CPT is calculated from the structural equations by marginalizing over the latents, i.e. we sum over all latent states that yield Y = 0. Equivalently, we can marginalize over all exogenous latent states multiplied by the above Boolean function, which is 1 if the condition Y = 0 is met, and 0 otherwise.
This is identical to the noisy-OR CPT (2) where we denote λ Xi,Y = P (u i ).The leak node is included as a parent X L where P (X L = 1) = 1, and a (typically large) probability of being ignored λ L .This node represents the likelihood that Y will be activated by some causal influence outside of the model, and is included to ensure that P (Y = 1| ∧ n i=1 (X i = 0)) = 0.As the leak node is always on, its notation can be suppressed and it is standard notation to write the CPT as

Supplementary note 3: Twin diagnostic networks
In this supplementary note we derive the structure of diagnostic twin networks.First we provide a brief overview to the twin-networks approach to counterfactual inference.See [1] and [17] for more details on this formalism.First, recalling the definition of the do operator from the previous section, we define counterfactuals as follows.
Definition 3 (Counterfactual).Let X and Y be two subsets of variables in V .The counterfactual sentence Y would be y (in situation U ), had X been x, is the solution Y = y of the set of equations F x , succinctly denoted Y x (U ) = y.
As with observed variables in Definition 1, the latent distribution P (U ) allows one to define the probabilities of counterfactual statements in the same manner they are defined for standard probabilities (1).
Reference [3] provides an algorithmic procedure for computing arbitrary counterfactual probabilities for a given SCM.First, the distribution over latents is updated to account for the observed evidence.Second, the dooperator is applied, representing the counterfactual intervention.Third, the new causal model created by the application of the do-operator in the previous step is combined with the updated latent distribution to compute the counterfactual query.In general, denote E as the set of factual evidence.The above can be summarised as, 1. (abduction).The distribution of the exogenous latent variables P (u) is updated to obtain P (u | E)

(action).
Apply the do-operation to the variables in set X, replacing the equations

(prediction)
. Use the modified model to compute the probability of Y = y.
The issue with applying this approach to our large diagnostic models is that the first step, updating the exogenous latents, is in general intractable for models with large tree-width.The twin-networks formalism, introduced in [1], is a method which reduces and amortises the cost of this procedure.Rather than explicitly updating the exogenous latents, performing an intervention, and performing belief propagation on the resulting SCM, twin networks allow us to calculate the counterfactual by performing belief propagation on a single 'twin' SCM -without requiring the expensive abduction step.The twin network is constructed as a composite of two copies of the original SCM where copied variables share their corresponding latents [1].We refer to pairs of copied variables as 'dual variables'.Nodes on this twin network can then be merged following simple rules outlined in [17], further reducing the complexity of computing the counterfactual query.We now outline the process of constructing the twin diagnostic network in the case of the two counterfactual queries we are interested in -those with single counterfactual interventions, and those where all counterfactual variables bar one are intervened on.We assume the DAG structure of our diagnostic model is a three layer network [A].The top layer nodes represent risk factors, the second layer represent diseases, and the third layer symptoms.We assume no directed edges between nodes belonging to the same layer.To construct the twin network, first the SCM in [A] is copied.In [B] the network on the left will encode the factual evidence in our counterfactual query, and we refer to this as the factual graph.The network on the right in [B] will encode our counterfactual interventions and observations, and we refer to this as the counterfactual graph.We use an asterisk X * to denote the counterfactual dual variable of X.
As detailed in [1], the twin network is constructed such that each node on the factual graph shares its exogenous latent with its dual node, so u * Xi = u Xi .These shared exogenous latents are shown as dashed lines in figures [B-E].First, we consider the case where we perform a counterfactual intervention on a single disease.As shown in [B], we select a disease node in the counterfactual graph to perform our intervention on (in this instance D * 2 ).In Once the counterfactual intervention has been applied, it is possible to greatly simplify the twin network graph structure via node merging [17].In SCM's a variable takes a fixed deterministic value given an instantation of all of its parents and its exogenous latent.Hence, if two nodes have identical exogenous latents and parents, they are copies and can be merged into a single node.By convention, when we merge these identical dual nodes we map X * → X (dropping the asterisk).Dual nodes which share no ancestors that have been intervened upon can therefore be merged.As we do not perform interventions on the risk factor nodes, all (R i , R * i ) are merged (note that for the sake of clarity we do not depict the exogenous latents for risk factors). .
Factual graph Counterfactual graph

Factual graph Counterfactual graph
Supplementary Figure 2: Simplification of twin network through node merging Next, we merge all dual factual/counterfactual disease nodes that are not intervened on, as their latents and parents are identical (shown in [D]).Finally, any symptoms that are not children of the disease we have intervened on (D 2 ) can be merged, as all of their parent variables are identical.The resulting twin network is shown in [E].Note that we have also removed any superfluous symptom nodes that are unevidenced, as they are irrelevant for the query.
In the case that we intervene on all of the counterfactual diseases except one, following the node merging rule outlined above, we arrive at a model with a single disease that is a parent of both factual and counterfactual symptoms, as shown in Figure [F].

Supplementary Figure 3: Final twin network for expected sufficiency
We refer to the SCMs shown in figures [E] and [F] as 'twin diagnostic networks'.The counterfactual queries we are interested in can be determined by applying standard inference techniques such as importance sampling to these models [18].
Before proceeding, we motivate our choice of counterfactual query for the task of diagnosis.
An observation will often have multiple possible causes, which constitute competing explanations.For example, the observation of a symptom S = 1 can in principle be explained by any of its parent diseases.In the case that a symptom has multiple associated causes (diseases), rarely is a single disease necessary to explain a given symptom, unless the symptom is uniquely generated by the disease.Equivalently, the symptoms associated with a disease tend to be present in patient's suffering from this diseases, without requiring a secondary disease to be present.This can be summarised by the following assumptionany single disease is a sufficient cause of any of its associated symptoms.Under this assumption, determining the likelihood that a diseases is causing a symptom reduces to simple deduction -removing all other possible causes and seeing if the symptom remains.We call this the assumption of causal sufficiency and note that it is a standard assumption in most of medicine, and is often taken as part of the definition of the symptoms of a disease.
The question of how we can define and quantify causal explanations in general models is an area of active research [19][20][21][22] and the approach we propose here cannot be applied to all conceivable SCMs, as counterfactual inferences are valid only up to a set of modelling assumptions [23].For example, if you had a symptom that can be present only if two parents diseases D 1 and D 2 are both present, then neither of these parents in isolation is a sufficient cause (individually, D 1 = 1 and D 2 = 1 are necessary but not sufficient to cause S = 1).This case would violate the assumption of causal sufficiency.In supplementary note 6 we present a different counterfactual query that does not require causal sufficiency, and captures causality in this case by reasoning about necessary treatments.
The assumption of causal sufficiency is obey by noisy-Or models, as in these models all diseases are individually sufficient to generate any symptom.This is ensured by the OR function, which states that a symptom S is the Boolean OR of its parents individual activation functions, s = N i=1 [d i ∧ ūDi,S ] where the activation function from parent D i is f i = d i ∧ ūDi,S .Thus, any single activation is sufficient to explain S = 1 and we can quantify the expected sufficiency of a diseases individually.An example of a model that would violate this property is a noisy-AND model, where s = N i=1 [d i ∧ ūDi,S ] -e.g.all parent diseases must be present in order for the symptom to be present.
Given these properties of noisy-OR models (as disease models in general), we propose our measure for quantifying how well a disease explains the patient's symptoms -the expected sufficiency.For a given disease, this measures the number of symptoms that we would expect to remain if we intervened to nullify all other possible causes of symptoms.This counterfactual intervention is represented by the causal model shown in figure [F] in supplementary note A.2.

Definition 2
The expected sufficiency of disease D k determines the number of positively evidenced symptoms we would expect to persist if we intervene to switch off all other possible causes of the symptoms, where the expectation is calculated over all possible counterfactual symptom evidence states S and S + denotes the positively evidenced symptoms in the counterfactual symptom evidence state.Pa(S + \ D k ) denotes the set of all parents of the set of counterfactual positively evidenced symptoms S + excluding D k , and do(Pa(S + ) \ D k = 0) denotes the counterfactual intervention setting Pa(S + \ D k ) → 0. E denotes the set of all factual evidence.
To evaluate the expected sufficiency we must first determine the dual symptom CPTs in the corresponding twin network (figure [F]).
Lemma 1.For a given symptom S and its counterfactual dual S * , with parent diseases D and under the counterfactual interventions do(D \ D * k = 0) and do(U * L = 0), the joint conditional distribution is given by i is the set of all counterfactual disease nodes excluding D k , ∧ i =k d i is the given instantiation on all disease nodes exlcuding D k , and u * L denotes the leak node for the counterfactual symptom.s \k denotes the state of the factual symptom node S under the graph surgery removing any direct edge from D k to S.
Proof.The CPT for the dual symptom nodes S, S * is given by Where we have use the fact that the latent variables and the disease variables together form a Markov blanket for S, S * , and we have used the conditional independence structure of the twin network, shown in Figure [F], which implies that S and S * only share a single variable, D k , in their Markov blankets.With the full Markov blanket specified, including the exogenous latents, the CPTs in ( 14) are deterministic functions, each taking the value 1 if their conditional constraints are satisfied.Note that the product of these two functions is equivalent to a function that is 1 if both sets of conditional constraints are satisfied and zero otherwise, and marginalizing over all latent variable states multiplied by this function is equivalent to the definition of the CPT for SCMs given in equation (1), where the CPT is determined by a conditional sum over the exogenous latent variables.Given the definition of the noisy-OR SCM in (3), these functions take the form Taking the product of these functions gives the function g s,s * (u, d, u where u denotes a given instantiation of the free latent variables u D1,S , . . ., u D N ,S . where we have used u D i ,S P (u Di,S ) di ∨ u Di,S = P (u Di,S = 1) + P (u Di,S = 0) di = P (u Di,S = 1) di = λ di Di,S , and Di,S can immediately be identified as P (s = 0|D) by (11).
and we can identify . Finally, we can express this as , where s \k is the instantiation of S \k -which is the variable generated by removing any directed edge D k → S (or equivalently, replacing λ D k ,S with 1).
Given our expression for the symptom CPT on the twin network, we now derive the expression for the expected sufficiency.
Theorem 1 For noisy-OR networks described in supplementary note A.1-A.4, the expected sufficiency of disease D k is given by where S ± denotes the positive and negative symptom evidence, R denotes the risk-factor evidence, and S \k denotes the set of symptoms S with all directed arrows from D k to S ∈ S removed.
Proof.Starting from the definition of the expected sufficiency we must find expressions for all CPTs P (S |E, do(D \ D k = 0), do(U L = 0)) where |S + | = 0 (terms with S + = ∅ do not contribute to (20)).Let S * A = {S * s.t.S ∈ S − , S * ∈ S − } (symptoms that remain off following the counterfactual intervention), S * B = {S * s.t.S ∈ S + , S * ∈ S + } (symptoms that remain on following the counterfactual intervention), and S * C = {S * s.t.S ∈ S + , S * ∈ S − } (symptoms that are switched off by the counterfactual intervention).Lemma 1 implies that P (S = 0, S * = 1|d, do(∧ i =k D * i = 0), do(u * L = 0)) = 0, and therefore these three cases are sufficient to characterise all possible counterfactual symptom states S .Therefore, to evaluate (20), we need only determine expressions for the following terms where U * L denotes the set of all counterfactual leak nodes for the symptoms S * A , S * B , S * C .Note that we only perform counterfactual interventions, i.e. interventions on counterfactual variables.As the exogenous latents are shared by the factual and counterfactual graphs, U * L = U L , but we maintain the notation for clarity.First, note that Which follows from the fact that the factual symptoms S ± on the twin network [F] are conditionally independent from the counterfactual interventions do )), we express Q as a marginalization over the factual diseases which, together with the interventions on the counterfactual diseases and leak nodes, constitute a Markov blanket for each dual pair of symptoms Substituting in the CPT derived in Lemma 1 yields The only terms in (20) with |S + | = 0 have S B = ∅, therefore the term δ(d k − 1) is present, and Q simplifies to = P (S A = 0, where in the last line we have performed the marginalization over  (26) where we have dropped the subscript C from S C .Given our expression for the expected sufficiency, we now derive a simplified expression that is very similar to the posterior P (D k = 1|R, S ± ).
Theorem 2 (Simplified expected sufficiency). where Proof.Starting with the expected sufficiency given in Theorem 2, we can perform the change of variables X = S + \ S to give where in the last line we apply the inclusion-exclusion principle to decompose an arbitrary joint state over Bernoulli variables P (A = 0, B = 1) as a sum over the powerset of the variables B in terms of marginals where all variables are instantiated to 0, By the definition of noisy-or (7) we have that Therefore we can replace the graph operation represented by \k by dividing the CPT by the product This allows E suff to be expressed as We now aggregate the terms in the power sum that yield the same marginal on the symptoms (e.g. for fixed Z).Every X ∈ S + \ Z yields a single marginal P (S − = 0, Z = 0, D k = 1|R) and therefore if we express (33) as a sum in terms of Z, where each term P (S − = 0, Z = 0, D k = 1|R) aggregates the a coefficient K Z of the form E suff (D k , E) = Z⊆S+ K Z P (S − = 0, Z = 0, D k = 1|R) where where A = S + \ Z.This can be further simplified using the identity Using (37) we can simplify the coefficient (34) Rearranging (33) as a summation over Z substituting in (38) gives which can be expressed as where Note that if we fix τ (k, Z) = 1 ∀Z, we recover ), which is the standard posterior of disease D k under evidence E = R∩S ± (this follows from the inclusion-exclusion principle, and can be easily checked by applying marginalization to express P (S ± , D k = 1|R) in terms of marginals where all symptoms are instantiated as 0).Note that (40) can be seen as a counterfactual correction to the quickscore algorithm in [10] (although we do not assume independence of diseases as the authors of [10] do).

Supplementary note 5: properties of the expected sufficiency
In this supplementary note, we show that the expected sufficiency (42) obeys our four postulates, including an additional postulate of sufficiency which is obeyed by the expected sufficiency.
The expected sufficiency satisfies the following four properties, Proof.Postulate 1 dictates that the measure should be proportional to the posterior probability of the diseases.Postulate 2 states that if the disease has no causal effect on the symptoms presented then it is a poor diagnosis and should be discarded.Postulate 3 states that the (tight) upper bound of the measure for a given disease (in the sense that there exists some disease model that achieves this upper bound -namely deterministic models) is the number of positive symptoms that the disease can explain.This allows us to differentiate between diseases that are equally likely causes, but where one can explain more symptoms than another.Postulate 4 states that if it is possible that D k is causing at least one symptom, then the measure should be strictly greater than 0.
Starting from the definition of the expected sufficiency given the conditional independence structure of the twin network [F], we can express the counterfactual symptom marginals as If D k = 1, then do the the counterfactual interventions the counterfactual states have all parents (including leaks) instantiated to 0, which implies that S + = ∅ by (2).Hence this case never contributes to the expected sufficiency as the expectation is over |S + |.For D k = 1, we recover that P (S |E, do(D \ D k = 0), do(U L = 0)) ∝ P (D k = 1|E) and therefore E suff (D k , E) ∝ P (D k = 1|E).For postulate 2, if there are no symptoms that are descendants of D k , then E suff (D k , E) = 0.This follows immediately from the fact that if D k is not an ancestor of any of the symptoms, then all counterfactual symptoms have all parents instantiated as 0 and S + = ∅.For postulate 4, we can only prove this property under additional assumptions about our disease model (see supplementary note 2 for noisy-and counter example).First, note that E suff (D k , E) is a convex sum with positive semi-definite coefficients |S + |.If there is a single positively evidenced symptom that is a descendent of D k , and D k has a positive causal influence on that child, and our disease model permits that every disease be capable of causing its associated symptoms in isolation, i.e.P (S = 1|only) Supplementary note 6: expected disablement In this supplementary note we turn our attention to our second diagnostic measure -the expected disablement.This measure is closer to typical treatment measures, such as the effect of treatment on the treated [24].We use our twin diagnostic network outlined in supplementary note 3 figure [E] (shown below) to simulating counterfactual treatments.We focus on the simplest case of single disease interventions, and propose a simple ranking measure whereby the best treatments are those that get rid of the most symptoms.

Definition 3
The expected disablement of disease D k determines the number of positive symptoms that we would expect to switch off if we intervened to turn off D k , where E is the factual evidence and S + is the set of factual positively evidenced symptoms.The expectation is calculated over all possible counterfactual symptom evidence states S and S + denotes the positively evidenced symptoms in the counterfactual symptom evidence state.do(D k = 0) denotes the counterfactual intervention setting D k → 0. Decisions about which treatment to select for a patient generally take into account variables such as cost and cruelty.These variables can be simply included in the treatment measure.For example, the cruelty of specific symptoms can be included in the expectation (46) by weighting each positive symptom accordingly.The cost of treating a specific disease is included simply by multiplying (46) by a cost weight, and likewise for including the probability of the intervention succeeding.For now, we focus on computing the counterfactual probabilities, which we can then use to construct arbitrarily weighted expectations.
To calculate (46), note that the only CPTs that differ from the original noisy-OR SCM are those for unmerged dual symptom nodes (i.e.children of the intervention node D k ).The disease layer forms a Markov blanket for the symptoms layer, d-separating dual symptom pairs from each other.Therefore we derive the CPT for dual symptoms and their parent diseases.
Factual graph Counterfactual graph Supplementary Figure 4: Final twin network for expected disablement Lemma 2. For a given symptom S and its counterfactual dual S * , with parent diseases D and under the counterfactual intervention do(D * k = 0), the joint conditional distribution on the twin network is given by Proof.First note that for this marginal distribution the intervention do(D * k = 0) is equivalent to setting the evidence D * k = 0 as we specify the full Markov blanket of (s, s * ).Let D \k denote the set of parents of (s, s * ) not including the intervention node D * k or its dual D k .We wish to compute the conditional probability where p(u s ) is the product distribution over all exogenous noise terms for S including the leak term.We proceed as before by expressing this as a marginalization over the CPT of the dual states, P (s = 0, For s i = 0, the generative functions are given by First we compute the joint state. Where we have used the Boolean identities a ∧ a = a and a ∨ (b ∧ c) = (a ∨ b) ∧ (a ∨ c).Therefore Next, we calculate the single-symptom conditionals ).Note that λx + x = λ x .We can now express the joint cpd over dual symptom pairs, using the identities P (s = 0, s As we are always intervening to switch off diseases, and therefore P (s = 0, s * = 1|∧ i =k D i = d i , D k = d k , D * k = 0) = 0 as expected (switching off a disease will never switch on a symptom).This simplifies our expression for the conditional distribution to This then simplifies using (49) to We have arrived at expressions for the CPT's over dual symptoms in terms of CPT's on the factual graph, and hence our conterfactual query can be computed on the factual graph alone.The third term in (52), Using the definition of noisy-OR (7) to give in the case that λ D k ,S > 0, we recover which is equivalent to Finally, from the definition of the noisy-OR CPT (2), Lemma 52 allows us to express the expected disablement in terms of factual probabilities.As we have seen, the intervention do(D * k = 0) can never result in counterfactual symptoms that are on, when their dual factual symptoms are off, so we need only enumerate over counterfactual symptoms states where S + ⊆ S + as these are the only counterfactual states with non-zero weight.From this it also follows that for all s ∈ S − =⇒ s * ∈ S − .The counterfactual CPT in (46) is represented on the twin network [F] as Theorem 4 (Simplified noisy-OR expected disablement).For the noisy-OR networks described in supplementary note 2, the expected disablement of disease D k is given by where where S ± is the set of factual positive (negative) evidenced symptom nodes and R is the risk factor evidence.
Proof.From the above discussion, the non-zero contributions to the expected disablement are Applying Bayes rule, and noting the the factual evidence states are not children of the intervention node D * k , gives Let us now consider the probabilities Q = P (S * − = 0, C * = 0, S \ C * = 1, S + , S − |R, do(D * k = 0)).We can express these as marginalizations over the disease layer, which d-separate dual symptom pairs from each-other.First, we express Q in the instance where we assume all λ D k ,S > 0.
E(D k , E) is a sum of products of Q's, therefore if all Q are continuous for λ D k ,S → 0 ∀ S we can derive E(D k , E) for positive λ D k ,S and take the limit λ D k ,S → 0 where appropriate.We can consider each term in isolation, as the product of continuous functions is continuous.Each term in Q derives from one of Di,S , this is a linear function of λ D k ,S and therefore continuous in the limit λ D k ,S → 0. Secondly, which again is a linear function fo λ D k ,S and so is continuous in the limit λ D k ,S → 0.
, so these are also both continuous in the limit.
We therefore proceed under the assumption that λ D k ,S > 0 ∀ S. Applying Lemma 1 simplifies (62) to Note that the only Q that are not multiplied by a factor |C| = 0 in (61) have C = ∅, and so δ(d k − 1) is always present.Marginalizing over all disease states gives As before, we simplify this using a change of varaibles and the inclusion-exclusion principle.Change variables C → S + \ C, which along with (66) gives − 1 (67) Next we apply the inclusion exclusion principle, giving We can now proceed as before and remove the graph cut operation on the set Z, using the definition of noisy-or (2), Therefore (70) Clearly each term for a given X is zero unless λ D k ,S < 1 ∀ S ∈ X , and so we can restrict ourselves to S ⊆ S + ∩ Ch(D k ).Furthermore, if any λ D k ,S = 0 for S ∈ X , then the symptom marginal (which is linearly dependent on λ D k ,S ) is 0 (there is zero probability of observing this symptom to be off if D k = 1), and this term in the sum is zero.Therefore we can restrict the sum to X ⊆ S   shows the mean position of the true disease for the associative (A) and counterfactual (C, expected sufficiency) algorithms over all 1671 cases.Results are stratified over the rareness of the disease (given the age and gender of the patient).For each disease rareness category, the number of cases N is given.Also the number of cases where the associative algorithm ranked the true disease higher than the counterfactual algorithm (Wins (A)), the counterfactual algorithm ranked the true disease higher than the associative algorithm (Wins (C)), and the number of cases where the two algorithms ranked the true disease in the same position (Draws) are given, for all cases and for each disease rareness class.
Figure [C], blue circles represent observations and red circles represent interventions.The do-operation severs any directed edges going into D * and fixes D * = 0, as shown in [D] below.

=
prove iteratively.First, consider the function S(B) := A⊆B a∈A (1 − a) a ∈B\A a .Now, consider S(B + {c}).This function can be divided into two sums, one where c ∈ A and the other where c ∈ A. Therefore S(B + {c}) = empty set, S(∅) = 1, it follows that S(B) = 1 ∀ countable sets B. Next, consider the function G(B) := A⊆B |A| a∈A (1 − a) a ∈B\A a , which is the form of the sum we wish to compute in (34).Proceeding as before, we have G(B + {c}) = cG(B) + (1 − c)G(B) + (1 − c)S(B) Using S(B) = 1 we arive at the recursive formula G(B + {c}) = G(B) + (1 − c).Starting with G(∅) = 0, and building the set B by recursively adding elements c to the set, we arrive at the identity s )P (s = 0|D \k , D k , u s ) = u L S P (u Ls )u Ls Di∈D u D i ,S P (u Di,S )u Di,S ∨ di = P (u Ls = 1) Di∈D u D i ,S P (u Di,S = 1) + P (u Di,S = 0) di = λ Ls Di∈D λ Di,S d i + di (49) and similar for P (s * = 0 | ∧ i =k D i = d i , D * k = d * k ) where d k is the instantiation of D k on the factual graph.The term δ(d k − 1) is equivalent to fixing the observation D k = 1 on the factual graph.If λ D k ,S = 0 then λ Ls 1 − λ d k D k ,S Di∈D \k λ di Di,S = λ Ls Di∈D \k λ di Di,S δ(d k − 1) (55)

TABLE II :
Results for experiment 1: table