Individual treatment effect estimation in the presence of unobserved confounding using proxies: a cohort study in stage III non-small cell lung cancer

Randomized Controlled Trials (RCT) are the gold standard for estimating treatment effects but some important situations in cancer care require treatment effect estimates from observational data. We developed “Proxy based individual treatment effect modeling in cancer” (PROTECT) to estimate treatment effects from observational data when there are unobserved confounders, but proxy measurements of these confounders exist. We identified an unobserved confounder in observational cancer research: overall fitness. Proxy measurements of overall fitness exist like performance score, but the fitness as observed by the treating physician is unavailable for research. PROTECT reconstructs the distribution of the unobserved confounder based on these proxy measurements to estimate the treatment effect. PROTECT was applied to an observational cohort of 504 stage III non-small cell lung cancer (NSCLC) patients, treated with concurrent chemoradiation or sequential chemoradiation. Whereas conventional confounding adjustment methods seemed to overestimate the treatment effect, PROTECT provided credible treatment effect estimates.


A.1 Treatment Effect Estimation with PROTECT
In this section we introduce additional implementation details of the PROTECT method.

A.1.1 Additional assumptions
In addition to the DAG, the PROTECT method relies on two other assumptions that are very common in treatment estimation from observational data. The first assumption is that of overlap. For the X-conditional average treatment effect to exist, when there are confounders L, the joint distributions of X, L of patients with different treatments should fully overlap [10]. The second assumption is the Stable-Unit Treatment Value Assumption [10]. This assumption has two components: consistency meaning that the observed factual outcome under a certain treatment is equal to the outcome that would be observed when intervening to make the same treatment decision; and no interference, meaning that assigning a treatment to a certain patient does not influence the interventional distribution of other patients.

A.1.2 Marginalizing out Tumor Behavior for Estimation
The PROTECT DAG has two latent factors: tumor behavior and patient fitness. Since the latent factors are not observed they need to be marginalized out, both during the model estimation phase and for posterior predictions on new data. As conditioning on the latent factor for tumor behavior is not needed to satisfy the backdoor rule, we can also do inference over the acyclic directed mixed graph (ADMG) [19] that results from marginalizing out tumor behavior, without harming the ability to estimate the conditional treatment effect. ADMGs are a generalization of DAGs that allow for bi-directed arrows, indicating the presence of an unobserved confounder [19]. The AMDG that results from marginalizing out tumor behavior has fewer parameters to estimate and was therefore used for model estimation. See Figure A.1.2 for the resulting ADMG. There are multiple bi-directed arrows in the ADMG as a result of marginalizing out tumor behavior. To model these dependencies, the original ancestral order is used. This means for instance that the outcome will be modeled conditional on stage and not vice-versa, as stage is an ancestor of the outcome in both the original DAG and the ADMG with tumor behavior marginalized out.

A.1.3 Target Estimand
Here we derive a non-parametric expression for estimating the model from the observed data. Let t x ∈ {0, 1} be the treatment indicator where. Let y denote the outcome. Let X and W be vectors of covariates (causes and proxies of fitness respectively, in line with the AMDG in Figure A.1.2). Let X B denote the proxies and causes of tumor behavior from the original ADMG (histology, weight loss and stage IIIA vs IIIB vs IIIC in the case of stage III Non-Small Cell Lung Cancer). We use Pearl's do-operator to indicate intervening on a variable  [17]. We are interested in the conditional average treatment effect (CATE) as the difference in expected survival under the different treatments: CATE(X, W, X B ) = E y|do(t x = 1), X, W, X B − E y|do(t x = 0), X, W, X B (1) The presented AMDG suggests a causal factorization of the conditional distribution in equation (1). We will now derive our target estimand of the joint distribution of the observable variables (y, t x , W ) given the control variables (X, X B ).
Proof. Let y be the outcome, t x the treatment variable, F ∈ R a latent factor, W possibly multidimensional proxy variables for F , and X possibly multidimensional causes for F , X B possibly multidimensional causes and proxies for tumor behavior, then a) by the law of total probability 4 b) by the chain rule of of probability c) by conditional independencies implied by the ADMG In 4 the outcome distribution conditions on all confounders, thereby satisfying the backdoor rule. This implies we can use Rule 2 of the rules of do-calculus and exchange observing t x (as we do in the observational distribution from which our samples are drawn), with intervening on t x (the interventional distribution that is the target of our research) [16].
After estimating the joint distribution, we can calculate the CATE for a patient by conditioning 4 on the observed proxy variables W for both t x = 0 and t x = 1 and marginalizing over F , and then calculating the difference in the expected value of y for t x = 1 and t x = 0. The average treatment effect can be estimated by calculating the mean CATE over all observed patients.

A.1.4 Estimation in a Marginalized DAG
For any application of PROTECT, the number of possible cause and proxy variables of patient fitness may be large. Moreover, different research groups investigating the same application may come up with different sets of cause and proxy variables of fitness. A natural question is whether the estimated treatment effect depends on the chosen set of proxies. Under the assumption that there is a finite number of potential cause and proxy variables of fitness, we will show that omitting some these variables will not necessarily bias the treatment effect estimate. For this we need two things to hold: 1. The DAG with fewer variables is indeed the DAG that is obtained after marginalizing out the unused variables from the full DAG 2. Given the DAG with fewer variables, the conditional average treatment effect is identified from the observed data We defer the discussion of the second requirement to subsection A.1.4. To demonstrate the first requirement we will consider a hypothetical application of PROTECT to a case where there exist three binary proxies of fitness (w 1 , w 2 , w 3 ) and no cause variables for fitness. A research groups tries to estimate the treatment effect but only has measurements of two of the three proxy variables. See Figure A.1.4 for the DAG for this situation.
As the researchers only have samples from the target estimand for this study is CATE(w 1 , w 2 ) instead of CATE(w 1 , w 2 , w 3 ).
If the group had access to the joint distribution p(y, t x , w 1 , w 2 , w 3 ), they could calculate CATE(w 1 , w 2 ) by first summing out w 3 from the joint distribution and then calculating the CATE. The question is whether this will lead to the same estimand as when estimating CATE(w 1 , w 2 ) from samples from p(y, t x , w 1 , w 2 ) directly. We now show that this is indeed the case.
In 1 the back-door requirement is still fulfilled so the treatment effect can be estimated. The average treatment effect is calculated by calculating the expectation of the CATE over the proxy variables. As the CATE(w 1 , w 2 ) can be estimated from samples of p(y, t x , w 1 , w 2 ) and the order of taking the expectation over proxy variables does not matter, the inferred average treatment effect would be the same as first estimating CATE(w 1 , w 2 , w 3 ) and then taking the expectation over all three proxy variables. A similar argument can be made for cause variables of F .

A.1.5 Treatment Effect Estimation
As mentioned in the main text, if the joint distribution of observed variables and the latent confounder can be estimated from the observed data, the treatment effect can be estimated. This is because the back-door adjustment formula can be applied using the estimated distributions of survival given treatment and fitness, and fitness given the proxy variables and cause variables of fitness [16,12], see also equation 4. The crucial question is whether the joint distribution of observed variables and the latent confounder can be correctly estimated from the observed data. Without constraints on the joint distribution, this is not possible. Fortunately, the DAG provides conditional independency constraints on the joint distribution. For instance, the treatment choice is independent of performance score, once the value of fitness is known. However, this is not enough to identify the joint distribution in general and additional assumptions are needed. These assumptions can be provided by the form of error distributions of the observed variables and latent confounder or by functional forms of relationships between these variables.
In PROTECT, parametric forms for the structural equations in the DAG are specified. By assuming parametric models and thus reducing the family of structural causal models under consideration, the question of identification of the treatment effect reduces to whether the parameters for the structural equations can be uniquely estimated from observed data. This is not an easy question to answer in general. If all observed variables and the latent factor followed linear models with Gaussian error distributions, a well-known sufficient condition for uniqueness of the parameters is when there are at least three independent proxies per latent factor [2]. This textbook result of parameter identification is based on comparing the number of unknown parameters in the statistical model with the number of unique entries in the covariance matrix of the observed variables. The latent factor fitness in our DAG has four dependent variables (two proxies, the treatment and overall survival). Treatment and survival also have a direct dependency relationship that would require one additional parameter to be estimated in a linear Gaussian setting. The requirement of estimating this single extra parameter is offset by having one more observed variable, so this model would still be identified. However, many settings will be more complex than standard linear structural equation models, and analytical identification proofs are often intractable to obtain [20]. Instead, empirical checks can be performed to see if there is any evidence of non-uniqueness of the treatment effect estimate given the observed data and modeling assumptions. In a Bayesian parameter estimation setting, this amounts to checking the posterior distribution over parameters for multi-modality.

A.1.6 Model Selection
In PROTECT, the joint distribution of observed variables and unobserved variables is estimated by specifying parametric distributions for all structural equations implied by the DAG. Different parameterizations may lead to different inferences regarding the treatment effect. To reduce the dependence of the treatment effect estimate on these choices, we now introduce a data-driven model selection procedure. We describe it for a general DAG with a latent confounder with proxy variables and causes.
In figure 3 we present a general latent confounder model with two proxies w 1 , w 2 , a treatment variable t and outcome y. Each observed variable has a possibly empty set of control variables x (.) and the sets of control variables are allowed to overlap. Causes of the latent factor F are permitted but are irrelevant to the model selection procedure so they are omitted here. The latent factor F is a cause of the proxies, treatment and outcome, and is the only confounder of t and y. The proxies, treatment and outcome are assumed to be random variables conditional on their parents in the graph. This graph is necessarily an abstraction of complicated clinical and biological processes. In reality, the process that is responsible for treatment decisions and outcomes may be better described with a multi-dimensional latent factor F . Considering this multi-dimensional F , it is likely that there are dimensions of F that are related to treatment but not to the outcome (F t ) and vice-versa (F y ). The dimensions of F that are related to both treatment and outcome are denoted F t,y and constitute the true confounder. See Figure 4. Possible clinical interpretations of these dimensions are: • F t : information that the pre-treatment variables provide that is thought to t y w 1 w 2 x t x w1 x w2 x y F Figure 3: Exeample of a DAG where PROTECT could be applied be relevant to efficacy of treatment (and is used as such to select patients for treatment), but in reality holds no information on treatment efficacy or outcome • F y : information that the pre-treatment variables have on the outcome that is not known to the physician (this could be treatment effect modification that is not known to the physician), or otherwise is not used to make treatment decisions.
• F t,y : information that the pre-treatment variables have on the outcome or treatment efficacy and that is utilized in the treatment decision process To estimate the treatment effect, the modeled latent factorF should contain as much information on F t,y as possible. Both F t and F y are irrelevant for the estimation of the treatment effect in terms of bias.
When the parameterization of the model has limited expressiveness (e.g.F is a one-dimensional latent variable), it is not guaranteed that the model parameters that maximize the joint likelihood of the observed data will learn anF that has information about F t,y . For example, when the mutual information between the pre-treatment variables and treatment is much higher than the mutual information between pre-treatment variables and outcome, it could be that a model with a single dimensionalF will learnF = F t . Indeed, this is not an unrealistic scenario. The treatment selection process is mostly determined by the pre-treatment variables. Outcomes like overall survival have much higher intrinsic randomness given these pre-treatment variables as they depend on 1. the true biological state of the patient and the tumor that may never be fully known and 2. random events in the future that have not occurred at the time of the treatment choice. In the following paragraph we introduce a set of model checks based on observed data to evaluate whether a given parameterization of the graphical model in the DAG is compatible with the assumptions laid out in the DAG and the requirement of modelingF = F t,y Hyperparameter Selection When there are multiple choices for parameterizing the joint model summarized in hyperparameter ψ (e.g. number of dimensions forF , parameterization choices for any of the conditional distributions, including priors for parameters), each value of ψ will imply a posterior distribution p post,ψ (θ) = p ψ (θ|D train ) over global parameters θ after conditioning on training data D train . To infer if a chosen hyperparameter ψ is compatible with the assumptions on the data generating mechanism in the DAG, a number necessary constraints implied by the DAG can be checked on held-out data D test . Let p ψ (y (i) |t (i) , w (i) , x (i) , θ) be the predictive density for the outcome of observation i, conditional on treatment, proxies and control variables for the outcome, of the model with hyperparameter ψ and global parameter value θ. Let l ψ (y (i) |t (i) , w (i) , x (i) ) be the log point-wise predictive density for observation i from the test data, given hyperparameter value ψ: Here the expectation over θ can be approximated e.g. by using monte-carlo samples of θ from x w2 x y Figure 4: A more complex true data generating process may underlie the observed data which is modeled as in Figure 3 F t is a latent factor that only influences the proxy variables and the treatment assignment. F y is a latent factor that only influences the proxy variables and the outcome. F t,y is a latent factor that influences the proxy variables and both the treatment and the outcome. F t,y is the only confounder.
the posterior distribution. Let L ψ (y (i) |t (i) , w (i) , x (i) ) be the expectation of the log point-wise predictive likelihood (elpd) [22] over the held-out data, and let the elpd for different conditional distributions be similarly defined, then for t, y and proxies w j the following minimal criteria should hold: In words this means that for each observed variable, the predictive likelihood of the model with the latent variable that conditions on the other observed variables should be greater than a corresponding baseline model that conditions only on the direct parents in the DAG, excluding the latent factor. To calculate the elpd for the baseline models, a separate regression model was formulated for each target using the same linear formula, outcome distribution and priors as in the DAG model, but without the latent factor. If the predictive likelihood of the model with latent factor is higher than the baseline regression model without the latent factor, then the latent factor effectively transmits information between observed variables in the model indexed by ψ. However, this still does not guarantee that the inferredF is the actual confounder F t,y . If we select a hyperparameter ψ = ψ t that leads toF = F t then equations 5 may still be satisfied. We can formulate more strict requirements. If we mutilate the DAG in Figure 4 by removing F t,y and F y and their outgoing arrows, the following equality can be shown by applying the conditional independence of F t |t of y|t implied by the mutilated DAG: Equivalently, if we select a hyperparameter ψ = ψ y that leads toF = F y then L ψy (w j |y, t, w −j , x wj ) = L ψy (w j |y, w −j , x wj ). Note that this equality does not hold if either F t,y , or the combination of F t and F y are in the DAG. We can now define a necessary condition for theF , estimated from the observed data, not to be independent of F t,y . For all w j : Hyperparameter settings for which one of these conditions does not hold for one of the proxies should be rejected. Note that this does not rule out the possibility thatF is some function of F t and F y and is still not related to F t,y .

A.2 Methods for stage III NSCLC application
In this subsection more implementation details for the application of PROTECT to stage III Non-Small Cell Lung cancer (NSCLC) are provided.

A.2.1 Missing Data
Proxies of fitness Values of proxies of fitness are assumed to be missing at random conditional on the observed variables. We further assumed independent priors for the proxy missingness mechanism. Together with the missing at random assumption, this makes the missingness model ignorable, meaning that it does not contribute to the estimation of the other parameters. We therefore did not model missingness in proxies. Since we model the joint distribution of proxies, treatment and survival, marginalization of the missing values of proxy variables is trivial by not adding terms to the likelihood for values that are not observed.
Weight loss Roughly 15% of the values for weight loss were missing. Though weight loss is a known prognostic factor and is a standard part of the pretreatment history taking, the physician may sometimes forget to ask about it, or forget to note it down in the electronic health record. It is likely that high values of weight loss have a higher probability of being registered. A patient with high weight loss may self report it and the physician is more likely to report it, as it is more notable. This renders the missingness mechanism for weight loss "Missing Not at Random". When the missingness in a covariate is dependent on the value of the covariate (but not on the outcome), estimating parameters of a regression model for the outcome using the complete cases is not biased by the missingness, while imputation may lead to bias in the estimation of the treatment effect [23]. A problem with the complete case method here is that allthough the conditional treatment effect may be validly estimated, the average treatment effect may be different in the population of interest than in the complete case population, as the populations may differ with respect to the distribution of weight loss. If the effect of treatment depends on weight loss (according to our ADMG, either through the effect of tumor behavior on survival, or through the effect of tumor behavior on fitness), the average treatment effect estimated from the complete cases with respect to weight loss is a biased estimator for the average treatment effect in the target population. Through a sensitivity analysis presented in section B.2.2 we assess what the effect of this missingness on the estimated average treatment effect may be.

A.2.2 Pre-processing, Parametric Models and Priors
In this subsection we describe details on the data pre-processing, the parametric forms of each conditional distribution and the priors for all parameters. The Gaussian distribution parameterized with mean µ and standard deviation σ is denoted N (µ, σ), the Half-Normal distribution with standard deviation σ is denoted HN(σ), the Bernoulli distribution with probability p is denoted as Bern(p), the Uniform distribution over values between a and b is denoted U(a, b) Covariate pre-processing In order to be able to use the same distribution to model each of the proxies, proxy variables were binarized. Dichotomization of variables is not generally recommended due to the loss of information. However, as our primary goal is to identify the conditional average treatment effect, we expect the potential loss of information from the dichotomization to be small compared to the extra modeling challenges when dealing with proxies with different outcome distributions. Our exact variable definitions were: • ECOG performance score: 1 or higher vs 0 (higher means worse overall health according to the ECOG standard [15]) • eGFR: less (or greater) than 60 milliliter / minute / 1.73m 2 The first value (1 for ECOG, and less than 60 for eGFR) was dummy coded as 0, and the second value was dummy coded as 1. There were only two patients with stage IIIC, therefore they were grouped with IIIB patients. Histology types were grouped as Adenocarcinoma, Squamous Cell Carcinoma or other, missing values were classified as other. Weight loss was defined as any weight loss of ≥ 3% of the original weight in the 6 months preceding the start of follow-up. Age was scaled to zero mean and standard deviation 1. Details on the definition of the treatment variable and outcome are given in the main text.
Survival The Cox-proportional hazards model is characterized by a partial likelihood that leaves the baseline hazard unspecified. Performing Bayesian inference with a proportional hazards model will require specifying a likelihood for the baseline hazard. We chose a parametric survival model, using the Adaptive Power Generalized Weibull (APGW) distribution as described in [4]. The APGW has two parameters that control the shape of the baseline hazard function. With these two parameters, the APGW can model a wide range of baseline hazard function shapes. The APGW accommodates non-proportional hazards effects by letting one or more of the shape parameters depend on covariates. It can be parameterized as an accelerated failure-time model or as a (proportional) hazards model. Using the proportional hazards formulation of the APGW makes it very similar to Cox-proportional hazards regression, but with a parametric baseline hazard function. The β coefficients in this version have the same interpretation as the parameters in the familiar Cox-proportional hazards regression, namely log hazard ratios. The reference value for the treatment effect is presented as a hazard ratio [1]. This value assumes a proportional hazard model for overall survival. Therefore, we chose to parameterize the outcome model similarly using a linear model for the log-hazard ratio in the proportional hazards formulation of the APGW. Potential treatment effect modification for variables was accommodated by adding parameters for product terms of the variable and the treatment. The APGW was parameterized as follows: β Xy→y , β tx→y ∼ N (0, 2.5) β Xy * tx→y , β F * tx→y ∼ N (0, 0.1) Here, t x = 1 is concurrent chemoradiation, F is the inferred latent fitness, X y are the other direct parents of survival in the ADMG, σ F →y is a hyperparameter that was determined per the model selection method in subsection A.1.6. Using this parameterization, we can now express the CATE (equation 1) for patient i in terms of a log hazard ratio β: = β tx→y + F (i) β F * tx→y + X (i) y β Xy * tx→y The average treatment effect (ATE) is estimated with the mean CATE over the observed population.

Latent Factor
The latent factor F is modeled using a conditional Gaussian distribution with fixed standard deviation 1, followed by the logistic function (also known as the sigmoid function, denoted σ). The logistic function was used to fix the overall scale of the latent factor to prevent compensatory scaling of F when adjusting hyperparameters that control the magnitude of the effect of F on t x or y.
In the conditional mean function for F , the control variables X F were precentered to have zero mean.
Proxies We parameterize the proxies as conditionally independent Bernoulli distributed random variables. The conditional probability was implemented using the logistic link function. For each proxy w j : The prior HN (2.5) implements the assumption that a higher value of latent fitness leads on average to better values for the proxy variables of fitness.
Treatment The treatment indicator was modeled as a Bernoulli distributed random variable, with a linear model and the logistic link function, similar to the proxy variables.
X tx are all observed variables that are direct parents of treatment in the ADMG, except for F . σ F →tx is a hyperparameter that was determined per the model selection method in subsection A.1.6.

Structural Causal Model
The entire generative model can now be formalized in a single structural causal model. Let X (.) denote a N d (.) design matrix for N patients with d (.) control variables for different control variable sets, β (.) global parameters, APGW(.) the 4-parameter Adapted Power Generalized Weibull distribution [4], w j binary proxies, t x treatment variable, y the survival outcome consisting of positive real number indicating time, and a binary indicator for deceased or censored.
β y = F β F →y + X y β Xy→y + t x (β tx→y + F β F * tx→y + Xβ X * tx→y ) + β 0 y ∼ APGW(0, β y , α 0 , ν 0 ) Effect Modifiers Predicting treatment effects for new patients requires the estimation of the conditional average treatment effect (CATE). In the case of a binary treatment variable t x and covariates x, w, the CATE is defined as: To evaluate whether some variables are effect modifiers (i.e. the treatment effect differs between different values of this variable), we employ the definition of a conditional effect modifier by VanderWeele [21]. For covariates x, w and treatment t x , w is said to be an effect modifier of t x conditional on x if for some value of x there exist two values w 1 = w 2 of w for which CATE(x, w 1 ) = CATE(x, w 2 ). Note that the estimation of a treatment effect modifier is dependent on the scale on which the CATE is measured. For a linear log-hazard ratio model (e.g. the Cox proportional hazards model or the proportional hazards APGW), this reduces to the familiar statistical interaction term β w * tx : Filling in equation 21 in 20 will lead to the familiar result that the conditional average treatment effect is the average treatment effect plus wβ w * tx . This treatment interaction term quantifies the difference in treatment effects between w = 0 and w = 1. Due to the linearity of this model in 21, the effect modification of w is the same for all values of x. In our log-hazard model formulation, effect modification is similarly modeled by adding product terms between treatment and control variables, but also between treatment and the latent factor F . Since the conditional distribution of p(F |X, W ) is non-linear in X and W , effect modification by proxies W and control variables X can no longer be equated to the interaction terms in the linear log-hazard model, and the effect modification of some variable x i may depend on the value of other control variables and proxy variables.
We can estimate potential effect modification by plugging our log-hazard model β(t x , F, X y ) defined in 18 in equation 20: This will estimate the log-hazard ratio between concurrent and sequential treatment as a function of X and W . We have omitted the decency of these quantities on the global parameters θ here. As said before, the effect modification of a variable can depend on the value of the other variables due to the nonlinearities in our model. To summarize the average effect modification for a unit change in a variable we use partial dependence functions [8]. Specifically we look at the average change in CATE for a unit difference in a variable x j , averaged over the marginal distribution of other variables x −j .
Where the expectation is taken over the observed data.

A.2.3 Model selection
As the treatment variable and the outcome variable follow different distributions there is no way to express the parameters of these models on a single scale. Specifically, the log odds ratio of fitness to treatment (β F →tx ) is incommensurable with the log hazard ratio of fitness to survival (β F →y ). This creates a problem with specifying the right scale for the prior distributions of these two parameters. If the scale of the priors for one of the parameters is greater than the other, parameters that maximize the joint likelihoood can be biased towards modeling the variable with the prior with the greater scale. Therefore, the prior standard deviations on the parameters from F to y and from F to t x were treated as hyperparameters. Values for the prior standard deviations were {0.1, 1.0, 2.5, 10.0, 100.0}, resulting in a grid of 125 hyperparameter combinations. We used 5-fold cross validation to select hyperparameters that were not refuted by the assumptions in the DAG as outlined in A.1.6. In a final inference step we follow a Bayesian Model Averaging approach by formulating a mixture model over all acceptable hyperparameter settings ψ j .
Where each ψ j corresponds to an acceptable setting of the hyperparameters. This mixture model was evaluated once on the full dataset where global parameters and model weights were inferred jointly, resulting in our final model estimate.

A.2.4 Implementation
All probabilistic models were implemented using the open source probabilistic programming language NumPyro [18], version 0.4.1 with JAX [3] version 0.2.7 as a back-end. We used the No-U-Turn Hamiltonian Monte Carlo sampling algorithm to simulate 7500 samples per chain (following 2500 warm-up samples) in 4 independent chains from the posterior distribution for each hyperparameter setting. For posterior predictive densities on held-out data we numerically integrated the densities with respect to F by using a fixed grid of points for F , the noise term ofF . Between −5.0 and 5.0, 250 equally spaced values for F were used. From the original samples over global parameters, 1500 samples equally divided over the chains were used. The samples of global parameters were selected by slicing the original chain to reduce auto-correlation between the samples. After applying the model selection procedure we estimated the final model using 12 independent chains with 7500 samples per chain (following 2500 warm-up samples). Model evaluation was performed in R, version 4.0.3. The code that implements the statistical models will be made freely accessible online. Due to privacy regulations the clinical data cannot be made available. We performed a sensitivity analysis to asses the robustness of our inferences with respect to a potential unobserved confounder. Specifically, we test the effect of an unobserved confounder U on the point estimate of the average treatment effect (AT E). The latent factor U is assumed to be a direct cause of both treatment and survival, independent of the latent factor fitness, see Figure B.1.1.
Assuming a true data generating mechanism (under the sensitivity parameters γ U →y , γ U →tx ): β y = β 0 + F β F →y + X y β Xy→y + t x (β tx→y + F β F * tx→y + X y β X * tx→y ) + γ U →y U Where β y is the log-hazard ratio. Whereas we estimated the treatment effect using the equivalent model for survival but omitting the term from U to y.
As in this hypothetical setting of the sensitivity analysis we did not condition on all confounders when |γ U →tx | > 0 and |γ U →y | > 0, the estimated treatment effect is a biased estimate of the true treatment effect. For different settings of sensitivity parameters γ = {γ U →tx , γ U →y } we re-estimated the posterior distribution over global parameters, this time including U as an additional latent factor with fixed coefficients to treatment and survival, as specified by γ. In order to be able to clinically reason about the likelihood of the existence of an omitted confounder U that would be strong enough to alter the conclusions of our modeling, we need to be able to compare the effects of this hypothetical unobserved confounder on treatment ( γ U →tx ) and survival ( γ U →y ), relative to the effects of the modeled confounder for fitness F ( β F →tx , β F →y ). To make sure that the comparison of parameters was valid we fixed the standard deviation of U to 1, and we re-scaled the parameters β F →tx , β F →y to the values Figure 6: Results of sensitivity analysis. For different combinations of sensitivity parameters γ U →y (y-axis) and γ U →tx (x-axis) we re-estimated the model, with the additional latent variable U . Results for the estimated average treatment effect are indicated by the level lines, marked by the point estimate of the hazard ratio for overall survival of concurrent versus sequential chemoradiation. The black line indicates the null-effect (hazard ratio = 1). The red dotted line indicates the treatment effect reported in the meta-analysis of randomized controlled trials (hazard ratio = 0.84) [1]. Four reference points are added to gauge how the sensitivity parameters relate to the parameters of the estimated latent confounder F to treatment and survival. The reference point 1 * F corresponds to the value ofβ F →y andβ F →tx in the original model. 0.5 * F is the point where these coefficients are both divided by two. −1 * F is the point where the sign ofβ F →y is changed, and −0.5 * F is defined analogous to 0.5 * F they would have if F were also scaled to have a standard deviation of 1. We investigate the effect of potential unobserved confounding on the treatment effect estimate for all combinations of γ U →y ∈ {−1, −0.5, 0, 0.5, 1}β F →y and γ U →tx ∈ {0, 0.5, 1}β F →tx .

B.1.2 Results
The results of the sensitivity analysis are presented in Figure 6. If there were a confounder that was more than half as strong as the modeled latent confounder F but opposite in sign of survival, the point estimate of the AT E would be greater than the estimate from the RCTs [1]. The interpretation of this latent confounder is that it increases the likelihood of being treated, but reduces the likelihood of survival. A clinical example may be that patients who have more aggressive tumors are treated more aggressively in order to improve survival outcomes.

B.2 Missing Weight Loss
We now describe the sensitivity analysis for the effect of different missingness mechansisms for weight loss on the estimate of the average treatment effect.

B.2.1 Methods
Let M denote the missingness indicator for weight loss X, then p(M = 1|X = x) denotes the probability of missingness, conditional on the value of X. The central assumption of this sensitivity analysis is that the missingness in weight loss is random conditional on the value of weight loss itself. Specifically, that the probability of missingness is lower when weight loss is present.
Let π obs := p(X = 1|M = 0), the probability of weight loss in the observed cases and π * := p(X = 1) denote the unknown true marginal probability of weight loss. Using Bayes rule, we can express π obs in terms of π * , p(M = 1) and p(M = 0|X = 1): Now we can write: Since we cannot estimate p(M = 0|X = 1) from the observed data, we will substitute it with the p(M = 0|X = 1, α) using equation 27, conditional on the sensitivity parameter α.
Now we have that given the observed data and the sensitivity parameter α, the unobserved quantity of interest is identified, so we can proceed with the sensitivity analysis.

Re-weighting by weight loss
For different values of α we assign weights to patients based on their value for weight loss.
By assigning patients with weight loss a higher weight than patients without weight loss, we can artificially create a population with a greater prevalence of weight loss. Specifically, for each α we define weights such that the weighted sum of the observed values of weight loss is equal to π * α .
By further requiring that the sum of weights is equal to the total number of patients, the weights are now uniquely defined as: Using these weights, for each α we calculate a new average treatment effect by taking the weighted mean of the conditional treatment effects:

B.2.2 Results
The results of this sensitivity analysis are presented in Table B.2.2 and Figure  B.2.2. The ATE shows a minor shift in the direction of concurrent being more effective when the dependency of missingness on the actual value of weight loss becomes stronger. This is explained by the fact that patients with weight loss are estimated to have less benefit of concurrent treatment. If missingness were always observed if it was present, this means that the true average value of weight loss is lower in the population than when weight loss is missing completely at random. Results of sensitivity analysis to missing data in weight loss. α: sensitivity parameter, M missingness indicator, X indicator for presence of weight loss, π * α true marginal probability of weight loss under sensitivity parameter α, w 1 weight for patients with weight loss, w 0 weight for patients without weight loss, hazard ratio: point estimate of the average treatment effect of concurrent versus sequential chemoradiation for overall survival. CI: 95% credible interval

C.1 Methods
A central assumption in our study is that the average treatment effect reported in randomized trials is not directly transportable to our population due to differences between the patient populations. We assume however that the conditional treatment effect is transportable. This assumption implies that a patient in our population has the same benefit and harms of treatment as a similar patient in the randomized trials has. Similar here means that they have the same values for the covariates, latent tumor behavior and latent fitness. Using our estimate of the conditional treatment effect we can extrapolate our results to calculate what the estimated average treatment effect would be if our population were more alike the population from the randomized trials.
Matching the RCT Population Average descriptive statistics on the study population of the RCTs (e.g. the mean age) are available from [1]. However, matching our population to theirs is not possible based on these criteria alone. In addition to the published inclusion and exclusion criteria for each RCT included in [1] we should expect hidden confounding between inclusion in the trial and overall survival [7]. Patients included in the randomized trials should be able to undergo all treatment arms. This means that patients who were deemed clinically unfit for concurrent treatment can be expected to be underrepresented in the RCT population. The assessment of this fitness for inclusion in the RCT is essentially the same as the assessment of fitness for concurrent treatment in 25 Tx Y W F RCT Figure 8: Directed Acyclic Graph of inclusion in RCTs. The same (unobserved) confounding that influences treatment decisions in real-world practice will also influence inclusion in the RCTs. In addition to the unobserved confounder fitness, the concrete exclusion criteria based on proxy measurements (e.g. performance score) also determine inclusion in the RCT our real-world population, with the addition of strict inclusion criteria from the trials. See Figure 8 for a DAG that depicts this mechanism. Using the predicted probability of concurrent treatment from the model, we can restrict our population to a subpopulation with higher fitness using different cut-offs for the predicted probability of concurrent treatment.

C.2 Results
We found that the estimated Average Treatment Effect increased in favor of concurrent treatment when restricting the population to higher predicted probability of concurrent treatment, see Figure 9. When trying to match the RCT by restricting the population to those with a higher predicted probability of concurrent treatment, the estimated average treatment effect becomes closer to that of the RCTs [1]. Added is a linear regression fit line with 95% prediction confidence band. Reference lines are added for the null-effect (hazard ratio 1) and the effect reported in the randomized trials (hazard ratio 0.84) [1].

D.1 Cohort
Patients were recruited from 9 different hospitals in the Utrecht region of the Netherlands. Specifically, the University Medical Center Utrecht, the Antonius Hospital, locations Nieuwegein and Woerden, the Meander Medical Center Amersfoort, the Diakonessen Hospital, Utrecht, the Beatrix Hospital, Gorinchem, the Haaglanden Medical Center, The Hague, the Amsterdam Medical Center, Amsterdam, the Gelderse Vallei Hospital, Ede, and the Antoni van Leeuwenhoek Hospital. These hospitals represent a rich ensemble of university hospitals, specialized oncological hospitals and both large and small secondary care institutions.
All patients were referred to the University Medical Center Utrecht for radiotherapy, but their care was coordinated by the physician in the referring hospital.
Baseline data of our cohort and that of the published meta-analysis of randomized clinical trials [1] are presented in Table 1. Percentages were calculated on the available data. Note that our cohort is older, has worse performance scores and is less frequently treated with concurrent therapy.    example 1: laryngeal carcinoma For locally advanced unresectable squamous cell laryngeal carcinoma, concurrent chemoradiation is recommended. For patients over 70 years of age or who are unfit for concurrent treatment, only radiotherapy is recommended [13]. RCTs comparing concurrent chemoradiation with radiotherapy alone will be conducted in populations where concurrent chemoradiation is a feasible treatment. Therefore, the RCTs provide no evidence on the treatment effect of concurrent treatment for older and weaker patients. In real-world clinical practice, older and weaker patients may receive the concurrent treatment for example when they have a strong preference for this treatment.

D.2 Supplemental
Observational studies comparing both treatments will have to deal with the same unobserved confounder as in the NSCLC case: the overall fitness of the patient.
To estimate the treatment effect in the older and weaker population, PROTECT can be used to estimate the treatment effect from observational data.
example 2: esophageal cancer The second example concerns stage III squamous cell esophageal cancer. For these patients, neoadjuvant chemoradiation followed by esophagectomy is considered the most effective treatment while definitive chemoradiation without surgery is recommended for patients who are unfit for surgery or refuse surgery [11]. Whether chemoradiation followed by surgery should be preferred for every patient who is fit enough for surgery remains unanswered, as all RCTs are conducted with only patients who are fit for surgery and receive some form of surgical treatment [9,5], or with patients who are unfit for surgery and never receive surgical treatment [14,6]. In real-world clinical practice there will be patients who are deemed fit enough for surgery but refuse the surgery due to reasons not related to their prognosis, e.g. personal preference. Retrospective comparisons of chemoradiation with surgery versus chemoradiation alone will arguably be biased due to the same unobserved confounding as in the case of stage III NSCLC: the overall fitness of the patient, this time specifically fitness for surgical treatment. To answer the question whether surgery should be preferred for every patient who is fit enough, the PROTECT method could be applied to to observational data to mitigate the confounding bias.