G-computation, propensity score-based methods, and targeted maximum likelihood estimator for causal inference with different covariates sets: a comparative simulation study

Controlling for confounding bias is crucial in causal inference. Distinct methods are currently employed to mitigate the effects of confounding bias. Each requires the introduction of a set of covariates, which remains difficult to choose, especially regarding the different methods. We conduct a simulation study to compare the relative performance results obtained by using four different sets of covariates (those causing the outcome, those causing the treatment allocation, those causing both the outcome and the treatment allocation, and all the covariates) and four methods: g-computation, inverse probability of treatment weighting, full matching and targeted maximum likelihood estimator. Our simulations are in the context of a binary treatment, a binary outcome and baseline confounders. The simulations suggest that considering all the covariates causing the outcome led to the lowest bias and variance, particularly for g-computation. The consideration of all the covariates did not decrease the bias but significantly reduced the power. We apply these methods to two real-world examples that have clinical relevance, thereby illustrating the real-world importance of using these methods. We propose an R package RISCA to encourage the use of g-computation in causal inference.

methods have been proposed to estimate marginal causal effects in observational studies, amongst which propensity score (PS)-based methods are increasingly used in epidemiology and medical research 3 .
Propensity score-based methods make use of the PS in four different ways to account for confounding, namely matching, stratification, conditional adjustment 4 and inverse probability of treatment weighting (IPTW) 5 . Stratification and conditional adjustment on PS are associated with the highest bias [6][7][8] , because the two methods estimate the conditional treatment effect rather than the marginal causal effect. Matching on PS remains the most common approach with a usage rate of 83.8% in 303 surgical studies using PS-based methods 9 and 68.9% in 296 medical studies (without restriction regarding the field) also using PS-methods 10 . The IPTW appears to be less biased and associated with a lower variance than matching in several studies 8,[11][12][13][14] . Nevertheless, in particular settings, full matching (FM) was associated with lower mean square error (MSE) in other studies [15][16][17] .
Multivariable models, even non-linear ones, can also be used to indirectly estimate the marginal causal effect with g-computation (GC) 18 . This method is also called the parametric g-formula 1 or (g-)standardisation 19 in the literature. Snowden et al. 20 and Wang et al. 21 detailed the corresponding methodology for estimating the average treatment (i.e., marginal causal) effect on the entire population (ATE) or only on the treated (ATT), respectively. The ATE is the average effect, at the population level, of moving an entire population from untreated to treated. The ATT is the average effect of treatment on those subjects who ultimately received the treatment 22 . Furthermore, some authors 23,24 have proposed combinations of GC and PS to improve the estimation of the marginal causal effect. These methods are known as doubly robust estimators (DRE) because they require the specification of both the outcome (for GC) and treatment allocation (for PS) mechanisms to minimise the impact of model misspecification. Indeed, these estimators are consistent as long as either the outcome model or the treatment model is estimated correctly 25 .
Each of these methods carries out the adjustment in different ways, but all of these methods rely on the same condition: a correct specification of the PS or the outcome model 1 . In practice, a common issue is choosing the set of covariates to include to obtain the best performance in terms of bias and precision. Three simulation studies 7,26,27 have investigated this issue for PS-based methods. They studied four sets of covariates: those causing the outcome, those causing the treatment allocation, those are a common cause of both the treatment allocation and the outcome, and all the covariates. For the rest of this paper, we called these strategies the outcome set, the treatment set, the common set and the entire set, respectively. These studies argued in favour of the outcome or common sets for PS-based methods, but it is not immediately clear that such works will generalise to other methods of causal inference. Brookhart et al. 26 and Lefebvre et al. 27 focused on count and continuous outcomes. Austin et al. 7 investigated binary outcomes on matching, stratification and adjustment on PS. However, GC and DRE also require the correct specification of the outcome model with a potentially different set of covariates. Recent works have shown that efficiency losses can accompany the inclusion of unnecessary covariates [28][29][30][31] . De Luna et al. 32 also highlighted the variance inflation caused by the treatment set. In contrast, VanderWeele and Shpitser 33 suggested the inclusion of both the outcome and the treatment sets.
Before selecting the set of covariates, one needs to select the method to employ. Several studies have compared the performances of GC, PS-based methods and DRE in a point treatment study to estimate the ATE 13,23,25,[34][35][36] . Half of these studies investigated a binary outcome 13,25,34 . Only Colson et al. 17 studied the ATT, but they focused on a continuous outcome. Except in Neugebauer and van der Laan 25 , these studies only investigated the ATE (or ATT) defined as a risk difference. The CONSORT recommended the presentation of both the absolute and the relative effect sizes for a binary outcome, "as neither the relative measure nor the absolute measure alone gives a complete picture of the effect and its implications" 37 . None of these studies was interested in the set of covariates necessary to obtain the best performance.
In our study, we sought to compare different sets of covariates to consider to estimate a marginal causal effect. Moreover, we compared GC, PS-based methods and DRE for both the ATE and ATT, either in terms of risk difference or marginal causal OR. Three main types of outcome are used in epidemiology and medical research: continuous, binary and time-to-event outcomes. We focused on a binary outcome because i) a continuous outcome is often appealing for linear regression where the two conditional and marginal causal effects are collapsible 38 , and ii) time-to-event analyses present additional methodological difficulties, such as the time-dependant covariate distribution 39 . We also limit our study to a binary treatment, as in the current literature, and the extension to three or more modalities is beyond the scope of our study.
The paper is structured as follows. In the next section, the methods are detailed. The third section presents the design and results of the simulations. In the fourth section, we consider two real data sets. Finally, we discuss our results in the last section.

Methods
Setting and notations. Let A denote the binary treatment of interest ( = A 1 for treated patients and 0 otherwise), Y denote the binary outcome ( = Y 1 for events and 0 otherwise), and L denote a set of baseline covariates. Consider a sample of size n in which one can observe the realisations of these random variables: a, y, and l, respec- as the expected proportions of event if the entire (ATE) or the treated (ATT) populations were treated ( = do A ( 1)) or untreated ( = do A ( 0)), respectively 40 . From these probabilities, the risk difference can be estimated as π π π ∆ = − 1 0 and the log of the marginal causal OR estimated as θ π π = logit( )/logit( ) 1 0 , where logit(•) = log(•/(1 − •)). The methods described bellow allow for the estimation of both the ATE and the ATT effects.
Causal inference requires the three following assumptions, called identifiability conditions: i) The values of exposure under comparisons correspond to well-defined interventions that, in turn, correspond to the versions of treatment in the data. ii) The conditional probability of receiving every value of treatment, though not decided by the investigators, depends only on the measured covariates. iii) The conditional probability of receiving every value of the treatment is greater than zero, i.e., is positive. These assumptions are known as consistency, (conditional) exchangeability and positivity, respectively 1 . However, PS-based methods rely on treatment allocation modelling to obtain a pseudo-population in which the confounders are balanced across treatment groups. Covariate balance can be checked by computing the standardised difference of the covariates included in the PS between the two treatment groups 10 . In contrast, GC relies on outcome modelling to predict hypothetical outcomes for each subject under each treatment regimen. Note that one can ignore the lack of positivity if one is willing to rely on Q-model extrapolation 1 . As is the case for standard regression models, these methods also require the assumptions of no interference, no measurement error and no model misspecification.
Weighting on the inverse of the propensity score. Formally, the PS is = the probability that subject i ( = … i n 1, , ) will be treated according to his or her characteristics L i at the time of the treatment allocation 4 . It is often estimated using a logistic regression. The IPTW makes it possible to reduce confounding by correcting the contribution of each subject i by a weight ω i . For ATE, Xu et al. 41 The use of stabilised weights has been shown to produce a suitable estimate of the variance even when there are subjects with extremely large weights 5,41 . For ATT, Morgan and Based on ω i , the following weighted univariate logistic regression can be fitted: To obtain θ var( ), we used a robust sandwich-type variance estimator 5 with the R package sandwich 43 . full Matching on the propensity score. The FM minimises the average within-stratum differences in the PS between treated and untreated subjects 16 . Then, two weighting systems can be applied in each stratum, making it possible to estimate either the ATE or the ATT unlike other matching methods which can only estimate the ATT 44 . If t and u denote the number of treated and untreated subjects in a given stratum, one can define the weight for a subject i in this stratum as ω = = 16 . In the latter case, the weights of untreated subjects are rescaled such that the sum of the untreated weights across all the matched sets is equal to the number of untreated subjects: 45 . From the resulting paired data set, we fitted a weighted univariate logistic regression, and the rest of the data analysis is tantamount to IPTW. We used the R package MatchIt 45 to generate the pairs.

G-computation. Consider the following multivariable logistic regression
This regression is frequently called the Q-model 20 . Once fitted, one can compute for all subjects the two expected probabilities of events if they were treated or untreated 20 . For ATE, one can then The same procedure can be performed amongst the treated patients for ATT 21 . For implementation in practice, consider a treated subject ( = A 1 i ) included in the fit of the Q-model. Thanks to this model, one can then compute for this subject his or her predicted probabilities of the event if he or she received the treatment ( . Computing these predicted probabilities for all the subjects, one can obtain two vectors of probabilities if the entire sample were treated or not. The corresponding means correspond to π 1 and π 0 , respectively. We obtained θ var( ) by simulating the parameters of the multivariable logistic regression assuming a multinormal distribution 46 . Note that we could have used bootstrap resampling instead. However, regarding the computational burden of bootstrapping and the similar results obtained by Aalen et al. 46 , the variance estimates in the simulation study were only based on parametric simulations. We used both bootstrap resampling and parametric simulations in the applications.
targeted Maximum Likelihood estimator. Amongst the several existing DREs, we focused on the targeted maximum likelihood estimator (TMLE) 24 , for which estimators of ATE and ATT have been proposed 47 . The TMLE begins by fitting the Q-model to estimate the two expected hypothetical probabilities of events π 1 and π 0 . An additional "targeting" step involves estimation of the treatment allocation mechanism, i.e., the PS i i , which is then used to update the initial estimates obtained by GC. In the presence of residual confounding, the PS provides additional information to improve the initial estimates. Finally, the updated estimates of π 1 and π 0 are used to generate π ∆  or θˆ. We used the efficient influence curve to obtain standard errors 47,48 . A recent tutorial provides a step-by-step guided implementation of TMLE 49 .

Simulation study
Design. We used a close data generating procedure from previous studies on PS models 7,50 . We generated the data in three steps. i) Nine covariates (L 1 , …, L 9 ) were independently simulated from a Bernoulli distribution with a parameter equal to 0.5 for all covariates. ii) We generated the treatment A according to a Bernoulli distribution with a probability obtained by the logistic model with the following linear predictor: We fixed the parameter γ 0 at −3.3 or −5.2 to obtain a percentage of treated patients equal to 50% for scenarios related to ATE and 20% for ATT, respectively. iii) We simulated the event Y using a Bernoulli distribution with a probability obtained by the logistic model with the following linear predictor: 10 9 . We set the parameter β 1 for a conditional OR at 0 (the null hypothesis is no treatment effect) or 2 (the alternative hypothesis is a negative impact of treatment). We also fixed the parameter β 0 at −3.65 and −3.5 to obtain a percentage of the event close to 50% in ATE and ATT, respectively. Figure 1 presents the values of the regression coefficients γ 1 to γ 9 and β 1 to β 10 . We considered four covariates sets as explained in the introduction: the outcome set included the covariates L 1 to L 6 , the treatment set included the covariates L L L L L L , , , 8 , the common set included the covariates L L L L , , , 5 , and the entire set included the covariates L 1 to L 9 . For each of the four methods and the four covariate sets, we studied the performance under different sample sizes: = n 100, 300, 500 and 2000. For each scenario, we randomly generated 10 000 data sets. We computed the theoretical values of π 1 and π 0 by averaging the values of π 1 and π 0 obtained from univariate logistic models (treatment as the only covariate) fitted from data sets simulated as above, except that the treatment A was simulated independently of the covariates L 50 . We reported the following criteria: i) the percentage of non-convergence, ii) the mean absolute bias (e.g., 51 , the empirical coverage rate of the nominal 95% confidence intervals (CIs), defined as the percentage of 95% CI including the theoretical value, the type I error, defined as the percentage of rejection of the null hypothesis under the null hypothesis, and the statistical power, defined as the percentage of rejections of the null hypothesis under the alternative hypothesis. The MSE was our primary performance measure of interest because it combines bias and variance. We assumed that the identifiability conditions hold in these scenarios. We further performed the same simulations by omitting L 1 in the PS or in the Q-model to evaluate the impact of an unmeasured confounder. We performed all the analyses using R version 3.6.0 52 .

Results
convergence. Non-convergence only occurred for ATT estimation when sample sizes were lower or equal to 300 subjects (see Fig. 2). The GC, IPTW and FM had a minimal convergence percentage higher than 98%, even under small sample size (n = 100). Similarly, TMLE experienced some difficulty in converging for ATT estimation in the medium-sized sample (n = 300). However, they experienced severe difficulty in converging in the small sample with a convergence percentage of approximately 92%.

Mean bias.
As expected with the common set, the mean absolute bias of θ was close to zero for GC, IPTW and TMLE when the three identifiability assumptions hold with a maximum at −0.028 given moderate sample size (n = 300) under the alternative hypothesis for ATT estimation ( Table 1). Note that the three other covariate sets led to a bias close to zero with a maximum of 0.053 for TMLE with the entire set given small sample size (n = 100) under the alternative hypothesis for ATE estimation (Table 2). Furthermore, FM was also associated with a similar bias with a maximum of 0.082 given a small sample size (n = 100), with the treatment set under the alternative hypothesis for the ATE estimation. With an unmeasured confounder, the bias increased in all scenarios with a  Variance. For all methods, the outcome set led to the lowest MSE, followed closely by the common set.
G-computation led to the lowest MSE and FM to the highest. In ATT, IPTW had lower MSE than TMLE. Note that the VEB was particularly high for FM in all ATE scenarios with a minimum of −17.5% (n = 500 with the outcome set). For the ATT, FM also had a higher VEB than other methods, apart from TMLE with the treatment or entire sets in sample sizes of fewer than 2000 subjects. In the presence of an unmeasured confounder, the MSE increased in all scenarios in agreement with the increase in bias. The VEBs did not change notably with an unmeasured confounder. coverage and error rates. G-computation produced coverage rates close to 95%, except for ATE in a small sample size leading to an anti-conservative 95% CIs with a minimum of 91.7% with the entire set under the null hypothesis. Anti-conservatives 95% CIs were also produced by FM in all scenarios, and by TMLE given a small sample size. Conversely, conservative 95% CIs were obtained when using TMLE for the ATT with the entire or the treatment sets, and when using IPTW for ATT or ATE with the outcome or the common sets.
Lending confidence to these results, the type I error was close to 5% for GC in all scenarios and may vary for other methods. The power was more impacted by the choice of the covariate set. The outcome set led to the highest power for GC.

Applications
We illustrated our findings by using two real data sets. First, we compared the efficiency of two treatments, i.e., Natalizumab and Fingolimod, sharing the same indication for active relapsing-remitting multiple sclerosis. Physicians preferentially use Natalizumab in practice for more active disease, indicating possible confounders. Given the absence of a clinical trial with a direct comparison of their efficacy, Barbin et al. 53 recently conducted an observational study. We reused their data. Second, we sought to study barbiturates that can lead to a reduction of the patient functional status. Indeed, barbiturates are suggested in Intensive Care Units (ICU) for the treatment of refractory intracranial pressure increases. However, the use of barbiturates is associated with haemodynamic repercussions that can lead to brain ischaemia and immunodeficiency, which may contribute to the occurrence of infection. These applications were conducted in accordance with the French law relative to clinical noninterventional research. According to the French law on Bioethics (July 29, 1994; August 6, 2004; and July 7, 2011, Public Health Code), the patients' written informed consent was collected. Moreover, data confidentiality was ensured in accordance with the recommendations of the French commission for data protection (Commission Nationale Informatique et Liberté, CNIL decisions DR-2014-558 and DR-2013-047 for the first and the second application, respectively).
To define the four sets of covariates, we asked experts (D.L. for multiple sclerosis and M.L. for ICU) which covariates were causes of the treatment allocation and which were causes of the outcome, as proposed by VanderWeele and Shpitser 33 . We checked the positivity assumption and the covariate balance (see OSI). We applied B-spline transformations for continuous variables when the log-linearity assumption did not hold. natalizumab versus fingolimod to prevent relapse in multiple sclerosis patients. The outcome was at least one relapse within one year of treatment initiation. Six hundred and twenty-nine patients from the French national cohort OFSEP were included (www.ofsep.org). The first part of Table 3 presents a description of their baseline characteristics.
All included patients could have received either treatment. Therefore, we sought to estimate the ATE. The first part of Table 4  www.nature.com/scientificreports www.nature.com/scientificreports/ varying slightly depending on method and set of covariates) than one in which all patients received Fingolimod (approximately 28%). This difference of approximately 8% is clinically meaningful and suggests the superiority of Natalizumab over Fingolimod to prevent relapses at one year. This result was concordant with the recent clinical literature 53,54 . impact of barbiturates in the icU on the functional status at three months. We define an unfavourable functional outcome by a 3-month Glasgow Outcome Scale (GOS) lower than or equal to 3. We used the data from the French observational cohort AtlanREA (www.atlanrea.org) to estimate the ATT of barbiturates because physicians recommended these drugs to a minority of severe patients. The second part of Table 3 presents the baseline characteristics of the 252 included patients.
The second part of Table 4 presents the results according to the different possible methods and covariate sets. G-computation and TMLE lead to the conclusion of a significant negative effect of barbiturates regardless of the covariate set considered with an OR [95% CI] ranging from 0.43 [0.25; 0.76] for GC with the common set to 0.51 [0.29; 0.90] for TMLE with the entire set. By contrast, the results were discordant when using different covariate sets for IPTW and FM. We report, for instance, OR estimates obtained by FM ranging from 1.520 with the outcome set to 2.300 with the common set. In line with the simulation study, the estimated standard errors were higher for these methods (0.294 and 0.293 for GC and TMLE when the outcome set was considered, respectively) leading to lower power. Note also that standardised differences were higher than 10% for the IPTW with the entire set (see OSI) and for FM with the outcome, the treatment and the entire sets.
Depending on the methods and sets of covariates included, we estimated that from 18% to 20% of patients treated with barbiturates had an unfavourable GOS at three months. If these patients had not received barbiturates, the methods estimate that from 30% to 35% would have had an unfavourable GOS at three months. For the patients, this difference is meaningful but full clinical relevance depends also on the effect of barbiturates on other clinically relevant outcomes, such as death or ventilator-associated pneumonia. However, the results obtained by GC or TMLE differ with those obtained by Majdan et al. 55 , who did not find any significant effect of barbiturates on the GOS at six months. Two main methodological reasons can explain this difference: the GOS was at six months rather than three months post-initiation, and the authors used multivariate logistic regression leading to a different estimand.

Discussion
The aim of this study was to better understand the different sets of covariates to consider when estimating the marginal causal effect.
The results of our simulation study, limited to the studied scenarios, highlight that the use of the outcome set was associated with the lower bias and variance, principally when associated with GC, for both ATE and ATT. As expected, an unmeasured confounder led to increased bias, regardless of method employed. Although we do not report an impact on the variance, the effect's over-or under-estimation leads to the corresponding over-or under-estimation of power and compromises the validity of the causal inference.
The performance of FM is lower than that of the other studied methods, especially for the variance. Our results were in line with King and Nielsen 56 , who argued for halting the use of PS matching for many reasons such as covariate imbalance, inefficiency, model dependence and bias. Nonetheless, Colson et al. 17 found slightly higher MSE for GC than FM. Their more simplistic scenario, with only two simulated confounders leading to little covariate imbalance, could explain the difference with our results. Moreover, is unclear whether they accounted for the matched nature of the data, as recommended by Austin and Stuart 16 or Gayat et al. 50 .
While DRE offers protection against model misspecification 23,34,36 , our simulation study resulted in the finding that GC was more robust to the choice of the covariate set than the other methods, TMLE included. This result was particularly important when the treatment set was taken into account, which fits with the results of Kang and Schafer 35 : when both the PS and the Q-model were misspecified, DRE had lower performance than GC. Furthermore, GC was associated with lower variance than DRE in several simulation studies 13,17,35 , which accords with our results.
The first application to multiple sclerosis (ATE) illustrated similar results between the studied methods. In contrast, the second application (ATT) to severe trauma or brain-damaged patients showed different results between the methods. In agreement with simulations, the estimations obtained with GC or TMLE were similar n method set mean bias log OR www.nature.com/scientificreports www.nature.com/scientificreports/ in terms of logOR estimation and variance regardless of the covariate set considered. Estimations obtained with IPTW or FM were highly variable, depending on the covariate set employed: some indicated a negative impact of barbiturates and others did not. These results also tended to demonstrate that GC or TMLE had the highest statistical power. Variances obtained by parametric simulations or by bootstrap resampling were similar (results not displayed).
One can, therefore, question the relative predominance of the PS-based approach compared to GC, although there are several potential explanations. First, there appears to be a pre-conceived notion according to which multivariable non-linear regression cannot be used to estimate marginal absolute and relative effects 57 . Indeed, under logistic regression, the mean sample probability of an event is different from the event probability of a subject with the mean sample characteristics. Second, while there is an explicit variance formula for the IPTW 58 , the equivalent is missing for the GC. The variance must be obtained by bootstrapping, simulation or the delta method. Third, several didactic tutorials on PS-based methods can be found, for instance [59][60][61] .
We still believe that PS-based methods may have value when multivariate modelling is complex, for instance, for multi-state models 62 . In future research, it would be interesting to examine whether the use of potentially better settings would provide equivalent results, such as the Williamson estimator for IPTW 58 , the Abadie-Imbens estimator for PS matching 63 , or bounded the estimation of TMLE, which can also be updated several times 36 . We also emphasise that we did not investigate these methods when the positivity assumption does not hold. Several authors have studied this problem 13,25,35,36,64 . G-computation was less biased than IPTW or DRE except in Porter et al. 36 , where the violation of the positivity assumption was also associated with model misspecifications. The robustness of GC to non-positivity could be due to a correct extrapolation into the missing sub-population, which is not feasible with PS 1 . Other perspectives of this work are to extend the problem to i) time-to-event, continuous or multinomial outcomes and ii) multinomial treatment. However, implementing GC using continuous treatment raises many important considerations concerning the research question and resulting inference 64 .  www.nature.com/scientificreports www.nature.com/scientificreports/ To facilitate its use in practice, we have implemented the estimation of both ATE and ATT, and their 95% CI, from a logistic model in the existing R package entitled RISCA (available at cran.r-project.org/web/packages/ RISCA). We provide an example of R code in the appendix. Note that the package did not consider the inflation of the type I error rate due to the modelling steps of the Q-model. Users also have to consider novel strategies for post-model selection inference.
In the applications, we classified covariates into sets based on experts knowledge 33 . However, several statistical methods can be useful when no clinical knowledge is available. Heinze et al. 65 proposed a review of the most used, while Witte and Didelez 66 reviewed strategies specific to causal inference. Alternatively, data-adaptive methods have recently been developed, such as the outcome-adaptive LASSO 67 to select covariates associated with both the outcome and the treatment allocation. Nevertheless, according to our results, it may be preferable to focus on constructing the best outcome model based on the outcome set. For instance, the consideration of a super learner 68,69 , merging models and modelling machine learning algorithms may represent an exciting perspective 70 .
Finally, we emphasise that the conclusions from our simulation study cannot be generalised to all situations. They are consistent with the current literature on causal inference, but theoretical arguments are missing for generalisation. Notably, our results must be considered in situations where both the PS and the Q-model are correctly specified and where positivity holds.
To conclude, we demonstrate in a simulation study that adjusting for all the covariates causing the outcome improves the estimation of the marginal causal effect (ATE or ATT) of a binary treatment in a binary outcome. Considering only the covariates that are a common cause of both the outcome and the treatment is possible when the number of potential confounders is large. The strategy consisting of considering all available covariates, i.e., no selection, did not decrease the bias but significantly decreased the power. Amongst the different studied methods, GC had the lowest bias and variance regardless of covariate set considered. Consequently, we recommend that the use of the GC with the outcome set, because of its highest power in all the simulated scenarios. For  Table 4. Results of the two applications. *Treatment and common sets contain same covariates. π 0 : Percentage of event in the Natalizumab (or control) group, π 1 : Percentage of event in the Fingolimod (or Barbiturates) group, SE: standard error.