Nearest Neighbour Propensity Score Matching and Bootstrapping for Estimating Binary Patient Response in Oncology: A Monte Carlo Simulation

Nearest Neighbour (NN) propensity score (PS) matching methods are commonly used in pharmacoepidemiology to estimate treatment response using observational data. Unfortunately, there is limited evidence on the optimal approach for accurately estimating binary treatment response and, more so, to estimate its variance. Bootstrapping, although commonly used to accurately estimate variance, is rarely used together with PS matching. In this Monte Carlo simulation-based study, we examined the performance of bootstrapping used in conjunction with PS matching, as opposed to different NN matching techniques, on a simulated dataset exhibiting varying levels of real world complexity. Thus, an experimental design was set up that independently varied the proportion of patients treated, the proportion of outcomes censored and the amount of PS matches used. Simulation results were externally validated on a real observational dataset obtained from the Belgian Cancer Registry. We found all investigated PS methods to be stable and concordant, with k-NN matching to be optimally dealing with the censoring problem, typically present in chronic cancer-related datasets, whilst being the least computationally expensive. In contrast, bootstrapping used in conjunction with PS matching, being the most computationally expensive, only showed superior results in small patient populations with long-term largely unobserved treatment effects.

PS NN matching and treatment effect. In NN PS matching, each treated patient is matched to one or more patients from the control group based on the closest = = | Pr Z X PS ( 1 ) i i value 3,4,15 . In principle, any regression technique can be used to develop the propensity model as long as it provides reasonable fit to the data. It is not necessary that the chosen technique produces calibrated probabilities as units are matched on a score 16 . Optimal selection of variables X i is based on observed variables which affect the outcome of interest, because this is associated with better PS estimations 17 .
Let T and C be the set of N T treated and N C control patients respectively and OS i T and OS i C the observed continuous outcomes of the treated and control units, respectively. Denote by C(i) the set of N i C control units matched (using NN based on PS scores) to the treated unit i ∈ T. Define the weights = w ij N 1 i C if j ∈ C(i) and = w 0 ij otherwise. We define the subject-specific treatment effect (STE) for the treated, derived from the ATE 13,18 , as the estimation of the subject-specific survival gain SG i s : Following oncology guidelines and depending on disease severity, patients can be labelled with 'response' whenever their SG is longer than a threshold of λ months.
One-by-one matching. The most common implementation of PS matching in practice is one-by-one matching, in which pairs of treated and control units are formed. Using one-by-one nearest neighbour PS matching = N ( 1) i C , one treated unit i ∈ T is matched to one control unit j ∈ C. When the OS of treated, control or both are censored, the estimated SG s will be highly uncertain (see Supplementary Material). Hence, for those matched pairs where censoring is problematic, the binary response-label based on the estimated SG s becomes highly uncertain. For those cases, the SG cannot be assessed if any of the following conditions apply: • when j ∈ C(i) and i ∈ T are censored • when j ∈ C(i) is censored and www.nature.com/scientificreports www.nature.com/scientificreports/ Denote κ = NA ij when i and j are censored or when j is censored and and ρ i = NA otherwise. Formula (1) becomes: with OS i T assumed to be a constant. We can make a crude estimation of the SG i s variance by taking the variance of the entire control set C, One-by-k matching. Using one-by-k nearest neighbour PS matching ( = N k i C = 50), one treated unit i ∈ T is matched to k nearest control units. Labelling for matched units subject to the censoring problem cannot be estimated if any of the following conditions are satisfied: When none of these conditions are met the response label of treated unit i ∈ T can be estimated. However, if ∃ ∈ j C i j ( ): is censored, j cannot contribute to the estimation of this label.
, given that the summation can be calculated under the conditions given by the definition of κ C i ( ) , and ρ = NA i otherwise. if j ∈ C(i) and w ij = 0 otherwise 19 . The value of α is set to 5 to ensure weights close to zero for maximal distances while having large enough differences in weights for small distances. Matched control units with equal PS as the treated unit contribute to the mean with a weight equal to one, while matched control units with distance approaching one contribute only with a weight approaching zero.

Matching through bootstrapping.
Using the complex bootstrap method as described by Austin (2014), b bootstrap samples are drawn from the original control group with sample size equal to the control group 9 . In each of the bootstrap samples, the PS model is estimated, and one-by-one PS matching is performed for creating matched pairs, forming the set C(i) of control units In this k-NN bootstrapping method, one treated patient can be matched multiple times to the same control patient, i.e. j ∈ C(i) can occur multiple times in C(i), lowering the heterogeneity of the matched sets. The estimation of the subject-specific gain in survival  SG i s and its variance  V i s can be calculated as given by Eqs. (4) and (5) in one-by-k matching 14 .

Material and Methods
We used simulated datasets with three levels of patient heterogeneity to examine the performance of the different matching techniques over a series of Monte Carlo simulations. There, performance was evaluated based on their ability to estimate the individual SG under these three scenarios. In this section, we describe the design of the datasets and the Monte Carlo simulations. The results were externally validated by examination of case studies for treated metastatic colorectal patients.
Simulated data generation. Data was simulated in R following the data-generating process described by Austin (2014), generating 1000 patients with 10 baseline covariates X 1 − X 10 , of which seven affecting treatment selection (X 1 − X 7 ) and OS outcome (X 3 − X 10 ) 9 . Very weak, weak, moderate, strong and very strong effects of the covariates on treatment selection and OS outcome is introduced by the regression coefficients i i i were determined using logistic regression, following: www.nature.com/scientificreports www.nature.com/scientificreports/ for low, medium and high heterogeneity respectively. Treatment status was generated from a Bernoulli distribution on the subject-specific PS p Z Be p : ( ) i i i , through which the intercept α 0,treat indirectly affects the proportion of patients treated in the simulation. The OS outcome was generated as described by Bender (2005) and Austin (2014) 9,12 , that is, based on the linear predictor , with u a random number from the uniform distribution and equal to 0.00002. The conditional hazard ratio exp(β treat ) was fixed to 0.8. The true SG for the treated SG i was generated from the OS outcome as produced by the linear predictors for Z i = 1 and 0: . From this, the corresponding average true SG, i.e. the "true ATE" SG, and the variance of the true SG, i.e. the "true variance" V, is calculated. The censoring status of the subjects' survival was drawn from a Binominal distribution given the probability of being censored case study data. Patients were collected from the Belgian Cancer Registry (BCR), a population based cancer registry. We used ICD-10 codes (C18 up to and including C20) to select 10426 metastatic colorectal patients (stadium IV carcinoma) diagnosed between 2006 and 2014 with vital status information updated until July 1, 2017 (Table 1). Patients were classified in five groups according to their targeted treatment assignment: 2784, 845, 308 and 31 patients received bevacizumab, cetuximab, panitumumab, and aflibercept respectively. 6458 patients were not treated with the targeted medicine and were classified as the control group (irrespective of radiotherapeutic and/or chemotherapeutic treatments). Of these five groups, 26% (731), 15% (127), 11% (35), 52% (16) and 15% (965) had censored survival, respectively. OS, the RCT's primary endpoint, was used as the main indication of treatment effect. Selected variables were taken from the full standard set of variables nationally collected by the BCR and Inter Mutualistic Agency, which were further limited for relevance by BCR oncologists.
The data set consisted of (a) the patient's OS and censoring status; (b) (historical) treatment paths i.e. radiotherapeutic and/or chemotherapeutic treatment and treatment with the four targeted medicines; and (c) patient and tumour characteristics, i.e. age, sex, tumour differentiation grade, topography, tumour location (left/right), total amount of tumours, WHO performance score at diagnosis and TNM classification. Multiple imputation was used for handling missing data for PS-relevant variables assuming data was missing completely at random 20-23 .
Analysis on simulated data. Using the data-generating process described above, three types of heterogeneity were simulated by using the regression coefficients denoting very weak to very strong impact on treatment selection and survival, which were iterated 1000 times. For each of these heterogeneity types (low, medium and high), three factors were varied: the proportion of patients treated (given no censoring), the proportion of outcomes censored (given 20% of patient treated) and the number of nearest neighbours used in matching (given 20% of patient treated and 20% of outcomes censored). (See Supplementary Materials for more information). For all these scenarios and datasets, the PS is estimated using a logistic regression model 3,4 , with selected observed variables being those affecting the survival time (X 3 − X 10 ) 16  www.nature.com/scientificreports www.nature.com/scientificreports/ weighted k-NN and complex bootstrapping described above) are performed to estimate the STE, i.e. the estimated subject-specific gain in survival  SG i s , and STE variance  V i s for each treated unit i ∈ T, given by formula (4 and 5). These are then investigated for the different PS NN methods by calculating their means over all units, SG s the mean STE) and V s (the mean STE variance), and comparing the latter with the simulated "true variance" V. We propose the PS NN method with the smallest relative difference )/V the best estimator of the true variance. Lastly, the variance of  V s , i.e. var(V s ), is compared between the PS NN methods, as well as the proportion of patients subject to the label-censoring problem as defined by the rules of formula (4)-(5). These analyses were carried out across the 1000 iterations of the Monte Carlo simulation conducted in R. Therefore, results of these analysis are reported as averaged values over the iterations.

Simulation Results
The following section describes the results of the label-censoring problem and the variance estimations for the three PS NN methods on the simulated datasets with low, medium and high heterogeneity. impact of number of units matched. The relative difference δ var between the true variance V and the mean estimated STE variance V s together with the resulting variance of the STE variance var(V s ) and the proportion of labels censored are reported in Fig. 1 for varying amount of NN units k considered during matching. The three panels show the different levels of heterogeneity.
As expected, the amount of predicted labels that are censored decrease with increasing amount of matched units k considered during the three PS NN matching methods for all heterogeneity sets, this at a similar pace until k = 5. Hence all methods perform equally well for solving the label-censoring problem, regardless of heterogeneity.
Similarly, no difference is found between the methods for estimating variance in low heterogeneous groups unless for computational complexity using bootstrapping. However, for increasing heterogeneity, we observe the bootstrap method to have less accurate predictions of variances, showing higher relative differences δ var although lower variances of  V s for small k. Hence, for high heterogeneity the bootstrap method would be inferior to the k-NN matching methods based on both accuracy and computational complexity, while for low heterogeneity the bootstrap method is inferior on computational complexity alone.
impact of proportion of outcomes censored. The relative difference δ var between the true variance V and the mean estimated STE variance V s and the resulting variance of the STE variance var(V s ) are reported in Fig. 2 for varying amount of outcomes OS censored. The three panels show the different levels of heterogeneity.
We see the relative difference δ var to be quite unaffected by the proportion of OS outcomes censored for all heterogeneity sets. Hence, the estimation of V s remains constant, even though increased outcomes censored means less units j ∈ C(i) contribute to the estimation of V i s for every unit i ∈ T. However, we can verify that this has an effect on the accuracy of this estimation, because the variances of  V s have an increasing trend for low heterogeneity. This trend disappears for higher heterogeneity because of both δ var and especially var(V s ) fluctuate. For all levels of heterogeneity, the bootstrap method would be inferior to the k-NN matching method based on computational complexity.
impact of proportion of patients treated. The relative difference δ var between the true variance V and the mean estimated STE variance V s and the resulting variance of the STE variance var(V s ) are reported in Fig. 3 for varying amounts of patients treated in the population. The three panels show the different levels of heterogeneity. www.nature.com/scientificreports www.nature.com/scientificreports/ As expected, δ var increases and becomes more uncertain (  V s variance increases) for increasing proportion of treated patients in all heterogeneity sets, as the control group C (the pool to which treated units are matched) becomes smaller. No difference is found between the different methods, except for a slightly higher  V s variance for low heterogeneity. However, as for each proportion of treated units a different linear predictor was simulated, affecting the OS outcome of the dataset, we see δ var and  V s variance fluctuates, especially for high heterogeneity. Therefore, results for δ var for the different PS NN methods are inconclusive.

case Study
In this section, the performance of the three different PS NN techniques are examined on a case study of metastatic colorectal cancer patients treated with bevacizumab, cetuximab, panitumumab or aflibercept as a targeted medicine. Numerical results of the one-by-one, one-by-25 (weighted) and 25-bootstrap PS NN matching techniques are depicted in Fig. 4 and Table 1.
The results show that the three methods are stable and concordant. Only for the aflibercept case, with a small treated population (31) and high amount of survival censoring (52% or 16 out of 31), we observe a difference between the k-NN techniques and the bootstrap method. Specifically, the censoring problem reduces dramatically with increasing k with lower estimated V s and  V s variance for the bootstrap method as opposed to the k-NN techniques.

Discussion
Bootstrapping, a method commonly used to accurately estimate variance, is rarely used together with PS matching. In this Monte Carlo simulation-based study, we examined the performance of the complex bootstrap method, as described by Austin (2014), to estimate binary treatment response and variance in the domain of oncology. Specifically, the subject-specific survival gain (that is, the individual treatment effect) and its variance together with its ability to mitigate the problem of label-censoring, obtained from the individual treatment effect as a binary treatment response label, were the main factors under investigation. The Monte Carlo study was based  www.nature.com/scientificreports www.nature.com/scientificreports/ on simulated datasets containing 1000 patients with varying levels of heterogeneity found in real world patient populations. Counterintuitively, we found that the estimation of survival gain variance in patient populations with a high patient heterogeneity does not benefit from using the complex bootstrapping method instead of (weighted) k-NN. Indeed, we expected the relevant matches to be small for increasing k and increasing heterogeneity, implying that k-NN PS matching would contain a large set of irrelevant matches for large k, as opposed to the complex bootstrapping method, which always matches a treated patient to the one closest control patient. As a consequence, while also taking into account the computational complexity we found the bootstrap method to not show to favourable results even for high heterogeneous patient populations. Additionally, no major differences were found between the k-NN and weighted k-NN method, because the resulting weights were approximately equal to one in most cases of the simulated data. While it can be argued that this behaviour would change if one chooses a value of the parameter of the exponential distance function that is better suited to the data at hand, we note that this parameter cannot be tuned in practice because, as opposed to that in simulation study, the real variance is unknown before estimation.
Applying these methods to four colorectal cancer treatments with varying amount of patients treated and unobserved outcomes, we found all three PS methods to be stable and concordant. From the analysis, we can conclude that the computationally cheapest method, being k-NN PS matching, should be used in most of the cases. However, for the aflibercept case, where a small amount of patients are assigned to the treatment while the majority of survival times are censored, we did observe the bootstrap method to have favourable estimations. This result was expected because bootstrapping is a statistical method often used for estimating variance in fairly small datasets 13 .
Note that some concerns may arise when using bootstrapping in conjunction with PS matching in observational studies. First, one specified PS model was used for each resampled control group in the analysis of this study, which may be inappropriate in high heterogeneous patient populations. However, identifying the best fitted model for each bootstrapped sample would be highly unpractical. Second, for comparison reasons, a low number

Data availability
The case study data that support the findings of this study are available from the Belgian Cancer Registry but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of the Belgian Cancer Registry.