Multiple imputation for analysis of incomplete data in distributed health data networks

Distributed health data networks (DHDNs) leverage data from multiple sources or sites such as electronic health records (EHRs) from multiple healthcare systems and have drawn increasing interests in recent years, as they do not require sharing of subject-level data and hence lower the hurdles for collaboration between institutions considerably. However, DHDNs face a number of challenges in data analysis, particularly in the presence of missing data. The current state-of-the-art methods for handling incomplete data require pooling data into a central repository before analysis, which is not feasible in DHDNs. In this paper, we address the missing data problem in distributed environments such as DHDNs that has not been investigated previously. We develop communication-efficient distributed multiple imputation methods for incomplete data that are horizontally partitioned. Since subject-level data are not shared or transferred outside of each site in the proposed methods, they enhance protection of patient privacy and have the potential to strengthen public trust in analysis of sensitive health data. We investigate, through extensive simulation studies, the performance of these methods. Our methods are applied to the analysis of an acute stroke dataset collected from multiple hospitals, mimicking a DHDN where health data are horizontally partitioned across hospitals and subject-level data cannot be shared or sent to a central data repository.

I n the past two decades, enormous amounts of health data have been collected and digitized, partly due to increasingly broader adoption of electronic health records (EHRs) by many healthcare systems. Pooling such big health data from multiple institutions such as healthcare systems and health insurance companies into a single database increases sample sizes for subsequent data analyses and, more importantly, the pooled data can provide a more representative sample of a larger population of interest. As such, it offers great promises in improving the validity, robustness and generalizability of research findings. However, pooling data from multiple institutions may not always be feasible or desirable. First, when the amount of data is massive and continues to grow, it may not be feasible or efficient to transmit data between institutions or store all data in one central repository. Second, for big health data, it may be desirable to store them in a distributed fashion and take advantage of advances in parallel computing. Third, most importantly, due to government regulations, institutional policies, and privacy concerns, it may not be possible to transfer big health data at the patient level from one institution to another or there are extremely high hurdles for such data transferring that may take years to clear. For example, Veteran's Health Administration policies require its EHR data to remain only within VA's facilities. In addition, improper disclosure of individual-level data has serious implications, such as discrimination for employment, insurance, or education 1 . In addition, the current standard practice of data de-identification through removing indiviual identifiers is inadequate for privacy protection in the era of big data, as a large body of research has demonstrated that given some background information of an individual, an adversary can learn (from "de-identified" data) sensitive information about the victim 2-6 .
To address these challenges, distributed health data networks (DHDNs) that can store and analyze EHRs data from multiple sites without sharing individual-level data have drawn increasing interests in recent years 7,8 . Examples of DHDNs 9 , include the vaccine safety datalink, the health care systems research network, the sentinel initiative, and most recently the patient-centered SCAlable national network for effectiveness research (pSCAN-NER) 10 that is part of PCORnet. To enhance scalability and privacy protection in distributed analysis, the PopMedNet platform 11,12 has been developed to provide software enabled governance over shared data. DHDNs eliminate the need to create, maintain, and secure access to central data repositories, minimize the need to disclose protected health information outside the data-owning entity, and mitigate many security, proprietary, legal, and privacy concerns. In this work, we focus on horizontally partitioned data 13 , meaning that different data custodians such as hospitals and healthcare providers have the same set of features for different sets of patients. For example, several healthcare systems are interested in analyzing pooled data from their EHRs to improve the precision and generalizbility of analysis results. However, due to the aforementioned concerns, they are not allowed or are reluctant to share individual-level data with others, despite the substantial benefits from such collaboration. DHDNs would lower the hurdles for them to collaborate in a distributed analysis environment 14 , highlighted needed methods contributions to analysis of distributed EHRs data.
As EHRs are collected as part of healthcare delivery, missing data are pervasive in EHRs and DHDNs 8,15 . Missing data problem reduces the usable sample size and hence analysis power. Improper handling of missing data is known to compromise the validity of analysis and yield biased results, and could subsequently lead to inappropriate healthcare and health policy decisions. To choose the best way forward in handling missing data, the pattern and mechanism of missingness need to be considered 16 . Three main missing data mechanisms are missing completely at random (MCAR), missing at random (MAR), and missing not at random 17 . Most of the existing methods for handling missing data rely on the assumption of MAR, i.e., missingness only depends on observed data, and which is the focuse of our current work as well.
Multiple imputation (MI) 17 is arguably the most popular method for handling missing data largely due to its ease of use. MI methods replace each missing value with samples from its posterior predictive distribution. The predictive imputation model is estimated from the observed data, which have no missing values. The missing values are imputed multiple times in order to account for the the uncertainty of imputation, and then each imputed dataset is used to fit the analysis model parameters θ 18 proposed a simple method for combining these analysis results from multiple imputed datasets, which is known as Rubin's rule. In the presence of general missing data patterns, the MI by chained equations (MICE) method is widely adopted and has been shown to achieve superior performance in practice 19,20 .
While there has been a large body of literature on handling missing data, there has been little work on handling distributed incomplete data such as missing data in DHDNs. Of note, while pSCANNER 10 has developed a suit of software tools for privacypreserving distributed data analysis, it currently has no tools for handling distributed missing data.
To enhance protection of patient privacy, we investigate distributed MI methods for handling missing data that do not require sharing individual level data between sites. Under MAR, one straightforward privacy-preserving MI approach for horizontally partitioned incomplete data would be to conduct MI within each institution/site and then perform the distributed analysis. We call this approach the independent MI (iMI). The iMI has a number of limitations. In particular, it fails to leverage data from other sites, which leads to large variability in imputation and loss of power in subsequent analysis. This becomes more pronounced as the proportion of missing data in individual sites increases. In the extreme case when one variable is missing for all observations in a single site, this variable cannot be imputed in that site using the iMI approach. As a result, the data from this site may not be used in any subsequent analysis where that variable is needed 21 proposed a privacypreserving lazy decision-tree imputation algorithm for data that are horizontally partitioned between two sources. As their algorithm is designed for only single imputation, it is challenging to conduct proper statistical inference such as hypothesis testing using their singly imputed dataset that underestimates the uncertainty of imputation. In addition, it is not directly applicable to general missing data patterns and the case of more than two sources, and their complex decision tree algorithm may overfit the data and may not be communication efficient.
Since communicating data between sites in distributed learning can be a costly operation, we seek to develop communicationefficient distributed MI approaches. The aforementioned naïve iMI approach is communication-efficient as it involves no communication between sites. We propose two additional communicationefficient approaches, inspired by the inference methods for distributed complete data (CD); the average mixture approach (AVGM) and the communication-efficient surrogate likelihood (CSL) approach. In the AVGM approach 22 , each site finds the local estimate using the data available at the site, and then these estimates are averaged to find the global estimate. The CSL approach 23 uses the curvature information from a central site and the pooled derivative at a point near the true parameter. AVGM is expected to perform better when the samples are evenly distributed across the sites, while CSL is expected to perform well when the central site has the majority of samples. We will use these two approaches to develop distributed MI approaches, avgmMI and cslMI, for univariate missing data patterns. In addition, we develop the another distributed MI method that uses only the aggregated statistics from each site that are sufficient to obtain the same global estimate as if one had access to data pooled from all sites, and we call this method siMI. siMI can be communication efficient for linear regression models but not for nonlinear regression models for which the fitting algorithm is iterative. Of note, similar to iMI, when one variable is missing for all observations in a single site, this variable cannot be imputed in that site using avgmMI and cslMI and hence the data from this site may not be used in any subsequent analysis where that variable is needed. However, siMI would enable the use of data from these sites, which is one advantage of siMI over the other methods. Using these techniques, we also develop distributed MICE methods for general missing data patterns. However, since the standard MICE algorithm involves fitting of the imputation model multiple times, these direct extensions may not be as communication efficient except for iMICE which requires no communication.
Our work represents the first attempt to develop MI methods that allow proper statistical inference such as hypothesis testing in analysis of horizontally partitioned incomplete data in DHDNs. The remainder of the article is organized as follows. In the section "Results", we assess the strengths and weaknesses of the proposed distributed imputation methods in simulation studies; we then apply the methods to analysis of an acute stroke dataset collected from EHRs of multiple hospitals, mimicking a DHDN setting. The section "Discussion" provides some concluding remarks. In the section "Methods", we first briefly review the standard MI and the two distributed analysis methods for CD, AVGM, and CSL, respectively, and then present our communication-efficient distributed MI and MICE methods, respectively.

Results
Simulation studies. We conduct simulation studies to investigate strengths and limitations of the four privacy-preserving distributed MI methods described in the section "Methods" under the MAR assumption. We consider a linear regression model as the "analysis model" where y is the N × 1 vector of responses Y, x 1 , …, x p are the N × 1 covariate vectors for variables X 1 through X p , θ = (θ 0 , θ 1 , …, θ p ) denotes the model parameters of interest, and ϵ $ N ð0; σ 2 IÞ is the N × 1 vector of errors. We investigate both univariate and general missing data patterns. We apply each distributed MI method to the simulated missing data and then fit the analysis model using the imputed data to evaluate the imputation performance in terms of bias and SD of regression coefficient estimates, and communication costs. To benchmark the performance of the distributed MI methods, we compare their results with the results from the CD analysis which fits the analysis model using the full data before missing values are generated, and the results from the complete case analysis which fits the analysis model using only the set of complete cases that have all variables observed after missing values are generated.
In the second scenario, we make only one change from the first scenario, which is X 1 is now a binary variable. Given X 2 , instead of sampling X 1 from a normal distribution, we generate X 1 from a Bernoulli distribution Bð1; pÞ with probability p ¼ f1 þ expðÀ0:2 þ 0:5X 2 Þg À1 . The outcome variable Y and the missingness of X 1 are generated in the same way as in the first scenario. The resulting missing rate is about 50%.
The third scenario considers general missing data patterns. We have p = 5 predictor variables, and X 1 -X 3 have missing values. The fully observed variables X 4 and X 5 are independent and identically distributed as N ð0; 1Þ. Given X 4 and X 5 , we sample X 1 -X 3 from a multivariate normal distribution N ðμ X ; Σ X Þ with μ X ¼ ð0:3 À 0:3X 4 À 0:1X 5 Þ1; Σ X ¼

5:
The outcome Y is generated by Missing values in X 1 -X 3 are generated based on the logistic regression models for the missing indicators δ 1 -δ 3 .
resulting in 20% of missing rates for each missing variable and 50% of complete case rate.
Let K be the number of data sites distributed over the network. We consider two different numbers of sites (K = 5, 10) and three different sample sizes N = 250, 500, and 1000. We also look at two different types of distributions among the samples over the sites. In the first type (U), the samples are unevenly distributed. The first site has the majority of the samples and each site except the first has 15 samples only. In the second type (E), the samples are evenly distributed over the K sites. Table 1 lists all 15 settings of K, N, and the sample distribution type which are considered in this study. To evaluate the performance, we compute bias, standard deviation (SD), and root mean squared error of the estimates for θ from 1000 Monte Carlo datasets, which are , where θ 0 is the true value of θ. Tables 2-4 summarize the results for scenarios 1-3, respectively. Note that the CC method is biased regardless of the sample size N, which is as expected since the missing mechanism is not MCAR. Overall, the biases of all MI methods deteriorate as N decreases and K increases. However, the changes vary with the method, the type of sample distribution, and the type of imputed variable.
We can see that iMI and iMICE are less biased when the samples are evenly distributed, as each individual imputation model can be fitted stably. In contrast, when most of sites do not have enough samples, the individual and hence the aggregated estimates are less stable. Note that avgmMI and avgmMICE are hardly affected by the type of sample distribution when the missing variable is continuous (scenarios 1 and 3). However, they are substantially influenced when the missing variable is binary (scenario 2). The difference is even bigger when K = 10. Conversely, cslMI and cslMICE are worse when the samples are evenly distributed, obviously because the sample size at the central site is smaller. Note that they utilize the curvature information from the central site only and the initial estimate is obtained from the central site as well. Therefore, its performance is sensitive to the sample size at the central site. In particular, note that a few cases of cslMICE failed to converge in evenly distributed case when K = 10 and N = 250. However, the performance of the CSL based methods is comparable to that of siMI and siMICE when the central site has the majority of the samples. Note that the estimates of siMI and siMICE are as unbiased as comparable to CD in all settings, although they suffer a bit larger SDs.
Note that iMI and iMICE do not require any communication for imputation. The avgmMI approach only requires two one-way communications; one to fit the imputation model by AVGM and another to deliver the aggregated estimates to all sites for imputation. The cslMI approach requires one more one-way communication; two to fit the imputation model by CSL and another to deliver the estimated estimates to all sites for imputation. However, the cslMI method transmits vectors only in the first two communications, while avgmMI sends an estimate vector and a covariance matrix in every communication. The siMI approach requires as many communications as avgmMI does when the imputation model is a linear regression. However, siMI requires more communications when the imputation model is nonlinear as shown in Table 3.
As we can see in Table 4, the communication costs of the proposed MICE methods are huge except iMICE. Due to the iterative nature of the MI by chained equation, the number of required communications is proportional to the number of imputations, M.
Analysis of real data. The Georgia Coverdell Acute Stroke Registry (GCASR), covering nearly 80% of acute stroke admissions in the state of Georgia in USA, was set up to monitor and improve the care of acute stroke patients in the prehospital and hospital settings. The GCASR dataset analyzed in this section includes 68,287 patients from 75 hospitals in Georgia with clinically diagnosed acute stroke between 2005 and 2013. The data collected from EHRs in each hospital include a total of 203 variables, many of which have missing values due to various reasons. The goal of our analysis is to fit a linear regression model for assessing the effect of 14 features on the outcome variable of arrival-to-computed tomography time, an important quality indicator for acute stroke care, in the presence of missing data. The features of interest include patient-related characteristics such as age and gender, and pre-hospital-related characteristics such as EMS notification.
To assess the performance of the distributed imputation methods, we consider the case where EHRs data from individual hospitals cannot be pooled or sent to a central data repository, mimicking a DHDN, and we seek to impute missing values in this distributed set-up while protecting data privacy. Among these features of interest, only gender and race are observed for all patients, and the missing rates for the other variables range from 0.04 to 50.73%. Of note, in some hospitals one or more variables (e.g., NIH stroke score, EMS prenotification, and NPO) are missing for all observations. Since iMICE cannot be used to impute missing values in a hospital with one or more variables missing for all observations, such hospitals were removed in the first set of analyses resulting in 67,944 observations from 66 hospitals. The sample size in each hospital ranges from 18 to 4,333 with median 578. The number of complete cases across all hospitals is 13,353.
As with the simulation study, we used the CC, iMICE, avgmMICE, cslMICE, and siMICE methods. The CD analysis is not applicable to real data analysis. In addition, since the cslMICE approach is sensitive to the sample size of the central site, we consider two versions of cslMICE, namely, cslMICE(M) and cslMICE(m). For cslMICE(M), the central site is chosen to be the one with the most samples (4333). For cslMICE(m), the central site is the hospital with the median sample size (578). For each imputation method, we generate M = 20 imputed datasets. To benchmark the performance of the distributed MI methods, we also include the results from the complete case analysis, noting that the CD analysis is not applicable in the real data example.
To compare the performance of distributed imputation methods without being complicated by the choice of distributed method for fitting the analysis model, we chose to fit the analysis model using the imputed data pooled across all hospitals. The analysis results from the M = 20 imputed datasets are combined using the Rubin's rule. Specifically, let b θ m and d Varð b θ m Þ be the regression coefficient estimate and its estimated variance (or variance-covariance) from the m-th imputed dataset. Then the overall coefficient estimate is given by Figure 1 presents the parameter estimates and associated 95% confidence intervals for each regression coefficient in the linear regression model of interest. Of note, the hospitals in which at least one variable is missing for all observations are removed for iMICE since the missing values in such hospitals cannot be imputed using iMICE. For each method other than siMICE, we counted the number of discrepancies in statistical significance defined at α = 0.05 or in sign/direction of estimated effect compared to siMICE. Table 5 reports the number of discrepancies along with the number of communications required for each imputation method.
Since the results from the siMICE method are the same as the results from the standard MICE using pooled data, the latter is omitted from Fig. 1, and the results from the siMICE method are  (1) n (2) n (3) n (4) n (5) n (6) n (7) n (8) n (9) n (10) -1  250  250  -1  500  500  -1  1000  1000  U  5  250  190  15  15  15  15  U  5  500  440  15  15  15  15  U  5  1000  940  15  15  15  15  U  10  250  115  15  15  15  15  15  15  15  15  15  U  10  500  365  15  15  15  15  15  15  15  15  15  U  10  1000  865  15  15  15  15  15  15  15  15  15  E  5  250  50  50  50  50  50  E  5  500  100  100  100  100    N is the total number of samples. See Table 1 for local sample sizes. Results are based on 1000 Monte Carlo datasets.  N is the total number of samples. See Table 1  used to benchmark the other methods. As shown in Table 5, siMICE incurs substantially higher communication costs than the other distributed MI methods. Compared to the results from siMICE, the CC analysis yields substantially different parameter estimates for multiple regression coefficients as well as disagreement in statistical significance or in direction of estimated effect for eight features. This demonstrates the need for adequate handling missing data in the analysis of the GCASR data.
The results from the other distributed MI methods are closer to the results from siMICE than the CC analysis. Though, cslMICE (m), which uses the hospital with the medium sample size as the central site, exhibits the second largest number of discrepancies after CC, including disagreement for four features. While offering substantial savings in communication costs and yielding similar parameter estimates for some features, iMICE shows notable discrepancies for "NIH stroke score," "Serum total lipid", and "NPO" when compared to siMICE. On the other hand, cslMICE (M), which uses the hospital with the largest sample size as the central site, and avgmMICE yield the smallest number of discrepancies from siMICE, specifically only two discrepancies as shown in Table 5. In addition, Fig. 1 provides more granular information about comparing cslMICE(M) and avgmMICE with siMICE. The locations of 95% confidence intervals obtained by avgmMICE are most similar to those obtained by siMICE, with the only notable discrepancy for "family history of stroke." The lengths of confidence intervals obtained by cslMICE(M) are most similar to those obtained by siMICE, which could be attributed to the fact that it uses the curvature information of the central site with a large sample size. As a comparison, avgmMICE yields wider intervals for a number of features including "EMS prenotification," "history of TIA," "history of cardiac valve prosthesis," and "family history of stroke." To assess the performance of the distributed MI methods when sample sizes in individual hospitals are moderate to small, we conducted another set of analyses after removing the hospitals with more than T patients where T = 500, 300, 100. When T = 500, the total sample size decreases to 5307 patients from 26 hospitals. The number of patients in each hospital ranges from 18 to 462 with a median of 163 and the number of complete cases is only 362. In this set of analyses, we exclude cslMICE(m). As T decreases and, in other words, the number of patients per hospital continues to decrease, the discrepancies between the results from siMICE and the results from the other distributed MI methods including iMICE, avgmMICE and cslMICE(M) become greater. Particularly, the discrepancies between siMICE and cslMICE(M) tend to grow faster than the discrepancies between siMICE and avgmMICE, suggesting that cslMICE(M) is more sensitive to moderate to small sample sizes in all sites than avgmMICE.

Discussion
In this paper, we consider the problem of distributed incomplete data where data from multiple sites are not allowed to be combined, due to institutional policies or privacy concerns. We have developed and investigated four MI approaches that allow proper statistical inference such as hypothesis testing in analysis of horizontally partitioned incomplete data in DHDNs.
Our numerical experiments provide insights into the strengths and weaknesses of these methods. The proposed distributed imputation methods except for iMI/iMICE enable the use of data from all sites including sites with one or more variables missing for all observations. siMI has been shown in our numerical studies to yield comparable performance as the standard MI using pooled data, but it is not communication efficient for generalized linear imputation models. While cslMI and avgmMI are more communication efficient for all imputation models, their Table 4 Simulation results for scenario 3 where three continuous variables N is the total number of samples. See Table 1      performance may be sensitive to sample sizes in individual sites. In particular, the performance of cslMI may become unstable as the sample size in the central site becomes small to moderate. While avgmMI is less sensitive to small sample sizes, our simulations show that it tends to yield larger bias when imputing binary variables. Of note, siMI may be particularly appealing when analyzing data for uncommon diseases for which the sample size can be small in each individual site and missing data can further complicate data analysis. On the other hand, given that existing networks have 5 to upwards of 70 or more sites and can have millions to billions of records, the avgmMI and cslMI approaches can be very appealing options compared to the iMI or siMI approaches if all the data are used in analysis. Unlike the other proposed methods, iMI requires no communication between sites but may lead to unstable results as the sample sizes in some sites become small to moderate. As shown in the real data example, when one variable, say X 1 , is missing for all observations in a single site, X 1 in that site cannot be imputed using iMI/iMICE and hence the data from the site may not be used in subsequent analysis involving X 1 . The choice of distributed imputation approaches may also depend on whether data heterogeneity across multiple sites can be adequately adjusted for in imputation models. If we are able to adequately account for the heterogeneity by say including covariates that capture the heterogeneity or random effects for sites in imputation models, the siMI, cslMI, and avgmMI methods that borrow information across sites can enhance the efficiency of imputation and hence the power of subsequent analysis of imputed datasets. However, if that is not the case, then the iMI approach may be preferred.
We have investigated the extensions of the distributed MI methods for general missing patterns through the use of chained equations (MICE). Although these methods are privacypreserving and yield good performance, they are not communication efficient as demonstrated in the numerical experiments. In cases where communication costs are of critical concern, more communication-efficient imputation methods are needed for handling general missing data as potential future work. Another potential limitation is that siMI, cslMI and avgmMI may not always be privacy-preserving as the summary statistics transmitted between individual sites and a central server may still leak individual-level information 24 . Particularly, the siMI method needs to transfer the entire design matrix between sites, which poses higher risk of leaking individual patients' information. To address this issue, a differential privacy step 25 can be added to further strengthen the privacy-preserving property.
In practice, robust imputation methods such as predictive mean matching (PMM) and random forest (RF) imputation are widely used. It is of future interest to develop distributed versions of generic imputation methods include PMM and RF, which, however, can be very challenging. Such distributed generic imputation methods are expected to require additional communication overhead and more general definition of sufficient statistics. Particularly, the information to be exchanged for a distributed generic imputation method needs to be carefully investigated, while taking into account statistical validity, privacypreserving property, and communication costs.

Methods
Ethical approval. This study was reviewed by the Institutional Review Board at the University of Pennsylvania which determined that the study does not meet the criteria for human subject research since it involves only secondary analysis of deidentified data from an existing database and does not involve new data collection.
Notation. To fix ideas, suppose that we are interested in fitting the analysis model (1) of outcome Y on p features X 1 , …, X p , using a random sample of N observations. We define x 0 = 1 for the N × 1 vector of ones and denote the values for the i- We consider horizontally partitioned data from K institutions or sites, all of which have the same set of features recorded for all of their subjects. y and X, known as the "pooled" outcome vector and the "pooled" design matrix respectively, can be decomposed by sites as follows: . . .
where y (k) and X (k) are the data from the kth site with n (k) subjects, X (k) is an n (k) × (p + 1) matrix, and N ¼ P K k¼1 n ðkÞ . We first consider a univariate missing pattern where only X 1 has missing values and the other variables are fully observed where N c denotes the number of complete cases. We then consider general missing data patterns.
Multiple imputation. An MI method replaces each missing value multiple times from its predictive distribution based on the observed data, accounting for the uncertainty of imputation. Each of the imputed datasets is analyzed separately as if it were fully observed. The results across all imputed datasets are then combined following Rubin's rule. For example, if X 1 which has missing values is continuous, we can use a Bayesian linear regression model for imputation where ζ $ N ð0; τ 2 Þ with priors πðτ 2 Þ / IGð1=2; 1=2Þ; αjτ 2 $ N ð0; τ 2 λ À1 IÞ; where IG and N refer to the inverse gamma distribution and the multivariate Gaussian distribution, respectively. Let Z = [1, y, x 2 , …, x p ], and let Z c be the N c × (p + 1) submatrix of Z loaded with the complete cases only. Similarly, let x 1,c be the subvector of x 1 with the complete cases. The posterior distribution of (τ 2 , α) is given by The MI method samples (τ 2 , α) from Equation (3), imputes the missing values of X 1 according to Eq. (2) with random errors added, and fits the analysis model (1) using the imputed full data. This procedure is repeated multiple times.
When X 1 that has missing values is binary, we can use a Bayesian logistic regression model for imputation with prior α $ N 0; λ À1 I À Á . Let b α be the maximum A posteriori estimator. Note that, as N c tends to infinity, we have αÞÞ, W c is the sub-diagonal-matrix of W for the complete cases, and expitðxÞ ¼ logit À1 ðxÞ ¼ 1 1þe Àx . Therefore, the MI method samples α from N ðb α; ðZ T c W c Z c þ λIÞ À1 Þ, imputes the missing values according to the Bernoulli distribution, x i1 $ Bð1; expitðe z T i αÞÞ, and fits the analysis model using the imputed full data. This procedure is repeated multiple times. A regularization parameter λ can be used to avoid numerical difficulties particularly when the sample size at a site is less than the dimension of the parameters. We choose the value of λ to be as small as possible so that the bias caused by regularization can be negligible.
Communication-efficient inference for distributed CD. One straightforward approach to analyze distributed data is to transmit the minimally sufficient information from all sites to the central site such that it would enable reproducing the results from analyzing data pooled from all sites. We call this approach the SI (suffcient information) method which will be extended to the sufficient information MI (siMI) in the next section. In linear regression, for example, we only need X (k)T X (k) and X (k)T y (k) to obtain the same least-square estimates for the regression coefficients as if we had the data pooled from all sites. This can be seen from the following equation: In addition, this approach requires only one single communication between each site and a central server where the computations described in Eq. (4) are conducted. To extend this strategy to the generalized linear models (GLMs), we note that a standard algorithm for fitting GLMs goes through Newton iterations, each of which requires the derivative and the curvature information of the global loglikelihood that involves regression coefficients. For example, in the logistic regression, the parameters are updated as follows for each iteration.
where π ðkÞ i ¼ expitðe x ðkÞT i θ ðtÞ Þ, W (k) = diag(π (k) )diag(1 − π (k) ), and s is the step size. For each iteration, the central site has to transfer θ (t) to other sites and all sites have to transfer X (k)T W (k) X (k) and X (k)T (y (k) − π (k) ) back to the central site. Since the number of required round-trip communications is same as that of Newton iterations, the SI method may not be communication efficient. This approach is privacy preserving in the sense that the subject-level data are not shared outside of each site and hence are protected. Again, this approach generates the same results as the analysis of the pooled data.
We also consider two alternative distributed analysis methods that are communication efficient 22 proposed the average mixture algorithm. Each site estimates the model parameters using the data available at the site only, and then combine the estimates to find the global estimate for the parameters. Let b θ ðkÞ be the estimate from the kth site. Then, we have where w k ≥ 0 and ∑ k w k = 1. Note that the weights w k should reflect the sample size of each site. This method is communication efficient as it requires a single one-way communication only. The resulting estimator can achieve the best rate of convergence in asymptotics 22 . However, the local estimates can be volatile, especially when the sample size of the site is small. Therefore, the finite sample characteristic of b θ avgm can be quite different. We call this approach the AVGM (average mixture) algorithm. Jordan et al. 23 proposed using the curvature information from one central site, say site 1, and the global derivative at a point near the true parameter. The estimator is defined as the minimizer of the CSL, which is defined as e LðθÞ ¼ L 1 ðθÞ À h∇L 1 ðθÞ À ∇LðθÞ; θi; where L is the gross average loglikelihood (loss function), L 1 is the local average loglikelihood at the central site (site 1), θ is a point close to the true θ, and 〈⋅,⋅〉 denotes the inner product. Note that it requires one round-trip communication in order to calculate ∇LðθÞ. The solution achieves the optimal convergence rate if θ converges fast enough 23 . However, the finite sample performance may deteriorate if the sample size at the central site is small, as it utilizes the curvature information from the central site only and the initial solution θ is also obtained from the central site only. We call this approach the CSL method. Jordan et al. 23 proposes multiple versions of CSL methods including the ones that repeat the whole procedure using the current solution as a new initial coefficient θ. Since those approaches require more communications, we do not consider all of them and restrict our focus on the most communication-efficient version, the one described above. The two communication-efficient methods AVGM and CSL are also privacy-preserving in the sense that the individual level data are not shared between sites.
Distributed MI for univariate missing data pattern. One straightforward way to impute missing values for distributed data is to impute missing data in each site separately using a standard MI method. We call this approach the iMI method. When using iMI, no subject-level data are shared across sites and thus no communication is required. As such, it is communication-efficient and privacy-preserving. However, this approach has a number of limitations as discussed in the "Introduction" section. 1 for k ← 1 to K do 2 Find the estimates b α ðkÞ and Covðb α ðkÞ Þ at site k; Sample α 1 , …, α M independently from N ðb α avgm ; Covðb α avgm ÞÞ; 8 Send α 1 , …, α M to all sites; 9 for m ← 1 to M do 10 for k ← 1 to K do Impute the missing data at site k based on α m 11 Fit the analysis model and obtain b θ m and Covð b θ m Þ; 12 end 13 Combine the results by Rubin's rule to obtain b θ and Covð b θÞ; The cslMI method fits the imputation model using the CSL method.
b α csl ¼ arg min α e LðαÞ; where α is chosen as the local solution at the central site.

Algorithm 4
CSL MI algorithm 1 Find the estimate α ¼ arg min α L 1 ðαÞ, which is the optimal estimate at site 1; 2 Send α to all sites and receive ∇L k ðαÞj α¼α back; 3 Find b α csl ¼ arg min α e LðαÞ and Covðb α csl Þ ¼ 1 N c ∇ 2 L 1 ðαÞ À1 j α¼b α csl ; 4 Sample α 1 , …, α M independently from N ðb α csl ; Covðb α csl ÞÞ; 5 Send α 1 , …, α M to all sites; 6 for m ← 1 to M do 7 for k ← 1 to K do Impute the missing data at site k based on α m ; 8 Fit the analysis model and obtain b θ m and Covð b θ m Þ; 9 end 10 Combine the results by Rubin's rule to obtain b θ and Covð b θÞ; The aforementioned algorithms can be used for a wide range of imputation models including, but not limited to, GLMs. But, when applied to a linear imputation model, the algorithms need to be designed to draw the error variance parameter τ 2 in different ways. The iMI and siMI methods can follow the procedure in (3) without modification. However, a direct application of the AVGM and CSL approaches would sample the error variance τ 2 from Gaussian, which does not ensure its positiveness. To address this issue, we propose the following alternative procedures for sampling τ 2 .
For avgmMI, we use SSE = ∑ k SSE (k) where SSE (k) is the sum of squared errors from site k.