Prior fluid and electrolyte imbalance is associated with COVID-19 mortality

Background The COVID-19 pandemic represents a major public health threat. Risk of death from the infection is associated with age and pre-existing comorbidities such as diabetes, dementia, cancer, and impairment of immunological, hepatic or renal function. It remains incompletely understood why some patients survive the disease, while others do not. As such, we sought to identify novel prognostic factors for COVID-19 mortality. Methods We performed an unbiased, observational retrospective analysis of real world data. Our multivariable and univariable analyses make use of U.S. electronic health records from 122,250 COVID-19 patients in the early stages of the pandemic. Results Here we show that a priori diagnoses of fluid, pH and electrolyte imbalance during the year preceding the infection are associated with an increased risk of death independently of age and prior renal comorbidities. Conclusions We propose that future interventional studies should investigate whether the risk of death can be alleviated by diligent and personalized management of the fluid and electrolyte balance of at-risk individuals during and before COVID-19.


Quality control, transformation and handling of missing data
In the first step, we kept all lab and vital parameters, which were available for at least 1,000 patients, leaving 249 variables for primary analysis. We set implausible measurements to missing (e.g. negative values, percentages greater than 100, "0" if not biologically sensible, etc.). For continuous variables, observations more than +/-3 interquartile ranges from the median were set to missing, unless more extreme values were deemed to be plausible according to the judgement of an experienced physician. We performed log-transformation of variables when the Shapiro-Wilk test 4 showed less significant deviation from normality on the log-scale than it did on the original scale. We confirmed the decision on transformation via visual inspection of the density plots. The accordingly processed and quality-controlled data were used for univariate association analysis (cf. below).
For multivariate analysis, we additionally removed all patients from the analysis, for which less than 10 different variables were available, leaving 55,757 patients for analysis. Variables available for less than 10,000 patients were mean-imputed, resulting in a remaining missing rate of 12.4% in total. Those remaining missing values were imputed using the missForest Rpackage 5 . The package iteratively imputes missing values, and is suited for mixed-type data. It averages over unpruned trees, using the built-in out-of-bag error estimates of random forest.
We computed pairwise Pearson correlation coefficients 6 of variables, for the purpose of sanity checking of variables via to be expected correlations, and to investigate potential collinearity.

Association analysis
As primary analysis, we performed survival analysis, in particular Cox regression 7 , to analyze time-to-death from the date of COVID-19 diagnosis. Univariate analysis was conducted using age, sex, ethnicity, race, insurance status, and US region/division as covariate parameters for adjustment. We applied a Bonferroni-correction with the number of variables (m=250) to account for multiple testing and judged variable association to be significant if it met an α-level of α=0.05/m=2*10 -4 . We used the R survival package 8 and in particular the survfit() function for survival times and the coxph() function for Cox regression. Median survival times were computed using the methodology as described in 9 and visualized as Kaplan-Meier plots 10 . For our main model, we computed Martingale residual 11 and visualized them as suggested in 12 . An interpolation curve was fitted with the R lowess() function 13 . We tested if the Cox proportional hazards (PH) assumption 14 was fulfilled and computed Schoenfeld residuals 15 .
In order to allow comparison of, by default scale-dependent, hazard ratios (HRs) between different variables, we report the 2-standard-deviations hazard ratio "HR_2SD". It is computed as HR_2SD=HR^{2*SD}, where SD is the standard deviation of the respective variable. Note that for a binary variable, the usual HR and HR_2SD coincide if the two binary outcomes are about equally frequent (e.g. for sex) and that HR_2SD is numerically smaller than the standard HR when the binary outcomes differ in frequency. This effect makes the HR_2SD more comparable to HRs for quantitative variables and implies also a stronger accordance with the actual prognostic power of the variable: for binary variables where one outcome is very rare, the impact on prognosis is comparatively smaller than for a binary variable with two equally frequent outcomes, since only a small portion of patients has the rare variable status and has the elevated risk. Using HR_2SD instead of HR compensates for this effect.
The association analysis between a priori measurements of hypotensive DBP and a priori median albumin levels (n=45,819 patients) was performed using logistic regression. To be consistent with the Cox regression the results are shown as 2-standard-deviations odds ratio "OR_2SD".

Model development
As secondary analysis, multivariable modelling was performed. We pursued two approaches in parallel. First, we performed a backward selection procedure on the Cox regression model of all eligible variables. We iteratively removed the variable with least impact on model performance, until all remaining parameters were significant at α1=0.05/250=2*10 -4 (Bonferroni-correction, twosided). By construction, the procedure controls the family-wise error rate at α=0.05. In parallel, we derived a regularized Lasso model 16 . We fitted a L1 (Lasso) regularized Cox-Proportional Hazards Model using glmnet version 3.02 16 , with the concordance index (C-index) 17 as the performance measure. The regularization parameter λ was optimized using ten-fold crossvalidation. We selected λ such that we extracted the most regularized model with a C-index within one standard error of the best performing model.
To compare model performance between the two approaches, we used the C-index and and receiver operating characteristics area under the curve (ROC-AUC) for censored data, implemented in the R package timeROC 18 . ROC-AUC was computed for different time points (5, 10, 20, 30, 60s-days survival).
For the final model, we checked if recently suggested requirements for developing multivariable prediction models were fulfilled 19 .

Supplementary Results
The plot of the martingale residuals of the multivariate model (Supplementary Figure S1) and the red fitting line, which increases very slightly and only at the end, suggest that linearity on the log hazard ratio scale is a reasonable model assumption. Also the Schoenfeld residuals (Supplementary Figure S2) are distributed symmetrically, and do not suggest relevant systematic deviations from the proportional hazards assumption.