Using machine learning to predict COVID-19 infection and severity risk among 4510 aged adults: a UK Biobank cohort study

Many risk factors have emerged for novel 2019 coronavirus disease (COVID-19). It is relatively unknown how these factors collectively predict COVID-19 infection risk, as well as risk for a severe infection (i.e., hospitalization). Among aged adults (69.3 ± 8.6 years) in UK Biobank, COVID-19 data was downloaded for 4510 participants with 7539 test cases. We downloaded baseline data from 10 to 14 years ago, including demographics, biochemistry, body mass, and other factors, as well as antibody titers for 20 common to rare infectious diseases in a subset of 80 participants with 124 test cases. Permutation-based linear discriminant analysis was used to predict COVID-19 risk and hospitalization risk. Probability and threshold metrics included receiver operating characteristic curves to derive area under the curve (AUC), specificity, sensitivity, and quadratic mean. Model predictions using the full cohort were marginal. The “best-fit” model for predicting COVID-19 risk was found in the subset of participants with antibody titers, which achieved excellent discrimination (AUC 0.969, 95% CI 0.934–1.000). Factors included age, immune markers, lipids, and serology titers to common pathogens like human cytomegalovirus. The hospitalization “best-fit” model was more modest (AUC 0.803, 95% CI 0.663–0.943) and included only serology titers, again in the subset group. Accurate risk profiles can be created using standard self-report and biomedical data collected in public health and medical settings. It is also worthwhile to further investigate if prior host immunity predicts current host immunity to COVID-19.


Supplemental Text 1.
To reiterate, the main objectives of this report are to use Linear Discriminant Analysis (LDA) for probabilistic determination via naïve Bayesian classification of: 1) a two-class grouping defined as a positive vs. negative COVID-19 test case; and 2) a two-class grouping nested within positive COVID-19 tests, defined as a test case occurring in a hospital vs. non-hospital setting. For positive tests, these settings are considered as proxies for mild vs. severe COVID-19 disease status 1 . UK Biobank later confirmed disease severity using electronic health records and death certificates from pseudonymized Public Health England and other public records data. Because two or more test cases could be nested within a given participant, this would normally violate independence and potentially invalidate results. While one could use a single random test case per participant, this reduces sample size and does not represent real world data and within-subject variability (e.g., changes in COVID-19 status or infection severity). Thus, for estimation robust to non-independence, we used Mundry and Sommer's permuted LDA approach 2 . The unit of randomization across participants was a grouping of all test cases originally nested in a given participant. The null hypothesis was that a given LDA model with randomized data would not perform better than the original non-randomized data in 95% of permutations. As recommended, 1,000 permutations were run for each model using macros provided by the authors and Python scripting for automation. To ensure that models were stable and generalizable, we used a typical "holdout" method of 70% and 30% respectively for the training and test samples 3 .
As discussed in the main text, forced entry models were first conducted for each predictor, and then among sets of similar predictors (e.g., demographics, vital signs). It was known from the outset that model overfitting would likely occur when a predictor set had dozens of features (e.g., biochemistry). This procedure was done for two reasons: 1) to show clinicians, researchers, and policymakers how a set of common features would discriminate COVID-19 groups in ideal circumstances, where in some cases model overfit is frankly likely; and 2) to contrast such models with the comparable or superior performance of stepwise models that had substantial feature reduction and enough n > p that model overfit was guarded against.
We now explain why LDA was used. Foremost, prediction models derived using LDA are straightforward to interpret by a general audience, which is appropriate for the journal in question. For example, the Wilks' Lambda statistic allows a clear interpretation for how well a given predictor distinguishes between classes and its directionality (e.g., higher age in years predicts increased likelihood of positive COVID-19 classification). Equally important, LDA creates models that maximally separate classes of interest, where a new observation's data can be used to determine which class that observation would belong in. Since it is of central importance to have equally valid diagnostic assessment for who is and is not at risk for COVID-19 or if a positive would have a mild or severe infection, LDA is most appropriate. As a generative, supervised learning classification technique, LDA is also best used in complex datasets with high dimensionality composed of a few to hundreds of features per data category, where it can remove most redundant or dependent features that do not maximize model fit. Reducing the feature set size reduces the risk of model overfitting 4,5 . This procedure minimizes the High Dimensional, Low Sample Size (i.e., p>>N, or "small n, big p") problem 4 by reducing the likelihood of the within-class covariance matrix approaching singularity 6 and leading to instability of parameter estimation.
LDA has several key assumptions that we wish to address. LDA is relatively robust to overfitting provided there is a relative lack of outliers, multivariate normality and lack of multicollinearity, and independence of data values between participants. To begin, UK Biobank removes extreme values during data quality control before posting datasets to their data showcase 7 . We further log-transformed all quantitative variables to normalize distributions and "bring in" outliers defined as data points >3SD from the mean. As described in the main text, we also removed biochemistry variables that were multicollinear (e.g., direct vs. total bilirubin). While some antigens of the same pathogen approached multicollinearity, removing them from feature selection led to identical results and thus they were kept in. Participant-level data was not dependent on data from other participants. To be clear, however, multiple observations were nested within a given participant and would violate independence. Because the permutation LDA testing randomizes which participants have a group of one to several COVID-19 test cases, however, these models are robust to non-independence.
While other machine learning techniques are also appropriate for classification, we discuss why they were not used. Regarding logistic regression, this technique is attractive because it has no distribution assumptions. However, it assumes observations are independent. This does not occur with COVID-19 testing, in which a participant will often have multiple test cases. Logistic regression also requires a large numbers of observations to provide reliable estimates. Finally, it does not produce robust models for well-separate types of classes. As there are very clear immunologic differences that determine if someone has or does not have COVID-19, and a clear demarcation between mild vs. severe symptom presentation, we believe logistic regression model estimations might be inflated and thus less accurate. Finally, despite their methodological differences, LDA and logistic regression may perform the same with real data 8 , where LDA may be more conservative and was one of our goals for this proof-of-principle study.
More complex algorithms vs. LDA were also not considered due to feature complexity, the need for transparent model estimates, and sample size. First, in the dataset there are many features present for biochemistry markers, antibody load to specific antigens, and to a lesser degree immune factors. Data reduction is therefore important to determine which features are most useful for COVID-19 data and should receive attention. By contrast, clustering methods are not suitable because the dimensionality space is too high and model fit is likely to be poor. For newer machine learning techniques, such as deep learning, it is often unclear what set of features are selected or their relative contribution when a given prediction is made. This is unacceptable for predicting COVID-19 infection or severity risk. For researchers, it is unknown how various risk factors converge to affect risk and this information is necessary to better understand underlying mechanisms. In population health or the clinic, certain features have prohibitive time or cost constraints (e.g., body compartment imaging; ordering one versus multiple antigen tests). More importantly, it is critical for clinicians, policy makers, or other stakeholders to point out which exact features led or would lead to a predicted outcome. Finally, deep learning, support vector machines (SVM), and similar approaches also require much larger sample sizes to train and adapt a classifier to produce robust estimates. By comparison, our dataset only had several thousand testing datapoints in the "full" sample and just over one-hundred in the sub-group that had serological data.
We recognize that LDA has several limitations and used non-parametric estimation to minimize these issues. To begin, using simulation data, LDA performs comparably to logistic regression when predictor distributions are normal or near normal, but has worse fit when there are clear normality violations 9 . While we log-transformed quantitative measures with appreciable skewness (>3SD), normality nonetheless remained a concern, particularly for the serology sub-group that had 124 observations. To reduce potential problems, bootstrapping 10 was used (95% CI, 1000 iterations) to estimate model coefficients. This allows unbiased estimation of generalized absolute error, taking into account potential model overfit by substantially varying training and test sets from the selected sample. Nevertheless, with the serology sub-group, the small n, big p problem may still be a concern. Regularized LDA has been a popular choice to overcome this issue of withinclass covariance singularity, where cross-validation presents a reasonable solution 11 . Due to computation problems in tandem with bootstrapping, we used a simple "leave-one-out" approach with bootstrap estimates.

Supplementary Text 2.
This section addresses why test case data for COVID-19 test was used, and to what degree participants had within-subject variability for testing data. Typically, all test cases are examined to determine if a given person was ever positive for COVID-19, and if infected whether that infection was mild or required hospitalization (i.e., severe). This data is then used to create binary categorical outcomes of risk. While this approach is straightforward, it may lead to misestimation and bias that limits the generalizability of results.
First, for adults tested more than once for COVID-19 in the UK, there are meaningful differences in key risk factors. As shown by Chadeau-Hyam and colleagues 12 in supplemental material, participants with multiple tests (40.5%) were more likely to have lower education attainment, lower average household income and greater economic hardship (e.g., renting vs. owning, unemployment %, etc.), and a substantially greater likelihood of having comorbid conditions. This was not due to occupation bias it seems, as participants with multiple tests were less likely to be a frontline healthcare worker (15.46% to 9.04%). In effect, ignoring these differences may give rise to sampling bias. This bias can lead to gross over-or underestimation of the "true" resilience or risk conferred by a predictor variable, in the same way that there is frequent misestimation of COVID-19 prevalence 13 . By examining all test cases and using LDA-based permutation testing 14 to ensure robustness of model estimates at the test case level, we have a more meaningful idea of how a given person reacts to COVID-19 exposures over time. In this way, such models: 1) are better able to control for nonrandom sampling that will occur during a worldwide pandemic; and 2) are more robust and generalizable to "real-world" conditions, in which out-patients and particularly in-patients are likely to have more than one COVID-19 test done.
Collider bias is also a concern, which is highlighted by Griffith et al. using UK Biobank data 15 . Briefly, a collider is a variable influenced by a given risk factor and an outcome. For COVID-19, the authors indicate this bias occurs when restricting analyses to a subset of participants that ever developed COVID-19 or were hospitalized (i.e., had severe disease). This sampling may distort patterns of association and limit generalizability to the general population. Further, it can distort "individual-level causal effects." As an example, between March to late April 2020, symptomatic emergency responders and health-care workers were more frequently tested in hospital settings than community members, whereas testing policy changed after late April to more broadly address community testing. Collider variables here would be occupation (e.g., healthcare worker), degree of exposure, and at what date/dates a given sample was taken. While Griffith et al. go on to discuss collider bias in tested vs. non-tested UK Biobank participants, we argue that "throwing out" all but one test case per participant introduces collider bias. Specifically, this would lead to artificially overweighting participants who had 1 test case and underweighting participants who had multiple test cases and are more likely to be one of several disadvantaged groups prone to COVID-19 infection and cases of severe infection 12 .
While using all test case data can mitigate these issues to some degree, a new problem that occurs is withinsubject variability. We operationally define this variability as a participant who, based on a new COVID-19 test instance, "switches" groups for either the Disease Risk variable (negative; positive) or the Severity Risk variable (mild; severe). To test the effect of within-subject variability, we used Chadeau-Hyam et al.'s approach 12 to categorize all UK Biobank participants into a three-level categorical variable, Testing Frequency: • Adults who had one COVID-19 test • Adults who had two COVID-19 tests • Adults who had three or more COVID-19 tests Using a linear mixed effects model, we compared the "2 test" group vs. the "1 test" group, and then the "3 or more tests" group with the "2 test group." We first tested disease risk using the 'Result' variable from UK Biobank. We next assessed severity risk using the 'Origin' variable from UK Biobank. Critically, if a given test case's result was negative (i.e., Result=0), that test case was not included in the severity risk analysis. Put in other words, because there is no disease present and therefore no affiliated symptoms, it does not matter if a negative test was done in an outpatient or inpatient setting. We emphasize that this is the prescribed methodology 1 and has been used in numerous published UK Biobank studies on COVID-19 risk.
For participants with 2 tests, there was a 2% higher likelihood for the second test to be COVID-19 positive (32.3% to 34.3%, p<.001) vs. the first test. Similarly, there was a 6% higher likelihood for the second test of a COVID-19+ patient to reflect severe instead of mild symptoms (55.2% to 61.4%, p<.001). For participants with 3 or more tests, there was no higher likelihood of being COVID-19 positive (34.3% to 34.3%, p=0.988) vs. the second test. Curiously, for participants in the 3 or more test group that tested positive for COVID-19, there was a 3.7% lower likelihood of having severe instead of mild symptoms (65.1% vs. 61.4%). Finally, there were no significant interactions to suggest that number of tests or different Testing Frequency groups influenced sensitivity or specificity. Likewise, model estimates for positivity or severity did not change when introducing number of tests or Testing Frequency as a covariate.
These results suggest that repeated testing reflects a significant but modest increase in the likelihood that a COVID-19 test will be positive. For severity risk, our results suggest that the change across testing groups reflects the standard of care for hospitals at that time 1 . Specifically, from March to mid-April 2020 and to a lesser extent mid-April to mid-June 2020, symptomatic COVID-19 participants would be initially tested at hospital for COVID-19, admitted to hospital if positive, up to a week later have a 2 nd test done (often a PCR) to confirm positivity and check for viral load, and up to a week later potentially be tested again. In many cases, for patients who clearly still had severe symptoms, a 3 rd test was not done. A 3 rd test would be more likely to occur if symptoms improved for the patient and there was a possibility for discharge. Area Under the Curve (AUC); Confidence Interval (CI); Geometric Mean (G-Mean). Specificity and sensitivity are the likelihood of correctly detecting when COVID-19 infection for a test case was negative or positive respectively. G-Mean is the degree to which a given predictor correctly predicts both true negatives and true positives for COVID-19 infection. "Blue" and "white" shading are used to better visualize predictors within a set of similar variables. P values less than .05 were considered significant and applicable predictors and statistics are bolded. Area Under the Curve (AUC); Confidence Interval (CI); Geometric Mean (G-Mean). Specificity and sensitivity are the likelihood of correctly detecting when COVID-19 infection for a test case was negative or positive respectively. G-Mean is the degree to which a given predictor correctly predicts both true negatives and true positives for COVID-19 infection. "Blue" and "white" shading are used to better visualize predictors within a set of similar variables. P values less than .05 were considered significant and applicable predictors and statistics are bolded. Area Under the Curve (AUC); Confidence Interval (CI); Geometric Mean (G-Mean). Specificity and sensitivity are the likelihood of correctly detecting if COVID-19 infection for a test case was negative or positive respectively. G-Mean is the degree to which a given antigen correctly predicts both true negatives and true positives for COVID-19 infection. "Blue" and "white" shading are used to better visualize antigens specific to a given pathogen. P values less than .05 were considered significant and applicable antigens and statistics are bolded. *The CagA antigen was excluded from analysis due to roughly half of sample analyte values being lost to lab error. Wilks' λ represents the relative strength of a given predictor in contributing to the final model fit. Seroprevalence is the proportion of participants whose assay values were high enough such that they were considered positive for having a given disease. "Blue" and "white" shading are used to distinguish between predictors that loaded for a given model. Area Under the Curve (AUC); Confidence Interval (CI); Geometric Mean (G-Mean). Here, specificity and sensitivity are the likelihood of correctly detecting if a positive COVID-19 test case was mild or severe respectively. G-Mean is the degree to which a given predictor correctly predicts both true negatives and true positives for COVID-19 infection severity. "Orange" and "white" shading are used to better visualize each class of predictors for COVID-19 severity. P values less than .05 were considered significant, where applicable predictors and classifier statistics are bolded. Area Under the Curve (AUC); Confidence Interval (CI); Geometric Mean (G-Mean). Here, specificity and sensitivity are the likelihood of correctly detecting if a positive COVID-19 test case was mild or severe respectively. G-Mean is the degree to which a given antigen correctly predicts both true negatives and true positives for COVID-19 infection severity. "Orange" and "white" shading are used to better visualize each set of antigens for a specific pathogen. P values less than .05 were considered significant, where applicable antigens and statistics are bolded. *The CagA antigen was excluded from analysis due to roughly half of sample analyte values being lost to lab error.