A machine learning approach identifies 5-ASA and ulcerative colitis as being linked with higher COVID-19 mortality in patients with IBD

Inflammatory bowel diseases (IBD), namely Crohn’s disease (CD) and ulcerative colitis (UC) are chronic inflammation within the gastrointestinal tract. IBD patient conditions and treatments, such as with immunosuppressants, may result in a higher risk of viral and bacterial infection and more severe outcomes of infections. The effect of the clinical and demographic factors on the prognosis of COVID-19 among IBD patients is still a significant area of investigation. The lack of available data on a large set of COVID-19 infected IBD patients has hindered progress. To circumvent this lack of large patient data, we present a random sampling approach to generate clinical COVID-19 outcomes (outpatient management, hospitalized and recovered, and hospitalized and deceased) on 20,000 IBD patients modeled on reported summary statistics obtained from the Surveillance Epidemiology of Coronavirus Under Research Exclusion (SECURE-IBD), an international database to monitor and report on outcomes of COVID-19 occurring in IBD patients. We apply machine learning approaches to perform a comprehensive analysis of the primary and secondary covariates to predict COVID-19 outcome in IBD patients. Our analysis reveals that age, medication usage and the number of comorbidities are the primary covariates, while IBD severity, smoking history, gender and IBD subtype (CD or UC) are key secondary features. In particular, elderly male patients with ulcerative colitis, several preexisting conditions, and who smoke comprise a highly vulnerable IBD population. Moreover, treatment with 5-ASAs (sulfasalazine/mesalamine) shows a high association with COVID-19/IBD mortality. Supervised machine learning that considers age, number of comorbidities and medication usage can predict COVID-19/IBD outcomes with approximately 70% accuracy. We explore the challenge of drawing demographic inferences from existing COVID-19/IBD data. Overall, there are fewer IBD case reports from US states with poor health ranking hindering these analyses. Generation of patient characteristics based on known summary statistics allows for increased power to detect IBD factors leading to variable COVID-19 outcomes. There is under-reporting of COVID-19 in IBD patients from US states with poor health ranking, underpinning the perils of using the repository to derive demographic information.


Methods
Dataset. COVID-19/IBD repository. We acquired data from IBD patients who tested positive for COV-ID-19 from the Surveillance Epidemiology of Coronavirus Under Research Exclusion Inflammatory Bowel Disease (SECURE-IBD) 21 repository jointly created by researchers of University of North Carolina, Chapel Hill and Mount Sinai, NY, USA. This international, pediatric and adult database based on collaborative participation (1) monitors and reports on outcomes of COVID-19 occurring in IBD patients and (2) provides the IBD community with updates on affected numbers based on demography. The repository is populated with data that is compiled within the UNC REDCap (Research Electronic Data Capture) system, a secure, web-based electronic data capture tool hosted at the University of North Carolina at Chapel Hill. Disease activity (mild, moderate, remission or unknown) was determined based on the assessment of a physician during a COVID-19 related emergency room visit or upon hospitalization. These data include several patient characteristics, namely, age, sex, smoking status, medication usage, severity of IBD, number of preexisting conditions (comorbidities), along with COVD-19 outcomes, namely whether the patient was an outpatient, hospitalized, admitted to the ICU, ventilated and/or deceased. Data is based on country-wise and US state-wise counts of voluntarily reported COVID-19 cases in IBD patients.
American health rankings. We also acquired data from a comprehensive repository of US national health statistics provided on a state-to-state basis 22 . Scores for a multitude of health-related features are provided that are based on a history of environmental and socioeconomic data. The repository offers an annual report with a variety of features, where the major classes include behaviors, community environment, policy, clinical care and outcomes with a plethora of sub-categories (see health ranking sub-criteria in the "Results" section). Each subcategory is associated with a list of state names ranked on calculated scores in the order of healthy to unhealthy. www.nature.com/scientificreports/ Preprocessing and dataset generation. Using data in the SECURE-IBD repository (2.1.1), we generated a table for each feature (age, sex, smoking status, medication usage, severity of IBD, and the number of comorbidities) that summarizes the percentage of the COVID-19/IBD population that fall into the following outcomes: outpatients (OP), hospitalized and recovered (H-R), and hospitalized resulting in death (H-D) (see Table 1). We propose a random sampling approach using these feature tables to generate a complete patient dataset for any number of desired patients with the following set of characteristics: Patient ID, Gender, Age, Medication Usage, Number of Comorbidities, Smoking Status, IBD subtype, IBD Severity, Country, State, Hospitalization Status. For each simulated patient p, we generated these features and outcomes as follows: 1. Determine the patient gender ( f i = male or female) based on the rule: Here n f i is the number of people with gender f i . Given N = j n(f j ) , the number of male and female patients follow the multinomial distribution given by: N! f i n(f i ) × f i pr f i n(f i ) . Therefore, the expected number of patients of any given feature value f i is calculated as n f i = N × p f i . 2. Given the hospitalization status s, we now calculate all other features, one at a time, based on the conditional probability of a feature given status p (f | s).

Observations
We assume that gender is conditionally independent of other features f, i.e., pr (f |gender) = pr(f).
However, this approach preserves the probabilities between input features and outcomes as summarized in the SECURE-IBD summary tables, since the likelihood of observing a feature f in the generated dataset is given by It is noteworthy that Step 2 of the random sampling approach ensures that the simulated patient dataset preserves the likelihood of each feature value given the observed variable (i.e., hospitalization status) from the original dataset. By summing up the marginal probabilities in Eq. (2), we show that the likelihood of finding a feature value in the generated dataset corresponds to the probabilities defined in the SECURE-IBD summary tables. Supervised learning methods. Supervised machine learning (ML) algorithms learn a function that maps the input training data (i.e., features) to some output labels 23 . In this work, we consider the following supervised learning techniques, evaluating each using cross-fold validation (see [24][25][26][27][28][29][30][31][32][33][34] for more details on these ML approaches).
• Support vector machine (SVM) is used for classification and regression problems that map sample data to high-dimensional feature spaces. SVM operates on hyperplanes-decision boundaries in high-dimensional space-that define the class for the data points. The objective of SVM is to maximize the separation between the training data points and the learned hyperplane, and later use this separation to classify new samples. SVM is memory efficient and effective for datasets with few samples 24,25 . • Stochastic gradient descent (SGD) is an iterative strategy that fits the training data to an objective function that is used to classify new samples 26,27 . SGD is a stochastic variant of the popular gradient descent (GD) optimization model 27,28 . In GD, the optimizer starts at a random point in the search space and reaches the lowest point of the function by traversing along the slope. Unlike GD that requires calculating the partial derivative for each feature at each data point, SGD achieves computational efficiency by estimating derivatives on randomly chosen data points. Table 1. We subdivided patients into nine age groups and calculated the proportion of patients within each subgroup that were outpatients (OP), hospitalized resulting in death (H-D) and hospitalized and recovered (H-R) outcomes.

Total OP (%) H-D (%) H-R (%)
0 -9 years  17  88  0  12   10-19 years  296  94  0  6   20-29 years  649  90  0  10   30-39 years  634  85  0  15   40-49 years  508  79  0  21   50-59 years  470  70  2  28   60-69 years  274  55  7  www.nature.com/scientificreports/ • Nearest centroid (NC) is a classification approach that represents each learned class from the training data by the centroid of its members. Subsequently, it assigns each new sample data point to the cluster whose centroid is the closest. NC is particularly effective for non-convex classes and does not suffer from any additional dependencies on model parameters 29 . • Decision trees (DTs) are a classification and regression technique that assigns target labels based on decision rules inferred from data features of the training samples 30,31 . DT maintains the decision rules using a tree. A new data point is repeatedly assessed using the conditional statement at a particular tree node and branches off to a new node based on this conditional until a leaf node is reached. The new data point is then assigned the class of the leaf node. • Gaussian Naive Bayes (NB) are a class of fast, probabilistic learning techniques that apply the Bayes' theorem to assign labels to the new data points 32 . • Supervised neural network (SNN) or multilayer perceptron (MLP) is a deep artificial neural network comprising several neural network units, called perceptrons. Each perceptron is a function that combines any input with learned weights (or hyperparameters) to generate an output value. MLP consists of an input layer that receives the input data, a set of hidden layers serving as computational engines and an output layer that makes a prediction based on the input. MLP training operates in two stages: a forward pass and a backward pass. The forward pass propagates the signal from the training input to the output layer, measuring the output prediction against the ground truth. The backward pass pushes data from the known output towards the input layer, modulating the hyperparameters to enable the prediction to best fit the ground truth of the training data 33 .
Note that supervised ML approaches generally yield reliable prediction accuracy. However, they often suffer from overfitting or convergence issues 34 . Each of the above approaches has its advantages and disadvantages. For instance, SVM works well when the underlying distribution of the data is not known. However, it is prone to overfitting when the number of features is much greater than the number of samples. SGD converges quickly for large datasets, but it is restrictive because it may require fitting a large number of hyper-parameters. Conversely, DT involves almost no hyper-parameters but often entails slightly higher training time. Unlike DT, NB requires less training time but works on the intrinsic assumption that all the attributes are mutually independent. Finally, NC is a fast method but is not robust to outliers or missing data. In the context of our work, we try out several features to get a broad sense of the best features that are applicable in most scenarios. The supervised and unsupervised ML approaches were implemented using the Python Scikit-learn library 34 . Metrics. We use the following metrics to evaluate each of the classifiers and to determine the importance of sample features: • Accuracy function measures the fraction of matches between the predicted and actual labels in a multi-label classification, i.e., the ratio of correctly predicted observations to the total observations. It can be calculated as: • In the above equation, TP, TN, FP, FN denote true positive, true negative, false positive and false negative, respectively. Other metrics of accuracy are precision (pr), recall (re) and F-score (fs), measured as: • Feature importance is calculated by employing the extra trees classifier estimator that fits randomized decision trees (called extra-trees) on data samples 35 . The memory and computation overhead of this approach can be controlled by regulating the size of the extra trees. The nodes in the tree are split into sub-trees resulting in high accuracy (i.e., drop in impurity). Thus, feature importance is measured as the total reduction in impurity affected by that feature 35 . • Multiple linear regression (MR) is a statistical approach that measures the linear relationship between the independent and the dependent variables x and y of a function y = g(x). MR generates this linear relationship y = β 0 + β 1 x 1 + β 2 x 2 + · · · + ǫ , where β i is the coefficient that captures the contribution of feature f i towards the dependent variable y, while β 0 and ǫ are the intercept and error terms, respectively 36 . In the context of this work, the independent variables are the gender, age group, medication usage, number of comorbidities, smoking status, IBD subtype, and IBD severity, while the dependent variable is the outcome (namely, outpatient, hospitalized/recovered and hospitalized/deceased). • Multinomial logistic regression (MLR) is also a statistical tool that fits the data to a line to find the association between the independent and dependent variables. Unlike MR, the data is passed through a logistic function that predicts the target or dependent variable. Moreover, the dependent variable in MR is continuous, while in MLR it is categorical, i.e., assuming a limited number of possible discrete values.

Statistical operations.
Given any pair of vectors v and v , we perform these statistical performance measures: between v and v measures the strength of a linear association between two variables, where the value PCC = 1 is a perfect positive correlation and − 1 is perfect negative correlation. • Principal component analysis (PCA) is a dimension reduction approach that projects each data point onto the first few principal components to achieve a lower dimensional representation of the data, while preserving most of the variation in the data. We consider the first two principal components, which capture the highest variance in the data 37 . Therefore, features contributing the most towards the first and second principal components are called primary and secondary factors, respectively. For instance, let there be |P|= 5 patients with following medications: Sulfasalazine/ mesalamine, Budesonide, Oral/parenteral steroids, 6MP/azathioprine monotherapy, Sulfasalazine/mesalamine and following outcomes: outpatients, hospitalized recovered, and hospitalized deceased. Then, v med=Budesonide = { 0,1,0,0,0} and v out=death = { 0,0,0,0,1} and cos(v med=Budesonide , v med=Budesonide ) = 0 suggests no association (by cooccurrence) between feature medication equals Budesonide and outcome equals death.
• Kendall's tau is a measure of the correlation for ordinal data. Values close to 1 indicate strong agreement in rank order, while − 1 indicate strong disagreement 38 . • Z-score is the number of standard deviations by which a data point is above or below the mean value. For any data point x is calculated as: • One sample proportion Z-test is a standard hypothesis testing approach. Given the number of trials and successor trials, one can test a null hypothesis (H 0 ) whether the proportion (i.e., fraction of successful trials) of the data equals a prespecified value.
Ethics approval. As

Results
Analytical design of the study. We outline the experimental design of our work in Fig. 1. First, we apply a random sampling approach on the summary statistics (introduced in "Preprocessing and dataset generation" section) of COVID-19 outcomes of IBD patients (outpatients, hospitalized/recovered and hospitalized/ www.nature.com/scientificreports/ deceased) to generate a large-scale clinical dataset. We then validate the dataset by comparing the trends in the odds ratio of logistic regression analysis of the generated dataset against that of the real data (as reported in the Brenner et al. 39 ). Next, we apply randomized decision trees (i.e., extra trees classifier) to identify the subset of key features contributing to the outcome. We perform the following analyses: (1) use regression analysis to quantify the contribution of the features towards the outcome and principal component analysis to identify the interrelationship between the primary and second features on the patient COVID-19 outcomes; (2) apply supervised and unsupervised machine learning approaches to estimate the accuracy of the significant features in predicting the outcomes; (3) calculate cosine similarity to determine the association between use of each medication and subsequent outcome, and (4) perform correlation studies on American Health Rankings (discussed in "American health rankings" section) to reveal the perils of predicting the locations with the highest concentration of COVID-19/IBD patients in the United States. We discuss the implications behind our findings in light of other reported studies on COVID-19/IBD patients.
Generation of patient data. We first applied our random sampling approach ("Preprocessing and dataset generation" section) to generate a COVID-19/IBD clinical dataset for 20,000 patients (Table 2) based on the summary statistics of 1739 Crohn's disease and 1323 UC patients reported by Brenner et al. 39 . In their study, they performed multivariable logistic regression to calculate the effects of age, sex, IBD subtype (CD vs UC/IBD-U), disease activity, smoking status, body mass index ≥ 30, and the number of comorbidities (0, 1, ≥ 2) on the primary outcome of severe COVID-19, defined as a composite of ICU admission, ventilator use, and/or death. They incorporated tumor necrosis factor (TNF) antagonist use (versus not) and sulfasalazine/5-aminosalicylate (5-ASA) use (vs not), as these were the two most commonly reported medication classes, and systemic corticosteroid use (vs not) on the basis of increased risk of infectious complications. A secondary outcome was the composite of any hospitalization and/or death. Finally, they calculated adjusted odds ratios (aOR) and 95% confidence intervals (CI) for (1) ICU/Vent/Death, (2) Hospitalization or Death and (3) Death, for each demographic or disease characteristic. In order to validate our proposed random sampling approach, we evaluated whether the relative order of the features ranked in the decreasing order of odds ratio that we similarly calculated based on our generated dataset matches the ordering based on the original data using Kendall's tau score. We ordered the input features in the generated dataset in the decreasing order of odds ratio for two outcome scenarios, namely (a) ICU/Ventilation/Death and (b) Death (see Supplementary Fig. S1). The ordered odds ratio on the generated dataset and real data shows Kendall's tau score (on a scale of − 1 to + 1) of (a) 0.73 and (b) 0.55, respectively, against that of the reported COVID-19/IBD data. This shows that the data we generated through random sampling preserves the trends of the original COVID-19/IBD data.
Age, medication usage, number of comorbidities and IBD severity are the key features affecting COVID-19 outcomes. We created ranked subgroups based on quantitative ranges of the features age group, number of comorbidities, smoking status and IBD severity (Table 3). Using feature importance and multiple linear regression analysis (see "Metrics" section), we sought to identify the patient features that help discriminate among the three COVID-19 disease outcomes: outpatients, hospitalized-recovered, and hospitalizeddeceased. We found that age group, medication usage and number of comorbidities have the highest importance Table 2. 5 rows of the dataset showing the features and outcomes for a patient with the features medication usage (medication), number of comorbidities (comorbidity), smoking status (smoking), IBD subtype (condition) and IBD severity (severity).  www.nature.com/scientificreports/ (also refer to "Metrics" section for details on feature importance), while IBD subtype (CD or UC), smoking status and gender have the least independent feature importance (Fig. 2a).
We found that age and number of comorbidities are correlated with the number of patients who died (Pearson's correlation coefficient > 0.35), while smoking status and IBD severity are largely uncorrelated (Fig. 2b). Interestingly, age is contingent on a variety of factors such as genetic background and its cumulative effect of generic responses to biological stress and environmental exposures, and there is evidence to suggest that the number of comorbidities increases with age and is larger in individuals 65 years and older 40 . Given the high potential of mutual information between age and number of comorbidities, we studied two feature combinations with either age or the number of comorbidities, specifically (a) medication usage, IBD subtype, IBD severity and age and (b) medication usage, IBD subtype, IBD severity and number of comorbidities, against outcomes (outpatients, hospitalized/recovered and hospitalized/deceased). Multiple linear regression analysis shows that age and number of comorbidities have the highest coefficients in their respective groups, followed by IBD severity and IBD subtype. This indicates that age and number of comorbidities are the key features affecting COVID-19/ IBD outcomes (refer to Supplementary Tables S1, S2).
Next, we analyzed the interrelationship between the COVID-19/IBD outcome and medication usage. To this end, we considered our generated dataset of |P|= 20,000 COVID-19/IBD patients, 11 IBD medications (see Fig. 3a) along with the three COVID-19 disease outcomes. For each medication m and disease outcome d, we calculated two vectors v d = 1 outcome(p)=d p ∈ P} and v m = 1 medication(p)=m p ∈ P} , where the indicator variable 1 c = 1 if condition c is true, and 0 otherwise. Essentially, this vector represents for each patient, the combination of medications being used. We applied the cosine similarity 1 − cos(v m , v d ) (see "Statistical operations" section) to compare the profiles between each medication and the outcome vector. Figure 3a shows that while using anti-TNF without immunomodulators (namely 6MP/AZA/MTX, used to treat Crohn's disease 41 ) exhibits a high similarity with the outpatient outcome, 5-ASAs (sulfasalazine/mesalamine) consistently align well with all three outcomes by exhibiting one of the highest similarities (0.4, 0.3 and 0.15) with OP, H-R and H-D among the list of administered medications. This suggests that 5-ASA usage has the highest degree of overlap with all the outcomes among all medications (see "Statistical operations" section for the details).
IBD severity, smoking status, gender and IBD subtype are secondary factors affecting outcome. We apply principal components analysis (PCA; "Statistical operations" section) on the 7 input features of patient data and identified two components, PC1 and PC2, that account for 18% and 14% of the variance in the data (see Fig. 3b). It is noticeable that PC1 clearly demarcates the OP and H-D patients. Figure 3c once again shows that the primary factors, namely, age, medication usage, and the number of comorbidities, have the highest contributing weights along PC1; conversely, IBD severity, smoking status, gender and IBD subtype are strong contributors along PC2, making them key secondary factors-clinical characteristics that help distinguish the outcomes of any two patients if their primary factors are identical. When looking more closely at medication usage, we see that 5-ASA (sulfasalazine/mesalamine) usage is associated with most of the COVID-19/IBD deaths-approximately 47% and 68% deaths in two sample groups (defined hereafter). This lends credence to the high cosine similarity between sulfasalazine/mesalamine and H-D in Fig. 3a. Also, UC accounts for nearly three times the death than that of Crohn's disease (3.7% and 1.3%, respectively.).
As discussed earlier in Fig. 2b, we consider three outcomes, outpatients (OP), hospitalized/recovered (H-R) and hospitalized/deceased (H-D). We noted that there is a more prominent separation between samples along PC2 at about 1.5 (dashed line in Fig. 3b). In the subsequent discussion, we term the samples above and below PC2 = 1.5 as the upper group (~ 3.5% of total patients) and lower sample group (~ 96.5% of total patients), respectively. We calculated the pairwise mean Euclidean distance across all pairs of samples such that outcome (p 1 ) = outcome (p 2 ) to gauge the relative difference between samples corresponding to an outcome on the PCA plot. Figure 4a,b show that for both upper and lower PC2 sample groups, samples with outcomes OP and H-R Given this high variation in the H-D samples, we further analyzed the proportions of each outcome in the upper and lower groups. We found that in samples from the upper group, despite having fewer samples, there is a higher proportion (~ 3% of cases in the upper group) of H-D outcomes (Fig. 4c). A greater proportion of COVID-19/IBD cases corresponded to Crohn's disease in both upper and lower regions (Fig. 4d), while UC patients, despite encompassing in fewer cases overall, account for a higher proportion of deaths, about 3% and 5% of total cases in the upper and lower groups, respectively (Fig. 4e). We performed statistical t-tests and found that UC patients appear more vulnerable to COVID than Crohn's patients (Supplementary Fig. S2). Given that gender, IBD subtype and smoking are key secondary features contributing to IBD deaths, as highlighted along PC2 (Fig. 3c), the two clusters (upper and lower regions, Fig. 4) suggest a high proportion of IBD patients who    www.nature.com/scientificreports/ are male and are smokers will have an outcome that results in death (Fig. 4f,g). Note that all smokers are in the upper region.

Supervised learning on age, medication usage and number of comorbidities predict outcomes.
To further evaluate the relationships between IBD features and COVID-19 outcomes suggested by the PCA, we performed classification experiments using multiple machine-learning methods (Fig. 5, Supplementary Table S3). We estimated the accuracy, precision and F-score (defined in "Metrics" section) of predicting COVID-19/IBD outcome using support vector machine (SVM), stochastic gradient descent (SGD), nearest centroid (NC), decision tree classifier (DTC), supervised neural network (SNN) and Naive Bayes (NB) classifiers. In Table 4, we summarize the hyperparameters employed in the supervised learning models. Given that the hospitalized-recovered (H-R) and hospitalized-death (H-D) account for a small fraction of the COVID-IBD dataset, we applied Synthetic Minority Oversampling 42 to augment the input dataset to give enough training data for each outcome. We observe that the supervised machine learning classifiers with the combined feature set of age, number of comorbidities and medication usage significantly outperforms the mean accuracy of the individuals' features. Figure 5a shows that overall, the majority of the accuracy scores for classifiers using just the three primary features exceed those classifiers that include all features. There is less deviation in precision and f-scores across approaches when using all features. The (1) precision from DTC, SNN, NB (Fig. 5b) and (2) the f-scores from SGD, SNN, NB (Fig. 5c) for classifiers using the three primary features outperform those using all features. This suggests that the predictions based on primary features exhibit a more consistent overall accuracy in outcome prediction. However, considering all the features during outcome prediction may yield an improved true positive rate, as manifested in the steady precision and f-scores across the different ML approaches.

COVID-19 numbers in IBD patients do not align with state health indices. SECURE-IBD data-
base provides the number of COVID cases in IBD patients reported in each state. We intended to investigate whether the frequency of cases correlated with the healthiness of each state. Using state health indices for nearly 50 criteria (Table 5), we calculated the Pearson correlation of the ranked list of states based on each state health index with the ranked list of states based on the increasing number of COVID-19 in IBD patients scaled by the population of the state. A high correlation would imply that healthy states have fewer COVID-IBD cases, and vice versa.
Although our initial hypothesis was that the healthy states would report few COVID-19 cases in IBD patients, we observed that the correlations with healthiness were negative for over 70% of the criteria. We provide the correlation and p values for the complete range of criteria reported in Fig. 6 (refer Supplementary Tables S4, S5 for full list). We tabulated the top 10 criteria (and correlations corresponding to the p values) in the increasing order of p values. Figure 6a-c show the Pearson correlation coefficients between (a) overall COVID cases and health ranking, (b) overall covid deaths and health rankings and (c) COVID-IBD cases and health rankings, with the negative correlations marked red. Across nearly all health indices, the healthy states report a higher number of COVID-19 cases in IBD patients. While this suggests that healthy states have more COVID-19/IBD  www.nature.com/scientificreports/ cases, we believe that this may also imply an underreporting from the US states ranking low on the American health index rankings.

Discussion
In this work, we carried out a comprehensive analysis of the clinical factors affecting the COVID-19 outcomes of IBD patients using machine learning methods. Current studies are severely hindered by the lack of patient samples. Thus, we employed a random sampling approach to generate a large COVID-19/IBD clinical dataset from published summary statistics. To verify the validity of these generated samples, we demonstrate that the relative rank of the odds ratio of the logistic regression analysis on the generated dataset largely aligns with that of the real COVID-19/IBD dataset. Our random sampling approach employs Bayesian statistics and a multinomial distribution making it highly generalizable with regard to the number of observable features. The results from this work are significant in several ways. First, we showed that the sampling method we designed can preserve the summary statistics of the original data for the patient features, namely gender, age, medication usage, number of comorbidities, smoking status, IBD subtype, and IBD severity. This indicates that our methodology is robust and could be used more generally in cases where individual patient data is limited or unknown but summary statistics are available. Second, we determined key primary and secondary IBD patient features related to COVID-19 outcomes and evaluated their effect on COVID-19 mortality and their potential utility in predicting outcomes. Using several methods, such as random forest classifier, multiple linear regression and correlation analysis, we established, in keeping with almost all existing findings, that age, medication usage and the number of comorbidities is the primary features, while gender (male), smoking and IBD severity are secondary features affecting COVID-19 outcomes in IBD patients. These findings are consistent with existing studies on 79 and 232 patients with IBD with COVID-19 in Italy and the United States that show that old age and comorbidities pose a far greater risk of negative COVID-19 outcome than other factors 14,43 . Finally, our correlation studies reveal the non-uniformity in the reporting of COVID-IBD cases. Specifically, there seems to be reduced reporting from US states with poor health rankings, suggesting that more rigorous data collection is necessary before the repository can be used to derive demographic inferences.
We specifically note that to date, the relationship between IBD patients that smoke and COVID-19 outcomes have been unclear. Our principal components analysis suggests that male patients with ulcerative colitis and that smoke face high risks of mortality, especially elderly patients with multiple preexisting conditions. Existing studies suggest that smoking affects the pulmonary immune function, increasing the risk of contracting infectious diseases, and excessive smoking has been linked with the progression of COVID-19 44,45 . Our findings further suggest that the male IBD patients, in particular, are more likely to die due to COVID, lending credence to prior clinical studies that have shown that females are less susceptible to viral infections and reduced cytokine production, and that female patients have higher macrophage and neutrophil activity and antibody production and response 46 .
Of keen interest in IBD is whether any medications exacerbate the infection rate and/or severity of COVD-19. Our analysis shows that having UC and using 5-ASAs (sulfasalazine/mesalamine) are linked with higher COVID-19 deaths. In the early stages of the pandemic, there was a documented case of an 80-year-old female with a 3-year history of UC, in maintenance with mesalamine, who had a fever and bloody diarrhoea. She was diagnosed with COVID-19 pneumonia and passed away after 14 days of hospitalization 47 . As previously mentioned, there is considerable uncertainty among gastroenterologists and patient support groups about the effects of IBD medications on COVID-19 susceptibility and progression, and whether treatment guidelines need to be www.nature.com/scientificreports/ modified in the COVID era 15 . The present work supports reports that suggested 5-ASA usage was associated with adverse clinical outcomes such as hospitalization or death 39,48 . These findings on primary (age, medication usage, and the number of comorbidities) and secondary features (IBD severity, smoking status, gender and IBD subtype) increase our understanding of vulnerable IBD patient groups. Our findings suggest that gastroenterologists should weigh the risks and benefits before recommending 5-ASA to elderly patients with multiple comorbidities. Our supervised machine learning classifiers using the three most discriminatory features (i.e., age, number of comorbidities and medication usage) show approximately 70% accuracy in predicting outcome, outperforming the accuracy for the complete feature set in a majority of the classification approaches. This suggests that these discriminatory features can be reasonably applied to prognosticate the risk associated with each IBD patient. Lastly, we demonstrated that the present COVID-19/IBD data repository appears to have fewer cases reported from unhealthy US states. This suggests that the current repository may not be useful for deriving accurate demographic information. Therefore, our proposed random sampling approach may be utilized in the future to generate a large-scale synthetic COVID-19/IBD data repository while taking into consideration the innate bias of data reporting.   Figure 6. The health ranking criteria and their sub-categories for US states. Pearson correlation between (a) overall COVID cases and health ranking, (b) overall covid deaths and health rankings and (c) COVID-IBD cases and health rankings.