An explainable model of host genetic interactions linked to COVID-19 severity

We employed a multifaceted computational strategy to identify the genetic factors contributing to increased risk of severe COVID-19 infection from a Whole Exome Sequencing (WES) dataset of a cohort of 2000 Italian patients. We coupled a stratified k-fold screening, to rank variants more associated with severity, with the training of multiple supervised classifiers, to predict severity based on screened features. Feature importance analysis from tree-based models allowed us to identify 16 variants with the highest support which, together with age and gender covariates, were found to be most predictive of COVID-19 severity. When tested on a follow-up cohort, our ensemble of models predicted severity with high accuracy (ACC = 81.88%; AUCROC = 96%; MCC = 61.55%). Our model recapitulated a vast literature of emerging molecular mechanisms and genetic factors linked to COVID-19 response and extends previous landmark Genome-Wide Association Studies (GWAS). It revealed a network of interplaying genetic signatures converging on established immune system and inflammatory processes linked to viral infection response. It also identified additional processes cross-talking with immune pathways, such as GPCR signaling, which might offer additional opportunities for therapeutic intervention and patient stratification. Publicly available PheWAS datasets revealed that several variants were significantly associated with phenotypic traits such as “Respiratory or thoracic disease”, supporting their link with COVID-19 severity outcome.


Introduction
The coronavirus disease 2019 (COVID-19) pandemic, caused by the infection with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is challenging at an unprecedented level health, economical and societal systems worldwide. The SARS-CoV-2 infection is characterized by a large variation in consequence ranging from asymptomatic to life-threatening conditions such as viral pneumonia and acute respiratory distress syndrome (ARSD). ARSD is caused by an exaggerated host immune response leading to lung injury, which starts at the epithelial-interstitium-endothelial interface with increased vascular permeability and extravasation of immune cells, mostly macrophages, and granulocytes. Infected epithelial cells and debris bind immune cell receptors, triggering the release of in ammatory cytokines (predominantly IL-6, IL-1, and TNF-α) and activating broblasts, resulting in a cytokine release syndrome (Marini & Gattinoni, 2020).
Established host risk factors for disease severity, such as increasing age, male gender, and higher body mass index1, do not explain all variability in disease severity observed across individuals. Genetic factors contributing to COVID-19 susceptibility and severity may provide novel biological insights into disease pathogenesis mechanisms, new drug targets as well as new means for patient strati cation, also considered that, despite the recent development of vaccines, treating the disease remains an important goal in the clinics. The rst genetic factors described to contribute to COVID-19 severity were rare loss-offunction variants in genes involved in type I interferon (IFN) ( While GWAS studies provide solid foundations of the host genetic factors individually associated with COVID-19 severity, they most often fail to provide an organic picture about their interplay. By learning nonlinear patterns from data in a human interpretable fashion, explainable machine learning algorithms might help in understanding the multifactorial nature of the interactions between host genetics and COVID-19, at the same time providing effective tools for risk and severity forecasting. In 2020, Italian GEN-COVID Multicenter Study started to investigate how the combination of common and rare variants could determine COVID-19 severity in a pilot study including WES data of a rst small cohort of hospitalized patients (Benetti et al., 2020). Previous and ongoing efforts entailed machine learning techniques (i.e. LASSO logistic regression models) in combination with a boolean representation of genetic variants to identify the most informative features associated with the severity which were used to compile an Integrated PolyGenic Score for COVID-19 severity predictions ) . In this study, we combined variant case-control screening, supervised binary classi ers training, feature importance analysis, and dimensionality reduction techniques with pathway enrichment and phenotype association studies to identify a few dozens of genetic variants contributing to increased risk of severe COVID-19 infection from a Whole Exome Sequencing (WES) dataset of a cohort of Italian patients.

Results
Comparing genetic variation in severe and asymptomatic individuals We considered the Whole Exome Sequencing (WES) dataset of germline variants from 1982 European descent patients provided by the GEN-COVID Multicenter Study group . All subjects were classi ed according to the grading scheme by the World Health Organization (WHO) de ned by the following categories: 0=not hospitalized (a-or pauci-symptomatic); 1=hospitalized without respiratory support; 2=hospitalized O2 supplementation; 3=hospitalized CPAP-biPAP; 4=hospitalized intubated; 5=dead. Demographic (sex, age, and ethnicity) and clinical data (family history, pre-existing chronic conditions, and SARS-CoV-2 related symptoms) were also collected (Fig. 1A).
We started our analysis from a total of 1.057M simple variants which were screened to identify mutations associated with severe patients, likely representing risk factors, from those associated with asymptomatic patients, more likely contributing to protection. We grouped severe patients from clinical groups 5, 4, and 3 which were contrasted against the asymptomatic ones, considered as controls (group 0). We further re ned the grading classi cation by retaining only those patients with severity grade matching the prediction from an ordered logistic regression model using age as input feature for sexstrati ed patients (see Methods), yielding a total of 841 samples (518 severe, 323 asymptomatic; Fig. 1B). We employed log odds ratio statistics, using an additive model, to screen variants signi cantly associated with either severe or asymptomatic groups (see Methods). In order to nd a robust set of variants to be used as features set for downstream ML and pathways analysis, we performed the screening on the majority portion (training set) of a randomly split dataset (keeping 80% of the samples for training and 20% for testing). To ensure robustness, we repeated the splitting procedure ve times, employing a strati ed 5-fold cross-validation scheme, by performing the screening on the training set and nally retaining those variants found to be signi cantly enriched in each of the ve splits ( Fig. 1D; see Methods). We found on average 1130 variants signi cantly enriched across the ve folds (Table S1).

Genetic variants predict severity through supervised ML classi ers
We embedded the strati ed 5-Fold screening within a supervised classi er training procedure ( Fig. 1D; see Methods). For each random split of the dataset, the training set (80% of the original dataset) was used both for variant screening and model training, which was then tested on the corresponding held-out portion (20% of the dataset). For each screened random split, we trained distinct models using a strati ed 5-Fold Cross-validation (5-Fold CV) grid search to estimate optimal hyperparameters for supervised classi ers training ( Fig. 1D; see Methods). Speci cally, for each of the ve random splits we trained four independent supervised classi er algorithms, i.e. Logistic regression (LogReg), Support Vector Machine (SVC), Random Forest (RF) and Extreme Gradient Boosting (XGBoost), yielding a total of 20 models. Each of them was then nally tested on the corresponding held-out set from the random split and performance assessments were collected. XGBoost was the algorithm that displayed the smallest drops between training and testing accuracies, achieving the best average performance during testing across the ve folds ( Fig. 2A; Table S2). In more details, the best XGBoost model had the following performances: Precision=77.27%, Recall=83.33%, MCC=46.69%, ROC_AUC=80, Accuracy=75%, F1-score=80.2% (Table  S2).
Overall, we found that 3217 unique variants (out of a total of 3258 unique, screened variants), corresponding to 2546 unique genes, had non-zero coe cients in at least one of the ve, tree-based models (i.e. RF or XGBoost). However, the XGBoost classi er led to a sharper reduction of relevant variants (1086, corresponding to 1049 genes, with non-zero feature importance in at least one model), consisting of a subset of those identi ed with the RF models. As expected, clinical covariates such as age and gender were found among the features with the highest median of the distribution of importance coe cients collected from XGBoost models (Fig. 2B). Among this shortlist, only 16 variants (and corresponding genes; Table S1) consistently received non-zero coe cients in all tree-based models, out of which 9 variants were found to be enriched (Fig. 2B,C). To con rm the predictive performance of these variants, we re-trained the models by considering only this subset of variants, plus age and gender covariates, and we calculated aggregated performances by considering the median of the probabilities outputted by each model for each sample in the testing set (see Methods). While age and gender covariates alone retained high predictive power (ROC_AUC=80%), the addition of these most informative genetic features led to an increase of performances (ROC_AUC=86%, best model ROC_AUC=91%; Fig. 2D; Table S3).
Remarkably, when we nally tested the ensemble of models trained with only these informative variants on a follow-up cohort of 618 individuals (122 asymptomatic, 496 severe; Fig. 1C) we achieved good performances, either at the individual model level or at the ensemble one (Table S4). In fact, when computing aggregated metrics by considering the median of the probability distribution collected from the ensemble of models (Table S4,5; see Methods), the ensemble of models was able to identify severe patients with good accuracy (ACC=81.88%) and with a ROC_AUC=96%, performing considerably better than the ones obtained by training with only covariates or variants ( Fig. 2E; TableS4).
Risk and protective genetic factors impinge on modular, interconnected networks underlying distinct biological processes.
We focused our attention on the subset of variants receiving non-zero feature importance in at least one XGBoost model which we functionally analyzed to provide a mechanistic explanation for their interaction with COVID-19 infection. We performed pathway analysis by mapping mutated genes in a functional interaction (FI) network (i.e. Reactome FI network; see Methods). We built a general FI network (Fig. 3B), as well as networks speci c for clinical groups, by grouping variants and genes enriched in severe and asymptomatic patients (Fig. 3A). Pathway analysis on group-speci c networks revealed patterns of signi cantly enriched processes in either asymptomatic or severe patients (Fig. 3A).
In severe patients we found signi cantly enriched processes associated to cardiomyopathies, e.g.
The general FI network comprised a total of 344 mutated genes and 630 functional interactions, remarking a high degree of interconnection between affected genes, which participate in different, crosstalking biological processes. Cluster analysis on the general FI network revealed distinct modules characterized by the enrichment of speci c pathways, and by a variable composition in terms of variants enriched in either severe or asymptomatic patients. Intriguingly, we found out that no cluster exclusively contained variants enriched in severe or asymptomatic patients. In detail, the largest cluster (i.e. Module 1; 43 nodes) encompassed Fanconi anemia pathway (FDR= 2.46e −07 ) and DNA repair processes such as HDR through HRR or SSA (FDR= 4.51e −06 ) or Homologous recombination (FDR=1.76e −03 ) (Fig. 3B). In this cluster, we found that the gene characterized by the variant with the strongest model support (ms) (i.e. fraction of tree-based models assigning non-zero feature importance; see Methods) is MYBBP1A rs117615621, which is enriched in asymptomatic patients (log odds ratio (lor)= -1.34; pval= 0.0065; ms=90%; Table S1,S8).
The second-largest module (Module 2; 42 nodes) involves genes mediating signal transduction cascades such as those mediated by Ras GTPases, e.g. Rap1 signaling pathway (FDR=1.01e −04 ) or MAP kinases, e.g. MAPK signaling pathway (FDR=5.95e −04 ) ( Fig. 3B, C). We also found processes more directly linked to the immune and in ammatory response to the virus, such as the JAK-STAT signaling pathway (FDR=1.11e −03 ), Cytokine-cytokine receptor interaction (FDR=1.92e −03 ), and Interleukin-6 family signaling (FDR=1.92e −03 ) (Fig. 3B, C). All these three pathways are participated by the CNTFR gene, which codes for the alpha subunit of the receptor for the ciliary neurotrophic factor, and is affected by a novel variant (chr9:34557898:A: T) enriched in severe patients (lor=1.230663067; pval=0.00021727; Table S1). Intriguingly this variant was ranked in the top20 of genes with the highest median importance (Fig. 2B) and received 100% model support (Fig. 2C), indicating that all the tree-based models considered it as important for the classi cation of severity. Another variant with 100% support affecting a gene within the same cluster is rs150021157, also signi cantly enriched among severe patients (lor=1.373871841; pval=0.001927211; Table S1,S8), affecting the PCSK5 gene, a serine endoprotease which processes various proteins including various cytokines, NGF, renin and which has been reported to regulate the viral life cycle (Decroly et al., 1996).
The third-largest module (Module 3; 38 nodes) is characterized by the Regulation of nuclear SMAD2/3 signaling pathway (FDR=1.95e −03 ) as the most enriched pathway, therefore being tightly interconnected with cluster 2. It was previously shown that SARS nucleocapsid proteins interact with SMAD3 and modulate TGF-β signaling (Zhao et al., 2008), another pathway signi cantly enriched in Module 3 (FDR= 0.014). The latter pathway has also been con rmed to drive a chronic immune response in severe COVID-19 (Ferreira-Gomes et al., 2021).
We found additional interesting, potentially relevant pathways in the remaining modules.  Table S8). The PLEC gene, involved in laments network by anchoring intermediate laments to desmosomes or hemidesmosomes via interlinks with microtubules and micro laments, belongs to this cluster and it's affected by the variant rs140300753 (lor=1.16, pval=0.002881778, ms=100%), which is enriched in severe and received 100% support from tree-based models (Table S1). In Module 5 one of the most signi cantly enriched pathways is Cilium Assembly (FDR= 2.64e −04 ), which entails CEP131 affected by the variant rs2659015, which is enriched in asymptomatic patients (lor=-1.92; pval=0.001517767) and which received 100% model support. Intriguingly CEP131 has been recently found to be signi cantly regulated by phosphorylation during viral infection (Vanderboom et al., 2021).
In addition to several other immune response-related processes (e.g. MHC class II antigen presentation in Module 5, FDR=7.13e −03 ; Table S8), in the remaining clusters we found additional processes with high translational and therapeutic potential. For instance, we found several GPCR-signaling instances signi cantly enriched in Modules 6 (e.g. G alpha (i) signaling events, FDR= 3.69e −04 ) and 8, which exclusively entails GPCR-downstream signaling pathways and where again the G alpha (i) signaling events (FDR= 2.56e-09) and G alpha (q) signaling events (FDR= 4.83e-08) are the two downstream pathways most signi cantly over-represented( Fig. S3; Table S8).
We also found that a few genes whose variants have been identi ed through our pipeline are among the ones carrying top associations to severity as assessed from studies of the COVID-19 HGI (https://app.COVID-19hg.org/variants)(COVID-19 Host Genetics Initiative, 2021). In detail, variants of 9 out of the 43 genes identi ed from GWAS studies are also present in our list, including: ABO, ARL17A, ARL17B, DPP9, LRRC37A, LRRC37A2, RAVER1, TMEM65, ZBTB11 (Table S1).
Severe patients tend to cluster together using only more informative variants.
We applied unsupervised clustering and dimensionality reduction techniques (i.e. Principal Component Analysis (PCA)) to group patients based on the genetic distance calculated by considering the most informative variants selected after screening and supervised machine learning procedure. By projecting the patients on the rst two PCs followed by k-means clustering (see Methods), we detected three groups of patients on the original cohort ( Fig. 4A-C). The two largest clusters were separated by PC1. The largest one, 515 patients, was characterized by a majority of severe cases (78% of the total). The second cluster was instead characterized by a prevalence of asymptomatic patients (70%) of the total. Finally, a third small cluster was identi ed through the combined usage of PC1 and PC2 and it was characterized almost exclusively by severe patients (95% of 24 patients in total). Notably, the severity of this cluster is only partially explained on the basis of either gender (59% males and 37% females; Fig. 4C) or age (Fig.S4A). This cluster was characterized by peculiar genetic features, with a smaller number of variants and a neat prevalence of risk over protection factors (Fig. S4B). Remarkably, a total of 7 (out of 9 overall enriched in severe patients) variants with 100% support from XGB models were also found in this cluster (Table S9). Network analysis of the mutated genes in this predominantly severe cluster highlighted several common processes as well as candidates for drug targeting. In particular, several GPCRs (ADRB2, ADRA1, GRM6), ion channels (GRIN1, CACNA1G), (receptor tyrosine) kinases (NTRK1, CSF1R, GAK) and nuclear hormone receptors (AR, THRB) participate to this network and can be readily targeted by approved drugs (Fig. 4C; Table S10). Important variants are associated to disease traits linked to COVID-19 severe phenotypes To provide further evidence of a functional relationships between our variants and COVID-19 severe phenotypes, we checked available open-access integrative resources (i.e. Open Target Genetics initiative (Ghoussaini et al., 2021)) which aggregate human GWAS and functional genomics data to link between GWAS-associated loci, variants, and likely causal genes. In particular, we considered Phenome Wide Association Study (PheWAS) analysis considering a wide range of diseases and traits to identify the phenotypes associated with our variants (see Methods). Intriguingly, we found that many variants identi ed through our approach are associated with traits or phenotypes which might be linked with either risk or protection from severe consequences to the viral infection.
For example, by considering variants with non-zero importance in at least one XGB model we found that those enriched in severe patients were 70% of the total associated with the category "respiratory or thoracic diseases" (see Figure 5A). Among the speci c traits with strong associations to more supported variants, we found instances such as "Doctor diagnosed emphysema" (ITPKA, rs41277684; LTK, rs35932273), the latter variant associated also to "Other alveolar and parietoalveolar pneumopathy", "Respiratory disorders in diseases classi ed elsewhere" (KCNB1, rs34467662), "Chronic bronchitis/emphysema" (C12orf43;HNF1A, rs11065390; SLC47A2, rs34399035), "Acute sinusitis" (SHANK2, rs146204677), "Pleural plaque" (CFAP74, rs141833643), "Allergic asthma" (SYTL2, rs61740616 and rs35751209), "Symptoms and signs involving the circulatory and respiratory systems" (PCSK5, rs150021157) (Fig.5B). Although more weakly associated and supported by our models, we also found several associations with chronic obstructive pulmonary disease (COPD) both in "respiratory or thoracic diseases" and in "infectious disease" categories (Table S11). Other disease categories displaying a net prevalence of phenotypic associations for variants enriched among severe were "immune system disease", with multiple variants associated with speci c traits such as "Autoimmune diseases" "Immunode ciency with predominantly antibody defects" or "Noninfectious disorders of lymphatic channels", and "pancreatic disease" (Fig.5A; Table S11).
Two of the variants enriched among severe patients which were found by our models to be invariably relevant for severity classi cation (i.e. PCSK5 rs150021157 and PLEC rs140300753) were signi cantly associated with the "Abnormalities of breathing" phenotype (pval=0.0000040 and pval=0.00016, respectively), suggesting that patients carrying these variants might be at higher risk due to pre-existing di culties of breathing ( Fig.S5; Table S11).
Other general categories of traits that might be linked to severe COVID-19, such as "Cardiovascular disease" or "Infectious disease" showed similar distributions of associations of risk or mitigation factors ( Figure S6). Interestingly other categories, such as "Integumentary system disease" showed instead a prevalence of associations with mitigation factors ( Figure S6).

Discussion
In this study, we have set up a multifaceted computational strategy to dissect patient genetic variants which might interplay with the SARS-Cov2 virus to increase the risk of, or to protect from, a severe response to infection.
We integrated into a strati ed k-fold scheme a pipeline to perform variant features screening followed by machine learning model training and testing to robustly identify variants associated with severe response to COVID-19 infection. Our pipeline allowed a drastic reduction of the initial number of variants by several orders of magnitudes: from an initial set of approximately 1M unique variants derived from WES to 1k variants receiving non-zero feature importance in at least one of the tree-based modes. By only considering the variants with full support, i.e. always found to have non-zero feature importance in all the tree-based models, we further reduced the pool to only 16 variants. Models retrained with only full-support variants (plus age and gender as covariates) achieved superior performances (median ROC_AUC=86%, best model ROC_AUC=91%). Although models trained with only patients age and gender already showed good performances in severity prediction (median ROC_AUC=80%), con rming the predictive power of these covariates, the increase in performance followed by the inclusion of curated genetic information provides the foundation for integrated tools for COVID-19 severity forecast and patient strati cation. When tested on a follow-up cohort of more than 600 our models achieved remarkable performances in identifying severe patients with good accuracy (ACC=81.88% and ROC_AUC=96%), performing considerably better than the ones obtained by training with only covariates or variants ( Fig. 2E; TableS4).
The interpretability of our models allowed us to shed new light on the complex landscape of genetic interactions interplaying with virus genetics to contribute to a severe response to COVID-19 in an Italian cohort. Among the 16 variants with 100% support, only 6 genes (37%) were annotated in the largest pathway knowledgebase, i.e. Reactome (Jassal et al., 2020), suggesting that unannotated variants might modulate the interaction with the virus through yet-to-be-discovered biological mechanisms. Intriguingly, we found that two of these highly supported variants, i.e. chr9:34557898:A:T (CNTFR) and rs150021157 (PCSK5) interact withinin the second-largest module identi ed on the interaction network of the genes affected by mutations within our study. This cluster, which is moreover the only one characterized by two fully supported variants, is highly enriched in pathways linked to immune response and in ammation, such as the such as JAK-STAT signaling pathway, Cytokine-cytokine receptor interaction, and Interleukin-6 family signaling.
We found that variants enriched in severe patients are involved in cardiomyopathies processes, supporting the established notion that patients with heart disease or its risk factors are at greater risk of severe consequences following COVID-19 infection, including hospitalization, ventilation, or death (Harrison et al., 2021). Additional processes signi cantly enriched among severe mutations was ECM, whose importance in mediating the interaction with viral particles have been highlighted by a nitypuri cation proteomics experiments (Gordon et al., 2020). Recent experiments also con rmed a role for integrins in binding to UV-inactivated viral particles, through which outside-inside signaling is elicited via binding to Gα13 (Simons et al., 2021). Vesicle-mediated transport, such as clathrin-mediated endocytosis, has been shown to mediate a key entry point for SARS (Wang et al., 2008). Moreover, C-type leptin receptors have been shown to engage with the virus inducing robust proin ammatory responses in myeloid cells that correlated with COVID-19 severity (Qiao et al., 2021).
We also found several GPCR signaling instances signi cantly enriched among the network modules. In particular, one cluster of the network was highly speci c for these terms, in particular G i and G q signalling processes, which mediate vascular in ammation. In particular, the G q pathway contributes to regulating calcium signaling, which is one of the most enriched processes in our dataset and which leads to endothelial barrier disruption via adherens junction disassembly (Birch et al., 2021). On the other hand, G q signaling might also contribute to transactivate JAK-STAT pathway via (ERK)1/2 signaling (Birch et al., 2021), the latter in turn also activated by G i signaling (Goldsmith & Dhanasekaran, 2007). It has also been recently shown that the C5a-C5aR1 axis, which also signals intracellularly through G q , plays a key role in On the other hand, some of the processes that we found signi cantly enriched among asymptomatic patients have been previously put in connection to SARS viral infection. For example, members of the machinery for DNA damage response have been shown to interact and affect the response to several DNA and RNA viruses (Lilley et al., 2007) and it has been recently demonstrated that these pathways are also triggered by SARS-CoV2 in vitro cellular models (Victor et al., 2021). The Fanconi anemia pathway is tightly linked to DNA repair processes involving homologous recombination and genome integrity (Michl et al., 2016). We therefore speculate that patients carrying variants on these pathways might differently interact with the virus, modulating a milder response to viral infection.
In general we found multiple, recurrent disease traits associated with the variants identi ed. The variants rs150021157 and rs140300753, characterized by full support during supervised learning, also provide an example of associations to phenotypes that might play a role in COVID-19 severity, such as "Abnormalities of breathing phenotype". Some categories show a prevalence of associations with risk factors, such as "respiratory or thoracic disease", including speci c traits such as chronic bronchitis, emphysema or COPD (the latter also found in the "infectious disease" category). Other categories enriched for associations with variants enriched in severe patients are "immune system disorders", including traits such as immunode ciency with antibody defects, or "pancreas disease", including several instances mainly associated to Type 2 diabetes, which is a known risk factor for severe COVID-19 (Onder et al., 2020) and whose molecular connection to cytokine storm in ammatory response has now begun to emerge (Melvin et al., 2021). Taken together, these results further corroborate our analysis.
Our model is complementary to previous and ongoing efforts entailing machine learning techniques (i.e. LASSO logistic regression models) and a boolean representation of genetic variants to identify the most informative features associated to severity to compile an Integrated PolyGenic Score for COVID-19 severity predictions  . While we expect that some of the variants identi ed in this study might be speci c for the Italian population, we believe that our approach could be readily trained on different cohorts to identify additional biomarkers for patient strati cation in the clinics. Our capability to understand and forecast the genetic factors contributing to COVID-19 disease severity will certainly bene t from the availability of larger sequencing cohorts, the usage of more advanced methods for case-control associations in WES studies, new methodological advancement in the explainable AI eld, as well as on our prior-or data-driven knowledge of biological mechanisms linking genetic variants to disease phenotypes.

Dataset and Pre-processing
We used the whole-exome sequencing (WES) dataset of 1982 European descent patients collected from the GEN-COVID Multicenter Study group coordinated by the University of Siena (https://clinicaltrials.gov/ct2/show/NCT04549831) . The WES dataset contained a total of 1.057M unique simple variants. Patients were classi ed according to the grading scheme by the World Health Organization (WHO). The grading classi cation contained the following categories: 0=not hospitalized (a-or pauci-symptomatic); 1=hospitalized without respiratory support; 2=hospitalized O2 supplementation; 3=hospitalized CPAP-biPAP; 4= hospitalized intubated; 5=dead. We considered patients from more severe groups, i.e. 3,4, and 5, as cases, and asymptomatic patients from group 0, as controls, for a total of 1078 patients. We further re ned the grading classi cation based on an ordinal logistic model which uses age as input feature for sex-strati ed patients  and we retained only those patients whose grading classi cation was concordant with the one adjusted by age. This yielded a nal set of 841 samples for downstream analysis.

Strati ed K-fold split of sample cohort into train and test sets
We embedded a strategy for variant screening in, strati ed k-fold cross-validation (using the Strati edKFold function from the scikit-learn library) scheme to generate 5 random splits, into a training and testing test, of the original dataset. Each fold was constituted by an 80% training set which was also employed for variant screening and a 20% testing set. The variants in the test set were curated from the variants screened in the training set. Through the strati ed 5-fold approach, we made sure that all the samples of the dataset were employed for testing.

Variant screening
We employed a Log-Odds Ratio (LOR) statistics to perform case-control association and to screen variants associated with either severe or asymptomatic patients in each of the training sets for each of the ve folds generated.
GATK best-practices were used to de ne the variant calling pipeline, as previously described . Then, a contingency table to measure the enrichment of reference (Ref) or alternative (Alt) alleles in either severe or control groups was de ned by employing an additive model, whereby homozygous genotype (1/1) has twice the risk (or protection) of the heterozygous type (0/1 or 1/0). We employed the Table2x2 function from the statsmodels library to calculate LORs values and associated p-values and con dence intervals from the the contingency table in Figure S7, respectively employing the functions log_oddsratio, log_oddsratio_pvalue() and log_oddsratio_con nt(). We ltered variants with the following characteristics: p − valuecript > and |LOR| ≥ 1. Variants with LOR > 1 are enriched among severe, while those with LOR < -1 are enriched among asymptomatics.

Feature matrix generation
For each split, we generated a feature matrix for the training set by assigning the allele counts of each screened variant for each sample of the training: i.e. 0 for genotype 0/0, 1 for genotypes 1/0 or 0/1, 2 for genotype 1/1. The feature matrix for the test set was de ned by considering only variants identi ed as signi cant after screening the training set of the corresponding split and by assigning the allele count of each sample of the test set. We also included as additional features age, which was normalized, and gender, which was binarized by setting males to 0 and females to 1. Severe patients from group "3+4+5" were given the classi cation label "1", the asymptomatic patients from group 0 were given the label "0".

Feature selection: Removal of Multicollinearity
We employed feature selection techniques to further reduce the number of considered features initially screened through the Log-Odds-Ratio statistics. We tried several approaches, including Lasso, ElasticNet and Multicollinearity, in combination with supervised training approaches (see below). After training several classi ers with the variants selected with each of these methods on a smaller cohort of 1200 samples (data not shown), we found that removing multicollinearity from features by considering variant allele counts with correlation coe cients corr.≤|0.8|) gave the best results. The screened features with little or no effects of multicollinearity formed the nal 80% training sets in each fold and the nal 20% corresponding validation sets used for training the supervised machine learning models.

Supervised Binary Classi cation
We trained supervised learning models for binary classi cation tasks by employing several algorithms, i.e. Support Vector Machine, Logistic Regression, Random Forest, and Extreme Gradient Boosting classi ers. Support Vector Classi er (SVC) a popular machine learning method that classi es data points utilizing the concept of hyper-plan and kernel tricks to nd ts that best separate the data cloud. In this study, we used the popular Jupyter notebook and scikit-learn python package to import the "sklearn.svm" SVC classi er model. We rst set the SVC default regularization parameter "C" to 1, the class weight to "balanced" in order to account for imbalanced classi cation problems in the dataset. The default linear kernel was used rst with the prediction probability set to true. The GridSearchCV was used to select the best hyperparameter values for the estimator "C", "gamma", and the kernel (Linear, Radial Basis Function (RBF), and polynomial) that are critical to the performance of the SVC classi er. The best GridSearchCV estimator hyperparameter values that were used to train our dataset were identi ed as the RBF kernel, C = 10, and gamma set to 0.1.
Logistic Regression a binary classi cation regression model that uses the logistic function to estimate the parameters of the logistic model. We import from the scikit-learn package the "sklearn.linear_model" the Logistic Regression model function. We rst set the default logistic model classi er parameters; "class weight = balanced", C = 0.3 and solver = sag. The best GirdSearchCV estimator values used to train our dataset uses the regularization penalty of l1 (Lasso), C = 0.7, and solver = saga.
Random Forest(RF): an ensemble learning method that employs a bagging strategy. Multiple decision trees are trained using the same learning algorithm, and then predictions are aggregated from the individual decision tree. From the "sklearn.ensemble" library, we import the Random Forest Classi er function. The RF default model parameters use a class weight set to "balanced", maximum depth (max_depth) of the decision trees was set to 80, the number of features (max_features) was set to 2, minimum samples (min_samples_leaf) leaf of 3, minimum samples split (min_samples_split) of 10, and the number of trees (n_estimators) in the forest was set to 300. The GridSearchCV best model estimator parameters were "bootstrap = True", "max_depth" = 110, "max_features" = 2, "min_samples_leaf" = 5, "min_samples_split" = 10, and "n_estimators" = 100.
Extreme Gradient Boosted Trees classi er (XGBoost) an ensemble learning classi er family that utilizes boosting strategy to combine a set of weak learners and delivers improved prediction accuracy. We import from the XGBoost package "xgboost" library and xgboost function. We de ned the data matrix (training feature set and classi cation label). We set the default XGBoost classi er model parameters class weight to "balanced", learning objective to "binary logistic". The best GridSearchCV estimator parameters values we used to train the dataset were "learning_rate" = 0.01, "max_depth" = 3, "n_estimators" = 140.
In summary, for each of the four ML models, we performed a parameter optimization through grid search (GridSearchCV), using the accuracy_score during grid search as a scoring method. We performed a 5folds cross-validation, by splitting 80% for training and 20% for validation in each fold, repeated three times, using the Strati edKFold function with n_splits=5 and n_repeats=3. We also set the class weight parameter to ''balanced" in each of the ML algorithms employed. Both model training and hyperparameters optimization was done with a Python Jupyter notebook interactive web-based development environment using the scikit-learn and the xgboost packages. Model performances on the testing set were evaluated through the following metrics: Accuracy, F1, Precision, Recall, Matthew correlation coe cient (MCC), ROC_AUC.
A consensus voting approach was used to aggregate validation prediction probability scores of the four ML algorithms (SVC, Logistic Regression, Random Forest, and XGBoost classi ers) from each of the (20%) testing sets from each fold by considering the median of the probability distribution collected from the ensemble of models. The features (variants) that received non-zero weight during training of the supervised ML methods (Random Forest and XGBoost classi ers) in each fold were combined across the 5-fold for further interpretation.
We performed a randomization test (i.e. Salzberg's test) to assess over-tting (Salzberg, 1997), where we replace the original phenotypic labels of the training matrix with randomly assigned labels while preserving the ratio of the number of positive (severe) and negative (asymptomatic) patients (Table S12).

Feature importance scores
The feature importance assigns weight scores to individual features that interact to predict a particular event in the model. Feature importances for RandomForest and XGBoost models were calculated as the mean decrease in impurity for the feature using the feature importances function from xgboost. The feature importance (weights) scores assigned from these models' predictions were aggregated across the 5-folds to generate a non-zero panel of variants for further downstream analysis.
Final testing on a follow-up cohort We tested the best performing models trained using most supported variants with and without covariates on a followup cohort of sequenced, Italian patients. An initial set of 838 samples corresponding to grading groups 0, 3, 4 and 5 were re ned by applying the same ordered logistic regression classi cation adjusted_by_age, which yielded a nal set of 618 individuals (122 asymptomatic, 496 severe). We curated the allele counts of the 16 most informative variants, identi ed in the rst stage of the analysis and model training, from this new set of patients and we used them, together with age and gender, as features for the testing. We evaluated the performances of the ensemble of the 20 models both on an individual as well as on an aggregated level, by calculating aggregated metrics obtained from the median of the probability distribution outputted by the ensemble of the 20 models on the testing samples.

Principal Component Analysis (PCA) and clustering
The variants with non-zero weights from best performing tree-based models were remapped back into the feature space to form a new feature count matrix covering 100% of the samples (i.e. 841 individuals). This reduced feature matrix was analyzed using Principal Component Analysis (PCA) techniques to reduce the dimensional space. In order for us to do this, we utilized the "sklearn.decomposition" library to import the PCA function. We standardized the feature count matrix using the "sklearn.preprocessing" library to import the Standard Scaler function. We transformed the normal feature count matrix considering the 1st and 2nd PCA components. We further employ the K-means clustering technique (using the "sklearn.cluster" library to import the "KMeans" function) to visualize and cluster the 2D PCA components (1st and 2nd dimensions). We set the default cluster size to 3, the maximum iteration (max_iter=1000), and a tolerance value (tol=1e-04). Clusters of patients that express interesting severity patterns were further analyzed using the pathway enrichment for biological interpretations and implications.

Pathway Enrichment Analysis
The pathway enrichment analysis was done using the ReactomeFIViz plugin (Wu et al., 2014) available in Cytoscape (Lotia et al., 2013). The genes corresponding to variants with non-zero feature importance from XGBoost were used to construct a Functional Interaction (FI) network. The general FI network comprised all the genes affected by variants with non-zero feature importances in both patient groups. Node diameter is proportional to the number of variants with non-zero coe cients in any tree-based models. Node color is instead proportional to the LOR with the highest absolute value among the variants associated with a given gene. Modules within the network were identi ed through spectral partition clustering (Newman, 2006). Reactome pathways over-representation analysis (FDR<0.1) was calculated on either the whole network or for each individual module. We also generated group-speci c networks by keeping separated genes with variants enriched in severity from those enriched in asymptomatic and performed pathway over-representation analysis (FDR < 0.1) on the distinct networks.

Retrieving associations between variants and disease traits or phenotypes
We retrieved associations among the variants identi ed in our study and disease traits or phenotypes through the Open Targets Genetics platform (Ghoussaini et al., 2021). We interrogated the database using the GraphQL query language embedded in a python script and by inputting the variant coordinates (given by chromosome nr, position, Ref, and Alt allele). For each PheWAS association, we retrieved the following data: eaf, beta, se, nTotal, nCases, oddsRatio, studyId and pval. Only PheWAS with oddsRatio > 1 and pval <0.001 were considered. The statistics were done only for the variants with non-zero feature importance from XGBoost models.

Data availability
All the scripts and models generated are available at the following URL: https://github.com/raimondilab/An-explainable-model-of-host-genetic-interactions-linked-to-Covid19severity Declarations Acknowledgements F.R. was supported by the Italian Ministry of University and Research through the Department of excellence "Faculty of Sciences" of Scuola Normale Superiore. We gratefully acknowledge computational resources of the Center for High Performance Computing (CHPC) at SNS. We are grateful to Luigi Ambrosio for his initiative and for helpful discussions. This study is part of the GEN-COVID Multicenter Study, https://sites.google.com/dbm.unisi.it/gen-COVID, the Italian multicenter study aimed at identifying the COVID-19 host genetic bases. Specimens were provided by the COVID-19 Biobank of Siena, which is part of the Genetic Biobank of Siena, member of BBMRI-IT, of Telethon Network of Genetic Biobanks (project no. GTB18001), of EuroBioBank, and of RD-Connect. We thank the CINECA consortium for providing computational resources and the Network for Italian Genomes (NIG; http://www.nig.cineca.it) for its support. We thank private donors for the support provided to AR (Department of Medical