Presence of rare potential pathogenic variants in subjects under 65 years old with very severe or fatal COVID-19

Rare variants affecting host defense against pathogens could be involved in COVID-19 severity and may help explain fatal outcomes in young and middle-aged patients. Our aim was to report the presence of rare genetic variants in certain genes, by using whole exome sequencing, in a selected group of COVID-19 patients under 65 years who required intubation or resulting in death (n = 44). To this end, different etiopathogenic mechanisms were explored using gene prioritization-based analysis in which genes involved in immune response, immunodeficiencies or blood coagulation were studied. We detected 44 different variants of interest, in 29 different patients (66%). Some of these variants were previously described as pathogenic and were located in genes mainly involved in immune response. A network analysis, including the 42 genes with candidate variants, showed three main components, consisting of 25 highly interconnected genes related to immune response and two additional networks composed by genes enriched in carbohydrate metabolism and in DNA metabolism and repair processes. In conclusion, we have detected candidate variants that may potentially influence COVID-19 outcome in our cohort of patients. Further studies are needed to confirm the ultimate role of the genetic variants described in the present study on COVID-19 severity.


Results
Clinical and demographic characteristics. A total of 44 unrelated patients with very severe COVID- 19) that required intubation, non-invasive ventilatory support or did not survive to SARS-CoV-2 infection were included in the present study. A total of 16 patients (36%) did not survive, most of them received invasive ventilatory support (intubation, 44%) and 31% received exclusively non-invasive ventilatory support ( Table 1). The median age was 46 years, and patients were mainly males (70%, Table 1). Most patients were of European ancestry (86%), except for 7 Admixed Americans and one patient from another ethnicity. Main clinical characteristics of this cohort such as pre-existing diseases and COVID-19 management are summarized in Table 1. The most frequent comorbidities were obesity (defined as body mass index > 33 kg/m 2 ; 39%), hypertension (18%), respiratory disease (16%) and oncohematological antecedents (14%). Detailed clinical data of each patient are provided in Supplementary Table 1.

Identification of candidate variants. We have detected 44 different variants of interest located in 42
genes. These variants were identified in 29 patients, 11 of them carrying 2 or more candidate variants in different genes (Fig. 1A). Available data about pathogenic predictors, MAF and pathogenic annotation from public databases (ClinVar and HGMD) are summarized in Supplementary Table 2. A total of 12 (26%) variants were previously described as likely pathogenic or displayed strong evidence for being considered as likely pathogenic following ACMG criteria ( Table 2).
Thirty eight percent of the identified variants (n = 17, 15 different variants and 1 variant found in 2 cases) was located in 15 genes involved in immune response (Fig. 1B). Eight of these candidate variants were found in genes with an autosomal dominant inheritance mode (such as PLCG2, RNASEL, TLR4 or STAT3), six variants were detected in genes with an autosomal recessive inheritance mode (CFI, FCN3, HAVCR2 and IL17RC) and three variants were identified in X-linked genes (TLR7, TLR8 and G6PD). The detected LoF variant in TLR7 was found in a 30-year-old male who was included in a case-series recently reported 16 . The p.Arg378* variant on IL17RC, which was not described in public databases (GnomAD 23 , ExAc 24 or 1000 genome project 25 ), was detected in two unrelated patients from our cohort (MAF of 2.3%). The patient carrying the variant p.Ser755Cys located in the TLR8 gene was also hemizygous for a pathogenic variant in G6PD (both XL genes) and did not show pre-existing comorbidities or risk factors at the time of the SARS-CoV-2 infection (Supplementary Table 1). Besides, one of the deceased patients, clinically diagnosed with a primary immunodeficiency (FJD_0728); carried a variant on the STAT3 gene (p.Asn175His), in addition to one pathogenic variant in the recessive SRD5A3 gene (Table 2).
Additionally, two variants were found in genes related to coagulation (PLAU) and cardiovascular risk (LDLR) in two different cases (    Table 3), was performed to detect functional interactions among them. The network (Fig. 2) shows three main components. First component consists of 25 highly interconnected genes, 15 involved in immune response and enriched in cell signaling compared to the rest of the network (Fig. 2). Two additional network components were identified, one composed by 6 genes and enriched in carbohydrate metabolism and a third component with 5 genes enriched in DNA metabolism and repair processes, both compared to the rest of the network.

Discussion
Understanding inter-individual clinical variability in COVID-19 has important implications for the identification of high-risk patients, clinical decision-making and the development of individualized treatments. In the present study, a group of young and middle-aged patients with very severe COVID-19 were selected for a genetic study, in which 44 different variants of interest have been detected. As expected, most of the detected variants (40%) were encoded by genes directly related to immune response, as the gene panel used in this exploratory study was enriched in immune-related pathways and included additional immune genes than those reported in previous studies 11,13 . Innate immunity is crucial for early antiviral response; thus, LoF variants in related genes could affect the onset of the immune response and also alter the appropriate clearance of the infection by adaptative response 9,26,27 . In this sense, pathogenic rare variants in 13 candidate genes involved in TLR3-and IRF7-dependent type I IFN pathways showed a higher risk to severe SARS-CoV-2 infection, which could explain up to 3.5% of severe cases 11 . However, these initial findings were not replicated in subsequent studies 12,28 . In our study, we did not find any LoF variant in those 13 genes, but we have detected seven likely pathogenic variants in other genes directly related to immune response (IFN pathways, mainly).We have confirmed the presence of a TLR7 variant in a male also participating in a recently published study 16 , suffering from very severe COVID-19 and without relevant comorbidities or risk factors at the time of the infection. Therefore, our results support the genetic screening of TLR7 variants in young men in absence of pre-existing conditions as a preventing biomarker that may help clinical management of this subset of patients. Even more, we have found two variants in other Toll-like receptor (TLRs) genes, TLR4 (Chr9) and TLR8 (ChrX), in two males under 50 years of age requiring intubation. Of www.nature.com/scientificreports/ note, TLRs are crucial in innate response by recognizing pathogen-associated molecular patterns from different microorganisms 29 , being TLR3, 7 and 8 key sensors of RNA viruses 30 . Furthermore, nearly 40% of the variants detected in the present study were located in immune response genes, some of them with a high probability of intolerance to heterozygous LoF variation (pLI ≥ 0.9) 31 ;thus, a single LoF variant may lead to a severe clinical phenotype due to haploinsufficiency in genes such as CARD11, STAT3 or NFKB1(pLI = 1, each). In addition to allelic dosage, subjects carrying the same genotypes can display variable expressivity and additional common or rare genetic variants may modify the penetrance of monogenic variants (polygenic risk) 32 . In this sense, 30% of our patients carried more than one variant of interest. Even more, other well-known COVID-19 risk factors, such as age, comorbidities, or environmental factors may affect monogenic variants penetrance to the final observed phenotype 33 .
Additionally, we found 5 patients (11%) carrying heterozygous variants in genes related to glycosylation defects. Congenital defects of glycosylation (CDG) is a group of rare diseases caused mainly by recessive genes 34 . Clinical manifestations of CDG include neurological, cardiovascular, and hematologic involvement and recurrent infections, among others 35 . An increased risk of thrombotic events and bleeding complications have been related to abnormal glycosylation of coagulation factors 36 and thrombosis is one of the most important complications of COVID-19. Therefore, patients carrying a defective copy may experience a more severe course of SARS-CoV-2 infection due to the importance of glycosylation in immune response 35 . In contrast, ACE2 is a protein extensively glycosylated and previous studies showed that cellular SARS-CoV-2 entry is reduced by blocking the N-glycan and O-glycan formation 37 . Thus, it is difficult to conclude about the effect of these defective variants on the glycosylation status of the monoallelic carriers and the impact of those variants on SARS-CoV-2 clearance.
Moreover, four variants were detected in damaged DNA binding genes and a cluster including three of these genes (POLD1, MSH6 and REQL4), in addition to SLX4 and FANCD2, was detected in the network analysis. There is evidence that senescence is in part caused by accumulated DNA damage 38 and severity of some pathologies, as COVID-19, has been related to cell senescence, particularly in the elderly 39 . In addition, premature cellular senescence could be induced by viral infections 40 ; therefore, COVID-19 patients with pathogenic variants in damaged DNA binding genes may be more likely to develop cellular senescence and severe COVID-19.
Interestingly, one of the candidate variants identified was in the canonical splice site of a key player of the coagulant pathway, PLAU, that has been previously related to bleeding disorders, tandem duplication of this gene is related to Quebec platelet disorder (MIM #601709) in a dominant model. Therefore, we could hypothesize that this variant may impair thrombosis resolution, as demonstrated previously in a knocked-out model 41 and inferred by the critical role of PLAU in the natural thrombus resolution by its fibrinolytic function 42 . Anticoagulant and fibrinolytic gene expression has been found dramatically down-regulated in the lung of COVID-19 patients compared to controls 43 . Thus, COVID-19 patients with loss of functions in the PLAU gene may be more likely to develop a thrombotic event. Moreover, we found a variant in the LDRL gene, classified as a variant of uncertain significance in relation to familial hypercholesterolemia 44 . Patients with impaired cholesterol metabolism could display a higher risk of COVID-19 severe outcomes due to the intimate relationship of hypercholesterolemia, metabolic syndrome, and heart disease 45 . Therefore, variants predisposing to hypercholesteremia such as LDLR pathogenic variants may confer a higher risk of suffering severe COVID-19 disease, even in the absence of other relevant comorbidities 46 .
Our study has several limitations. First, we have a limited sample size. Despite recruiting more than 3500 in the Stop_Coronavirus cohort, the stringent cut-off for age (< 65 years) and outcome (only very severe COVID-19) led us to select those patients displaying an extreme phenotype in our cohort. Second, we have analyzed only the coding region; thus, we could have missed a second pathogenic allele (deep intronic regions or CNVs) in monoallelic patients that could help us explain the COVID-19 outcome. Besides, together with the effect of the detected genetic variants, it is necessary to consider the possible additional effect of pre-existing conditions related to COVID-19 severity in the patients on the outcome.
In conclusion, our descriptive study in very severe COVID-19 patients has reported the presence of rare variants in certain biological pathways such as immune response. Moreover, two additional signaling pathways have been detected including genes involved in carbohydrate metabolisms and DNA repair. Further studies are needed to confirm the ultimate role of the variants described in the present study on COVID-19 severity.

Patients and methods
Subjects and clinical data. A case series study was performed by selecting a subgroup of patients (n = 44) from the Spanish STOP_Coronavirus 47 cohort, which comprises more than 3,500 COVID-19 patients, from 4 hospitals (three from Madrid and one from Murcia). Extreme phenotypes were selected from our STOP_ Coronavirus cohort using a similar design to previous case series studies 13 . Inclusion criteria were young and middle-aged patients (age under 65 years) with a confirmatory test of SARS-CoV-2 infection that presented ARDS (survivors or deceased). More information about the Spanish STOP_Coronavirus cohort is provided in Supplementary methodology. Cases were retrospectively and prospectively enrolled from March to May 2020 and followed-up until December 2020. SARS-CoV-2 infection was confirmed by a positive PCR (n = 41) and/ or serological test (n = 3, IgG and IgM both positives). COVID-19 patients were recruited from four hospitals in Spain: Hospital Universitario Fundación Jiménez Díaz (HUFJD), Hospital Universitario Infanta Elena (HUIE) and Hospital Universitario 12 de Octubre (H12O) in Madrid, and Hospital Clínico Universitario Virgen de la Arrixaca in Murcia (HVAM).
Clinical data obtained in HUFJD and HUIE were extracted from the patients' electronic medical records using batch-based complex queries and then reviewed and refined manually by two clinicians and two clinician researchers. At H12O and HVAM, clinical data were manually collected by researchers from electronic medical records. Clinical information included primary demographic data, comorbidities, COVID-19 symptoms, www.nature.com/scientificreports/ laboratory findings, treatments, related complications from COVID-19, ICU admissions, and outcomes (Supplementary Table 1). Descriptive statistics (mean and SD) were calculated for main clinical and demographic data (Table 1). This study was approved by the research ethics committees of HUFJD, HVAM and H12O. Wherever was possible, patients provided written or verbal informed consent to participate in this study. Due to the health emergency, the research ethics committees of each center waived the requirement for informed consent for the STOP_Coronavirus cohort. All samples were de-identified (pseudonymized) and clinical data were managed in accordance with national legislation and institutional requirements.
Ancestry inference. Principal component analysis (PCA) based on the variance-standardized relationship matrix was used to infer the ancestry of each patient and classify them as one of the selected ancestry groups (European, African, admixed American, and East Asian) using a set of 1000 genome samples (phase 3) as a reference population. For PCA, we used previously collected genetic data from our cohort (unpublished) obtained with the Applied Biosystems™ Axiom™ Spain Biobank Array (COL32017 1217, Thermo Fisher Scientific Inc.), which contains 758,740 variants. PCA was performed using Plink software version 1.9 48 .
Kinship test. To assess kinship, we used previously collected genetic data from our cohort 2 obtained with the Applied Biosystems™ Axiom™ Spain Biobank Array (COL32017 1217, Thermo Fisher Scientific Inc.), which contains 758,740 variants. Autosomal SNPs (MAF > 5%) were pruned with PLINK 3 using a window of 1000 markers, a step size of 80 and a r 2 of 0.1. A subset of 131,937 independent SNPs was used to evaluate kinship (IBD estimation) in PLINK 3 . Only one individual from each pair of individuals with PI_HAT > 0.25 (seconddegree relatives) that showed a Z0, Z1, and Z2 coherent pattern (according to theoretically expected values for each relatedness level), was removed.
Whole exome sequencing analysis. DNA was isolated from EDTA-collected peripheral blood samples using an automated DNA extractor (BioRobot EZ1, QIAGEN GmbH). DNA samples were subjected to library construction using SureSelect Human All Exon V6 (Agilent Technologies, Santa Clara, CA, USA) and sequenced on a Novaseq 6000 instrument (Illumina, San Diego, CA, USA), following the manufacturer's protocol. Pairedend reads of 2 × 150 bp were generated per sample to provide an on-target coverage of minimum 100 ×, with a total coverage of 12 GB/sample.
For WES analysis we applied an in-house maintained bioinformatics pipeline using bwa v0.7.17 49 for mapping to the GRCh37/hg19 human genome assembly, gatk v4. Single variant analysis. To search for candidate variants involved in the pathophysiology of severe COVID-19, we used a candidate virtual gene panel summarized in Supplementary Table 4. Candidate gene panel included 330 genes mainly involved in type I IFN immunity, primary immunodeficiencies, and genes related to coagulation (panel 1). Moreover, 234 additional genes were selected by using the COVID-19 severity and susceptibility panel published in PanelApp 53 , by selecting only green-labelled genes (panel 2). Besides, other functionally related genes were included by using our GLOWgenes prioritization method (www. glowg enes. org) using the 564 genes from panels 1 and 2 as a seed set. Top 300 prioritized genes were selected and included as panel 3 (Supplementary methodology). Thus, a total of 864 genes (564 candidates and 300 selected by GLOWgenes) were included in the final panel.
The PriorR v.2.1 package (https:// github. com/ TBLab FJD/ PriorR) was used for variant filtering and prioritization. Variants were filtered according to a minor allele frequency (MAF) < 0.01 in population databases [the 1000 genomes project 25 , the Exome Aggregation Consortium (ExAc) 24 , and the Genome Aggregation Database 23 (GnomAD)]. Synonymous, intronic and non-coding variants were excluded from the analysis. ClinVar (ncbi. nlm.nih.gov/clinvar/) and the Human Gene Mutation Database 54 (HGMD) were used to identify variants previously reported as pathogenic and those described as likely benign/benign variants were discarded. The impact of missense variants was assessed using several predictor tools (DANN 55 72 , and GeneSplicer 73 ) using the Alamut software (Interactive Biosoftware, Rouen, France). The potential pathogenicity of prioritized variants was assessed using the Varsome tool 74 following ACGM criteria 75 . Network analysis of genes with candidate variants. Genes carrying at least one of the candidate variants (Supplementary Table 3) were submitted to the STRING database v11.5 76 and interactions with a STRING combined score ≥ 400 were downloaded as a file (.tsv) in short tabular text output format from the Exports tab. Cytoscape 77 version 3.4.0 was used for visualization. Clusters were defined as subgraphs with any two nodes (genes) connected to each other by edges, and not connected to other nodes in the graph, this normally called network components and the most extreme version of a cluster. We applied BINGO 78 Cytoscape app for the enrichment analysis extracting over-represented Gene Ontology (GO) biological processes terms comparing their annotation in every cluster to the rest of the network including genes not grouped in clusters. In the net- www.nature.com/scientificreports/ work representation, the STRING combined score, which represents the interaction confidence, is used to characterize edges between genes. Functions enriched for every cluster were selected as having an FDR < 0.05.

Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.