ACE2 gene variants may underlie interindividual variability and susceptibility to COVID-19 in the Italian population

In December 2019, an initial cluster of interstitial bilateral pneumonia emerged in Wuhan, China. A human-to-human transmission was assumed and a previously unrecognized entity, termed coronavirus disease-19 (COVID-19) due to a novel coronavirus (SARS-CoV-2) was described. The infection has rapidly spread out all over the world and Italy has been the first European country experiencing the endemic wave with unexpected clinical severity in comparison with Asian countries. It has been shown that SARS-CoV-2 utilizes angiotensin converting enzyme 2 (ACE2) as host receptor and host proteases for cell surface binding and internalization. Thus, a predisposing genetic background can give reason for interindividual disease susceptibility and/or severity. Taking advantage of the Network of Italian Genomes (NIG), here we mined whole-exome sequencing data of 6930 Italian control individuals from five different centers looking for ACE2 variants. A number of variants with a potential impact on protein stability were identified. Among these, three more common missense changes, p.(Asn720Asp), p.(Lys26Arg), and p.(Gly211Arg) were predicted to interfere with protein structure and stabilization. Rare variants likely interfering with the internalization process, namely p.(Leu351Val) and p.(Pro389His), predicted to interfere with SARS-CoV-2 spike protein binding, were also observed. Comparison of ACE2 WES data between a cohort of 131 patients and 258 controls allowed identifying a statistically significant (P value < 0.029) higher allelic variability in controls compared with patients. These findings suggest that a predisposing genetic background may contribute to the observed interindividual clinical variability associated with COVID-19, allowing an evidence-based risk assessment leading to personalized preventive measures and therapeutic options.


Introduction
In December 2019, a new infectious respiratory disease emerged in Wuhan, Hubei province, China [1 -3]. An initial cluster of infections likely due to animal-to-human transmission was rapidly followed by a human-to-human transmission [4]. The disease was recognized to be caused by a novel coronavirus (SARS-CoV-2) and termed coronavirus disease-19 . The infection spread within China and all over the world, and it has been declared as pandemic by the World Health Organization (WHO) on 2nd March 2020. The symptoms of COVID-19 range from fever, dry cough, fatigue, congestion, sore throat, and diarrhea to severe interstitial bilateral pneumonia with a ground-glass image at the CT scan. While recent studies provide evidence of a high number of asymptomatic or paucisymptomatic patients who represent the main reservoir for the infection progression, the severe cases can rapidly evolve towards a respiratory distress syndrome which can be lethal [5]. Although age and comorbidity have been described as the Members  main determinants of disease progression towards severe respiratory distress, the high variation in clinical severity among middle-age adults and children would likely suggest a strong role of the host genetic asset. A high sequence homology has been shown between SARS-associated coronavirus (SARS-CoV) and SARS-CoV-2 [6]. Recent studies modeled the spike protein to identify the receptor for SARS-CoV-2 and indicated that angiotensin converting enzyme 2 (ACE2) is the receptor for this novel coronavirus [7,8]. Zhou et al. conducted virus infectivity studies and showed that ACE2 is essential for SARS-CoV-2 to enter HeLa cells [9]. Although the binding strength between SARS-CoV-2 and ACE2 is weaker than that between SARS-CoV and ACE2, it is considered as much high as threshold necessary for virus infection. The spike glycoprotein (S-protein), a trimeric glycoprotein in the virion surface (giving the name of crown -corona in latin-), mediates receptor recognition throughout its receptor binding domain (RBD) and membrane fusion [10,11]. Based on recent reports, SARS-CoV-2 protein binds to ACE2 through Leu455, Phe486, Gln493, Ala501, and Tyr505. It has been postulated that residues 31, 41, 82, 353, 355, and 357 of the ACE2 receptor map to the surface of the protein interacting with SARS-CoV-2 spike protein [12], as previously documented for SARS-CoV. Following interaction, cleavage of the C-terminal segment of ACE2 by proteases, such as transmembrane protease serine 2 (TMPRSS2), enhances the spike protein-driven viral entry [13,14]. Thus, it is possible, in principle, that genetic variability of the ACE2 receptor is one of the elements modulating virion intake and thus disease severity. ACE2 is located on chromosome X. Although it is one of the genes escaping X inactivation several lines of evidence suggest that a different degree of X-chromosome inactivation (XCI) is present in distinct tissues [15].
Taking advantage of the Network of Italian Genomes (NIG), a consortium established to generate a public database (NIG-db) containing aggregate variant frequencies data for the Italian population (http://www.nig.cineca.it/), here we describe the genetic variation of ACE2 in the Italian population, one of the newly affected countries by the SARS-CoV-2 outbreak causing COVID-19. Three common c.2158A>G p.(Asn720Asp), c.77A>G p.(Lys26Arg), and c.631G>A p.(Gly211Arg) variants and 27 rare missense variants were identified, 9 of which had not previously been reported in public databases. We show that p.(Asn720Asp), which lies in a residue located close to the cleavage sequence of TMPRSS2, likely affects the cleavagedependent virion intake. Along with the other two common variants, this substitution is represented in the Italian and European populations but is extremely rare in the Asian population. We also show that two rare variants, namely, c.1051C>G p.(Leu351Val) and c.1166C>A p.(Pro389His) are predicted to cause conformational changes impacting RBD interaction. As the uncertainty regarding the transmissibility and severity of disease rise, we believe that a deeper characterization of the host genetics and functional characterization of variants may help not only in understanding the pathophysiology of the disease but also in envisaging risk assessment.

Italian population randomization
The work has been realized in the context of the NIG, with the contribution of centers: Azienda Ospedaliera Universitaria Senese, Azienda Ospedaliera Universitaria Policlinico Sant'Orsola-Malpighi di Bologna, Città della Salute e della Scienza di Torino, Università della Campania "Luigi Vanvitelli", Ospedale Pediatrico Bambino Gesù. The NIG (http://www.nig.cineca.it/) aim is to create a shared database (NIG-db) containing data from nucleic acids sequencing of Italian subjects. This database allows defining an Italian Reference Genome for the identification of genes responsible for genetic diseases or Italian population susceptibility to complex disorders and for the detection of genetic variants responsible for interindividual differences in disease progression ad /or drug response among the Italian population. Individuals coming to our centers were offered to participate to the NIG study and blood withdrawal was performed upon informed consent. Individuals provided signed informed consents at each participating center for whole-exome sequencing analysis (WES), and clinical and molecular data storage and usage. All subjects were unrelated, healthy, and of Italian ancestry. Italian origin was ascertained asking for parents and grandparents origin. DNA has been stored in the Telethon Network of Genetic Biobanks (project no. GTB12001), funded by Telethon Italy.

COVID-19 patients enrollment
The study was consistent with Institutional guidelines and approved by the University Hospital (Azienda Ospedaliera Universitaria Senese) Ethical Committee, Siena, Italy (Prot n. 16929, dated March 16, 2020). Written informed consent was obtained from all patients and controls. Peripheral blood samples in EDTA-containing tubes and detailed clinical data were collected. All these data were inserted in a section of the established and certified Biobank and Registry of the Medical Genetics Unit of the Hospital dedicated to COVID-19. The cohort of COVID-19 patients consists of 131 individuals out of whom 34 females and 97 males belonging to the GEN-COVID MULTICENTER STUDY ([16], Late Breaking Abstract ESHG 2020.2 Virtual Conference "WES profiling of COVID-19"). The cohort of controls consists of 258 italian individuals (129 males and 129 females). All patients are of Italian ethnicity. The median age is 64 years (range 31-98): median age for women 66 years and for males 63 years. The population was clustered into four qualitative severity groups depending on the respiratory impairment and the need for ventilation: high care intensity group (those requiring invasive ventilation), intermediate care intensity group (those requiring noninvasive ventilation i.e., CPAP and BiPAP, and high-flows oxygen therapy), low care intensity group (those requiring conventional oxygen therapy) and very low care intensity group (those not requiring oxygen therapy).

Whole-exome sequencing
Targeted enrichment and massively parallel sequencing were performed on genomic DNA extracted from circulating leukocytes of 6930 individuals. Genomic DNA was extracted from peripheral blood samples using standard procedures. Exome capture was carried out using Sur-eSelect Human All Exon V4/V5/V6/V7 ( Broad Institute, Harvard, USA). Alignment of raw reads against reference genome Hg19, variant calling and annotation were attained using in-house pipelines [17][18][19] which take advantage of the GATK Best Practices workflow [20] and of Annovar, VEP [21,22]. The genome aggregation database gnomAD (https://gnomad.broadinstitute.org/) was used to assess allele frequency for each variant among different populations. The mean depth of coverage of each ACE2 exon in all participants was 55×. Variants with a depth of coverage lower that 20× were filtered out according to ASHG Guidelines for germline variants [23]. The

Computational studies
The structure of native human angiotensin converting enzyme-related carboxypeptidase (ACE2) was downloaded from Protein Data Bank (https://www.rcsb.org/) (PDB ID code 1R42) [24]. The DUET program [25] was used to predict the possible effect of amino acids substitutions on the protein structure and function, based on the use of machine-learning algorithms exploiting the threedimensional structure to quantitatively predict the effects of residue substitutions on protein functionality. Molecular dynamics (MD) simulations of wild-type and variant ACE2 proteins were carried out in GROMACS 2019.3 [26] to calculate root mean square deviation (RMSD) to define structural stability. The graphs were plotted by the XMGrace software [27]. MD simulations were performed using a high parallel computing infrastructure (HPCS) with 660 cpu within 21 different nodes, 190T of RAM, 30T hard disk partition size, and six NVIDIA TESLA gpu with CUDA support. PyMOL2.3 was used as a molecular graphic interface. The protein structures were solvated in a triclinic box filled with TIP3P water molecules and Na + /Cl − ions were added to neutralize the system. The whole systems were then minimized with a maximal force tolerance of 1000 kJ mol −1 nm −1 using the steepest descendent algorithm. The optimized systems were gradually heated to 310 K in 1 ns in the NVT ensemble, followed by 10 ns equilibration in the NPT ensemble at 1 atm and 310 K, using the V-Rescale thermostat and Berendsen barostat [28,29]. Subsequently, a further 100 ns MD simulations were performed for data analysis.

ACE2 variants identification
The extent of variability along the entire ACE2 coding sequence and flanking intronic stretches was assessed using 6930 Italian WES, out of which 4171 males and 2759 females which sum up to 9689 alleles. Identified variants and predicted effects on protein stability are summarized in Tables 1, 2, and Table S1, and represented in exomes. In addition to these variants, 28 rare missense variants were identified, out of which ten had not previously been reported in GnomAD database and nine truncating variants that had not been reported in gnomAD database (Table 1 and Supplementary Table 1). Out of these variants, two fall in the neck domain, which is essential for dimerization and one in the intracellular domain. Many of them truncate the protein in different positions of the Protease domain embedded in the extracellular domain, which contains the receptor binding site for SARS-CoV-2. Only three truncating variants have been previously described for ACE2 likely due to a low-tolerance for loss-of-function variants. In line with this evidence, all these variants were very rare and no homozygous females were detected for the identified variants. Three missense changes c.1517T>C p.    (Fig. 3a), the c.1166C>A p.(Pro389His) and c.1051C>G p.(Leu351Val) variants show a difference in comparison with the native protein with a gradual increase in RMSD value, which stabilizes at an average of 0.5 nm (Fig. 3a). Finally, the c.631G>A p.
(Gly211Arg) shows a bigger difference with a higher increase in RMSD value, which stabilizes at an average of 0.6 nm (Fig. 3a). Structural analysis between WT and mutant c.1517T>C p.(Val506Ala) MD simulations showed that the c.1517T>C p.(Val506Ala) forms a hydrophobic center together with Leu456, Leu503, and Phe516 with minimum differences in protein rearrangements when the residue is mutated in Ala as reported in Fig. 2 and Supplementary Video S1. The c.77A>G p.(Lys26Arg) is located at the N-terminus and the sidechain engages a hydrogen bond with Asn90 thus determining a minimal destabilizing predicted effect as shown in Table 2  During MD simulations, we have also investigated the surrounding region of ACE2 WT and previously selected variants by calculating change in solvent accessibility surface area (SASA). Differences in average SASA value would suggest for the native protein a wider surface exposed to solvent and subsequently a different ability to interact with spike SARS-CoV-2 in comparison with the studied variants (Fig. 3b).

Differences in ACE2 allelic variability in COVID-19 patients compared with controls
In order to shed light on the role of ACE2 variants on interindividual variability and susceptibility to COVID-19 in Italian population we performed WES analysis on a cohort of 131 patients and 258 controls who agreed in   participating to the study (see "Materials and methods"). Data analysis of ACE2 variants identified a different distribution of variants in controls compared with patients ( Fig. 4)

Discussion
According to recent reports, ACE2 is essential for SARS-CoV-2 to enter cells. Recent single-cell RNA studies have also shown that ACE2 is expressed in human lung cells [30]. The majority of ACE2-expressing cells are alveolar type 2 cells. Other ACE2-expressing cells include alveolar type 1 cells, airway epithelial cells, fibroblasts, endothelial cells, and macrophages although their ACE2-expressing cell ratio is low and variable among individuals. The expression and distribution of the ACE2 receptor can thus justify the route of infection and the main localization at the alveolar level. Although the different density of ACE2 receptors in the upper respiratory tract among individuals can partially give reason for the clinical variability, which  (5 Å)   Y180, L456, R460, P500, A501, S502, L503, F504, H505, N506, S507   Y207, E208, V209, N210, G211, V212, Y215, D216, Y217, P565, T567   H373, H374, E375, M376, G377, H378, I379, A380, Y381, F315, H401, V404, G405, M408   L262, P263, A264, H265, L266, L267, W271, W478, V487, V488, E489, P490, W165   Y497, C498, D499, P500, A501, S502, G173, R177, L176, Y180, W459, W473, M474   A242, Y243, V244, R245, A246, K247, L248,  DUET program results that display predicted change in folding free energy upon ACE2 missense variant (ΔΔG in kcal/mol). In the first three columns are reported single missense variants with specific position on ACE2 protein. The residues in the first column highlighted in gray are involved in N-glycosylation pattern NxT/S, therefore those missense variants determine the loss glycosylation of Asparagine 53 and 90, respectively. In the fourth column is reported ΔΔG analysis predict effects of missense variants on protein stability using an integrated computational approach. The column "Interaction Network around (5 Å)" shows for each single missense variant the residues around 5 Å. In this column, we highlight in green residues involved in spike SARS-CoV protein interaction, in yellow residues involved in Zinc coordination and finally in magenta residues of Asn involved in N-glycosylation. The last column defines the outcome of protein stability for each single missense variant. An increasing negative value for the ΔΔG is correlated with a higher destabilizing effect, while a positive value is associated with a variant predicted as stabilizing.
ranges from asymptomatic/paucisymptomatic patients to severely affected ones, it could not be the only reason for such variability. In addition, recent works did not observe significantly different viral loads in nasal swabs between symptomatic and asymptomatic patients [31]. Italy has been the first European country that experienced the COVID-19 outbreak with a rapid increase in the positive cases in a very short-time period and a morbidity and lethality (~10%) definitely higher in comparison with Asian countries, such as China (4%) and South Korea (1.2%) [32]. These considerations raise the possibility of a predisposing genetic background accounting for or contributing to the wide interindividual clinical variability, as well for the differential morbidity and lethality observed among different countries, population awareness, and constrictive measures apart. We integrated genomic WES data produced by five Italian centers (Siena, Naples, Turin, Bologna, and Rome) interconnected by the NIG in the attempt to identify variation encompassing the ACE2 gene, which could account for a difference in SARS-CoV-2 spike binding affinity, processing, or internalization. Previous studies showed that the residues near lysine 31, and tyrosine 41, 82-84, and 353-357 in human ACE2 are important for the binding of S-protein to coronavirus [12]. In line with previous reports [33], we did not find polymorphism or rare variants in these residues in the Italian population. However, we identified three variants namely c.2158A>G p.(Asn720Asp), c.1166C>A p.(Pro389His), and c.1051C>G p.(Leu351Val), one of which polymorphic c.2158A>G p.(Asn720Asp), moderately expressed in the Italian and European non-Finnish populations and with a very low allele frequency or not occurring in the Eastern Asia population. These variants which surround residual essentials for the SARS-CoV-2 spike protein binding were predicted to likely affect the cleavage-dependent virion intake, such as the polymorphic c.2158A>G p.(Asn720Asp) (allele frequency 0.011) which lies four amino acids from the cleavage sequence of TMPRSS2 or to have a substantial impact on protein structure and spike protein interaction by MD simulation (Fig. 3a). The relatively frequent c.631G>A p.(Gly211Arg) (allele frequency 0.0012, 12/6930 individuals) was predicted to confer a wide flexibility to the region because of the ability to engage different interactions with the nearby amino acid residues. Along with these more common variants we also identified very rare variants such as the c.1166C>A p.(Pro389His) and the c. Finnish European population, that could give reason for a different affinity for the SARS-CoV-2 spike protein (Figs. 2, 3a and Supplementary Video S4). Interestingly all the studied variants affect residues highly conserved among species (Supplementary Fig. S1). Given their rarity in other populations, we cannot exclude that these variants can Fig. 3 Structure superimposition snapshot between wild-type protein and variant proteins. a Root mean square deviation (RMSD) trends for the backbone of ACE2 WT (black line) and some selected variants (colored lines, see legend) during 100 ns of simulation. The molecular dynamics simulation shows a good stability for all systems with exception of G211R mutants. RMSD is a parameter used to define the stability of an element. Wild type shows a steady course in the RMSD value, stabilizing at an average of 0.2 nm, while, the G211R variant shows a gradual increase in RMSD value, stabilizing at an average of 0.6 nm. b SASA graphical representation of ACE2 WT (black line) and ACE2 variants (colored lines, see legend). partially account for the clinical outcome observed in the Italian population. WES data generated from a wide cohort of COVID-19 Italian patients revealed a statistically significant (P < 0,029) higher allelic heterogeneity for ACE2 in controls compared with patients with a higher chance to find at least one ACE2 variant in the cohort of controls compared with the cohort of patients. Therefore, it is plausible to think that the effect of allelic variability on ACE2 conformation would at least partially account for the interindividual clinical differences and likely modulate clinical severity. This finding reinforces the hypothesis that at least some of the identified variants or the cumulative effect of few of them confer a different susceptibility to virus cell entry and consequently to disease onset and progression. We cannot exclude that also silent variants such as the c.2247G>A (p. Val749Val) with no effect on the protein could play a role because of an unpredictable impact at a posttranscriptional level.
Notably, morbidity and lethality have been reported definitely higher in men compared with women (~70% vs. 30%, 20th March 2020 ISS report). Although several parameters have been brought to case to explain this difference, i.e., smoking, differences in ACE2 localization and/or density in alveolar cells, hormonal asset, it is noteworthy that ACE2 is located on chromosome X and that given the low allele frequency of the identified variants the rate of homozygous women is extremely low (see Results section). The XCI is incomplete in humans and some genes show a degree of XCI escape which vary between individuals and tissues [34]. ACE2 is one of the genes escaping X inactivation, but it belongs to a subgroup of X-chromosome genes showing a higher expression in men in several tissues thus mostly suggesting that ACE2 gene XCI is present although different in distinct tissues [15]. Therefore, the impact of X inactivation on the alternate expression of the two alleles would guarantee, in the affected tissues, a heterogeneous population of ACE2 molecules, some of which protective towards the infection until the point of a complete or almost complete protection in the case of a X inactivation skewed towards the less SARS-CoV-2-binding prone allele. This hypothesis would justify the high rate of asymptomatic or paucisymptomatic patients. However, the presented data does not allow to confirm a clear cause-effect relationship and, since most of the identified variants have very low frequencies, further functional studies are needed to validate these results. ACE2 is definitely one of the main molecules whose genetic heterogeneity can modulate infection and disease progression; however, a deeper characterization of the host genetics and functional variants in other pathwayrelated genes may help in understanding the pathophysiology of the disease opening up the way to a stratified risk assessment and to tailored preventive measures and treatments. questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Ethical approval The NIG study was approved by the Ethical Committee of each center, Prot Name NIG Prot n 10547_2016, approved CEAVSE on 18.07.2016 n 88/16. The COVID-19 patients study was approved by the University Hospital (Azienda Ospedaliera Universitaria Senese) Ethical Committee, Siena, Italy (Prot n. 16929, dated March 16, 2020).
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/.