Binding Ability Prediction between Spike Protein and Human ACE2 Reveals the Adaptive Strategy of SARS-CoV-2 in Humans

SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel coronavirus causing an outbreak of COVID-19 globally in the past six months. A relatively higher divergence on the spike protein of SASR-CoV-2 enables it to transmit across species efficiently. We particularly believe that the adaptive mutations of the receptor-binding domain (RBD) of spike protein in SARS-CoV-2 might be essential to its high transmissibility among humans. Thus here we collected 2,142 high-quality genome sequences of SARS-CoV-2 from 160 regions in over 50 countries and reconstructed their phylogeny, and also analyzed the interaction between the polymorphisms of spike protein and human ACE2 (hACE2). Phylogenetic analysis of SARS-CoV-2 and coronavirus in other hosts show SARS-CoV-2 is highly possible originated from Bat-CoV (RaTG13) found in horseshoe bat and a recombination event may occur on the spike protein of Pangolin-CoV to imbue it the ability to infect humans. Moreover, compared to the S gene of SARS-CoV-2, it is more conserved in the direct-binding sites of RBD and we noticed that spike protein of SARS-CoV-2 may under a consensus evolution to adapt to human hosts better. 3,860 amino acid mutations in spike protein RBD (T333-C525) of SARS-CoV-2 were simulated and their stability and affinity binding to hACE2 (S19-D615) were calculated. Our analysis indicates SARS-CoV-2 could infect humans from different populations with no preference, and a higher divergence in the spike protein of SARS-CoV-2 at the early stage of this pandemic may be a good indicator that could show the pathway of SARS-CoV-2 transmitting from the natural reservoir to human beings.


Introduction
Coronavirus is commonly found in nature and infects only mammals and birds 1-3 , among 46 species, only seven of them are human-susceptible 4,5 . Aside from SARS-CoV and MERS-CoV that cause deadly pneumonia in humans by crossing the species barrier 3,6,7 , SARS-CoV-2 is now bringing a global pandemic of respiratory disease  within six months after the first case has been confirmed in Wuhan city of China 8 . It has been identified as a novel coronavirus that is a member of β-coronavirus in the family Coronaviridae, which is a positive single-stranded RNA virus with a protein envelope 2 . Up to date, COVID-19 caused by SARS-CoV-2 results in more than six million people infected and over 380 thousand deaths worldwide. Compared with SARS-CoV and MERS-CoV, SARS-CoV-2 spreads more rapidly and being highly infectious to humans [8][9][10] . It is crucial to understand the origin of this coronavirus and its strategy in adapting to human hosts so efficiently, moreover, to apply this knowledge for controlling this pandemic and developing effective therapeutics and vaccines against COVID-19.
By reconstructing phylogenomic relationships among various coronavirus 11 , it showed around 70% genome sequence similarity with SARS-CoV 11 , and more closely related to bat coronavirus RaTG13 in the spike (S) gene 1 . Given closely related to SARS-like coronaviruses, the genome structure of SARS-CoV-2 is similar to other beta-coronaviruses, which composed in order with 5'replicase ORF1ab-S-envelope(E)-membrane(M)-N-3' with numbers of open reading frames (ORFs) function resembles those of SARS-CoV 12 . It is noted that RaTG13 is highly similar to SARS-CoV-2 especially between genes, while they differed in some crucial genomic structures, one of the most notable features is that a polybasic (furin) cleavage site insertion (PRRA residue) at the junction between two subunits (S1, S2) of S protein 1,2,12,13 . Although some studies show bats could be the reservoir host for many coronaviruses including SARS-CoV, the reservoir host of SARS-CoV-2 remains unclear 4,10,14 . Given the global spread of this epidemic, it draws a lot of attention to reveal the origins of the pandemic event. The evolutionary of SARS-CoV-2 may explain its infectiousness and transmissibility among different animal hosts and provide evidence about whether this virus is natural or artificial.
Spike glycoprotein on the surface of SARS-CoV-2 is the key to enter the target cells, which forms homotrimers protruding from the surface to recognize host cell receptor and cause membrane fusion 15 . Spike protein contains S1 and S2 subunits and the receptor-binding domain (RBD) exists on S1 which can bind to the peptidase domain (PD) of angiotensin-converting enzyme 2 (ACE2), while S2 is responsible for membrane fusion during viral infection 14,16 . In SARS-CoV, RBD in spike protein is the most diverse part of the whole genome, of which six amino acid (Y442, L472, N479, D480, T487, and Y4911) have been found to play a key role in binding to ACE2 receptor and further in transmissibility across species boundary 16 . Similar to SARS-CoV, RBD mutations acquired during adaption to different host cells along transmission have also been observed in SARS-CoV-2. Thus, dissection of the key mutations in spike protein RBD that affect binding to the ACE2 receptor would be important for understanding the molecular mechanism of how SARS-CoV-2 infect human cells. Recent studies show the ACE2 receptor in host cells would mediate the entry ability of SARS-CoV-2 by interacting with spike protein, and its binding capacity to spike protein of SARS-CoV-2 determines the transmissibility of this coronavirus across species particularly among humans 14,17 .
The present study intends to 1) reveal the phylogenetic relationship of SARS-CoV-2 identified in the different population at the genomic level, and provide genetic evidence based on the structural protein-coding genes (S, M, N, E) for identifying the nucleotide variations of SARS-CoV-2 collected from different regions amid COVID-19 pandemic; 2) provide a prediction for revealing the adaptive mutations on spike protein of SARS-CoV-2 based on the specific variation of S gene, and finding the differences of stability of spike protein mutants and their affinities with the human ACE2 receptors; 3) explore whether there are variants of ACE2 in different populations which may affect the infectivity of COVID-19. To this end, we analyzed the variants of ACE2 in a large cohort including 1000 Chinese local people and other human populations and identified polymorphisms that may influent on the binding between ACE2 and spike protein of coronaviruses, furthermore, we predicted the affinities of spike protein to binding ACE2 variants to understand whether those changes would render individuals resistant or susceptible to SARS-CoV-2 at the molecular level.
Herein our study could explain part of the origin of SARS-CoV-2 at a phylogenetic level based both on whole genome and multiple key genes, to allow the elucidation of population risk profiles and also help advance therapeutics such as a rationally designed soluble ACE2 receptor for the management of COVID-19. was carried out by Kimura two factor correction method for nucleic acid level calculation. To avoid the prediction error caused by the selection of outgroups with a far evolutionary relationship, the complex outgroup was adopted in this study, and the sequence of MERS-CoV and SARS-CoV were selected as outgroups to predict the genetic relationship.

Phylogeny of SARS-CoV-2
The Maximum Likelihood (ML) phylogenic trees were constructed based on 2,147 genome sequences of SARS-CoV-2 with SARS-CoV and MERS-CoV were selected as the out-groups ( Figure 1). After alignment, we merged identical sequences into one clade with labels kept as one.
The genomic tree showed all SARS-CoV-2 was closely related to the SARS-like virus found in the horseshoe bat from Yunnan (RaTG13), and a coronavirus from pangolin collected in Guangdong province was the sister taxon with RaTG13 which is closer to the virus from bat than to other pangolins from Guangxi. SARS-CoV-2 is not related closely with SARS-CoV at the nucleic level Australia, European countries and USA, and the case document showed this patient was confirmed after his seven-day trip to Italy and he didn't show any symptoms before his journey. Some branches were poorly supported by the bootstrap values and exhibited polytomies, which indicated SARS-CoV-2 was adapting the human hosts globally and it is hard to determine the origin of SARS-CoV-2 solely based on nucleic data. Furthermore, same to the genomic tree, multiple genes sequences of SARS-CoV-2 are divergent from outgroups and we detected some mutations according to S gene sequences comparison ( Figure 3).
As shown in Figure 3, we extracted and aligned the S gene sequences from SARS coronavirus in different hosts, the one from pangolin that closer to SARS-CoV-2 and RaTG13 was from a horseshoe bat (Rhinolophus spp) and we found no big fragment shift in S genes of pangolin and it was less similar to S in SARS-CoV-2 compared with RaTG13, but the insertions exhibited in pangolin indicated potential recombination in spike protein of coronavirus would occur during its cross-species hosts manumission.
According to S gene sequences in different strains, we found the similarity between SARS-CoV-2 and SARS-CoV-bat (98%) is higher than it compared with coronavirus from pangolins (85%). It showed more solved of the phylogenetic tree based on S gene (Fig. 4) compared to the trees based on genomic sequences and multiple genes. Nevertheless, the phylogeny reconstructed based on genome, S gene, and multiple genes, all indicate the SARS-CoV-2 is closely related to RaTG13 and the virus isolated from pangolin.

Polymorphism prediction of spike protein in SARS-CoV-2
S gene has been studied as the key gene for SARS-CoV-2 binding to host cells' receptor, and this gene shows less conserved compared to the genome sequence of SARS-CoV-2, and some studies suggest that the high divergence found in spike protein RBD and specifically the direct bonding sites to ACE2 receptor play an important role to SARS-CoV-2 adapting to different animal hosts or the populations in different regions. Combined with the phylogeny based on the S gene, we also detected all mutationson the S gene and their binding capacity with ACE2 receptor in human beings. 23 point mutations were predicted to significantly influence the affinity and stability of the spike protein (Fig. 5), among which 9 polymorphisms exhibited increased affinity and stability while 14 ones showed decreased affinity and stability (Table 1). All analysis outcomes of over 3000 polymorphisms in spike protein were shown in Supplementary Table 1. Figure 5 and Table 1 showed the missense mutation in spike protein RBD that has significant changes (Cutoff =3) of the affinity and stability with ACE2 receptors. Interestingly, residues L455, Q498 and N501 would have two potential mutations that leading to contrary affinities, including increased affinity in leucine changed to methionine on AA455, glutaurine changed to tryptophan on AA498 and asparagine changed to tyrosine on AA501, reduced affinity in leucine changed to alanine, glutamine changed to alanine, asparagine changed to glycine. The polymorphisms on the same residue causing two opposite effects of affinity suggested that mutations on those positions may lead to different adaptation directions on SARS-COV-2 to fit in different hosts. Furthermore, Phe456, Gln493, and Phe486 showed two mutations that both result in affinity reduction. The stability of spike protein with different mutations was showed in Table 1, we found only 5 mutants increased the stability of spike protein including G446W, G496A, Q498W, N501Y and G502Y, which is not consistent with the affinity.
To match the mutations of spike protein that reported by SARS-CoV-2 database of China national center for bioinformation (https://bigd.big.ac.cn/ncov) with our predictions, a total of 1,150 polymorphisms in spike protein were collected from the database and 643 missense variants were selected and run analysis about their affinity and stability binding to ACE2 (Supplementary Table   2). We focused on 76 missense variants in region T333-C525 of spike protein and 9 variants bring significant changes on the affinity of it to ACE2 ( Figure 6A), moreover, 13 variants cause structural stability changes in spike protein RBD ( Figure 6B).

Polymorphisms in ACE2 affect the binding ability to spike protein of SARS-CoV-2
We calculated the population frequency of 388 missense variants in ACE2 collected from GnomAD (Supplementary Table 3) and the local population ( Table 2). Analysis of their stability and affinity to spike protein of SARS-COV-2 were performed in this study, and the results showed no significant differences in both affinity and stability.

Discussion and Conclusions
Revealing the evolutionary origin of SARS-CoV-2 is of great value to understand its transmission pathway cross-species and provide a guide to long-term infection prevention of zoonotic coronavirus. Since it is a novel coronavirus, the extremely limited morphological information of SARS-CoV-2 could be used in the phylogenic analysis, genomic data of SARS-CoV-2 nowadays provide a helpful approach to identify its divergence along with the transmission among

Spike protein is a key at the first step of the viral infection in host cells for receptor recognition
to most SARS-coronavirus 5,17 , and the S gene is more divergent in coronavirus at a genetic level relative to other genes 14 . Along adapting to the human host cellular environment, mutations occur in spike protein in some of which enable to enhance the binding affinity and stability of spike protein-hACE2 complex, and the enhancement would increase the transmissibility of SARS-CoV-2 among human and bring more severe disease 16 Table 1). 23 variants of spike protein RBD that causing significantly higher or lower affinity to ACE2 and stability were focused in this study, the cutoff we set up was three for determining significance (Table 1), among 9 of the mutations that highly enhance the binding between spike protein and ACE2, the residue G446 (Figure 7) was also reported CDC up to date Combined 1,150 variants (634 missense mutations in total) in the spike protein of SARS-CoV-2 reported by CDC, we found their affinity and stability in our analysis outcomes. Among 76 missense variants that locate in the region between T333-C525, five variants in spike protein enhance its affinity binding to hACE2 (Table 3) and only three of them increase the stability of spike-hACE2 complex (Table 4). Additionally, we collected the genome sequences of SARS-CoV-2 that harbor those variants in spike protein and found the variants V483A occurs in 26 strains from the USA, V367F occurs in 12 strains from Hong Kong, Australia, and other European countries, and a variant G446V that highly enhance the affinity were identified in a strain from Australia (Supplementary Table 2). We mapped those strains onto our genomic phylogenetic tree and found they were dispersed distributed in the position close to the SARS-bat and SARS-pangolin. As COVID-19 widespread globally, several studies tried to find where this novel coronavirus came from, but few reports analyze the origin of SARS-CoV-2 according to the divergent pattern in spike protein. In our study, the multiple variants in spike protein found in different countries and clustered in the ancestral direction on phylogeny which might suggest SARS-CoV-2 infection start to occur at multiple sites but not only in Wuhan city. However, due to the limited detection capability and restricted availability of samples from infected animals, the variants available in the national database are uncompleted, more data needs to be collected for further studies.
Furthermore, since the SARS-CoV-2 strains have variants on spike protein relative to a reference sequence of SARS-CoV-2 (NC_045512.2) were found more divergence on the phylogenetic position that closer to SARS-bat and SARS-pangolin, it might indicate a consensus evolution occurred in S gene and enable SARS-CoV-2 to adapt to human hosts well. However, since it is difficult to determine the exact time the zero patient who got infected by SARS-CoV-2, the date of the sample collection and data extracted could mislead the results. We believe further clinical information from all the countries would be needed for research and this is a global concern that requires more cooperation and collaboration but not political games or constant blames. We shared all the analysis of over 3000 variants on spike protein in this study to help the world tracking the mutations of SARS-CoV-2 and also can be useful to select the potential druggable targets and neutral inhibitors to prevent the further damages may be brought by the pandemic of COVID-19.
Aside from mutations in spike protein for adapting to new hosts, the hACE2 also represents polymorphisms in the binding region, host-virus interaction over time makes a natural selection on both virus and host cells 30 . Therefore, the variants in hACE2 receptor would also play a role in SARS-CoV-2 infection. Cao and his colleagues (2020) 31  Researchers found three unique variants in hACE2 in the Italian population that might be corresponding to the high fatality including P389H, W69C, and L351V 33 .
We compared the unique variants found in local and Italian population by mapping them on the hACE protein structure ( Figure 8) and found the Italian ones were closer to the binding region than local ones, but in silico simulation indicated that none of them change the affinity and stability of spike protein of SARS-CoV-2 and hACE2 complex (Table 2). Generally, we did not find any variants in ACE2 that would significantly increase or decrease the affinity and stability of spike      The ML phylogeny tree of different strains SARS-CoV-2 from various region all around the world (Partial, the full tree was found in Supplementary Figure 1), the bootstrap values were mapped on the branch as long as the colors annotated for all clades.

Figure 2
The ML phylogeny tree of S, N, M, N gene sequences of SARS-CoV-2 from various strains (Partial, the full tree was found in Supplementary Figure 2), the bootstrap values were mapped on the branch as long as the colors annotated for all clades.

Figure 3
The alignment of S gene from different SARS virus.

Figure 4
Phylogenetic tree based on S gene (Partial, the full tree was presented in Supplementary Figure 3), the bootstrap values were mapped on the branch as well as the colors annotated for all clades.

Figure 5
Identified polymorphism in spike protein RBD mapped to the structure of spike protein in SARS-CoV-2 in complex with ACE2 in humans. Cyan = spike protein, Orange=ACE2 direct bounding to spike protein. 23 point mutations causing affinity significant change on direct bounding of spike protein of SARS-CoV-2, blue presents decreasing affinity while purple shows increasing.

Figure 6
Reported polymorphism in spike protein RBD mapped to the structure of spike protein in SARS-CoV-2 in complex with ACE2 in humans. Cyan = spike protein, Orange=ACE2 and blue presents decreasing affinity and stability while red shows increasing ones. A: Affinity; B: Stability.

Figure 7
Residue G446 variants in documents from CDC (G446V) and in our prediction (G446W)

Figure 8
Spike protein polymorphous points from local population and Italy (Citation), the Italian ones are marked as green while local ones marked as red.