Emerging SARS-CoV-2 variants follow a historical pattern recorded in outgroups infecting non-human hosts

Katoh, Kazutaka; Standley, Daron M.

doi:10.1038/s42003-021-02663-4

Download PDF

Article
Open access
Published: 22 September 2021

Emerging SARS-CoV-2 variants follow a historical pattern recorded in outgroups infecting non-human hosts

Communications Biology volume 4, Article number: 1134 (2021) Cite this article

1924 Accesses
2 Citations
2 Altmetric
Metrics details

Subjects

Abstract

The ability to predict emerging variants of SARS-CoV-2 would be of enormous value, as it would enable proactive design of vaccines in advance of such emergence. We estimated diversity of each site on a multiple sequence alignment (MSA) of the Spike (S) proteins from close relatives of SARS-CoV-2 that infected bat and pangolin before the pandemic. Then we compared the locations of high diversity sites in this MSA and those of mutations found in multiple emerging lineages of human-infecting SARS-CoV-2. This comparison revealed a significant correspondence, which suggests that a limited number of sites in this protein are repeatedly substituted in different lineages of this group of viruses. It follows, therefore, that the sites of future emerging mutations in SARS-CoV-2 can be predicted by analyzing their relatives (outgroups) that have infected non-human hosts. We discuss a possible evolutionary basis for these substitutions and provide a list of frequently substituted sites that potentially include future emerging variants in SARS-CoV-2.

On the origin and evolution of SARS-CoV-2

Article Open access 16 April 2021

Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic

Article 28 July 2020

Mutational spectrum of SARS-CoV-2 during the global pandemic

Article Open access 27 August 2021

Introduction

In December 2020, three SARS-CoV-2 variants emerged with increased infectivity from England, South Africa, and Brazil. The fact that certain mutations in the Spike (S) protein had occurred independently prompted us to reexamine our September 2020 study of the evolution of this protein¹. In our original study, we characterized the importance of each residue position in the S protein by comparing its diversity in SARS-CoV-2 with that in relatives (outgroups) that infected bats or pangolins by using a simple equation:

$${{Importance}}={{diversity}}({{{{{{{\rm{SARS}}}}}}}}{{{{{\mbox{-}}}}}}{{{{{{{\rm{CoV}}}}}}}}{{{{{\mbox{-}}}}}}2+{{{{{{{\rm{outgroup}}}}}}}})-{{diversity}}({{{{{{{\rm{SARS}}}}}}}}{{{{{\mbox{-}}}}}}{{{{{{{\rm{CoV}}}}}}}}{{{{{\mbox{-}}}}}}2),$$

(1)

where diversity(x) is defined as the number of different amino acids observed at the site in question in virus group x. This equation, which was meant to be descriptive rather than predictive, identified twenty positions of high importance. We were thus surprised to find that, of these 20 positions, four were characteristic of the above emerging variants: Histidine 69, Valine 70, Glutamine 484, and Asparagine 501. These sites coincide with four out of the five residues (69, 70, 417, 484, 501) that have mutated independently in two or more of the three emerging lineages or a lineage transmitted between human and mink². We reanalyzed the underlying sequence data and found that the importance values of these sites were determined primarily by diversity(outgroup), rather than diversity(SARS-CoV-2). In hindsight, this is somewhat expected, as the latter term was close to unity at the time when we performed the analysis (i.e., before the emergence of new variants).

A natural question, then, is why a limited set of sites with high diversity in outgroups have also recently been substituted in SARS-CoV-2. As an evolutionary mechanism behind such frequent substitutions, two extreme scenarios, (i) neutral evolution and (ii) positive selection, are possible. These two scenarios give opposite predictions as to functionality of frequently substituted residues: scenario (i) predicts that the frequently substituted sites are not functionally important because they are under low functional constraints, while scenario (ii) predicts that functionally important sites have changed by being positively selected. Although the truth may lie in between these two extremes, we tested which scenario is more likely using the distribution of residues that are known to be important for infection to host cells.

Results and discussion

Known functionally important sites

Currently available functional information supports scenario (ii). When viewing the distribution of residues with high diversity(outgroup) as a heatmap on the spike molecular surface (Fig. 1a, b), it is apparent that these residues are not evenly distributed, but form clusters in the N terminal domain (NTD), receptor binding domain (RBD) and S1/S2 cleavage site, which are thought to be important for interaction to human cells. More specifically, Glutamine 484 and Asparagine 501 are structurally close to the interface with the host cell receptor ACE2, which, in turn, is targeted by neutralizing antibodies. Histidine 69 and Valine 70, on the other hand, are far from the ACE2 binding site but proximal to a recently-reported epitope for infection-enhancing antibodies^3,4. The 69/70 deletion mutant also occurred in an immunosuppressed individual who underwent convalescent plasma therapy⁵, suggesting that the mutation is a direct response to host antibodies. These two residues have also been reported to bind sialic acids⁶. There are also high diversity sites (around Alanine 684) adjacent to the S1/S2 cleavage site of SARS-CoV-2, as indicated in Table 1. The changes in this region seem to be host specific: HSMSS[LF]R in pangolin; QTQTNSR in two lineages of bat; QTQTNSPRRAR (which includes a polybasic insertion recognized by host’s protease^7,8) in human. These changes might reflect adaptation to new hosts in the past. Further substitutions, such as Proline 681, could change infectivity in human⁹ around this region.

**Fig. 1: Diversity and other indices mapped on structure of the S protein, visualized by ChimeraX²⁸.**

Table 1 High diversity residues.

Full size table

Possible positive selection in SARS-CoV-2 and outgroups

Modification of the regions discussed above could thus affect the infectivity or enable the virus to escape from the host’s immune system, albeit temporarily, as the change will inevitably be counteracted by a shift in the antibody repertoire of the host, resulting in an effective “arms race”, as reviewed in references^10,11. In this scenario, the sites with higher diversity imply direct or indirect host–pathogen interactions and are thus in a constant state of flux. This interpretation is consistent with previous studies that reported the possibility of adaptive evolution to infect human in SARS-CoV-2^12,13 and in other coronaviruses¹⁴. More recently, reports suggest that mutations in the B.1.1.7/B.1.351/P.1 lineages result in reduced binding affinity to some but not all neutralizing antibodies^15,16. For close outgroups of SARS-CoV-2, where functional information in non-human hosts is not available, we explored the possibility of positive selection using Bayes Empirical Bayes analysis¹⁷ implemented in the PAML program¹⁸, excluding human-infecting lineages. The fourth column in https://mafft.cbrc.jp/alignment/pub/sarscov2/fulllist.tsv shows the sites estimated to have had more nonsynonymous substitutions than synonymous substitutions in the outgroup sequences, although this estimation is sensitive to sequence selection and alignment ambiguity.

Correspondence of high diversity sites between SARS-CoV-2 and outgroups

Assuming adaptive evolution, it is conceivable that different sites are positively selected in different hosts to “optimize” infectivity; however, the analysis of outgroups revealed that such sites can overlap, presumably being involved in a common mechanism of host-pathogen interaction in different lineages, and that frequent changes in these sites already occurred in outgroups before the pandemic. We note that the correspondence between the positions of emerging mutations found in multiple human-infecting variants and those with high diversity(outgroup) is significant by Fisher’s exact test, regardless whether the original outgroup (Fig. 1a) or a broad outgroup (Fig. 1b) is used (see the lines marked with asterisk, *, in Fig. 1e, f). By contrast, the positions of sporadic mutations that are found just in a single variant in human show less clear or no correspondence with diversity(outgroup) (see the lines marked with dagger, †, in Fig. 1e, f). The former type of mutations (found in multiple variants) are likely to affect interactions with host factors and to spread in humans, although it’s difficult to differentiate between parallel evolution and recombination between lineages. The proposed simple method is suitable to predict such sites because they appear to be under positive selection in independent lineages including outgroups. There are some sites that have high diversity in outgroup but are not (yet) mutated in the current population of SARS-CoV-2. Such sites are regarded to be mis-predicted in this statistical test, but may mutate in the future. Indeed, residues close to the S1/S2 cleavage site were found to have high diversity(outgroup) in our initial analysis before emergence of several variants of concern (red sites in Fig. 1a) in 2020. Subsequently, substitutions in this region were indeed found in multiple variants infecting humans (red sites in Fig. 1c).

Prediction of position of emerging mutations

To anticipate new variants of SARS-CoV-2 as early as possible, a straightforward strategy would be to intensively collect a large amount of sequence data from human-infecting lineages¹⁹. Our observation above leads to a complementary strategy: prepare against new variants in advance by decoding the long history of host-pathogen interactions recorded in the outgroup sequences infecting non-human hosts. Unfortunately, currently efforts have focused almost exclusively on the former strategy and available outgroup sequences are limited. If richer sequence data of outgroups infecting bat, pangolin and other possible hosts becomes available, it would not only shed light on the origin of SARS-CoV-2²⁰, but also give us an advantage in the arms race with this virus.

Limitations

This analysis has several limitations. First, genetic changes can be caused by recombinations, not only point mutations and insertions/deletions. Indeed the receptor binding motif of SARS-CoV-2 was reported to be acquired from a lineage infecting pangolin²¹. It is possible that some changes in outgroups are also caused by recombinations. Our analysis regards a recombination simply as simultaneous changes in successive sites in the recipient genome. As a result, in comparison with the number of evolutionary events, diversity is overestimated if recombination between lineages occurred and the donor is not included in the alignment, while diversity is underestimated when the donor is included in the alignment. Second, the diversity in outgroup can be calculated only when the corresponding part exists in outgroups. This method cannot be applied to human-specific insertions. Third, the prediction of the position of a mutation can be ambiguous because of alignment ambiguity. Figure 2a shows the multiple sequence alignment (MSA) used in Saputri et al.¹, and Fig. 2b shows an alternative MSA. Since the insertion in the P2V strain in pangolin should be independent from the insertion at human and RaTG13, gaps can be inserted in different positions between these two groups. Thus some other MSAs (eg, Fig. 2c) are also possible. By using different MSAs, the position of high diversity sites can shift. Even in this case, a high diversity region should exist nearby, because the alignment ambiguity itself is due to high diversity. This problem more frequently occurs when including a wider range of outgroups. Thus, to obtain a prediction at the residue-level resolution, a large amount of data from close outgroups is necessary.

**Fig. 2: MSA around residues 69, 70, 417, 484, and 501, visualized by Jalview²⁹.**

Methods

Sequence data to calculate diversity

According to the interpretation of positive selection resulting in an “arms race”, it is possible that positions of mutations in future emerging variants can be predicted simply by identifying sites with high diversity in outgroups, where adversarial host-pathogen interactions have been occurring longer than for SARS-CoV-2 and humans. Because of their potential importance in the design of vaccines against future emerging variants, we calculated diversity(outgroup) for each residue position considering two definitions of outgroups: one that is identical to that used in our original analysis in which 6 sequences were used and a broader definition (18 sequences) to increase the amount of data used in the calculation. Both datasets are available at https://mafft.cbrc.jp/alignment/pub/sarscov2/. The sequence data was taken from GISAID²² and genbank, to cover major lineages of outgroups appearing in recent reports^23,24. As noted in Introduction, diversity(x) is defined as the number of different amino acids observed at the site in question, where x is either of the two outgroups. Amino acid residues with high diversity(outgroup) are listed in Table 1.

Tree

Figure 3 shows a phylogenetic tree of the S protein from SARS-CoV-2 and relatives including remote ones by the neighbor-joining method²⁵ applied to a distance matrix estimated with the Poisson correction based on an amino acid sequence alignment by MAFFT²⁶. The outgroup sequences used here are highlighted in this figure. The accession numbers of the sequences are given in this tree.

Statistics and reproducibility

In Fig. 1e, f, p values were calculated by Fisher’s exact test under the null hypothesis that the diversity of each site in outgroups (Fig. 1a or b) and the distribution of emerging mutations (Fig. 1c) are independent of each other. Positions of the mutations found in multiple variants and the sporadic mutations in Fig. 1c were taken from Peacock et al.²⁷.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

Sequence data used here are available at https://mafft.cbrc.jp/alignment/pub/sarscov2/.

Code availability

A script to count the number of amino acids in each site is available at https://mafft.cbrc.jp/alignment/pub/sarscov2/.

References

Saputri, D. S. et al. Flexible, functional, and familiar: characteristics of SARS-CoV-2 spike protein evolution. Front. Microbiol. 11, 2112 (2020).
Article Google Scholar
Lassaunière, R. et al. Working paper on SARS-CoV-2 spike mutations arising in Danish mink, their 2 spread to humans and neutralization data. https://files.ssi.dk/Mink-cluster-5-short-report_AFO2 (2021).
Li, D. et al. The functions of SARS-CoV-2 neutralizing and infection-enhancing antibodies in vitro and in mice and nonhuman primates. bioRxiv https://doi.org/10.1101/2020.12.31.424729 (2021).
Liu, Y. et al. An infectivity-enhancing site on the SARS-CoV-2 spike protein is targeted by COVID-19 patient antibodies. Cell 184, 3452–3466.e18 (2021).
Kemp, S. A. et al. SARS-CoV-2 evolution during treatment of chronic infection. Nature 592, 277–282 (2021).
Article CAS Google Scholar
Baker, A. N. et al. The SARS-COV-2 spike protein binds sialic acids and enables rapid detection in a lateral flow point of care diagnostic device. ACS Cent. Sci. 6, 2046–2052 (2020).
Article CAS Google Scholar
Hoffmann, M. et al. A multibasic cleavage site in the spike protein of SARS-CoV-2 is essential for infection of human lung cells. Mol. Cell 78, 779–784 (2020).
Article CAS Google Scholar
Peacock, T. P. et al. The furin cleavage site in the SARS-CoV-2 spike protein is required for transmission in ferrets. Nat. Microbiol. 6, 899–909 (2021).
Article CAS Google Scholar
Lubinski, B. et al. Functional evaluation of proteolytic activation for the SARS-CoV-2 variant B.1.1.7: role of the P681H mutation. bioRxiv https://doi.org/10.1101/2021.04.06.438731 (2021).
Meyerson, N. R. & Sawyer, S. L. Two-stepping through time:mammals and viruses. Trends Microbiol. 19, 286–294 (2011).
Article CAS Google Scholar
Bonsignori, M. et al. Antibody-virus co-evolution in HIV infection: paths for HIV vaccine development. Immunol. Rev. 275, 145–160 (2017).
Article CAS Google Scholar
Tegally, H. et al. Detection of a SARS-CoV-2 variant of concern in South Africa. Nature 592, 438–443 (2021).
Article CAS Google Scholar
Kang, L. et al. A selective sweep in the Spike gene has driven SARS-CoV-2 human adaptation. Cell 184, 4392–4400.e4 (2021).
Kistler, K. E. & Bedford, T. Evidence for adaptive evolution in the receptor-binding domain of seasonal coronaviruses OC43 and 229e. eLife 10, e64509 (2021).
Article CAS Google Scholar
Zhou, D. et al. Evidence of escape of SARS-CoV-2 variant B.1.351 from natural and vaccine-induced sera. Cell 184, 2348–2361 (2021).
Article CAS Google Scholar
Hoffmann, M. et al. SARS-CoV-2 variants B.1.351 and P.1 escape from neutralizing antibodies. Cell 184, 2384–2393 (2021).
Article CAS Google Scholar
Yang, Z. et al. Bayes empirical Bayes inference of amino acid sites under positive selection. Mol. Biol. Evol. 22, 1107–1118 (2005).
Article CAS Google Scholar
Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007).
Article CAS Google Scholar
Pater, A. A. et al. Emergence and evolution of a prevalent new SARS-CoV-2 variant in the United States. bioRxiv https://doi.org/10.1101/2021.01.11.426287 (2021).
Andersen, K. G. et al. The proximal origin of SARS-CoV-2. Nat. Med. 26, 450–452 (2020).
Article CAS Google Scholar
Li, X. et al. Emergence of SARS-CoV-2 through recombination and strong purifying selection. Sci. Adv. 6, eabb9153 (2020).
Article CAS Google Scholar
Elbe, S. & Buckland-Merrett, G. Data, disease and diplomacy: GISAIDas innovative contribution to global health. Global Challenges 1, 33–46 (2017).
Article Google Scholar
Guo, H. et al. Identification of a novel lineage bat SARS-related coronaviruses that use bat ACE2 receptor. bioRxiv https://doi.org/10.1101/2021.05.21.445091 (2021).
Zhou, H. et al. Identification of novel bat coronaviruses sheds light on the evolutionary origins of SARS-CoV-2 and related viruses. Cell 184, 1–12 (2021).
Article Google Scholar
Saitou, N. et al. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987).
CAS PubMed Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–380 (2013).
Article CAS Google Scholar
Peacock, T. P. et al. SARS-CoV-2 one year on: evidence for ongoing viral adaptation. J. Gen. Virol. 201, 001584 (2021).
Google Scholar
Pettersen, E. F. et al. UCSF ChimeraX: Structure visualization for researchers, educators, and developers. Protein Sci. 30, 70–82 (2021).
Article CAS Google Scholar
Waterhouse, A. M. et al. Jalview Version 2-a multiple sequence alignment editor and analysis workbench. Bioinformatics 25, 1189–1191 (2009).
Article CAS Google Scholar

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number T20K067670 and by the Platform Project for Supporting Drug Discovery and Life Science Research [Basis for Supporting Innovative Drug Discovery and Life Science Research (BINDS)] from AMED under Grant Number 21am0101108j0005. We thank Keisuke Takahashi, Shunsuke Teraguchi, Tokiko Watanabe, and Songling Li for helpful discussions regarding the preparation of the manuscript.

Author information

Authors and Affiliations

Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita, 565-0871, Japan
Kazutaka Katoh & Daron M. Standley

Authors

Kazutaka Katoh
View author publications
You can also search for this author in PubMed Google Scholar
Daron M. Standley
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.M.S. conceived the study. K.K. designed and performed the analysis. K.K. and D.M.S. wrote the manuscript. Both authors approved the manuscript.

Corresponding authors

Correspondence to Kazutaka Katoh or Daron M. Standley.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Communications Biology thanks Mushtaq Hussain and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Karli Montague-Cardoso.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Katoh, K., Standley, D.M. Emerging SARS-CoV-2 variants follow a historical pattern recorded in outgroups infecting non-human hosts. Commun Biol 4, 1134 (2021). https://doi.org/10.1038/s42003-021-02663-4

Download citation

Received: 22 April 2021
Accepted: 08 September 2021
Published: 22 September 2021
DOI: https://doi.org/10.1038/s42003-021-02663-4

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.