replying to V. Soni et al. Nature Communications https://doi.org/10.1038/s41467-024-46261-4 (2024)
In comments on our paper “Within-host genetic diversity of SARS-CoV-2 lineages in unvaccinated and vaccinated individuals,” Soni et al. argue that the methods we employed for detecting natural selection are unreliable. Our study examined nucleotide diversity (π)1, the mean number of pairwise differences per nucleotide site, which is a common metric for quantifying within-host viral polymorphism2. Comparison of π at nonsynonymous (πN) and synonymous (πS) sites is thought to provide evidence for positive (πN > πS or πN/πS > 1) or purifying (πN < πS or πN/πS < 1) selection acting on amino acid changes3,4. This method has been used to study the intrahost evolution of viruses like influenza, often with evidence of positive selection in regions encoding immune epitopes5. Intrahost πN and πS have also been examined in SARS-CoV-26,7,8,9,10, and our study11 compared πN – πS across distinct COVID-19 patient subsets. We found that breakthrough infections in 2- or 3-dose Comirnaty and CoronaVac vaccinated individuals do not show elevated viral πN and may not change the direction of selection. These negative conclusions inherently control for viral demographic factors like bottlenecks that operate similarly in each patient, allowing straightforward interpretation of πN – πS differences.
Soni et al.12 challenge our null hypothesis of πN – πS = 0 (i.e., πN = πS), instead proposing that simulation is necessary for defining a precise expectation under neutrality. Indeed, πN – πS has widely recognized limitations13; for detecting positive selection, it is both overly conservative (may fail to detect positive selection when it has occurred) and susceptible to false positives (may spuriously detect positive selection when it has not occurred). Value is therefore placed on complementing the metric with other approaches. While recognizing these points, we believe the criticisms of Soni et al. may not be entirely valid. In fact, their own simulations demonstrate that selection is often readily detectable using a simple πN versus πS method.
First, Soni et al. employ analytical methods that do not reflect our study11. In our approach, the codon is treated as the observational unit, such that πN and πS values for each codon are averaged across all 2,820 intrahost samples or subsets thereof. Selection is then evaluated with a Z-test of the null hypothesis πN – πS = 0 by bootstrapping codons. This detects codon-specific patterns that are consistent across samples; takes advantage of the independent diversity generated in each sample; and compensates for the typically small number of intrahost single nucleotide variants (iSNVs) that pass quality control for any one sample. In contrast, Soni et al.12 use the sample as the observational unit and report values of πN and πS for 200 replicates, analogous to only 200 samples. Their simulations also fail to recapitulate key aspects of the observed biological data, including πN – πS values and numbers of iSNVs per sample (Supplementary Fig. 1).
Next, Soni et al. report no statistical tests. However, based on data simulated with SLiM14, they suggest that large variances make πN > πS probable even under purifying selection alone. This claim relies on the visual inspection of standard deviations in their Figs. 1–3. To assess it, we used the models of Soni et al. to simulate intrahost data for 100 samples, estimating standard errors of mean πN and πS as in our study. Purifying selection is highly significant for all models (P ≤ 5.0 × 10−7, Z-tests) (Supplementary Fig. 1). Purifying selection is detected even using their own sample-based approach (P ≤ 1.6 × 10−6, Wilcoxon Signed Rank tests). Thus, in contrast to their conclusions, a relatively small number of samples has sufficient statistical power to detect widespread selection using both methods.
Soni et al. then offer several simulations of positive selection. First, directional selection is modelled by introducing a single highly beneficial mutation (i.e., a selective sweep) in the context of a neutral/deleterious distribution of mutational fitness effects (DFE). Because the fraction of nonsynonymous mutations that are beneficial (fb) in this scenario is ~0.00007%, it is not surprising that πN – πS fails to detect positive selection. Specifically, πN – πS is tailored to detecting pervasive (multi-site), incomplete positive selection that is ‘caught in the act’. Population genetics theory suggests that the substitution of beneficial mutations takes an average of approximately \(2{{{{\mathrm{ln}}}}}(2{N}_{e}s)/s\) generations15. For selection coefficients (s) of 0.01–0.1 and intrahost effective population sizes (Ne) of 103–105, this implies an average of 45–644 days for SARS-CoV-2 (i.e., 106–1,520 replication cycles of 610 minutes16). A selective sweep is therefore not expected to complete over the course of a typical acute infection within a host. Furthermore, within-host viral evolution is likely to involve trade-offs, compensatory mutations, shifting fitness landscapes, and potentially balancing selection as a result of intrahost heterogeneity and frequency dependence17. In all cases, segregating nonsynonymous mutations will elevate πN.
In a second scenario of positive selection, Soni et al. set fb to 1.0% or 9.7% (s = 0.05–0.13) in the context of a DFE derived from Flynn et al. for Mpro (nsp5)18. We again used their models to simulate 100 samples (Fig. 1). Although they claim that πN – πS cannot detect selection, positive selection was highly significant at the whole-genome level for fb = 9.7% (πN/πS = 4.43, P < 2.2 × 10−16), whereas purifying selection was detected for fb = 1.0% (πN/πS = 0.90, P = 0.0033; Z-tests). Thus, under the simulation parameters of Soni et al., positive selection becomes highly significant for fb somewhere in the range 1–10%, due to multiple beneficial mutations segregating at intermediate frequencies.
To estimate fb for SARS-CoV-2, we utilized the fitness effect calculations of Bloom and Neher19. The central 95% of synonymous mutational effects was considered a null (neutral) distribution, such that nonsynonymous mutations were classified as beneficial if their effects fell above the 97.5th percentile of synonymous mutations. Results are summarized in Table 1. For the whole genome, fb is 1.5%. For individual ORFs, fb ranges from 0.8% (ORF1ab) to 6.6% (ORF7a). For sliding windows of 30 codons such as used in our study11, fb ranges from 0% to 13.7%. Maximum regional fb values occur near Spike codons ~127–175 and ~461–512, overlapping the antigenically important amino-terminal (NTD) and receptor-binding (RBD) domains20. Thus, at the levels of whole ORFs and functional domains, fb for SARS-CoV-2 often falls in a range that allows detection of positive selection by πN – πS.
Last, we modified the simulations of Soni et al. by introducing a DFE based on the nonsynonymous fitness effect estimates of Bloom and Neher19. Whole-genome mutation effect fractions (bottom row of Table 1) were used as a background. Deleterious and beneficial selection coefficients (s) were modelled using gamma (mean = −0.32, shape = 1.70) and exponential (mean = 0.087) distributions, respectively. Under these parameters, at the whole-genome level, selection was not significant (πN/πS = 1.03, P = 0.51) (Fig. 1b bottom). At the level of 30-codon sliding windows, we considered regions with πN > πS to be candidates for positive selection at various P value cut-offs, detecting 131 true positives (windows with at least one beneficial mutation) and 0 false positives for P < 0.0124. Thus, even under a nonideal scenario where the precise genomic targets of selection (codons with beneficial mutations) differ stochastically across samples, sliding windows are a reasonable candidate generator for regions undergoing positive selection.
All simulation results reported by Soni et al. and herein are subject to many limitations and likely do not reflect biological reality. First, DFEs were derived from functional assays18 or clinical isolates19 and therefore describe between-host evolution, but it is known that purifying selection is weaker within hosts6,21. Second, the models may contain important misspecifications, including (1) sequencing coverage of only 100 effective reads (median coverage in our study was 20,782 reads); (2) 2/3 of sites nonsynonymous (compared to ~3/4 in most real ORFs); (3) s > 1.0 in a SLiM non-Wright-Fisher context (Soni et al. Figure 2); (4) intrahost dynamics that may deviate from expected viral population sizes; and (5) no tendency for the same site to be under similar selection pressures across multiple samples (e.g., no convergent selected changes). Model complexity potentiates increased misspecification bias, and it is important for both biological parameters and analytical methods to match between simulated and empirical data.
To summarize, πN – πS has limitations. Care must be exercised, as factors other than positive selection can yield πN > πS, especially in short genome regions where πS is subject to stochastic fluctuation. The expected value of πN/πS depends on fb and DFE properties. More work is needed to determine the precise values of fb necessary for detecting positive selection, intrahost DFEs, and additional criteria for lowering the false-discovery rate (e.g., a minimum πN cutoff). All parameters are likely to vary by host, virus, lineage, and many other contexts. SLiM offers unprecedented opportunities for simulating complex evolutionary scenarios in order to test specific hypotheses14. Nevertheless, we maintain that simple methods like πN – πS have value. In the same way, simple dN/dS analyses continue to yield highly informative results22 even though viral consensus sequences do not incorporate real-world complexity, and each site in a genome may in reality follow its own ‘model’ of evolution which changes over time23. As the aphorism suggests, the question is not whether models are realistic, but rather whether they are useful24. While more advanced methods are always welcome, there is no one ‘right’ way to analyze evolutionary genomics data23.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All input data, intermediate files, and simulated data have been deposited at Zenodo under accession code https://doi.org/10.5281/zenodo.10552831. Data for estimating fb were obtained from the aamut_fitness_all.csv file of Bloom and Neher19 (public_2023-10-01 dataset; accessed 2023/10/05). Figure source data are provided as a Source Data file. Source data are provided with this paper.
Code availability
Simulation and analysis scripts have been deposited at Zenodo under accession code https://doi.org/10.5281/zenodo.10552831.
References
Nei, M. & Li, W.-H. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc. Natl Acad. Sci. USA 76, 5269–5273 (1979).
Lauring, A. S. Within-host viral diversity: a window into viral evolution. Annu. Rev. Virol. 7, 63–81 (2020).
Nelson, C. W. & Hughes, A. L. Within-host nucleotide diversity of virus populations: Insights from next-generation sequencing. Infection Genet. Evol. 30, 1–7 (2015).
Nelson, C. W., Moncla, L. H. & Hughes, A. L. SNPGenie: estimating evolutionary parameters to detect natural selection using pooled next-generation sequencing data. Bioinformatics 31, 3709–3711 (2015).
Moncla, L. H. et al. Selective bottlenecks shape evolutionary pathways taken during mammalian adaptation of a 1918-like avian influenza virus. Cell Host Microbe 19, 169–180 (2016).
Nelson, C. W. et al. Dynamically evolving novel overlapping gene as a factor in the SARS-CoV-2 pandemic. eLife 9, e59633 (2020).
Lythgoe, K. A. et al. SARS-CoV-2 within-host diversity and transmission. Science 372, eabg0821 (2021).
Bashor, L. et al. SARS-CoV-2 evolution in animals suggests mechanisms for rapid variant selection. Proc. Natl. Acad. Sci. USA 118, e2105253118 (2021).
Tonkin-Hill, G. et al. Patterns of within-host genetic diversity in SARS-CoV-2. eLife 10, e66857 (2021).
San, J. E. et al. Transmission dynamics of SARS-CoV-2 within-host diversity in two major hospital outbreaks in South Africa. Virus Evol. 7, veab041 (2021).
Gu, H. et al. Within-host genetic diversity of SARS-CoV-2 lineages in unvaccinated and vaccinated individuals. Nat. Commun. 14, 1793 (2023).
Soni, V., Terbot II, J. W. & Jensen, J. D. Population genetic considerations regarding the interpretation of within-patient SARS-CoV-2 polymorphism data. Nat. Commun. This issue (2023).
Kryazhimskiy, S. & Plotkin, J. B. The population genetics of dN/dS. PLoS Genet. 4, e1000304 (2008).
Haller, B. C. & Messer, P. W. SLiM 3: forward genetic simulations beyond the Wright–Fisher model. Mol. Biol. Evol. 36, 632–637 (2019).
Walsh, B. & Lynch, M. Evolution and Selection of Quantitative Traits (Oxford University Press, 2018).
Terbot, J. W. et al. Developing an appropriate evolutionary baseline model for the study of SARS-CoV-2 patient samples. PLoS Pathog. 19, e1011265 (2023).
Daugherty, M. D. & Malik, H. S. Rules of engagement: molecular insights from host-virus arms races. Annu. Rev. Genet. 46, 677–700 (2012).
Flynn, J. M. et al. Comprehensive fitness landscape of SARS-CoV-2 Mpro reveals insights into viral resistance mechanisms. eLife 11, e77433 (2022).
Bloom, J. D. & Neher, R. A. Fitness effects of mutations to SARS-CoV-2 proteins. Virus Evol. 9, vead055 (2023).
Carabelli, A. M. et al. SARS-CoV-2 variant biology: immune escape, transmission and fitness. Nat. Rev. Microbiol. https://doi.org/10.1038/s41579-022-00841-7 (2023)
Holmes, E. C. The Evolution and Emergence of RNA Viruses (Oxford University Press, 2009).
Lucaci, A. G. et al. RASCL: rapid assessment of selection in CLades through molecular sequence analysis. PLoS ONE 17, e0275623 (2022).
Hughes, A. L., Friedman, R. & Glenn, N. L. The future of data analysis in evolutionary genomics. Curr. Genomics 7, 227–234 (2006).
Box, G. E. P. Science and Statistics. J. Am. Statistical Assoc. 71, 791–799 (1976).
Jungreis, I. et al. Conflicting and ambiguous names of overlapping ORFs in the SARS-CoV-2 genome: a homology-based resolution. Virology 558, 145–151 (2021).
Acknowledgements
The authors acknowledge the Research Grants Council of HK theme-based research schemes (T11-705/21-N (L.L.M.P.)), Health and Medical Research Fund (COVID190205 (L.L.M.P.)), and InnoHK grant (L.L.M.P.) for the Centre for Immunology and Infection. H.G. was supported by the RGC Postdoctoral Fellowship Scheme (PDFS2324-7S03 (H.G.)) by the University Grants Committee of Hong Kong. C.W.N. was supported by the NCI Research Participation Program administered by the Oak Ridge Institute for Science and Education (ORISE) through an interagency agreement between the U.S. Department of Energy (DOE) and the National Institute of Health (NIH). ORISE is managed by ORAU under DOE contract number DESC0014664. All opinions expressed are the authors’ and do not necessarily reflect the policies and views of their organizations. The authors thank Jesse Bloom, Ben Haller, Sarah P. Otto, Helen Piontkivska, April (Xinzhu) Wei, Zachary Ardern, Louise H. Moncla, Ming-Hsueh Lin, Lisa Mirabello, and Meredith Yeager for feedback.
Author information
Authors and Affiliations
Contributions
C.W.N., L.L.M.P., and H.G. conceived of the project and wrote the manuscript; C.W.N. performed simulations and analyses; L.L.M.P. provided funding for the project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Nelson, C.W., Poon, L.L.M. & Gu, H. Reply to: Population genetic considerations regarding the interpretation of within-patient SARS-CoV-2 polymorphism data. Nat Commun 15, 3239 (2024). https://doi.org/10.1038/s41467-024-46262-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-024-46262-3
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.