Origin and cross-species transmission of bat coronaviruses in China

Bats are presumed reservoirs of diverse coronaviruses (CoVs) including progenitors of Severe Acute Respiratory Syndrome (SARS)-CoV and SARS-CoV-2, the causative agent of COVID-19. However, the evolution and diversification of these coronaviruses remains poorly understood. Here we use a Bayesian statistical framework and a large sequence data set from bat-CoVs (including 630 novel CoV sequences) in China to study their macroevolution, cross-species transmission and dispersal. We find that host-switching occurs more frequently and across more distantly related host taxa in alpha- than beta-CoVs, and is more highly constrained by phylogenetic distance for beta-CoVs. We show that inter-family and -genus switching is most common in Rhinolophidae and the genus Rhinolophus. Our analyses identify the host taxa and geographic regions that define hotspots of CoV evolutionary diversity in China that could help target bat-CoV discovery for proactive zoonotic disease surveillance. Finally, we present a phylogenetic analysis suggesting a likely origin for SARS-CoV-2 in Rhinolophus spp. bats.


Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code Data collection

Data analysis
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Peter Daszak, Zhengli Shi May 25, 2020 No software was used for data collection.
Nucleotide sequences were aligned using MUSCLE 3.8.31. Bayesian phylogenetic analysis were performed in BEAST 1.8.4. We used TEMPEST 1.5. to assess the temporal structure within our datasets. Convergence of the MCMC runs was confirmed using Tracer 1.6. Maximum clade credibility (MCC) tree annotated with discrete traits were generated in TreeAnnotator 1.8.4. and visualized using the software SpreaD3 0.9.6. The Mean Phylogenetic Distance (MPD) and the Mean Nearest Taxon Distance (MNTD) statistics were calculated in R 3.5.1. Mantel tests were performed in ARLEQUIN 3.5. Median!joining network was reconstructed in Network 10.0.
All sequences data generated for this project have been deposited in GenBank, the NIH genetic sequence database (https://www.ncbi.nlm.nih.gov/genbank/). All accession numbers are available in the supplementary material Note 1.

nature research | reporting summary
October 2018 Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. We studied the diversity of bat coronaviruses in China using sequences generated from our own field-collected samples (oral and rectal swabs) and sequences available in GenBank to infer their cross-species transmission history and spatial spread. Bats were captured using mist nets at their roost site or feeding areas. Each captured bat was stored into a cotton bag, all sampling was non-lethal and bats were released at the site of capture immediately after sample collection. Oral and fecal swabs were collected. These kinds of samples are suitable for the detection of coronaviruses in wild animals.
Field samples were collected by experienced wildlife zoologists using standard procedures approved by Tufts University IACUC committee (proposal #G2017-32) and Wuhan Institute of Virology Chinese Academy of Sciences IACUC committee (proposal WIVA05201705). Bats were captured using mist nets at their roost site or feeding areas. Each captured bat was stored into a cotton bag, all sampling was non-lethal and bats were released at the site of capture immediately after sample collection. Oral and fecal swabs were collected. RNA was extracted and tested for the presence of coronaviruses as described in the Methods.
Our samples were collected in 15 Chinese provinces (Anhui, Beijing, Guangdong, Guangxi, Guizhou, Hainan, Henan, Hubei, Hunan, Jiangxi, Macau, Shanxi, Sichuan, Yunnan, and Zhejiang) from December 2010 to June 2015 when bats were not hibernating. Samples were collected each month in different locations.
Positive results detected in bat genera that were not known to harbor a specific CoV lineage previously were repeated a second time (PCR + sequencing) as a confirmation. Field species identifications were also confirmed and re-confirmed by cytochrome (cytb) DNA barcoding using DNA extracted from the feces or swabs. Only viral detection and barcoding results confirmed at least twice were included in this study. Viral detection in bat genera that were not known to harbor a specific CoV lineage previously and that were not reconfirmed at least twice were excluded from the study.
Positive results detected in bat genera that were not known to harbor a specific CoV lineage previously were repeated a second time (PCR + sequencing) as a confirmation. Field species identifications were also confirmed and re-confirmed by cytochrome (cytb) DNA barcoding using DNA extracted from the feces or swabs. Only viral detection and barcoding results confirmed at least twice were included in this study. All sequences data generated for this project have been deposited in GenBank.