Genomic recombination events may reveal the evolution of coronavirus and the origin of SARS-CoV-2

To trace the evolution of coronaviruses and reveal the possible origin of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which causes the coronavirus disease 2019 (COVID-19), we collected and thoroughly analyzed 29,452 publicly available coronavirus genomes, including 26,312 genomes of SARS-CoV-2 strains. We observed coronavirus recombination events among different hosts including 3 independent recombination events with statistical significance between some isolates from humans, bats and pangolins. Consistent with previous records, we also detected putative recombination between strains similar or related to Bat-CoV-RaTG13 and Pangolin-CoV-2019. The putative recombination region is located inside the receptor-binding domain (RBD) of the spike glycoprotein (S protein), which may represent the origin of SARS-CoV-2. Population genetic analyses provide estimates suggesting that the putative introduced genetic sequence within the RBD is undergoing directional evolution. This may result in the adaptation of the virus to hosts. Unsurprisingly, we found that the putative recombination region in S protein was highly diverse among strains from bats. Bats harbor numerous coronavirus subclades that frequently participate in recombination events with human coronavirus. Therefore, bats may provide a pool of genetic diversity for the origin of SARS-CoV-2.


Recombination between bat and pangolin coronaviruses may represent to the origin of SARS-CoV-2.
We performed multiple sequence alignment for SARS-CoV-2 strains and proximal outgroups and identified 3 independent recombination events by RDP4, software to detect recombination 32 . Each event was supported by evidence from at least six statistical tests (requiring a P-value < 0.05 in each test) ( Table 1). The phylogenetic tree of sequences in the recombination region was different from the phylogenetic tree built using the whole genome (Fig. 1). The three recombination events were also reflected in pairwise identity plots (Fig. S1). For further validation of the three recombination events, we also calculated pairwise genetic distances between coronaviruses, which were related to the three putative recombination events or outgroups. We performed calculations in the recombination region as well as the flanking sequences. The results (Table 2, Tables S1, S2) were consistent with the phylogenetic trees ( Fig. 1) and pairwise identity plots (Fig. S1). For all three recombination events, the genetic distance, calculated for the recombination region, between the presumed recombinant and the presumed minor parent was the lowest among all the genetic distances between the putative recombinant and other outgroups. According to the definition in RDP4, the minor parent is the parental sequence that contributes the smaller fraction of the recombinant sequence, while the major parent is the parental sequence that contributes the larger fraction.
Two of the three potential recombination events may have altered the structures of two different pangolinrelated coronavirus isolates, namely, an isolate possibly evolved from Pangolin-CoV-2017 and an isolate similar to Pangolin-CoV-2019. A 1260 bp fragment in some strains representing the ancestors of SARS-CoV-2, Bat-CoV-RaTG13 and Pangolin-CoV-2019 and a 1182 bp fragment in some strains similar to Pangolin-CoV-2017 may be recombinationally integrated sequences donated by bat isolates (bat-SL-CoVZC45 or bat-SL-CoVZXC21), suggesting that recombination between coronaviruses from bats and pangolins is not rare. One of these two recombinationally intergrated RNA fragments is located inside polyprotein 1ab (pp1ab, open reading frame 1 (ORF1)), referred to as RI_RNA_ORF1 in this manuscript, and the other fragment spans the 3′ end of ORF1 and the 5′ beginning of the S protein, referred to as RI_RNA_Boundary in this manuscript ( Fig. 2A).
Our analysis confirmed that the 228 bp long sequence within the SARS-CoV-2 S protein ( Fig. 2A) is likely to be an integrated sequence resulting from recombination between some strains similar to Bat-CoV-RaTG13 (NCBI accession No. MN996532) and some strains similar to Pangolin-CoV-2019 (NCBI accession No. MT121216; Table 1, Fig. 1D, Figs. S1C, S2). This recombination was significant in 6 independent statistical tests (Table 1). Moreover, we further validated of this recombination by performing sliding window analysis on sequence differences (Fig. S3) between SARS-CoV-2 and other coronaviruses proximal to SARS-CoV-2 in the phylogenetic tree (Fig. 1A). The recombination event was also validated by genetic distance analyses (Table 2). To reveal whether some other coronavirus strains contributed to the integrated sequence, we searched for recombination events using all reported coronavirus genomes (Table S3). We did not find any other recombination that may contribute to the 228 bp sequence. However, SARS-CoV-2 has not yet been isolated and identified from bats or pangolins. At the whole-genome level, Bat-CoV-RaTG13 shows higher identity with SARS-CoV-2 than Pangolin-CoV-2019. Our results suggested with high probability that SARS-CoV-2 originated from a bat Table 1. Three putative recombination events between bat and pangolin coronaviruses. 'Position' refers to the start and end of the reference genome MN908947 (SARS-CoV-2). 'NS' means not significant. The major parent and minor parent are the presumed parent contributing the larger fraction of the sequence and the presumed parent contributing the smaller fraction of the sequence, respectively. In cells, following the strain name, a representative strain ID is listed within a pair of small brackets. P-values based on seven statistical tests are also listed. Plots of alignments supporting these recombination events are shown in Fig. S1. Sequence IDs in brackets are exemplary sequences of the described strains. www.nature.com/scientificreports/ coronavirus after recombinational integration of a RNA fragment from a pangolin coronavirus into the S protein gene (Fig. 2B). This putative integrated RNA fragment, referred to as RI_RNA_S in this manuscript, encodes a 76 AA long peptide and is located in the RBD (Fig. S2), which may influence the host preference of the virus. This recombination event may have played a key role in the origin of SARS-CoV-2.
Evolutionary pattern of the putative recombinationally integrated fragment in the S protein. To understand the evolutionary role of the recombination that may have led to the origin of SARS-CoV-2, we performed sliding window analysis and genetic tests for coronavirus populations. We observed that RI_RNA_S has peaks of fixation index (Fst) values calculated between human and bat coronaviruses, between human and pangolin coronaviruses and between human and pangolin coronaviruses (Fig. 2C, Figs. S4, S5A). These Fst peaks have values higher than the 0.05 or 0.1 threshold when treating the nearby region as the background, indicating that they are significant or weakly significant (Fig. 2D, for specifics, see Materials and Methods). The significances of the human-bat and human-pangolin Fst peaks was also confirmed by comparing the distribution of the values inside RI_RNA_S and that in the flanking region (Fig. 2D). We also observed that Fst values between coronaviruses from other pairs of species mostly had peaks at RI_RNA_S (85.7%, 18/21, Fig. S5). Twelve of 21 of these peaks were confirmed by testing the difference in distribution between RI_RNA_S and the flanking region (Fig. 2D, Fig. S5B). The increase in differentiation reflected by the Fst peak suggests that RI_RNA_S is a featured segment. In other words, RI_RNA_S may be used to predict which host a coronavirus belongs to. RI_RNA_S may be important for coronavirus adaption to new hosts. Consistently, we observed a pair of CLR peaks adjacent to RI_RNA_S not only for SARS-CoV-2 strains collected in April ( Fig. 2E) but also for those collected in March (Fig. S6). The two CLR peaks for April strains showed significance (threshold, 0.05) or weak significance (threshold, 0.1) in the nearby region. The two CLR peaks for March strains had values higher than the 0.05 threshold when treating the nearby region as the background, while one showed significance when considering the whole genome. These results suggested that the putative recombinationally integrated sequence in SARS-CoV-2 underwent adaptation. www.nature.com/scientificreports/ We did not observe obvious fluctuations in the CLR or Tajima's D within RI_RNA_S for any SARS-CoV-2 strains. One explanation for this result is that the RBD region is highly conserved in SARS-CoV-2. Compared with other human coronavirus, SARS-CoV exhibits a pair of CLR peaks adjacent to the corresponding region of RI_RNA_S (Fig. S4). The CLR peak of SARS-CoV near the 3′ right of RI_RNA_S is significant when considering the whole genome.
Bats may provide a genomic pool for the origin of novel human coronavirus. There was a sharp decrease in RBD diversity (Pi) for human, camel and cow coronaviruses compared with bat coronaviruses. In contrast, for all coronaviruses isolated from bats, the Pi values in the RBD were high and there was a Pi peak at RI_RNA_S (Fig. S4). Considering that there are 12 reported clades of bat coronaviruses, population structure may contribute to the Pi peak. Therefore, we performed sliding window analyses on bat clades. We chose clades with more than 10 genome samples to ensure an adequate sample size. We found that RI_RNA_S has a Pi peak in 5 bat clades (Fig. S7A), and one shows significance (Rhinacovirus) within the region nearby. We performed the same analysis for 7 human clades. No clade had a Pi peak in RI_RNA_S with statistical significance (Fig. S7B). To avoid of the effects of population structure, we performed sliding window calculations of Pi in different clades. We calculated statistics to assess the differences between hosts. We found that bats had a higher Pi value for the nearby region of RI_RNA_S (from 21,500 to 25,000 bp) than other hosts (Fig. 3A). Bats also had a higher Pi for RI_RNA_S than other hosts (Fig. S8A). These findings highlight the high diversity of RBD sequences in coronavirus isolates from bats, which may provide a genetic pool for recombination that drives the evolution of coronaviruses in general and SARS-CoV-2 specifically.
Bat coronaviruses show a higher Tajima's D than human coronaviruses (all coronaviruses isolated from humans, including SARS-CoV-2, SARS-CoV, MERS-CoV, and 229E-CoV) (P-value = 5.157e−08, Wilcoxon rank sum test) in the RBD region (Fig. S4). To test whether there is a difference in selection between bat and human coronaviruses, we slid the analysis window and calculated Tajima's D, clade by clade, as we did for Pi above. We detected differences in Tajima's D between bats and other hosts in RI_RNA_S and the nearby region of RI_RNA_S (Fig. S8B,C). Bat coronavirus did not show a deviation from neutrality (Tajima's D = 0). However, there was no significant difference in Tajima's D between bats and humans (Fig. S8B,C). Thus, the difference in Tajima's D between bat and human coronaviruses in Fig. S4 may result from population structure.
Through a search in CoVdb 31 , we found that bat isolates had the highest number of subclades among 32 reported hosts (Fig. 3B), indicating that coronaviruses in bats may have differentiated at higher levels than Table 2. Estimates of evolutionary divergence between coronavirus sequences obtained using the Tajima  www.nature.com/scientificreports/ those from other hosts because of population structure. Previous work indicated that most human coronaviruses originated from bats 33 . Interestingly, our analysis results show that bats rank first in terms of coronavirus recombination event quantity among 32 regular hosts (Fig. 3C). For recombination between coronaviruses from different hosts, the bat-human pair had the highest frequency (Fig. 3D, Fig. S9). A total of 43.5% (37/85) of human-related coronavirus recombination events were bat related. One hundred percent (10/10) of pangolinrelated coronavirus recombination events were also bat related. The comparably high frequency of recombination between human and bat coronaviruses as well as between pangolin and bat coronaviruses (Fig. S9) may explain the origin of SARS-CoV-2. www.nature.com/scientificreports/

Recombination between bat and pangolin coronaviruses. It is likely that Pangolin-CoV-2017 or
Pangolin-CoV-2019 contributed to the origin of SARS-CoV-2, although the two recombination events referred in this work showed no direct contribution to the origin of SARS-CoV-2. The putative integrated sequence from Bat-SL-CoV in Pangolin-CoV-2019 (Figs. 1, 2A) may have made the viral genome less similar to SARS-CoV-2. Nevertheless, recombination among coronaviruses between bats and pangolins may have generated other novel strains that can be transmitted between species. SARS-CoV-2 may be just a recent example. We observed a CLR peak at RI_RNA_ORF1 for SARS-CoV-2 strains (Fig. S10). The CLR peak was significant in the nearby region and confirmed by testing the distribution differences (P-value = 9.997e−08). The CLR peaks were re-evaluated and confirmed using all SARS-CoV-2 strains collected in March (Fig. S11A). We observed a nonsignificant peak for those in April (Fig. S11B). RI_RNA_Boundary showed a peak in the CLR calculated using human isolates (Fig. S12). The values in the peak were significantly higher than those in the flanking region (extended by 1000 bp, P-value = 7.375e−06). We also observed a peak in the CLR calculated using SARS-CoV-2 strains collected in April (Fig. S11D). We observed peaks in the CLR calculated based on SARS-CoV-2 strains collected in March (Fig. S11C), but these peaks were not significant. There was a peak in Fst (significant in nearby region, threshold, 0.05) calculated between human and bat coronaviruses at RI_RNA_Boundary (Fig. S12), which was also confirmed by comparing the distributions in RI_RNA_Boundary and the flanking region (P-value = 7.11e−05). The same was observed for Fst values between coronaviruses from most other pairs of species at RI_RNA_Boundary (76.2%, 16/21, Fig. S13). Nine of the 16 peaks of Fst values showing local significance were also confirmed by testing the distribution (Table S4). These results indicated that these putative recombinants were evolutionarily active regions. www.nature.com/scientificreports/

Discussion
Previous efforts were made to detect evidence of recombination, but did not support a relationship between recombination and the origin of SARS-CoV-2 1-3 . In contrast, other studies revealed that recombination could be associated with the origin of COVID-19 15 . In relation to this issue, our observations indicate that SARS-CoV-2 possibly originated from recombination between bat and pangolin coronaviruses. We reached this conclusion through comprehensive analyses of all reported coronavirus genomes from different hosts. Important evidence was provided by genomic sequence analysis of Pangolin-CoV-2019 (412,860) and Bat-CoV-RaTG13 (MN996532). The putative recombinationally integrated sequence provided by some strains similar to Pangolin-CoV-2019 was located inside the RBD of the S protein region. Our analyses indicated that RI_RNA_S is under positive selection in SARS-CoV-2 populations. These results supported the evolutionary importance of RI_RNA_S. However, more experiments are needed to understand whether and how RI_RNA_S functions differently in SARS-CoV-2 and Bat-CoV-RaTG13.
Unlike human coronavirus genomes, coronavirus genomes isolated from other hosts are limited in terms of public availability. To overcome this, we pooled all coronaviruses from different subclades and collection times that were isolated from the same host. Although such pooling can inflate genetic diversity levels, we used the same pipeline for all coronavirus strains to reduce deviation from reality. Bat coronaviruses had a higher Tajima's D than human coronaviruses. Considering that the number of bat samples was smaller (176) than the number of human samples (972), the low Tajima's D for human coronaviruses may have been caused by sample size differences, as the larger the sample size is, the more negative Tajima's D might be. The negative Tajima's D for human coronaviruses was located inside a negative valley, as shown in Fig. S4. Thus, it is hard to infer whether sample size led to the difference in Tajima's D between human and bat coronaviruses.
Our analyses provide further support that SARS-CoV-2 originated from bats, considering that bat isolates may be the major parent contributing the largest fraction of sequences. However, we still cannot conclude that SARS-CoV-2 originated from bats because of the lack of direct evidence. There is high genetic diversity in the S-protein of coronavirus strains from bats, but not in strains from other hosts, suggesting that bats are a reservoir of genetic diversity upon which natural selection can act. Compared to other hosts, bats also have more coronavirus subclades in terms of taxonomy. Bat coronaviruses may have more chances to take part in recombination than coronaviruses from other hosts and thus play the most important role in the origin and recombination of human coronaviruses among all known coronavirus hosts. Thus, avoiding contact with wild bats should be important for preventing future coronavirus associated pandemic diseases in humans.

Methods
Identification of recombination events. We collected genomic sequences of coronaviruses from NCBI, GISAID (http:// www. gisaid. org) and CoVdb 31 . We collected 3140 non-SARS-CoV-2 coronavirus (Table S5) and 26,312 SARS-CoV-2 strains (Table S6). Using all coronavirus strains, we performed whole-genome alignments by CUDA ClustalW 34 and built a phylogenetic tree, according to which we observed that bat-and pangolinisolated strains were proximal to SARS-CoV-2. We chose SARS-CoV-2, severe acute respiratory syndrome coronavirus (SARS-CoV), Middle East respiratory syndrome coronavirus (MERS-CoV) and strains whose hosts are bats or pangolins, and then performed recombination detection by RDP4 32 , which used RDP (the algorithm used to test for recombinants in RDP4 software) 35 , GENECONV 36 , Bootscan 37 , Maxchi 38 , Chimaera 39 , SiSscan 40 and 3Seq 41 as statistical test methods for recombinants. We chose to perform a full exploratory scan using all methods in the software. In this way, we identified three putative recombination events between bat and pangolin coronaviruses. We used MEGA 42 to perform local alignments, to build maximum likelihood trees by the Jukes-Cantor model 43 and to test the phylogeny by 5000 bootstrap replicates. For further verification of the recombination detected, we estimated the evolutionary divergence between coronavirus sequences using the Tajima-Nei model 44 by Mega 42 . We chose one strain in each coronavirus as a target. We performed analyses on RI_RNA_S, RI_RNA_ORF1, RI_RNA_Boundary and the nearby 2000 bp sequences of these fragments. We also wrote Perl scripts to perform sliding window calculations of the nucleotide differences between SARS-CoV-2 and other coronaviruses proximal to SARS-CoV-2 in the phylogenetic tree shown in Fig. 1.
To obtain an overall and general view of possible recombination events within all reported coronaviruses, we retrieved all coronavirus genomes from CoVdb 31 and then filtered out unique genomic sequences using CD-HIT 45 , requiring an identity > 95% and a coverage > 95% to speed up postanalyses. With these unique genomic sequences, we performed whole-genome sequence alignment and used RDP4 32 to search for possible recombination events. We also performed a full exploratory scan using all methods provided in RDP4. In this way, we identified 1149 putative recombination events. We discarded the cases in which the recombination signal could have been caused by an evolutionary process other than recombination. With these identifications procedures combined, we identified 532 putative independent recombination events ( Table S3).
Annotation of coronavirus strains. We annotated these recombination events based on the information provided by CoVdb 31 , where the classification of coronaviruses was first based on NCBI taxonomy. The properties of unclassified strains were identified by searching against a manually curated reference set, which include representative sequences of subclades. Other information, such as the host and collection region, was retrieved from the NCBI or GISAID and then curated manually.

Population genetic analyses.
We used the online tools in CoVdb 31 to perform population genetic analysis. In the platform, coronavirus strains isolated from the same host are grouped together. For viruses isolated from humans, those related to the same disease are also grouped together.

Statistics.
We calculated the statistics referred to in this work by R. A heatmap was created by the R package "pheatmap" and TBTools 53 . Most data extraction and clustering work was performed by writing Perl pipelines. When calculating statistics for all coronavirus recombination events shown in Fig. 3D and Fig. S9, we did not differentiate which was the recombinant, major parent or minor parent. They were all considered to take part in or related to recombination. To test the significance of a peak of a population genetic track after sliding window analysis, we extended the target region by a distance of nearly 1000 bp to the left and right and used the extended region as the background. We tried to identify where the values of the target region were located in the distribution of the background, as shown in Fig. 2D. If the maximum value of a peak was higher than the 0.05 threshold, the peak inside the target region was considered to be significant in the region nearby, or locally significant. If it was higher than the 0.1 threshold, the peak was considered weakly significant. We also performed a Wilcoxon rank sum test between the distribution of track values in the target region and that in the flanking region, including the 1000 bp sequences to the left and right, to validate the significance of the peak. Moreover, we tried to identify the position of the peak value in the distribution of all the values in the whole genome. If a peak was inside the top 5% section (P value < 0.05), it was considered to have whole-genome significance. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.