Data-driven recombination detection in viral genomes

Recombination is a key molecular mechanism for the evolution and adaptation of viruses. The first recombinant SARS-CoV-2 genomes were recognized in 2021; as of today, more than ninety SARS-CoV-2 lineages are designated as recombinant. In the wake of the COVID-19 pandemic, several methods for detecting recombination in SARS-CoV-2 have been proposed; however, none could faithfully confirm manual analyses by experts in the field. We hereby present RecombinHunt, an original data-driven method for the identification of recombinant genomes, capable of recognizing recombinant SARS-CoV-2 genomes (or lineages) with one or two breakpoints with high accuracy and within reduced turn-around times. ReconbinHunt shows high specificity and sensitivity, compares favorably with other state-of-the-art methods, and faithfully confirms manual analyses by experts. RecombinHunt identifies recombinant viral genomes from the recent monkeypox epidemic in high concordance with manually curated analyses by experts, suggesting that our approach is robust and can be applied to any epidemic/pandemic virus.


Main
Recombination is a key molecular mechanism used by RNA viruses to boost their evolution. Recombination occurs both in viruses with segmented and non-segmented genomes; parental strains to a recombinant virus are referred to as "donor" and "acceptor". Recombination requires co-circulation and co-infection in the same host; the clinical and epidemiological relevance is substantial since recombinant viral strains have been associated with altered viral host tropism, enhanced virulence, host immune evasion, and the development of resistance to antivirals 4,5 . In light of these considerations, and in hindsight from the recent global scale COVID-19 epidemic, the need for the development of novel and rapid methods to identify recombination has been increasingly recognized by international health authorities and researches 6,7 . Phylogenetic analyses are essential to monitoring the spread and evolution of viruses [8]. All phylogeny-based approaches assume that the shared history of pathogens, isolated from different hosts, can be described by a branching phylogenetic tree. Recombination breaks this assumption and impacts the application of phylogenetic methods for the reconstruction of chains of contagion, viral evolution, and ultimately genomic surveillance of pathogens 9,10 .
During the COVID-19 pandemic SARS-CoV-2 has accumulated in excess of 130K distinct nucleotide mutations, leading to the emergence of more than 2K lineages. In the first three years of the COVID-19 pandemic, limited levels of recombination were observed, although an increased number of recombinant lineages has been reported more recently 11 . As SARS-CoV-2 evolved and differentiated, several recombination eventsinvolving SARS-CoV-2 variants of concern (VOCs) -have been recognized, highlighting once more the importance of recombination as a molecular mechanism for the generation of genomic and phenotypic diversity in epidemic/pandemic viruses 3,[12][13][14][15][16][17][18] .
More recently, as the Omicron SARS-CoV-2 variant became dominant worldwide, an excess of 60 recombinant lineages have been identified only within Omicron. XE (also known as V-22APR-02 in Public Health England) was considered the most concerning Omicron lineage during 2022 since it had a reported growth advantage over BA.2. None of the recombinants detected until the Autumn of 2022 spread fast enough to become dominant, while greater concern came from sublineages of Omicron 2 and from Omicron 4 and 5 (e.g., XBF). The recombinant XBB lineage, the ancestor of XBB.1.5 and XBB.1.16 is the only Variants Of Interest according to the World Health Organization at the time of writing 19 ).
The fast emergence of SARS-CoV-2 viral lineages and their potential epidemiological implications has called for continuous monitoring of viral genome evolution in the last three years. The largest available col-lection of genomes has been curated by GISAID 20 , which reached 15.2 million deposited viral sequences in April 2023. Genomic surveillance efforts initially focused on the monitoring of single amino acid changes. As the COVID-19 pandemic progressed, research interests shifted toward the study of mutational signatures and variants associated with increased transmission rates and reduced antigenicity, and possibly hampering testing, treatment, and vaccine development [21][22][23] . A number of methods were proposed to allow automatic early detection of variants [24][25][26][27][28] . Instead, interest in the automatic identification of recombination in SARS-CoV-2 started at a later stage.
Recombination in viral genomes is often identified using algorithms implemented in programs such as Simplot 29 ; 3SEQ 30 and its improved version 1 ; RDP3 (Recombination Detection Program version 3 31 , including four previously proposed tools) and its extensions RDP4 32 / RDP5 33 ; and RAPR 34 (Recombination Analysis PRogram). In short, these methods apply phylogenetics approaches to identify recombination hotspots and pinpoint patterns of interest with matrix-based visualizations. However, none of these methods was specifically devised to analyze/deal with large amounts of data/genomes. Moreover, some of these algorithms account for all polymorphic sites equally, regardless of phylogenetic information, and hence might be prone to systematic errors if/when applied to a sparse, arbitrary selection of genome sequences.
A number of studies already applied these methods to SARS-CoV-2. For instance, Pollett et al. 35 employed RDP4 on 100K sequences (dataset of August 2021); 3SEQ was applied by Boni et al. 36 and Jackson et al. 37 for analysis of mosaic signals and breakpoint identification; while 11 combined 3SEQ and RDP5 for assigning parent lineages.
Ignatieva et al. 38 proposed KwARG, a parsimony-based method to reconstruct possible genealogical histories of SARS-CoV-2 and disentangle recombination based on a statistical framework. Similar to other comparable works 37,39 , the method however suffers from a limited resolution and can not fully resolve pairs of recombinant donors/acceptor sequences at the lineage level.
Zhou et al. 40 recently introduced VirusRecom, a novel algorithm that uses information theory to infer recombination. A recombination event is modeled as a transmission process of "information" that uses a weighted information content to quantify the contribution of recombination to a given region in the viral genome. The method is applied only to simulated data and a limited selection of recombinant lineages (XD, XE, and XF); these settings cannot be considered a comprehensive evaluation.
The RIPPLES method 9 and related RIVET software 41 are more conceptually similar to the method described in this study and have been recently applied to the complete collection of SARS-CoV-2 genome se-quences; a detailed comparison between this method and the novel approach developed in this work is reported below.
Novelty: A computational approach to scout recombinations Currently, the hunt for recombinant SARS-CoV-2 viral lineages is manually performed by experts who report evidence on the Pango designation GitHub repository in the form of issues; they are broadly documented and allow for discussion with other peers 42 . Here we present a new automatic method (RecombinHunt, Fig. 1) for effectively and efficiently detecting recombination by analyzing the complete data corpus of SARS-CoV-2.
Our approach relies exclusively on big data-driven methods, as opposed to manual analyses performed and maintained by trained experts 9 ; the method produces results that are easily inspectable on simple visual reports (see Supplementary Files 1-4), thus it is completely transparent to its users.
We considered 15,271,031 SARS-CoV-2 genomes, downloaded from the GISAID database 20 on April 1st, 2023. Genome sequences were aligned to the SARS-CoV-2 reference genome and nucleotide mutations were identified by the HaploCoV pipeline 28 . To mitigate the impacts of sequencing and assembly errors, genome sequences of uncertain/low quality were excluded (i.e., records associated with low coverage, percentage of unknown bases ≥ 2%, unknown length, and incomplete metadata). We then retained 5,255,228 viral genomes, for which we only considered the assigned Pango lineage and the list of mutations. Overall, a total of 2,345 distinct lineages were represented; previously designated Variants of Concern (VOCs) lineages were associated with the largest number of assigned genome sequences, namely B.1.1.7 (Alpha), BA.2 (Omicron 2), BA.1.1 (Omicron 1), and AY.4 (Delta).
In the dataset, a total of 57 distinct recombinant lineages (denoted according to the Pango convention for labeling recombinant lineages under the letter 'X') with at least one high-quality sequence are present. Mutations frequencies were estimated across the complete collection of Pango lineages and in the complete collection of SARS-CoV-2 genome sequences. For every lineage in the SARS-CoV-2 reference nomenclature, mutations with a frequency above a 75% threshold were denoted as characteristic mutations; the list of characteristic mutations for a lineage is denoted as the lineage mutation space in RecombinHunt.

The approach
RecombinHunt accepts as input a target genome sequence, in the form of a list of mutations. Genome positions with a nucleotide mutation in the target are denoted as the target mutations space. Candidate "donor" and "acceptor" lineages are defined on the basis of the counts of their mutations in the target mutation space, we denote as "donor" the lineage with the higher count. For every lineage, the union of the lineage and target mutation space is denoted as extended target space. A cumulative likelihood ratio score is derived according to the following procedure: at each position of the extended target space, we compute the logarithmic ratio between the frequency of the mutation in the lineage and in the complete collection of SARS-CoV-2 genomes.
This score is added if the mutation is shared by the both target and the lineage, it is subtracted if the mutation is observed in the lineage but not in the target.
The workflow is represented in Extended Data Fig. 1. Likelihood ratio values are computed for all possible lineages; the lineage L1 associated with the maximum value is assigned to the target. If L1 mutation space differs from the target mutation space at most in two positions, then the non-recombinant model is selected, and the target is assigned to L1; else, L1 is designated as the candidate donor. Note that L1 covers the majority of the mutations of the target, located in the genome segment that starts from one of the two ends (either 5' or 3') -denoted as L1's end -and reaches its maximum value at a position designated as max-L1. Upon the identification of a candidate donor, the one-breakpoint (1BP, Fig. 1b) and two-breakpoint (2BP, Fig. 1c) models are compared in parallel.
In the 1BP model, we search for a lineage L2, starting at the opposite end of the genome -denoted as L1 opp end. We consider the L2 lineage associated with the maximum likelihood ratio value (max-L2); if such lineage is not different from L1 in at least three mutations, we recede to the non-recombination case, else we designate it as a candidate acceptor. The interval between coordinates max-L1 and max-L2, where the donor and acceptor lineages reach their maximum likelihood ratio, define the 'breakpoint range', which is then reduced to a single position (see Methods and Extended Data Fig. 2). An example of a 1BP use case is shown in Fig. 1b  In the 2BP model, the candidate donor L1 lineage is assigned to both ends of the genome; this case is explored only when the target sequence has at least three mutations of L1 at the L1 opp end of the genome. We designate as L1 opp the portion of the genome between the L1 opp end and the point where L1 opp 's likelihood ratio is maximum, denoted as max-L1 opp . A candidate acceptor L2 lineage is searched in the space between max-L1 opp and max-L1; the lineage L2 with the maximum likelihood ratio is selected, yielding to a breakpoint range, positioned either between max-L1 and max-L2 or between max-L1 opp and max-L2; if such breakpoint range is greater than one mutation, it is reduced to a single mutation. In Fig. 1c

Results
Our method was executed on 51 of the 57 lineages designated as recombinant by Pango at the time of the writing. Two lineages were excluded since they had three breakpoints and other four lineages were disregarded since the defining Pango issue was unclear/controversial. The "ground truth", i.e., the description of the recombinant lineage in terms of donor, acceptor, and breakpoint, was reconstructed directly from the corresponding Pango designation issues 42,43 .
Recombinant lineages were removed from the list of Pango designations, and an ideal consensus-genome was reconstructed for every lineage by considering the ordered list of nucleotide mutations shared by >75% high-quality genomes assigned to that lineage.
Three main criteria were used to validate RecombinHunt results with respect to the ground truth: (1) the correct model recombinant (1BP or 2BP) represented a statistically significant improvement (p-value < 10 −2 ), compared with the other models; (2) designations of both the donor and the acceptor lineages were correct (i.e., all found candidates are the same or descendants of those in the ground truth); and (3) the breakpoint position, as determined by our method, was within the range of genomic positions indicated in the Pango issue.
Results are shown in Table 1, partitioned as follows: (i) cases with 1BP approximately in the middle of the SARS-CoV-2 genome (from XA to XZ); (ii) 5' proximal 1BP cases (from XAA to XW); (iii) 3' proximal 1BP cases (from XAH to XT); (iv) cases with 2BP (from XAC to XD).
Notably, RecombinHunt results were in complete agreement with the ground truth for 40 recombinant lineages (37 with one breakpoint and 3 with two breakpoint recombinations). The remaining 11 lineageswhich did not fully agree with the Pango designation -are stratified into three conceptually distinct groups, discussed below; A detailed report illustrating all eleven cases is provided in Supplementary File 5. (G3) Three cases (XAW, XAT, and XP) highlight some limitations of our approach. In XAW, while L1 is correctly identified, none of the lineages defined in the Pango nomenclature provides a good match for 46 out of 103 total target mutations characteristic of the XAW lineage; in this case, the parent acceptor lineage might not be defined in the reference nomenclature. In XAT and XP, recombination as defined by the ground truth is supported by a limited number of mutations at the 3' end of the genome. As these mutations have a relatively high frequency (range 0.17-0.34) in the complete collection of SARS-CoV-2 genomes, they contribute to a modest drop in the log-likelihood ratio score. Both XAT and XP lineages are flagged as non-recombinant by RecombinHunt; we speculate that our method loses sensitivity due to attraction to local maxima and limited/insufficient drops in the cumulative likelihood ratio score.

Reproducibility on individual high-quality sequences
To evaluate 1) whether RecombinHunt could systematically flag recombinant viral genomes; and 2) the reproducibility of the results, our method was applied directly to individual genome sequences. For every recombinant lineage, at most one hundred of randomly selected, high-quality sequences were analyzed; when high-quality sequences were fewer than one hundred we considered all of them. Results are shown in Table 2, in the same order and grouping of Table 1 to facilitate the comparison.
As outlined in Table 2 (comparing '%non-rec', '%1BP', and '%2BP' with the 'consensus'), in 46 out of 51 lineages, RecombinHunt's results of the large majority of single genome sequences (value in bold type) are consistent with those obtained on the corresponding lineage-level genome consensus. The most notable exceptions are: XBG/XAZ (consensus 1BP, whereas respectively 69% and 40% of sequences are 2BP); XBH/XBM (consensus 2BP, whereas respectively 86% and 56% of sequences are 1BP); and XAT (consensus is nonrecombinant, whereas 57% of sequences are 1BP). Fig. 3 report the observed frequencies of inferred breakpoint genomic positions. In the first three groups of plots (from XA to XT), for 29 lineages the distribution of results obtained on single sequences is in large agreement with the genome consensus level analysis -see that the mode of the blue bar plots is close to the light blue bar, indicating the breakpoint position of the consensus-genome. Some discrepancies however are observed, which can be summarized in two main cases: (1) Additional breakpoints. The consensus-genome analysis indicated one breakpoint, but two breakpoints are detected in single sequences: this occurs in eight lineages, the most evident case is XBG, whose panel in Fig. 3 shows one light-blue bar and a distribution of double breakpoint positions in orange. (2) Missing breakpoints. Two breakpoints were identified based on the consensus-genome analysis, but only one breakpoint is detected in the single sequences: this occurs in two lineages, XBH and XBM, whose panels in Fig. 3 show two light orange bars and several blue bars, representing single sequences with one breakpoint.

Histograms in
Note that in XAC, XAW, XBL, and XD both the consensus-genome and the majority of sequences were recognized as two-breakpoint recombinants, and that all single genome sequences assigned to XAR were flagged as non-recombinant, in agreement with the consensus-genome result. The Nextstrain dataset was processed according to our quality filtering pipeline. Out of a total of 6,983,419 sequences, 3,984,308 sequences and 61 distinct recombinant Pango lineages were retained. After removing the cases with three or more breakpoints and those for which the ground truth is uncertain, we obtained 51 recombinant cases. A consensus genome was computed for these lineages by retaining mutations with a frequency above the 75% threshold. Subsequently, RecombinHunt was applied. Detailed results are reported in Extended Data Table 1. RIVET reports a total of 17 inter-lineage distinct recombination cases on the complete Nextstrain dataset; out of these, two have more than two breakpoints and cannot be addressed by RecombinHunt; moreover, XBN was not considered due to the uncertainty of the ground truth. Noticeably, RecombinHunt reports 33 cases designated as recombinant in Pango that are not identified as recombinant by RIVET. Only 14 cases are recorded by both methods and allow a direct comparison. The comparison is summarised in Extended Data Table 2.

Comparison with the RIPPLES method
For every lineage, the donor/acceptor candidates and breakpoints positions found by RIVET and Recombin-Hunt are compared against the ground truth. Note that in the seven cases (XAK, XBB, XBD, XD, XW, XBT, and XBU) RIVET reported donor/acceptor lineages that are not defined in Pango (e.g., miscBA.5BA.2.75 or miscBA1BA2Post17k), thus candidate parent lineages could not be evaluated.

RecombinHunt on monkeypox
The complete collection of viral genome sequences from the recent monkeypox epidemic 46  Monkeypox genome sequences and associated metadata were accessed through the dedicated Nexstrain resource 49 , low-quality genome sequences were discarded according to the same criteria used for SARS-CoV-2, and mutations were identified by applying the HaploCoV workflow 28 on a collection of 2,526 high-quality sequences. A total of 4,932 distinct mutations were detected. Mutations characteristic of each lineage -as defined in the reference nomenclature -were identified. A threshold of 9% was applied, to account for the high levels of intra-lineage diversity observed in monkeypox 50 .
A total of 374 1BP and 331 2BP candidate recombinant genomes were identified by RecombinHunt (see Supplementary Table 1

Discussion
We introduce RecombinHunt, a method purely based on big data, for the automatic identification of recombinant viral genomes in epidemic/pandemic scenarios. We record the frequency of nucleotide mutations occurring within lineages, as defined by a reference nomenclature (or otherwise determined clusters), and within all the genomes of a given sequence collection. These data are subsequently used to score viral genomes by means likelihood-based approach and detect recombinant sequences of two lineages, respectively labeled as the "donor" and the "acceptor". The method is general and can be applied to all viral species; as demonstrated by its application on distinct collections of SARS-CoV-2 genomes (retrieved from the EpiCov database of GI-SAID and the GenBank database curated by Nextstrain) and on genomic sequences from the recent monkeypox epidemic (also curated by Nextstrain). In terms of comparison with the Pango ground truth, RecombinHunt outperforms all the other currently available methods and correctly identifies a significantly larger proportion of recombination events flagged by expert manual analysis. In addition, RecombinHunt is rapid, and can be applied to the analysis of pandemic-scale data; the evaluation of the recombinant cases takes about 13 minutes on the GISAID dataset (15M sequences), and 8 minutes on the Nextstrain dataset (6.4M sequences) using a common portable computer.
We thoroughly evaluated our results in comparison with 51 unambiguous recombinant lineages defined by the Pango SARS-CoV-2 nomenclature for which sufficient genome sequences are available. Initially, lineage levels consensus-genomes were reconstructed and evaluated. Strikingly a complete agreement with the ground truth was observed for 40/51 cases. Of the remaining eleven lineages, six were not recognized as recombinant, two had alternative solutions with a higher likelihood, according to RecombinHunt, and three outlined some potential methodological limitations of our approach. Importantly -in the large majority of these cases -differences in the determinations were based on a relatively low number of target mutations (2 or fewer), a scenario that is not completely incompatible with convergent evolution.
When applied to high-quality individual sequences, with some well-explainable exceptions, the method produced highly consistent results with those recovered at the lineage level, and the same correct outcome was reported in the vast majority of the individual sequences. Moreover, the small discrepancies observed in our analyses might not necessarily reflect errors, and could be suggestive of intra-lineage heterogeneity and/or microevolution in some SARS-CoV-2 recombinant lineages. Interestingly, for a single lineage (XAT), results obtained on single genome sequences were more aligned with the ground truth than the lineage-level consensus.
Strikingly, once applied to the monkeypox virus, our method was able to replicate the classification of viral sequences recently indicated as recombinant by using a sophisticated ad-hoc method based on expert manual annotation. A large number of additional candidate recombinant genomes (705 cumulatively) were also detected, suggesting previously unreported recombination events in monkeypox.
Our results demonstrate that RecombinHunt is highly accurate and reliable, and represents a major breakthrough for the detection of recombinant viruses in large-scale epidemics/pandemics. The method can be applied to any mid-size available collection of nucleotide mutations for viral species (e.g., Zika and Ebola) and facilitates the detection of recombinant viral genomes in current and future viral outbreaks.  Figure 1 | RecombinHunt has three possible outcomes: no recombination, 1 breakpoint recombination, or 2 breakpoints recombination. a, Example of likelihood ratio profile for a non-recombined genome with N1 = 66 mutations, only featuring one donor lineage corresponding to BA.2.3.13. b, Example of likelihood ratio profile for a recombined genome assigned to XBE Pango lineage with N2 = 72 mutations, breakpoint at the 63rd mutation, donor lineage BA.5.2.6 (from 5'-end to 63rd mutation), and acceptor lineage BE.4 (from 64th mutation to 3'-end). c, Example of likelihood ratio profile for a recombined genome assigned to XD Pango lineage with N3 = 67 mutations, two breakpoints at the 23rd and 52nd mutations, donor lineage AY.4 (from 5'-end to 23rd mutation and from 53rd mutation to 3'-end), and acceptor lineage BA.1.22 (from 24th to 52nd mutation).   (left table) and BE.4 (right table) are compared with the following candidates, who scored lower maximum likelihood ratios. For each lineage candidate, tables report the number of sequences; the position max-L1 (resp. max-L2), i.e., the beginning of the breakpoint; and the maximum likelihood ratio value reached on that coordinate. The following columns correspond to the result of the comparison of the row candidate with the first of the     (or less) available genomes for each recombinant lineage. For each Pango lineage, we indicate the number of considered highquality sequences-100 random ones are selected when more are available. In the following columns, we indicate percentages of sequences with no breakpoint, one breakpoint, or two breakpoints (bold-type for solid majority). The following column reports the percentage of sequences for which the p-value of the most-probable model (for each sequence) is ≤ 10 −5 . The last column reports the model chosen for the consensus-genome and * when that model is in contrast with the majority of sequences (underlined).  Table 2. On the x-axis, the consensus-genome mutations; on the y-axis, the count of single sequences with a breakpoint detected on the x position. A light blue stripe indicates the RH 1BP position; two light orange stripes indicate the RH 2BP positions (see Table 1). Blue bar plots count the sequences whose 1BP is located at a given mutation; orange bar plots count the sequences whose 2BPs are located at given positions (without distinction between the first and the second one).

Data collection and genome quality filtering
Since January 5th, 2020, when the first complete genome sequence of SARS-CoV-2 was released on Gen-

Mutation-lineage probability
We compute the probability of every genomic mutation in the collection of high-quality SARS-CoV-2 genome sequences defined above 1) in the entire database; and 2) in each existing Pango Lineage 53 ). Probabilities are approximated with the corresponding frequency, i.e., the ratio between the number of genomes holding the mutation and the total number of genomes (1) or the total number of genomes assigned to a lineage (2).
Lineages with less than 10 high-quality sequences (i.e., XA, XAQ, XU, and XC in the GISAID dataset) are excluded -to avoid small denominators.

Screening potential recombinant genomes
The complete workflow of the RecombinHunt method is shown in Extended Data Fig. 1. The input is a sequence (hereon called target) represented as an ordered set of nucleotide mutations, either of an existing genome or of a consensus-genome, corresponding to the set of mutations with a frequency above a certain threshold in a given lineage.
The search of candidates is based on the computation of the cumulative logarithmic ratio between the probability of a given mutation m to occur in a given lineage L i and the probability of that mutation to occur in any SARS-CoV-2 genome. More formally, we associate a function likelihood_ratio (see Eq. 1) to each lineage L i tested on a target genome T in a range R from a start to an end mutation on T : The first candidate -denoted as L1 -is searched in the whole space of mutations of the genome, both starting from the 5'-genome end and the 3'-genome end. When the high majority of mutations in the target are also included in L1 (i.e., all except at most 2) the non-recombinant model is considered as the one that best describes the target. Else, the method evaluates two alternative models, (1) with one breakpoint, separating L1 and L2, or (2) with double breakpoint, having an L2 stretch in between two L1 stretches (with a pattern L1-L2-L1).
In both cases, L1 is selected as the first candidate contributing to the largest portion of the genome in the space of mutations of the target; it extends from the 5'-end until the breakpoint (L1 has the >> direction) or from the 3'-end to the breakpoint (L1 has the << direction). The breakpoint is marked at the mutation where the profile of the L1's likelihood_ratio (computed according to ) reaches its maximum.
Then, in the case of a single breakpoint, a second candidate, denoted as L2 and different from L1, is searched in the space of mutations of the genome that was not already covered by L1. This case is only considered when there exists an L2 (within the Pango lineages) characterized by at least 3 mutations in that space; else, we recede back to the non-recombinant case. The resulting composition of L1 and L2 also identifies the position of the breakpoint; when the two candidates cover the entire genome space, the breakpoint corresponds to a pair of adjacent positions. Else, we call gap the uncovered portion of the genome and propose a gap-resolution procedure. See the below paragraph, for further details.
In the case of a double breakpoint, L1 is considered also from the opposite side of the genome (called L1 opp ); if in this region L1 has at least 3 mutations of the target, the model L1-L2-L1 is explored. A second, central candidate L2 is then searched between the two stretches of L1. After finding L2, a gap-resolution procedure may be necessary.
The test obtaining the lowest p-value determines the final output.

Gaps resolution
This procedure is applied when the target positions corresponding to maximum likelihood (i.e., max-L1 for L1 and max-L2 for L2) are not adjacent. In such cases, the breakpoint is set as the position p that minimizes the cumulative likelihood loss for both the adjacent regions. The region to the left of the gap is extended up to p, while the positions starting from p + 1 are assigned to the right-adjacent region. One example of the procedure is shown in Extended Data Fig. 2.

Comparison of recombinant vs non-recombinant models
We finally estimate the error probability of the recombination hypothesis against the non-recombination hypothesis. For each target sequence T , we compute three global_likelihood (G) functions, respectively representing the cases where the target is 1) completely explained by lineage L 1 (G L 1 ); 2) completely explained by L 2 (G L 2 ); or 3) partially explained by L 1 and by L 2 in different portions (G L 1 ,L 2 ) (hence supportive of a recombination). The function cumulatively adds (or subtracts) a contribution for each mutation in the target mutations space, composed by the combination of T with the involved lineages.
Each term of the sum corresponds to the natural logarithm of the probability of the specific mutation to characterize the lineage L i ; the contribution is positive when the mutation occurs on T , and negative when it does not. In Eq. 2, we use the general term L i to represent L 1 for G L 1 and L 2 for G L 2 . In the composite case of G L 1 ,L 2 , instead, L i is composed of portions of L 1 and L 2 according to the best 1BP or 2BP solution.
Then, by means of AIC, we compare G L 1 ,L 2 versus G L 1 and G L 1 ,L 2 versus G L 2 . Small p-values indicate that the recombination model better explains the considered target, as opposed to single lineages. We finally use the smallest p-value for reporting the significance of the recombinant vs. non-recombinant output.
Identification of groups of similar candidates L1 and L2 may be represented by additional -similar -candidates. To exhaustively define the group of lineages that are not inconsistent with the roles of, respectively, acceptor and donor, three conditions are checked: 1) the AIC criteria is used to assess how well the stretch of L1 (respectively, L2) is explained by each of the ten best candidates according to the maximum likelihood_ratio value reached in correspondence of the breakpoint position (see Eq. 1). The candidates from the 2nd to 10th position ones are compared to the first candidate, using the AIC and a hypothesis test with p-values ≥ 10 −5 , meaning that they are sufficiently different not to be considered acceptable alternatives for (or be differentiated from) L1 (resp. L2); 2) candidates that reach the maximum of likelihood_ratio in locations that are apart from the position of the first candidate (i.e., more than one mutation apart) are not to be considered as acceptable alternatives for L1 (resp. L2); 3) candidates that do not belong to the same phylogenetic branch or sub-tree as the first candidate are not to be considered acceptable alternatives for L1 (resp. L2). When a candidate meets all of the three requirements, it is incorporated into a group of alternative candidates, equally explaining the recombination model proposed by RecombinHunt.

Collection of ground truth information from Pango designation issues
The hunt for recombinants is manually performed by volunteers who report evidence on the Pango designation GitHub repository in the form of issues, broadly documented and discussed with peers 42 . We extracted from the Pango designation file 43 all the entries whose keys start with 'X' at the first level of nomenclature (i.e., without any dot). Then, we matched them to the lineage_notes.txt file 54 , retrieving the issues where those designations are discussed. We inspected all the issues present on April 1st, 2023.
Two kinds of information were retrieved: 1) the donor and acceptor lineage candidates of recombination (directly from the alias_key.json file 43 and cross-checked withFocosi & Maggi 55 , when related information was available); 2) the interval-based position of the breakpoints, manually scouted in the discussions of the issues. As per (1), several candidates 43 are reported with the name* symbol, indicating that -at the time of designation -it was not possible to assign a precise lineage, but it was possible to assign an entire sub-tree of the phylogeny (with root in name). As per (2), sometimes issues' threads reported conflicting intervals; in these cases, we considered the union of such options. For the purposes of RecombinHunt, intervals in genomic coordinates are translated into target mutations space coordinates that depend on the target sequence observed in the given task.

Data availability
Original sequences and metadata are publicly accessible through the GISAID and Nextstrain platforms. The specific sample accessions of the genomes on which RecombinHunt has been applied are reported in Supplementary Tables 2-4 Extended Data Figure 1 | Overall RecombinHunt workflow. An input viral sequence is considered. The donor lineage is searched based on the cumulative likelihood ratio. Then, three branches are considered: non-recombinant model, one-breakpoint recombination model, and two-breakpoint recombination model. The preferred model is chosen using statistical testing, based on the Aikake information criterion.  Extended Data Figure 2 | Reduction of breakpoint range performed by RecombinHunt. The procedure is applied when a precise position of the breakpoint cannot be determined from the likelihood ratio information. Here, the case of running Recombin-Hunt on the consensus-genome of XAF is shown. The L1 candidate is found to cover positions 1-9, whereas L2 covers 12-65. The initial gap corresponds to a stretch from the 9th and the 12th mutation of the XAF consensus-genome, then resolved to only include the 9th-10th mutations, laying on the [10448, 11288] nucleotides-interval of the SARS-CoV-2 reference genome.