Identification of CRF89_BF, a new member of an HIV-1 circulating BF intersubtype recombinant form family widely spread in South America

Circulating recombinant forms (CRFs) contribute substantially to the HIV-1 pandemic. Among 105 CRFs described in the literature, 16 are BF intersubtype recombinants, most of South American origin, of which CRF12_BF is the most widely spread. A BF recombinant cluster identified in Bolivia was suggested to represent a new CRF_BF. Here we find that it belongs to a larger cluster incorporating 39 viruses collected in 7 countries from 3 continents, 22 of them in Spain, most from Bolivian or Peruvian individuals, and 12 in South America (Bolivia, Argentina, and Peru). This BF cluster comprises three major subclusters, two associated with Bolivian and one with Peruvian individuals. Near full-length genome sequence analyses of nine viruses, collected in Spain, Bolivia, and Peru, revealed coincident BF mosaic structures, with 13 breakpoints, 6 and 7 of which coincided with CRF12_BF and CRF17_BF, respectively. In a phylogenetic tree, they grouped in a clade closely related to these CRFs, and more distantly to CRF38_BF and CRF44_BF, all circulating in South America. These results allowed to identify a new HIV-1 CRF, designated CRF89_BF. Through phylodynamic analyses, CRF89_BF emergence was estimated in Bolivia around 1986. CRF89_BF is the fifth CRF member of the HIV-1 recombinant family related to CRF12_BF.


Materials and methods
Samples from HIV-1-infected individuals were collected in 14 Spanish regions for a molecular epidemiological study. An ~ 1.4 kb pol fragment in protease-reverse transcriptase (Pr-RT) was amplified by RT-PCR/nested PCR from plasma RNA as described previously 33 and sequenced with the Sanger method using a capillary automated sequencer. Near full-length genome (NFLG) sequences were obtained for selected samples by amplification in four overlapping segments from plasma RNA and sequenced by the Sanger method, as described 29,34 . Newly derived sequences are deposited in GenBank under accessions KX818199, KX818200, MW344906-MW344922, and MW802822-MW802825 (Table 1).
Sequences were aligned with MAFFT v7 35 . Initial phylogenetic trees with all Pr-RT sequences obtained by us were constructed via approximate maximum likelihood in FastTree2 36 , using the general time reversible evolutionary model with CAT approximation for among-site rate heterogeneity and assessment of node support with Shimodaira-Hasegawa (SH)-like local support values 37 . Subsequent maximum likelihood (ML) trees with sequences of interest were constructed in IQ-Tree 38 , using the best-fit substitution model determined by the program 39 , with assessment of node support with the ultrafast bootstrap approximation approach 40 . Trees were visualized with MEGA v7.0 41 . A phylogenetic network of NFLG sequences was also constructed with SplitsTree4 42 . In this analysis, the HKY + G + I evolutionary model was used (GTR is not available) and a 95% confidence network was estimated. Mosaic structures were analyzed by bootscanning 43 with SimPlot v1.3.5 44 , with tree construction using the neighbor-joining method and a window width of 250 nucleotides. Recombinant segments identified with Sim-Plot were further phylogenetically analyzed through ML with IQ-Tree and PhyML v3.0 45 (with assessment of node support in the PhyML analyses with the approximate likelihood ratio test, Shimodaira Hasegawa-like (aLRT SH-like) procedure 37 ) and through Bayesian inference with MrBayes v3.2 46 . The analysis with MrBayes was performed using the GTR + G + I substitution model. We ran two simultaneous independent runs and 8 chains 2-5 million generations long, ensuring that both runs reached convergence, as determined by an average standard deviation of split frequencies < 0.01. We discarded the first 50% of the trees in the posterior distribution as burn-in. The existence of adequate phylogenetic signal in short (< 200 nt) segments was analyzed through likelihood mapping 47 with IQ-Tree.
The time of emergence and most probable country location of the most recent common ancestor (MRCA) of the identified cluster and subclusters were estimated using Pr-RT sequences with the Bayesian Markov chain Monte Carlo (MCMC) coalescent method implemented in BEAST v1.8.4 51 . For this analysis, the positions in the alignment corresponding to codons containing antiretroviral drug resistance mutations in any of the sequences, as determined with Stanford University's database HIVdb program 52 , were removed. Prior to the BEAST analysis, the existence of temporal signal in the dataset was analyzed with Tempest 53 . Since according to this analysis there was insufficient temporal signal, we used as a prior parameter a normally-distributed substitution rate (1.33 × 10 -3 ± 2.57 × 10 -4 subst./site/year) estimated from 65 CRF12_BF sequences, which exhibited an adequate temporal signal (r 2 = 0.389 in TempEst analysis) ( Supplementary Fig. S1). The BEAST analysis was performed using the SRD06 codon-based evolutionary model 54 , an uncorrelated lognormal relaxed clock model and the Bayesian Skyline Plot population growth model 55

Results
In an HIV-1 molecular epidemiological study in Spain, we identified a phylogenetic cluster of 19 Pr-RT sequences from samples collected in five regions, nested within the CRF12_BF clade. In bootscan analyses, sequences from this cluster (henceforth, BF cluster) exhibited 5'-B/F/B/F-3' recombinant structures that were very similar to each other (Supplementary Fig. S2a-f). Their structures also showed some similarity with that of CRF12_BF, from which they differed in a longer subtype B segment in the Pr-RT junction (Supplementary Fig. S2g-i). To determine whether additional sequences in databases clustered with viruses of the BF cluster, all BF recombinant Pr-RT sequences ≥ 900 nt long deposited in the Los Alamos HIV-1 sequence database 56 and an additional BF recombinant sequence described in 57 , not available in the Los Alamos database but deposited in GenBank 58 (accession MF109665), were downloaded and phylogenetically analyzed with FastTree2. We found that 20 additional database sequences fell in the BF cluster, which was also well supported in a ML tree constructed with IQ-Tree (Fig. 1). Of the 39 viruses belonging to the BF cluster, 22 were collected in Spain, 5 in Bolivia, 4 in Argentina, 3 in Peru, 2 in the United Kingdom, 2 in Japan, and 1 in Sweden. Epidemiological data from samples collected in Spain, available for all samples processed by us (Table 1) and from one database sequence, indicated that individuals in the BF cluster residing in Spain were predominantly male (except for 6 out of 19), of South American origin (from Bolivia or Peru, but 5 were native Spanish), and infected via heterosexual contact (but 5 of 18 were MSM). Transmission route information was also available from one sample collected in Peru (DEURF13PE006), which was from a MSM. Available clinical data for the samples collected in Spain and studied by us are shown in Supplementary Table 1.
In the BF cluster there were three well supported subclusters. One comprised 14 sequences, all collected in Western Europe, mostly in Spain, but also 2 in the UK and 1 in Sweden; for 10 of the 11 viruses collected in Spain, the country of origin of the patient was known, which was Bolivia in 8 and Spain in 2. A second subcluster comprised 5 viruses collected in La Paz, Bolivia. The third subcluster comprised 12 sequences from samples collected in Spain (n = 7), Peru (n = 3), and Japan (n = 2), with all but one samples from Spain being from Peruvian individuals. These sublcusters were designated Euro-Bolivian, Bolivian, and Peruvian, respectively (Fig. 1). Interestingly, 4 out of 5 viruses (M1131, MS0254, TO0275, and DEURF13PE006) grouping in a sub-subcluster within the Peruvian subcluster were from MSM.
We obtained NFLG sequences from seven viruses in the BF cluster: three from samples collected in the city of Bilbao from individuals without known epidemiological links, two (P2633 and P3177) from Bolivian individuals  18 and Peru (DEURF13PE006), respectively, the viruses of the BF cluster grouped in a strongly supported clade (100% ultrafast bootstrap support) closely related to CRF12_BF and CRF17_BF and more distantly to CRF38_BF and CRF44_BF (Fig. 2). The tree also showed that BOL0137, belonging to the Bolivian cluster, is phylogenetically related to the Euro-Bolivian cluster, a relationship which was not apparent in the tree of the Pr-RT segment. The clustering of the viruses in the BF cluster and their segregation from other South American CRF_BFs was also supported in a 95% confidence network constructed with SplitsTree4 program (Supplementary Fig. S3). The mosaic structures of the 9 NFLG sequences were analyzed through bootscan analyses, which showed a complex BF1 recombinant pattern, with highly similar structures and genomes predominantly of F1 subsubtype (Fig. 3). Two ~ 7 kb-long sequences from viruses collected in UK 57 branching in the BF cluster also showed mosaic structures highly similar to those of the NFLG (Supplementary Fig. S4).
We examined the intersubtype transitions around breakpoints detected by bootscanning by looking at similarities between viruses in the BF cluster and the 75% consensuses of the B and F1 parental clades in the positions where they differ ( Supplementary Fig. S5). We found agreement in these transitions among the viruses of the BF cluster, with some minor differences explained by mutations occurring near the breakpoints due to viral evolution since emergence of their common ancestor.

Figure 2.
Maximum likelihood tree of NFLG sequences of the BF cluster and of references of CRF_BFs from the Southern Cone of South America and of subtypes. Names of sequences obtained by us are in bold type. In database sequences located in the BF cluster, the country of sample collection is indicated before the virus name with the two-letter ISO country code. In reference sequences, the subtype or CRF is indicated before the virus name. Only bootstrap values ≥ 80% are shown. www.nature.com/scientificreports/ To further examine the subtype assignment of the recombinant segments identified with bootscanning, we analyzed them with ML trees constructed with IQ-Tree and PhyML and with Bayesian trees constructed with MrBayes. Prior to these analyses, the existence of adequate phylogentic signal in three short (< 200 nt) recombinant segments, of subtype B in bootscan analyses (HXB2 positions 1085-1340, 7975-8069, and 8466-8640), was analyzed with likelihood mapping. Since our interest in these analyses was solely to determine the phylogenetic relationship of the viruses in the BF cluster with B and F1 subtype references, likelihood mapping analyses were performed in alignments including subtype references and only one BF virus at a time. The results showed that there was adequate phylogenetic signal in the 3 segments to determine phylogenetic relationship of each virus in the BF cluster with subtype references, with > 70% fully resolved quartets in all cases (Supplementary Table 2). The phylogenetic trees of recombinant segments confirmed the subtype assignments inferred with bootscan analyses, which were coincident among all viruses in the BF cluster (Fig. 4). In the Bayesian tree of the 8466-8640 fragment, the posterior probability support for the assemblage of viruses of the BF cluster with subtype B references was 0.94; however, none of the trees in the posterior (optimal) distribution, clustered the BF viruses with F1 references. The coincidence of subtype assignments among all viruses of the BF cluster included the short segment in the gp41-tat overlap (HXB2 positions 8466-8640) in which all viruses, except P4464 and DEURF13PE006, appeared to be of subtype B in bootscan analyses, with phylogenetic trees showing that all of them were of subtype B in this segment (Fig. 4). The phylogenetic trees also showed that the short subtype B segment around nt 8000 (HXB2 positions 7975-8069) was also found in CRF17_BF and CRF38_BF viruses ( Supplementary Fig. S6).
Intersubtype breakpoint locations were also analyzed with GARD, RDP4, and jpHMM. The breakpoint locations determined by these programs are shown in Supplementary Table 3. These were generally consistent with the analyses of bootscanning and phylogenetic trees of partial segments, although jpHMM failed to detect breakpoints delimiting two short subtype B segments in pol and env in all viruses, and some breakpoints in some viruses failed to be detected by some programs. Importantly, no breakpoint failed to be detected by all three programs.
Thus, these analyses show that viruses of the BF cluster have a coincident mosaic structure, which exhibits some similarity to those of CRF12_BF and CRF17_BF, but differs from both in the presence of a short subtype B segment in gag, absent from CRF12_BF and CRF17_BF, and from CRF12_BF also in the presence of a short subtype B fragment in env around HXB2 position 8000, which is absent from CRF12_BF ( Supplementary Fig. S6). Additionally, they also differ in breakpoint positions in p17 gag , RT and vpu ( Supplementary Fig. S5), where These results, therefore, show that NFLG sequences of viruses of a BF recombinant cluster group in a clade separate from other CRFs and exhibit a coincident and distinctive mosaic structure, indicating that they represent a new HIV-1 CRF, which was designated CRF89_BF. The mosaic structure of CRF89_BF inferred from bootscan analyses, ML phylogenetic trees of partial segments, analyses with GARD, RDP4, and jpHMM, and intersubtype consensus transitions around breakpoints detected by these methods (Fig. 5) indicates that it is predominantly of subtype F, with 13 breakpoints delimiting 7 subtype B and 7 subtype F segments. Its close relationship with CRF12_BF and CRF17_BF and more distant relationship with CRF38_BF and CRF44_BF are supported by phylogenetic clustering (Fig. 2) and coincidences in 6, 7, 4, and 3 breakpoints, respectively (Fig. 5).
We analyzed amino acid residues found in all or most of the CRF89_BF viruses and absent or uncommon in the related CRF_BFs, numbers 12, 17, 38, and 44, and in the parental Brazilian F1 strain. Twelve characteristic CRF89_BF residues were found, distributed in Gag, Pol, Tat, Rev, and Vif proteins (Supplementary Table 4).
We estimated the time of emergence and country of origin of the MRCA of CRF89_BF using Pr-RT sequences and a Bayesian coalescent method. Since the CRF89_BF alignment lacked a sufficient temporal signal, we used as a prior parameter a substitution rate estimated from 65 CRF12_BF sequences, which exhibited an adequate temporal signal (Supplementary Fig. S1). For the BEAST analysis, the country of origin of the individual, when known, was used. This was done because we found no definitive evidence of the epidemic spread of CRF89_BF in Spain (as reflected in clustering among Spanish individuals, which was seen only in two individuals-CU0019 and CU0020-residing in the same city), and, therefore, we assumed that Bolivian and Peruvian immigrants (who represented all foreign-born individuals in the data set) had probably acquired HIV-1 in their countries of origin. For individuals whose country of origin was unknown, country of sample collection was used as location trait. According to this analysis (Fig. 6), the mean estimated time of the MRCA (tMRCA) of CRF89_BF was 1986 (95% HPD, 1978-1992) and its most probable location was Bolivia (PP = 0.851), with the second most probable location being Argentina (PP = 0.097). Estimated times and locations of MRCAs of clusters were 1992 (1985-1997)

Discussion
The HIV-1 epidemic in Argentina and the neighboring countries of Uruguay, Chile, Paraguay, and Bolivia is characterized by the cocirculation of B subtype and BF1 recombinant viruses [16][17][18][19][20][21][22][23][24][25][26] . Most of the BF1 recombinant forms in these countries appear to derive from a common recombinant ancestor, as inferred from coincident breakpoints and clustering in phylogenetic trees 17,18,23,[27][28][29] . As the subtype F fragments of these recombinants cluster with viruses of the F subtype strain circulating in Brazil and there is no evidence of the circulation of this strain in other South American countries, it has been proposed that the common ancestor of these recombinants www.nature.com/scientificreports/ might have originated in Brazil. Subsequent recombination events would have given rise to a great diversity of recombinant forms 17,29 , some of which became circulating, of which CRF12_BF, CRF17_BF, CRF38_BF, and CRF44_BF had been identified previously 17,18,23,27,28 . Due to their common ancestry and similarity in recombination structures, all these viruses have been proposed to constitute a CRF "family" 31,32 (similarly, other CRF families could be the CRF_BGs from Cuba, numbers 20, 23, and 24 59 , CRF_BGs from Spain and Portugal, numbers 14 and 73 60 , and CRF_01Bs from Malaysia, numbers 33, 53, 58, and 74 61,62 ). The first to be identified in the CRF_BF family from the Southern Cone of South America was CRF12_BF, which is widely circulating in Argentina and Uruguay [16][17][18] (Fig. 1)] forms part of a larger cluster, comprising 39 viruses collected in two other South American countries (Peru and Argentina), three European countries (Spain, United Kingdom, and Sweden), and Japan, with samples collected in Spain representing a majority, although most of them are from Bolivian or Peruvian individuals (Fig. 1). We show through the analysis of 9 NFLG sequences, 7 of them newly derived from samples collected in Spain and two from databases from samples collected in Bolivia and Peru, that the identified cluster represents a new CRF derived from subtypes B and F1, designated CRF89_BF (Figs. 2 and 3). This CRF is closely related to CRF12_BF and CRF17_BF, as deduced from multiple breakpoint coincidences and close phylogenetic clustering, and more distantly to CRF38_BF and CRF44_BF. CRF89_BF has a complex mosaic structure with 13 breakpoints, delimiting 7 subtype F and 7 subtype B fragments. One of the subtype B segments, in gag, is absent from CRF12_BF and related CRFs and another segment in env is absent from www.nature.com/scientificreports/ CRF12_BF, but found in CRF17_BF and CRF38_BF. Breakpoint coincidence with different CRF_BFs from the Southern Cone suggests a complex scenario of BF recombinant generation in this area through successive rounds of recombination with subtype B viruses, as previously proposed 28 . However, it seems unlikely that CRF89_BF derives from CRF12_BF or CRF17_BF, since in the NFLG phylogenetic tree the CRF89_BF clade is not nested within CRF12_BF or CRF17_BF radiations, but forms a separate clade (Fig. 2), and exhibits several differences in breakpoint locations from both CRFs (Supplementary Fig. S5).
In a phylogenetic tree of Pr-RT, CRF89_BF comprised three major clusters. One comprised exclusively samples collected in Western Europe (Spain, UK, and Sweden); however, out of 10 individuals with data on country of origin (all residing in Spain), 8 were Bolivian and only 2 were Spanish, whose viruses branch interspersed among those from Bolivian individuals. Therefore, it seems reasonable to assume that this cluster (which was designated Euro-Bolivian cluster) originated and spread initially in Bolivia, and its finding in Western Europe reflects the importation of infections acquired in Bolivia rather than local circulation of CRF89_BF. Otherwise, clustering of CRF89_BF strains among native European individuals would be expected but was not seen. Failure to identify viruses collected in Bolivia within the Euro-Bolivian cluster may be due to the low number of HIV-1 sequences from Bolivia available in public databases. A second CRF89_BF cluster comprises all five samples collected in Bolivia, all from La Paz. In a phylogenetic tree of NFLG, one virus of this cluster is closely related to viruses of the Euro-Bolivian cluster. The third cluster comprises 3 sequences from Peru, 6 from Peruvians residing in Spain, 1 from a Spaniard residing in Spain, and two from Japan, the last ones closely related to a Peruvian virus. Similarly to the case of the Euro-Bolivian cluster, we assume that this cluster represents a variant originated and circulating in Peru and that its presence in Spain and Japan probably reflects the importation of infections acquired in Peru, rather than the local circulation of CRF89_BF. It is interesting to point out that although a small proportion of HIV-1 BF recombinant viruses have been identified in Peru (approximately 2% 19,63 ), no evidence has been published of their circulation among the local Peruvian population. Therefore, the results presented here would be the first evidence indicating that an HIV-1 BF1 recombinant form, in this case CRF89_BF, is most likely circulating in Peru. It is also interesting to note that although heterosexual transmission is predominant among CRF89_BF infections, all 4 infections with information on transmission route in a subcluster of 5 individuals within the Peruvian cluster were found in MSM. This reflects the circulation of CRF89_BF among Peruvian MSM and the linkage between HIV-1 heterosexual and MSM transmission networks. A similar linkage was observed in a CRF02_AG cluster in Spain, although in this case the spread was from an MSM to a heterosexual network 64 .
According to phylodynamic estimations, CRF89_BF probably emerged in Bolivia around the mid-1980s, with its major clusters emerging around the first half of the 1990s, two of them in Bolivia and one in Peru (Fig. 6). These estimations were done assuming that CRF89_BF infections in Bolivian and Peruvian individuals residing in Spain acquired their infections in their country of origin, which seems a reasonable assumption, as discussed above. However, since we could not rule out that subclusters of more recent origin comprising viruses sampled in Spain reflected local transmissions, a second analysis assuming HIV-1 acquisition in Spain of the most recently diagnosed infections of subclusters comprising Bolivian or Peruvian individuals was performed, yielding similar results ( Supplementary Fig. S4). The MRCA of CRF89_BF, according to our estimations, would be around 11 years more recent than that of CRF12_BF ( Supplementary Fig. S8). However, we cannot rule out an earlier emergence of CRF89_BF, since estimations could have changed with a more representative sampling of Bolivian HIV-1.
In Bolivia, CRF89_BF was detected in only 5 samples from La Paz, 4 collected in 2005 and 1 in 1997. In 2005, CRF89_BF represented 13.3% HIV-1 samples collected in La Paz sequenced for Pr-RT. However, given the low proportion of Bolivian HIV-1 strains sequenced and the fact that no sequences from samples collected after 2005 are available in public databases, the current prevalence of CRF89_BF in Bolivia and its geographical spread in that country cannot be accurately estimated. Considering that in one of the major CRF89_BF clusters 8 of 10 viruses, all of which were collected in Europe, were from Bolivian individuals, and that 18% of the HIV-1-infected Bolivian individuals residing in Spain studied by us harbored CRF89_BF viruses, we hypothesize that CRF89_BF could be circulating widely in some areas of Bolivia.
The identification of CRF89_BF infections in Spain and other European countries, mainly in South American immigrants, reflects the increasing relation between the South American and European HIV-1 epidemics, which is also reflected in the expansion in Western Europe of clusters of South American strains of subtypes C 64-68 and F1 33,69-71 , of CRF12_BF 72 , and of CRF17_BF 67 , and in the identification in Western Europe of CRFs derived from parental strains of South American ancestry [73][74][75] .
The identification of CRF89_BF and other CRFs in NFLG sequences is relevant for molecular epidemiological studies because it allows for the proper characterization of HIV-1 strains circulating in different geographic areas and population groups. In this regard, some CRF89_BF viruses were misclassified as CRF12_BF viruses in GenBank submissions (accessions MF403410, MF403416). Such misclassification may not be irrelevant, since, even though both CRFs exhibit similar mosaic structures, they are not identical and form separate clades. It should also be pointed out that even relatively minor genetic differences in viral genomes may result in important biological differences. Examples in HIV-1 are CXCR4 coreceptor usage in CRF14_BG, which is associated with only four amino acid residues in the Env V3 loop 76 , all or most of which are absent in viruses of the closely related CRF73_BG 60 , which has a very similar, but not identical, mosaic structure; and differences in pathogenic potential or therapeutic response associated with clusters within HIV-1 genetic forms 77,78 . The identification of CRF89_BF may be also relevant for the development and testing of vaccines intended for use in areas where this CRF circulates, considering the correlation of susceptibility to protective immune responses with HIV-1 clades and with intraclade genetic diversity 5 .