Current and historic HIV-1 molecular epidemiology in paediatric and adult population from Kinshasa in the Democratic Republic of Congo

HIV-1 diversity may impact monitoring and vaccine development. We describe the most recent data of HIV-1 variants and their temporal trends in the Democratic Republic of Congo (DRC) from 1976 to 2018 and in Kinshasa from 1983–2018. HIV-1 pol sequencing from dried blood collected in Kinshasa during 2016–2018 was done in 340 HIV-infected children/adolescents/adults to identify HIV-1 variants by phylogenetic reconstructions. Recombination events and transmission clusters were also analyzed. Variant distribution and genetic diversity were compared to historical available pol sequences from the DRC in Los Alamos Database (LANL). We characterized 165 HIV-1 pol variants circulating in Kinshasa (2016–2018) and compared them with 2641 LANL sequences from the DRC (1976–2012) and Kinshasa (1983–2008). During 2016–2018 the main subtypes were A (26.7%), G (9.7%) and C (7.3%). Recombinants accounted for a third of infections (12.7%/23.6% Circulant/Unique Recombinant Forms). We identified the first CRF47_BF reported in Africa and four transmission clusters. A significant increase of subtype A and sub-subtype F1 and a significant reduction of sub-subtype A1 and subtype D were observed in Kinshasa during 2016–2018 compared to variants circulating in the city from 1983 to 2008. We provide unique and updated information related to HIV-1 variants currently circulating in Kinshasa, reporting the temporal trends of subtypes/CRF/URF during 43 years in the DRC, and providing the most extensive data on children/adolescents.


Children/adolescents (%) Adults (%) Total (%)
Subjects with pol sequence 55 ( Cluster identification between non-epidemiologically related adults and children/adolescents. Among the 165 pol sequences from Kinshasa we identified 4 independent transmission clusters ( Table 2 and Sup. Fig. 2-5) supported by 100% bootstrap, presenting recent transmission (genetic distance lower than 0.01) in three of them: cluster 1, cluster 3 and cluster 4. Infected individuals from clusters 1, 2 and 3 carried subtype A pol sequences, although they were not ascribed to previously characterized sub-subtypes A or CRF including A at pol coding region. They appeared in 3 different branches in PhyML trees after including all LANL sequences ascribed to subtype A and their sub-subtypes circulating in the DRC since 1976 (Sup. Fig. 2-4). Subjects from cluster 4 harbored subtype H viruses, clustering apart from other subtype H LANL sequences from the DRC (Sup. Fig. 5). Table 2 shows the epidemiological data collected from clinical reports of each patient. All clusters included 2 subjects, except cluster 1 including a 34.7 year-old woman, a 6.2 year-old boy and a 14.3 yearold adolescent male with no reported epidemiological link (Sup. Fig. 2). Clusters 2 and 3 included two adults and cluster 4 two children.
Transmission networks involved a common recombinant fragment in recent paediatric/adolescents sequences from Kinshasa. Among   26.7%, p < 0.05) and sub-subtype F1 (0% vs. 4.8%, p < 0.05) and a significant reduction in sub-subtype A1 (11% vs. 0%, p < 0.05) and subtype D (7.7% vs. 2.4%, p < 0.05) compared to variants from Kinshasa circulating from 1983 to 2008, whereas the proportion of CRFs remained similar across both sequence sets (12.1 vs. 13.2%, respectively). We detected CRF47_BF virus for the first time in the country using pol sequences collected after 2016. URF tended to decrease over time (31% vs. 23.6%). The remaining variants remained stable over time (Fig. 3D).      (Table 4). Figure 4, shows the high genetic diversity over time in pol sequences sampled in the DRC since the 1976 (year of first available pol sequence) to 2018, with some variations by periods of time.

Discussion
The DRC is known as the origin of HIV-1 pandemic and the epicenter for the selection and spreading of many HIV-1 variants to neighboring countries 28 . As a consequence of high HIV-1 genetic heterogenicity (probably the highest diversity rate in the world), the accurate phylogenetic reconstructions have not been easy to interpret 29 . For this reason, our first approach was to reanalyze the available HIV-1 sequences deposited in LAN from the DRC, observing 8.8% of misclassified sequences according to the new reclassification of HIV-1 variants and phylogenetic programs used in this work. Global geographical patterns in HIV-1 variants distribution are changing over time due to several factors, such as population movements or the dense transmission networks. These factors are contributing to an unpredictable HIV-1 pandemic 30,31 . Consequently, a continuous and accurate molecular epidemiology surveillance is necessary for increasing our knowledge about the evolving HIV-1 epidemic, especially in those geographical areas with high genetic diversity rate, where multiple HIV-1 variants co-circulate 20,32 , such as the DRC. In these settings, the selection of new recombinant forms is easier and the viral evolution could be faster than in those regions with low diversity of HIV-1 33 .
Due to the lack of updated data on HIV molecular epidemiology in the DRC, we present the information of 165 pol sequences obtained in adult and paediatric populations, representing the most recent data on circulating HIV-1 subtypes and recombinant forms in Kinshasa ranging 2016-2018. We also compared the temporal trends of HIV-1 variants in the DRC for a 43-years period, based on 28 1,10,24,28,29,34-56 published studies. Sup. Table 1 shows the sampling year, sample type, study population, number of analyzed sequences, coding regions, subtyping method and sequence submission to databases in each 28 published studies characterizing HIV-1 variants in the DRC. Since HIV is prone to recombination during retrotranscription, high levels of recombinant forms after coinfections or superinfections are expected, especially in places with high viral diversity, such as in the DRC 24,34 . In this scenario HIV-1 can accelerate adaptation to the host, favouring emerging variants with unknown pathogenicity 57 .
Previous studies have also reported heterogeneous geographic distribution of group M variants across regions in the DRC 58 , with high prevalence of recombinants. However, HIV recombinant rates in our study and previously reported in the country could be underestimated since most of them were classified by using short partial genome sequences (Sup. Table 1) and not full genomes. Thus, the recombinant rate in the DRC could be even higher if more viral regions are assessed. Although the biological and clinical impact remains to be clarified, there is growing evidence that recombination has played a significant role in the early history of the HIV-1 pandemic and in viral evolution 59 . We reported the first identification of CRF47_BF in Africa and the DRC, a recombinant variant first described in Spain 60 . It would also suggest a possible importation of subtype B to the DRC.
This study confirmed subtype A as the most prevalent subtype, which has maintained a high rate of infections in the DRC over the last three decades 61 , explaining the high rate of URF carrying subtype A fragments circulating in Kinshasa described in the last years. Kinshasa has been proposed as the origin of A, G and F1 subtypes 28 , common found HIV-1 variants in our series. Although subtype D was associated with faster disease progression 13 , was also initially emerged in Kinshasa, its prevalence decreased in the city during 2016-2018 compared to previous years. Although a significant increase of subtype C was reported from 1997 to 2002 in Kinshasa 38 , in other countries close to the DRC 62 and worldwide 15 , we observed a similar prevalence of this variant in Kinshasa comparing the two time periods under study. The reasons for these findings are unknown.
We also found that almost 4 out of 10 variants circulating in Kinshasa during both periods (1983-2008 and 2016-2018) carried recombinant sequences in pol. When CRFs and URFs diversities were analysed according to age of patients, CRFs diversity in pol was higher in adults and URF diversity higher in children/adolescents in the last period. The trends of HIV-1 genetic diversity should be further explored in the coming years, also including other HIV-1 genes. The high misclassification in sub-subtype A and U pol sequences from the DRC deposited in LANL reveals the need for reclassification by phylogeny with updated reference sequence sets and new reported HIV-1 variants.
Phylogenetic approaches and distance-based analysis represent the most common strategy to identify recent transmission networks 63 , and it could be useful to identify other HIV infected and uninfected persons at highest risk of transmission who could benefit from HIV prevention interventions 64 . Some reports have demonstrated that genetic distance restrictive thresholds between 0.01 and 0.02 substitutions per site have been more strongly www.nature.com/scientificreports/ associated with probable transmission partners than traditional epidemiological connections [65][66][67] , and a distance of 0.015 could serve as a use proxy for epidemiological relatedness in a surveillance setting 66 . Our study identified that 3 out of 4 transmission clusters were recent (genetic distance thresholds 0.01-0.02 [65][66][67] ). We implemented a strategy based on traceability of genetic fragments, to know the potential network among infected people (Sup. Fig. 6). This strategy, "recombination network", gave us a dense net and probably a more complete view. The recombination network found involving pol sequences from 17 children/adolescents with no epidemiological link according to clinical data confirmed the important role of recombination in an HIV pandemic and the importance of common ancestor identification to understand recombination origin and spread. As the children are overrepresented in this network, we should suspect that this network is even denser. This study has some limitations that need to be considered. Firstly, samples from 2016-2018 were collected in two reference hospitals in Kinshasa, which cannot be fully representative of the situation in a city and even more at a country-wide level 58 . However, the inclusion of all pol sequences from LANL collected in the country over more than 4 decades, from the general population and risk groups, provides a good overview of the huge diversity of the country. A second limitation is the length of recovered pol sequences, which differed across samples, complicating bioinformatic analysis. A third limitation is that some reports in the DRC did not submit sequences to databases, and they were not available in LANL (Sup. Table 1). These reported HIV-1 variants could therefore not be reevaluated by new PhyML analysis to be included in the temporal tend analysis of HIV-1 variants. Thus, we encourage the necessity of sequence submission in all HIV epidemiology studies worldwide before publication, and the inclusion of the sampling year in all submitted sequences, absent in 214 (8%) of pol sequences downloaded from LANL in that country. The lack of routine resistance testing during clinical follow-up of HIV-infected subjects in the DRC limits the pol sequences availability in the DRC. Furthermore, the lack of complete epidemiological information from subjects involved in transmission or recombination clusters limits full understanding of a cluster´s origin. Finally, 51.5% of DBS did not yield any viral sequences, possibly due to the low viral load (40-1000 HIV-1 RNA copies/dot) in more than half of specimens and to the low plasma volume in the 2 dots used for HIV-1 RNA extraction, limiting the ability to get positive PCR amplifications of HIV in these samples.
Since none of sequences downloaded form LANL was sampled after 2012, our sequence set was the most recent in the DRC to date and the highest in number of analyzed pol sequences in Kinshasa. Our study reinforces the use of dried blood as a field-friendly, useful, convenient and alternative specimen type to whole blood or plasma in HIV molecular epidemiology surveillance studies in developing countries or settings with limitations as regards the collection, storing, transportation of plasma or when low blood volume is available 68 Table 1). Although two previous studies reported the full HIV-1 genome (including pol) from a vertically infected 12-month-old baby 54 or gag sequences from 15 children in Kimpese, rural DRC 53 , our study presents the most extensive data regarding HIV-infected children and adolescents in the country. The updated high genetic diversity observed in the DRC also represents a real challenge for future vaccine development and for efficiency of antiretroviral treatment, diagnostic and monitoring tests of HIV infection 69 .
In conclusion, we report the most recent data related to HIV-1 variants circulating in Kinshasa, the geographical origin of the pandemic, and unique and updated information on the temporal trends of HIV-1 subtypes, CRF and URF in the DRC during 43-year period (from 1976 to 2018) and in Kinshasa from 1983-2018 after reclassification of several available LANL sequences using phylogenetic approach. The data provided increase and update the knowledge of HIV molecular epidemiology in the DRC. Active transmission clusters were detected, and a new strategy offers us a more complete view of transmission networks. Of concern is an overrepresentation of children was observed in the recombination network. Continued molecular surveillance will be essential to determine and trace rare unique recombinant forms or emerging strains of HIV in the country. www.nature.com/scientificreports/ sequences from all HIV-1 variants described to date with available sequence were included (groups, subtypes, sub-subtypes, CRFs). One HIV-1 group N sequence was used as an outgroup. Sequences from groups P/O/N were also downloaded to discard non group-M infections. To alleviate the burden of computer-time required to reconstruct large phylogenies fast algorithms, phylogenetic trees (PhyMLtree) were reconstructed by maximumlikelihood (ML) method with RAxML v8.0 (Randomized Axelerated Maximum Likelihood) 74 using the general time reversible plus proportion of invariable sites plus gamma distribution parameter (GTR + I + G) evolutionary model. To estimate the bootstrap values on the inferred topology by RAxML, Shimodaira-Hasegawa (SH) test using FastTree program was used (support > 90%) 75 . Sequences not clustering with any variant were analyzed using Recombination Detection Program (RDP3v4.13) 76 , identifying the subtypes involved in eventual recombination events and hypothetical recombination breakpoints. To further confirm the detected putative recombination events, new phylogenetic analyses were performed using the sequence fragments assigned to different subtypes according to the proposed breakpoint position(s) defined by RPD3. The topologies obtained with each fragment were compared to SH, expected likelihood weight, and Kishino-Hasegawa tests using the TREE-PUZZLE 5.2 program. In the positive cases, the recombinant sequences were redefined as URFs; otherwise, sequences with a most recent common ancestor to subtypes sub-subtypes A1-A6 or F1-F2 were identified as subtypes A or F, respectively. The remaining cases were appointed as U. Since it was not possible to sequence the same fragment for all sequences, different regions were analyzed separately. Transmission clusters in our 165 study sequences from Kinshasa (2006-2018) were defined as viral sequences belonging to the same subtype/CRF/URF and grouped into a single and well-supported cluster (monophyletic clade) with 100% bootstrap values. When only partial pol sequences could be recovered from new samples, each available sequence fragment present in all viruses from the cluster was used for analysis separately. In this way, new recombination events were detected. Sequences with different recombination events but sharing some recombinant fragment and common progenitor virus were considered as transmission networks.

Material and methods
Genetic diversity (D = 1 − ∑f 2 ) of HIV variants in the DRC was analyzed over time, a measure of variability that takes into account the frequencies (f) of all variants. For that purpose, all 2802 pol LANL sequences from the DRC were downloaded and reclassified as previously described. Among them, 2588 (92.3%) reported the sampling year, all belonging to the 1976-2012 period. Viral genetic diversity among HIV-infected children and adults with samples collected in Kinshasa between 2016 and 2018 was also calculated. We analyzed the genetic distance (number of base substitutions per site) or the average evolutionary divergence over all sequence pairs by using the Tamura-Nei model 93 (TN93) 77 , according to previous reports 67 . The rate variation among sites was modeled with a gamma distribution and analysis was conducted in MEGA6. We identified a recent transmission cluster in pol sequences showing maximum pairwise genetic distance lower than 0.01, according to previous reports [65][66][67] . TN93 was used because it is the most general nucleotide substitution model for which distances can be estimated directly from counts of nucleotide pairs in aligned sequences.
Ethical aspects. The project was approved by the Human Subjects Review Committees at Monkole Hospital/University of Kinshasa (Kinshasa, DRC), University Hospital Ramón y Cajal (Madrid, Spain) and University of Navarra (Pamplona, Spain). Informed consent of all enrolled adults and of parents or guardians of enrolled children was obtained. All methods were carried out in accordance with relevant guidelines and regulations. Patients' names were codified at sampling to maintain confidentiality. Statistical analysis. Differences in prevalence of HIV-1 variants were tested using T-test and p values < 0.05 were considered statistically significant. Descriptive statistical analysis was performed, median and interquartile range (IQR) was also calculated. Statistical analyses were conducted using Prism 6.0 software from GraphPad version 8.0.1 (San Diego, CA, USA).