The molecular epidemiology of HIV-1 in the Comunidad Valenciana (Spain): analysis of transmission clusters

HIV infections are still a very serious concern for public heath worldwide. We have applied molecular evolution methods to study the HIV-1 epidemics in the Comunidad Valenciana (CV, Spain) from a public health surveillance perspective. For this, we analysed 1804 HIV-1 sequences comprising protease and reverse transcriptase (PR/RT) coding regions, sampled between 2004 and 2014. These sequences were subtyped and subjected to phylogenetic analyses in order to detect transmission clusters. In addition, univariate and multinomial comparisons were performed to detect epidemiological differences between HIV-1 subtypes, and risk groups. The HIV epidemic in the CV is dominated by subtype B infections among local men who have sex with men (MSM). 270 transmission clusters were identified (>57% of the dataset), 12 of which included ≥10 patients; 11 of subtype B (9 affecting MSMs) and one (n = 21) of CRF14, affecting predominately intravenous drug users (IDUs). Dated phylogenies revealed these large clusters to have originated from the mid-80s to the early 00 s. Subtype B is more likely to form transmission clusters than non-B variants and MSMs to cluster than other risk groups. Multinomial analyses revealed an association between non-B variants, which are not established in the local population yet, and different foreign groups.

Worldwide, the most prevalent HIV-1 subtype is C, accounting for around 50% of all cases. However, the HIV epidemic in Europe, particularly among MSM, is mainly driven by subtype B 11 , with frequent reported transmission clusters affecting this group [12][13][14][15][16][17][18][19][20] . However, there is evidence for an increased introduction of non-B subtypes 11 . For instance, in a sample of 206 patients from Spain, Abecasis and colleagues 11 found that CRF02_AG was the second most prevalent HIV-1 variant after subtype B (prevalence <2%).
The Comunidad Valenciana (CV) is the fourth largest region in Spain (~5 million inhabitants), representing >10% of the total population. Genotypic tests of resistance to antiviral drugs are performed routinely for the design of individualized antiretroviral treatments. Here, we have analysed more than 1,800 HIV-1 pol sequences obtained between 2004 and 2014 from different patients in the CV. We have used molecular phylogenetic analyses to complement HIV surveillance tasks of the Public Health Directorate of the CV Regional Government. Specifically, we have used this information to infer the distribution of HIV-1 subtypes, to analyse the introductions (and further local expansion) of this virus in the CV, to identify the emergence of new viral variants in this region, and also to analyse which risk groups are currently more vulnerable to HIV infection. The results obtained from this work may be useful in establishing and reinforcing preventive measures in specific target groups.
Phylogenetic analyses revealed the existence of 270 transmission clusters, with sizes ranging from 2 to 111 patients (Fig. 1A). In total, 1039 patients from the dataset were included in a transmission cluster (57.5%), with 302 patients (16.7%) included in large clusters of 10 or more patients (Table 1 Male  818  10  13  5  24  11  49  930   Female  115  6  6  8  25  6  24  190   UNK  579  18  15  7  17  11  37  684   Geographical origin   Spain  579  6  5  3  11  8  25  Twelve large transmission clusters, which included at least 10 patients from the CV (Fig. 2), were detected: 11 were from subtype B and included a total of 281 patients ( Fig. 2A). One cluster of 21 patients corresponded to CRF14_BG (Cluster K, Fig. 2B). Patients infected with subtype B were more likely to form large transmission clusters than those infected with non-B variants (FET: p-value = 3.4 × 10 −7 , OR = 2.94, 95% CI = 1.84-4.92). Nine of the large clusters were classified as MSM; the other three included mostly IDUs, although in a proportion lower than 50% (  21 ) representing tree height (the time elapsed between tMRCA and the sampling date of the last sequence) and evolutionary rate estimates from the posterior distribution of each large cluster are shown in Fig. 3. A significant, negative correlation between median tree height and substitution rate estimates of these large clusters was obtained (R = −0.70, p-value = 0.013).
A total of 88 subtype B, and 9 non-B transmission clusters did not include patients of foreign origin. These subtype B clusters included 178 Spanish patients and 122 of unknown origin, and the non-B clusters included 13 Spanish patients and 14 of unknown origin. A total of 49 subtype B, and 7 non-B transmission clusters were formed by both Spanish and non-Spanish patients. Altogether, these mixed subtype B clusters included 200 Spanish, 87 non-Spanish (mostly from Latin America) and 113 of unknown origin, and the mixed non-B clusters included 18 Spanish, 10 non-Spanish (mostly from Latin America or Eastern Europe) and 13 of unknown origin. All large transmission clusters but one (cluster G, for which there was not information available in 17 of its 19 patients) presented this characteristic. Finally, a total of 16 subtype B and 25 non-B transmission clusters included only patients of foreign origin. These subtype B clusters included 27 foreign patients (mostly from Latin America) and 16 of unknown origin, and the non-B clusters included 37 foreign patients (mostly from Africa/Middle East and Latin America) and 20 of unknown origin (Supplementary Table S1).
Multinomial analyses were performed using a subset of 906 patients for whom there was information for all the variables considered: sex, risk group, country of origin, clustering status and age. We considered as baseline the most frequent category for each variable (subtype B, Spaniard, male, MSM, age between 21 and 29 years and not clustering), and a significant model was obtained (McFadden R 2 = 0.32, Chi 2 = 421.34, p-value < 2.2 × 10 −16) ). We checked that this subset was representative of the global set (n = 1804) by means of FETs. The p-values obtained for all the analysed variants were >0.05, including the distribution of patients from the different cluster-size categories (no cluster, small cluster, medium cluster and large cluster) that would result if only sequences included in the multinomial analysis had been taken into account for cluster detection, (FET: P > 0.10; Supplementary  Table S2). This dataset was also genetically representative of the whole dataset ( Supplementary Fig. S13). The  Table S3).

Discussion
We have studied the HIV epidemic in the Comunidad Valenciana by analysing, with molecular and evolutionary tools, 1804 sequences obtained between 2004 and 2014 from different patients. Our results indicate that the HIV epidemic in the CV is dominated by HIV-1 subtype B infections among local MSM. However, non-B infections represented an important number of cases, with a prevalence higher than 15%, being CRF02_AG the most prevalent non-B variant (prevalence = 3.7%), in agreement with previous estimates for Spain 11,22 .
Overall, the detection of transmission clusters demonstrates the importance of the domestic spread of HIV-1 in the CV. Most patients from the whole dataset (>57%) were included in local transmission clusters, of sizes ranging from 2 to 111 individuals. Local transmission was especially important among MSM, who were more likely to be included in a transmission cluster, as well as for Spanish natives, than other risk groups. Considering the 12 largest clusters (size ≥10 patients), this transmission risk was the most frequent in 9 of them. Excluding transmission cluster A (MSM, tMRCA = 1984.7), MSM clusters were of more recent origin as estimated using Bayesian coalescent analyses, especially clusters B (n = 13), E (n = 18), H (n = 19) and L (n = 111), which originated after year 2000. Previous analyses in the Spanish regions of Madrid 23 and the Basque Country 24 detected lower proportions of clustering patients (18 and 27% of their analysed sequences, respectively). These results suggest that the importance of local transmissions may not be the same in different Spanish regions. In addition, they also found evidences for an increased vulnerability to HIV of the Spanish MSM community in recent years. A more detailed analysis of cluster L is provided elsewhere 25 .
Immigrants were disproportionally represented in the dataset (almost 1/3 of the patients were of non-Spanish origin), reflecting their higher vulnerability to HIV infection. Multinomial analyses evidenced the significant association between all non-B groups analysed and different foreign populations: Eastern Europe (subtypes A1, F1, CRf14_BG and the rare variants), Africa and the Middle East (subtypes F1, G, CRF02_AG and rare variants) and Latin America (rare variants). These associations were in agreement with the geographical distributions of these variants [26][27][28] . These results, along with the fact that most non-B patients, either clustering or not, were of non-Spanish origin (CRF14_BG was the only exception) suggest that along the analysed time-span non-B HIV variants were not well established among Spanish locals in this region. This is exemplified by the fact that almost half of the non-B transmission clusters did not include any Spanish patient. Furthermore, non-B patients clustered in a significantly lower proportion than patients infected with subtype B, especially when considering clusters with at least 10 patients, thus displaying significantly lower local transmission efficiency. Previous molecular epidemiology studies of HIV-1 in Western Europe also reported non-B variants to be associated with the migrant population 11 , mainly affecting HTs. In the Spanish region of Madrid, where non-B variants have been found with a prevalence of up to 37% 23 , more than 70% of non-B sequences were found to belong to non-Spanish patients 29 . Other works in different Western European countries have reported differences in the prevalence of  Although 11 of the 12 largest transmission clusters were of subtype B, one of the largest clusters found (n = 21, median tMRCA = 1990.6) corresponded to the CRF14_BG, and included a high number of IDUs from Spain. CRF14 has been associated to a predominance of CXCR4 tropism, which usually leads to faster AIDS onset 33,34 . Previous phylogeographic analyses have suggested that CRF14_BG originated in the Iberian Peninsula 33 , and its prevalence is increasing in some Eastern European countries, such as Romania, boosted by migration between these countries and Spain 35 . Similarly to Romania, this CRF has been found to be common among IDUs in Greece, where a large transmission network of CRF14_BG of recent origin (tMRCA = 2009) has been reported to affect more than 70 people, and which origin has been linked to strains from South-western Europe 36 .
We also detected two other transmission clusters of smaller size (n = 5 and n = 4) affecting Spanish-native MSMs of another highly pathogenic variant (CRF19_cpx), which we previously reported as the first evidence of expansion of this variant outside Cuba 37 . Despite the prevalence of these CRFs remaining low in the CV, the effective expansion of these highly pathogenic HIV variants, evidenced by the detection of transmission clusters, is of especial interest because it may hamper the control of the local HIV epidemic, especially among vulnerable populations such as IDUs or MSMs.  Table 3. Results of the multinomial analysis (only significant associations between each HIV variant and the categories compared at each variable are shown). ***P < 0.001; **0001 < P < 0.01; *0.01 < P < 0.05; $0.05 < P < 0. The significant association between non-B variants and immigrants reported in this work and in previous molecular epidemiology analyses of HIV-1 in Madrid, Spain 23,29 , as well as the aforementioned examples of emergence of CRFs associated with migratory waves, suggest that high migration and tourism rates in some Spanish regions may contribute to the high genetic diversity of their HIV-1 epidemics. Thus, although our results suggest that non-B HIV groups are not well established in the local population, with only a few of them including patients of both Spanish and non-Spanish origin, this situation could change soon without intensified HIV prevention campaigns focused on vulnerable groups, including migrants, IDUs and MSM.
Although the large number of sequences for which no epidemiological information was available could be a potential limitation of our study, the results reported are representative of the sampled population and represent one of the most comprehensive analysis of the HIV pandemics in Spain 22,23,29,38 .
We have found a significant, negative correlation between tree height and evolutionary rate estimates obtained when comparing the 12 largest clusters. Several publications have addressed this time-dependency on the evolutionary rate (TDRP), that is, virus lineages estimated to have a recent origin yield higher estimated evolutionary rates than older viruses [39][40][41] . In our study the most plausible explanation for this phenomenon is the overestimation of evolutionary rates in the most recent clusters, caused by the presence of deleterious mutations over which purifying selection has not had time to act. This phenomenon might be potentiated by the bottlenecks that occur at viral transmission, which might originate a transient accumulation of deleterious mutations. Consequently, these results suggest that TDRP is an important factor to consider in molecular epidemiology, even when datasets are obtained from the same population and at short timescales (in this case 10 years, from 2004 to 2014).
It is also important to mention that many works on the molecular epidemiology of HIV-1 remove resistance-associated positions from the analysis, in order to prevent spurious clustering by convergent evolution. In our analyses, we did not remove those positions because their impact on the detection of transmission clusters has been demonstrated to be irrelevant 42 . Furthermore, a great majority of our sequences were known to derive from treatment-naïve patients (only 66 patients were known to have been treated).
Additionally, although the identification or detection of transmission clusters is commonly based only on phylogenetic cluster support, it is also often based on criteria that combine phylogenetic support, usually high bootstrap values, with genetic distance (GD) thresholds 43 . In this work, we did not consider any GD threshold value as criterion to define transmission clusters. This is because the rationale for genetic distance cut-offs is rarely provided, with very different thresholds being reported in the literature 44 , and larger GD values are expected for older transmission events. In this way, choosing distance-based thresholds can lead to underestimating transmission clusters, especially those that span long time periods 43 , as it is the case of our dataset. Nevertheless, a great majority of GDs estimated from the pairwise comparisons performed here were lower than 0.045 s/s. Thus, almost all the transmission clusters reported here would be characterized by both high phylogenetic support and low GD.
In conclusion, our results provide evidence that the HIV-1 epidemic in the CV is dominated by subtype B, especially among local MSMs. Although there was an important number of non-B cases, they occurred mostly among immigrants. This suggests that non-B infections are not well established in the local population. However, the detection of transmission clusters of non-B variants associated to a higher pathogenicity and affecting Spanish patients, urges to increase efforts on HIV testing and prevention campaigns to prevent their further expansion.

Materials and Methods
Dataset. A total of 1804 PR/RT sequences were obtained from newly HIV diagnosed people at six different hospitals and two HIV counselling and testing centres (CIPS) from the three provinces in the CV between 2004 and 2014. The sequences comprised the complete PR and the first 1005 nucleotides (335 amino acids) of the RT (1302 nt in total), and were obtained and subtyped as detailed in Patiño-Galindo et al. 25 .
Detection of local transmission clusters. Detection of transmission clusters, was performed following the same two-step procedure detailed in Patiño-Galindo et al. 25 . Briefly, in the first step, we obtained independent phylogenetic trees for each HIV-1 subtype and CRF detected, which included reference sequences retrieved from the Los Alamos HIV database spanning the analysed PR/RT region (n = 1787 subtype B, 2097 A1, 720 F1, 854 G + CRF14_BG and 2157 CRF02_AG, as well as 5720 sequences from all the HIV-1 subtypes/CRFs that were rare in the CV dataset). The initial trees were reconstructed with FastTree 2.1 in order to detect potential transmission clusters, defined as clades with SH-like support ≥0.70, and containing ≥90% sequences from the CV dataset 17 .
In the second step, these potential clusters were confirmed with ML phylogenies obtained with PhyML 3.0 45 in which the following additional reference sequences were included: (i) the 10 sequences with the highest similarity to each CV sequence from the potential clusters, as retrieved with a BLASTN search at the NCBI server (https:// blast.ncbi.nlm.nih.gov/Blast.cgi); (ii) 854 subtype B, 77 A1, 74 F1, 173 G + CRF14_BG, 101 CRF02_AG as well as 482 reference sequences from the other HIV-1 subtypes/CRFs. Only those potential clusters which contained more than 90% of sequences from the CV and grouped in the ML tree with aLRT support ≥0.98 were considered as confirmed. Clusters were then classified depending on the major risk of transmission (≥50%) for the corresponding patients from the CV.
Dated phylogenies. Dated phylogenies of large transmission clusters (those containing at least 10 patients from the CV 46 ) were obtained with BEAST v1.8.1 47 , similarly as in Patiño-Galindo et al. 25 . In those clusters with low root-to-tip divergence vs sampling date correlation (R < 0.4), a log-normal prior (median = 0.0025 per site and year, s/s/y, 95% HPD upper limit = 0.0035 s/s/y) 17 was placed on the ucld.mean parameter.

Statistical analyses.
A multinomial logistic regression analysis was performed in order to identify the main predictors of HIV-1 subtype/CRF group distribution, considering relevant epidemiologic variables: country of Scientific RepoRts | 7: 11584 | DOI:10.1038/s41598-017-10286-1 origin, sex, risk group, collection date, age and clustering status. Due to the lack of epidemiological data for many sequences from the dataset, only 906 of the 1804 sequences were included in this analysis. Seven groups of different HIV-1 subtypes and CRFs were used: A1 (n = 14), B (n = 752), F1 (n = 13), G (n = 11), 02_AG (n = 42) and 14_BG (n = 15). All HIV-1 variants with fewer than 10 patients sampled were pooled as "Other variants" (n = 59). Prior to the multinomial analysis, univariate analyses (Fisher's Exact Tests) were performed for the aforementioned variables, in order to exclude from the multinomial analysis those with non-significant p-values. Only the variable "collection date" (p-value > 0.70) was excluded. In the multinomial analysis, the most representative category of each variable was used as "baseline category" (Subtype B, Spaniard, male, MSM, age between 21 and 29 years and not clustering; Supplementary Table S4). All the statistical analyses were performed using R 48 . The mlogit R package 49 was used for the multinomial analysis.
Estimation of pairwise genetic distances. Pairwise genetic distances (GD) between individuals within transmission clusters were estimated, as substitutions per site (s/s), with the TN93 + Γ(4CAT) model, using Perl and R scripts (available upon request) dependent on the ape R package 50 .
Sequence information. GenBank accession numbers for the CV sequences used in this study are HF567872-HF567912 and MF403205-MF404967.