Using nearly full-genome HIV sequence data improves phylogeny reconstruction in a simulated epidemic

HIV molecular epidemiology studies analyse viral pol gene sequences due to their availability, but whole genome sequencing allows to use other genes. We aimed to determine what gene(s) provide(s) the best approximation to the real phylogeny by analysing a simulated epidemic (created as part of the PANGEA_HIV project) with a known transmission tree. We sub-sampled a simulated dataset of 4662 sequences into different combinations of genes (gag-pol-env, gag-pol, gag, pol, env and partial pol) and sampling depths (100%, 60%, 20% and 5%), generating 100 replicates for each case. We built maximum-likelihood trees for each combination using RAxML (GTR + Γ), and compared their topologies to the corresponding true tree’s using CompareTree. The accuracy of the trees was significantly proportional to the length of the sequences used, with the gag-pol-env datasets showing the best performance and gag and partial pol sequences showing the worst. The lowest sampling depths (20% and 5%) greatly reduced the accuracy of tree reconstruction and showed high variability among replicates, especially when using the shortest gene datasets. In conclusion, using longer sequences derived from nearly whole genomes will improve the reliability of phylogenetic reconstruction. With low sample coverage, results can be highly variable, particularly when based on short sequences.

Most studies on HIV molecular epidemiology now use the portion of the viral pol gene that contains the protease (PR) and reverse transcriptase (RT) coding regions. This is because these partial pol sequences (around 1.3 Kb long) are routinely sequenced for genotypic resistance testing [1][2][3] . Although initially the env gene was considered to present the strongest phylogenetic signal, it was argued that some env fragments were too short and/or variable for a robust analysis 4 . After pol was demonstrated to accurately reconstruct HIV transmission 5 , its analysis for phylogenetic studies became the standard owing to the very large datasets available for analysis (e.g., the UK 6 and Swiss 7 sequence databases). In the last few years, the increasing availability of HIV whole genome sequences has made possible the analysis of other genetic regions, which has raised discussion about whether full-length genome trees should be used or which viral genes provide the best trees.
A few studies have previously approached this question by analysing HIV transmission networks in which the timing and direction of transmission were known [8][9][10][11] . They have suggested that the combination of more than one gene provides the best estimation of the true tree. However, all were limited to very few patients and, in some cases, short nucleotide sequences. The lack of a known, large phylogeny prevents providing a definitive comparison that would answer this question, but simulated data provide an approximation that allows having both the true tree and a recombination-free dataset.
Such data were generated in the context of the PANGEA_HIV Methods Comparison Exercise 12 (http://www. pangea-hiv.org), for which an HIV epidemic in an African village was simulated using an agent-based model in which all sexual contacts were recorded, and those that gave rise to transmissions created a transmission tree which was recorded. Here, we used these HIV datasets to evaluate the effect of utilising viral sequence datasets of different length and from several viral genes and with different sampling depths to reconstruct the known simulated phylogenies.

Results
From the simulated HIV sequence data generated for the PANGEA_HIV project, we produced different combinations of sampling density (100%, 60%, 20% and 5%) and viral gene use (gag-pol-env, gag-pol, gag, pol, env and partial pol). Sixty per cent represents approximately the sampling coverage in the UK HIV Drug Resistance Database 13 , whereas 5% represent the range in HIV sequence coverage that is believed to be relevant for cohorts in many African countries. For example, in the region of KwaZulu-Natal, South Africa, the sampling density is estimated to be between 4% and 8%, according to the specific cohort, (Prof. Tulio de Oliveira, pers. comm.). This sub-sampling was randomly replicated 100 times and ML trees were constructed, whose topology was then compared to that of the corresponding true tree. The results of the CompareTree metric ( Fig. 1A) show that the proportion of correct tree splits increased with the length of the sequences used. The genome datasets showed the best performance considering all the sampling coverage levels together (Table 1) Thus, the proportion of correct tree splits increased in direct proportion to the length of the sequences used. A linear regression analysis showed a statistically significant positive correlation between the metric and a logarithmic transformation of the sequence length, yielding a correlation value of R 2 = 0.83 (p < 10 −16 ; see also  In the 20% sampling level there was considerable overlap in performance among the larger fragments, but that of the smaller regions was substantially poorer. With 5% sampling coverage levels, the results showed the largest confidence intervals, revealing a substantial variability among the replicates, although some of these replicates outperformed estimations from the levels with higher sampling coverage. Although quantitatively small, these differences in accuracy of tree reconstruction are important for identifying transmission clusters. We tested the impact of these differences using a standard methodology to detect transmission networks from the trees generated in this study by comparing the proportion of clusters found in the true tree ("true clusters") that were also found when analysing the ML trees. We did this using the gag-pol-env sequence and the partial pol sequences (as is the norm in the vast majority of studies) in the 100% sampled dataset, and we discovered that the use of gag-pol-env detected a significantly higher proportion of true clusters (778 out of 788 true clusters in gag-pol-env (98.73%) versus 774 out of 827 true clusters in partial pol (93.59%), chi-square test p = 1.95 × 10 −7 ). Thus, even in the fully sampled dataset, the reconstruction of trees from partial sequences implies a significant and important difference in the outcome.

Discussion
We have used simulated HIV sequence data to show how the use of genes of different lengths can affect the correct reconstruction of the true viral phylogeny. The proportion of correct trees increased in almost direct proportion to the length of the sequences used. Thus, the 7 Kb gag-pol-env nearly full-genome sequences were best at reconstructing the true tree.
The 60% sampling coverage provides the most similar results to the analyses of the complete datasets, which emphasises the superior reliability of studies based on high densely sampled epidemics. In contrast, lower sampling depths (20% and 5%, which resemble the sampling settings found in Africa and developing areas) greatly reduced the accuracy of tree reconstruction -visible in the high variability between the replicates-especially when using the short clinical pol dataset.
We presumably obtained values higher than expected in a real-world analysis, particularly because there is a complete fit between the evolutionary model used to simulate the sequence data and the model used for analysing it. In addition, the good performance of the env analyses is partly due to the fact that its characteristic insertion/ deletion variation was not simulated. Nevertheless the fact that env trees can outperform the pol trees, suggests that, in principle, the higher evolutionary rate in env can improve reconstruction.
Here we used a metric that is proportional to the RF metric -the most widely used method to estimate the distance/similarity between two phylogenetic trees. While this might be a simplistic metric, it is an intuitive and powerful method to compare trees, although its limitation is that it does not provide a means to state that one tree is significantly more similar to the true tree than a second tree is.
Our results demonstrate that the length of the sequence increases the reliability of phylogeny reconstruction in simulated data. In the simulations, different evolutionary rates applied to the gag-pol and env genes, as seen in real datasets. These were of 1.91 × 10 −3 for gag-pol (or pol) and 3.83 × 10 −3 for env, i.e. the evolutionary rate for env was twice that of gag-pol. Thus, the amount of variation that we find in env (length = 2508 nt) would be equivalent to an approximately 5 Kb-long gag-pol sequence. This could explain that, in some replicates, env outperforms pol (length = 3000 nt). However, there was no insertion/deletion variation in the simulated sequences and in analysing real datasets this apparent superiority of env over more conserved genes is constrained by errors in alignment if hypervariable regions are included.
Although we did not perform a bootstrapping analysis of the reconstructed trees, previous analyses have further demonstrated that support for groupings in the tree is increased when longer sequences are used, and clustering found in full-length datasets can be missed when using sub-genomic regions [14][15][16] . Given the difficulty in generating and/or handling full genome datasets, our results demonstrate that gag-pol provides a dependable approximation; however it should be kept in mind that, at this point and considering we analysed a simulated dataset, the good performance of gag-pol could be more attributable to these genes' combined length than to their particular characteristics.
In conclusion, thanks to the more affordable generation of full HIV genomes, as is the goal of the PANGEA_ HIV consortium 17 , the use of longer genetic regions (such as concatenated gag, pol and env or gag-pol) will allow for a more reliable reconstruction of transmission events. The traditional short pol sequences generated for resistance testing that are used in most molecular epidemiology studies are substantially less reliable, especially with low sampling depths. An effort to generate highly sampled datasets is also needed to increase our ability to reconstruct real HIV epidemics. Methods HIV epidemic simulation. The PANGEA_HIV phylodynamic Methods Comparison Exercise 12 (http:// www.pangea-hiv.org/Projects#phylodynamic) created a simulation resembling an African Village, which was based on high-and low-risk households and a small sex worker group. These simulations made use of the Discrete Spatial Phylo Simulator adapted to HIV-specific components (DSPS-HIV), which is an individual-based stochastic simulator. Using a specifiable contact network, the DSPS-HIV models HIV transmissions and records all sexual contacts. Selecting those which gave rise to transmissions produced the transmission tree. To generate the HIV sequences associated to these transmissions events, viral phylogenies that reflect between-and within-host viral evolution were simulated down the transmission tree using VirusTreeSimulator (https://github.com/ PangeaHIV/VirusTreeSimulator). In order to reconstruct ancestral subtype C sequences to be used as starting point of the simulation, a dataset of Southern African full genome subtype C sequences was downloaded from Los Alamos database (http:// www.hiv.lanl.gov/). It included 100 sequences selected to represent a balanced number of sequences per calendar year , and were sampled in South Africa (n = 46), Botswana (n = 41), Zambia (n = 8) and Malawi (n = 5). The GenBank accession numbers corresponding for these 100 sequences are provided in the Supplementary Table 1. This dataset was separated into gag, pol and env and ancestral sequences for each gene were reconstructed using BEAST v1.8.1 18 applying GTR + 4Γ + I as nucleotide substitution model and Bayesian skyride as demographic model.
These ancestral sequences were used as starting point to simulate sequences along these viral phylogenies using π BUSS 19 , with substitution rates parameterized from the aforementioned analyses of Southern African sequences. To increase realism, different substitution rates applied to different genes (with a rate twice as high for env as for gag and pol) and different codon positions (1st and 2nd vs 3rd). Finally, the simulations were parameterized to emulate prevalence and incidence estimates from the peak of the African HIV epidemic in the late 1980s-early 1990s [20][21][22] , before treatment roll-out, so the date of the root of the sequences coincides with the subtype C common ancestor in the 1940s 23 .

Analysis dataset.
We sampled all HIV simulated sequences corresponding to all infected individuals (one sequence per individual) in a 5-year period -between years 40 and 45 after the simulated epidemic started. From these simulated HIV sequences we created different combinations of sequence sampling depths and genomic regions. The full dataset contained 4662 sequences, and we adopted sub-sampling levels of 60% 20% and 5% sampling density which therefore included, respectively, 2798, 933 and 233 sequences. These sequences were chosen at random from the dataset with 100% sampling coverage. For the 60%, 20% and 5% sampling coverage levels we generated 100 independent sub-samples to test the reproducibility of the analyses.
The fully-sampled simulated sequence dataset as well as the true transmission tree are available at http://hiv. bio.ed.ac.uk/datasets/Yebra2016_Tree_Comparison_dataset.zip.
Phylogenetic tree comparison. We obtained the top-scoring maximum likelihood (ML) tree for each of these datasets using RAxML v8.2 24 under the GTR + Γ substitution model. For the nearly full genome trees, we applied a partition analysis in RAxML to accommodate for different evolutionary models in gag-pol versus env.
The Robinson-Foulds (RF) 25 metric is the most widely used measure of phylogenetic tree similarity. Given two phylogenetic trees, this metric counts the number of splits or clades induced by one of the trees but not the other. Here, we use an approximation to the RF metric implemented in the CompareTree program (http://meta. microbesonline.org/fasttree/treecmp.html), which also calculates the fraction of splits in the query tree (i.e., the reconstructed trees) that are shared with the reference one (i.e., the true trees). Unlike the RF metric, this value represents a proportion (therefore it ranges from 0 to 1), providing a metric that is more intuitive and easier to interpret and compare. We use the proportion of shared splits as an indicator of the fidelity in reconstructing the corresponding, sub-sampled true tree.
Finally, in order to evaluate the implications of the topology differences, a phylogenetic cluster comparison analysis was performed in the fully sampled dataset using the Cluster Picker and Cluster Matcher programs 26 . Statistical analyses. We compared the results from different genes and/or sampling coverage levels by using a two-sample Student's t-test. When comparing to the fully sampled datasets (100% sampling coverage), for which only point estimations were obtained because replicates cannot be produced, a one-sample t-test was performed to test whether the corresponding mean distribution was significantly different than the point estimation of the 100% sampling coverage level. Finally, we applied a linear regression analysis to explore the relationship between the results and the sequence length. All this calculations were produced in R 27 version 3.1.2.