A cross-sectional study to characterize local HIV-1 dynamics in Washington, DC using next-generation sequencing.

Washington, DC continues to experience a generalized HIV-1 epidemic. We characterized the local phylodynamics of HIV-1 in DC using next-generation sequencing (NGS) data. Viral samples from 68 participants from 2016 through 2017 were sequenced and paired with epidemiological data. Phylogenetic and network inferences, drug resistant mutations (DRMs), subtypes and HIV-1 diversity estimations were completed. Haplotypes were reconstructed to infer transmission clusters. Phylodynamic inferences based on the HIV-1 polymerase (pol) and envelope genes (env) were compared. Higher HIV-1 diversity (n.s.) was seen in men who have sex with men, heterosexual, and male participants in DC. 54.0% of the participants contained at least one DRM. The 40-49 year-olds showed the highest prevalence of DRMs (22.9%). Phylogenetic analysis of pol and env sequences grouped 31.9-33.8% of the participants into clusters. HIV-TRACE grouped 2.9-12.8% of participants when using consensus sequences and 9.0-64.2% when using haplotypes. NGS allowed us to characterize the local phylodynamics of HIV-1 in DC more broadly and accurately, given a better representation of its diversity and dynamics. Reconstructed haplotypes provided novel and deeper phylodynamic insights, which led to networks linking a higher number of participants. Our understanding of the HIV-1 epidemic was expanded with the powerful coupling of HIV-1 NGS data with epidemiological data.

HIV-1 diversity in DC. We performed in-depth phylodynamic profiling of HIV sequences from 171 PCR gene products that passed quality thresholds, including 62 PR/RT, 62 int, and 47 env amplicons. The subtyping analyses showed that all of the participants belonged to subtype B; therefore, subsequent analyses included data from all participants. Our participants were dispersed amongst and showed a star-like pattern with other DC HIV PR/RT sequences (see Supplementary Fig. S1 online). The env(c) data showed higher nucleotide diversity (π) and Watterson genetic diversity (θ) than pol(c) ( Table 2). Males had a higher diversity, though not significant, than females (haplotype diversity: PR/RT: p = 0.2039, int: p = 0.9571, env: p = 0.3404). Participants whose risk factor was IDU (n = 6) had 50% less diversity than those with MSM and HRH risk, though again not significant (haplotype diversity: PR/RT: p = 0.7323, int: p = 0.7861, env: p = 0.6560). Non-Hispanic black participants had a higher genetic diversity for the pol(c) gene than for the env(c) gene ( Table 2). The average haplotype diversity when calculated with the number of reconstructed haplotypes by PredictHaplo showed that env had more haplotype diversity compared to PRRT and int ( Table 3). The average number of haplotypes per participant was the same for PR/RT and int (2 haplotypes) and slightly higher for env (4). Four of the six participants that had a higher number of haplotypes reconstructed (7-12 haplotypes in one or more gene regions) also had a higher average haplotype diversity estimate of 0.634 (range: 0.392-0.777), while the other two had a very low average haplotype diversity estimate (0.032, range: 0.027-0.038). Participants with HRH and MSM risk factors were found to have an average of 3 reconstructed haplotypes, with haplotype diversities of 0.338 and 0.335, respectively. www.nature.com/scientificreports www.nature.com/scientificreports/ Transmission clusters. Transmission clusters were assessed by phylogenetic methods and HIV-TRACE, a genetic-distance based clustering method. Phylogenetic methods found support (>70% bootstrap or >0.95 posterior probability) for 33.8% of the sequences associated with six transmission clusters in pol(c) and for 31.9% of the sequences associated with seven clusters in env(c) (highlighted in Fig. 1). All of the clusters were comprised of two to three sequences, except one cluster in pol(c) which had twelve sequences. Ten of the twelve sequences in the large pol(c) cluster contained a DRM within RT. Furthermore, 11 of the 12 sequences in this large cluster were included on the same sequencing run; therefore, we were unable to undoubtfully discriminate between laboratory artifacts or batch effects and HIV infection to interpret this transmission cluster. However, there were other samples from that same sequencing run that did not cluster with these twelve sequences and formed a dyad cluster. The most common DRM was T215C, which does not reduce NRTI susceptibility, and was found in eight of the twelve sequences. HIV-TRACE grouped only a few sequences into two clusters for pol(c) (12.8%) and into a single cluster for env(c) (2.9%) (Fig. 2I). More transmission clusters were estimated with the haplotypes reconstructed from PredictHaplo (Fig. 2II, Table 5). For PR, RT, and int (genes also used in past DC HIV studies 28,29 ), 82.1% of our participants were incorporated into transmission clusters. Unique to our dataset was the use of envelope to predict transmission clusters; 35.8% of our participants were included in V1V2 and V3 transmission clusters. Haplotypes were not concatenated for this analysis, but little overlap of cluster composition was found between gene regions. Often one participant's haplotype would cluster with another participant in one gene region, but it would cluster with a different participant in a different gene region. Thus different gene regions displayed different transmission clusters that would not have been detected if one only analyzes a single gene. Clusters predicted in all gene regions were mainly composed of two participants (42 of the 55 total clusters). Likewise, the majority of the transmission clusters predicted by past DC HIV studies 28,29 were comprised of two or three participants, regardless of the gene region, clustering estimation method, or haplotype/consensus construction approach.

Discussion
Collectively, our study aimed to characterize the local and recent phylodynamics of a subset of DC Cohort participants from Washington, DC metro area living with HIV. By combining clinical and behavioral data with NGS data, we were able to identify transmission clusters across groups with different demographics and risk behaviors. Additionally, this study reconstructed sequence variants present within a participant (i.e., intra-host) and investigated associations between the reported participant characteristics and transmission clusters.  www.nature.com/scientificreports www.nature.com/scientificreports/ Our population genetic estimators indicate that HIV-1 env(c) is more genetically diverse than pol(c). Similar diversity estimates were found in other studies 19,[29][30][31] . Our estimate of genetic diversity for pol(c) in DC (θ = 0.079) was lower than those reported for subtype B in Pérez-Losada et al. 29 (θ = 0.084 and 0.090), but higher than those currently reported for the US subtype B sequences in Los Alamos HIV-1 database (θ = 0.075 for int; θ = 0.067 for PR/RT). This same trend was seen in env(c), where our DC HIV-1 genetic diversity (θ = 0.202) was greater than that reported for V1-V3 from Los Alamos HIV-1 database (θ = 0.168) across the US. Haplotype diversity estimated from the PredictHaplo results also found env genes more genetically diverse than PR/RT and int.

PR/RT int env
Notably, there were interesting differences in diversity estimates among risk groups. Participants who were infected through injection drug use (IDU) were found to have about half of the diversity of MSM and HRH participants. This low diversity, potentially associated with low multiplicity of HIV-1 infection, is not unusual within IDU individuals 32,33 . In both pol(c) and env(c), males also showed higher genetic diversity, which could be attributed to half (51%) of the males in this study being infected through sex with other men (16% of the males had an unknown risk factor). Our measurements of HIV-1 diversity in HRH were also high and similar to those of MSM; however, we did not observe differences in HIV genetic diversity by race or ethnicity.
The HIV-1 subtype B epidemic in DC is highly diverse, and our results here agree with previous conclusions suggesting a mature epidemic 29,34 . High genetic diversity could result from risk groups intermingling and viral strains being exchanged and the transient nature of the DC metro area population. DC is an international stop for some, a temporary residence for others, and home for many. This constant influx of incomers could have a boosting effect on the DC HIV-1 population by consistently introducing new viral strains into the pool. Treatment and vaccine development can be compromised by high genetic diversity. As HIV-1 continues to evolve and, as seen here, high genetic diversity levels are kept constant over time 34 , resistance to vaccines and ART drugs may increase, which could ultimately lead to treatment and prevention failures 35 .
A Drug Resistant Mutation (DRM) prevalence of 48.6-54.0% was detected in this subset of the DC Cohort, depending on use of consensus sequence or haplotypes. Lower DRM prevalence rates were previously reported for the DC area 28,29,34,36 (17.3-37.9% between 1994 and 2016). A much higher rate (66%) was reported in a smaller study of ART treatment-naïve and experienced pediatric patients in Rhode Island 37 . DRM rates over 50% are also seen in large sequence databases in the UK and Switzerland 29 . Our study found fewer codons affected by a DRM and fewer DRMs in our participants than previous studies 28, 29 of the DC epidemic. We only found 32 and 38 codons to be affected when analyzing consensus sequences and haplotypes, respectively, whereas Pérez-Losada et al. 29 found 83 codons affected for subtype B, however that study contained 20 times more sequences than ours. Moreover, our study found three novel DRM sites that were not identified in either previous study: P145PAST (IN Major) and A128APST, Q146QH (IN Accessory).
More recently, a study by Kuhnert and colleagues 38 reported on the fitness of fourteen HIV-1 resistance mutations, of which seven were detected in RT in our study. Three of the seven DRMs (codons: 41 L, 67 N, 184 V) were NRTI-related and the other four DRMs (codons: 103 N, 108I, 138 A, 181 C) were NNRTI-related. Six DC Cohort participants contained the 184 V DRM, which was found by Kuhnert et al. 38 to have the highest transmission cost -i.e., the success (low transmission cost) or lack thereof (high transmission cost) of transmission of hosts infected by drug resistant strains. Because of this high cost to the virus, the mutation resulted in very short transmission chains despite evolving frequently under treatment failure 38 . Of the six DC Cohort participants, all were included in a cluster when using the haplotypes, but none of the participants clustered with each other. Additionally, this mutation was found to be persistent in the DC HIV-1 viral population since 2005 34 . Previous studies showed that the most important NNRTI mutation currently is 103 N because of its connection to first-line treatment failure [39][40][41] ; a quarter of our DC Cohort participants with one or more DRMs contained this mutation, which was also found to be at a low frequency in the DC HIV-1 viral population since 2005 34 .
In treatment-naïve participants, both when using consensus sequences and haplotypes, we estimated a low prevalence rate of DRMs (14.3%). Similarly, low prevalence rates have been seen in the past in the US ( 42 ; 15% between 1999 and 2011) and even lower in treatment-naïve individuals in Europe ( 41,(43)(44)(45) ; 10% between 2001 and 2013). Moreover, we also found fewer DRMs in treatment-naïve participants compared to a recent study of PLWH in DC 28 (22.5% between 1994 and 2013). Kassaye et al. 28 also observed a downward trend of overall prevalence of DRMs over time in treatment-naïve individuals. Our results suggest a further decrease in overall prevalence of DRMs in the current DC HIV-1 epidemic. A lower prevalence of DRMs in surveillance versus targeted treatment-naïve studies could result from sampling design. A surveillance study of DC would likely provide the more accurate picture of DRM trends in the population.
Novel sites that continue to evolve in the DC epidemic and have not become fixed in the population are of serious concern for future drug therapy and conferring resistance to these drugs. More and different codons were predicted to be under positive selection in the haplotype sequences than the consensus sequences. FUBAR predicted sites in V1V2 are likely being impacted by the immune system, which the virus is actively trying to evade (diversifying selection). V3 is associated with co-receptor binding 46 ; codons predicted here could be rising advantageous mutations by HIV-1 to adapt to the host cells' response against the virus. Five codons (PR: 37, RT: 35, and int: 201, 265, 283) were identified by both our study, in both haplotypes and consensus sequences, and Pérez-Losada et al. 29 as sites under selection for subtype B. Since none of these sites corresponded to any known Stanford DRMs, these may be newly evolving resistance mutations in the DC HIV-1 epidemic. Given that all of our participants were on dual-or multiple-drug regimens, these sites may also be indicative of potential escape mechanisms by the virus in response to multiple-drugs treatments. Thus, these amino acid replacements are candidates for fitness testing with and without associated drugs to infer their ability to confer drug resistance, their relative fitness status in different environments, and their transmissibility across individuals.
Additionally, identifying transmission clusters is critical to recognizing groups who may be at risk of contracting HIV-1 or who may already be infected but are not yet aware of their diagnosis. Phylogenetic studies suggest that transmission clusters greatly contribute to the spread of HIV-1 within the population 47 ; therefore, identifying Scientific RepoRtS | (2020) 10:1989 | https://doi.org/10.1038/s41598-020-58410-y www.nature.com/scientificreports www.nature.com/scientificreports/ high-risk groups, whether that is based on risk behavior or geographical location 48 , can help public health officials to better target prevention efforts and treatment options. The spread of infection is often associated with early HIV-1 infection 47 , consequently, molecular surveillance of the DC epidemic should continue in order to identify potential areas or clusters of transmission and, thus, help lower the HIV-1 incidence.
Towards this goal, we combined NGS sequence data with clinical characteristics to obtain a dynamic picture of the evolution of HIV-1 in the DC metro area. We detected high levels of clustering using haplotypes (32.8-64.2%, not including the V1V2 region), as also seen in other HIV-1 cohorts (24-65%) 29,35,[49][50][51][52][53][54][55][56][57][58][59][60][61] . Other studies, including one completed with Sanger sequencing in DC 28,31,36,52,62 , however, have found lower levels of clustering (7-17%), in agreement with those reported here for the consensus sequence phylogenetic clusters (pol(c) and env(c)) and the V1V2 region for haplotypes (9.0%). Therefore, a more comprehensive understanding of HIV-1 transmission events in DC has been achieved when evaluating multiple genes together, rather than primarily focusing on polymerase genes that are typically screened for DRMs in clinical settings or used in investigations at the local health department level. By excluding envelope genes, informative transmission events can be missed, which could hinder community health prevention and intervention efforts. In an ideal setting, using all the genetic information available would be most favorable when investigating local HIV-1 phylodynamics.
In agreement with a recent study of HIV-1 transmission clusters in Chicago 59 , we also found association of risk factors within clusters. More HRH participants fell in our haplotype transmission networks compared to MSM, IDU, and participants with unknown risk (HRH = 23, MSM = 18, UNK = 11 each & IDU = 6). A total of 58.3% of the clusters that included an HRH participant also had an MSM participant. Likewise, a US study that included 12 major US cities 63 found transmission clusters that contained overlap between participants who were MSM and HRH. Mixing of risk types in HIV-1 subtype B transmission clusters has also been observed in Switzerland, Iceland, and Nordic European countries 60,64,65 . Contrarily, Kouyos et al. 65 found segregation based on location among individuals who were included in a transmission cluster despite having overlapping risk factors. Risk groups may be mixing due to underreporting of risk behaviors or bisexual behavior [65][66][67] . This heterogeneity of risk groups in transmission clusters suggests that focusing on individuals within city areas (e.g., wards in Washington, DC) to concentrate resources and information may help in addressing the HIV-1 epidemic.
Otherwise, we were unable to determine the mode of transmission for the "unknown modes of transmission" group (16.2% of our sample). Nonetheless, our results suggest that the mode of transmission may not be as important for prevention and intervention efforts as the location where transmission events are occurring. Likewise, Morgan et al. 59 suggested not targeting efforts towards risk groups, but rather age groups, particularly younger people, in Chicago. The average age of the DC participants included in a haplotype transmission cluster was 46.6 years of age. If DC's younger population is being the most affected, as suggested by the new cases identified by the DC DOH in 2016 and 2017 1,3 , taking a spatial dynamic approach to intervention with continued surveillance may help. Through surveillance studies, further adapted location-based prevention efforts can be employed.
Notably, our analysis has some limitations. We included only 68 participants of the approximately 10,000 people enrolled in the DC Cohort 3 , whereas past studies conducted in DC included 700 (Kassaye et al.) and 1,500 participants (Pérez-Losada et al.). However, the demographics in our sample size are similar to PLWH in DC 3,4 . As a prospective study conducted as part of an ongoing HIV-1 surveillance program associated with the DC Cohort, we capitalized on all the current cases that met our inclusion criteria (see Matierals and Methods: DC cohort). These past studies of HIV-1 diversity in Washington, DC were historical in nature and, therefore, had larger sample sizes available. Our study also applied a powerful next-generation sequencing approach instead of Sanger sequencing (previous DC studies), to characterize the current HIV-1 epidemic. With the implementation of NGS, mapping very diverse short reads to a reference genome poses alignment issues 14 , which can add difficulty to downstream analyses. We circumnavigated this alignment issue by using HAPHPIPE, where the reads Gene   www.nature.com/scientificreports www.nature.com/scientificreports/ were mapped against a tailored reference genome, thus resulting in higher alignment rates and fewer errors. Nonetheless, aligning very diverse reads still remains an issue. Although new sites under selection were identified using NGS, their clinical relevance as potential DRMs requires further validation. We also recognize that we used conservative genetic distance cutoff values for determining transmission clusters, which could result in lower www.nature.com/scientificreports www.nature.com/scientificreports/ numbers of transmission clusters 68 ; however, this conservative estimate reduced the number of false positive transmission clusters. Finally, due to the nature of predicting transmission clusters and the potential for missing individuals, we are not able to determine the direction of infection within the transmission clusters. We were also unable to rule out batch effects or laboratory artifacts accounting for any transmission cluster which participants were included in the same sequencing run.

conclusions
This study showed that NGS and epidemiological data can be used to characterize the current phylodynamics of a subset of people living with HIV, enhancing our understanding of the diversity and local dynamics of the HIV-1 epidemic in the DC area. HIV-1 diversity in DC is high and seems to remain stable over time. Furthermore, NGS of the envelope gene provided sufficient coverage to compare transmission cluster inference across HIV-1 gene regions 14,20 . Additional transmission clusters were identified when using HIV-1 intra-host haplotypes instead of consensus sequences, which led to networks linking a higher number of participants. Moreover, transmission clusters varied across genes, with each gene suggesting a different transmission story. Hence, using multiple HIV-1 genes or whole genomes is recommended to infer more reliable transmission clusters. Inferred clusters should then be linked to locations in DC to target transmission intervention efforts. Additionally, HIV-1 drug resistance was only found when using haplotypes in a single young adult in our cross-sectional sample of the cohort. Future studies should also focus on age groups and geographic regions rather than only risk factors. As the DC area maintains significant rates of HIV-1 infection, integrating present and past molecular data from previous studies conducted in DC in 2017 29 and 2013 28 will help to paint a comprehensive picture of the HIV-1 transmission and evolution of drug resistance in this high prevalence urban U.S. city. Future HIV-1 phylodynamic studies should also include more participants, particularly young adults, and newly diagnosed persons to provide a comprehensive view of DRM prevalence in treatment-naïve individuals in the DC area. Studies revealing the severity of transmitted drug resistance in the DC population may provide physicians and public health workers with additional information to design more effective treatment plans for newly diagnosed individuals and intervention strategies for targeted key populations.

ethics. Institutional Review Board (IRB071029) approval was obtained from The George Washington
University IRB (which serves as the IRB of Record for eight of the participating sites), the DC DOH IRB, and the remaining site IRBs. Informed consent was obtained and documented prior to conducting study procedures. Sample collections from participants were performed in accordance with relevant guidelines and regulations.
Dc cohort. Participants from the DC Cohort were recruited for this molecular epidemiology sub-study from January 2016 through May 2017. Eligibility criteria included current DC Cohort enrollment, ≥18 years of age, HIV-1 diagnosis within prior 12 months of enrollment or detectable HIV-1 viral load of ≥1,500 copies/mL, ability to provide written informed consent, and completion of a behavioral survey; a total of 104 participants met the eligibility criteria. Blood samples were collected at the clinical sites and transported to George Washington University for processing, targeted amplification, library preparation and NGS. Sample sequences were paired with clinical and demographic data retrieved from the database from the DC Cohort (Table 1). Clinical and demographic characteristics collected included age, race/ethnicity, sex at birth, gender, country of birth, state of residence, zip code, HIV-1 risk factor, presence of co-infections (e.g., chlamydia, gonorrhea, syphilis, trichomoniasis, and Hepatitis B and C), duration of infection, CD4 count, viral load, ART exposure, ART regimen type, date of sample, and date of HIV-1 diagnosis. The paired data were de-identified and analyzed using the approaches described below. Sequence analyses. The raw sequence data for each patient were processed through HAPHPIPE (https:// github.com/gwcbi/haphpipe), a HAplotype reconstruction and PHylodynamics PIPEline for genome-wide assembly of viral consensus sequences and haplotypes 69 . Briefly, HAPHPIPE includes modules for quality trimming, error correction, assembly, and haplotype reconstruction. We put the raw sequencing FASTQ files through quality control and quality trimming with Trimmomatic 70 . Error correction of the reads was completed with an earlier version of HAPHPIPE that used BLESS 71 , and the cleaned reads were mapped against the current HIV-1 subtype B reference sequence HXB2 (Genbank accession: K03455) 72 . Through iterative refinement, the cleaned reads were then mapped back to the reference sequence generated in the mapping step with Bowtie2 73 . This iterative refinement step was completed twice, first using only a random subsampling of the reads (25% subsampling) participants within a cluster. d MSM = men who have sex with men; HRH = heterosexuals; IDU = injection drug users; UNK = unknown. e Average age in years. f Percentage of participants within a cluster that contained one or more DRMs. g Overlap was only assessed between the concatenated consensus pol and env genes and between genes (PR, RT, int, V1V2, and V3) with transmission clusters generated with haplotypes. Reported as the percentage of unique participants within the cluster that are found in another cluster in a different gene within the same sequence type (consensus vs haplotype sequences). Overlap was not assessed between sequence types. h One participant had a risk factor of other.
www.nature.com/scientificreports www.nature.com/scientificreports/ and the fast-local mapping option to speed up the computational time, and second using all of the sequence reads and the very sensitive mapping option to further refine the individually crafted reference sequence. A consensus sequence was generated from the refined reference sequence, and a final refinement step was concluded with BLAST 74 against this refined consensus sequence. The resulting sequences were filtered to include participants that contained a passing amplicon, defined as having > 95% of the amplicon covered by 10x or greater read coverage. Amplicons that did not pass this filter were removed, and this subset was then used for subsequent phylodynamic analyses (Table 1).
Sequence data for each PCR amplicon (PR/RT, int, and env) were aligned individually using MAFFT with the L-INS-i algorithm 75 in Geneious (ver. 9.1.6) 76 . Protease (PR) and reverse transcriptase (RT) were extracted from the PR/RT amplicon, and env was further divided into the variable regions: gp120 V1V2 (HXB2 coordinates: 6615-6812) and gp120 V3 (HXB2: 6984-7349). PR, RT, and int were concatenated into the pol(c) gene region, and V1V2 and V3 were concatenated into the env(c) gene region. Concatenated gene regions will be distinguished from whole gene regions by adding "(c)" to the end of the gene name. Each gene (PR, RT, int, V1V2, and V3) was extracted from the amplicon data to fulfill different purposes: (1) remove any nucleotides belonging to other genes, for example the env PCR amplicon contained primer sequences, and therefore nucleotides belonging to the vpu gene region; (2) simulate amplicon sizes that could be covered end to end by paired-end reads to be consistent and comparable to future NGS studies with HIV-1 when using PrimerID or other local haplotype phasing techniques 8,77 ; and (3) account for differences in PCR performance between and within samples by extracting a common, high-coverage region. Therefore, missing data in this dataset were low, and often only due to amplification failure of an entire amplicon. Concatenating the genes into their respective gene regions (pol and env) retained variants in genes that are often studied, such as protease and reverse transcriptase for drug resistant mutations. It also allowed comparisons to past studies based on Sanger sequencing that used either parts of genes or whole genes. Our overall goal was to keep the integrity of the individual genes while using as much of the NGS data as possible.
Identification of subtypes and drug resistant mutations. HIV-1 subtype identification was completed for each concatenated gene region (pol(c) and env(c)) using the REGA subtyping tool (version 3) 78,79 . A total of 170 subtype reference sequences from the Los Alamos HIV-1 database (LANL; http://www.hiv.lanl.gov/) were included to assign the patient sequences to a particular subtype clade and validate the findings from REGA using phylogenetic methods described below. Drug resistant mutations were identified aligning the consensus concatenated gene nucleotide sequences with reference strains in the Stanford HIV Drug Resistance Database (https://hivdb.stanford.edu) using the HIVdb program 80 . Nucleotide positions under positive selection were identified using Fast Unconstrained Bayesian AppRoximation (FUBAR) 81 in HyPhy 82 . Recombination in our HIV-1 data was accounted for with GARD 83,84 . phylogenetic analyses. Phylogenetic estimations were completed for each concatenated gene region. The best-fit model of molecular evolution 85 was estimated for each pol(c) and env(c) from the data using jModel-Test2 86 in CIPRES Science Gateway 87 . Amino acid positions corresponding to identified DRMs described above were removed prior to phylogenetic estimations to avoid potential bias due to selection. A maximum likelihood phylogenetic estimate using RAxML 88 was made for each region with the 3 codon-position partitions 89 . The branch support for the RAxML phylogenetic trees was estimated with a bootstrap approach with 1,000 replicates 90 . Bayesian trees were inferred using MrBayes 91 . Four Markov chains (one cold and three heated) were run for 8 × 10 8 generations sampling every 2,000 steps for each gene region, and each run was repeated twice. The output was analyzed in Tracer 92 to assess convergence and mixing of the chains. Subtypes references for subtype D (GenBank accessions: K03454, AY371157, AY253311, U88824) and circulating recombinant forms CRF28,42-BF (GenBank accessions: FJ213781, FJ358521, FJ670529) and CRF10-CD (GenBank accessions: AF289548, AF289549, AF289550) were pulled from LANL and used as proper outgroups for the phylogenetic analyses 93 . Additional RT sequences from DC 29 were included to observe how our data related to other DC sequences. We visualized the epidemiological data on the resulting trees with the Interactive Tree of Life (iTOL).

Haplotype reconstruction. For the identification of transmission clusters and testing for associations
of clinical variables to transmission clusters, it is ideal to characterize within patient viral variation as individual sequence variants (haplotypes) instead of combining all of the individual reads into a single consensus sequence 94,95 . Therefore, haplotypes for each patient were predicted from the sequence data using HAPHPIPE's haplotype stages. Haplotype reconstruction was performed on each PR/RT, int, and env targeted PCR amplicons using PredictHaplo 96 . Each gene region (PR, RT, V1V2, and V3) was then extracted from the corresponding targeted amplicon and, using the methods described below, transmission clusters were estimated using the predicted haplotypes. No concatenation of the individual genes to form the regions pol(c) and env(c) was done with the haplotypes.
Identification of transmission clusters. Transmission clusters were assessed for each pol(c) and env(c), as well as for each of the gene regions with the haplotypes, using phylogenetic methods 89,91 and the genetic-distance based clustering method HIV-TRACE 97 . Phylogenetic transmission networks were defined as clades with bootstrap proportions ≥70 or posterior probabilities ≥95%. Genetic distance thresholds of 0.01 29,62 and 0.02 98 substitutions/site were used for pol and env in HIV-TRACE, respectively, to identify potential transmission events. Ambiguities were handled with the HIV-TRACE option "average" to avoid biases and false positives and minimum overlap was 1/genetic distance threshold and adjusted for size of amplicon, as recommended. Default settings were used for the remaining parameters. Transmission clusters were compared between gene regions.