# Inferring HIV-1 transmission networks and sources of epidemic spread in Africa with deep-sequence phylogenetic analysis

## Abstract

To prevent new infections with human immunodeficiency virus type 1 (HIV-1) in sub-Saharan Africa, UNAIDS recommends targeting interventions to populations that are at high risk of acquiring and passing on the virus. Yet it is often unclear who and where these ‘source’ populations are. Here we demonstrate how viral deep-sequencing can be used to reconstruct HIV-1 transmission networks and to infer the direction of transmission in these networks. We are able to deep-sequence virus from a large population-based sample of infected individuals in Rakai District, Uganda, reconstruct partial transmission networks, and infer the direction of transmission within them at an estimated error rate of 16.3% [8.8–28.3%]. With this error rate, deep-sequence phylogenetics cannot be used against individuals in legal contexts, but is sufficiently low for population-level inferences into the sources of epidemic spread. The technique presents new opportunities for characterizing source populations and for targeting of HIV-1 prevention interventions in Africa.

## Introduction

Large generalized epidemics of human immunodeficiency virus type 1 (HIV-1) continue to cause substantial mortality and morbidity across much of sub-Saharan Africa1. Rates of new infections have been reduced by adoption of prevention measures, especially antiretroviral therapy and medical male circumcision1,2. Despite progress, incidence levels remain well above elimination thresholds3. There remains an urgent need to better understand the drivers of transmission such as differential transmission by sex and age groups, especially among young women who account for 74% of new infections among adolescents in sub-Saharan Africa4. This may enable better targeting of prevention measures to infected people who most likely act as sources of new infection, and thus reduce transmission amongst groups most likely to sustain the epidemic. HIV-1 evolves faster than transmissions occur, so that viral sequences obtained from an individual tend to be characteristic of that individual within weeks after infection5,6. Therefore, viral genetic data have the potential to yield novel insights into the drivers of transmission by identifying who may have been a transmitter, and then by generalizing these findings to identify risk factors that can be directly targeted for prevention7,8.

Currently, phylogenetic tools to identify sources of transmission are based on Sanger sequencing, which generates a single HIV-1 consensus sequence per virus sample from an individual9,10,11,12,13. Typically one sample per individual is sequenced, and so the entire viral population from one individual is reduced into a single consensus sequence, which is insufficient to determine in which direction infections occurred14. For this reason source attribution methods have required data on dates of infection15,16,17 or modelling assumptions on the epidemic9,10,12,18,19. An advantage of source attribution methods based on additional modelling assumptions is that they may be applied with relatively small sample sizes, although it can be hard to disentangle assumptions from conclusions. For example, in ref. 12, it was assumed that young women are predominantly infected by older men in KwaZulu-Natal, South Africa, and it is unclear to what extent the same conclusion is based on data20. There is consequently a need for broadly applicable source attribution methods that are not dependent on external modelling assumptions to provide independent evidence.

Here, we demonstrate that HIV-1 transmission networks and the direction of transmission within them can be reconstructed from deep-sequence data of a large population-based sample of infected individuals with phyloscanner21, a recently developed software package for viral phylogenetic inference from deep-sequence data. The accuracy in reconstructing the direction of transmission is sufficient to infer source populations, i.e. the most likely drivers of the epidemic, without assumptions on the epidemic. This finding turns into practice the theoretical prediction by Romero-Severson et al.22 that individuals should be represented by clusters (in short: subgraphs) of viral sequences in phylogenies when many sequence reads per individual are available, and that the phylogenetic ordering of subgraphs should allow inference of the likely direction of transmission between individuals. Figure 1 illustrates this principle. Leitner and Romero-Severson23 investigated which phylogenetic orderings of subgraphs (in short: subgraph topologies) can be expected among known transmission pairs. The primary aim of this study is the opposite, to establish what epidemiologic inferences can be made from observed patterns in deep-sequence phylogenies. Our population-level analysis is based on deep-sequence data that was cross-sectionally collected from 40 communities in the Rakai region of Southern Uganda. Rakai communities are predominantly small agrarian and semi-urban trading centres as well as fishing communities alongside Lake Victoria. The area was the initial epicentre of the HIV-1 epidemic in Eastern Africa, and today remains among the highest burdened districts in Uganda with an overall adult HIV prevalence that ranges from 9–26% among inland trading and agrarian communities to 38–43% among lakeside fishing communities24,25.

We report first that it is feasible to obtain population-based samples of HIV-1 deep-sequence data that represent a large proportion of infected individuals with unsuppressed virus in a local setting in Africa. Second, we demonstrate that deep-sequence phylogenetic analysis can be scaled from pairs in whom transmission has been suspected to population-based samples of HIV-1 epidemics. We reconstruct partial transmission networks in the absence of self-reported sexual contact information and identify pairs of individuals in whom transmission and the direction of transmission is phylogenetically inferred with high statistical support, which we call source−recipient pairs. Third, we assess the strength of deep-sequence phylogenetic inferences on direct transmission between two individuals (in short: linkage) in a large population-based sample, and the direction of transmission between two individuals via potentially unsampled intermediates. Our major finding is that the direction of transmission from a source case to a recipient could be frequently estimated with high statistical support, and that accuracy levels are sufficient for inferences into the drivers of epidemic spread at the population-level.

## Results

### Large deep-sequence data set of an African HIV-1 epidemic

Between August 2011 and January 2015, 25,882 individuals aged 15–49 years were surveyed in 40 communities of the Rakai Community Cohort Study (RCCS) in Uganda (Table 1). The survey included the four largest fishing sites along Lake Victoria because of their high population-level HIV prevalence (~40%)25 and hypothesized role in epidemic spread. 5142 participants were HIV-positive. Reflecting previous guidelines on initiation of antiretroviral therapy (ART) during the observation period, 3878 (75.4%) infected study participants reported no ART use at time of survey. Self-reported ART use was previously validated as a proxy for actual ART use26, and 90% of individuals who reported using ART also had suppressed virus titres below 1000 copies per millilitre plasma blood2. This prompted us to focus on viral sequencing among individuals who did not report ART use. Deep-sequencing of the virus genomes was performed on 3758/3878 (96.9%) samples using the Gall et al. protocol27, generating thousands of short viral sequence fragments (reads) per individual. Sequencing success was comparatively modest28. We restricted our analysis to samples from 2652 individuals that satisfied minimum criteria on read length and depth for phylogeny reconstruction and subsequent inferences (see Methods and Supplementary Figure 1). Women and individuals of 35 years or more were under-represented in this data set when compared to infected participants, whereas individuals in fishing sites were over-represented. The overall sequence sampling fraction was high, 68.4% (2652/3878) among infected participants who did not report ART use (Fig. 2), and an estimated 65.6% (2652/4043) among infected participants with unsuppressed virus (see Methods). If we assume that individuals who were not present or did not participate at survey visits were infected with unsuppressed virus in proportion to the enrolled population, an additional 1837 individuals likely did not have suppressed viraemia, leading to an estimated sequence sampling fraction of 45.1% (2652/5880) among eligible, infected individuals with unsuppressed virus. Accounting for the previous finding that ~30% of individuals were infected by a person outside the cohort11, we thus expect that in approximately three of ten cases (0.451 × 0.7), our data contain the transmitter of a sequenced individual.

### Scaling deep-sequence phylogenetics to large data sets

We first investigated the types of deep-sequence phylogenetic patterns that arise in known epidemiologic relationships. Our population-based sample comprised 331 concordant HIV-1-positive couples who self-identified as sexual partners. Based on previous partner analyses16,17, we expected that virus was transmitted in approximately 70% of couples, and that the remaining couples were separately infected by other individuals. Figure 1d illustrates a typical scan of deep-sequence phylogenies across the genome for three male−female pairs. In each phylogeny, subgraphs of reads from two individuals could either be ancestral to each other (pink if virus of the female was ancestral and blue if virus of the male was ancestral), siblings (purple), intermingled (yellow), or disconnected by one or more other individuals (grey, see Methods for full definitions and Supplementary Tables 13 for command line specifications of the phyloscanner software). In addition, the shortest patristic distance between subgraphs of reads from two individuals (in short: subgraph distance) reflected genetic similarity of their viruses (y-axis). Figure 3a summarizes these deep-sequence phylogenetic patterns across known couples. We found, first, that the distribution of subgraph distances separating partners was bimodal (Fig. 3a, showing the median distance per pair across all their phylogenies after standardizing for differences in evolutionary rates across the genome). Most couples were either phylogenetically closely related or distantly related, with intermediate distances being very rare. This suggested that transmission likely occurred among phylogenetically closely related couples, and allowed us to define distance thresholds below which transmission was likely and above which transmission could be ruled out in this population (respectively <0.025 substitutions per site and >0.05 substitutions per site, see Fig. 3a). Additional analysis of whole-genome consensus sequences further supported these findings and thresholds (Supplementary Note 2 and ref. 29). Second, we found that the large majority (166/178, 93.3%) of phylogenetically close couples also had ancestral subgraphs in most deep-sequence phylogenies, indicating in line with Leitner and Romero-Severson23 that ancestral subgraph topologies are strongly over-represented among true transmission pairs.

Crucially, molecular epidemiologic analyses aim to infer unknown epidemiologic relationships from observed phylogenetic patterns in a population-based sample. This is a harder analytical problem compared to characterizing phylogenetic patterns among known epidemiologic relationships as in Fig. 3a, because only a tiny proportion of all pairs of individuals in a population-based sample are transmission pairs. We calculated the same phylogenetic patterns among all 3,515,226 possible pairs in our sample of 2562 individuals (see Methods), and summarized them in Fig. 3b as for the couples. With the exception of the 331 couples, sexual contacts were not known among any other of the ~3.5 m possible pairs. We found that ancestral subgraph topologies centred among pairs who were phylogenetically close: of 814 pairs with mostly ancestral subgraphs, 694 (85.3%) had phylogenetically close virus below our threshold for likely direct transmission (0.025 substitutions per site). However, 48 (5.9%) pairs had divergent virus above our threshold for ruling out direct transmission (0.05 substitutions per site). In addition, ancestry missed 118 (14.5%, 118/(694 + 118)) phylogenetically close pairs that had intermingled or sibling subgraphs in most of their deep-sequence phylogenies. Therefore, we used all types of subgraph topologies in combination with subgraph distance for inference of transmission networks from deep-sequence data. It is possible to approximate the likelihood of deep-sequence phylogenetic patterns under mathematical models of within-host viral evolution and transmission30. However, such models do not fully reproduce empirical observations such as preferential transmission of founder viruses31, and can be computationally prohibitive at large scales. For these reasons we adopted a statistical approach that is based on counting phylogenetic patterns across the genome, and calculating the proportion of deep-sequence phylogenies in support of no linkage $$({\hat \mu _{ij}})$$, linkage $$({\hat \lambda _{ij}})$$, and direction of transmission given linkage $$({\hat \delta _{ij}})$$; see Fig. 4 and Methods. Starting with subgraph distance, direct transmission could be ruled out for 3,513,800/3,515,226 (99.96%) pairs, leaving only 1426 potential transmission pairs. Next, we also considered information in subgraph topologies. This left 1191 potential transmission pairs that formed 446 transmission networks in the population-based sample of 2562 individuals, i.e. groups of individuals that had predominantly phylogenetically close and topologically adjacent (ancestral, intermingled or sibling) subgraphs.

Unlike typical phylogenetic clusters11,12,32,33, these transmission networks contained information on the direction of transmission (Fig. 5). Two hundred and sixty-one networks comprised just two individuals, while 36 had more than five individuals. As expected given the uncertainty in our inferences, larger networks included cycles of possible transmission flows and recipients with more than one probable source case, implying that multiple transmission chains were consistent with our phylogenetic data. We next identified the most likely transmission chains using graph theory (see Methods). This retained 888 phylogenetic linkages in 446 most likely transmission chains, of which 351 linkages had low statistical support ($$\hat \lambda _{ij} \le 0.6$$, see Fig. 4 and Methods for choice of threshold) and 537 linkages had high statistical support $$({\hat \lambda _{ij} \hskip 1.5pt > \hskip 1.5pt 0.6})$$.

### Viral deep-sequence data cannot prove HIV-1 transmission

We hypothesized that many of the 537 highly supported phylogenetic linkages were false discoveries in that transmission did not occur directly between the paired individuals. Our population-based sample did not capture all members of ongoing transmission chains, and so transmission likely occurred via unsampled intermediates in some cases. 80/537 (14.9%) of highly supported phylogenetic linkages were between two women even though HIV-1 is predominantly sexually transmitted in Africa, and extremely rarely transmitted sexually between women34. Considering that there were almost twice as many possible male−female combinations than female−female combinations, we calculate in Supplementary Note 3 that up to 35.4% of phylogenetically close male−female pairs of the population-based sample may not represent direct transmission events. Figure 4b illustrates this fundamental problem further: subgraph distances and topologies were not sufficient to clearly separate pairs of individuals from the population sample into two groups of closely related or distantly related pairs.

In prior work, Romero-Severson et al.22 proposed that direct transmission can be established with near certainty when viral sequences from two individuals are heavily intermingled in deep-sequence phylogenies. This prediction, while based on theoretical evolutionary principles and simulation, implies that deep-sequence phylogenies could be used in criminal cases of HIV-1 transmissions, and thus has important public health and human rights implications.

We revisited this hypothesis in our data, and found 34 phylogenetically close pairs with intermingled subgraphs across the majority of the genome. In two instances, the phylogenetically linked individuals were female (Fig. 6, corresponding deep-sequence phylogenies are reported in Supplementary Data 1), suggesting they were likely infected by a common unobserved male partner. Based on this, the phylogenetic linkages in transmission networks that we inferred from our deep-sequence data may indicate—but cannot prove—direct transmission. The difference between the theoretical expectations of Romero-Severson et al.22 and our observations may be explained by limited phylogenetic resolution in our reads, or may reflect greater complexity in HIV-1 evolutionary dynamics35.

These findings put into context that 81 (15.1%) of the 537 highly supported phylogenetic linkages were between two men. Given that the relative proportion of same-sex linkages were equivalent between men and women, our phylogenetic transmission networks provide no evidence of extensive sub-epidemics amongst men who have sex with men in rural Rakai although we cannot rule out the possibility that these may exist due to potential undersampling of widely stigmatized key populations36.

### The direction of transmission can be frequently inferred

We further analysed the remaining 376 highly supported male−female linkages to infer the direction of transmission (i.e. who might have infected whom, potentially via unsampled intermediates). Amongst the population-based sample, we inferred the phylogenetically likely source for 293/376 (77.9%) of linked male−female pairs (Fig. 5, $$\hat \delta _{ij} \hskip 1.5pt > \hskip 1.5pt 0.6$$, see Methods for choice of thresholds). In comparison, 176/376 (46.8%) of highly supported male−female linkages were between couples, and the phylogenetically likely source could be inferred in 133/176 (75.6%) couples. Inferences of these source−recipient pairs did not depend strongly on our cut-off choices (Supplementary Table 4).

### Inferring the direction of transmission has a small error

We cross-validated our findings on the direction of transmission using HIV-1 testing history and clinical data that provided independent evidence that one direction of transmission was much more likely than the other. In 36 pairs (18 couples and 18 pairs between whom sexual contact was not known), one individual tested HIV-1 negative after the other had already tested positive, and the negative individual subsequently seroconverted. The phylogenetically inferred source ($$\hat \lambda _{ij} \hskip 1.5pt > \hskip 1.5pt 0.6$$ and $$\hat \delta _{ij} \hskip 1.5pt > \hskip 1.5pt 0.6$$) was consistent with clinical evidence in 27/31 pairs, inconsistent in 4/31 pairs, and could not be inferred reliably in 5/31 pairs (Table 2; corresponding deep-sequence phylogenies are reported in Supplementary Data 2). The false discovery rate for estimating the direction of transmission amongst pairs with epidemiologically known direction of transmission was therefore 12.9% with 95% confidence interval [5.1–28.9%].

In 35 pairs, one individual had a CD4 cell count above 800 cells per mmblood, indicative of being close to time of infection, while their partner was already immuno-compromised with a CD4 cell count below 400 cells per mm3 blood. The phylogenetically inferred source was consistent with clinical evidence in 19/35 pairs, inconsistent in 5/35 pairs, and could not be inferred reliably in 11/35 pairs. In two of the five inconsistent cases, CD4 data were only weakly indicative of the direction of transmission, and it is possible that we overestimated error rates for these pairs with CD4 data to 20.8% [9.2–40.5%] (Supplementary Note 4).

Amongst all pairs, the false discovery rate was 16.3% [8.8–28.3%]. Error rates varied slightly depending on the exact configuration of parameters in the phyloscanner analyses, though not substantially (Supplementary Tables 56). Similar error rates were observed in phylogenetic analysis of 454 deep-sequence data over a 320 bp region of the env gene among 33 couples with known direction of transmission and confirmed linked infection in the HPTN 052 trial37. Our findings are based on deep-sequencing of a population-based sample, and thus extend previous results to population-level inferences among individuals between whom sexual contact is not necessarily known a priori.

## Discussion

A central application of pathogen sequencing is to identify how infectious diseases continue to spread in human populations, and how new infections can be averted most effectively38,39,40,41. Most molecular epidemiologic studies are based on analysis of Sanger sequences, and typically identify clusters of genetically related infections in an effort to characterize ongoing transmission sources11,32,33,42. These approaches fail to distinguish sources from recipients of transmission within such clusters, making epidemiological inferences relevant to public health intervention challenging7. In contrast, deep-sequence phylogenetic analyses are based on thousands of reads per individual, and thereby provide more information into the epidemiologic relationship of individuals beyond distance measures, through the topological ordering between subgraphs of viral reads from individuals. Prior work assessed the potential of deep-sequence phylogenetic analyses on simulations and on known transmission pairs for whom at least five viral sequences were available per individual22,23,43. Here, we demonstrate that large population-based samples of standard deep-sequence output can be used to infer directed transmission networks of generalized HIV-1 epidemics in sub-Saharan Africa with phyloscanner21. Combining the patristic distance between viral subgraphs and their topological ordering in deep-sequence phylogenies, our analysis uncovered 446 partially sampled HIV-1 transmission networks in Rakai comprising 1334 individuals.

We were not able to rule out the possibility that sources were indirectly linked to recipients through unobserved individuals (i.e. intermediate partners) with deep-sequence phylogenetic analysis. One third (161/537) of phylogenetically highly supported linkages were between individuals of the same  gender, in line with incomplete sequence coverage. We also found two pairs with phylogenetic patterns previously considered strong enough to virtually exclude the possibility of common sources or recipients, but in whom both individuals were female. These findings have important implications for criminal prosecution of people living with HIV in at least 72 countries with laws penalizing HIV transmission14,44: even with deep-sequencing, transmission of HIV-1 cannot be proven between two individuals. Thus, communicating the limitations of deep-sequencing data is essential to prevent its misuse in criminal prosecutions. For example, we opted to visually interrupt linkages in phylogenetic transmission networks (Fig. 5), in order to highlight the possibility of unsampled cases along inferred source−recipient relationships.

We found that when many reads from different individuals are analysed together, they tend to form subgraphs with consistent ordering in deep-sequence phylogenies from across the genome. This observation enabled us to infer the source of transmission in 77.9% of 376 phylogenetically linked male−female pairs. The accuracy of our viral phylogenetic inferences regarding directionality was validated on 71 male−female pairs with clinical data that suggested transmission in one direction, with an overall false discovery rate of 16.6% [9.1–28.7%], and was thus not substantially different in a population-based sample compared to analysis of couples with known direction of transmission37. At this error rate, phyloscanner and similar approaches21,37,43 allow inferences into population-level transmission networks and the epidemiologic sources of ongoing viral spread from sequence data alone.

Our study has several weaknesses. First, sequence sampling of the infected population in RCCS communities remained incomplete. Phylogenetic inferences are expected to improve with higher sampling fraction45, though in practice, complete sequence sampling is hard to achieve. This study enrolled participants before immediate provision of ART was recommended in national guidelines, so that a relatively large proportion of infected individuals did not report ART use at first study visit, and could be sequenced. To perform similar phylogenetic analyses of ongoing viral spread in sub-Saharan Africa in the future, it is thus important to collect and store samples prior to ART initiation, and to investigate alternative sequencing protocols46. Second, relatively modest deep-sequencing quality compromised the length of deep-sequence reads28. Analyses were based on relatively short read alignments of 250 bp that primarily covered the gag gene, rather than the whole genome (Supplementary Figure 1). It is thus plausible that deep-sequence phylogenetic analyses may be more accurate than reported in this study as deep-sequence output with longer reads and greater coverage is becoming available47. Third, we found that inferring the direction of transmission became more challenging as the virus was increasingly closely related within individuals. We thus predict that the direction of transmission may be less frequently inferable in situations when the virus spreads more rapidly between persons, as in high-risk sexual networks among men having sex with men9,15, or among injecting drug users48. For the same reason, sources of infections may be less accurately and/or less frequently inferable for pathogens that generate within-host viral diversity at a slower pace than HIV-1 39,49,50.

Whole-genome deep-sequencing is now the tool of choice in clinical practice and epidemiologic investigation for a broad range of bacterial infectious disease pathogens, and increasingly used for viral pathogens, and especially HIV-1 8,38,39,49,50. Here we establish that HIV-1 phylogenetic analyses can be scaled to large population-based samples of deep-sequence data, and that the direction of transmission can be frequently inferred in reconstructed HIV-1 transmission networks. At present, more than 15,000 individuals have been deep-sequenced and linked to demographic records across sub-Saharan Africa in order to understand who is at the core and driving new infections where the burden of HIV-1 is highest, how the epidemic regenerates from older to younger generations, and how spread can be most effectively interrupted in generalized epidemics7,8. The phyloscanner method is applicable to these data, and we hypothesize that this innovation will help identify the key drivers of HIV-1 transmission in regions that are hardest hit by the virus, and in turn facilitate tailoring of interventions to achieve epidemic control.

## Methods

### Sample selection

Data for this study come from the Rakai Community Cohort Study (RCCS), a population-based study of HIV-1 incidence in Rakai, District Uganda. Procedures for the RCCS have been described in detail elsewhere2. Briefly, the RCCS conducts a census in all communities to identify eligible individuals 2 weeks before the survey. Eligible individuals include those able to give consent and between the ages of 15 and 49 years. Eligible individuals who provide written informed consent are administered a survey on their demographs, sexual behaviours and health-care seeking practices. Individuals are also asked to name their cohabitating sexual partners in order to identify couples, and to provide a serum sample for HIV-1 testing and future laboratory studies, including HIV-1 viral sequencing. Data for this particular study were collected between 2011 and 2015 from 40 agrarian, trading and fishing communities.

### Ethics

The study was independently reviewed and approved by the Ugandan Virus Research Institute, Scientific Research and Ethics Committee, Protocol GC/127/13/01/16; the Ugandan National Council of Science and Technology; and the Western Institutional Review Board, Protocol 200313317. All study participants provided written informed consent at baseline and follow-up visits using institutional review board-approved forms.

### Sampling fraction

To estimate the number of infected participants with unsuppressed virus, we first calculated the expected number of infected participants who did not use antiretrovirals at time of survey, and had thus unsuppressed virus. Participant reported ART use was previously validated as a proxy of actual ART use with a specificity of 99%26, giving 3878/0.99 individuals. To this, we added the expected number of participants who reported ART use but did not have suppressed virus. Ten per cent of participants reporting ART use had plasma viral loads above 1000 copies/ml plasma blood2, giving 1264 × 0.9 individuals, and 4043 in total. The sampling fraction was therefore estimated at 2652/4043 (65.6%) among infected participants with unsuppressed virus.

### HIV-1 deep-sequencing

Serum samples from HIV-1 seropositive persons who did not self-report ART use over the analysis period were shipped to University College London Hospital, London, United Kingdom for viral RNA extraction. RNA extraction was automated on QIAsymphony SP workstations with the QIAsymphony DSP Virus/Pathogen Kit (Cat. No. 937036, 937055; Qiagen, Hilden, Germany), followed by one-step reverse transcription polymerase chain reaction (RT-PCR)27. Deep-sequencing was performed on Illumina MiSeq and HiSeq instruments in the DNA pipelines core facility at the Wellcome Trust Sanger Institute, Hinxton, United Kingdom.

Deep-sequencing reads were assembled with the shiver sequence assembly software51. Where no contigs could be generated with IVA52, contigs were generated with SPAdes and metaSPAdes v3.10 53,54, after excluding reads classified as Homo sapiens by Kraken v0.10.5-beta55. Contigs with at least 300 bp matching known HIV-1 diversity were used for shiver analysis.

Phyloscanner version 1.1.2 21 was used to merge paired-end reads, and only merged reads of at least 250 bp in length were retained for phylogeny reconstruction. Subsequent deep-sequence inferences were performed on individuals whose reads covered the HIV-1 genome at a depth of at least 30 reads for 750 bp or more. Individuals who did not have sequencing output meeting these criteria were excluded.

### Deep-sequence phylogenetic analysis

It proved computationally intractable to reconstruct viral trees from all deep-sequence reads of all individuals simultaneously. To address this challenge, samples were divided into batches of 50−75 individuals, and phyloscanner was run on all possible pairs of batches to assess deep-sequence phylogenetic relationships in all pairs of individuals in the population-based sample. The phyloscanner command line specification for this first analysis stage is given in Supplementary Tables 1 and 2. Shell scripts were used to handle calculations in parallel, and are available upon request. From stage 1 output, we identified potentially phylogenetically close pairs and, from those, networks of pairs that were connected through at least one common, phylogenetically close individual. Networks were extended to include spouses of partners in networks, couples in no network, and the ten most closely related individuals from stage 1 as controls. For computational considerations, reads of individuals that differed at one nucleotide position were merged. In a second analysis stage, phyloscanner was used to confirm potential transmission pairs by considering also the topological configuration of subgraphs in deep-sequence phylogenies, and to resolve the ordering of transmission events within transmission networks. The phyloscanner command line specification for stage 2 is given in Supplementary Table 3. In this stage, reads of individuals that differed at one nucleotide position were not merged.

### Phylogenetic relationships of virus from two individuals

The basis of viral phylogenetic analysis with phyloscanner are subgraphs, sets of tips and internal nodes of a phylogeny that are attributed to one individual with a parsimony-based algorithm21. A single individual can have multiple subgraphs in one tree. The following statistics were calculated to characterize the phylogenetic relationship between two individuals i and j in one phylogeny:

• Subgraph distance between i and jij): The distance between any two subgraphs u, v is the shortest patristic distance between any nodes or tips of u and v and Δij is the minimum patristic distance between subgraphs u from i and v from j. Deep-sequence phylogenies from different parts of the genome had markedly different branch lengths, reflecting evolutionary rate variation across the genome. Prior to calculating subgraph distances, we standardized phylogenies by multiplying branch lengths with the ratio of expected branch lengths in the genomic window from which the tree was reconstructed, divided by the expected branch lengths in the gag and polymerase genes (Supplementary Table 2).

• Adjacency of i and j (Aij): True if the shortest path between at least one subgraph u from i and v from j is not attributed to any sampled individual other than i and j, and false otherwise.

• Paths from i to (Pij): number of subgraphs from j which have as ancestor a subgraph from i.

Analyses were then based on the following phylogenetic relationship types between two individuals i and j in a viral tree:

• Phylogenetically unlinked (Uij): Aij = 0 or Δij > 0.05 substitutions per site.

• Phylogenetic linkage grey zone (Gij): Aij = 1 and Δij [0.025−0.05 substitutions per site].

• Phylogenetically linked and i source (i → j): Aij = 1 and Pij ≥ 1 and Pji = 0 and Δij < 0.025 substitutions per site.

• Phylogenetically linked and j source (j → i): Aij = 1 and Pji ≥ 1 and Pij = 0 and Δij < 0.025 substitutions per site.

• Phylogenetically linked with no evidence for direction of transmission (i ~ j): Aij = 1 and Pji ≥ 1 and Pij ≥ 1 and Δij < 0.025 substitutions per site (intermingled), or Aij = 1 and Pji = 0 and Pij = 0 and Δij < 0.025 substitutions per site (sibling).

### Evidence for transmission and direction of transmission

To capture uncertainty in inferences, relationship types between reads from two individuals were evaluated on a large number of deep-sequence phylogenies that corresponded to sliding and overlapping read alignments (as shown in Fig. 1d). For each pair of individuals, the number of deep-sequence phylogenies in which i and j had one of the above five relationship types were counted (as shown in Fig. 4). The raw counts were adjusted for overlap in read alignments from which the deep-sequence phylogenies were constructed as described in Supplementary Note 1, and are denoted by kU (unlinked), kG (grey zone), ki$$_{\rightarrow}$$j (i source), kj$$_{\rightarrow}$$i (j source), ki ~ j (no evidence for direction). After adjusting for overlap, the counts were interpreted as phylogenetic independent observations, leading to Binomial probability models for each count. Evidence for direct transmission (λij) was based on the count kL = ki$$_{\rightarrow}$$j + kj$$_{\rightarrow}$$i + ki ~ j ≥ 0, and binomial model (likelihood)

$$p\left( {k_{\mathrm L},n{\mathrm{|}}\lambda _{ij}} \right) = \frac{{{\mathrm{\Gamma }}(n + 1)}}{{{\mathrm{\Gamma }}(k_{\mathrm L} + 1){\mathrm{\Gamma }}(n - k_{\mathrm L} + 1)}}\lambda _{ij}^{k_{\mathrm L}}(1 - \lambda _{ij})^{n - k_{\mathrm L}},$$
(1)

where n = ki$$_{\rightarrow}$$j + kj$$_{\rightarrow}$$i + ki ~ j + kG + kU > 0 and Γ is the Gamma function, with maximum likelihood estimate $$\hat \lambda _{ij} = k_{\mathrm L}/n$$. Evidence for ruling out direct transmission (μij) was based on kU and total n as above. Evidence for the direction of transmission given linkage (δij) was based on ki$$_{\rightarrow}$$j and total ki$$_{\rightarrow}$$j + kj$$_{\rightarrow}$$i. Posterior density estimates of λij, μij and δij are available analytically when a Beta prior density on these parameter is chosen. We here chose a flat Beta prior density with scale and shape parameters set to 1, so that e.g. the posterior density for direct transmission is

$$p\left( {\lambda _{ij}{\mathrm{|}}k_{\mathrm L},n} \right) = \frac{{{\mathrm{\Gamma }}(n + 1)}}{{{\mathrm{\Gamma }}(k_{\mathrm L} + 1){\mathrm{\Gamma }}(n - k_{\mathrm L} + 1)}}\lambda _{ij}^{k_{\mathrm L}}(1 - \lambda _{ij})^{n - k_{\mathrm L}}.$$
(2)

The confidence intervals shown in Supplementary Notes 2 and 4 are 95% highest density intervals of Eq. (2). In principle, the parameters of the Beta prior could be chosen to reflect additional data such as seroconversion histories; however, care should be taken to specify informative priors based on variables such as age differences or age-specific disease prevalence20, in order to avoid circular inferences on who may have infected whom.

### Most likely transmission chains

Pairs of individuals between whom transmission was not excluded (when $$\hat \mu _{ij} > 0.6$$) defined a set of connected graphs, which we call (partially observed) transmission networks. For each network, we defined its adjacency matrix with entries $$\hat \tau _{ij} = k_{i \to j} + k_{i\sim j}/2$$ for i ≠ j and $$\hat \tau _{ij} = 0$$. Every spanning tree c of a network defines a possible transmission chain, and was associated with a transmission flow score over its directed edges, $$\hat \tau _c = \mathop {\prod }\nolimits_{ij \in c} \hat \tau _{ij}$$. The most likely transmission chain, defined by $$\hat c^{{\mathrm {ML}}} = {\mathrm{argmax}}_c\,\hat \tau _c$$, was calculated with Edmonds’s algorithm as implemented in the RBGL R package, version 1.55.1 56.

### Classification of linked pairs and sources

Pairs in most likely transmission chains were classified as (epidemiologically) linked when $$\hat \lambda _{ij} = k_{\mathrm L}/n > c$$ where n as above and c = 0.6, and otherwise as potentially linked. The threshold c was determined as follows. Under model (1), kL ~ Binomial (n, λij), where λij indicates the strength of phylogenetic evidence for linkage. The threshold c was motivated by the condition that the posterior probability for λij > 50% should be larger than α = 80% or alternatively α = 95%, i.e.

$$p\left( {\lambda _{ij} > 0.5{\mathrm{|}}k_{\mathrm L},n} \right) > \alpha .$$
(3)

We simplified this criterion by choosing c (0, 1) such that Eq. (3) holds for all kL > nc for a typical whole-genome analysis. For the Rakai analysis, read alignments had a length of 250 bp, resulting in n = 35 non-overlapping alignments and deep-sequence phylogenies, and so with Eq. (2), we obtain c = 0.57 for α = 80% and c = 0.64 for α = 95%. The thresholds were similar for analyses based on read alignments of length 350 bp, resulting in n = 25 deep-sequence phylogenies, and c = 0.59 for α = 80% and c = 0.67 for α = 95%. This suggested choosing as default values c = 0.6 for α = 80% and c = 0.66 for α = 95%, with the present analysis based on c = 0.6 for all linkage and direction classifications.

### Reporting Summary

Further information on experimental design is available in the Nature Research Reporting Summary linked to this article.

## Data availability

The deep-sequence phylogenies and basic individual-level data analysed during the current study are available in the Dryad repository, https://doi.org/10.5061/dryad.7h46hg2. HIV-1 reads are available on reasonable request through the PANGEA consortium (www.pangea-hiv.org) or the corresponding author. Please contact project manager Lucie Abeler-Dörner (lucie.abeler-dorner@bdi.ox.ac.uk) for further details. Additional individual-level data are available on reasonable request to RHSP or the corresponding author.

## Code availability

Code is available from https://github.com/BDI-pathogens/phyloscanner (version 1.1.2) and https://github.com/olli0601/Phyloscanner.R.utilities (version 0.7) under the GNU General Public License v3.0.

## References

1. 1.

UNAIDS. UNAIDS Data 2017, Document JC2910E. http://www.unaids.org/en/resources/documents/2017/2017_data_book (2017).

2. 2.

Grabowski, M. K. et al. HIV prevention efforts and incidence of HIV in Uganda. N. Engl. J. Med. 377, 2154–2166 (2017).

3. 3.

UNAIDS. Fast-track: ending the AIDS epidemic by 2030, Document JC2686. http://www.unaids.org/en/resources/documents/2014/JC2686_WAD2014report (2014).

4. 4.

UNAIDS. Empower young women and adolescent girls: fast-track the end of the AIDS epidemic in Africa, Document JC2746. http://www.unaids.org/en/resources/documents/2015/JC2746 (2015).

5. 5.

Salazar-Gonzalez, J. F. et al. Deciphering human immunodeficiency virus type 1 transmission and early envelope diversification by single-genome amplification and sequencing. J. Virol. 82, 3952–3970 (2008).

6. 6.

Maldarelli, F. et al. HIV populations are large and accumulate high genetic diversity in a nonlinear fashion. J. Virol. 87, 10313–10323 (2013).

7. 7.

Dennis, A. M. et al. Phylogenetic studies of transmission dynamics in generalized HIV epidemics: an essential tool where the burden is greatest? J. Acquir. Immune Defic. Syndr. 67, 181–195 (2014).

8. 8.

Pillay, D. et al. PANGEA-HIV: phylogenetics for generalised epidemics in Africa. Lancet Infect. Dis. 15, 259–261 (2015).

9. 9.

Volz, E. et al. HIV-1 transmission during early infection in men who have sex with men: a phylodynamic analysis. PLoS Med. 10, e1001568 (2013).

10. 10.

Stadler, T., Kuhnert, D., Bonhoeffer, S. & Drummond, A. J. Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV). Proc. Natl. Acad. Sci. USA 110, 228–233 (2013).

11. 11.

Grabowski, M. K. et al. The role of viral introductions in sustaining community-based HIV epidemics in rural Uganda: evidence from spatial clustering, phylogenetics, and egocentric transmission models. PLoS Med. 11, e1001610 (2014).

12. 12.

de Oliveira, T. et al. Transmission networks and risk of HIV infection in KwaZulu-Natal, South Africa: a community-wide phylogenetic study. Lancet HIV 4, e41–e50 (2017).

13. 13.

Le, Vu,S. et al. Comparison of cluster-based and source-attribution methods for estimating transmission risk using large HIV sequence databases. Epidemics 23, 1–10 (2018).

14. 14.

Barre-Sinoussi, F. et al. Expert consensus statement on the science of HIV in the context of criminal law. J. Int. AIDS Soc. 21, e25161 (2018).

15. 15.

Ratmann, O. et al. Sources of HIV infection among men having sex with men and implications for prevention. Sci. Tr. Med 8, 320ra2 (2016).

16. 16.

Eshleman, S. H. et al. Analysis of genetic linkage of HIV from couples enrolled in the HIV Prevention Trials Network 052 trial. J. Infect. Dis. 204, 1918–1926 (2011).

17. 17.

Campbell, M. S. et al. Viral linkage in HIV-1 seroconverters and their partners in an HIV-1 prevention clinical trial. PLoS ONE 6, e16986 (2011).

18. 18.

Volz, E. M. et al. Molecular epidemiology of HIV-1 subtype B reveals heterogeneous transmission risk: implications for intervention and control. J. Infect. Dis. 217, 1522–1529 (2018).

19. 19.

Didelot, X., Fraser, C., Gardy, J. & Colijn, C. Genomic infectious disease epidemiology in partially sampled and ongoing outbreaks. Mol. Biol. Evol. 34, 997–1007 (2017).

20. 20.

Grabowski, M. K. & Lessler, J. Phylogenetic insights into age-disparate partnerships and HIV. Lancet HIV 4, e8–e9 (2017).

21. 21.

Wymant, C. et al. PHYLOSCANNER: inferring transmission from within- and between-host pathogen genetic diversity. Mol. Biol. Evol. 35, 719–733 (2017).

22. 22.

Romero-Severson, E. O., Bulla, I. & Leitner, T. Phylogenetically resolving epidemiologic linkage. Proc. Natl. Acad. Sci. USA 113, 2690–2695 (2016).

23. 23.

Leitner, T. & Romero-Severson, E. Phylogenetic patterns recover known HIV epidemiological relationships and reveal common transmission of multiple variants. Nat. Microbiol. 3, 983–988 (2018).

24. 24.

Serwadda, D. et al. Slim disease: a new disease in Uganda and its association with HTLV-III infection. Lancet 2, 849–852 (1985).

25. 25.

Chang, L. W. et al. Heterogeneity of the HIV epidemic in agrarian, trading, and fishing communities in Rakai, Uganda: an observational epidemiological study. Lancet HIV 3, e388–e396 (2016).

26. 26.

Grabowski, M. K. et al. The validity of self-reported antiretroviral use in persons living with HIV: a population-based study. AIDS 32, 363–369 (2018).

27. 27.

Gall, A. et al. Universal amplification, next-generation sequencing, and assembly of HIV-1 genomes. J. Clin. Microbiol. 50, 3838–3844 (2012).

28. 28.

Ratmann, O. et al. HIV-1 full-genome phylogenetics of generalized epidemics in sub-Saharan Africa: impact of missing nucleotide characters in next-generation sequences. AIDS Res. Hum. Retroviruses 33, 1083–1098 (2017).

29. 29.

Rose, R. et al. Identifying transmission clusters with cluster picker and HIV-TRACE. AIDS Res. Hum. Retrovir. 33, 211–218 (2017).

30. 30.

Romero-Severson, E. O. et al. Donor-recipient identification in para- and poly-phyletic trees under alternative HIV-1 transmission hypotheses using approximate Bayesian computation. Genetics 207, 1089–1101 (2017).

31. 31.

Carlson, J. M. et al. HIV transmission. Selection bias at the heterosexual HIV-1 transmission bottleneck. Science 345, 1254031 (2014).

32. 32.

Hue, S. et al. HIV type 1 in a rural coastal town in Kenya shows multiple introductions with many subtypes and much recombination. AIDS Res. Hum. Retrovir. 28, 220–224 (2012).

33. 33.

Novitsky, V. et al. Phylogenetic relatedness of circulating HIV-1C variants in Mochudi, Botswana. PLoS ONE 8, e80589 (2013).

34. 34.

Chan, S. K. et al. Likely female-to-female sexual transmission of HIV–Texas, 2012. Mmwr. Morb. Mortal. Wkly. Rep. 63, 209–212 (2014).

35. 35.

Fraser, C. et al. Virulence and pathogenesis of HIV-1 infection: an evolutionary perspective. Science 343, 1243727 (2014).

36. 36.

Hladik, W. et al. Men who have sex with men in Kampala, Uganda: Results from a bio-behavioral respondent driven sampling survey. AIDS Behav. 21, 1478–1490 (2017).

37. 37.

Rose, R. et al. Phylogenetic methods inconsistently predict direction of HIV transmission among heterosexual pairs in the HPTN052 cohort. J. Infect. Dis., https://doi.org/10.1093/infdis/jiy734 (2018).

38. 38.

De Silva, D. et al. Whole-genome sequencing to determine transmission of Neisseria gonorrhoeae: an observational study. Lancet Infect. Dis. 16, 1295–1303 (2016).

39. 39.

Fifer, H. et al. Sustained transmission of high-level azithromycin-resistant Neisseria gonorrhoeae in England: an observational study. Lancet Infect. Dis. 18, 573–581 (2018).

40. 40.

Dellicour, S. et al. Phylodynamic assessment of intervention strategies for the West African Ebola virus outbreak. Nat. Commun. 9, 2222 (2018).

41. 41.

Poon, A. F. et al. Near real-time monitoring of HIV transmission hotspots from routine HIV genotyping: an implementation case study. Lancet HIV 3, e231–e238 (2016).

42. 42.

Oster, A. M., France, A. M. & Mermin, J. Molecular epidemiology and the transformation of HIV prevention. JAMA 319, 1657–1658 (2018).

43. 43.

Skums, P. et al. QUENTIN: reconstruction of disease transmissions from viral quasispecies genomic data. Bioinformatics 34, 163–170 (2018).

44. 44.

Bernard, E.J., Cameron, S., HIV Justice Network & GNP+. Advancing HIV Justice 2: Building momentum in global advocacy against HIV criminalisation. http://www.hivjustice.net/wp-content/uploads/2016/05/AHJ2.final2_.10May2016.pdf (2016).

45. 45.

Yebra, G. et al. Using nearly full-genome HIV sequence data improves phylogeny reconstruction in a simulated epidemic. Sci. Rep. 6, 39489 (2016).

46. 46.

Novitsky, V. et al. Long-range HIV genotyping using viral RNA and proviral DNA for analysis of HIV drug resistance and HIV clustering. J. Clin. Microbiol. 53, 2581–2592 (2015).

47. 47.

Bonsall, D. et al. A comprehensive genomics solution for HIV surveillance and clinical monitoring in a global health setting. Preprint at bioRxiv, https://www.biorxiv.org/content/early/2018/08/23/397083 (2018).

48. 48.

Sypsa, V. et al. Rapid decline in HIV incidence among persons who inject drugs during a fast-track combination prevention program after an HIV outbreak in Athens. J. Infect. Dis. 215, 1496–1505 (2017).

49. 49.

Chewapreecha, C. et al. Dense genomic sampling identifies highways of pneumococcal recombination. Nat. Genet. 46, 305–309 (2014).

50. 50.

Paterson, G. K. et al. Capturing the cloud of diversity reveals complexity and heterogeneity of MRSA carriage, infection and transmission. Nat. Commun. 6, 6560 (2015).

51. 51.

Wymant, C. et al. Easy and accurate reconstruction of whole HIV genomes from short-read sequence data. Virus Evol. 4, vey007 (2018).

52. 52.

Hunt, M. et al. IVA: accurate de novo assembly of RNA virus genomes. Bioinformatics 31, 2374–2376 (2015).

53. 53.

Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

54. 54.

Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).

55. 55.

Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).

56. 56.

Carey, V., Long, L. & Gentleman, R. RBGL: an interface to the BOOST graph library, version 1.55.1. http://bioconductor.org/packages/release/bioc/html/RBGL.html (2017).

## Acknowledgements

We thank the participants of the RHSP RCCS; as well as the PANGEA-HIV steering committee for their input and their comments on a previous version of this article. Computations were performed at the Imperial College Research Computing Service, https://doi.org/10.14469/hpc/2232. This study was supported by the National Institute of Mental Health (K23MH086338, R01MH107275); the National Institute of Allergy and Infectious Diseases (R01AI110324, U01AI100031, R01AI110324, R01AI102939); the National Institute of Child Health and Development (RO1HD070769, R01HD050180); the Division of Intramural Research, National Institute for Allergy and Infectious Diseases, National Institutes of Health; the Bill & Melinda Gates Foundation (22006.02, OPP1084362); the Johns Hopkins University Center for AIDS Research (P30AI094189); and the European Research Council (Advanced Grant PBDR-339251).

## Author information

O.R., M.K.G., A.L.B., TdO, P.K., D.P., T.C.Q., M.J.W., D.S., R.H.G., C.F. conceived the study; M.K.G., J.K., G.K., O.L., T.C.Q., M.J.W., D.S., R.H.G., A.G., D.B. selected, provided and prepared sequence and patient data; L.A.-D., A.H., T.G. provided managerial and logistical support, including data tracking; M.K.G., C.W., T.G. assembled deep-sequence reads; O.R., M.H. performed computations and statistical analyses; O.R., M.K.G., C.F. evaluated statistical analyses; O.R. wrote the first version of the manuscript; all authors reviewed and approved the statistical analysis and final version of the manuscript.

Correspondence to Oliver Ratmann.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Journal peer review information: Nature Communications thanks Denise Kühnert, Thomas Leitner, and the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

# A full list of consortium members appears at the end of the paper.

## Rights and permissions

Reprints and Permissions