Distinct antibody-mediated selection in a narrow-source HIV-1 outbreak among Chinese former plasma donors

The HIV-1 envelope protein mutates rapidly to evade recognition and killing, and is a major target of the humoral immune response and in vaccine development. Identification of common antibody epitopes for vaccine development have been complicated by large variations on both virus (different infecting founder strains) and host genetic levels. We studied HIV-1 envelope gp120 evolution in 12 Chinese former plasma donors infected with a purportedly single founder virus, with the aim of identifying common antibody epitopes under immune selection. We found five amino acid sites to be under significant positive selection in ≥50% of the patients, and 22 sites housing mutations consistent with antibody-mediated selection. Despite strong selection pressure, some sites housed a limited repertoire of amino acids. Structural modelling revealed that most sites were located on the exposed distal edge of the Gp120 trimer, whilst wholly invariant sites clustered within the centre of the protein complex. Four sites, flanking the V3 hypervariable loop of the Gp120, represent novel antibody epitopes that may be suitable as vaccine candidates.


INTRODUCTION 1
The human immunodeficiency virus type 1 (HIV-1) glycoprotein Gp120 is a 120 kDa surface-2 expressed protein that is essential for viral entry into the cell. It is encoded by the env gene, 3 and consists of five variable regions (V1-V5) interspersed between five conserved regions (C1-4 C5) 1 . The Gp120 forms heterodimers with Gp41 which themselves trimerise, studding the viral 5 membrane at a density of around fourteen copies per virion 2 . Whilst the cellular immune 6 response against HIV-1 targets epitopes dispersed throughout the viral genome, the 7 accessibility of Gp120 on the cell surface makes it the major target of humoral responses. 8 9 The humoral response against HIV-1 Gp120 develops rapidly within around four weeks of 10 detectable plasma viral loads 3 , but neutralising antibodies (NAbs) typically only develop after 11 several months of infection 4 . Around two hundred antibodies have been described that 12 recognise the Gp120 protein (LANL Immunology Database; 13 http://www.hiv.lanl.gov/content/immunology), and many of the epitopes cluster within the V3 14 loop. However, the interplay between Gp120 and the adaptive immune response is complex, 15 and the role that antibodies play in the control of infection is a contentious issue. The loss of 16 neutralising activity has been associated with faster disease progression in some individuals 5 . 17 In addition, studies in macaques have indicated that B lymphocyte depletion-associated 18 reductions in NAb titre inversely correlate with viral load, suggesting that the humoral response 19 may contribute at least in part to the control of viral replication 6,7 . However, whilst NAbs do 20 exert selection pressure on the virus 8,9 , the breadth of response does not correlate with or predict 21 progression to AIDS 5,10,11 . 22 23 It is commonly believed that the reason why antibody responses may play a limited role in the 24 control of HIV-1 is because the virus can mutate easily to escape neutralization by these 25

METHODS 1
Cohort characteristics and sampling 2 The SM cohort comprises HIV-1 patients from a small rural community in Henan province, 3 China, as described previously 23 . Between 1993 and 1995, the patients were infected with a 4 narrow-source of subtype B' virus during an illegal paid plasma donation scheme. Few 5 individuals knew that they were infected until HIV screening programmes were implemented 6 in China in 2004. Patients were then recruited to the cohort and gave informed consent for their 7 samples to be used.  (Table S1). Ethical approval was obtained from Beijing You'an Hospital and the 13 University of Oxford Tropical Ethics Committee (OxTREC). 14 15 HIV-1 env gp120 sequencing and sequence assembly 16 Viral RNA was isolated from cryopreserved plasma samples and purified using the QIAamp 17 Viral RNA Extraction Kit (Qiagen) followed by reverse transcription using the SuperScript III 18 Reverse Transcriptase System (Invitrogen). Env gp120 C2-V5 was amplified by nested 19 touchdown PCR (primers listed in Table S2). Amplified DNA was purified using the MinElute 20 Gel Extraction Kit (Qiagen), and ligated into a pCR4-TOPO sequencing vector using the TOPO 21 TA Cloning Kit for Sequencing (Invitrogen). Chemically competent One Shot MAX Efficiency 22 DH5α E. coli (Invitrogen) were transformed with the prepared plasmids, and cultured 23 overnight at 37 o C. Eighteen colonies were selected for colony PCR (M13F and M13R primers, 24 Table S2), and the resulting products were purified using ExoSAP-IT (Affymetrix) and 25 sequenced to generate forward and reverse reads (Source BioScience). Contigs were assembled 1 and controlled manually using Geneious v9.0.5 26 (HXB2 gp120 positions 661-1455, accession 2 number K03455). Sequences were multiple aligned using MUSCLE 27 , and then manually 3 edited in MEGA v6.06 28 . Sequences with premature stop codons or frame-shifts were excluded 4 to control for intra-patient clustering of sequences. 5 6 Inference of infecting founder strain 7 A consensus sequence was generated for each patient for each time-point, with an ambiguity 8 threshold of 10%. One sequence was selected per patient to generate a dataset with sequences 9 evenly distributed across the sampling period. The sequences were aligned and codon-stripped 10 to a final alignment length of 759 nucleotides. Patient SM007 was excluded from this analysis 11 because it was not possible to conclusively rule out dual-or superinfection as preliminary data 12 exploration demonstrated that sequences from this patient did not resolve monophyletically.  As the SM cohort patients were infected with a narrow source of virus, the sequence of the 2 MRCA was used as a surrogate for the infecting founder. The sequence of the reconstructed 3 ancestor at the tree root from each run following burn-in was extracted, and a consensus 4 sequence was generated from an alignment of these sequences (26,000 sequences) with an 5 ambiguity threshold of 10%. The ratio of non-synonymous (dN) to synonymous (dS) mutations was estimated for each 19 codon in patient-specific alignments. Renaissance counting 32,33 was implemented through 20 BEAST v1.8.2, and the HKY85 nucleotide substitution model 34 , three-site codon partitioning, 21 a strict molecular clock with tip-dating of time stamped sequences were applied. Significant 22 selection was defined as a 95% higher posterior density (HPD) range that did not encompass 23 1. An alignment representing all patients was then constructed from these data, and the dN/dS 24 estimates were combined across each aligned position. The proportion of patients with virus 1 showing evidence of significant selection pressure in each cohort was calculated. 2 3

Variant characterisation 4
A variant was defined as any amino acid in any position in the alignment that differed from 5 that present in the inferred founder. Owing to the extensive degree of variation, the 6 hypervariable loops were conservatively stripped from the alignment prior to analysis (final 7 length 179 amino acids). Major variants were defined as variants found in greater than 15% of 8 the sequences. Whilst major variants are canonically defined as those present at a frequency 9 greater than 5%, this value was conservatively tripled as the amplicon was approximately three 10 times more variable than the full-length HIV-1 genome 35 .

RESULTS 1
Cohort characteristics 2 HIV-1 env gp120 sequences were recovered from 12 patients in the SM cohort (seven female 3 and five male). The HLA types of those were representative of cohort frequencies 23 . Sequence 4 recovery was 74% from the available specimens, with ten of the patients yielding sequences 5 from two or more time points (Fig. S1A). PCR success was associated with the viral load of 6 the sample. The final dataset of 575 sequences is available from Genbank under accession 7 numbers MF078678-MF079252. 8 9 Across the specimens sampled, median viral load was 7,388 copies ml -1 of plasma (interquartile 10 range (IQR): 1,612-30,403); median absolute CD4 + lymphocyte count was 337 cells µl -1 (IQR: 11 248-400); and median CD4 percentage of lymphocytes was 24% (IQR: 15-30). Demographic 12 and clinical characteristics of the patients sampled are shown in Table S1. The median evolutionary rate ratio of positions 1+2 to position 3 among the participants was 1 0.861 (IQR: 0.779-0.892), indicating an overall purifying selection. The overall ratio was <1 2 in all patients except SM021 (1.306). Next, we evaluated the intrapatient dN/dS ratio of each 3 codon by renaissance counting 33 . The majority of codons in Gp120 were under either 4 significant positive or negative selection in seven of the ten patients with longitudinal samples 5 ( Fig. 1A), and neutral evolution was comparatively rare. Consistent with the 1+2:3 codon rate 6 ratios, only one patient (SM021) appeared to have more residues under significant positive than 7 negative selection pressure. 8 9 Within the variable loops, substantial negative selection could be seen in V3, but not in V4 and 10 V5 (Fig. 1B). The negative selection in V3 corresponded with a marked increase in the density 11 of neutralising antibody epitopes. Five codons, corresponding to positions T297, A337, S348, 12 D415 and S468 in the HXB2 Gp120 (accession number K03455), were under significant 13 positive selection in 50% or more of the patients irrespective of HLA profile ( Figure 1A and 14 Table S1). These sites were further mapped to a homology-modelled structure of the SM cohort 15 Gp120 consensus, and clustered either within or immediately flanking the variable loops on 16 the distal exposed edge of the protein complex (Fig. 1C). 17 18

Significant selection pressure was exerted by the humoral response 19
We next considered how the virus evolves in response to significant positive selection pressure. 20 For each position in the partial Gp120 sequence, deviations from the inferred founder were 21 detected and 288 unique amino acid variants were recorded across 128 variable sites ( Fig. 2A). 22 The remaining 51 positions were completely invariant (29%). Of the variants identified, many 23 were present in a single or small collection of sequences and are likely of limited biological 24 relevance. Major variants were therefore resolved, and to prevent overrepresentation by 25 particular patients, a single time-point was selected for each. Thirty-one major variants were 1 detected in total, across 29 sites ( Fig. 2A). 2 3 In patients with longitudinally-sampled sequences, the presence or absence of each major 4 variant was recorded at each time point. Whether these sites were under significant positive 5 selection pressure in that patient was also determined (Fig. 2B). Twenty-four of the 29 sites 6 housing major variants were under significant positive selection pressure in at least one patient, 7 and of these sites, a higher proportion exhibited fluctuating patterns of variant emergence than 8 a consistent pattern wherein the variant was present at all time points sampled (p<0.01, two-9 tailed Fisher's Exact Test). 10 11 Mapping these 24 sites to the homology-modelled structure of the SM cohort Gp120 consensus 12 revealed that all but two were visible on the surface of the protein and likely accessible to We next considered the biophysical diversity of amino acids in each of the 22 sites of interest 23 (Fig. 3). In all positions, the biophysical properties of the amino acid in the inferred founder 24 were preserved in approximately 50% or more of the sequences. Little divergence was seen in 25 sites where the inferred founder residue was hydrophobic, with other properties being 1 somewhat more variable. Tabulating the amino acids present in each site also reveals that seven 2 positions flick back and forwards between just two or three amino acids. 3 4 Four novel sites were identified and consistent with antibody activity 5 The 22 sites were also cross-checked for existing antibody epitopes in the LANL Immunology 6 Database ( Table 1). Eighteen of the 22 sites were part of previously described antibody epitopes 7 from different sources (human, mice, or both). In detail, five positions (T236, Q344, I345, 8 V424, and D474) were contained within known human antibody epitopes, whilst 16 positions 9 were contained within epitopes reported in mice. Four sites (V283, E290, L333, S334) flanking 10 the V3 hypervariable loop -which itself houses the majority of antibody epitopes -were 11 identified against which no antibodies have yet been reported. Two of these four sites (V283 12 and L333) primarily switch between I/V and I/L, respectively. 13

DISCUSSION 1
In line with previous observations 48 , mapping the ratio of non-synonymous to synonymous 2 substitutions showed that the majority of sites within the C2-V5 region of Env Gp120 were 3 under negative selection in all but one patient out of ten. Whilst V4 and V5 loops exhibited a 4 dearth of negative selection, the V3 hypervariable loop contained substantial negative 5 selection. Of the five variable loops in Gp120, V3 is the most conserved with amino acid 6 variation restricted to approximately 20% of the loop's residues 49 . It is also likely that V3 is 7 subject to stronger functional constraints due to its important role in co-receptor binding 50-53 . 8 Moreover, it has been shown that deletion of V3 abrogates viral infectivity 54 . 9 10 Several sites within each patient showed evidence of significant positive selection, and five of 11 these were common to at least half of the patients sampled. Structural modelling demonstrated 12 that all but two of the 24 positively selected sites were found on exposed regions of the outer 13 face of the protein complex. CTL epitopes in the HIV-1 Nef protein have been reported to 14 cluster in hydrophobic regions 55 , whilst more recent evidence suggests that their distribution 15 may be random across the genome 56 . Such strong clustering on the surface of the protein is 16 therefore more consistent with antibody-mediated than CTL-mediated selection pressure. The 17 exceptions in terms of surface exposure were positions 345 and 424, which were buried within 18 the protein. Notably, position 345 was found to be under significant positive selection pressure 19 in only one patient, SM176. This position is contained within a known HLA-A11-restricted 20 CTL epitope, which is one of the HLA alleles expressed by patient SM176. It is therefore 21 feasible that this variant has emerged in response to CTL-mediated selection pressure in this 22 patient. Conversely, position 424 is important in CD4 binding 57 , and is contained within a 23 known human antibody epitope. Mutation of this residue to methionine has been shown to 24 increase susceptibility to neutralisation 58 . 25 We were able to assign most positions to known antibody epitopes in humans and mice. We 1 also identified four novel sites consistent with humoral activity, which are not contained within 2 any known antibody epitopes reported in the literature. These sites flank the V3 hypervariable 3 loop, which is the most epitope dense region of Gp120 (LANL Immunology Database; 4 https://www.hiv.lanl.gov/content/immunology/), although this may be due to a bias in reporting 5 stemming from the extensive study of V3 in vaccine design rather than a genuine increase in 6 immune activity. Whilst numerous antibodies against V3 have been described, the cross-7 neutralisation potential of these antibodies is generally low, reviewed by Hartley et al. 59 8 Glycosylation, sequence variation, masking by V1-V2, and the specific amino acid make-up 9 of the loop may contribute to this, reviewed by Pantophlet et al. 60 . However, some monoclonal 10 and polyclonal antibodies specific to epitopes within V3 have been demonstrated to neutralise 11 diverse HIV-1 strains in vitro 61-64 . Two of the new sites exhibit particularly limited amino 12 acid diversity and as such may warrant further investigation as potential components of 13 vaccines targeting shared epitopes of very low diversity within V3. 14 15 Indeed, despite evidence for significant antibody-mediated selection pressure, some sites were 16 relatively conserved in terms of their composition, containing only a limited number of 17 biophysically similar amino acids. This is likely due to functional or structural constraints on 18 the protein, and may reduce the ability of the virus to successfully escape antibodies targeting 19 these regions. We also identified sites containing biophysically diverse amino acids that may 20 be contained within epitopes eliciting highly effective antibody responses which cycle 21 continuously between a limited number of biophysically distinct forms throughout chronic 22 infection, as predicted by a previous modelling study 21 . Consistent with this model, we found 23 evidence within individuals that some sites contain major variants that appear and disappear 24 over time, such as proline to leucine in position 369 and arginine to serine in position 444. 25 1 In summary, our data provide insight into how and where the surface of Gp120 is mutating 2 over the course of clinically latent infection, and indicate that humoral immunity is the likely 3 driving factor of such changes. Our detailed analysis of HIV-1 evolution and selection within 4 hosts infected with a narrow-source virus allowed us to identify amino acid sites under positive 5 selection that was likely attributed to host factors. Importantly, this comprehensive mapping 6 resulted in the identification of both previously described and novel constrained antibody and 7 T-cell epitopes. These sites are likely crucial to viral envelope function, and may aid in the 8 development of future drugs and vaccines. 9 The datasets generated and analysed during the current study are available in the Genbank 9 repository (accession numbers: MF078678-MF079252). Custom R scripts used in the analysis 10 of these data are available from the authors on request. 11 respectively. The dotted lines denote 50% of patients. Antibody epitope clustering is shown in 12 grey, whereby intensity denotes number of epitopes spanning that residue as reported in the 13 LANL Immunology Database (http://www.hiv.lanl.gov/content/immunology). Sequences have 1 been aligned to the HXB2 Gp120 reference sequence (accession number K03455), and position 2 is relative to this alignment. C) Homology-modelled structure of the SM cohort consensus 3 Gp120 sequence in surface representation. Variable loops V3, V4 and V5 are shown in grey 4 and sites under significant positive selection in 50% or more patients are shown in purple. 5 Structure has been modelled on a glycosylated HIV-1 Gp120 trimer (RCSB PDB 3J5M) 40 . For 6 clarity, two molecules in the trimer are shown in line representation in grey. 7