Geographically-stratified HIV-1 group M pol subtype and circulating recombinant form sequences

Accurate classification of HIV-1 group M lineages, henceforth referred to as subtyping, is essential for understanding global HIV-1 molecular epidemiology. Because most HIV-1 sequencing is done for genotypic resistance testing pol gene, we sought to develop a set of geographically-stratified pol sequences that represent HIV-1 group M sequence diversity. Representative pol sequences differ from representative complete genome sequences because not all CRFs have pol recombination points and because complete genome sequences may not faithfully reflect HIV-1 pol diversity. We developed a software pipeline that compiled 6,034 one-per-person complete HIV-1 pol sequences annotated by country and year belonging to 11 pure subtypes and 70 CRFs and selected a set of sequences whose average distance to the remaining sequences is minimized for each subtype/CRF and country to generate a Geographically-Stratified set of 716 Pol Subtype/CRF (GSPS) reference sequences. We provide extensive data on pol diversity within each subtype/CRF and country combination. The GSPS reference set will also be useful for HIV-1 pol subtyping.


Background & Summary
Accurate classification of HIV-1 group M lineages, henceforth referred to as subtyping, has been essential for understanding the evolution of divergent HIV-1 in the context of the global pandemic. The classification of such sequences is complicated by the HIV-1's high mutation rate and propensity to develop new recombinant forms when two different virus strains infect the same cell. Indeed, HIV-1 group M sequences can be classified into many different lineages referred to as pure subtypes and circulating recombinant forms (CRFs). The HIV research community led by researchers at the Los Alamos National Laboratories (LANL) HIV Sequence Database established criteria for the taxonomic recognition of the pure subtypes A, B, C, D, F, G, H, J, and K and for an increasing number of CRFs 1,2 . A new CRF has generally been established and assigned a number when a recombinant virus with unique breakpoints has been sequenced in its entirety and identified in three or more epidemiologically unlinked individuals. Recombinant viruses that do not share breakpoints with an established CRF are classified as unique recombinant forms (URFs) 3,4 .
Several online programs use phylogenetic or bioinformatic approaches to determine the subtype/CRF of a submitted sequence. Each of these programs relies on a set of~9,700 bp full-genome reference sequences for each subtype and for a large proportion of CRFs [5][6][7][8][9] . The vast majority of HIV-1 sequencing, however, is done for genotypic resistance testing and is therefore confined to the pol gene (2,841 bps; 947 amino acids) which encodes the protease, reverse transcriptase (RT), and integrase enzymes. We therefore sought to develop a set of representative pol sequences for subtyping of HIV-1 pol sequences. Selecting representative pol sequences for each subtype/CRF based solely on pol will differ from selecting them based on complete genome sequences because not all CRFs have breakpoints in pol. Furthermore, sequences selected based on their complete genome may not reflect the diversity of HIV-1 pol because there are fewer complete genome sequences than pol sequences.
As many subtypes and CRFs exhibit regional divergence 10-13 , we developed a systematic approach for selecting a Geographically-Stratified Pol Subtype/CRF (GSPS) reference dataset. For each distinct subtype/CRF and country combination, we characterized the extent of diversity in the pol gene and applied a partitioning around medoids (PAM) algorithm to identify the smallest number of centrally located sequences that would minimize the average distance to the closest leaf (ADCL) of the complete set of subtype/CRF/country sequences 14 . This approach is designed to select a subset of sequences that is designed to represent the diversity of a larger collection of sequences 14 .
It should be noted, however, that the collection of published HIV-1 sequences may not perfectly represent the entirety of global HIV-1 sequences. Indeed, as we indicate in this study, many CRFs are likely over-represented in public sequence databases compared with their prevalence in HIV-1 infected persons. To counter this fact, we supplemented our GSPS reference dataset with a set of additional sequences that were closest to the progenitors of many of the reported CRFs.  Figure 1 shows the complete process of (1) Identifying complete pol sequences from GenBank; (2) Extracting annotation from GenBank, LANL 1 , the HIV Drug Resistance Database 15 , and associated PubMed publications; (3) Performing sequence quality control; (4) Assigning a consensus subtype to those pol sequences for which there was an agreement using two or more of the following subtyping classifications: LANL 1 , Rega subtyping tool 5 , and COMET 6 ; and (5) Selecting the sequences that comprise the GSPS reference set. In addition, the manuscript also characterizes the extent of pol diversity within each subtype/CRF and country combination and demonstrates the potential usefulness of the GSPS set of sequences for subtyping studies.

Compilation of HIV-1 group M complete pol sequences
We accessed the repository of NCBI-GenBank Flat File Release 220.0 (release date of June 15, 2017) 16 and downloaded all files containing GenBank virus sequence records (gbvrl1.seq.gz -gbvrl49.seq.gz). Using a custom Perl script (GB_to_Fasta.pl), we parsed each of the 2,293,121 virus sequence records and extracted data from the following GenBank fields: "TITLE", "AUTHORS", "TaxonID", and, if available, "PUBMED", "collection_date", "country", and "notes". We created an fasta sequence file in which each sequence header contained the extracted sequence annotations. We then converted the fasta sequence file to a BLAST-searchable database using MAKEBLASTDB (NCBI blast+ package) 17 .
To retrieve sequences homologous to the HIV-1 subtype B pol consensus amino acid sequence, we searched the database using TFASTX (FASTA package) 18 . TFASTX is a local sequence alignment program which searches translated nucleotide sequences using an amino acid query sequence. Unlike TBLASTN (NCBI blast+ package), it can extend an alignment across reading frames 19 . Using an e-value of 0.0000001, TFASTX returned 54,781 High-scoring Segment Pairs (HSPs). We then selected the 12,871 HSPs with a similarity score >80 and 947 amino acids. To remove frameshifts and insertions and to generate a nucleotide sequence alignment of 2,841 bps in length, we used the global nucleotide sequence alignment program, GGSEARCH (FASTA package). A second custom Perl script (Gene_to_Sequences. pl) was developed to streamline the above processes.
The 12,871 HIV-1 complete pol sequences were grouped into 631 submission sets containing the same submission "Title" and "Authors". Approximately 85% of the 631 submission sets were linked to one or more publications containing a PubMed ID. To describe the provenance of each sequence (i.e., whether it was the only sequence obtained from an individual vs. whether it was one of multiple sequences obtained from an individual), we created a Person ID descriptor for each sequence using annotations extracted from the GenBank record, LANL, HIVDB, and published papers.
From the set of 12,871 complete HIV-1 pol sequences, we excluded 1,104 sequences including 55 nongroup M sequences and 1,049 sequences that either lacked annotation for country or sample year (n = 339), lacked a Person ID descriptor (n = 234), or had poor sequence quality (n = 476) defined as the absence of ≥10 amino acid positions or the presence of ≥3 stop codons, ≥20 highly unusual mutations, or ≥5 signature APOBEC-associated mutations 20 . These exclusions left us with 11,767 well-characterized pol sequences from 6,763 persons. One sequence per person was then selected for subtyping. If one complete pol sequence was available from a person, we selected the complete pol sequence from that person. Otherwise, we used the random number generator implemented in mysql with a seed and chose one random sequence from that person.

Subtype assignment
For the 6,763 one-per-person complete pol sequences, we recorded the subtype classification in the LANL database, which is based on the author-defined subtype and additional analyses performed by LANL. This classification is based on the entire sequence in a GenBank record and is therefore influenced by sequence outside of the pol region. We also submitted each of the pol sequences to the online subtyping programs Rega and COMET 5,6 . Sequences for which two or more of three classifications agreed were assigned a consensus subtype/CRF. Overall, 6,034 (89.2%) of the 6,763 sequences were assigned a consensus subtype/CRF: 11 pure subtypes including A2 and F2 (65.6% of 6,034) and 70 CRFs (34.4% of 6,034). The sequences assigned a subtype included (i) 5,091 sequences (84.4% of 6,034) for which the classifications assigned by Rega, COMET, and LANL yielded the same classification; (ii) 875 (14.5% of 6,034) for which two of the three approaches yielded the same classifications; and (iii) 68 (1.1% of 6,034) for recently recognized CRFs in LANL but for which reference sequences did not exist in Rega and COMET.
Thirty-nine sequences assigned to seven CRFs were reclassified because they did not have recombination breakpoints in pol: CRF14_BG, CRF15_01B, CRF32_06A1, CRF41_CD, CRF46_BF1, CRF57_BC and CRF88_BC were reclassified based on their pol sequence to subtype G, CRF01_AE, CRF06_cpx, subtype D, subtype F1, subtype C, and subtype C, respectively. Twenty sequences assigned to CRF70_BF1, CRF71_BF1 and CRF72_CRF1 were not included because the breakpoints in their pol genes were inconsistent 21 . The majority of the remaining 709 sequences (10.5% of 6,763), for which the three approaches diverged, were URFs typically reported as unassigned by Rega and COMET. They were excluded from subsequent analyses because URFs are not recognized established lineages. Most of these URFs contained sequence segments consistent with CRF01 and B (Southeast Asia), B and C (South East Asia), B and F (South America), and A and D (East Africa).
The overall distribution of subtypes in the 6,763 pol sequences in order of frequency was as follows: subtype B (32.2%), subtype C (18.0%), CRF01_AE (17.9%), Other CRFs (11.2%), URFs (10.5%), subtype A (4.6%), CRF02_AG (1.9%), subtype D (1.3%), and subtype G (1.1%). The remaining four uncommon subtypes F, H, J, and K comprised 1.4% of the dataset. Table 1 compares this distribution to the estimated global distribution of subtypes from a 2011 publication describing the WHO/UNAIDS Network characterizing the global and regional distribution of subtypes 22 . This comparison shows that our dataset of complete pol sequences was more likely to contain viruses belonging to subtype B, CRF01_AE, other CRFs, and URFs and less likely contain viruses belonging to subtypes C, A, and CRF02_AG compared those in the WHO/UNAIDS publication. The 6,034 one-per-person, subtype-assigned pol sequences were from 88 countries in the following seven regions: Asia (43.5%), Sub-Saharan Africa (26.4%), North America (12.9%), Europe (8.2%), Latin America and Caribbean (7.2%), former Soviet Union (1.5%) and North Africa and Middle East (0.4%) (Complete.Set.txt, Data Citation 1). The countries with the greatest number of subtype/CRFs were China (18), Cameroon (16), DRC (15), Spain (12) and Brazil (11). Twenty-nine countries had 3 to 9 subtype/ CRFs and 55 countries had 1 to 2 subtype/CRFs. The number of subtype/CRFs per country was influenced in large part by the frequency with which laboratories in a country performed complete pol sequencing.
The subtypes present in the largest number of countries were B (42 countries), C (30 countries), A1 (24 countries), CRF02_AG (20 countries), CRF01_AE (15 countries), D (13 countries), G (13 countries), To identify a set of representative sequences, we applied the k-medoids method using the partitioning around medoids (PAM) algorithm implemented in the Reference Package PReparare (rppr) program in the pplacer package 14 . Medoids are representative members of a dataset whose average distance from all the members in the dataset is minimized. The rppr program takes a phylogenetic tree and the number (k) of medoids (leaves or sequences) to be subsampled that minimize the average distance to the closest leaf (ADCL). We performed this analysis for each set of three or more sequences from the same country within a subtype/CRF.  Overall, there were 294 distinct combinations of subtypes/CRFs and country (subtype/CRF/country) of which 155 contained three or more sequences. For these 155 sequence sets, we created a neighbourjoining tree rooted at its midpoint using the TN93 substitution model. We then used rppr to identify the set of sequences that minimized ADCL for each k with k ranging from 1 to the number of sequences in the tree (n)-1.
After plotting, k vs. ADCL, for each subtype/CRF/country combination (ADCL.txt, Data Citation 1), we decided to select the minimum set of k sequences that lowered ADCL below 4.0% because this could be achieved for 72% (111) of the 155 subtype/CRF/country combinations. However, for 14 (9%) subtype/ CRF/country combinations for which there was much sequence diversity and for which the k required to lower ADCL below 4.0% was high, we selected the set of k sequences that lowered ADCL below a higher threshold: (i) 4.5% for subtype A1 in Kenya, Tanzania, and Uganda; for subtype B in Cyprus, Germany, Denmark, Spain; for subtype C in Botswana, Ethiopia, Kenya, Tanzania, and South Africa; and for subtype G in Nigeria; and (ii) 5.0% for subtype B in Brazil. For the remaining 30 subtype/CRF/country combinations (19%) for which the ADCL remained above 4.5% regardless of k, we included all 141 sequences.
For 139 subtype/CRF/country combinations with just one or two sequences, we included just one sequence unless the two sequences belonging to that subtype/CRF/country differed by more than 4.0%.  Overall the above procedures yielded 684 sequences including (i) 523 sequences from the 155 subtype/ CRF/country combinations containing three or more sequences and (ii) 161 sequences from the 139 subtype/CRF/country combinations containing just one or two sequences.

Supplementing GSPS sequences with subtype sequences closest to CRF progenitors
The set of publicly available HIV-1 sequences, particularly sequences contributed by research laboratories, is likely to contain a higher relative proportion of CRF than non-CRF sequences compared with their prevalence in HIV-1 infected patients because CRF sequences may be more likely to be published. If a set of reference sequences contains a CRF sequence but does not contain the non-CRF ancestors that gave rise to the CRF, some non-recombinant sequences may appear to be more similar to the CRF reference sequence than to the parent subtype. This is particularly likely to occur when the CRF is dominated by one subtype and contains just a small segment of another subtype.
To counter this effect, we added 32 additional reference sequences from subtypes A1, A2, B, C, D, F1, G, and CRF01_AE using the following approach. We used the LANL CRF gene maps to identify 49 CRFs comprising one or two of the following subtype/CRFs: A1, A2, B, C, D, F1, G, and CRF01_AE in their pol genes. We then created 60 neighbour-joining trees containing each of the sequences belonging to 49 CRFs and each of the sequences from that CRF's dominant pol subtype according to the CRF gene map. Each of these trees comprised just the region of pol shared by the CRF and its parent subtype. For several CRFs, for which more than one subtype comprised nearly one-half of pol, we created trees with that CRF and both parent subtypes. Although CRF01_AE is itself a CRF, it is also the parent for several other CRFs. We used R packages including ape and phytools to identify the most recent common ancestor (MRCA) of the CRF and the sequence within the parent subtype with the shortest patristic distance to the MRCA. Overall 48 distinct sequences from 55 trees were identified of which 32 were not already included in our set of 684 representative sequences, yielding a final total of 716 GSPS sequences (GSPS.Set.txt, Data Citation 1). Table 2 (available online only) describes the complete set of 6,034 one-person HIV-1 group M pol sequences and the 716 GSPS sequences. For each subtype/CRF, Table 2 (available online only) contains the number of sequences, the average pairwise distance (PWD) using the TN93 substitution model, and the number of sequences per country. It also describes the number of sequences per country in the GSPS panel and the overall ADCL for each subtype/CRF using the 716 GSPS sequences. Table 3 (available online only) lists the mean and maximum PWDs within each subtype/CRF for both the set of complete pol sequences and the GSPS subset. The data in the table indicates that the mean and maximum PWDs in the complete dataset and in the GSPS subset were very similar.

Code availability
Scripts used for the present reference datasets generation are available from the site https://github.com/ hivdb/Gene_to_Sequences

Data Records
All reference datasets are available from Dryad Digital Repository (Data Citation 1). The complete set of 6,034 one-per-person, subtype-assigned HIV-1 group M complete pol sequences is available in a tabdelimited file containing the aligned pol sequence, GenBank accession number, GenBank author list, Eastern Africa (n = 7), Southern Africa (n = 4), China (1), India (2) CRF08_BC has been reported primarily in China. CRF22_01A1 A1 9 Cameroon (9) CRF22_01A1 has been reported primarily in Cameroon. Subtype A1 is also common in Cameroon Brazil (4) CRF28_BF has been reported primarily in Brazil. Subtype B is also common in Brazil.
CRF51_01B B 2 Japan (1), Philippine (1) CRF51_01B has been reported primarily in Southeast Asia and Japan. Subtype B is common in Japan.
China (6) CRF64_BC has been reported primarily China. Subtype C is also common in China.
Japan (5) CRF69_01B has been reported primarily in Japan. Subtype B is common in Japan GenBank submission title, PubMed ID, country, sample year, and assigned subtype/CRF (Complete.Set. txt). The GSPS reference set of 716 sequences is also available in a tab-delimited file containing the GenBank accession number, country, sample year, PubMed ID, and assigned subtype/CRF (GSPS.Set.txt). Each of the 716 GSPS sequences is also included in a Fasta format (GSPS.Fasta.txt). The relationship between k and ADCL for each of the 155 subtype/CRF/country combinations with three or more sequences is listed in a tab-delimited file (ADCL.txt). A phylogenetic tree containing the 716 GSPS sequences is provided in Newick tree format (GSPS.tre) and in pdf format (Figure 2). Phylogenetic trees for each subtype/CRF are also provided in pdf format (GSPS.subtype.tre.pdf.zip). These trees were constructed using neighbour joining with branch length optimized by maximum likelihood method with GTR evolution model using the R package phangorn and then rooted at the mid-point. Figure 3 contains a map in which countries are color-coded according to number of complete pol sequences available. For those subtype/CRFs with published time to MRCA (tMRCA), Figure 4 (which is described in the Technical Validation section) plots the tMRCA year vs. the median intra-subtype/CRF PWD in the complete set of 6,034 sequences.

Technical Validation
We performed three analyses to assess the how well the set of 716 GSPS sequences represent the diversity of published pol sequences. We also assess the potential usefulness of this set of sequences for subtyping. First, we determined whether the diversity within each subtype/CRF correlated with the age of the subtype/CRF as would be expected by viruses diverging from a common ancestor. Second, we compared the centrality of the GSPS sequences and NCBI Viral Genotyping Program reference sequences 8 within the complete pol data set. We also determined whether the subtype/CRF of the closest GSPS sequence was the same as the subtype/CRF of the closest NCBI sequence. Third, we compared the genetic distance of the first 1,300 bps of the GSPS sequences to nearly 7,000 1,300 bp pol sequences for which the COMET and Rega subtyping programs assigned the same (i.e., consensus) subtype.

Time to the most recent common ancestor (tMRCA) analysis
A strong correlation is expected between the diversity of sequences and tMRCA for each lineage as less divergence has taken place within lineages that originated more recently. To determine whether the diversity within each subtype/CRF in the 6,034 sequences from which the GSPS set was derived correlated with a published age of the subtype/CRF, we reviewed published papers that used molecular clock methods to estimate subtype and CRF tMRCAs. For those subtype/CRFs with published tMRCAs, we plotted the tMRCA year vs. the median intra-subtype/CRF PWD in the complete set of 6,034 sequences ( Figure 4). Although the proportion of subtype/CRF sequences in the complete set is not representative of global epidemiology and the bias might have influenced the diversity of subtype/CRF in this set, there was a strong correlation between the published tMRCAs and the diversity of sequences in  the complete set (Pearson's correlation coefficient; r 2 = 0.78; p o0.001). Table 4 lists the published papers with the tMRCA estimates used in this analysis.

NCBI reference set comparison
To evaluate the centrality of the GSPS reference sequence set for distance-based subtyping, we compared the genetic distance of the GSPS sequences to the non-GSPS sequences in the complete pol data set with the genetic distances of the NCBI Viral Genotyping Program reference sequences 8 to the non-NCBI reference sequences in the complete pol data set. we compared the centrality of the 716 GSPS sequences with the 316 subtype reference sequences used by the NCBI Viral Genotyping Tool 8 . For this comparison, we used 5,185 test sequences-the complete set of 6,034 sequences excluding those sequences that were in either the GSPS or NCBI reference set. We then identified the subtype/CRF of the sequences in GSPS set and in the NCBI set with the fewest nucleotide differences from each test sequence.
For 96.0% of the test sequences, the subtype of the closest GSPS and NCBI reference sequences were the same. The median distance to the closest GSPS and NCBI reference sequences were 2.9% (IQR: 2.1-3.7%) and 4.2% (IQR: 2.9-5.0%), respectively (Wilcoxon rank-sum test: po 0.001). Thus, the GSPS set performed better at minimizing the distance to the test sequences. The increased centrality of the GSPS reference set may reflect the larger number of references in this set and/or the greater representativeness of its sequences.
Of the 4.0% (n = 208) of sequences for which the subtype/CRF of the closest matching GSPS and NCBI sequences were not identical, most were not true discordances. For 126 sequences, the GSPS and NCBI subtypes were the same for the pol gene (i.e., the NCBI subtype was one of the eight subtype/CRFs that were reclassified by us because they did not contain pol recombination points). For 21 sequences, the GSPS subtype/CRF was one of the most recent CRFs that was not yet included in the NCBI set.
The subtype/CRF classifications of the remaining 61 sequences (1.2% of 5,185) with discordant classifications are shown in Table 5. Each of these sequences was closest to a CRF in the NCBI Reference Set but closest to a pure subtype in the GSPS set. For 26 of these sequences, the NCBI classification was likely to be incorrect because the test sequences were from Sub-Saharan Africa but their closest reference sequence was either CRF08_BC that has been reported primarily in China 23 and CRF31_BC that has been reported primarily in Brazil 24 . The fact that NCBI CRF reference sequences were the closest sequence to likely pure subtype test sequences, suggests that the NCBI reference set may not have a sufficient number of representative pure subtype sequences. For the remaining 35 sequences, it was not possible to use isolate data to help infer the likely subtype/CRF. Genetic distance to 1,300 bp sequences with a consensus COMET/Rega subtype To evaluate the potential utility of the GSPS reference sequence set for distance-based subtyping for partial pol sequences, we downloaded all one-per-patient HIV-1 group M sequences (n = 50,335) encompassing the first 1300 bp of pol from LANL. We excluded the 6,034 complete pol sequences from the download because they were used to derive our GSPS set. We then randomly selected 1500 subtype B and 5500 non-subtype B sequences and aligned the first 1,300 bp of each of these sequences. For 6,115 (87.4%) of sequences, COMET and Rega assigned the same subtype/CRF. For the remaining 885 sequences, either COMET or Rega were discordant or reported that the sequence might have been a URF.
For 6,088 (99.6%) of the 6,115 sequences, the subtype/CRF of the closest sequence in the GSPS sequence set agreed with the consensus subtype/CRF assigned by COMET and Rega. The subtype/CRF classifications of the remaining 27 (0.4%) sequences with a discordance are shown in Table 6. The extraordinarily high concordance between the subtype/CRF of the closest GSPS sequence and the consensus subtype/CRF assigned by COMET/Rega likely reflects the fact that the test dataset contained only established subtypes and CRFs and did not contain URFs. We have not been able to identify any explanation for the few discordances listed in Table 6.