Adaptive evolution of proteins in hepatitis B virus during divergence of genotypes

Hepatitis B virus (HBV) is classified into several genotypes, correlated with different geographic distributions, clinical outcomes and susceptible human populations. It is crucial to investigate the evolutionary significance behind the diversification of HBV genotypes, because it improves our understanding of their pathological differences and pathogen-host interactions. Here, we performed comprehensive analysis of HBV genome sequences collected from public database. With a stringent criteria, we generated a dataset of 2992 HBV genomes from eight major genotypes. In particular, we applied a specified classification of non-synonymous and synonymous variants in overlapping regions, to distinguish joint and independent gene evolutions. We confirmed the presence of selective constraints over non-synonymous variants in consideration of overlapping regions. We then performed the McDonald-Kreitman test and revealed adaptive evolutions of non-synonymous variants during genotypic differentiation. Remarkably, we identified strong positive selection that drove the differentiation of PreS1 domain, which is an essential regulator involved in viral transmission. Our study presents novel evidences for the adaptive evolution of HBV genotypes, which suggests that these viruses evolve directionally for maintenance or improvement of successful infections.

significant attentions [18][19][20] . A strict selection is reported for the PreS1 domain 19 , while its overlapping region, so called the Spacer domain of P ORF, is relaxed and prone to non-synonymous mutations 19,20 .
In general, our current understanding of the selective pressures over the HBV proteins is mainly based on the ratio of non-synonymous vs. synonymous substitution rate d N /d S 21 , which is a commonly accepted statistical method of testing neutrality for viral protein variants 6,[17][18][19][20]22 . However, we argue that using the d N /d S statistic for HBV has a few limitations. First off, the power of d N /d S ratio to indicate the direction of selection in overlap genes is under question. The underlying assumption using d N /d S ratio to imply positive or negative selection is to treat d S as the neutrally evolving rate 21 , which is in conflict with the fact that synonymous sites of overlapping genes often cause non-synonymous alterations of another ORF. Further support to this statement is the decreased synonymous substitution rate observed in overlapping vs. non-overlapping regions of HBV 5 . Second, the outcome of the codon-based statistic regarding overlapping genes reflects a mixture of independent and joint evolutions of multiple ORFs, therefore it cannot distinguish the exact selection pressure over single gene. A solution to this problem is to take into account independently evolving sites, as the evolutions of overlapping genes are largely independent at specific codon positions 18 . Third, d N /d S represents an average substitution rate ratio over the entire phylogeny of input strains, regardless of defined genetic groups such as genotypes. However, the evolution of HBV is regarded to be dynamic between short-term and long-term events 11 . Thus, alternative methods considering the inconsistency of genotypic evolutions will provide novel information to the standing questions.
In the present study, we applied stringent criteria to collect HBV genome sequences representing 8 major genotypes (A-H) from public database. We used a specified definition of protein-altering and neutral variants in consideration of gene overlaps to investigate inner-genotypic polymorphisms and inter-genotypic fixations. We applied the McDonald-Kreitman (MK) test and its related statistics 23,24 to our HBV dataset to examine whether positive selection drives the differentiation of protein-coding genes between genotypes. Through the present study, we aimed to provide a novel insight into HBV protein evolution in presence of natural selection.

Construction of HBV genomic dataset of eight genotypes.
To study the genotypic diversity of HBV, we constructed a dataset of 6765 HBV genome sequences, which is approximately the entire public collection after exclusion of redundancy and poor quality (See Materials and Methods). We subsequently used a fragment typing approach 25,26 to predict genotypes of the sequences and to remove putative inter-genotypic recombinants (n = 2713) (Supplementary Table S1), thereby ensuring the remaining sequences consist of no admixture between genotypes. To validate the method, we randomly selected three sequences from each genotype (in total 24 pure strains), as well as 10 sequences characterized as putative recombinants (Supplementary Table S2). Then, we analyzed them using jpHMM, which is a probabilistic model-based method for predicting inter-genotypic recombinants 27 . All 24 pure strains, as well as 8 inter-genotypic recombinants were characterized by both genotyping methods (Supplementary Table S2, Supplementary Figure S1). Although 2 out of the 10 recombinants were identified with ambiguous recombination events (FJ361772, KJ586811), the consistent predictions of randomly selected pure strains by both methods suggested good quality of the remaining dataset (Supplementary  Table S2). After exclusion of inter-genotypic recombinants, we filtered the remaining dataset to remove strains infecting non-human primates (n = 105) (Supplementary Table S3), strains with insertion/deletion polymorphisms (n = 941) and population outliers (n = 14) (Supplementary Table S4 Table S4).
The remaining samples showed a clear division of genetic background with respect to their genotypes. In the Principal Component Analysis (PCA) 28 of all 2992 genomes, strains from each genotype were clustered and no obvious admixture between clusters was observed (Fig. 1A). Phylogenetic analysis also confirmed the classification: the root branches of each clustered genotype sub-clade were robust under bootstrapping test and no outlier was observed (Fig. 1B). Given these, we applied the 2992 sequences to our further analysis as the final dataset, which represented distinct and unmixed genetic groups.
Selective constraints on genome of HBV from different genotypes. With the 2992 genome dataset, we next examined the nucleotide polymorphisms within eight genotypes to see whether the inner-genotypic evolution of HBV was constrained by selection. The extent of polymorphisms on different genomic regions was measured by the Z-transformed pair-wise nucleotide difference Z π , using a 100-bp sliding window with a step size of 50 bp (See Materials and Methods). The highly diverse regions, referring to the peaks on curves of Z π , were mostly correlated with non-overlapping regions ( Fig. 2A), indicating presence of constraints on regions with dual ORFs. A general reduction of diversity on overlapped regions, including the S vs. P and X vs. P, were observed in genotypes A-F and H. Specifically, there was a peak at X vs. P region for genotype G, but we found its sample number (n = 27) and genetic variation too small for giving a robust estimation. Moreover, an exceptional peak was observed at the overlapping region PreS vs. P, correlated with the fact that the Spacer domain of P ORF within this region is mutation-prone 19,20 . Interestingly, this signal exhibited a genotype-specific pattern, which is dominant in genotype F and G, and slight or absent in other genotypes ( Fig. 2A).
As we described in the former section, our current understanding of selective pressures acting on HBV is controversial in respect of overlapping gene evolution. To solve this problem, we proposed a specified definition of variant types regarding overlapping regions. Briefly, we defined three types of non-synonymous variants and one type of synonymous variants: 1) non-overlapping non-synonymous (NNS) variants; 2) independent non-synonymous (INS) variants, which is present in overlapping regions and causes only one amino acid alteration; 3) dual non-synonymous (DNS) variants, which causes amino acid mutations of two overlapping genes; and 4) synonymous variants, which consist of both overlapping and non-overlapping mutations that cause no protein alterations (An example of determining 4 types of variants was showed in Supplementary Figures S3 and  S4). In particular, we have referred to the previously described concept of independently evolving sites 18 when we defined INS variants, nevertheless the two definitions are still different as INS variants are not restricted to a specific codon position. More concretely, the independently evolving site of S ORF is the P3/S2 codon position 18 , whilst the P3/S2 site contains both INS and DNS variants of S gene. In most cases, INS variant is a better estimator for independent gene evolution, as a minority of variants at independently evolving sites (like P3/S2) are jointly non-synonymous mutations of multiple ORFs.
Based on the new classification of variants, we next compared the distribution of allele frequencies between non-synonymous (INS + DNS + NNS) and synonymous mutations. The frequency of a variant (allele) was defined as the proportion of strains carrying the specific mutation within a genotype (See Materials and Methods). According to the histograms, more non-synonymous variants than synonymous variants were found to be singletons or enriched at low frequencies (Fig. 2B). This feature indicates that the non-synonymous variants  of HBV genome tend to be preserved at low genetic diversities compared with synonymous variants, suggesting an overall strict selection (on average) acting on the HBV proteins within genotypes.

Evolution of non-synonymous variants in genotypic differentiation.
To figure out whether the diversification of HBV genotypes is driven by natural selection, we subsequently performed the MK test and computed its related statistic α and AS (see Materials and Methods). Briefly, the MK test measures positive selection by comparing the fixed/polymorphic mutation count ratio between putatively selected sites (e.g. non-synonymous sites) and synonymous sites. In the case of the present study, the fixed mutations were defined as inter-genotypic differences, and the polymorphic mutations were defined as inner-genotypic variations (The method was described in Materials and Methods, and examples of determining fixed and polymorphic mutations were showed in Supplementary Figures S3 and S4).
One common concern of the MK test is that the presence of slightly deleterious mutations will downwardly bias the estimation of positive selection. This effect can be reduced by excluding variants at low frequencies, although the underestimation still exist 29 . However, in our data, the power of the MK test to detect positive selection is satisfactory. As expected, the level of adaptive selection AS in low-frequency non-synonymous variants (Derived Allele Frequency < 1% and singletons) is much lower than neutral sites (AS < 0) (Fig. 3A). After excluding these variants, the number of fixed mutations is lower than or nearly neutral in NNS sites (AS = −0.145), but slightly higher in INS (AS = 0.236) and DNS (AS = 0.438) sites (Fig. 3B, Table 1). Meanwhile, the estimation of AS is more consistent among genotypes in high-frequency variants, indicating a relatively constant ratio of non-synonymous to synonymous substitution rate among different genotypes (Fig. 3B, Supplementary  Table S5). By χ 2 tests 30 over the 2 × 2 mutation count table regarding three types of non-synonymous sites (See Materials and Methods), we identified the signature of AS which showed statistical significance in DNS variants (P-value = 0.0114), close to significance in INS variants (P-value = 0.0651), and no significance in NNS variants (P-value = 0.2735) ( Table 1). It should be noted that the heterogeneity of AS among DNS, INS and NNS variants did not prove that adaptive evolution is determined by the type of variants, because the three types of non-synonymous variants distributed differently among proteins. The essential conclusion is that adaptive evolution does affect the fixation of non-synonymous variants, and is likely enriched in overlapping regions because DNS and INS show more positive AS value than NNS variants. Moreover, the extent of adaptive evolution can       Table 2). This is explained by the fact that PreS1 is largely overlapped with the less functionally relevant Spacer domain of P ORF 31 , thus evolution of DNS sites on PreS1 is less affected by gene overlaps.
Given the remarkable signature of adaptive evolution on PreS1, we next analyzed its 119 amino acid positions to search for fixed amino acid differences among genotypes. In particular, we identified 47 positions where at least one genotype possessed an alternatively dominant amino acid, and characterized a total of 76 alternative amino acid variants (Supplementary Table S7). This number is slightly lower than that of nucleotide variants (INS + DNS in PreS1: 19 + 68 = 87 nucleotide variants), because some amino acid changes comprise more than one step of non-synonymous nucleotide substitutions. We showed in Table 3 the genotype-specific amino acids of PreS1, which were mutated and became dominant in 3 or less out of the 8 genotypes. According to the definition of the MK's α statistic, α of PreS1 INS variants equals 0.555 and DNS variants equals 0.689, suggesting that ~55-70% of the differentiated amino acids (42-53 out of the 76 variants), were driven by positive selection. Thus, the differentiated amino acids of PreS1, which we listed here, comprise a number of potential genetic determinants of HBV fitness.
The other signatures of positive AS were found in P and X. Whilst the absolute AS values are approximate in PreS2 (AS = 0.265), P (AS = 0.267) and X (AS = 0.340), we found that they were of statistical significance in P (P-value = 0.0312), close to significance in X (P-value = 0.0669) but no significance in PreS2 (P-value = 0.4839). Given these results, we assume that the evolutions of P, X and PreS2 are driven by slightly positive selection, however only the longest P gene has accumulated adequate mutations to reach the power of a statistical test.
In contrast, C gene represents the most conserved part of the genome during genotypic differentiation, with the most negative AS value (AS = −1.855, P-value = 1.08 −10 ) among all tested sites (Fig. 4A, Table 2). It is controversial whether C gene indeed has less number of fixed variants than neutral sites, because AS is possibly underestimated. What is certain, however, is that C gene shows much less tendency to undergo adaptive evolution than PreS1/PreS2/S, P and X genes. The most negative AS value suggests that C gene is the least related to viral adaptations among all HBV genes, therefore the sequence tends to remain stable among lineages.
Based on these findings, we concluded that the PreS1 domain has experienced adaptive evolution during differentiation of HBV genotypes, evidenced by elevated numbers of fixed INS and DNS variants, while the C gene remains conserved. Although the overlap of PreS1 and the Spacer domain gives rise to more flexibility in their DNS variants, it cannot fully explain the increasing proportion of fixed INS variants for PreS1. Our results suggest that the function of L surface protein is crucial for the evolution of HBV genotypes.  Table 3. Genotype-specific amino acid variants in PreS1 region. a The viral genotype(s) and their specific PreS1 amino acid residues are listed. Only genotype-specific variants in three or less genotypes are showed. b The genotype-specific amino acid variants are represented by their positions on PreS1 peptide (from 1st to 119th residues) and the amino acid symbol.

Discussion
The initial step of HBV entry is attachment to the membrane of hepatocytic cell. This process is mediated by the interaction between the PreS1 domain of L protein and the HBV entry receptor of human hepatocytes, namely NTCP 32 . In particular, the N terminal 75 or 77 amino acids of PreS1 are responsible for its activity in viral infection 33,34 . On a different note, the PreS domain (PreS1 + PreS2) also plays an essential role in the interaction with host immune response, as it contains several immunogenic T-and B-cell epitopes [35][36][37][38] . Deletion mutations in the PreS1 domain are correlated with occult HBV infections (surface antigen level in serum is undetectable) in genotype C 39,41 , which is regarded as a potential mechanism to escape clearance mediated by immune cells 41 . Given these critical functions in viral infections, our data shows that PreS1 is of vital importance in genotypic evolutions, represented by a high AS signature. Indeed, both attachment to hepatocytes and survival under immune response are highly correlated with viral fitness, to ensure successful infection in human hosts.
Besides the adaptive feature of PreS1 domain, another interesting finding of out study is the conservation of C gene among genotypes. The C gene encodes nucleocapsid protein, which serves as the container of partially double-strand DNA and is enveloped by three types of viral surface proteins (L, M, S proteins) 42 . After entry into the cytoplasm, the capsid is released from the envelope and transported towards the nucleus 43 . Its interaction with host is relatively weak, as the mature capsid (DNA-containing) is merely exposed after successful invasion. Thus, C gene exhibits a trend to preserve its sequence between heterogeneous viral genotypes, probably because it is not directly involved in adaptation to host.
The advantageous strains of a viral species are ones that survive better in the host and show higher efficiency of replications and infections. The signature of adaptive evolution in PreS1 suggests that HBV gains fitness potentially through improving infections or evading immune responses. Nevertheless, the remaining questions are what exact adaptation HBVs have experienced and what outcomes have been resulted from different PreS1 sequences. One explanation for the diversification of PreS1 domain is that the humans correlated with each viral genotype adapt to infections differently (e.g. by acquirement and fixation of different resistant mutations), thus the viruses evolve accordingly and constantly into diverse directions. This hypothetic model is an application of the classical evolutionary theory called "Red Queen hypothesis" 44 , which proposes an ever-existing evolutionary race between two counter-organisms caused by the conflict of fitness. This theory is also successful in describing the process where virus diverges to infect different mammal species 45 . Regarding the functions of PreS1, its strong signature of selection suggests an ever-changing adaptation of the viruses merely to ensure successful infection, as the defensive mechanisms of hosts might also be updated constantly. However, a limitation of this hypothesis is the lack of proof that humans have carried HBV-resistant mutations as a consequence of adaptive process, parallel with HBV genotypic divergence. An association study of Chinese population has surveyed common variants in regulatory regions of NTCP, but failed in finding correlation with HBV susceptibility 46 . Several genome wide approaches have discovered in Asian populations that variants of human leukocyte antigen (HLA) DP loci affect HBV infections [47][48][49][50][51] , but it remains unclear whether alleles correlated with better clinical outcomes are favored by positive selection.
Another possible outcome of PreS1 differentiation is correlated with the change of viral transmission route. The shift of main transmission route (vertical vs. horizontal), which HBV genotypes have experienced, is regarded as a consequence of geographic and demographic features of their infected populations 10,11 . More concretely, genotypes in endemic regions (e.g. genotype B, C) mainly spread through perinatal or vertical transmission, whilst genotypes widely spreading in and between continents (e.g. genotype A, D) often transmit horizontally 10,11 . The findings of our study now raises a thoughtful question: is this shift an adaptive process? As commonly expected, traits that favor vertical transmission show advantages in restricted human populations, while qualities that assist horizontal infections are beneficial in epidemic regions. Meanwhile, a signature of positive selection should be present on their genetic determinants. Our data shows that PreS1 exhibited the highest extent of adaptive evolution compared with other HBV genes. Although PreS1 is directly correlated with viral infection, it remains unclear whether the L protein or other HBV proteins contribute to the preference of transmission route among genotypes. A future direction regarding this issue is to look for the genetic determinants of vertical-and horizontal-transmission preference in HBV if they do exist.
An additional contribution of this study is the application of a new definition of protein-altering and neutral variants of overlapping genes, by which the independent or joint gene evolutions could be clearly described using statistical test of neutrality. Based on Nei and Gojobori's d N /d S statistic 21 , a number of studies have reported a general trend for evolution of overlapping regions in virus: one reading frame is subject to strict selection (d N /d S < 1), whilst the other reading frame underlies relaxed selection (d N /d S > 1) 6,19,22 . In HBV, the PreS1 domain of S ORF is found to be strictly constrained, whilst the Spacer domain of its overlapping P ORF is prone to non-synonymous mutations 19 . However, we argue that these outcomes of codon-based statistics might be misread, because the synonymous sites of overlapping regions, which are often non-synonymous in one of the reading frames, do not evolve neutrally. In fact, a decreased rate of synonymous mutations (d S ) in overlapping versus non-overlapping regions has been demonstrated in various viral species 5,6 . In such case, the codon-based estimation of selective pressure in overlapping genes is under question, because the neutral substitution rate d S is occasionally underestimated. Thus, we propose the use of an accurate definition of neutral variants, as well as an improved classification of independent and joint amino acid mutations referring to previous work 18 . We applied the MK test to evaluate adaptive evolution of overlapping genes, because of its flexibility in dealing with any putative variants (e.g. INS, DNS, and even non-coding) 23 as long as the neutral substitutions are well defined.
In conclusion, this study provides a framework to understand adaptive evolution of viral genomes, and improves current methodologies to better handle with genetic data in respect of gene overlaps. Using a classical statistical test of neutrality, this study reveals signatures of adaptive evolution in HBV proteins, thereby sheds a light into the exploration of virus-human co-evolution and future treatment to hepatitis B.

Materials and Methods
Data collection and pre-processing. The key word "Hepatitis B virus genome" was used to search against NCBI nucleotide database. The result was downloaded as fasta format, containing 8653 sequences. A filtering step was adopted to remove incorrect and low quality sequences, defined as: 1) sequence length <3100 or >3300 bp (out of range for a common and complete HBV genome); 2) sequences with >10 ambiguous bases; 3) redundant sequences. A total of 1888 sequences were removed in this step.
The circular genome sequences were modified to start with the ORF of P gene. All nucleotide sequence alignments in the present study were performed using ClustalW-MPI 52,53 , with all parameters set as defaults. BioEdit 54 was used to manually modify the mismatches of codons.

Exclusion of inter-genotypic recombinants and population outliers.
Inter-genotypic recombinants are potential admixture of genetically distant viral populations (in particular, each distinct HBV genotype is defined as one viral population), which largely bias the estimation of inner-and inter-genotypic diversities. To detect inter-genotypic recombinants, we adopted a fragment typing approach 25,26 . First, the alignment of 6765 HBV sequences were devided into 250-bp fragments. Then, for each fragment, consensus sequences of 8 major human HBV genotypes (A-H) and 5 primate HBV species (chimpanzee, gorilla, gibbon, orangutan, woolly monkey) were generated based on subsets of the sequences with known genotypes (according to NCBI annotation). For the reason that majority of genotype B strains in public data were considered as B/C hybrid 26 , the consensus sequence of genotype B was calculated based on 32 pure B j strains (AB010289-92, AB014366, AB073838, AB073842-58, AB106884-85, AB205121, D00329, D23677-79, D50521-22) 26 . The consensus fragmental sequences were used to construct blast database. Then, for each genome sequence, we performed BLASTN search to find the best hit for each 250 bp fragment and assign its putative "fragmental genotype" (Supplementary Table S1). Genomes with identical genotypes for all 250-bp fragments were considered as pure strains without inter-genotypic recombination, for example, "A1-A2-A3-A4-A5-A6-A7-A8-A9-A10-A11-A12-A13" for pure genotype A. To validate the method of detecting inter-genotypic recombinants and pure genotypes, we randomly select 24 pure strains (3 from each genotype) and 10 potential recombinants defined by fragment typing (Supplementary Table S2), and submit their sequences to jpHMM web-server, which is probabilistic model-based software to predict and visualize inter-genotypic recombinants 27 . Visualized results of four example strains were showed in Supplementary Figure S1. After excluding inter-genotypic recombinants and strains infecting non-human primates, 3947 sequences of human HBV genotype A-H were preserved (Supplementary Table S3).
Based on separate alignments of genome sequences per genotype, we further excluded 941 sequences with insertions or deletions relative to their own genotypic consensus sequence (Supplementary Table S4), to ensure the high quality of sequence alignment for further analysis. In the final step, we visualized the genetic distances within the 3006 sequences by PCA (performed using our in-house script in R language) and excluded 14 population outliers (Supplementary Figure S1, Supplementary Table S4). The outliers were determined by visual inspection and manual check on the values of principal components in each genotype.
Population genetic analysis. Phylogenetic analysis of the 2992 HBV genomes were performed using MEGA5 55 . The genetic distance between sequences, measured by the number of substitutions per site, was constructed using the Maximum Composite Likelihood model. The phylogenetic tree was constructed using Neighbor Joining method 56 , with 100-time bootstrapping test.
Statistic π (pair-wise nucleotide differences) were calculated using DnaSP 57 with a 100-bp sliding window at 50-bp step to measure the extent of genetic polymorphism within different viral genotypes. A standardized Z π is then calculated as where μ stands for the mean and σ stands for the standard deviation of π from all 100-kb windows. Derived alleles of a polymorphic site are defined as the alternative nucleotides distinguished from the dominant one. For example, at a single nucleotide polymorphic (SNP) site, 7 of 10 genotype H strains exhibits nucleotide residue A, another two has T and one has C. Therefore, A is the major allele of the SNP site, while T and C are two derived alleles, and their derived allele frequency (DAF) are 20% and 10%.
Categorization of variants in overlapped regions. The variants were summarized into several categories according to their protein sequence outcomes: 1) non-overlapping non-synonymous (NNS) mutations; 2) independent non-synonymous (INS) mutations, which occur in overlapping ORFs, but only cause single protein alteration (for example, a non-synonymous mutation on PreS1 but synonymous on P); 3) dual non-synonymous (DNS) mutations, which cause dual protein alterations from two different ORFs at the same time; and 4) synonymous mutations, which caused no amino acid change in either overlapping or non-overlapping regions. The inner-genotype polymorphic mutations and inter-genotype fixed mutations were characterized using our in-house perl script. Mutations were categorized into four variant types described above by comparing the mutated codons with the consensus codons of its genotype, which represented the ancestral state of the polymorphic site (Supplementary Figure S3). Fixed mutations were determined similarly based on an alignment of eight HBV genotypic consensus genomes (Supplementary Figure S4).
Analysis of positive selection during population differentiation. The present study used the MK test and its related statistics to measure the positive selection during genotypic differentiation. The original MK approach used the statistic α to measure the proportion of divergence driven by positive selection 24 . α is calculated as Scientific RepoRts | 7: 1990 | DOI:10.1038/s41598-017-02012-8 If the evolution of X is not driven by selection, the ratio of fixed mutations and polymorphic mutations on X should equal that of neutral sites, therefore AS X = 0. When AS X > 0, positive selection drives the fixation of X and increase the proportion of fixed mutation between genotypes. AS X < 0 suggests an opposite trend of adaptive evolution, where less fixed mutations are observed than the expected number. Confidence intervals of AS X were computed from a non-parametric bootstrapping procedure as described previously 23 to evaluate robustness of statistic AS X based randomly selected genomic sites. χ 2 tests 30 were performed on D X , D S , P X , P S to infer the probability for the observed number of fixed mutations in X under neutral evolution.