Hepatitis B virus (HBV) is a major cause of human hepatitis. There is considerable uncertainty about the timescale of its evolution and its association with humans. Here we present 12 full or partial ancient HBV genomes that are between approximately 0.8 and 4.5 thousand years old. The ancient sequences group either within or in a sister relationship with extant human or other ape HBV clades. Generally, the genome properties follow those of modern HBV. The root of the HBV tree is projected to between 8.6 and 20.9 thousand years ago, and we estimate a substitution rate of 8.04 × 10−6–1.51 × 10−5 nucleotide substitutions per site per year. In several cases, the geographical locations of the ancient genotypes do not match present-day distributions. Genotypes that today are typical of Africa and Asia, and a subgenotype from India, are shown to have an early Eurasian presence. The geographical and temporal patterns that we observe in ancient and modern HBV genotypes are compatible with well-documented human migrations during the Bronze and Iron Ages1,2. We provide evidence for the creation of HBV genotype A via recombination, and for a long-term association of modern HBV genotypes with humans, including the discovery of a human genotype that is now extinct. These data expose a complexity of HBV evolution that is not evident when considering modern sequences alone.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
B.B. thanks D. Tserendulam for help, wisdom and guidance. We thank S. Rankin and the staff of the University of Cambridge High Performance Computing service and the National High-throughput Sequencing Centre (Copenhagen). This work was supported by: The Danish National Research Foundation, The Danish National Advanced Technology Foundation (The Genome Denmark platform, grant 019-2011-2), The Villum Kann Rasmussen Foundation, KU2016, European Union FP7 programme ANTIGONE (grant agreement No. 278976), and European Union Horizon 2020 research and innovation programmes, COMPARE (grant agreement No. 643476), VIROGENESIS (grant agreement No. 634650). The National Reference Center for Hepatitis B and D Viruses is supported by the German Ministry of Health via the Robert Koch Institute (Berlin). B.B. was supported by Taylor Family-Asia Foundation Endowed Chair in Ecology and Conservation Biology. A.D.M.E.O. was supported by N-RENNT of the Ministry of Science and Culture of Lower Saxony, Germany.Reviewer information
Nature thanks P. Simmonds, B. Shapiro, C. Pepperell and the other anonymous reviewer(s) for their contribution to the peer review of this work.
Extended data figures and tables
The frequencies of the mismatches observed between the HBV reference sequences (Supplementary Table 3) and the reads are shown as a function of distance from the 5′ end. C > T (5′) and G > A (3′) mutations are shown in red and blue, respectively. All other possible mismatches are shown in grey. Insertions are shown in purple, deletions in green and clippings in orange. The count of reads matching HBV for each sample is shown in parentheses. a, Damage patterns for RISE563, DA222, DA119, RISE254, DA195, DA27, DA51, RISE386, RISE387, DA29, DA45, RISE416 and RISE154. b, Damage patterns for DA222 without (left) and with (right) USER treatment. c, Damage patterns with 10, 20, 50, 100, 200, 500 and 1,000 reads sampled from RISE563, in which each opaque line corresponds to one replicate set of reads.
This figure shows 26 Orthohepadnaviridae sequences (dataset 1, see Methods), including the ancient HBV sequences. Ancient genotype A sequences are shown in red, the ancient genotype B sequence in orange, ancient genotype D sequences in blue and novel genotype sequences in green. The tree was constructed in PhyML60, optimizing for topology, branch lengths and rates, with 100 bootstraps (see Methods). Internal nodes with < 70% bootstrap support are shown as polytomies.
RDP451 was used to analyse the set of 12 ancient sequences plus a representative set of 15 modern human and non-human primate sequences (see Methods). The seven recombination programs used by RDP4 suggested that all genotype A sequences are recombinants, with the genotype D sequence HBV-DA51 as the minor parent and an unknown major parent. The obvious interpretation is that recombination formed an ancestor of the oldest sequences, evidence of which is still present in the less-ancient and the modern representatives. The figure shows the graphical evidence and predicted recombination break-point distribution for the two oldest genotype A sequences, HBV-RISE386 and HBV-RISE387, according to three of the RDP4 methods (MaxChi, Bootscan and RDP). In all subplots, the predicted location of the break points is shown as a dashed vertical line and the surrounding grey area shows the 99% confidence interval for the break point. Subplots on the same row share their y axis and those in the same column share their x axis. a, HBV-RISE386 analysed by MaxChi. b, HBV-RISE386 analysed by Bootscan. c, HBV-RISE386 analysed by RDP. d, HBV-RISE387 analysed by MaxChi. e, HBV-RISE387 analysed by Bootscan. f, HBV-RISE387 analysed by RDP.
The sequences from dataset 2 (see Methods) and the ancient sequences were aligned in MAFFT59. The tree was constructed in PhyML60, optimizing for topology, branch lengths and rates, with 100 bootstraps (see Methods). Internal nodes with < 70% bootstrap support are shown as polytomies. Ancient genotype A sequences are shown in red, ancient genotype B sequences in orange, ancient genotype D sequences in blue and novel genotype sequences in green. Taxon names indicate: genotype or subgenotype, GenBank accession number, age, abbreviation of country of sequence origin, region of sequence origin, host species and optional additional remarks. Note that the maximum likelihood tree shows topological uncertainty (polytomies) in areas where the BEAST225 tree (Fig. 2) is well resolved. This is the case for two reasons. First, BEAST2 always produces a fully resolved binary topology without polytomies. Second, and more important, BEAST2 creates a time tree and uses tip dates to constrain the possible topologies under consideration. Thus, BEAST2 can know that certain topologies are unlikely or impossible, whereas maximum likelihood cannot and thus inherently has greater uncertainty regarding tree topology.
a, Regression of root-to-tip distances and ages performed in Scipy (http://www.scipy.org). One hundred and twenty-four branch lengths were extracted using TempEst62 from trees inferred using neighbour joining, maximum likelihood and Bayesian methods. Shaded areas show 95% confidence intervals. Slopes are 1.01 × 10−5, 1.20 × 10−5 and 4.21 × 10−6, and correlation coefficients are 0.45 (R2 = 0.2), 0.36 (R2 = 0.13) and 0.51 (R2 = 0.26), for maximum likelihood, Bayesian and neighbour joining trees, respectively. b, Date randomization tests under the strict clock model. The median and 95% HPD interval for the substitution rates are given. The rate for the correctly dated tree is shown in red. Dates were randomized within all sequences, within the ancient sequences only, and within each genotype. We performed three replicates of each. None of the 95% HPD intervals for the randomized runs overlaps with the 95% HPD intervals for the correctly dated runs, suggesting the presence of a temporal signal in the data.
This file is in PDF format and contains: Three Supplementary Tables: SI Tables 1 and 2 describe the number of reference genomes and accession numbers of sequences used to design capture probes. SI Table 3 contains additional information for the HBV positive samples. A Supplementary Methods section, showing: 1) An investigation into the dependence of damage patterns on the number of reads, 2) Lists of accession numbers for sequences included in the different analyses, and 3) The three phylogenetic trees used for the regression analysis, inferred using neighbour joining, maximum likelihood and Bayesian methods.