Main

HBV is transmitted perinatally or horizontally via blood or genital fluids3. The estimated global prevalence is 3.6%, ranging from 0.01% (UK) to 22.38% (South Sudan)4. In high endemicity areas, in which prevalence is over 8%, 70–90% of the adult population show evidence of past or present infection5 (http://www.who.int/mediacentre/factsheets/fs204/en/). The young and the immunocompromised are most likely to develop chronic HBV infection, which can result in high viraemia over years to decades3. Approximately 257 million people are chronically infected and around 887,000 people died in 2015 owing to associated complications (http://www.who.int/mediacentre/factsheets/fs204/en/).

Despite the prevalence and public health impact of HBV, its origin and evolution remain unclear6,7. Inference of HBV nucleotide substitution rates is complicated by the fact that the virus genome consists of four overlapping open reading frames8, and that mutation rates differ between phases of chronic infection9. Studies based on heterochronous sequences, sampled over a relatively short time period, find higher substitution rates, whereas rates estimated using external calibrations tend to be lower, leading to a wide range of estimated HBV substitution rates (7.72 × 10−4–3.7 × 10−6 substitutions per site per year)10,11,12. Human HBV is classified into at least nine genotypes (A–I) based on sequence similarity of at least 92.5% within genotypes13, with a heterogeneous global distribution7,8 (Fig. 1a). Attempts to explain the origin of genotypes using human migrations have been inconclusive. The hypothesis that HBV co-evolved with modern humans as they left Africa 60–100 thousand years ago (ka) has been contested owing to the basal phylogenetic position of genotypes F and H, which are found exclusively in the Americas6. HBV also infects non-human primates, and the human and other great ape HBVs are interspersed in the phylogenetic tree, possibly owing to cross-species transmission14. Given the variability of estimated substitution rates, the incongruence of the tree topology with some human migrations and the mixed topology of the non-human primate and human HBV sequences in the phylogenetic tree, there remains considerable uncertainty about the evolutionary history of HBV.

Fig. 1: Geographical distribution of analysed samples and modern genotypes.
figure 1

a, Distribution of modern human HBV genotypes7. Genotypes relevant to this Letter are shown in colour. Coloured shapes indicate the locations of the HBV-positive samples included for further analysis. b, Locations of analysed Bronze Age samples1 are shown as circles and Iron Age and later samples2 are shown as triangles. Coloured markers indicate HBV-positive samples. Ancient genotype A samples are found in regions in which genotype D predominates today, and HBV-DA27 is of sub-genotype D5 which today is found almost exclusively in India.

Recent advances in the sequencing of ancient DNA (aDNA) have yielded important insights into human evolution, past population dynamics15 and diseases16,17. However, ancient sequences have been recovered for only a handful of exogenous human viruses, including influenza virus (sample approximately 100 years old)18, variola virus (sample approximately 350 years old)19 and HBV (samples approximately 340 and 450 years old)20,21. The knowledge gained from these cases emphasizes the general importance of ancient sequences for the direct study of long-term viral evolution. HBV has several characteristics that make it a good candidate for detection in an aDNA virus study: its extended high viraemia during chronicity3, the relative stability of its virion22, and its small, circular and partially double-stranded DNA genome8.

Shotgun sequence data were previously generated from 167 Bronze Age1 and 137 predominantly Iron Age2 individuals from central to western Eurasia with a sample age range of approximately 7.1–0.2 thousand years (kyr) old. We identified reads that matched the HBV genome in 25 samples (Table 1, Extended Data Table 1a and Supplementary Table 3), spanning a period of almost 4,000 years, from several different cultures and with a broad geographical range (Fig. 1b, Table 1, Extended Data Table 1a and Supplementary Table 3). Using TaqMan PCR, we tested two samples (DA195 and DA222) with high genome coverage and two samples (DA85 and DA89) with low genome coverage for the presence of HBV. The high-coverage samples tested positive, whereas the low-coverage samples tested negative (Extended Data Table 1b). This is consistent with shotgun sequencing being more effective than targeted PCR for analysing highly degraded DNA23. On the basis of the availability of sample material, libraries from 14 samples were selected for targeted enrichment (capture) of HBV DNA fragments (Supplementary Tables 1, 2). This resulted in increased genome coverage and an average of a 2.4-fold increase in the number of HBV-positive reads (Extended Data Table 1a and Supplementary Table 3). We obtained 17.9–100% HBV genome coverage from the sequence data, with genomic depth ranging from 0.4× to 89.2× (Table 1 and Extended Data Table 1a). We selected 12 samples for phylogenetic analyses. Criteria for inclusion were at least 50% genome coverage and clear aDNA damage patterns after capture (Extended Data Fig. 1).

Table 1 Overview of samples used for phylogenetic analyses

For an initial phylogenetic grouping, we estimated a maximum likelihood tree using the ancient HBV genomes together with modern human, non-human primate, rodent and bat HBV genomes (dataset 1, see Methods). All ancient viruses fell within the diversity of Old World primate HBV genotypes, which includes all human and other great ape genotypes with the exception of human genotypes F and H (Extended Data Fig. 2).

Recombination is known to occur in HBV24. We found strong evidence that an ancient sequence (HBV-DA51) and an unknown parent recombined to form the ancient genotype A sequences. Although this cannot literally be the case owing to sample ages, the logical interpretation is that an ancestor of HBV-DA51 was involved in the recombination. The same recombination is also suggested for the two modern genotype A sequences that were included in the analysis. The ancient genotype B (HBV-DA45), a modern genotype B and two modern genotype C sequences were not similarly flagged, which suggests that the possible recombination occurred after genotypes A, B and C had diverged. The predicted recombination break points (Extended Data Table 2 and Extended Data Fig. 3) correspond closely to the polymerase gene. It is therefore possible that the polymerase from an unknown parent and the remainder of the genome from an HBV-DA51 ancestor recombined to form the now-ubiquitous genotype A about 7.4–9 ka (Fig. 2, Extended Data Table 3b and Methods). Similar recombination events that involved the creation of genotypes E, G and a currently circulating B/C recombinant have previously been identified24.

Fig. 2: Dated maximum clade credibility tree of HBV.
figure 2

A log-normal relaxed clock and coalescent exponential population prior were used. Grey horizontal bars indicate the 95% HPD interval of the age of the node. Larger numbers on the nodes indicate the median age and 95% HPD interval of the age (in parentheses) under a strict clock and Bayesian skyline tree prior. Clades of genotypes C (except clade C4), E, F, G and H are collapsed and shown as dots. The figure includes a possible tenth genotype, J, based on a single human isolate. Taxon names for ancient samples indicate era (BA, Bronze Age; IA, Iron Age or later), sample name, sample age in years, ISO 3166 three-letter abbreviation of country of sequence origin, and region of sequence origin. Taxon names for modern samples indicate human genotype or subgenotype or host species if non-human, GenBank accession number, sample age in years, ISO 3166 three-letter abbreviation of country of sequence origin, and region of sequence origin.

For detailed phylogenetic analyses, we used a set of 112 reference human and non-human primate HBV sequences (dataset 2, see Methods). A maximum likelihood phylogenetic tree based on these reference sequences and the 12 ancient sequences was constructed (Extended Data Fig. 4). Regression of root-to-tip genetic distances against sampling dates, as well as date randomization tests, showed a clear temporal signal in the data (Extended Data Fig. 5 and Supplementary Figs. 13), suggesting that molecular clock models can be applied. A dated coalescent phylogeny was constructed using BEAST225 (Fig. 2). The molecular clock was calibrated using tip dates. Strict and relaxed log-normal molecular clocks were tested with coalescent constant, exponential and Bayesian skyline population priors (Extended Data Table 3a). Model comparisons favoured a relaxed molecular clock model with log-normally distributed rate variation and a coalescent exponential population prior (Extended Data Table 3a). The median root age of the resulting tree is estimated to be 11.6 kyr (95% highest posterior density (HPD) interval: 8.6–15.3 kyr) and the median clock rate is 1.18 × 10−5 substitutions per site per year (95% HPD interval: 9.21 × 10−6–1.45 × 10−5 substitutions per site per year). Under a strict molecular clock, a coalescent Bayesian skyline population prior was favoured, in which case the median root age is 15.6 kyr (95% HPD interval: 13.7–17.8 kyr) and the median substitution rate is 9.48 × 10−6 substitutions per site per year (95% HPD interval: 8.3 × 10−6–1.07 × 10−5 substitutions per site per year) (Extended Data Table 3a–c).

Under all model parameterizations used here, the substitution rate that we find is lower than rates estimated from phylogenies built using either modern heterochronous sequences10 or sequences from mother-to-child transmissions26 but higher than rates inferred using external calibrations based on human migrations11. A lower rate is consistent with previous work27 in which it was shown that, although mutation rates may be high, mutations within an individual often revert back to the genotype consensus and thus rarely lead to long-term sequence change. It is also consistent with the time-dependent rate phenomenon, observed for many viruses, which suggests that short-term evolutionary rates are higher than long-term rates28.

The ancient HBV genome data enable us to formally evaluate hypotheses concerning HBV origins using path sampling of calibrated phylogenies based on appropriate external divergence date assumptions. We tested several calibration points that would be implied by a co-expansion of HBV with humans after leaving Africa for support of congruence between migrations and geographical locations of HBV clades11. We find weak evidence for the split of the F and H clade occurring between 13.4 and 25.0 ka under a strict, but not a relaxed, clock model. We do not find support for the divergence of subgenotype C3 strains between 5.1 and 12.0 ka (hypothesized to have led to its distribution in different regions of Polynesia11) or for divergence of Haitian A3 strains from other genotype A strains between 0.2 and 0.5 ka under either strict or relaxed clock models (Extended Data Table 3d).

In the dated coalescent phylogeny, four ancient sequences (from youngest to oldest: HBV-DA119, HBV-DA195, HBV-RISE386 and HBV-RISE387) group with genotype A. The first three are well within the 7.5% nucleotide divergence criterion that was used to delimit membership in HBV genotypes, and HBV-RISE387 is right on this limit (7.51%)13 (Extended Data Table 4a). The three oldest samples lack a six-nucleotide insertion at the carboxyl end of the core gene (C) that is present in all modern genotype A viruses8 (Table 2). HBV-RISE387 encodes a stop codon in its pre-core peptide that would have ablated the expression of the immune modulator HBe antigen (HBeAg), a phenomenon that is known to occur in modern HBV infections (Table 2). This characteristic viral mutant is usually found in chronic HBV carriers who seroconverted from HBeAg to anti-HBe. RISE386 and RISE387 have archaeologically dates of only about 100 years apart and both come from the Bulanovo site in Russia, but their viruses have only 93.34% sequence identity (Extended Data Table 4b), which indicates the existence of substantial localized HBV diversity about 4.2 ka.

Table 2 Genome properties of ancient sequences included in phylogenetic analyses

The ancient sequence HBV-DA45 phylogenetically groups with genotype B and has 97.65% sequence identity with modern genotype B (Extended Data Table 4a).

Sequences HBV-DA27, HBV-DA29, HBV-DA51 and HBV-DA222 phylogenetically group with the modern genotype D. They have high sequence identity (96.99–98.74%) with modern genotype D sequences (Extended Data Table 4a), and have the typical 33-nucleotide deletion in the preS1 region of the S gene, encoding the three HBV surface proteins8 (Table 2).

Sequences HBV-RISE154, HBV-RISE254 and HBV-RISE563 are in a sister relationship with the chimpanzee–gorilla HBV clade (Fig. 2). HBV-RISE254 and HBV-RISE563 have the same 33-nucleotide deletion in the preS1 sequence that is shared with non-human primate HBVs and human genotype D (Table 2). HBV-RISE563 does not encode a functional pre-core peptide (Table 2). On the basis of sequence similarity across the whole genome, HBV-RISE563 and HBV-RISE254 together might be classified as a new human HBV genotype that is extinct today, and HBV-RISE154 might possibly be classified as another (Extended Data Table 4). However, HBV-RISE154 has low genome coverage, which precludes an exact calculation. The sister relationship of these three sequences with modern chimpanzee and gorilla HBVs could be interpreted as a consequence of relatively recent transmission(s) of HBV from humans to non-human primates14. However, other scenarios and confounding factors are possible, as these lineages are deeply separated in the tree. Incomplete lineage sorting combined with viral extinction (possibly boosted by massive recent reductions in great ape populations) should be considered. More data on current and, if possible, ancient HBVs will be necessary to reach definitive conclusions.

The geographical locations of some of the ancient virus genotypes do not match the present-day genotype distribution, and also do not match dates and/or locations inferred in previous studies of HBV. Although the data presented here are limited, they provide important spatiotemporal reference points in the evolutionary history of HBV. Their synopsis suggests a more complicated ancestry of present-day genotypes than previously assumed, especially in light of recent insights into the history of human migration.

We find genotype A in south-western Russia by 4.3 ka (in samples RISE386 and RISE387) in individuals belonging to the Sintashta culture, and in a Hungarian sample (DA195) from the Scythian culture. The western Scythians are related to the Bronze Age cultures of western steppe populations2 and their shared ancestry suggests that the modern genotype A may descend from this ancient Eurasian diversity and not, as previously hypothesized, from African ancestors29,30. This is also consistent with the phylogeny (Fig. 2), as well as the fact that the three oldest ancient genotype A sequences (HBV-DA195, HBV-RISE386 and HBV-RISE387) lack the six-nucleotide insertion found in the youngest (HBV-DA119) and in all modern genotype A sequences. The ancestors of subgenotypes A1 and A3 could have been carried into Africa subsequently, via migration from western Eurasia31.

The ancient HBV genotype D sequences were all found in Central Asia. HBV-DA27, found in Kazakhstan and dated to 1.6 ka, falls basal to the modern subgenotype D5 sequences that today are found in the Paharia tribe from eastern India32. DA27 and the Paharia people in India are linked by their East Asian ancestry2,33.

Based on the observation that genotypes go extinct and can be created by recombination, the ancient sequence data show that the diversity that we observe today is only a subset of the diversity that has ever existed. These data support a scenario in which all present-day HBV diversity arose only after the split of the Old World and New World genotypes (25–13.4 ka). Any attempt to interpret the currently known HBV tree based on human migrations that happened before this event will necessarily result in anomalies that cannot be reconciled, such as the basal position of genotypes F and H and the apical position of subgenotype C4, which is found exclusively in indigenous Australians8. If HBV did co-evolve with ancient modern humans as they left Africa as previously proposed6, most of the pattern of earlier diversity has been replaced by changes that happened after the split of the Old and New World genotypes. Genotypes F and H would therefore be remnants of the earlier now-extinct diversity, and the arrival of subgenotype C4 in Australia would have taken place long after the split between Old and New World genotypes, as supported by the tree in Fig. 2. Alternatively, there could have been a New World origin of HBV or the virus could have been introduced into humans from a different host. Our data do not allow us to speculate either way.

To our knowledge, we report the oldest exogenous viral sequences recovered from DNA of humans or any vertebrate, and show that it is possible to recover viral sequences from samples of this age. We show that humans throughout Eurasia were widely infected with HBV for thousands of years. Despite the age of the samples and the imperfect diagnostic test, our dataset contained a high proportion of HBV-positive individuals. The actual ancient prevalence during the Bronze Age and thereafter might have been higher, reaching or exceeding the prevalence typically found in contemporary indigenous populations5. This clearly establishes the potential of HBV as powerful proxy tool for research into human spread and interactions. The data from ancient genomes reveal aspects of complexity in HBV evolution that are not apparent when only modern sequences are considered. They show the existence of ancient HBV genotypes in locations incongruent with their present-day distribution, contradicting previously suggested geographical or temporal origins of genotypes or sub-genotypes; evidence for the creation of genotype A via recombination and the emergence of the genotype outside Africa; at least one now-extinct human genotype; ancient genotype-level localized diversity; and demonstrate that the viral substitution rate obtained from modern heterochronously sampled sequences is probably misleading. Together, these findings suggest that the difficulty in formulating a coherent theory for the origin and spread of HBV may be due to genetic evidence of an earlier evolutionary scenario being overwritten by relatively recent alterations, as has previously been suggested in the context of recombination24. The lack of ancient sequences limits our understanding of the evolution of HBV and very probably of other viruses. Discovery of additional ancient viral sequences may provide a clearer picture of the true origin and early diversification of HBV, enable us to address questions of palaeo-epidemiology, and broaden our understanding of the contributions of natural and cultural changes (including migrations and medical practices) to human disease burden and mortality.

Methods

No statistical methods were used to predetermine sample size. The experiments were not randomized and investigators were not blinded to allocation during experiments and outcome assessment.

HBV datasets

The following HBV datasets were used in the present study. Full listings of accession numbers are given in the Supplementary Methods.

Dataset 1

Dataset 1 comprises 26 HBV genomes, covering all species in the Orthohepadnaviridae. This includes one sequence each from the human HBV genotypes (A–J), orangutan, chimpanzee, gorilla, gibbon, woolly monkey, woodchuck, ground squirrel, Arctic ground squirrel and horseshoe bat, four sequences from roundleaf bats, and three sequences from tent-making bats, largely following a previous publication34.

Dataset 2

Dataset 2 comprises 124 HBV genomes, from humans and non-human primates. This set contains 92 sequences from a previous publication11 (excluding their incomplete sequences), 7 additional genotype D sequences, the Korean mummy genotype C sequence20, the 12 ancient sequences from the present study and 12 full genomes selected from a set of 9,066 full HBV genomes downloaded from NCBI35 on 24 August 2017 (Entrez query: hepatitis b virus[organism] not rna[title] not clone[title] not clonal[title] not patent[title] not recombinant[title] not recombination[title] and 3000:4000[sequence length]) corresponding to the closest, non-artificial match for each of the ancient sequences. Dates for these sequences were acquired by looking for a date of sample collection in the NCBI entry, or the paper in which the sequence was first published. If a range of dates was mentioned, the mean was used. If no date of sample collection was found in this way, either the year of the publication of the paper, or the year of addition of the sequence to GenBank was used, whichever was earlier.

Dataset 3

Dataset 3 comprises 124 HBV genomes, from humans, non-human primates and a variety of other Orthohepadnaviridae host species, including woolly monkey, roundleaf bat, tent-making bat, ground squirrel, Arctic ground squirrel, woodchuck and snow goose. This set contains 113 sequences that were obtained from the union of 91 sequences from Paraskevis et al.11 and 29 from Drexler et al.34, plus 11 additional sequences (giving 124 sequences in total).

Dataset 4

Dataset 4 comprises 3,505 HBV genomes. Of these, 3,384 are from a previous publication36, divided into ten human genotypes. To these, we added 17 chimpanzee, 56 gorilla, 12 gibbon and 36 orangutan full HBV genome sequences downloaded from NCBI on 18 January 2017, resulting in 14 genome categories.

Dating of ancient samples

Sample ages were determined by direct 14C dating. These ages were calibrated using OxCal37 (version 4.3) using the IntCal13 curve38. Table 1 shows the 14C age and standard deviation for each sample. This is followed by the median probability calibrated age before present (cal. bp). RISE386 was 14C dated twice, with ages (standard deviation) of 3,740 (33) and 3,775 (34); a rounded mean of 3,758 (34) was used for its calibration. DA29 was dated at 822 years using 14C and also at about 700 years using multi-proxy methods: the former date was used for consistency. The dates for DA119, DA222, RISE548, RISE556, RISE568 and RISE597 are best estimates, based on sample context.

Data and data processing

We analysed 101 Bronze Age samples published in Allentoft et al.1, 137 predominantly Iron Age samples published in Damgaard et al.2 and 66 additional samples from the Bronze Age. A total of 114.58 × 109 Illumina HiSeq 2500 sequencing reads were processed.

AdapterRemoval39 (version 2.1.7) was used with its default settings to remove adaptors from all sequences, to trim N bases from the ends of reads and to trim bases with quality ≤ 2. Reads were aligned against a human genome (GRCh38, https://www.ncbi.nlm.nih.gov/grc/human) using BWA40 (version 0.7.15-r1140, mem algorithm). Reads that did not match the human genome were then mapped against the NCBI viral protein reference database containing 274,038 viral protein sequences (downloaded on 31 August 2016) using DIAMOND41 (version 0.8.25). Protein matches were grouped into their corresponding viruses. Reads matching HBV were found in 25 samples.

The non-human reads from the samples that had more than three reads matching HBV using DIAMOND were selected for a subsequent BLAST42 (version 2.4.0) analysis. A BLAST database was made from dataset 3, and samples were matched using BLASTn (with arguments -task blastn -evalue 0.01). Matching reads with bit scores greater than 50 for all samples (except DA222 (70) and DA45 (55)) were selected for subsequent processing. The number of reads selected from the BLAST matches, per sample, is shown in Table 1, with additional detail in Extended Data Table 1. Across all samples 11,149 reads matched against HBV sequences.

PCR confirmation

Real-time PCR was established using primers and TaqMan probes as previously described43, which were used to amplify a 91-base-pair amplicon of the HBV genome. Primers and probe were added to QuantiTect PCR mix (Qiagen #204343) in a final concentration of 400 nM or 200 nM, respectively, in a total reaction volume of 25 μl, including 5 μl template. Using the Roche LC480 or Agilent Mx3006p instruments, PCRs were incubated for 15 min at 95 °C followed by 45 cycles of 15 s at 94 °C and 60 s at 60 °C, measuring fluorescence from the 6-carboxy-fluorescein/BHQ1-labelled probe and the passive dye (ROX) at the end of each cycle.

Careful precautions were taken to prevent PCR contamination. PCR mastermixes were prepared in dedicated aDNA clean laboratory facilities, in which no prior targeted work has been carried out on HBV. aDNA extracts and non-template controls (NTCs) were added into PCR reactions in this location too, and were not subsequently opened. Positive control material was handled in laboratories in a physically separated building. Here, standard material, diluted to 5–50 copies per reaction, was added to duplicate PCR reactions along with additional NTCs.

Virus capture

Fourteen samples with sufficient sample material were selected for virus capture (DA27, DA29, DA45, DA51, DA85, DA89, DA119, DA195, DA222, RISE254, RISE386, RISE416, RISE568 and RISE556). The viral reference genomes for probes were selected as follows. The International Committee for Taxonomy of Viruses (ICTV) 2012 listed 2,618 viral species. As many had no associated reference genomes or merely partial sequence information, we selected 2,599 sequences of full-length viral genomes, available from GenBank (June 2014), representing viral species found in vertebrates excluding fish. Sequences < 1,000 nucleotides were discarded. Sequences with identical length and organism identification were regarded as duplicates and thus reduced to 1. For a number of specific viral taxa for which a large number of similar reference sequences are available, we manually selected representative genomes or genome segments (Supplementary Tables 1, 2). For example, among 72 available hepatitis C virus genome sequences, we selected one genome per subtype (subtypes 1a–1c, 1 g; 2a–c, 2i, 2k; 3a, 3b, 3i, 3k; 4a–4d, 4 f, 4 g, 4k–4r, 4t; 5a; 6a–6 u; 7a). Likewise, 12 HIV-1 genomes were selected to represent group M (subtypes A–D, F1, F2, H, J, K, N, O and P). For influenza A virus, we included only sequences from segment 7 and segment 5 that encode the conserved matrix proteins M1/M2 and the nucleocapsid protein NP, respectively. We selected 82 M1/M2 segments and 115 NP segments among the available segment sequences. All available segments were included from genomes belonging to Arenaviridae, Bunyavirales and Reoviridae. For members of Poxviridae for which full genomes were unavailable (skunkpox, raccoonpox and volepox viruses) sequences representing the conserved gene encoding the DNA-dependent RNA polymerase were included (n = 22). In addition, two partial genomes of squirrelpox virus were included. By mistake, two and nine partial sequences were included from Iridoviridae (1.5–2.5 kb) and Coronaviridae (1.3–14.5 kb), respectively, already represented by full genomes. Likewise, sequences representing Merkel cell polyomavirus and KI polyomavirus were not included among the reference genomes used for probe design. SeqCap EZ hybridization probes were designed and synthesized by Roche NimbleGen based on the resulting reference sequences.

Capture was performed on double-indexed libraries prepared from aDNA, following the manufacturer’s protocol (version 4.3) with the following modifications. In brief, 1.8 to 2.2 μg of pooled libraries were hybridized at 47 °C for 65–70 h with low complexity C0T-1 DNA, specific P5/P7 adaptor-blocking oligonucleotides each containing a hexamer motif of inosine nucleotides to match individually indexed adapters, hybridization buffer containing 10% formamide, and the capture probes. Dynabeads M-270 (Invitrogen) were used to recover the hybridized library fragments. After washing and eluting the libraries, the post-capture PCR amplification was performed with KAPA uracil + polymerase (Kapa Biosystems). PCR cycling conditions were as follows: 1 cycle of 3 min at 95 °C, followed by 14 cycles of: 20 s denaturation at 98 °C, 15 s annealing at 65 °C and 30 s elongation at 72 °C, ending with 5 min at 72 °C. The amplified captured libraries were purified using AMPureXP beads (Agencourt).

Shotgun sequencing data were generated as previously described1. Sequencing of target-enriched libraries was performed on Illumina Hiseq2500 SR80bp, V4 chemistry.

The resulting reads were compared to dataset 2 using BLASTn (with arguments -task blastn -evalue 0.01). Matching reads with bit scores greater than 50 for all samples (except DA222 (70) and DA45 (55)) were selected for subsequent processing. In total, 6,757 reads matched HBV in the capture data.

Sequence authenticity

The following evidence leads us to believe that the ancient HBV sequences are authentic and that the possibility of contamination can be excluded.

(1) Standard precautions for working with aDNA were applied44.

(2) Sequences were checked for typical aDNA damage patterns using mapDamage45 (version 2.0.6). Whenever sufficient amounts of data were available ( > 200 HBV reads), we found C > T mutations at the 5′ end, typical of aDNA46 (see Extended Data Fig. 1a, c).

(3) Capture was performed on sample DA222 DNA extracts with and without pre-treatment by uracil-specific excision reagent (USER)47. After USER treatment (3 h at 37 °C) of the aDNA extract, the damage pattern is eliminated (Extended Data Fig. 1b).

(4) As the ancient viruses are from three different HBV genotypes (A, B and D) and a clade in sister relationship to chimpanzee and gorilla HBVs, any argument that samples were contaminated would have to account for this diversity as well as the sequence novelty.

(5) HBV sequences were identified in 25 of 304 analysed samples (Table 1), showing that the findings cannot be due to a ubiquitous laboratory contaminant.

(6) Despite the low frequency of positive samples, we sequenced extraction blanks to provide additional evidence against the possibility that the HBV sequences stemmed from sporadic incorporation, amplification and sequencing of background reagent contaminants into the aDNA libraries. The negative extraction controls were amplified for 40 PCR cycles, and BLAST was used to match the read sequences against dataset 3, with the same parameters used for the ancient samples. Because the ancient HBV-positive reads used to assemble genomes all had bit scores of at least 50 (see ‘Data and data processing’), we filtered the negative extraction control BLAST output for reads with a bit score ≥ 45. No reads (out of 23 million) matched any HBV genome at that level.

(7) HBV is a blood-borne virus that is mainly transmitted by exposure to infectious blood and that does not occur in the environment3, making contamination during archaeological excavation extremely unlikely.

Consensus sequences

Reads from the original sequencing and from the capture were aligned to a reference genome (Supplementary Table 3) in Geneious48 (version 9) using medium sensitivity/fast and iterate up to 5 times. Because aDNA damage often clusters towards read termini46, the resulting alignments were carefully curated by hand to remove non-matching termini of reads if the majority of the read showed a very good match with the reference sequence.

Genotyping

All reads used to construct the ancient HBV consensus sequences were matched against the full NCBI nucleotide database (downloaded 28 December 2016) using BLAST. Of these reads, 97.5% had HBV as their top match. All ancient consensus sequences were matched against the full HBV genomes of dataset 4 with the Needleman–Wunsch algorithm49, as implemented in EMBOSS50 (version 6.6.0.0). For each ancient sequence, the percentage of sequence identity with the most similar representative of each modern genotype and four non-human primate species is listed in Extended Data Table 4a. The Needleman–Wunsch algorithm was also used to calculate the pairwise sequence similarity between all ancient sequences (Extended Data Table 4b).

Recombination analysis

The recombination detection program51 version 4 (RDP4) was used to search for evidence of recombination within the 12 ancient sequences and a selection of 15 modern human and non-human primate sequences (Supplementary Methods). Recombination with HBV-RISE387 as the recombinant and HBV-DA51 as one parent, was suggested at positions 1567–2256, by seven recombination methods (RDP52, GENECONV53, BootScan54, MaxChi55, Chimaera56, SiScan57 and 3Seq58) with P values from 1.179 × 10−6 to 5.336 × 10−11 (Extended Data Table 2). The same recombination was suggested for all 4 ancient genotype A and two modern genotype A sequences. Graphical evidence of the recombination and the predicted break point distribution for sequences HBV-RISE386 and HBV-RISE387 from three methods (MaxChi, Bootscan and RDP) is shown in Extended Data Fig. 3.

Initial maximum likelihood phylogenies

An initial maximum likelihood tree was generated to ascertain whether the ancient sequences fall within the primate HBV clades. Dataset 1 and the ancient sequences were aligned in MAFFT59 (version 7). The maximum likelihood tree was constructed using PhyML60 (version 20160116), optimizing topology, branch lengths and rates. We used a general time reversible (GTR) substitution model, with base frequencies determined by maximum likelihood, and a maximum likelihood-estimated proportion of invariant sites and 100 bootstraps (Extended Data Fig. 2). Furthermore, a maximum likelihood tree (Extended Data Fig. 4) was generated based on a MAFFT alignment of dataset 2 and the ancient sequences, using the same parameters as above.

Dated coalescent phylogenies

To check for a temporal signal in the data, root-to-tip regressions and date randomization tests were performed. For the root-to-tip regression, input trees were calculated using dataset 2 with the addition of a woolly monkey sequence (GenBank Accession Number: AF046996) as an outgroup. Three phylogenetic algorithms were used; neighbour joining, maximum likelihood (PhyML), and Bayesian (MrBayes61 (version 3.2.5)) methods (Supplementary Figs. 13). Root-to-tip distances were extracted using TempEst62 (version 1.5). For maximum likelihood and Bayesian methods, root-to-tip distances (in substitutions per site) were extracted from optimized tree topologies (maximum likelihood and maximum clade credibility trees, respectively). For the neighbour joining method, root-to-tip distances were averaged over 1,000 bootstrap replicates. Regression analyses were performed with Scipy (version 0.16.0; http://www.scipy.org). For the date randomization tests, we used three different approaches to randomize tip dates. First, tip dates were randomized between all sequences in the phylogeny. Second, tip dates were randomized only among the ancient sequences presented in this Letter, as well as the Korean mummy sequence (GenBank accession number JN315779), while the modern sequences retained their correct ages. Third, dates were randomized within a clade. For each of the three approaches, we performed three independent randomizations. This resulted in a total of nine analyses, which were run for 100,000,000 generations each, under the relaxed log-normal clock model and coalescent exponential tree prior. We also ran the same analyses under a strict clock and coalescent Bayesian skyline tree prior, which were run for 20,000,000 generations. We used a GTR substitution model with unequal base frequencies, four gamma rate categories, estimated gamma distribution of rate variation and estimated proportion of invariant sites, as found by bModelTest63 (version 1.0.4). None of the analyses using the relaxed clock converged (estimated sample size < 200). This is most probably because the mis-specification of the dates leads to incongruence between the sequence and time information. Under the strict clock model, all runs converged, and none of the 95% HPD intervals of the root age overlapped between the randomized and the non-randomized runs, fulfilling the criteria for evidence of a temporal signal64.

Dated phylogenies were estimated using BEAST225 (version 2.4.4, prerelease). We used a MAFFT alignment of dataset 2. Using bModelTest63, we selected a GTR substitution model with unequal base frequencies, four gamma rate categories, estimated gamma distribution of rate variation and estimated proportion of invariant sites. Proper priors were used throughout. Path sampling, as implemented in BEAST2, was performed to select between a strict or relaxed log-normal clock and a coalescent constant, exponential or coalescent Bayesian skyline tree prior (Extended Data Table 3a). Likelihood values were compared using a Bayes factor test. A Bayes factor in the range of 3–20 implies positive support, 20–150 strong support and > 150 overwhelming support65. The relaxed log-normal clock model in combination with a coalescent exponential tree prior was favoured (Extended Data Table 3a). For the final tree, a Markov chain Monte Carlo analysis was run until parameters reached an estimated sample size > 200, sampling every 2,000 generations. Convergence and mixing were assessed using Tracer66 (version 1.6). The final tree files were subsampled to contain 10,000 or 10,710 (for the relaxed log-normal clock, coalescent exponential tree prior) trees, with the first 25% of samples discarded as burn-in. Maximum clade credibility trees were made using TreeAnnotator25 (version 2.4.4 prerelease).

To formally test the ‘Out of Africa’ hypothesis of HBV evolution, calibration points were tested using path sampling as implemented in BEAST2. Calibration points were constrained as follows. For the split of genotypes F and H, the most-recent common ancestor (MRCA) of all genotype F and H sequences was constrained using a uniform (13,400:25,000) distribution, as this is the range of estimates for when the Americas were first colonized67,68. For the split of subgenotype A3 in Haiti, the MRCA of FJ692598 and FJ692611 was constrained using a uniform (200:500) distribution, owing to the timing of the slave trade to Haiti69. For the split of C3 in Polynesia, the MRCA of X75656 and X75665 was constrained using a uniform (5,100: 2,000) distribution, owing to the range of estimates for the MRCA of Polynesian populations11,70. Calibration points were tested under both a relaxed log-normal clock, coalescent exponential tree prior, and a strict clock, Bayesian skyline tree prior.

Reporting summary

Further information on experimental design is available in the Nature Research Reporting Summary linked to this paper.

Data availability

The complete sequences in this study have been deposited in the European Nucleotide Archive under sample accession numbers ERS2295383ERS2295394. All other data are available from the corresponding author upon reasonable request.