Machine-learning based patient classification using Hepatitis B virus full-length genome quasispecies from Asian and European cohorts

Chronic infection with Hepatitis B virus (HBV) is a major risk factor for the development of advanced liver disease including fibrosis, cirrhosis, and hepatocellular carcinoma (HCC). The relative contribution of virological factors to disease progression has not been fully defined and tools aiding the deconvolution of complex patient virus profiles is an unmet clinical need. Variable viral mutant signatures develop within individual patients due to the low-fidelity replication of the viral polymerase creating ‘quasispecies’ populations. Here we present the first comprehensive survey of the diversity of HBV quasispecies through ultra-deep sequencing of the complete HBV genome across two distinct European and Asian patient populations. Seroconversion to the HBV e antigen (HBeAg) represents a critical clinical waymark in infected individuals. Using a machine learning approach, a model was developed to determine the viral variants that accurately classify HBeAg status. Serial surveys of patient quasispecies populations and advanced analytics will facilitate clinical decision support for chronic HBV infection and direct therapeutic strategies through improved patient stratification.

(HBeAg negative status, anti-HBe positive phase) is associated with reduced viral replication and diminished long-term complications, for example, the risk of hepatocellular carcinoma 18 . Progression to seroconversion, either spontaneous or treatment-induced, may take years; contemporary studies have demonstrated differential rates of seroconversion associated with the duration and type of therapeutic regimen in addition to individual patient clinical profiles 19,20 . All virological factors associated with HbeAg seroconversion have not been elucidated and deeper knowledge of these may aid the definition of patients likely to seroconvert and direct appropriate therapeutic strategies.
In addition to HBV DNA load, ALT levels, HBV surface antigen (HBsAg), the loss of plasma HBeAg and seroconversion to HBeAg represent valuable clinical waymarks of a functional cure 21 . The plasma levels of these markers are used as proxies of viral activity in the liver; inferences about the spectrum of viral quasispecies in plasma reflecting those in hepatocytes has not been qualitatively assessed. Furthermore, HBeAg negative patients may have ongoing complex clinical profiles the virological basis of which has not been established.
The progression of chronic hepatitis B is related to the prevailing viral and host immune activities. The tolerogenic activity of HBeAg facilitates the establishment of HBV infection in vivo -this tolerance to infection is lost during the anti-HBe positive phase and immune-escape HBeAg-negative mutants may be selected 22 . Increased sequence diversity, associated with stochastic mutations, can result in changes in immune tolerance and reactivity with concomitant changes in selection pressure and evolution of the virus. The dynamic interplay and drivers of viral evolution are complex and difficult to resolve linearly; the use of advanced analytic tools may assist in the deconvolution of the prevailing virological factors that contribute to serocoversion.
Many studies have disclosed variants with statistical associations with clinical metrics in different regions of the HBV genome [23][24][25][26] , but associations between all variants in all HBV coding regions and clinical parameters has not been established. For example, the nG1896A (W28*) mutation in the precore and nA1762T/nG1764A double mutation in the basal core promoter have temporal correlations with seroconversion and have been shown to be associated with different clinical courses 3,11 . Next generation sequencing (NGS) technologies are optimal for uncovering the spectra of variants that exist within an infected individual and in the wider population for disease surveillance and healthcare planning 27 . Ultra-deep NGS enables the detection of viral variants at low allele frequency with much greater sensitivity and confidence. To date most studies have focused on particular regions of the HBV genome, e.g. HBs major hydrophilic region (MHR) 28 , core 29 , and reverse transcriptase (RT) 30 or on whole genome sequencing 9 at low coverage and, most recently, on single virions 31 . Here we present the first comprehensive survey of the diversity HBV quasispecies through ultra-deep sequencing of the complete HBV genome across two distinct patient populations and, in addition, explore quasispecies diversity between plasma and liver samples.
The use of big data advanced analytics is an emergent field with potential for extensive application in healthcare 32,33 including applications where there is disease and treatment heterogeneity and for prescriptive analytics (precision medicine and clinical decision support). The current paradigm for classification of HBV infected patients utilises a series of virological and biochemical factors to infer viral activity and liver damage, as well as histopathological analysis of liver biopsies for fibrosis scoring (Fig. 1). In complex viral disease profiles, where single laboratory tests may not provide sufficient insight into the clinical history and progression of a patient advanced analytic techniques, including machine learning, may unlock insights into host and pathogen factors that may represent novel biomarkers for prognostics and therapeutic strategies. Consideration of advanced analytics applied to complex disease of virological origin has not been extensively evaluated -this study sought to investigate the feasibility of combining big data and advanced analytics to drive clinical insight in chronic hepatitis B infection.
Machine Learning (ML) encompasses a field of data-driven techniques for the study and construction of predictive algorithms that allow classification of factors/classes or estimation of quantitative traits from complex, multi-dimensional data (hundreds to thousands of co-variates) without a priori models 34 . In essence, ML approaches identify patterns revealing novel relationships between covariates or allow inferences to be made about future events. We have applied a random forest machine learning approach to classify the HBeAg status of untreated patients with chronic HBV infection using the standard HBeAg diagnostic test as a benchmark. Using the allele frequency of HBV variants we identify novel associations between viral variants and HBeAg status. Our analysis demonstrates a proof-of-concept of the utility of machine learning approaches to classify HBV infected patients and offers the prospect of exploring additional markers for therapeutic decision making and prognostic support.

Methods
Patients and samples. Patient samples were derived from two retrospective cohorts representing a single point in time for each patient; the first, a Western European cohort defined as Dataset A, the second, a Chinese cohort, defined as Dataset B, Table 1. In both cases the collection of patient samples was undertaken with informed and written consent and in accordance with the Declaration of Helsinki. Metadata (demographic and clinical data) was anonymized at point of collection and all data analysis was undertaken blinded and without access to patient identification keys. Projects were reviewed and authorized by the  This approach defines patients broadly into four classes that will inform the clinical decision for standard-of-care including the use of interferon and/or nucleoside analogues. Using virus whole genome sequencing to catalogue all nucleotide variants occurring at >1% machine learning approaches are explored to determine whether classification of HBeAg status could be recapitulated from a diverse patient population, extend our understanding of the virological factors associated with HbeAg status and evaluate whether this type of approach may be extended to novel markers of clinical status that will inform clinical decision making for stratification of patients in clinical trials and in the appropriate patient selection for use of next-generation treatment modalities for HBV. The study sought to answer three questions (solid boxes): Q1 -that plasma HBV quasispecies profiles were representative of those in the liver; Q2 -that machine learning approaches could accurately recapitulate classification by a routinely used clinical marker; and Q3 -whether this approach has wider utility in clinical decision support and the deconvolution of complex clinical history. www.nature.com/scientificreports www.nature.com/scientificreports/ as previously described 35 ) and plasma were obtained from ten patients in accordance with the approved study protocol. For Dataset B the HBV infection of patients was confirmed by using commercial enzyme-linked immunosorbent assay kits (Wantai BioPharm, Beijing, China) and quantified by using real-time fluorescence quantitative PCR. Patient liver biochemistry and virological metrics (serum qualitative HBsAg, anti-HBs, HBeAg, anti-HBeAg, HBV viral load) at baseline was obtained from the respective electronic medical records. Machine learning. To classify HBeAg status from viral mutant signatures for both datasets only samples from untreated patients were used (n = 182 Dataset A; n = 170 Dataset B) to limit effects on sample variant profile attributable to treatment methods. The input to the machine learning was a matrix of viral variant allele frequencies (0.01-0.99) and the associated HBeAg status ('positive' or 'negative') as defined by standard diagnostic tests. A random forest machine learning approach was employed to establish the variants that best classified HBeAg status. A series of test and training partitions and cross-validation steps were undertaken to optimize the model before testing against independent data excluded from model generation. Comparison of random forest models was based upon the following metrics: accuracy, balanced accuracy, sensitivity, specificity, and kappa values. For

Comprehensive survey of HBV variants. Supplementary tablers liver HBV variants. To investigate
whether plasma samples sufficiently represent the hepatic quasispecies population paired liver biopsy and plasma samples were collected as part of Dataset A (n = 10). The frequency of HBV variants found in either the liver or plasma were compared to define systematic differences in the quasispecies populations. The majority of variants demonstrated small differences in allele frequency (1-2%), as shown in Fig. 2A. A small number of variants (n = 56) demonstrated a difference of 10% or greater. These differences between plasma and liver populations were found to be patient-associated, i.e. the majority of patients demonstrated only small differences in allele frequency between the tissues (Fig. 2B), whilst one patient accounted for the majority of variants showing large differences in allele frequency (Supplementary Table 2a,b).
Quasispecies diversity arises from rare and unique viral variants. Dataset A (n = 182 patients) represented a cohort of diverse ethnic background from Western Europe. The dataset consisted of five HBV genotypes (A-E); all samples were obtained from untreated patients. A graphical overview of Dataset A is presented in Fig. 3A. Variants with an adjusted allele frequency greater than 1% were used for analysis. A total of 4615 variants were detected above this threshold and used as features to predict HBeAg status (Fig. 3C). In the absence of HBsAg levels it was not possible to classify the clinical status of these patients by EASL guidelines 21 . Dataset B was derived exclusively from an ethnically Chinese (Han) patient cohort. After filtering Dataset B represented n = 207 patients of whom n = 170 were treatment naïve; a further n = 37 received either treatment with a nucleoside analogue or interferon therapy (Fig. 3B). Using the EASL (European Association for the Study of the Liver) guidelines all samples from this cohort were classed as 'chronic hepatitis' . Dataset B was comprised of genotypes B and C; a total of n = 3039 variants were defined at >1% frequency with n = 1245 variants common to both genotypes (Fig. 3D). The intersection of genotypes B and C found n = 1640 variants common to both datasets (Jaccard Index = 0.45).   www.nature.com/scientificreports www.nature.com/scientificreports/ Viral variant heterogeneity differs with genotype. The average number of variants per patient and per genotype was considered for Dataset A (genotypes A-E represented). In general, within a viral genotype most variants were found only once across all the patients, Fig. 3E and Supplementary Fig. 4. To consider the overlap between patients across genotypes and define 'genotype-associated' variants we considered variants occurring at least twice within a genotype (Fig. 3F). Genotype D demonstrated the greatest sequence diversity with low variant coverage, i.e. few variants found consistently in all patient samples. Significant differences in Shannon entropy between HBeAg status groups was also noted (Supplementary Materials & methods; Supplementary Figs.

5-7)
Variants associated with drug resistance present in untreated patients. Low frequency mutations aid viral adaptation to selection pressures and so are particularly relevant to development of drug resistance and treatment failure 36 . Genomic variants in the reverse transcriptase/polymerase (RT/POL) gene, leading to amino acid substitutions conferring single and multi-drug resistance, were found in plasma samples of untreated patients in both datasets (Table 2). Drug-resistance associated mutations were found in 15-17% of patient samples from genotype B and D in dataset A. Patients infected with HBV genotype C demonstrated the most frequent resistance-associated variants in Dataset A (26%) and Dataset B (16% untreated; 29% treated).
Phylogenetic analysis recapitulates genotype groups across cohorts. Consensus whole genome nucleotide sequences were gathered for each patient plasma sample for each dataset (Dataset A, n = 182; Dataset B, n = 207), paired plasma and liver samples (n = 10) variant profiles if present (total samples, n = 399), and five reference genomes (genotypes A-E) for phylogenetic analysis. Genotypes B and C grouped together in the appropriate clades, Fig. 4, regardless of geographical origin. The remaining Dataset A samples from genotypes A, D, and E also grouped together in their appropriate clades alongside the reference genomes used for mapping of sequencing reads. Despite the extensive sequence diversity recorded and clinical heterogeneity phylogenetic clades resolved entirely by genotype across the two datasets.

Machine learning defines novel viral variants classifying HBeAg status. Defining classifier variants
for HBeAg status. Machine learning was used to define patterns of viral variant allele frequencies that demonstrated a strong association with HBeAg status and would act as robust classifiers and uncover novel virological factors. A model with a balanced accuracy of 1 (Accuracy, Sensitivity, and Specificity = 1) was found using test data from Dataset A (range balanced accuracy 0.8-1), Fig. 5A,B, although the relative contribution of each variant to classification accuracy was small. The highest-ranking variables contributing to this model included known pre-core and basal core promotor mutants (n1896GA, n1934AT, n1753TC). In general, the variants with greatest relevance to the model were found in the precore/core region, with some variants found in the HBsAg, X or RT/ POL genes. The majority of variants were mis-sense, with nG1896A and nC2351T defined as stop gains.
Viral variant allele frequencies from Datasets A and B were combined (n = 352 samples and n = 2119 common variants). This represented a low overlap of shared variants across datasets (Jaccard index = 0.38). Random forest www.nature.com/scientificreports www.nature.com/scientificreports/ classification of HBeAg status utilised samples from treatment naïve patients from both cohorts with n = 100 HBeAg negative and n = 252 HBeAg positive samples. The most accurate models were achieved when variants with near-zero variance were excluded from combined dataset leaving n = 432 variants for machine learning. The best performing model had a balanced accuracy of 0.98 (accuracy = 0.97; sensitivity = 0.96; specificity = 1; kappa: 0.93) against test data derived from both data sets (0.9/0.1 ratio of train/test data), Fig. 5C,D. By combining the datasets the relative contribution of each variant to classification accuracy to improved and altered the ranking of the variants with the greatest contribution. The distribution of the top-ranking variants in the HBV genome is presented in Fig. 6A,B. Using this model to classify HBeAg status of the additional n = 37 samples (HBeAg status: n = 10 negative; n = 27 positive) from treated patients (not seen in the training of the model) was highly accurate (balanced accuracy = 1).

Conclusions
Hepatitis B virus has a baseline mutation rate of 1-7 e-05 base mutations/year and this diversity is fundamental to an understanding of the associations with the course of clinical disease, mutational dynamics, evaluation of novel strains, susceptibility to therapeutic interventions and development of immunity 27 . We report the first study to undertake ultra-deep sequencing of the whole HBV genome in almost 400 patients from two distinct cohorts. Ultra-deep sequencing demonstrated that, in general, variants arising within a single patient represented unique or rare mis-sense mutations that occur at the level of detection for the platform used (>1%). A subset of variants www.nature.com/scientificreports www.nature.com/scientificreports/ were found exclusively or with high frequency within the distinct genotypes. Differences were evident between genotypes in terms of the heterogeneity of viral populations with genotype D samples revealing greater sequence diversity with many low frequency mutants within individuals and more variation between patients than genotype A; greater similarity was evident between genotypes B and C.
The quasispecies populations of paired liver and plasma samples were analysed in a small patient cohort (n = 10). Few variants demonstrated a differential allele frequency greater than 10 percentage points between liver and plasma; we concluded that plasma samples represented an adequate proxy for inferring the diversity in the viral quasispecies population in chronically infected patients. In work by Nishijima and colleagues 37 , to determine the correspondence of hepatic quasispecies with plasma populations in patients undergoing liver transplantation, no differences in Shannon entropy was found. Likewise, no significant differences have been found between multiple paired liver samples and plasma in HCV-infected patients 38 . Coffin, et al., reported location and disease phase-specific differences in variants 39 although limited differences were found between liver, polymorphonuclear cells (PBMCs) and plasma over an extended period of analysis 10 . This corroboration of quasispecies populations in liver and plasma supports the further investigation of plasma biomarkers to evaluate the presence of disease-associated variants.
Contemporary sequencing studies have focused on particular regions of the HBV genome, e.g. HBs MHR 28 , core 29 , and RT 30 , or small whole genome sequencing studies 9 . Using ultra-deep sequencing, with high coverage, of the complete HBV genome we employed a random forest algorithm 40 , an ensemble method that builds 'forests' of binary decision trees to improve the classification accuracy of weak predictors, to define HBeAg status in patients across genotypes and demographics. Although machine learning approaches have been applied recently for virus data 41-43 these methods have been applied sporadically to HBV 41,[44][45][46] . The analysis presented here demonstrates the utility of unbiased, data-driven approaches to reveal novel aspects of HBV biology. Harnessing the power of two cohorts ML defined patterns of viral variant allele frequencies that accurately classified HBeAg status that was not possible from a single dataset; this model contained both variants with known temporal associations with HBeAg seroconversion, confirming the validity of the approach, and novel variants that aided the discrimination of HBeAg status that would not have been discovered by traditional statistical approaches.
HBe loss has a temporal correlation with elevated frequencies of a precore mutation at nG1896A and/or nG1764A/nA1764T basal core promotor mutations 47 although these mutations are not present in all genotypes/ sub-genotypes during seroconversion 22 . The ML models confirmed nG1896A as the most predictive mutation in classifying the HBeAg status of chronically infected HBV patients. Additionally, we note the strong contribution from nA1934T, nC2501T, nG1899A in HBeAg classification together with genotype-associated variants. The combined models included nC2501T, an intergenic variant found exclusively in genotype A in Dataset A at an allele frequency >0.91 in patients (n = 58) of both HBeAg classes. This variant is not associated with the six nucleotide insertion in genotype A. This highlights genotype specific discrimination in the combined model where the majority of samples were genotype B (n = 187). Similarly, the nA1934T precore/core variant was associated with genotype A, D and E samples and not found in genotype B and C samples in either dataset indicating another genotype partition in the combined model. The nA1934T mutation is reflected in a Thr12Ser amino acid substitution and is a mutation within the MHC class II restricted T-cell epitope (CD4 + T h epitope [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20] in the core protein 48,49 known to be associated with clinical reactivation during lamivudine treatment 50 and HCC 51 . This suggests that these variants contribute to partitioning of genotypes with subsequent classification of HBeAg status by the frequency of nG1896A, nG1899A, nG1764A/A1764T mutations. This study identifies novel patterns of viral variants associated with existing HBeAg status, however, makes no inference about the mechanism by which this status was reached (no historical clinical data) or what the model means with respect to patient outcomes (data was not part of a longitudinal study). Equally, we cannot comment on patient profiles that are not represented in the study population, e.g. HBeAg negative inactive carriers with low HBV DNA loads. The ML model was formed on the known HBeAg status as defined by standard diagnostic techniques, i.e. the development of an alternative diagnostic test was not the intent of this study. Although we present a classification model with high discriminative accuracy this does not translate directly to changes in clinical practice and decision support. To make such a model applicable we require prospective studies with serial sampling to capture patients in the process of seroconversion and follow treatment groups. Further, such a model requires to be calibrated to the population of interest (e.g. our study uses two clinical cohorts with differing healthcare approaches) and be applied to a clear risk-sensitive decision point in the clinical setting (e.g. change in therapeutic regimen) 52 . We are, however, encouraged by the performance of the general model in discriminating HBeAg status in the n = 37 patients receiving treatment from the Dataset B cohort which served as an independent test group. By including a diverse sample population for feature-selection it was possible to establish a general model for HBeAg classification with broad relevance to the clinical population. Furthermore, we characterize the incidence of resistance-associated mutants in naïve patients; previous work has shown the presence of these mutants in naïve patients 30,53 and our work further supports the future requirement for baseline sequencing of infected individuals to tailor therapeutic regimens.
The utility of ML approaches to clinical decision making in infectious diseases is not currently widely appreciated, however, the application of deep sequencing and ML analysis to identify data patterns could facilitate the targeting of specific therapeutic interventions to high risk groups, aid stratification of patients for more effective clinical trial design, link models to clinical decision support tools and, through incorporating patient demographic data, facilitate epidemiological and healthcare planning through a deeper understanding of the relationship between compound factors 27,32,33,54 . Our study demonstrates that plasma HBV quasispecies adequately represent the viral populations within hepatocytes and that these profiles, when interrogated with machine learning approaches, can recapitulate classification of patients by clinical marker status in addition to revealing novel biology.