Rare Pathogenic Variants Predispose to Hepatocellular Carcinoma in Nonalcoholic Fatty Liver Disease

Nonalcoholic fatty liver disease (NAFLD) is a rising cause of hepatocellular carcinoma (HCC). We examined whether inherited pathogenic variants in candidate genes (n = 181) were enriched in patients with NAFLD-HCC. To this end, we resequenced peripheral blood DNA of 142 NAFLD-HCC, 59 NAFLD with advanced fibrosis, and 50 controls, and considered 404 healthy individuals from 1000 G. Pathogenic variants were defined according to ClinVar, likely pathogenic as rare variants predicted to alter protein activity. In NAFLD-HCC patients, we detected an enrichment in pathogenic (p = 0.024), and likely pathogenic variants (p = 1.9*10−6), particularly in APOB (p = 0.047). APOB variants were associated with lower circulating triglycerides and higher HDL cholesterol (p < 0.01). A genetic risk score predicted NAFLD-HCC (OR 4.96, 3.29–7.55; p = 5.1*10−16), outperforming the diagnostic accuracy of common genetic risk variants, and of clinical risk factors (p < 0.05). In conclusion, rare pathogenic variants in genes involved in liver disease and cancer predisposition are associated with NAFLD-HCC development.

Nonalcoholic fatty liver disease (NAFLD) is now the leading cause of hepatic damage worldwide 1 . In a minority of affected individuals, NAFLD can progress to cirrhosis and hepatocellular carcinoma (NAFLD-HCC), which is emerging as a leading cause of liver-related mortality [2][3][4] .
Aging, male sex, presence of type 2 diabetes (T2D) and cirrhosis are clinical risk factors for NAFLD-HCC 2 . However, due to the high prevalence of NAFLD in the general population, the low incidence of NAFLD-HCC, but the frequent occurrence of this kind of cancer in non-cirrhotic individuals, and the competition with metabolic comorbidities as a cause of death, mass screening of NAFLD-HCC is currently unfeasible. As a result, most cancers are diagnosed at advanced stages, leading to a dismal prognosis for those affected 5,6 . Therefore, identification of new biomarkers able to improve NAFLD-HCC risk stratification are of paramount clinical importance for the development of targeted and cost-effective surveillance programs.
Heritability is involved in HCC predisposition 7 , and NAFLD has a strong genetic component 8 , as does its progression to advanced disease 9 . A common genetic variation encoding for the I148M variant in PNPLA3, the major inherited determinant of hepatic fat accumulation, predisposes to NAFLD-HCC independently of the effect of fibrosis 10,11 . We have also showed that other common genetic variants influencing hepatic fat accumulation, namely those in TM6SF2 and MBOAT7, may improve the ability to discriminate NAFLD patients at risk of HCC 11 . However, possibly because of the still large fraction of missing heritability, carriage of these variants was not specific enough to identify patients at risk of NAFLD-HCC to be implemented in clinical practice 12,13 .
Part of the missing heritability in NAFLD may be accounted for by carriage of rare genetic variants with a large effect size. Indeed, rare germline genetic variants in TERT and other loci are associated with occurrence of NAFLD-cirrhosis and familial HCC 14,15 , supporting the hypothesis that rare genetic variants contribute to NAFLD-HCC risk and phenotype variability. However, a systematic evaluation of candidate genes in a large number of affected individuals has not been performed so far.
Taking advantage of next generation whole exome sequencing (WES), the aim of this study was therefore to examine whether variants in candidate genes involved in liver disease and cancer predisposition, that are either pathogenic (that is already linked to a pathological phenotype) or rare and predicted with stringent criteria to alter protein activity (likely pathogenic), are enriched in NAFLD-HCC as compared to healthy individuals and patients with advanced NAFLD. The main outcome was the overall enrichment in pathogenic/likely pathogenic variants, hypothesizing that their identification might be useful in discriminating disease risk and ultimately inform a stratified approach to HCC screening or surveillance in the ever-increasing numbers of patients with NAFLD.
Such an approach enabled us to evaluate a large panel of genes, which would have been not possible by using targeted panels, while restricting at the same time the analysis to those for which there is already solid evidence of a causal relationship with liver disease/cancer.

Methods
Study cohorts and design. The evaluated cohorts and study flow chart are presented in Fig. 1.
The discovery NAFLD-HCC cohort included 72 Italian patients and 70 UK patients, who were enrolled between January 2010 and 2016. All were of Caucasian ancestry.
The diagnosis of HCC was based on the EASL-EORTC clinical practice guidelines for management of hepatocellular carcinoma 16 . Secondary causes of steatosis were excluded on history, including alcohol abuse (≥30 g/day in M/F) and the use of drugs known to precipitate steatosis. Viral and autoimmune hepatitis, hereditary hemochromatosis, Wilson's disease, overt alpha-1-antitrypsin deficiency and present or previous infection with HBV (HBsAg) and HCV were ruled out using standard clinical and laboratory evaluation as well as liver biopsy features. www.nature.com/scientificreports www.nature.com/scientificreports/ Fifty-nine patients with advanced fibrosis due to NAFLD (histological stage F3-F4 or clinically overt cirrhosis) recruited at the Italian institutions during the same period were used as controls. A local ethnically matched control group of comparable sex distribution including 50 Italian healthy blood donors without clinical and biochemical evidences of liver disease, NAFLD, metabolic abnormalities and no alcohol abuse 14 , and the 404 non-Finnish European (NFE) healthy individuals included in the 1000 Genomes database (http://www.internationalgenome.org), for whom complete exome data were publicly available were used as further controls (including 91 Italian and 107 UK individuals).
The study protocol conformed to the ethical guidelines of the 1975 Declaration of Helsinki, was approved by the Ethical committee of the involved Institutions (Fondazione IRCCS Ca' Granda Policlinico and University of Newcastle upon Tyne), and was performed according to the recommendations of the hospitals involved. Informed consent was obtained from each patient or responsible guardian.
The clinical features of individuals included in the study are presented in Table 1. There were four sequential steps to the study (Fig. 1). The first step consisted in whole exome sequencing, variant analysis, identification and prioritization. The second addressed the possibility of enrichment in already known pathogenic variants in candidate genes in NAFLD-HCC cases vs. controls, and identified the most mutated genes and the diagnostic yield for Mendelian monogenic disorders. The third step encompassed the identification of rare variants predicted to alter protein function in the same candidate genes that might be associated with disease predisposition. Finally, a genetic risk score (GRS) for NAFLD-HCC was developed and assessed for its diagnostic accuracy.
Whole exome sequencing, variants identification, annotation and prioritization. The WES sequencing and analytical pipeline is presented in Supplementary Fig. 1.
Briefly, for the 258 samples included in the EPIDEMIC-NAFLD cohort, which were resequenced for this project (Exome sequencing for the Identification of Inherited Variants Involved in HCC development in NAFLD), DNA was extracted from peripheral blood mononuclear cells, and quantified by a Qubit 2.0 analyzer using the Qubit dsDNA BR Assay Kit (Thermo-Fisher Scientific, Waltham, MA, USA). Samples purity was evaluated using a Nanodrop 1000 spectrophotometer (Thermo-Fisher, Waltham, MA, USA) and integrity was assessed by gel electrophoresis.
DNA libraries were enriched for exome sequencing by the SureSelect Human All Exon v5 kit (Agilent, Cernusco sul Naviglio, Milan, Italy). Sequencing was subsequently performed on the HiSeq4000 platform (Illumina, city). Raw reads quality control was performed using FastQC software (Brabaham bioinformatics, Cambridge, UK). Reads mapping on human GRCh37 genome was performed using MEM algorithm of Burrows Wheeler Aligner (BWA) version 0.7.10 17 . Reads with low quality alignments and duplicate reads were filtered out using Samtools 18 to generate high quality bam files. Mapping quality control was performed using Picard-tools (http://broadinstitute.github.io/picard) and Bedtools 19 . Sequencing mean depth was of 73x, and no samples exhibit a mean depth lower than 50x (Supplementary Fig. 2 panels a,b). Sequencing resulted in a good target coverage: almost all samples exhibited more than 90% coverage of the target at 20x depth. Importantly, sequencing   www.nature.com/scientificreports www.nature.com/scientificreports/ statistics in terms of input reads, high quality mapped reads, mean depth and coverage, did not show variations among the different cohorts ( Supplementary Fig. 2

panel c).
Variant calling was performed following GATK best practices 20 . Briefly, indel local realignment, base quality recalibration and variants calling (Haplotypecaller algorithm) were performed using GATK version 3.3.0 21 . GVCF joint and variants filtering using variant quality score recalibration (VQSR) method were performed. Variants quality score log-odds (VQSLOD) above 99% tranche were considered true positives. To avoid the possibility of calling somatic variants due to the presence of circulating tumor DNA, variants present in <20% of total reads were discarded. Indel left normalization was performed using BCFtools software 22 . Variants annotation was performed using both variant effect predictor (VEP) 23 and ANNOVAR 24 tools.
Variants filtering was performed using VCFtools 25 to exclude variants over VQSLOD threshold and variants which were called in less than 95% of samples. All intronic and synonymous variants according to VEP prediction were excluded from the analyses. Multidimensional scaling of identity-by-state distances analysis was conducted on EPIDEMIC study samples exploiting SNPRelate R Bioconductor package 26 . As shown in Supplementary Fig. 3, in the EPIDEMIC project samples, the first component of variability was explained by the geographic origin of the patients (Italy vs. UK). Whole exome sequencing reads from 1000 genomes project phase 3 (1000 G) 27 non-Finnish Europeans (NFE; 404 samples) were processed using the same pipeline described for the EPIDEMIC samples.
Candidate genes selection and classification of variants. Candidate genes were selected according to the literature updated at January 2016, among those whose variants were robustly linked with cancer predisposition syndromes 28 , or mutated in HCC 29 , or predisposing to hereditary liver diseases 30 , or involved in predisposition to telomeres diseases 31 , or in iron and lipid metabolism and NAFLD 32 . The complete list of 181 candidate genes and their classification is presented in the Supplementary File, sheet "Candidate genes". For Step 2 (enrichment in rare pathogenic variants in candidate genes and diagnostic rate), variants reported as "likely pathogenic" in the Clinvar (https://www.ncbi.nlm.nih.gov) database 33 , located in candidate genes, and with a minor allele frequency (MAF) <0.05 in 1000 G NFE, ExAC databases and in local healthy controls, were selected. For Step 3 (enrichment in rare variants predicted to alter protein function, novel likely pathogenic), we used stringent criteria, that is selection of variants determining an alteration of protein sequence (missense, nonsense, splice sites), located in candidate genes, and with a MAF <0.001 in ExAC NFE, MAF <0.005 in the EPIDEMIC project samples, and a CADD Phred >10 (Top 10% of damaging variants) 34 .
Statistical analysis. For descriptive statistics, continuous variables are shown as mean and standard deviation or median and interquartile range for highly skewed biological variables. Variables with skewed distributions were logarithmically or inverse normally transformed before analyses. All genetic analyses were calculated by using an additive model.
Fisher's Exact test, multivariate or univariate generalized linear models were used when appropriate. Models were adjusted for clinically relevant covariates, as specified in the Results section. Gene enrichment in rare variants was assessed using the cohort allelic sum test (CAST) approach 35,36 . The association between the frequency of variants in genes significantly enriched in NAFLD-HCC vs. healthy controls was next validated against the cumulative frequency observed in NFE individuals included in the ExAC project (n = 33370) by Fisher's exact test, adjusted for the number of comparisons. A NAFLD-HCC risk score was developed as previously described 11,37 . The GRS for HCC was calculated by regressing the number of pathogenic/likely pathogenic variant collapsed at the level of single candidate genes and common genetic risk factors for in PNPLA3, TM6SF2 and MBOAT7 against the presence of HCC. To internally validate the GRS, β coefficients were adjusted using the Jack-knife resampling method. The diagnostic accuracy of different models for NAFLD-HCC prediction was compared by two-sided Venkatraman test 38 . The population attributable risk (PAR) of GRS for NAFLD-HCC was estimated as previously described for case-control studies 39 .
GRS gene functions were explored by pathway enrichment analysis exploiting Ingenuity Pathway Analysis software (Qiagen, Valencia, USA) with default parameters.
Protein features were obtained from Uniprot database (www.uniprot.org) coding variants of interest genes were mapped into reported protein domains and regions. Variants enrichment in protein domains was evaluated by CAST Burden test approach. Lollipop diagrams were generated using Mutation Mapper software (http://www. cbioportal.org/mutation_mapper.jsp).
Statistical analyses were carried out using R statistical analysis software version 3.3.2 (http://www.R-project. org/). P values < 0.05 were considered statistically significant.

Results
Pathogenic variants in candidate genes are enriched in NAFLD-HCC. We first examined pathogenic variants in HCC cases and controls. The list of pathogenic variants identified in candidate genes is presented in the Supplementary File, sheet "Pathogenic". We identified 68 variants previously linked to pathological phenotypes, which met the inclusion criteria. There was an enrichment in the number of pathogenic variants per individual in candidate genes in NAFLD-HCC patients, as compared to patients with advanced fibrosis, healthy individuals and the 1000 G cohort (Fig. 2a  A comutation plot reporting genes interested by pathogenic variants in the newly characterized EPIDEMIC cohorts (excluding 1000 G), as well as common variants previously associated with NAFLD-HCC is reported in Fig. 3. Among the single genes, we found a significant enrichment of variants in the APOB gene predisposing to familial hypobetalipoproteinemia (two variants in cases and none in controls, p = 0.047). Furthermore, we (2019)  Role of rare variants likely determining an alteration in protein activity. The list of rare variants in candidate genes predicted to alter protein activity (likely pathogenic variants) is provided in the Supplementary File, sheet "likely pathogenic". We observed an enrichment in rare pathogenic variants in NAFLD-HCC cases vs. controls ( Fig. 2b and Supplementary Table 2; OR 3.5 95% c.i. 2.2-inf, p = 1.9*10 −6 ). Genes significantly enriched in likely pathogenic variants in NAFLD-HCC cohorts are presented in Table 2. In the overall series or national cohorts, we found a significant enrichment in variants in Telomerase complex genes (RTEL1, TERF2), DNA and  www.nature.com/scientificreports www.nature.com/scientificreports/ oxidative damage response (RB1), and we also highlighted genes involved in regulation of cell growth and proliferation (STK11, TSC1, TSC2, NF2, SMAD4). Genes involved in regulation of hepatic lipid metabolism, including APOB and MBOAT7 were also enriched in rare likely pathogenic variants, and we detected an enrichment in variants of SQSTM1. Most of the associations remained consistent when the cumulative frequency of variants in the candidate genes was compared to that observed in NFE individuals included in the ExAC database. A comutation plot reporting genes significantly enriched in likely pathogenic variants in the EPIDEMIC cohorts is reported in Supplementary Fig. 4.
In order to check the efficacy of the criteria adopted (frequency and predicted impact) for the identification of likely pathogenic variants, we assessed whether variants in APOB (pathogenic/likely pathogenic), which are associated with a clear phenotype that can be assessed by common biochemical tests, influence circulating lipid levels. Results are shown in Supplementary Table 3. Supporting the validity of our selection algorithm, in patients for whom data were available carriage of APOB variants was associated with 46% higher HDL cholesterol and 44% lower triglycerides (p = 0.008 and p = 0.001, respectively), consistent with a hypobetalipoproteinemia phenotype.
All in all, these data suggest that rare genetic variants with a very high likelihood of impacting protein activity are associated with the predisposition to develop NAFLD-HCC.

Genetic risk score development and validation.
To examine whether evaluation of common and rare germline genetic variants may be clinically helpful in the stratification of NAFLD-HCC risk, we developed a weighted GRS for this condition and tested its diagnostic accuracy. The GRS coefficients are shown in Supplementary Table 4. In the total cohort of 655 individuals, the GRS was associated with HCC risk (OR 435, 95% c.i. 111-1903; p < 2*10 −16 , OR 4.96, 95% c.i. 3.29-7.55; p = 5.1*10 −16 for high vs. low GRS). The GRS had an AUROC of 0.74, 95% c.i. 0.69-0.79 for predicting NAFLD-HCC in the total cohort, as compared to 0.79, 95% c.i. 0.73-0.85 and to 0.69, 95% c.i. 0.62-0.75 in the discovery and validation cohorts, respectively (Fig. 4a). The diagnostic thresholds, sensitivity and specificity values are reported in Table S5. The best GRS threshold had a 61% sensitivity and 72% specificity to detect NAFLD-HCC.
The GRS improved the ability to discriminate NAFLD-HCC risk as compared to evaluation of the PNPLA3 I148M variant alone and of a combination of the PNPLA3 I148M and TM6SF2 E167K variants (p < 2*10 −16 ; Fig. 4b and Supplementary Table 5).
The PAR of the overall panel of genetic variants considered in the GRS for the risk of NAFLD-HCC is presented in Supplementary Table 7. In the overall cohort, the genetic variants considered accounted for 48% (95% c.i. 38-59%) of NAFLD-HCC phenotype variability.
Characterization of specific variants. The localization of likely pathogenic variants in selected genes (RETL1, APOB, SQSTM1) in individuals with (n = 142) and without (n = 513) HCC is shown in Supplementary  Fig. 5a-c. For APOB, the rate of truncating variants, determining a higher likelihood of severe impairment in protein function, was higher in HCC vs. non-HCC individuals (p = 0.009). Furthermore, variants in SQSMT1 tended to localize in a protein region reported to interact with NTRK1 (p = NS).
The summary output of Ingenuity pathway analysis of genes included in the GRS (which were mutated in NAFLD-HCC cases) is presented in Supplementary Table S8. As expected, there was a significant enrichment in pathways related to cancer, hepatocellular carcinoma, liver disease, lipid metabolism, and hereditary disorders. Interestingly, there was an over-representation of genes involved in FXR/RXR activation, and regulation of cell cycle at the G1/S checkpoint.

Discussion
NAFLD-HCC is an emerging complication of metabolic conditions such as obesity and T2D 2 . Due to the very high prevalence of the population at risk, classic screening strategies are presently unfeasible. Therefore, novel noninvasive biomarkers are urgently needed to improve disease risk stratification. Indeed, although carriage of the common PNPLA3 I148M variant is a strongly associated with NAFLD-HCC development, taken by itself it is not sufficiently accurate to stratify the risk of this condition 12 .   www.nature.com/scientificreports www.nature.com/scientificreports/ Here we show that in patients with NAFLD-HCC, pathogenic and likely pathogenic variants in genes linked to liver disease and cancer predisposition are enriched as compared to healthy individuals. Furthermore, we have replicated this result in two independent cohorts.
Although further validation is required before target gene resequencing can be recommended in clinical practice, these findings have potential clinical implications. In the present cohort, resequencing of the candidate genes panel led to the detection of likely predisposing genetic conditions in a large fraction of patients presenting with NAFLD-HCC. This may also aid in the identification of family members for whom screening would be cost effective, or specific preventive treatments may be considered. Furthermore, evaluation of a comprehensive GRS, which takes into consideration rare variants, in individuals with NAFLD may allow a more accurate HCC risk stratification and the implementation of targeted surveillance. Indeed, the comprehensive GRS showed superior diagnostic accuracy as compared to the evaluation of common genetic risk factors, including the PNPLA3 I148M variant alone 12 , or a combination of PNPLA3 I148M and TM6SF2 E167K variants 40 . Furthermore, the GRS improved patient stratification, when considered together with classical risk factors for NAFLD-HCC. The clinical utility of GRS assessment should however be tested in familial and prospective studies evaluating patients with NAFLD and other liver diseases.
Furthermore, the present findings may also have pathophysiological implications worthy of exploration. Firstly, although PNPLA3 I148M is the major genetic determinant of NAFLD-HCC, no other rare loss-of-function mutations were identified in this gene among the affected patients. This would support the notion that the I148M acquires new functions able to trigger hepatic fat accumulation, alteration of retinol metabolism, and carcinogenesis [41][42][43] . Secondly, consistently with previous data 11 , other variants favoring hepatocellular fat retention were associated with NAFLD-HCC, including common and rare variants in TM6SF2 and MBOAT7 genes. Thirdly, variants in APOB, responsible for hypobetalipoproteinemia, were collectively observed in a high proportion of Italian patients (15%), and there was a significant enrichment in pathogenic and truncating mutations in this gene in the overall cohort of NAFLD-HCC patients. APOB genetic variants leading to the synthesis of a dysfunctional ApoB100 protein and to a consequent impairment in the export of lipids from hepatocytes within very low-density lipoproteins are responsible for the development of severe hepatic steatosis (hypobetalipoproteinemia, an autosomal dominant disease). At the same time, some APOB variants that lead to the alteration of the first portion of the protein result also in altered activity of ApoB48, the protein isoform expressed by enterocytes. This results in retention of chylomicrones, malabsorption of fat and liposoluble vitamins (retinol -vitamin A, vitamin E and vitamin D), known to play a protective role in liver disease progression, and possibly in the alteration of the intestinal barrier. Most importantly, individuals carrying APOB mutations had a circulating lipid profile consistent with hypobetalipoproteinemia, providing functional validation of the pathogenicity of the genetic mutations identified. Notably, in line with a causal role of hepatocellular lipid retention in promoting NAFLD-HCC, somatic mutations in APOB also frequently occur during hepatic carcinogenesis 44 . The mechanism connecting APOB mutations with carcinogenesis is still not completely understood. Induction of hepatocellular lipid accumulation, oxidative stress, and the loss of a possible tumor suppressive activity of APOB are some of the hypothesis that have been raised 45,46 . Therefore, the identification of APOB mutations in subjects with NAFLD-HCC would be to allow the diagnosis, in these cases mostly unrecognized, of familial hypobetalipoproteinemia in the first-degree relatives, allowing to establish adequate HCC surveillance.
An additional finding was the novel association between variants in SQSTM1 and NAFLD-HCC. SQSMT1 encodes for p62, a component of Mallory-Denk Bodies and hyaline granules 47 . Protein p62 aggregates accumulate in the cytoplasm of damaged liver cells in NASH and HCC 47 , and may promote hepatocytes transformation through the activation of antioxidants and mTOR pathways 48,49 . In keeping, we also identified variants in genes regulating cell growth via the insulin signaling and mTOR pathways, and, in line with previous findings from our group 14,31 , in the telomere regulation machinery. However, as the present study was not designed to this aim, enrichment in variants in specific genes will have to be confirmed in larger cohorts.
This study has some limitations. First, the design was cross-sectional with retrospective data collection, so that GRS for NAFLD-HCC will need to be validated in future prospective studies including individuals with NAFLD and other liver diseases at high baseline risk. However, since HCC remains a relatively rare complication of NAFLD (which affects almost one in three individuals in the general population) the risk of misclassifying healthy controls was minimal, and would have rather led to reduction of the study power. Furthermore, the sample size was relatively limited, and therefore we focused our attention on pathogenic variants in candidate genes, which are already known to cause disease. As we considered healthy individuals from the general population as controls, results are potentially applicable to NAFLD-HCC genetic screening at population level without prior knowledge of liver disease severity status. If it will be proven cost-effective and ethically acceptable, this approach may assist in stratifying the risk of liver disease and HCC, besides of other chronic degenerative diseases. Additional studies are required to discover new variants predisposing to NAFLD-HCC, which were not examined in this study. Moreover, we could not evaluate a control group with advanced fibrosis due to NAFLD for the UK NAFLD-HCC validation cohort, in which a different pattern of mutations was observed as compared to the Italian cohort. Finally, findings may only be applicable to individuals of European descent.
In conclusion, rare pathogenic variants in candidate genes involved in the predisposition to liver disease or cancer are associated with an increased risk of developing NAFLD-HCC.