Introduction

Liver disease accounts for approximately 2 million deaths per year worldwide. In the United States, the mortality rate for chronic liver diseases (CLD) increased 31% from 2000 to 2015, making it the fifth leading cause of death in 2017 for persons aged 45–64 years1. The history of liver genetic diseases dates back to 1865–1890 when Triouseau and von Recklinghausen described hemochromatosis2. The cloning, mapping, and functional characterization of homeostatic iron regulator (HFE) gene in the 1990s paved the way for molecular diagnosis of hemochromatosis3. The advent of next-generation sequencing (NGS) approaches have led to the discovery of genetic disorders causing liver disease phenotypes such as fibrolamellar hepatocellular carcinoma4, recurrent acute liver failure5,6,7, or idiopathic non-cirrhotic portal hypertension8,9. These findings demonstrate the power of NGS for identifying novel genetic forms of liver diseases.

NGS has been successfully deployed in clinical care to diagnose monogenic forms of neurologic, developmental, cardiac or renal disorders10,11. While genetic testing of single genes or small gene panels has been used for some suspected hereditary liver diseases12,13,14,15,16, NGS approaches have not been widely adopted into the routine evaluation of liver disease. As sequencing costs decline and clinical utility is demonstrated, a standardized genetic diagnostic pipeline for liver disease could benefit patients and clinicians, enabling efficient clinical diagnoses and early recognition of rare genetic disorders that may manifest as a common liver phenotype and may not be recognized based on their clinical workup17. In this paper, we outline an analytic approach and conduct a clinical sequence interpretation for ES data from 10,801 individuals (Supplementary Table 1), including 758 patients with CLD as encountered at various stages of their diagnostic workflow. Here we present the diagnostic utility of ES for liver diseases, highlight special considerations and elaborate on the potential for misclassification in the genetic workup for liver diseases.

Results

Characterization of 502 liver genes with Mendelian hepatobiliary disorders

In a comprehensive search for Mendelian genetic disorders with any liver abnormalities prompting clinical referral to a hepatologist, we manually curated a total of 959 genes. Of these, 502 had a confirmed abnormal and broad liver disease phenotype, with 193 genetic disorders having primarily liver disease. For example, ABCB11 or ATP8B1 causing progressive familial intrahepatic cholestasis; other genes might lead to liver abnormalities that are presenting clinically as a secondary cause. For instance, patients with Fanconi anemia may present with hepatocellular carcinoma18, individuals with inborn errors of immunity may have acute or chronic liver infection19 (Supplementary Fig. 1). We then annotated inheritance modes and detailed clinical phenotypes related to these 502 genes. In total, 75% of genes were associated with a recessive mode of inheritance (363 AR and 15 XLR). Sixty-two autosomal genes could result in both dominant and recessive disorders and 62 other genes associated with exclusively dominant disorders (61 AD and 1 XLD) (Supplementary Table 2). The most common clinical presentation of Mendelian hepatobiliary disorders was hepatomegaly, manifesting in 236/502 disorders (47%) Other common clinical manifestations included metabolic disease (25%), liver fibrosis or cirrhosis (25%), elevated hepatic transaminase level (20%), and cholestasis (19%) (Fig. 1A). Most of the genes (298/502, 59%) were associated with a developmental or congenital disorder with liver manifestations (Supplementary Table 2). The 62 genes exclusively associated with dominant inheritance showed significantly higher pLI (Fig. 1B) and missense Z scores (Fig. 1C) compared to the 378 genes associated with recessive diseases. For 62 genes associated with both dominant and recessive inheritance, a total of 16 genes has pLI score above 0.8 and nine of sixteen were involved in the immune system. Two genes, STAT1 and INSR, have a missense Z score above three (Supplementary Information 1). In conclusion, we curate a total of 502 genes that can be related to liver phenotypes in a Mendelian inheritance and evaluate its value in ES data analysis.

Figure 1
figure 1

A summary of liver phenotypes in Mendelian genetic disorders. (A) Inheritance mode, annotated clinical liver phenotypes, and biological effects of 502 genes related to Mendelian disorders. The liver phenotypes and inheritance were curated based on OMIM, ClinGen, and a literature search. AD: autosomal dominant disorder; AR: autosomal recessive disorder; XLD: X-linked disorder; XLR: X-linked recessive disorder; AD and AR: Genes with both autosomal dominant and autosomal recessive inheritance were reported. The right lower box showed the numbers of genes with corresponding biological effects and inheritance mode. (B) Box plot of pLI scores of 502 genes in three groups based on inheritance mode. The dark line inside the box represents the median of pLI score. The top of box is 75% and bottom of box is 25%. The endpoints of the lines are at a distance of 1.5*IQR, where inter quartile range is the distance between 25 and 75th percentiles. The points outside the whiskers are marked as dots and are considered as extreme points. (C) Violin plot of missense Z scores of 502 genes in three different groups based on inheritance mode. P values in B and C for differences between dominant and recessive genes were determined using ANOVA.

Assessment of the frequency of candidate pathogenic/likely pathogenic variants

To investigate the prevalence of candidate pathogenic variants in the liver genes, we analyzed ES data from 10,801 individuals, agnostic to the clinical phenotype. 758 patients were diagnosed with CLD. In additional, two control cohorts were used to evaluate the gene-list based ES analysis, including 7856 self-identified healthy individuals and 2,187 patients from CUIMC with chronic kidney disease (CKD) (Supplementary Table 1 and Supplementary Table 3)20,21. Based on an automated filtering (DP > 9, VQSR filter = PASS, Qual > 49, QD >  = 2, GQ >  = 20, MQ >  = 40, Percentage of alt read > 0.25, MAF < 0.01)20, we initially identified an equal distribution of candidate pathogenic variants, either “DM” in HGMD, or “Pathogenic” in ClinVar, across the three cohorts: 1567 (20.2%) in healthy controls, 416 (19.0%) in the CKD cohort, and 159 (21%) in the CLD cohort (Fig. 2A,B, Supplementary Table 4). This implausibly high frequency of variants for monogenic liver disorders suggested variant misclassification. Consistent with this conjecture, an analysis of CADD score and the maximal MAF from the ExAC and gnomAD indicated that many of these variants had implausibly high allele frequencies to be disease causing and had been erroneously reported as pathogenic prior to the availability public variant databases22,23 (Fig. 2C, Supplementary Information 2). We next used the maximal MAF, MAF ≤ 10–4 for dominant disorders and MAF ≤ 10–3 for recessive disorders, to filter variants, followed by manual review of 403 variants (Fig. 2A,B, Supplementary Table 5)20,23,24. This resulted in 112 variants being classified as either P/LP based on ACMG-AMP classification (including 78 PTVs, Fig. 2D), detected across 45 genes in a total of 100 individuals (0.93% of three cohorts). Subsequent to this filtering and manual annotation process, the prevalence of these P/LP variants significantly differed between healthy controls (51/7856, 0.65%), patients with CKD (25/ 2187, 1.14%), and patients with CLD (24/758, 3.17%) (X2 test OR: 5.00, 95%CI 3.06–8.18, p value = 4.55e−12, Fig. 2A,E). In summary, a search for rare variants in 502 genes associated with liver phenotypes lead to a significant enrichment of P/LP variants in the CLD cohort.

Figure 2
figure 2

A search for candidate variants and ACMG-AMP classification revealed an enrichment of P/LP variants in the CLD cohort. (A) Approach used to identify pathogenic/likely pathogenic variants in the CLD cohort. We started with a search for all the candidate pathogenic variants with global AF less than 1% in gnomAD for 10,801 WES samples and ended up with an implausibly high frequency of monogenic disorders. We then applied a stringent filter based on maximal populational MAF and manually annotated a total of 403 variants based on ACMG-AMP guidelines and concluded that a total of 112 variants are pathogenic/likely pathogenic, and 1% of individuals might benefit from a further clinical evaluation. (B) A diagram for variants filtering and candidate pathogenic variants search for monogenic liver disease genes. Variants were classified based on the following: DM in HGMD but not pathogenic in ClinVar (cyan); Pathogenic in ClinVar but not DM in HGMD (Orange); Pathogenic in ClinVar and DM in HGMD (green); and new protein-truncating variants not reported in HGMD or ClinVar (purple). (C) Variants with high populational MAF in dominant disorders with liver phenotypes: X-axis is CADD Phred score of each variant; Y-axis is the -log10 of the highest MAF, which was extracted from the following subpopulations: African/African American (AFR), Latino (AMR), Ashkenazi Jewish (ASJ), Finnish (FIN), Non-Finnish European (NFE), East Asian (EAS), South Asian (SAS) and Other (OTH) from ExAC and gnomAD data. Circle size indicates the total number of individuals carrying the variant. If 20 or more individuals were found to be carriers, the gene name and count are given. (D) Schematic presentation of individuals in each cohort with pathogenic/likely pathogenic variants, the majority of which are PTVs. (E) A Venn diagram shows a total of 45 genes found in at least one affected individual from three cohorts. Five genetic disorders were found in all three cohorts.

Second-level annotation of the CLD cohort identifies additional pathogenic variants

To maximize the identification of diagnostic variants in the CLD cohort, we performed a second-level manual assessment, using the more relaxed sequence quality thresholds which we had previously deployed to optimize diagnostic yield in other cohorts21,25. This second-level analysis led to the identification of 16 additional diagnostic variants that explained the liver phenotypes in 14 additional patients (13 genes, Fig. 3A). All 16 variants were missed because of the high stringency sequence quality thresholds and were all confirmed by Sanger sequencing. In addition, we evaluated four well-known pathogenic variants or risk alleles for liver disease that have a MAF above 1%: HFE C282Y and H63D, SERPINA1E264V (Pi*S) and E342K (Pi*Z). We found two patients with P/LP variants in HFE (one with a homozygous HFE C282Y variant, and one with an H63D/c.340 + 1G > A genotype, Table 1). Both had high serum iron transferrin saturation and ferritin levels, and clinical presentations consistent with hereditary hemochromatosis. For SERPINA1, three patients in the CLD cohort had a homozygous Pi*Z genotype, and all of them had a clinical diagnosis of alpha-1 anti-trypsin deficiency (Table 1). Altogether, this second level analysis increased the diagnostic yield in the liver cohort to 43/758 cases (5.7%, Fig. 3A).

Figure 3
figure 3

Genetic diagnoses and clinical implications of ES findings in the liver disease cohort. (A) A total of 43 CLD patients with P/LP variants from three searching approaches; (B) A total of 25 genetic disorders were found in the CLD cohort. Red star indicated the genetic disorders causing primarily liver diseases; (C) An investigation of clinical phenotypes and genetic diagnosis in CLD patients with P/LP variants; (D) clinical implications of the genetic findings.

Table 1 HFE and SERPINA1 variants in three cohorts.

Genetic diagnoses and their clinical implications

Overall, we identified a total of 25 genetic disorders in the liver disease cohort, with Alagille syndrome, alpha-1 anti-trypsin deficiency, cystic fibrosis, and progressive familial intrahepatic cholestasis-2 detected in at least three patients each (Fig. 3B). There were no differences observed in sex, race, or ethnicity between the patients with or without a genetic diagnosis in the liver disease cohort. From a univariate analysis, younger age and the clinical diagnosis of congenital liver disorders, abnormally elevated serum transaminase activities due to unknown causes were associated with a higher rate of a genetic diagnosis (Table 2). We next performed a case-level review to assess concordance between genotype and phenotype. Among 43 liver disease patients with P/LP variants (Supplementary Information 3), we confirmed a previous clinical diagnosis for eleven, identified a genetic disease that partially explained the phenotype for eleven, reclassified disease for seven, identified a molecular subtype of inherited liver diseases for six, and identified a cause for undiagnosed liver diseases for five. We also recommended further workup in three patients to confirm or refute the liver diagnosis (Fig. 3C). In addition, we examined the phenotypic concordance for the 25 kidney patients carrying P/LP variants in liver genes: 15/25 patients had a corresponding liver phenotype, which were mostly attributable to P/LP variants in genes like PKD1, MODY or ciliopathy genes causing both kidney and liver disease (Supplement Information 3). Benefits of a genetic diagnosis included the ability to guide familial testing and obtain an early diagnosis of affected family members for 24 families, or to perform surveillance for known complications, such as brain aneurysms in individuals carrying a pathogenic variant in PKD1. Four patients with HFE and PFIC2 will be followed clinically for progression to appropriate stages of disease for cancer screening. Patients with PGM1 and PHKA2 pathogenic variants, diagnostic of congenital disorders of glycosylation, can benefit from selective nutritional management. Other implications for better treatment include targeted therapy, clinical trials, or surgical options. For example, a review of clinicaltrials.gov identified 255 clinical trials are enrolling patients with monogenic forms of liver diseases identified in this study (Fig. 3D).

Table 2 Clinical characteristics for monogenic diagnoses in the liver cohort from ES analysis.

Discussion

Our primary goal was to evaluate the utility of ES for diagnosis of liver disease. Currently available clinical genetic testing for heritable liver diseases exists and is mostly utilized in the pediatric populations. For instance, one lab provides a panel of 72 genes for well-defined monogenic liver diseases, especially cholestasis and biliary atresia26. To guide the ES analysis, we developed a list of 502 genes associated with a Mendelian disease with potential liver phenotypes (Fig. 4). This work constitutes an initial attempt at a gene list for monogenic liver disease, but the list will have to be continuously annotated and updated to include new information about genes and variants. For example, we updated the list to include several genes (TULP327, KIF12, USP5328, KCNN329,30, GIMAP59) which have been implicated in monogenic disorders associated with liver phenotypes during the performance of this study. We also removed some genes which, in retrospect, did not have a secure causal relationship with CLD. In the future, the creation of a liver disease workgroup, for instance, under the ClinGen platform or PanelApp31, will accelerate the development of a reference gene list for CLD.

Figure 4
figure 4

A summary of the genetic analytic strategy and outcomes for liver diseases.

The current challenge of genetic analysis is to determine the pathogenicity of variants. In this work, we focused on genes associated with monogenic disorders and omitted analysis of risk factors, such as PNPLA3. Consistent with prior studies of other genetic disorders, our variant level analyses indicated that many previously reported P/LP variants for liver diseases are too common to be pathogenic and are erroneously annotated in reference databases. We report liver disease genes with the most frequently encountered false-positive P/LP variants to help with the reannotation of reference databases (Supplementary Information 3). We also performed a manual annotation of the data, which confirmed that the application of hard filters for allele frequency and sequence quality may lead to the omission of true pathogenic variants (Fig. 4). For example, in addition to the high frequency CFTR, HFE and SERPINA1 pathogenic variants, two patients with progressive familial intrahepatic cholestasis type 3 carried an ABCB4 Ala934Thr missense variant which has a MAF of 1.2% in African-American populations, and should be interpreted as a pathogenic variant (Supplementary Information 2). Likewise, a pathogenic p.Lys414fs variant in carnitine palmitoyltransferase (CPT) II gene have an allele frequency above 0.1% in Ashkenazi Jewish population32. Thus, a balanced disease-specific approach was necessary for maximizing the diagnostic rate. A case-level review indicated that the genetic results were consistent with the clinical findings in the majority of liver and kidney disease cases, validating our approach. The genetic findings had many implications for diagnosis, risk stratification, surveillance, treatment and management, including potential eligibility for clinical trials. For the patients who did not have a concordant liver disease phenotype, the P/LP variants may be non-penetrant, disease may develop in the future, or the variant may be downgraded in the future based on evidence of non-pathogenicity. We note that our study is limited by the lack of clinical information for most self-reported healthy controls, which hampers our ability to determine the causality of P/LP variants in this cohort.

Altogether, our single-center study indicates a significant diagnostic utility for ES in the evaluation of patients with CLD. Currently, the clinical genetic diagnoses are limited by several pitfalls based on ES. First, ES cannot find P/LP variants in intronic regions or poorly covered regions; second, we did not do homozygous CNV calls for ES data and might miss heterozygous CNVs or small genomic deletions, such as DNAJB1-PRKACA in fibrolamellar hepatocellular carcinoma. Third, as we define alternative alleles should be above 30% of total reads from genomic DNA extracted from blood, mosaic and somatic genetic disorders could not be ruled out. Lastly, those with single P/LP heterozygous variants in recessive inheritance gene were excluded for further analysis in this manuscript. Ideally, those patients with a single P/LP heterozygous variant should be identified with further efforts to investigate the corresponding clinical phenotypes. If clinical phenotypes are consistent with a recessive disorder, searching for an additional in-trans variant may be important to guide the genetic diagnosis. Therefore, combining WGS and RNA-Seq of liver biopsy may increase the genetic diagnosis rate. Future studies will have to evaluate the diagnostic utility across varied healthcare settings, apply different genetic testing strategies and prospectively demonstrate the impact of genetic testing on clinical decision-making, cost-effectiveness and genetically stratified clinical trials.

Material and methods

Developing a list of monogenic disorders associated with liver phenotypes

We first composed gene list, or “liver gene list”, to identify genes causing monogenic diseases with a wide range of liver manifestations. We used Online Mendelian Inheritance in Man database (OMIM), Orphanet, and the Human Phenotype Ontology (HPO) database33 to search for potential genes with Mendelian inheritance that have been associated with or shown to be causative in liver disease before December 2018. For the search, we used a total of 30 keywords or phrases (Supplementary Fig. 1), then manually reviewed OMIM and related literature34,35,36,37,38,39. We excluded: (1) genes not reported to be linked to any abnormal liver phenotypes; (2) genes within a locus reported from linkage analysis without any known pathogenic variants; (3) genes only discovered in GWAS but lacking any evidence for Mendelian inheritance; (4) genes with only somatic variants reported in abnormal liver phenotypes; (5) genes within a locus associated with abnormal liver phenotypes due to chromosomal abnormalities. The selected genes were annotated for biological functions, clinical liver presentations, and gene constraint score40. We annotated the inheritance mode of liver genes based on OMIM and ClinGen then manually curated the list of genes by reviewing relevant literature. The current gene list is an initial attempt to catalog monogenic liver diseases and remains a work in progress. We anticipate that this list will require regular updates and curations and may serve as the basis for a reference liver gene list that can be curated by an expert group, such as ClinGen41.

Cohorts

We analyzed ES data obtained by sequencing of genomic DNA extracted from peripheral blood of 758 patients with CLD. We enrolled patients from both pediatric and adult liver clinics at Columbia University Irving Medical Center (CUIMC) who were interested in and able to consent to participating in genetics research, without setting inclusion or exclusion criteria (Table 2). In the CLD cohort, 53.7% of participants were female, and 33.6% of participants were under 22 years of age. 182 patients with CLD (24%) were diagnosed with nonalcoholic fatty liver disease (NAFLD) or nonalcoholic steatohepatitis (NASH), 128 patients (16%) with AIH or PBC or PSC, other patients with viral hepatitis (n = 125), and alcoholic hepatitis (n = 27) were also included. A few cases with acute liver failure (n = 5), or hepatocellular carcinoma (HCC, n = 3), or hepatoblastoma (n = 7), or cardiogenic liver cirrhosis (n = 9) were sequenced and analyzed altogether. A selection bias might occur as we attempt to enroll those who might have a genetic cause of liver diseases. The CKD cohort was included because we had access to health records through CUIMC, enabling us to evaluate the penetrance of monogenic liver disorders in a cohort not ascertained for liver disease. Informed consent in writing was obtained from each patient and the study protocol conformed to the ethical guidelines of the 1975 Declaration of Helsinki as reflected in a priori approval by CUIMC Institutional Review Board.

Sequence analysis and variant annotation

Sample preparation, target-enrichment, sequencing process, read alignment, and variant calling were previously published20,21. We focused on variants that were predicted to have at least moderate to strong biological effects toward protein function and excluded those in intergenic and promoter regions. We used stringent quality filters and removed potential technical false-positive insertions and deletions (indels) using ATAV as previously described20,42. We excluded variants failing quality cutoffs in gnomAD or those identified as sequencing artifacts through a comparison of in-house control sequencing data. Current guidelines recommend considering all variants with a minor allele frequency (MAF) of less than 1% at the population level. Thus, we filtered the variants based on the overall MAF of less than 1% in the Genome Aggregation Database (gnomAD)43. Variants previously reported as pathogenic were identified using the HGMD and ClinVar. We included only those annotated as pathogenic/likely pathogenic (P/LP) in ClinVar or disease-causing mutation (DM) in HGMD without any conflicting evidence within each database. In addition, we identified novel protein-truncating variants (PTVs) not previously reported in either HGMD or ClinVar. As the initial yield of individuals carrying candidate pathogenic variants was significantly higher than expected, we employed a stringent filter by inheritance mode and subpopulation MAF based on the data from gnomAD and Exome Aggregation Consortium (ExAC): MAF ≤ 10–4 for dominant disorders and MAF ≤ 10–3 for recessive disorders20,23,24. We used Loss-Of-Function Transcript Effect Estimator (LOFTEE) filter to exclude PTVs with a false prediction. A detailed description of genetic terminology in this study has been described previously20.

Manual variant classification and clinical data review

Two independent genetic analysts performed a first-tier, stringent analysis of the CLD cohort to reach a consensus classification according to the ACMG-AMP guidelines44. We next performed a second-level manual curation of the CLD cohort using lower stringency filters, which identified several well-defined pathogenic variants that were excluded because they either have a MAF above 1% in some ethnic subpopulations or did not pass the stringent sequencing quality filters. This procedure had been successfully used to increase diagnostic yield in prior studies44,45. Subsequently, a multidisciplinary group of experts, including genetic counselors, geneticists, molecular pathologists, and clinicians, reviewed the available clinical information in individuals carrying P/LP variants to detect phenotypic concordance with the associated mode of inheritance of disease. If diagnostic evidence was insufficient based on chart review, a follow-up plan was recommended to clarify the significance of the genetic findings.

Statistical analysis

We compared the probability of being loss-of-function intolerant (pLI) and Z scores for genes using an analysis of variance (ANOVA) test to compare differences between the three groups. We analyzed the clinical variables between those with and those without a genetic diagnosis using the Chi-squared test. All statistics and genetic analyses were done in R statistical software (Version 4.0.0). A p-value of < 0.05 was considered significant after correction for multiple hypothesis testing.

Ethics declaration statement

I attest that the research included in this report was conducted in a manner consistent with the principles of research ethics, such as those described in the Declaration of Helsinki and/or the Belmont Report. In particular, this research was conducted with the voluntary, informed consent of all research participants, free of coercion or coercive circumstances, and received Columbia University Irving Medical Center Institutional Review Board (IRB) approval consistent with the principles of research ethics and the legal requirements of the lead authors' jurisdictions.