Cholestatic liver disease encompass a broad range of diagnoses that can present with fatigue, pruritus or jaundice1. Several genes are implicated, including the ABCB4 gene coding for the canalicular phosphatidylcholine floppase ABCB4, and the ABCB11 gene coding for the bile salt export pump (BSEP). Homozygous mutations in ABCB4 and ABCB11 cause a spectrum of disease from mild cholestasis to severe progressive familial intrahepatic cholestasis (PFIC), PFIC3 and PFIC2 respectively2,3. ABCB4 variants also increase the risk of developing drug-induced intrahepatic cholestasis, gallstone disease, gallbladder and bile duct carcinoma, liver cirrhosis and abnormal liver function tests4. Other canalicular transporters and their regulators are implicated in the pathogenesis of cholestatic liver disease e.g. ATP8B1 (a P-type ATPase that flips phospholipids into the cytoplasmic leaflet of the membrane)3, NR1H4 (farnesoid X receptor (FXR)) gene5, and TJP2 (tight junction protein 2)6. While homozygous mutations of these genes are implicated in rare cases of severe familial cholestasis, the evidence base for a role of heterozygous mutations in milder forms of liver disease is limited. ABCB4 and ABCB11 are involved in 20% of patients with severe intrahepatic cholestasis of pregnancy (ICP). Heterozygous ABCB4 variants have also been reported in ICP7,8,9,10,11,12,13. ICP is the commonest gestational liver disease14 and may be complicated by preterm birth, meconium-stained amniotic fluid and stillbirth, in association with maternal serum bile acid concentrations ≥ 40 µmol and the timing of its occurrence during pregnancy15,16,17. In Europeans the incidence is 0.62% versus 1.24% in women of Indian and 1.46% of Pakistani origin18. Despite the increased prevalence of ICP and other liver conditions like non-alcoholic fatty liver disease in South Asian populations they often remain undiagnosed and under-investigated19,20.

Genes and Health is a long-term population-based cohort that assesses the health and disease in British Bangladeshi and British Pakistani people. Using the Genes & Health genomics (whole exome sequencing (WES)) and data linkage to electronic health records (EHR)21, this study investigated rare variation in a unique British Bangladeshi and Pakistani cohort around 5 candidate loci (ABCB4, ABCB11, ATP8B1, NR1H4, and TJP2) implicated in cholestatic liver disease.


Genotype to phenotype analysis

Screening of the 5 candidate genes identified 300 variants; and 180 (166 non-synonymous and 14 loss of function (LoF) variants), were included in the analysis (Table 1). Most variants identified were heterozygous unless otherwise specified. A small number of volunteers carried more than one variant, and this is summarised in Supplementary Table 1. None of these volunteers displayed a disease phenotype. Further variant interpretation including scoring details of individual in-silico predictors, the impact of the coding substitution on disease propensity and conservation information for all variants are presented in Supplementary Table 2.

Table 1 Overall summary of mutational burden discovered in the Genes and Health cohort for all five gene candidates.

Phenotype to genotype analysis

We were able to validate variants of interest in 15 cases reporting ICP, most of whom had documented raised BA concentrations (Table 2). Further, variants discovered in 10 cases with raised BA but an uncertain or no diagnosis of ICP based on their EHR (Supplementary Table 3). This secondary analysis missed 2 individuals from the primary analysis with a diagnosis of ICP as there were no bile acid concentrations available for them. This analysis demonstrated a pragmatic approach to identifying disease causing variants and demonstrates the value of large genomic cohorts linked to electronic health data records.

Table 2 Variants identified and TSBA concentration in volunteers with a diagnosis of ICP based on electronic health records.

Rare variant burden analysis

We observed in cases versus controls a significant enrichment of rare variants in ABCB11 (p = 0.00247). There was no enrichment in ABCB4 (p = 0.39138), ATP8B1 (p = 0.57957), TJP2 (p = 0.17390), or NR1H4 (p = 0.70232). A further Fisher’s exact analysis compared percentage of rare variants in ICP cases in Genes & Health compared to Genomics England demonstrating that the rare genetic burden was significantly increased in tier 1 gene candidates in British Pakistani and Bangladeshi (ABCB4, p = 0.0191; ABCB11, p = 0.0191) but not in ATP8B1 (p = 0.4857) NR1H4 (p = 0.2286) or TJP2 (p = 0.1039).

ABCB4 variants

There was a total of 68 ABCB4 variants identified (Table 1 and Fig. 1). 41 variants fulfilled the inclusion criteria. Variants were identified in people with cholestatic liver disease phenotypes: ICP (n = 5 out of 88 women in the analysis), gallstone disease (n = 7), and cholangiocarcinoma (n = 1) (Table 3). For some identified variants a known cholestatic phenotype had previously been reported in the literature (n = 9) whilst others had no phenotype reported (n = 19) (Supplementary Table 4). We identified novel LoF variants (n = 4): three frameshift (one associated with an ICP phenotype) and one introduction of a premature stop codon (no associated phenotype) (Table 4).

Figure 1
figure 1

ABCB4 variant summary in a 2-dimensional illustration. 41 variants are divided into their phenotypic presentation and coloured by: No phenotype previously reported (n = 19), cholestatic phenotype reported in the literature (n = 9), gallstone disease (n = 7), cholangiocarcinoma (n = 1), and intrahepatic cholestasis of pregnancy (n = 5). Bold border represents variants that are unique to the Genes & Health cohort. Topo2 software (Johns S.J., TOPO2, Transmembrane protein display software, was used for illustration95.

Table 3 Non-synonymous variants identified in the five gene candidates associated with a cholestatic phenotype in the Genes and Health cohort.
Table 4 Loss of function variants identified in the five gene candidates.

ABCB11 variants

There were 77 ABCB11 variants identified of which 48 were included in the analysis (Table 1 and Fig. 2). The associated cholestatic liver disease phenotypes identified were: ICP (n = 5 out of 83 women in the analysis), and gallstone disease (n = 10) (Table 3). Some were linked to cholestatic phenotypes previously reported in the literature (n = 14), whilst for 19, no phenotype had been previously reported (n = 19) (Supplementary Table 5). Likely novel LoF variants were identified (n = 3): a frameshift, a splice-acceptor and an introduction of premature stop codon variant. These variants were not associated with a known disease phenotype (Table 4).

Figure 2
figure 2

ABCB11 variant summary in a 2-dimensional illustration. 48 variants are divided into their phenotypic presentation and coloured by: No phenotype previously reported (n = 19), cholestatic phenotype reported in the literature (n = 14), gallstone disease (n = 10), and intrahepatic cholestasis of pregnancy (n = 5). Bold border represents variants that are unique to the Genes & Health cohort. Topo2 software (Johns S.J., TOPO2, Transmembrane protein display software, was used for illustration95.

ATP8B1 variants

We identified a total of 50 ATP8B1 variants and 31 were included in the final analysis (Table 1 and Fig. 3). There were 7 variants associated with gallstone disease; three appeared to be in linkage disequilibrium (LD) noted in three volunteers: M674T, I577V, and H78Q. The R384H variant was associated with gallstone disease and with an ICP phenotype (in separate individuals, n = 2 out of 33 women in the analysis). A further variant was associated with hepatitis-induced liver cirrhosis (I513T), and a final variant (D70N) was associated with liver cirrhosis, secondary malignant neoplasm of liver and bile duct, and gallstone disease (Table 3). In addition, previously reported cholestatic liver disease phenotypes (n = 7) and variants with no previous reported phenotype were seen (n = 15) (Supplementary Table 6). Finally, we identified 4 novel LoF variants: a frameshift/splice region, frameshift, splice-acceptor/coding-sequence, and stop-gain variant. The latter variant was associated with a gallstone disease phenotype (Table 4).

Figure 3
figure 3

ATP8B1 variant summary in a 2-dimensional illustration. 31 variants are divided into their phenotypic presentation and coloured by: No phenotype previously reported (n = 15), cholestatic phenotype reported in the literature (n = 7), gallstone disease (n = 7), liver cirrhosis (n = 1), Liver cirrhosis and multiple cholestatic phenotype (n = 1), and intrahepatic cholestasis of pregnancy (n = 2). Bold border represents variants that are unique to the Genes & Health cohort. Topo2 software (Johns S.J., TOPO2, Transmembrane protein display software, was used for illustration95.

NR1H4 variants

There were 22 NR1H4 variants in the Genes & Health cohort and 9 variants in the final analysis (Table 1 and Supplementary Fig. 2). We only identified an ICP phenotype (n = 2 out of 33 women in the analysis) (Table 3) and otherwise novel variants that had no previous phenotype reported (n = 7) (Supplementary Table 7). Furthermore, one novel LoF variant was identified without demonstrating a phenotype (Table 4).

TJP2 variants

There were 83 TJP2 variants identified of which 37 were analysed (Table 1). People with TJP2 variants had ICP (n = 3 out of 103 women in the analysis), gallstone disease (n = 8), previously reported cholestatic liver disease phenotype (n = 4), and 22 did not have a previously reported phenotype (Table 3, Supplementary Table 8). There were two novel LoF variants without a clinical phenotype (Table 4).

Protein structure and molecular modelling

A flow chart illustrating the variants included in this analysis is shown in Supplementary Fig. 3. Results of the protein structure and molecular modelling software tools are presented in Supplementary Table 9.

Some novel variants are in regions of transporters for which we can hypothesis a mechanistic impact. Of particular interest are Q1106H in ABCB4 and D191A in ABCB11. These ABC B-family transporters share 48% amino acid identity and are very likely have a common mechanism of action. The two amino acids are conserved in both proteins, and we propose that they are involved in energy transduction through the transporter in order to couple the substrate efflux cycle to the ATP binding and catalysis cycle.

In ABCB4 and ABCB11, two transmembrane domains (TMDs) bind the transport substrates (phosphatidylcholine (PC) and bile acids), respectively. The conformational changes required for substrate transport are driven by ATP hydrolysis at the interface between two nucleotide binding domains (NBDs). The TMDs and NBDs must therefore be intimately coupled, and this is achieved via four ‘coupling helices’ (CH) located at the base of the long intracellular loops extending from the transmembrane alpha helices of the TMDs (Supplementary Fig. 4A).

Q1106 (ABCB4) and D191 (ABCB11) are particularly interesting because they are located at this interface that is conserved in both ABCB4 and ABCB11. Q1106 is in a groove on the surface of NBD2 where it interacts with CH2 (Supplementary Fig. 4B).

In the PC-bound conformation of ABCB4, Q1106 forms a weak electrostatic interaction with the peptide bond of G270. In the ATP-bound conformation (from which PC has most likely been released), Q1106 now interacts with Q272 which illustrates the movement of CH2 and its changing juxtaposition with the NBD during the transport cycle; essentially, a hinge region. The geometry of these interactions will not be preserved if Q1106 is replaced by histidine. In ABCB11, this triad is preserved in Q1150 and G295, with E297 providing a conservative change for the glutamine in CH2 (with respect to formation of an equivalent electrostatic bond with Q1150).

In the sole structural model that we have for ABCB11, D191 is in CH1 where it interacts with Y472 in a surface groove of NBD1 and also, intriguingly, with R946 which is in CH4, suggesting that CH1 and CH4 likely work together in energy transduction through the transporter (Supplementary Fig. 4C).

These electrostatic bonds will not be possible if D191 is replaced with alanine. This triad and its bond architecture is perfectly conserved in ABCB4 in the ATP bound conformation through amino acids D166, Y446 and R902. However, in ABCB4 there is also an additional electrostatic interaction between the carbonyl of the D166 peptide bond and the side chain of Q1171. Q1171 (which is conserved in ABCB11) is adjacent to the ABC signature motif 1172LSGGQ1176 which is involved in coordination of ATP and provides a direct mechanism for how CH1 influences, and responds to, the ATP catalytic cycle of these transporters.


This study identified novel variants implicated in the aetiopathogenesis of cholestatic liver disease that occur uniquely in this British Bangladeshi and Pakistani cohort36,79,80,81. There have not been any other studies of this magnitude analysing the burden of mutational variation in cholestatic liver disease in a large South Asian cohort. Using a genotype to phenotype approach we discovered novel likely pathogenic variants that appear to be unique to this cohort. We then employed a phenotype to genotype analysis using the ICP phenotype as an exemplar, which offered a pragmatic interrogation of electronic health records to identify rare genetic variants that are likely pathogenic. Thus, this study improves representation of this distinct population especially as prevalence of cholestatic liver disease is increased in the Genes & Health cohort, e.g. 1.54% are affected by ICP compared to white Europeans (0.62%). This study demonstrates the importance of multi-ancestry genomic research and offers the potential of tailored treatment for this population.

In the Genes & Health cohort, out of 194 variants meeting inclusion criteria we identified 53 that had a cholestatic liver disease phenotype reported in their linked EHR. Of those, 16 are unique to this British Bangladeshi and Pakistani population and a large number were predicted to be likely pathogenic or known pathogenic based on in-silico prediction tools. In addition, there were 35 variants that were previously reported in the literature with a cholestatic phenotype. However, 87 variants had no previously reported phenotype; 67 were novel (34% of all variants analysed in this study) as they were also not previously reported in the GnomAD population database. Despite that, 9 were considered likely pathologic and 5 known pathogenic. It is important to consider that heterozygosity as noted in most cases means that they are likely rescued by the wild-type allele but at higher risk of disease in later life or during times of liver stress, e.g. during pregnancy.

Our findings reflect the difficulty with interpretation of rare variants in clinically important genes when there is no previous evidence in the literature or functional data to interpret them further. The ACMG rare variant interpretation guideline30 provides a standardised analysis pathway. However, it relies in part on the interpretation of the variant in the context of the literature and does not account for specific genes and diseases. It also may not be robust for flexible membrane proteins which do not work by lock and key mechanism. For example, the ABCB11 variant V444A, considered as benign by the ACMG criteria, has been reported to increase the risk of ICP, hepatitis C disease progression, and drug-induced liver injury although the exact functional mechanisms are not clear yet55,82.

By employing computational protein modelling software tools, we were able to identify variants that likely have a significant impact on the conformation of the protein and could therefore be of clinical significance. It is important to bear in mind that all these tools have inherent flaws and are beyond the scope of this paper to discuss in detail. By taking ICP as a cholestatic liver disease example we were able to highlight further difficulties with rare variant interpretation in gestational syndromes as the inherent transient nature of the disease makes variant interpretation challenging. However, ICP is a clinically relevant example as the gestational disease consequences are not just relevant to their current pregnancy but also can result in later hepatobiliary disorders such as cancer, immune-mediated and cardiovascular diseases83. In addition, they have a higher gallstone-related morbidity and a strong positive association between ICP and hepatitis C exists as well84.


The use of electronic health records to determine phenotype is extremely useful but dependent upon appropriate information having been coded. Participants with at-risk variants may not have presented yet with symptoms of disease but still be at high risk of developing complications at a later stage in their life, particularly given that the median age of volunteers in this study was around 45 years. It demonstrates the difficulty with interpreting variants when recalling the genotype first.


In this study we provide the first comprehensive evaluation of gene candidates associated with cholestatic liver diseases in a unique cohort of British Bangladeshi and Pakistani origin demonstrating the importance of characterising genetic disease in diverse ethnic groups. Our findings have demonstrated the increased mutational burden of cholestatic liver disease in British Bangladeshi and Pakistani people who thus far remain understudied despite their distinct genetic background and increased risk of developing ICP in comparison to other populations. We were able to identify novel variants that have not been previously identified and are likely implicated in disease. We demonstrated the ability to identify participants at risk both by a phenotype or genotype first approach. This demonstrates the importance of providing more personalised care in a clinical setting as identification of high-risk individuals and their family members enables early intervention to prevent adverse outcomes, for example hepato-protective drugs such as UDCA, in addition to hepatic surveillance. Furthermore, it provides the necessary foundation for improved therapy and drug development.


Study population

A detailed description of the Genes & Health cohort has been described by Finer et al.21. Ethical approval for the study was provided by the South East London National Research Ethics Committee (14/LO/1240) including consent for publishing All Genes & Health volunteers consented to lifelong EHR linkage, DNA extraction and genetic tests. All research was conducted in accordance with NHS Health Research Authority guidelines and regulations. An individual application to support data access for this study was granted by Genes & Health (reference S00037) taking into consideration community prioritisation, acceptability and scientific merit.

Exome sequencing samples from 5236 Genes & Health volunteers reporting parental relatedness were available for analysis in variant call format files. For the initial analysis a genotype to phenotype approach was employed interrogating 5 gene candidate loci (Table 5). For the rare burden analysis female volunteers without ICP served as controls (n = 3048). In a secondary analysis, a phenotype to genotype analysis was used to validate these findings, using ICP as the exemplar. For this approach, electronic health records allowed total serum bile acid concentrations ≥ 10 µmol to be retrieved from a network of acute hospitals that provide maternity care to (n = 15,500) women per year living in east London to identify patients with a diagnosis of liver disease in pregnancy (ICD 10 diagnosis code O26.6), see Supplementary Fig. 1. Maternal health records were screened by an experienced clinician to verify a diagnosis of ICP.

Table 5 Description of the five gene candidates.

Exome sequencing & bioinformatic pipeline

Low/mid exome sequencing was performed as previously described85. The exome sequencing data is being held under a data access agreement at the European Genotype-phenome Archive ( under accession numbers EGAD00001005469. Minor allele frequency (MAF) was set at < 5% to include rare and low-frequency genetic variants to allow for a comprehensive evaluation.

Variant annotation

All protein-altering missense, non-sense, frameshift indels or splice site variants identified in the candidate gene set underwent the same processing as described below. Synonymous variants were excluded from further analysis. Variants were filtered and annotated if they met any of the following inclusion criteria (MAF < 5%): 1. associated with a phenotype; 2. known in the literature; 3. no recorded GnomAD allele frequency; 4. predicted to be likely pathogenic (LP) based on all 7 in-silico predictors. To assess the likelihood of functional impact of variants a variety of in-silico tools were employed, including Polyphen86, SIFT87, CADD88, Revel89, MetaLR90, MetaSVM90 and M_CAP91. Open-source databases (Leiden Open Variation Database—LoVD, and ClinVar) including a commercial database (Mastermind) were interrogated to assess whether variants were reported previously in the literature. The American College of Medical Genetics/Association for Molecular Pathology (ACMG/AMP) guidelines for rare variant interpretation was used to assess variant pathogenicity. The guidelines consider a variant to be LP if there is a > 90% certainty it being disease-causing, but below a higher “pathogenic” threshold92.

Rare variant burden analysis

To assess the significance of any rare variant burden of SNPs in all 5 gene candidates the exactCMC function in RVTESTS93 was used. The burden was calculated as the proportion of all ICP cases versus control in the Genes & Health cohort and who had at least one alternate allele. Variants with an allele frequency of 0.01 or less were included. ICP cases (n = 18) from the Genomics England database (Project ID 747)—a predominantly European genetic cohort—were accessed to serve as a direct comparison to rare variant burden in Genes & Health.

Protein structure and modelling analysis

Inclusion criteria for further protein structure and modelling analysis required variants to have an associated phenotype, and/or be predicted to be known pathogenic based on all 7 in-silico tools. VarMap was used to assess protein sequence variants94. 2D representations were designed using the open source tool TOPO295. Variants that were LP or associated with a phenotype underwent further analysis using: (1) Dynamut96 (2) CUPSAT97 (3) SNPMuSiC98. 3D structural representations were generated using PyMOL software (The PyMOL Molecular Graphics System, Version 2.0 Schrödinger, LLC.) using ABCB4 with phosphatidylcholine (PDB ID: 7NIV)99, ABCB11 open structure (PDB ID: 6LR0)100, ATP8B1 (PDB ID: 7PY4)101, and NR1H4 (PDB ID: 1OSH)102. There is no 3D protein structure available for TJP2.