Introduction

Genome-wide association studies (GWAS) of complex phenotypes often identify non-coding variants and fail to distinguish causal variants from commonly co-inherited variants associated through linkage disequilibrium1. Differential gene expression (DGE) studies in turn, while offering more mechanistic insight, fail to distinguish between cis- and trans-regulatory variation1. In addition, allelic dosage compensation can hide mono-allelic downregulation from DGE analysis. Using RNA-sequencing data, allele-specific expression (ASE) determines the relative expression of individual alleles to find allelic imbalance caused by cis-acting regulatory mechanisms1,2,3. These include cis-acting expression and splicing quantitative trait loci (eQTLs and sQTLs), nonsense-mediated decay (NMD), X-inactivation, imprinting, and RNA interference through non-coding RNAs (ncRNA)1,2,3,4,5. The detection of allelic imbalance can be performed on a per-sample basis, which allows for the discovery of variants with low minor allele frequencies (MAF)3. Thus, ASE analysis enables researchers to find regulatory genomic differences regardless of total gene expression or direct variant-phenotype correlations. In complex phenotypes with low explained heritability and inter-individual differences in pathophysiology, this could contribute to the identification of causal mechanisms and therapeutic targets.

Dilated cardiomyopathy (DCM) is a complex genetic disorder characterized by dilation of the left ventricle and impaired systolic function6,7. Around 15–30% of DCM cases are familial, but the currently known relevant genes and variants still fail to explain 70–80% of all cases8. Family members of affected patients often show no evidence for familial DCM and are diagnosed with sporadic DCM7,9,10. Previous studies have shown eQTL enrichment for GWAS hits in two separate DCM cohorts, indicating regulatory mechanisms play a role in this phenotype11. The unexplained heritability and assumed regulatory mechanisms make DCM a fitting case for ASE analysis.

In this study, we have developed and applied an open-source ASE analysis pipeline in R that performs ASE analysis on the individual and population level as well as group comparisons. The aim of this pipeline is to enable the analyses of, often readily available, RNA-sequencing data in novel ways that help elucidate cis-regulatory processes involved in the development of complex genetic disorders. We find evidence supporting the case that cis-regulatory variation contributes to the phenotypic heterogeneity of DCM, discover new candidate genes, and show the benefit of performing ASE analysis in addition to GWA and differential gene expression studies.

Results

Allele-specific expression analysis pipeline

The pipeline consists of three steps, starting with RNA sequencing data preprocessing using the Genome Analysis Toolkit (GATK), followed by the general ASE score statistics and the analyses for biological interpretation (Fig. 1) (https://github.com/macsbio/AlleleSpecificExpression). We chose to represent ASE as the absolute deviation from a heterozygous biallelic frequency of 0.5, as per the standard guidelines3.

Figure 1
figure 1

Overview of the pipeline. This analysis plan allows the inclusion of genotyping data to increase data retrieval and statistical power. The data can be evaluated on three distinct levels, with suiting visualizations for all of them in order to create interpretable results. The integration of WGS or WES data, the group comparisons, and the visualizations are novel additions to established ASE analysis pipelines.

Setting the ASE score threshold to distinguish true heterozygous loci from homozygous loci

The integration of genotype data allowed the determination of an ASE score threshold to distinguish between true heterozygous loci and homozygous loci with RNA sequencing artifacts. Performing Youden’s J statistic on a receiver-operating characteristic (ROC) determined an ASE score of 0.966 as the optimal threshold to distinguish between true heterozygous loci and RNA sequencing errors. The total number of heterozygous and homozygous loci was 167,329 and 719,769 respectively (Fig. 2).

Figure 2
figure 2

ASE density by zygosity status. This figure shows the density plots of ASE values for genotyped loci based on the zygosity status of the loci. The ASE threshold for homozygosity in non-genotyped loci is indicated with the dotted line.

Analyzing ASE on a population level

Among all the SNPs, the total number of statistically significantly imbalanced SNPs per individual, as determined by a cutoff of q < 0.05 ranged from 210 to 8327, mean = 2093, with the percentage amongst all measured SNPs for an individual ranging between 8.9 and 81.1%, mean = 28.3% (Supplementary Materials for the full SNP-q-value list). Shared imbalance, the number of times a gene showed significant imbalance for at least one locus in each of the subjects, showed an exponentially decreasing pattern (Supplementary Fig. 1). Most genes only showed significant imbalance in one or a few of the subjects, whereas only a few genes showed imbalance in more than half of the subjects. The three genes with the highest shared imbalance showed imbalance in 79 of the samples; ABLIM1, TNNT2, and AKAP13, all of which have known isoforms resulting from alternative splicing12,13,14. In concordance with previous studies, the genes with at least one significantly imbalanced SNP showed significant enrichment for eQTLs, p = 6.9E−3, and sQTLs, p = 5.7E−610.

Known DCM-associated genes are more often imbalanced

Since regulatory variation is known to contribute at least in part to DCM we expected to find more allelic imbalance in genes that are known to harbor variants that are associated with DCM. Thus, we compared the ASE p-value distribution for 12 genes in which confirmed DCM-associated variants most commonly occur to the p-value distribution for the rest of the dataset (Fig. 3)15,16. Furthermore, variants in genes classified as having a moderate, limited, or disputed link to DCM according to Hershberger et al. showed high shared imbalance across the samples and similar, albeit less pronounced, q-value inflation17 (Supplementary Fig. 2). When only taking the most significant SNP per gene per sample, the percentage of significant (q < 0.05) gene hits of the total dataset was 38%, while for the 12 genes with established DCM-associated variants it was 74%. The established DCM-associated genes, as described before, were more frequently observed to be significantly imbalanced in multiple samples, mean = 52 imbalanced patients (Supplementary Table 1). These findings confirm the mechanistic role of these genes in DCM.

Figure 3
figure 3

QQ-plot of p-value inflation. This figure shows the test statistic inflation for SNPs located in 12 genes with established DCM-associated variants compared to the remaining SNPs.

Differential ASE in the phenogroups comparisons

The DCM cohort we analyzed in this study is part of a larger set that has previously been clustered with a machine learning algorithm applied to clinical markers in order to find subsets of DCM patients with distinct phenotypical features10. These subsets of patients were called phenogroups and will be referred to as such from herewith (see methods for a more detailed description). In this paragraph the results and visualizations for the group comparisons are described following the Mann–Whitney U test between each phenogroup and the others as well as the Kruskal–Wallis test results between all combinations of phenogroups (see Supplementary Material for SNP-p-value lists). The Gene Ontology analyses highlighted slightly different processes when looking at genes with significant differential imbalance in one phenogroup versus the others. Metabolic processes, specifically protein metabolism and modifications, and intracellular transport processes were pronounced in phenogroup 1 (mild) and 3 (arrhythmogenic), whereas phenogroup 2 (immune) and 4 (severe) showed more pronounced effects in actin filament-based movement. Phenogroup 3 and 4 shared an enrichment for cardiac muscle contraction. Only phenogroup 1 had some enrichment for immune-related processes, specifically neutrophil activation. The topGO results are provided in the Supplementary Material. Significant differential imbalance between all four phenogroups was found for several SNPs located in genes with known cardiomyopathy links other than DCM such as posterior myocardial infarction, the top 5 hits were EZH1, NIBAN1, C7, CDIN1, and ADPRHL1 (Fig. 4A). The differential imbalance for these SNPs could be clearly visualized in boxplots (Fig. 4B, example for rs9766 in EZH1, Supplementary Material for the next four most significant hits). Genes with a SNP showing statistically differential imbalance in this analysis were visualized as a network connecting functionally related genes including their representative SNPs with median ASE scores indicated per cluster (Fig. 4C). The full network consisted of several subgraphs of varying sizes (Supplementary Material).

Figure 4
figure 4

Overview of results visualizations for differential imbalance between all four phenogroups. (A) Manhattan plot indicating loci significantly differentially imbalanced between all four phenogroups. Note that each dot represents a SNP, annotated with the gene it is located in. (B) Boxplot showing the distribution of ASE scores and number of measurements by phenogroup for rs9766, located in EZH1, the most significant differentially imbalance SNP between the four phenogroups. (C) Subgraph of the network displaying a group of functionally related genes with median ASE scores for the corresponding SNP per phenogroup.

Discussion

In this paper, we describe the development and application of a pipeline for ASE analysis based on standard best practices combined with novel aspects such as the incorporation of genotype data and the ability to analyze allelic imbalance on an individual, population, and group comparison level with intuitive results visualizations.

The pipeline uncovered many potential candidate genes, both known and novel in their relation to cardiomyopathy. The three genes with an imbalance in 79 of the samples, ABLIM1, TNNT2, and AKAP13, all have known isoforms due to alternative splicing12,13,14. Thus, these genes, as well as other commonly imbalanced genes in the dataset, might be differentially spliced and therefore showing allelic imbalance in people with DCM. This confirms prior research that showed a relation between splicing variation and DCM development10. Biologically, they are important for cell structure maintenance, with TNNT2 being a troponin complex subunit and ABLIM1 a mediator for actin-cytoplasmic interactions. AKAP13 serves as a guanine nucleotide exchange factor for RhoA small GTPase, which is an actin regulator. Additionally, it has been shown to be essential for cardiac development in mice and has been linked to human cardiomyopathies14,18. Interestingly, ABLIM1, has been linked to DCM through RBM24 mediated alternative splicing in knockout mice models19.

When looking at genes that showed differential imbalance between the autoimmunity related phenogroup 2 and the rest of the cohort, four out of the top five most significant hits were related to inflammatory processes like programmed cell death and autophagy (Supplementary Table 2)20,21,22,23. One of these genes, APIP, has been shown to carry out a cardioprotective role in the inflammatory process following myocardial infarction20. TFEB is a protein degradation promotor previously linked to autophagy and lysosomal related cardiac disorders22. Increased levels of PPP1R3B, a glycogen synthesis regulator also involved in inflammatory processes, decreases risk for myocardial infarction24. To our knowledge, the autophagy and intracellular protein trafficking gene COPZ1 has not been linked to DCM previously.

For a more general overview, the Gene Ontology enrichment results on biological processes were used. While we observed little overlap with the processes attributed to the phenogroups based on a combination of gene expression and clinical data as described by Verdonschot et al. we observed additional processes that could be further investigated11. General cell structure and muscle fiber processes where enriched in all phenogroups, indicating that more research into the genetic regulation of these molecular processes and how these effect cardiomyocyte structure and function in DCM could be useful. Similarly, even though the 12 known DCM-associated genes showed significant imbalance in many of the samples, only MYH7 was significantly differentially imbalanced between the four phenogroups. One explanation could be that most DCM patients have imbalances in the same core genes which would mean no differential imbalance in those genes9. For example, as described by Heinig et al., many DCM patients show imbalance in a wide variety of TTN loci10. The differences between phenotypic groups within DCM are more likely caused by regulatory changes in other, less disease-specific mechanisms such as inflammation and metabolism10. In addition, there is no reason to assume that the different phenogroups found by clustering on clinical markers are necessarily related to differential allelic imbalance. However, allelic imbalance might be a partial explanation for the heterogeneous disease progression found in DCM patients with otherwise similar or identical genetic markers. In the across phenogroups analysis, one gene that showed differential imbalance between the four groups, ADPRHL1 as seen in Fig. 4A, codes for a protein that is key to myofibril assembly and chamber development. The gene has previously been associated with posterior myocardial infarction25,26. In this same analysis, the most significant differentially imbalanced hit was located in EZH1, a H3K27 methylation mediator involved in cardiac reprogramming27,28. Our results suggest that differential regulation of these genes may play a role in the etiology of different DCM subtypes.

There were several methodological considerations we encountered while designing the pipeline. The primary reason for using an ASE score from 0.5 to 1 was that this removed the distinction between reference and alternative alleles, or major and minor alleles. Since ASE is determined at the individual level, determining reference versus alternative was inapplicable. In addition, our analysis serves as a proxy for underlying regulatory variation, regardless of the direction of the imbalance. Preserving the individual alleles would therefore have added unnecessary complexity to the pipeline.

As opposed to previous ASE analysis pipelines, we decided not to aggregate the ASE scores of multiple SNPs within the same gene3. The rationale behind this is that some cis-regulatory events that cause ASE are location specific. For example, if splicing variation occurs for one of multiple exons in one allele, but not in the other allele nor for the remaining exons, ASE can only be detected for that single exon. The measured imbalance in the spliced exon would be reduced, or potentially lost, when aggregating multiple ASE measurements on the same gene. Thus, we decided to treat alleles on the same gene individually to retain a larger number of positive findings for further exploration.

Due to the threshold to classify homozygosity, all truly imbalanced heterozygous measurements with ASE above that threshold have been disregarded. This is also true for all fully imbalanced loci, where only one of the two alleles is expressed. Which could arise due to, for example, parental imprinting, and nonsense mutations. We are unaware of methodologies to circumvent this problem other than integrating genotype data for all samples.

Ultimately, this pipeline offers regulatory genetic analysis on RNA sequencing data, a commonly available genome-wide omics data. Our pipeline provides added insight into the bio-molecular etymology underlying complex regulatory genetic disorders. The visualizations, which align with the most commonly used visualizations in GWA and DGE studies, offer an intuitive understanding of the results for applications in a clinical (genetics) setting.

Conclusion

We have shown that allele-specific expression analysis is able to pinpoint disease-relevant genes under cis-regulatory variation. Our analysis on a unique DCM cohort shows that allelic imbalances are detected in known DCM genes; furthermore, imbalances detected in novel DCM genes, while not yet experimentally or clinically linked to development of DCM, may ultimately be shown to reflect novel disease-relevant processes. Our ASE analysis pipeline can be applied at the individual and the population level and thereby play a role in research on both rare and common complex phenotypes.

Methods

Data description

The data set consists of a group of 87 RNA sequenced DCM patients from the Maastricht Cardiomyopathy Registry29. All patients were diagnosed according to World Health Organization criteria and the current European Society of Cardiology guidelines. Patients with a left ventricular ejection fraction of < 50% in the absence of obstruction of > 50% of a major coronary artery branch, pericardial diseases, congenital heart diseases, cor pulmonale, and active myocarditis were included. Unless contraindicated, patients received guideline-directed medical therapy titrated to the maximally tolerated dose and defibrillator device therapy. Endomyocardial biopsies were obtained in routine diagnostic care for each patient. All RNA was isolated from spare biopsies and sequenced using the TruSeq mRNA sample preparation kit (Illumina, San Diego, CA, USA) and the NextSeq 500 sequencing chip (Illumina, San Diego, CA, USA). All patients received genetic counseling and testing using a cardiomyopathy-associated gene panel with either single molecule molecular inversion probes or whole-exome sequencing. Whole-exome sequencing data was available for 35 patients, obtained with the Affymetrix GW6 platform (Affymetrix, Santa Carla, FL, USA). In addition, they were assigned classification labels based on a machine learning algorithm that performs clustering on DCM patients based on 28 distinct clinical features11. This resulted in four phenotypically distinct patient clusters (phenogroups) with increasing disease severity. Phenogroup 1 was mainly characterized by a moderate ejection fraction with low creatine levels and overall mild disease symptoms, phenogroup 2 by high creatine levels and an overrepresentation of auto-immune disease diagnoses, phenogroup 3 by the presence of atrial fibrillation, and phenogroup 4 by a low ejection fraction and other end-stage symptoms. All patients gave written informed consent before inclusion and the relevant guidelines and regulations were strictly adhered to29.

Data preparation

RNA sequencing data processing

RNA sequencing data was processed using the Genome Analysis Toolkit (GATK, version 3.8.1) for variant calling and read counting. All following steps were performed in R version 4.0.230. The fraction of reference and alternative allele reads was calculated for all loci. As a means of standardization, the ASE scores were set as the absolute deviation of the allelic read fraction from 0.5 to 1.0 as commonly done by others (Eq. 1) (ASE score calculation)2,3.

$$ASE \; score=\left|\left(\frac{{Read \;count}_{reference}}{{Read \;count}_{total}}\right)-0.5\right|+0.5$$
(1)

Distinguishing high imbalance from artifacts

The subset of samples with available whole-exome sequencing data was used to determine a homozygosity threshold based on the ASE score since the true zygosity for these samples was known. A receiver operating characteristic (ROC) analysis was performed using a model of zygosity (1 for homozygosity, 0 for heterozygosity) as a function of the ASE score31. Next, Youden’s J statistic was applied to find the threshold that maximizes the distance to the ROC diagonal31. Due to unequal sample sizes between homozygous and heterozygous loci, resampling was applied. The mean ASE score threshold was then used to filter out likely homozygous loci within the non-genotyped subset.

Testing for statistical significance

Statistical tests were performed on the ASE scores on a per sample basis as well as between phenogroups. Within-sample ASE significance was determined using a binomial test, where the expected probability of finding a certain allele in the total read count at a particular locus was set to be the median of all median ASE values per sample (0.647). Multiple testing correction was performed for the binomial test results using the Benjamini–Hochberg method32. For the group comparisons, a Wilcoxon rank-sum test was applied to calculate the statistical significance of differential imbalance between pairs of phenogroups. To investigate ASE variation between all four phenogroups, we performed the Kruskal–Wallis Rank-Sum Test. Non-parametric tests were chosen due to the non-normality of the ASE scores as well as the sparse measurements per SNP across the samples.

Biological interpretation

All SNP identifiers in the output results were mapped to Ensembl gene identifiers (ENSG IDs) and HUGO Gene Name Committee (HGNC) symbols33. For genes with multiple ASE measurements, the SNP with the lowest p-value was taken to represent the gene for that individual, since several of the potential underlying mechanisms of ASE, such as splicing and transcript truncation (followed by NMD), are position-specific and cannot be accurately aggregated with imbalances in other exons4,10. Ensembl Gene Ontology data and SNP-to-gene mapping were subsequently used to perform biological interpretation. The R-package topGO was used to perform Gene Ontology enrichment analysis using the parent–child for all significant gene hits from each analysis34,35. A gene was considered statistically significant if the lowest q-value for an ASE event within that gene was < 0.05. Genes with a SNP showing statistically significant differential imbalance between the four phenogroups were visualized as a network connecting functionally related genes including their representative SNPs with median ASE scores indicated per cluster. This network was created by taking the related genes of differentially imbalanced SNPs and, after translating gene identifiers to protein identifiers, using the STRING database to find protein–protein interactions with a cut-off of 0.936.

Ethics approval and consent to participate

All patients gave written informed consent before inclusion29. An independent Medical Ethics Committee of the Maastricht University Medical Center (MUMC+) has approved this registry.