Distinct epigenomic patterns are associated with haploinsufficiency and predict risk genes of developmental disorders

Han, Xinwei; Chen, Siying; Flynn, Elise; Wu, Shuang; Wintner, Dana; Shen, Yufeng

doi:10.1038/s41467-018-04552-7

Download PDF

Article
Open access
Published: 30 May 2018

Distinct epigenomic patterns are associated with haploinsufficiency and predict risk genes of developmental disorders

Xinwei Han ORCID: orcid.org/0000-0003-3930-8427^1,2^na1^nAff8,
Siying Chen ORCID: orcid.org/0000-0003-1534-6525^1,3^na1,
Elise Flynn^1,3,
Shuang Wu⁴,
Dana Wintner⁵ &
…
Yufeng Shen^1,6,7

Nature Communications volume 9, Article number: 2138 (2018) Cite this article

4916 Accesses
22 Citations
6 Altmetric
Metrics details

Subjects

Abstract

Haploinsufficiency is a major mechanism of genetic risk in developmental disorders. Accurate prediction of haploinsufficient genes is essential for prioritizing and interpreting deleterious variants in genetic studies. Current methods based on mutation intolerance in population data suffer from inadequate power for genes with short transcripts. Here we show haploinsufficiency is strongly associated with epigenomic patterns, and develop a computational method (Episcore) to predict haploinsufficiency leveraging epigenomic data from a broad range of tissue and cell types by machine learning methods. Based on data from recent exome sequencing studies on developmental disorders, Episcore achieves better performance in prioritizing likely-gene-disrupting (LGD) de novo variants than current methods. We further show that Episcore is less-biased by gene size, and complementary to mutation intolerance metrics for prioritizing LGD variants. Our approach enables new applications of epigenomic data and facilitates discovery and interpretation of novel risk variants implicated in developmental disorders.

Evidence for 28 genetic disorders discovered by combining healthcare and research data

Article 14 October 2020

Investigating the role of common cis-regulatory variants in modifying penetrance of putatively damaging, inherited variants in severe neurodevelopmental disorders

Article Open access 15 April 2024

Human and mouse essentiality screens as a resource for disease gene discovery

Article Open access 31 January 2020

Introduction

Haploinsufficiency (HIS) due to hemizygous deletions or heterozygous likely-gene-disrupting (LGD) variants plays a central role in the pathogenesis of various diseases. Recent large-scale exome and genome sequencing studies of developmental disorders, including autism, intellectual disability, developmental delay, and congenital heart disease (CHD)^1,2,3,4,5, have estimated that de novo LGD mutations explain the cause of a significant portion of patients with these developmental disorders, and the enrichment rate of de novo LGD variants indicates about half of these variants are associated with disease risk. However, relatively few genes have multiple LGD variants (“recurrence”) in a cohort^1,2,6, lacking of which provides insufficient statistical evidence to distinguish individual risk genes from the ones with random mutations⁷. On the other hand, most of the enrichment of LGD variants can be explained by HIS genes⁶. Therefore, a comprehensive catalog of HIS genes can greatly help interpreting and prioritizing mutations in genetic studies.

Currently, there are two main approaches of predicting HIS genes based on high-throughput data. Huang et al. use a combination of genetic, transcriptional and protein–protein interaction features from various sources to estimate haploinsufficient probabilities for 12,443 genes⁸. Using similar input information, Steinberg et al. generated the probabilities for more (over 19,700) human genes by a Support Vector Machine (SVM) model⁹. The other approach is based on mutation intolerance^10,11,12 in populations that do not have early onset developmental disorders. Lek et al.¹¹ estimated each gene’s probability of HIS (pLI: Probability of being Loss-of-function Intolerant) based on the depletion of rare LGD variants in over 60,000 exome sequencing samples. Although effective, the Exome Aggregation Consortium (ExAC) pLI is biased towards genes with longer transcripts or higher background mutation rates, since the statistical power of assessing the significance depends on a relatively large expected number of rare LGD variants from background mutations.

We sought to predict HIS using epigenomic data that are orthogonal to genetic variants and generally independent of gene size. Our method is motivated by recent studies indicating that specific epigenomic patterns are associated with genes that are likely haploinsufficient. Specifically, genes with increased breadth of H3K4me3, typically associated with actively transcribing promoters, are enriched with tumor suppressor genes¹³, which are predominantly haploinsufficient based on somatic mutation patterns¹⁴. Another study reported H3K4me3 breadth regulates transcriptional precision¹⁵, which is critical for dosage sensitivity. These observations led us to hypothesize that haploinsufficient genes are tightly regulated by a combination of transcription factors and epigenomic modifications to achieve spatiotemporal precision of gene expression, and such regulation can be detected by distinct patterns of epigenomic marks in relevant tissues and cell types. Based on this model, we develop a Random Forest–based method (“Episcore”) using epigenomic data from the Epigenomic Roadmap¹⁶ and the Encyclopedia of DNA Elements (ENCODE) Projects¹⁷ as input features and a few hundreds of curated HIS genes as positive training data. To assess the efficacy of prioritizing candidate risk variants in real-world genetic studies, we use large data sets of de novo mutations from recent studies of birth defects and neurodevelopmental disorders and show that Episcore performs better than existing methods. Additionally, Episcore is less-biased by gene length or background mutation rate and complementary to mutation-based metrics in HIS-based gene prioritization. Our analysis indicates that epigenomic features in stem cells, brain tissues, and fetal tissues contribute more to Episcore prediction than others.

Results

HIS genes have distinct epigenomic features

To examine the correlation of gene HIS and epigenomic patterns, we analyzed ChIP-seq data from Roadmap and ENCODE projects, including active (H3K4me3, H3K9ac, and H2A.Z) and repressive (H3K27me3) promoter modifications, and marks associated with enhancers (H3K4me1, H3K27ac, DNase I hypersensitivity sites). We used the width of called ChIP-seq peaks for promoter features and counted the interacting number of promoters and enhancers within pre-defined topologically associated domains (TADs) for enhancer features. As each histone modification is characterized in multiple cell types, we refer to the combination of an epigenomic modification and a cell type as one epigenomic feature.

Figure 1a shows the correlation among epigenomic features, and the correlation of epigenomic features and ExAC pLI score. As expected, active promoter or enhancer marks are highly correlated with each other and with ExAC pLI score, and they are anti-correlated with repressor marks in general. The repressor marks from stem cells or fetal tissues have positive correlations with active marks and ExAC pLI scores, suggesting many genes with bivalent marks in stem cells are likely haploinsufficient.

To further investigate the association of HIS and patterns of epigenomic modifications, we compiled a list of 287 known HIS genes (Supplementary Data 1) involved in a wide range of human diseases (Supplementary Data 2) from a recent study^8,18 and human-curated ClinGen dosage sensitivity map. We also collected a list of 574 haplosufficient (HS) genes, of which one copy of each gene had been deleted in two or more subjects based on a copy-number variation (CNV) study in 2026 healthy individuals¹⁹. For promoter features, HIS and HS genes clearly have distinct distributions of peak length (Fig. 1b–d). HIS genes on average have wider peaks of both the active marker H3K4me3 (Fig. 1b) and the repressive marker H3K27me3 (Fig. 1c), suggesting the difference between HIS and HS genes is not only on the level of expression but also on distinct mechanisms of regulation. Furthermore, other epigenomic modifications associated with active promoters, including H2A.Z and H3K9ac, also display wider peaks upstream of HIS genes (Supplementary Fig. 1A and B). In addition, HIS and HS genes also differ in the number of interacting enhancers. We adopted a recently published method EpiTensor²⁰, which decomposes a 3D tensor representation of histone modifications, DNase-Seq, and RNA-Seq data to find associations between distant genomic regions. When restricted to pre-defined TADs, associated regions identified by EpiTensor correspond well to enhancer−promoter interactions found by Hi-C. EpiTensor revealed that HIS genes have a median of nine interacting enhancers, while HS genes have a median of 0 (p < 10^–4, permutation test, Supplementary Fig. 1C). When averaged across tissues, HIS genes shift towards a larger number of mean interacting enhancers, as compared to HS genes (Fig. 1d), supporting the notion that HIS genes have more regulatory complexity.

Among these 287 known HIS genes, 129 genes (45%) have pLI smaller than 0.9 or missing value. Some of these genes are well-known disease risk genes under dominant genetic models, such as TGFB1²¹, RUNX1²², SOX2²³, SUMO1²⁴, NKX2-5²⁵, EYA4²⁶, CAV1²⁷, PAX2²⁸, GATA6²⁹, ZIC2³⁰, and WT1³¹. These known HIS genes with pLI < 0.9 have significantly smaller number of expected loss-of-function variants¹¹ than an average gene (Supplementary Fig. 2A), and intermediate selection coefficient of heterozygous LGD variants (S_het)¹² (Supplementary Fig. 2B), pointing to two particular areas (genes that are either short or under intermediate negative selection) in which HIS prediction can be improved.

Predicting haploinsufficiency with epigenomic features

To leverage the strong association between epigenomic patterns and gene HIS, we developed a computational method to predict HIS using Random Forest (Fig. 2a) and other supervised learning models (Supplementary Fig. 3A and B). The input features included peak length of four promoter marks (H3K4me3, H3K9ac, H2A.Z, and H3K27me3) and the number of EpiTensor-inferred interacting enhancers in various tissues. Performance evaluation by tenfold cross validation and AUC (Area Under Curve) in ROC (Receiver Operating Characteristic) curves showed that all of these methods achieved high AUC values of 0.86~0.88 (Fig. 2b and Supplementary Fig. 3A and B). As Random Forest performs the best, results from Random Forest are chosen as final metrics measuring the probability of being haploinsufficient, termed “Episcore” (Supplementary Data 3). Despite completely different input data are used, Episcore and ExAC pLI score displayed overall concordance. The distribution of pLI is generally bi-modal, with modes at 1 and 0¹¹. The genes with Episcore > 0.6 are much more likely to have pLI values close to 1 than genes with Episcore < 0.4, and the opposite trend at pLI close to 0 (Supplementary Fig. 3C). Among 3463 genes with Episcore > 0.6, 1518 have pLI scores < 0.5. Some of these genes have been implicated in human diseases under a dominant model, such as HEY2³², ASF1A³³, and HAND2³⁴ (Supplementary Table 1). Similarly to the ones with low pLI values in the positive training set, these genes have lower background mutation rate (which is primarily determined by transcript size) than the ones with large pLI values (Supplementary Fig. 3D), and are generally under less severe selection measured by S_het¹² (Supplementary Fig. 3E).

Better prioritization of de novo LGD variants by Episcore

A major goal of predicting HIS is to facilitate prioritization of variants identified in genetic studies of developmental disorders. We compared Episcore with pLI scores from ExAC¹¹, S_het values¹², and ranks of mouse heart expression level³⁵, using de novo LGD variants identified in a recently published whole exome sequencing study DDD (Deciphering Developmental Disorders consortium) of 1365 trio families with CHD³⁶. LGD variants include frameshift, nonsense and canonical splice site mutations. We only included genes with all four metrics for comparison, although we note Episcore (19,430 genes) made predictions for more genes than pLI (18,225 genes), S_het (17,200 genes) and ranks of mouse heart expression level (17,624 genes, due to loss in ortholog matching). Different predictions are compared by the enrichment rate of variants. For the same number of top-ranked genes by each metric, we calculated the number of LGD variants located in these genes and estimated the number of LGD variants based on background mutation rate³⁷. Across a wide range of top-ranked genes, Episcore showed larger enrichment than ExAC pLI, S_het, or heart expression level (Fig. 3a and Supplementary Fig. 4A). We also applied the same approach to de novo synonymous variants identified in the CHD data set and observed no enrichment (Supplementary Fig. 4B). Additionally, we compared these predictions by precision-recall-like curve (PR-like) based on enrichment. Since the total number of positive variants (true disease-causing variants) is unknown, we used estimated number of “true positives” instead of “true positive rate (recall)” in this comparison. For top-ranked genes from each method, the number of true positives was estimated by subtracting expected number of LGD variants based on background mutation rate from the observed in these genes. We measured precision by dividing the estimated number of true positives by the total number of observed LGD variants in these genes. Across a wide range of precision, Episcore consistently showed superior recall compared to pLI, S_het, and heart expression level (Fig. 3b) and to earlier methods based on combination of genetic and protein interaction network data^8,9 (Supplementary Fig. 4C and D). The performance advantage over other HIS-related score does not change after excluding the genes used in training (Supplementary Fig. 4E and F).

We obtained a second CHD WES cohort of 2645 parent−offspring trios from the Pediatric Cardiac Genomics Consortium (PCGC)³⁸ to emulate a replication design. We used the larger data (PCGC CHD) as discovery and the DDD data as replication. We found that the genes with a single LGD variant in PCGC data and “replicated” with at least one LGD variant in the DDD data have much higher Episcore, than the genes with a singleton LGD in PCGC data or genes with LGD variants in controls (unaffected siblings in Simons Simplex Collection autism study³⁹) (Supplementary Fig. 5).

Episcore is complementary to mutation intolerance metrics

Haploinsufficiency predicted by mutation intolerance in a general population (such as ExAC pLI metric) is intrinsically biased towards genes with longer CDS (coding sequence) or higher background mutation rates. Fig. 3c, d shows the distribution of genes with pLI scores >0.9 shifts towards longer CDS length or higher background mutation rate, as compared to the distribution of known HIS disease risk genes, while top 20% genes ranked by Episcore have similar distribution to known HIS disease risk genes or genes with expression level ranked in top 20% in developing heart³⁵.

Since Episcore and pLI use distinct types of input data, a combination of these two scores might achieve better performance. We obtained de novo mutation data of 4293 trio families affected by developmental disorders, mostly with intellectual disabilities (DDD ID), from a recent study⁶. Genes with de novo LGD mutations in DDD ID cases are notably under more severe selection than the ones in CHD (Supplementary Fig. 4G). We used a logistic regression to integrate Episcore and pLI in this data set. Specifically, we used a total of 45 genes with de novo LGD variants in three or more probands as positives, and randomly sampled 45 genes from genes with no observed de novo LGD variant as negatives to estimate coefficients in the logistic model. Both Episcore and pLI have significant coefficients (p < 10⁻³), supporting these two methods convey complementary information. We found that the resulting meta-score achieved overall better precision and true positives than Episcore or pLI alone (Fig. 3e, f), while maintaining similar enrichment burden as good as any method alone in a broad range of gene ranks.

Fetal tissues and stem cells highly associated with HIS

To evaluate the association of each epigenomic feature to HIS, we calculated Spearman correlation coefficients between each feature and Episcore. These correlation coefficients were analyzed in two ways. We first grouped them based on the molecular entities they represent, such that the same epigenomic modification from different tissues would be in one group. Each of the five resulting categories has distinct distributions of Spearman correlation coefficients, suggesting different contributions to Episcore (Fig. 4a). Except for the repressive mark H3K27me3, most of them have larger correlation coefficients than gene expression values, suggesting these features and the model do not merely reflect expression abundance but also epigenomic regulation specific to HIS genes. Measured by mean decrease of Gini index, these groups of features have similar trend in contribution to Episcore prediction (Supplementary Fig. 6).

We then grouped correlation coefficients based on tissue and cell types, converted correlation coefficient of each epigenomic modification to a Z-score using the mean and standard deviation across the tissue or cell type, and finally averaged the Z-score of all epigenomic modification for each tissue or cell type. The averaged Z-score represents the importance of this tissue or cell type to HIS prediction. In general, stem cells and neural tissues have large average Z-scores (Fig. 4b). Interestingly, for tissues in the same category, fetal tissues usually have larger average Z-scores than postnatal tissues.

Finally, to illustrate the contribution of different tissues to HIS, we examined in detail the histone modifications around TSS of several known HIS genes. Fig. 4c shows RBFOX2 and CWC22. RBFOX2 is a CHD risk gene recently discovered through de novo LGD variants⁵, and it has expansive H3K4me3 and H3K9ac peaks in stem/fetal cells and heart and brain tissues, but not in blood cells. Consistently, it has a reverse pattern in H3K27me3, extensive in blood cells but limited in other tissues. On the contrary, CWC22, a known house-keeping gene, shows consistent but narrow peaks of active marks across tissues.

Discussion

In this study we showed there is a strong correlation between epigenomics patterns and gene HIS, and developed a computational method (Episcore) to predict HIS using epigenomic features. Episcore had superior yet complementary performance in prioritization of de novo LGD variants in CHD and neurodevelopmental disorders, compared to mutation intolerance metrics such as ExAC pLI¹¹.

Existing HIS prediction methods based on intolerance of mutations have limited statistical power in genes with small transcript size or under less severe negative selection. Network-based methods⁸ are often biased towards well-studied genes⁹ and pathways. Epigenomic data have several advantages to address these issues: (a) they are orthogonal to genetic mutations, and therefore provide additional information that could improve power; (b) they are much less biased by transcript size, and will be most helpful to predict HIS of genes with short transcripts; (c) the bias with selection coefficient is a reflection of the training data, which empirically is much smaller than mutation intolerance metrics; (d) the ability to generate large amount of data without bias towards well-studied genes. These advantages contribute to the superior performance of Episcore in prioritizing de novo LGD variants from exome sequencing studies.

There are likely a variety of mechanisms underlining the correlation of epigenomics patterns and HIS. First, broad H3K4me3 peaks contributed most to Episcore prediction of HIS. Broad H3K4me3 peaks are associated with reduced transcriptional noise at cell population and single cell levels¹⁵, which is likely required to maintain precise expression levels of HIS genes in specific cell types and developmental stages. Second, a previous study found regulatory complexity is required to achieve cell-type-specific expression patterns of the lineage-defining genes in hematopoietic differentiation⁴⁰. Consistently, we found the number of enhancers interacting with the promotor of a gene is highly correlated with predicted HIS score. Third, many HIS genes are regulators that define cell lineages during differentiation. Bivalent chromatin domains in embryonic stem cells, in which both active marker H3K4me3 and repressor marker H3K27me3 are present, are generally associated with lineage control genes⁴¹. We observed that H3K27me3 are positively correlated with H3K4me3 in stem cells, and both are correlated with mutation intolerance (Fig. 1a, c) and Episcore (Fig. 4a). Finally, we found epigenomic features from stem cells and fetal tissues contribute most to prediction, highlighting the importance of developmental role in determining gene HIS.

Our data suggest Episcore is generally better for prioritizing genes with a broader range of selection coefficient or genes with smaller transcript size, whereas pLI performs better for genes under most severe negative selection. Episcore is currently limited by availability and resolution of epigenomic data, especially cell-type-specific data from complex tissues or organs such as the brain, and data at various developmental stages. Complex developmental disorders, such as autism, involve a large number of cell types during a broad range of developmental stages. It is critical to generate and integrate more fine-grained epigenomic data from cells of specific types at specific time points in order to improve genetic discoveries in studies of such diseases. We expect such data sets will become available in near future from ongoing projects^42,43,44, and will enable us to improve prediction of HIS and facilitate novel discoveries in genetic studies.

Methods

Collection and preprocessing of training genes

In this study, we used Ensembl release 75 for gene annotation and TSS (transcription start site) locations. All genomic coordinates are based on hg19 human genome assembly. Any non-hg19 coordinates were lifted over to hg19 using UCSC LiftOver tool (https://genome.ucsc.edu/cgi-bin/hgLiftOver). Conversion of gene symbols to Ensembl IDs were based on annotation tables downloaded from Ensembl BioMart.

Positive training set data (curated haploinsufficient genes) were collected from these two sources: (1) haploinsufficent training genes used in previous studies^8,18 and (2) genes with haploinsufficient score of 3 in ClinGen Dosage Sensitivity Map (http://www.ncbi.nlm.nih.gov/projects/dbvar/clingen/). For the negative training set (curated haplosufficient genes), we used genes deleted in two or more healthy people, based on CNVs detected in 2026 normal individuals¹⁹. Only genes with half or more of its length covered by any deletion were considered “deleted” in an individual.

The raw training set may have some false positives and false negatives, as it contained results from automated literature mining that is known to give noisy output. To optimize the performance, we did the following pruning of the raw training set: (1) we only kept protein-coding genes in autosomes, as non-protein-coding genes or genes on sex chromosomes may be under different mechanism of epigenomic regulation; (2) from the positive training set, we removed genes with sufficient contradictory evidence (ExAC pLI ≤ 0.1 and expected loss-of-function variants >10¹¹); and (3) from the negative training set, we removed genes with sufficient contradictory evidence (pLI ≥ 0.2 and expected loss-of-function variants >10). After pruning, the positive training set has 287 genes and the negative training set has 574 genes. The full list of training genes is available in Supplementary Table 1.

Preprocessing of epigenomic feature data

The uniformly processed peak calling results of Roadmap and ENCODE projects were downloaded from http://egg2.wustl.edu/roadmap/web_portal/processed_data.html. For promoter features (H2A.Z, H3K27me3, H3K4me3, and H3K9ac), “GappedPeaks” were used to allow for broad domains of ChIP-seq signal. The assignment of a GapppedPeak to a gene follows these steps in order: (1) for each gene, only TSS of Ensembl canonical transcripts were used; (2) assigned a GappedPeak to a TSS if the GappedPeak overlaps with the upstream 5 kb to downstream 1 kb region around the TSS. This definition of basal cis-regulatory region around promoter follows GREAT tool⁴⁵. Assigning one GappedPeak to multiple TSS was allowed; (3) for TSS having more than 1 GappedPeak assigned, kept the closest one; (4) for genes with multiple TSS and hence multiple assigned GappedPeaks, kept the longest GappedPeak. After these four steps, if one gene had been associated with a GappedPeak, then we used the width of the peak as an epigenomic feature in the following machine learning models. If a gene had no associated GappedPeak, then the peak width is 0.

To calculate the number of interacting enhancers of a gene, we used two approaches. In a naïve approach, we counted peaks of ChIP-seq signals that are associated with enhancers. The ChIP-seq signals we used include H3K4me1, H3K27ac, and DNase I hypersensitivity site, and each ChIP signal was counted and recorded separately. We used “NarrowPeak” instead of “GappedPeak” in the counting to better estimate the number of interacting enhancers, as enhancer regions are not long and GappedPeak has the risk of merging nearby ChIP-seq signals. For each gene, we counted peaks in (a) the surrounding TAD, based on TADs reported in ref.⁴⁶; or (b) +/− 20kb of each TSS. (Only TSSs of Ensembl canonical transcripts were used. For genes with multiple TSS and thus several numbers of interacting enhancers, we kept the largest one.) In a more advanced approach, we adapted EpiTensor²⁰ to infer gene−enhancer relationship. We made a few changes when using EpiTensor: (a) we used normalized coverage of ChIP-seq signal instead of raw coverage in Zhu et al.²⁰; (b) we used the coverage of H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me3, H3K9me3, DNase I, and RNA-seq as input for EpiTensor to balance between more input data types and more cell types included, as not every cell type has all these histone modifications characterized. With these input features, we were able to achieve better performance than the aforementioned counting method (Supplementary Fig 1C); (c) we used enhancer annotation from 15-state chromHMM (http://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#core_15state), while the original EpiTensor paper²⁰ used results of an earlier version. Based on the output of EpiTensor, which predicts enhancer−promoter pairs, we counted the number of interacting enhancers for each gene in various tissues.

Finally, the results of peak width and number of interacting enhancers were consolidated into a matrix, with each row being a gene and each column representing a combination of a tissue and a data type, e.g. “H3K4me3 peak width in fetal heart”. One combination of a tissue and a data type was referred to as one epigenomic feature. This matrix was used as input for machine learning models described in the following section.

Machine learning approaches to predict haploinsufficiency

We applied several machine learning approaches, including Random Forest, Support Vector Machine (SVM), and SVM with LASSO feature selection. Random Forest was implemented using R package “randomForest”. SVM was implemented using R package “e1071”. LASSO was implemented using R package “glmnet”, with alpha value equal to 1. For each machine learning method, we assessed the performance based on 100 runs of tenfold cross-validation. In each run, 10% of the training genes were randomly selected and left out to form a test set for validation. The remaining data were used to train the model, after which the test set was used to calculate model sensitivity and specificity. We used R package “ROCR” to make an ROC curve based on the 100 runs and calculated AUC values.

Finally, we used all training genes used to train the model, and then estimate the probabilities of being positive (i.e. probabilities of being HIS) for all genes. The whole process was repeated 30 times and we took the arithmetic mean of the 30 sets of probabilities as the final results.

Episcore and other metrics in variant prioritization

We used two approaches to compare Episcore and other metrics in variant prioritization, based on “enrichment of de novo LGD variants”, estimated “number of true-positives”, and “precision”. The formula to calculate these three statistics is as follows.

For any gene i, the number of expected de novo LGD variants in each gene, E_i, was calculated as:

$${{E}}_i = {{2}} \times {{N}} \times {{r}}_{{i}},$$

where N is the number of cases in the sequencing cohort and r_i is gene-specific LGD mutation rate. LGD variants include nonsense, frameshift, and canonical splice site mutations. The background mutation rate per gene of each mutation type was obtained from Samocha et al.³⁷. For each gene, r_i is the sum of background mutation rate of nonsense, frameshift, and canonical splice site mutations.

For a set of genes, the enrichment of de novo LGD variants, D, was calculated as:

$${{D}} = \frac{{{M}}}{{\mathop {\sum }\nolimits_{{i}} {{E}}_{{i}}}},$$

where M is the total number of observed de novo LGD variants in this gene set. In this study, we used results from two whole exome sequencing studies on CHD^5,36 and another whole exome sequencing study on various developmental disorders⁶.

For any gene set, the number of true positives, TP, was calculated as:

$${\mathrm{TP}} = {{M}} - \mathop {\sum }\limits_{{i}} {{E}}_{{i}}.$$

For any gene set, the precision (positive predictive value), PPV, was calculated as:

$${\mathrm{PPV}} = \frac{{{{M}} - \mathop {\sum }\nolimits_i E_i}}{{{M}}}.$$

For each metric (Episcore, pLI, etc.), a series of top-ranked genes were selected, such as top 500 genes, top 2000 genes, etc. In the first approach, enrichment of de novo LGD variants, D, was calculated for any set of top-ranked genes, and then enrichment values were plotted and compared, as shown in Fig. 3a. In the second approach, the number of true positives, TP, and the precision (true discovery rate), PPV, were calculated for any set of top-ranked genes. TP and PPV were plotted and compared, as shown in Fig. 3b. If the number of all true positives (N) in a study is known, we can calculate recall as R = TP/N. Although N is generally unknown, it is a constant; therefore, TP is proportional to R. In this study, we use TP as a proxy of recall.

To examine the utility of Episcore in prioritizing genes with only one LGD mutation, we utilized two independent CHD cohorts: DDD (Deciphering Developmental Disorders consortium) CHD³⁶ and PCGC (Pediatric Cardiac Genomics Consortium) CHD³⁸. Both these two study included trios from an earlier CHD study³⁵ to increase detection power. To avoid duplication, we removed these earlier trios from DDD CHD data.

Epigenomic features critical for the prediction

We calculated a Spearman correlation coefficient between each epigenomic feature and Episcore. One epigenomic feature here corresponds to a data type (like H3K4me3 peak width) in certain tissue/cell type (e.g. fetal heart). To examine which data types are more important, we plotted these Spearman correlation coefficients by data type, e.g. correlation coefficients from H3K4me3 peak width were plotted in one section. To examine what tissue/cell types are more important, we calculated averaged z-score for each tissue/cell type. The average z-score is calculated following these two steps: (1) we converted every Spearman correlation coefficient to a Z-score using mean and standard deviation specific to each data type and (2) for each tissue/cell type, we averaged the Z-scores from various data types.

Code availability

Episcore prediction is implemented using the random forest R package, “randomForest”: https://cran.r-project.org/web/packages/randomForest/index.html.

R scripts used for model training and Episcore prediction are available on GitHub: https://github.com/ShenLab/episcore.

Data availability

Ensembl release 75 for gene annotation and TSS (transcription start site) locations: http://feb2014.archive.ensembl.org/downloads.html.

Human genome assembly hg19 for genome coordinates: https://genome.ucsc.edu/cgi-bin/hgTables.

Epigenomic feature data from Roadmap and ENCODE projects: http://egg2.wustl.edu/roadmap/web_portal/processed_data.html.

All other data supporting the findings of this study are available within the paper and its supplementary information files.

References

De Rubeis, S. et al. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 515, 209–215 (2014).
Article PubMed PubMed Central CAS Google Scholar
Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).
Article ADS PubMed PubMed Central CAS Google Scholar
Hamdan, F. F. et al. De novo mutations in moderate or severe intellectual disability. PLoS Genet. 10, e1004772 (2014).
Article PubMed PubMed Central CAS Google Scholar
Deciphering Developmental Disorders, S. Large-scale discovery of novel genetic causes of developmental disorders. Nature 519, 223–228 (2015).
Article ADS CAS Google Scholar
Homsy, J. et al. De novo mutations in congenital heart disease with neurodevelopmental and other congenital anomalies. Science (New York, NY) 350, 1262–1266 (2015).
Article ADS CAS Google Scholar
Deciphering Developmental Disorders, S. Prevalence and architecture of de novo mutations in developmental disorders. Nature 542, 433–438 (2017).
Article CAS Google Scholar
He, X. et al. Integrated model of de novo and inherited genetic variants yields greater power to identify risk genes. PLoS Genet. 9, e1003671 (2013).
Article PubMed PubMed Central CAS Google Scholar
Huang, N., Lee, I., Marcotte, E. M. & Hurles, M. E. Characterising and predicting haploinsufficiency in the human genome. PLoS Genet. 6, e1001154 (2010).
Article PubMed PubMed Central CAS Google Scholar
Steinberg, J., Honti, F., Meader, S. & Webber, C. Haploinsufficiency predictions without study bias. Nucleic Acids Res. 43, e101 (2015).
Article PubMed PubMed Central CAS Google Scholar
Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).
Article PubMed PubMed Central CAS Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article PubMed PubMed Central CAS Google Scholar
Cassa, C. A. et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 49, 806–810 (2017).
Article PubMed PubMed Central CAS Google Scholar
Chen, K. et al. Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor-suppressor genes. Nat. Genet. 47, 1149–1157 (2015).
Article PubMed PubMed Central CAS Google Scholar
Davoli, T. et al. Cumulative haploinsufficiency and triplosensitivity drive aneuploidy patterns and shape the cancer genome. Cell 155, 948–962 (2013).
Article PubMed PubMed Central CAS Google Scholar
Benayoun, B. A. et al. H3K4me3 breadth is linked to cell identity and transcriptional consistency. Cell 158, 673–688 (2014).
Article PubMed PubMed Central CAS Google Scholar
Roadmap Epigenomics, C. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Article CAS Google Scholar
Consortium, E. P. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Article ADS CAS Google Scholar
Dang, V. T., Kassahn, K. S., Marcos, A. E. & Ragan, M. A. Identification of human haploinsufficient genes and their genomic proximity to segmental duplications. Eur. J. Hum. Genet. 16, 1350–1357 (2008).
Article PubMed CAS Google Scholar
Shaikh, T. H. et al. High-resolution mapping and analysis of copy number variations in the human genome: a data resource for clinical and research applications. Genome Res. 19, 1682–1690 (2009).
Article PubMed PubMed Central CAS Google Scholar
Zhu, Y. et al. Constructing 3D interaction maps from 1D epigenomes. Nat. Commun. 7, 10812 (2016).
Article ADS PubMed PubMed Central CAS Google Scholar
Kinoshita, A. et al. Domain-specific mutations in TGFB1 result in Camurati-Engelmann disease. Nat. Genet. 26, 19–20 (2000).
Article PubMed CAS Google Scholar
Taketani, T. et al. Mutation of the AML1/RUNX1 gene in a transient myeloproliferative disorder patient with Down syndrome. Leukemia 16, 1866–1867 (2002).
Article PubMed CAS Google Scholar
Fantes, J. et al. Mutations in SOX2 cause anophthalmia. Nat. Genet. 33, 461–463 (2003).
Article PubMed CAS Google Scholar
Alkuraya, F. S. et al. SUMO1 haploinsufficiency leads to cleft lip and palate. Science 313, 1751 (2006).
Article PubMed Google Scholar
Benson, D. W. et al. Mutations in the cardiac transcription factor NKX2.5 affect diverse cardiac developmental pathways. J. Clin. Invest. 104, 1567–1573 (1999).
Article PubMed PubMed Central CAS Google Scholar
Wayne, S. et al. Mutations in the transcriptional activator EYA4 cause late-onset deafness at the DFNA10 locus. Hum. Mol. Genet. 10, 195–200 (2001).
Article PubMed CAS Google Scholar
Cao, H., Alston, L., Ruschman, J. & Hegele, R. A. Heterozygous CAV1 frameshift mutations (MIM 601047) in patients with atypical partial lipodystrophy and hypertriglyceridemia. Lipids Health Dis. 7, 3 (2008).
Article PubMed PubMed Central CAS Google Scholar
Sanyanusin, P. et al. Mutation of the PAX2 gene in a family with optic nerve colobomas, renal anomalies and vesicoureteral reflux. Nat. Genet. 9, 358–364 (1995).
Article PubMed CAS Google Scholar
Kodo, K. et al. GATA6 mutations cause human cardiac outflow tract defects by disrupting semaphorin-plexin signaling. Proc. Natl. Acad. Sci. USA 106, 13933–13938 (2009).
Article ADS PubMed PubMed Central Google Scholar
Brown, S. A. et al. Holoprosencephaly due to mutations in ZIC2, a homologue of Drosophila odd-paired. Nat. Genet. 20, 180–183 (1998).
Article PubMed CAS Google Scholar
Hastie, N. D. Dominant negative mutations in the Wilms tumour (WT1) gene cause Denys-Drash syndrome—proof that a tumour-suppressor gene plays a crucial role in normal genitourinary development. Hum. Mol. Genet. 1, 293–295 (1992).
Article PubMed CAS Google Scholar
Reamon-Buettner, S. M. & Borlak, J. HEY2 mutations in malformed hearts. Hum. Mutat. 27, 118 (2006).
Article PubMed Google Scholar
Giannakou, A. et al. Copy number variants in Ebstein anomaly. PLoS ONE 12, e0188168 (2017).
Article PubMed PubMed Central CAS Google Scholar
Sun, Y. M. et al. A HAND2 loss-of-function mutation causes familial ventricular septal defect and pulmonary stenosis. G3 (Bethesda) 6, 987–992 (2016).
Article CAS Google Scholar
Zaidi, S. et al. De novo mutations in histone-modifying genes in congenital heart disease. Nature 498, 220–223 (2013).
Article ADS PubMed PubMed Central CAS Google Scholar
Sifrim, A. et al. Distinct genetic architectures for syndromic and nonsyndromic congenital heart defects identified by exome sequencing. Nat. Genet. 48, 1060–1065 (2016).
Article PubMed CAS PubMed Central Google Scholar
Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).
Article PubMed PubMed Central CAS Google Scholar
Jin, S. C. et al. Contribution of rare inherited and de novo variants in 2,871 congenital heart disease probands. Nat. Genet. 49, 1593–1601 (2017).
Article PubMed PubMed Central CAS Google Scholar
Krumm, N. et al. Excess of rare, inherited truncating mutations in autism. Nat. Genet. 47, 582–588 (2015).
Article PubMed PubMed Central CAS Google Scholar
Gonzalez, A. J., Setty, M., & Leslie, C. S. Early enhancer establishment and regulatory locus complexity shape transcriptional programs in hematopoietic differentiation. Nat. Genet. 47, 1249–1259 (2015).
Article PubMed PubMed Central CAS Google Scholar
Vastenhouw, N. L. & Schier, A. F. Bivalent histone modifications in early embryogenesis. Curr. Opin. Cell Biol. 24, 374–386 (2012).
Article PubMed PubMed Central CAS Google Scholar
Stunnenberg, H. G., International Human Epigenome, C. & Hirst, M. The International Human Epigenome Consortium: a blueprint for scientific collaboration and discovery. Cell 167, 1145–1149 (2016).
Article PubMed CAS Google Scholar
Psych, E. C. et al. The PsychENCODE project. Nat. Neurosci. 18, 1707–1712 (2015).
Article CAS Google Scholar
Dekker, J. et al. The 4D nucleome project. Nature 549, 219–226 (2017).
Article ADS PubMed PubMed Central CAS Google Scholar
McLean, C. Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol. 28, 495–501 (2010).
Article PubMed PubMed Central CAS Google Scholar
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
Article ADS PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgements

We thank Dr. Zhao Chen for advices on setting up and running EpiTensor. The work was supported by NIH grant R01GM120609 (S.C. and Y.S.).

Author information

Xinwei Han
Present address: Constellation Pharmaceuticals, 215 First Street, Cambridge, MA, 02142, USA
These authors contributed equally: Xinwei Han, Siying Chen.

Authors and Affiliations

Department of Systems Biology, Columbia University, New York, NY, 10032, USA
Xinwei Han, Siying Chen, Elise Flynn & Yufeng Shen
Department of Pediatrics, Columbia University, New York, NY, 10032, USA
Xinwei Han
The Integrated Program in Cellular, Molecular and Biomedical Studies, Columbia University, New York, NY, 10032, USA
Siying Chen & Elise Flynn
Department of Biostatistics, Columbia University, New York, NY, 10032, USA
Shuang Wu
Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY, 14853, USA
Dana Wintner
Department of Biomedical Informatics, Columbia University, New York, NY, 10032, USA
Yufeng Shen
JP Sulzberger Columbia Genome Center, Columbia University, New York, NY, 10032, USA
Yufeng Shen

Authors

Xinwei Han
View author publications
You can also search for this author in PubMed Google Scholar
Siying Chen
View author publications
You can also search for this author in PubMed Google Scholar
Elise Flynn
View author publications
You can also search for this author in PubMed Google Scholar
Shuang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Dana Wintner
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Shen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to data analysis, interpretation, and manuscript writing. X.H. and S.C. curated and processed epigenomics data sets. Y.S. conceived and designed the study.

Corresponding author

Correspondence to Yufeng Shen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Han, X., Chen, S., Flynn, E. et al. Distinct epigenomic patterns are associated with haploinsufficiency and predict risk genes of developmental disorders. Nat Commun 9, 2138 (2018). https://doi.org/10.1038/s41467-018-04552-7

Download citation

Received: 07 November 2017
Accepted: 08 May 2018
Published: 30 May 2018
DOI: https://doi.org/10.1038/s41467-018-04552-7

This article is cited by

An unsupervised deep learning framework for predicting human essential genes from population and functional genomic data
- Troy M. LaPolice
- Yi-Fei Huang
BMC Bioinformatics (2023)
dbCNV: deleteriousness-based model to predict pathogenicity of copy number variations
- Kangqi Lv
- Dayang Chen
- Xiuming Zhang
BMC Genomics (2023)
Pathological variants in genes associated with disorders of sex development and central causes of hypogonadism in a whole-genome reference panel of 8380 Japanese individuals
- Naomi Shiga
- Yumi Yamaguchi-Kabata
- Junichi Sugawara
Human Genome Variation (2022)
X-CNV: genome-wide prediction of the pathogenicity of copy number variations
- Li Zhang
- Jingru Shi
- Tieliu Shi
Genome Medicine (2021)
Novel candidate genes in esophageal atresia/tracheoesophageal fistula identified by exome sequencing
- Jiayao Wang
- Priyanka R. Ahimaz
- Wendy K. Chung
European Journal of Human Genetics (2021)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.