Here we ask the question “How much information do epigenomic datasets provide about human genomic function?” We consider nine epigenomic features across 115 cell types and measure information about function as a reduction in entropy under a probabilistic evolutionary model fitted to human and nonhuman primate genomes. Several epigenomic features yield more information in combination than they do individually. We find that the entropy in human genetic variation predominantly reflects a balance between mutation and neutral drift. Our cell-type-specific FitCons scores reveal relationships among cell types and suggest that around 8% of nucleotide sites are constrained by natural selection.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $18.75 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
All raw data for this study are publicly available from the sources described in the Supplementary Note. The cell-type-specific and integrated FitCons2 scores are available as UCSC genome browser tracks at http://compgen.cshl.edu/fitCons2/. Additional data generated during the course of our analyses can be obtained from the corresponding author upon reasonable request.
The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Yue, F. et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014).
The GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Doolittle, W. F. Is junk DNA bunk? A critique of ENCODE. Proc. Natl Acad. Sci. USA 110, 5294–5300 (2013).
Eddy, S. R. The ENCODE project: missteps overshadowing a success. Curr. Biol. 23, R259–R261 (2013).
Ernst, J. & Kellis, M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat. Biotechnol. 28, 817–825 (2010).
Hoffman, M. M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods 9, 473–476 (2012).
Ritchie, G. R., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).
Shihab, H. A. et al. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics 31, 1536–1543 (2015).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome. Biol. 15, 480 (2014).
Gronau, I., Arbiza, L., Mohammed, J. & Siepel, A. Inference of natural selection from interspersed genomic elements based on polymorphism and divergence. Mol. Biol. Evol. 30, 1159–1171 (2013).
Arbiza, L. et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nat. Genet. 45, 723–729 (2013).
Gulko, B., Hubisz, M. J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).
Huang, Y. F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017).
Iwasa, Y. Free fitness that always increases in evolution. J. Theor. Biol. 135, 265–281 (1988).
Barton, N. H. & Coe, J. B. On the application of statistical physics to evolutionary biology. J. Theor. Biol. 259, 317–324 (2009).
Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Taipale, J. Informational limits of biological organisms. EMBO J. 37, e96114 (2018).
Gao, T. et al. EnhancerAtlas: a resource for enhancer annotation and analysis in 105 human cell/tissue types. Bioinformatics 32, 3543–3551 (2016).
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Arner, E. et al. Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 347, 1010–1014 (2015).
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Liu, F. et al. The human genomic melting map. PLoS Comput. Biol. 3, e93 (2007).
Zerbino, D. R. et al. Ensembl 2018. Nucleic Acids Res. 46, D754–D761 (2018).
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
The UK10K Consortium. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015).
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl Acad. Sci. USA 113, 11901–11906 (2016).
Song, Q. et al. A reference methylome database and analysis pipeline to facilitate integrative and comparative epigenomics. PLoS ONE 8, e81148 (2013).
Zerbino, D. R., Wilder, S. P., Johnson, N., Juettemann, T. & Flicek, P. R. The ensembl regulatory build. Genome. Biol. 16, 56 (2015).
We thank R. Ramani for assistance with browser track development, D. McCandlish for comments on the manuscript, N. Dukler for calculating the number of bits required to encode the reference human genome, and other members of the Siepel laboratory for helpful discussions. This research was supported by US National Institutes of Health grants R01-GM102192 and R35-GM127070 (to A.S.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the US National Institutes of Health.
The authors declare no competing interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
The estimated tree is nearly identical to the one estimated from all 115 cell types (Fig. 2). The main changes are in the subtree beneath node 20 (highlighted in gray). Beneath this node, the original and resampled tree are still quite similar, with differences primarily in the order in which decision rules are applied. The impact on the estimated values of ρ is minimal, and the classes for which ρ does change contain few sites (~0.1% of the genome). Four additional trees (not shown) were estimated from random samples of 57 cell types, and all of them were similarly consistent with the tree in Fig. 2.
Supplementary Fig. 2 Reductions in entropy per site due to selection plotted as a function of estimated ρ for the 61 classes.
Reductions in entropy per site due to selection (vertical axis) plotted as a function of estimated ρ (see Supplementary Table 3) for the 61 classes. Sizes of circles reflect numbers of sites. Coding classes (CDS) are shown in blue, and noncoding classes (NCD) are shown in orange. The three classes with negative estimates of the reduction in entropy due to selection are shown in red.
Violin plots of FitCons2 scores similar to the one shown in Fig. 4 for two additional cell types: HUVEC (top) and H1hESC (bottom). Notice the similarity in the annotation-dependent distributions across cell types, despite differences in the genomic locations of the active regions.
The dendrogram is derived from a ‘Manhattan’ or L1 distance matrix defined such that the distance between every pair of cell types is equal to the sum of the absolute differences of their nucleotide-specific FitCons2 scores (see Methods). Clustering was done using the Ward-D2 clustering method in R. Major groups in the dendrogram correspond to cell types associated with (clockwise from top left) blood and the immune system (brown), internal organs (red), the digestive system (gray), neural tissues (blue), skin and connective tissue (purple), and stem cells (green). Insets show examples of closely related cell types from each group. Notice that the digestive cell types are nested within the internal organ-related cell types. Within the neural tissue cluster, separate groups are evident for embryonic and adult brain tissues (blue inset; embryonic cell types highlighted at bottom). Similarly, fetal cell types form subclusters within the internal organ (red inset, entire group) and digestive system (gray inset, gray background) groups. SUS, sites under selection (see Supplementary Note).
Supplementary Fig. 5 Genome browser display showing FitCons2 scores in the promoter region and first few exons of the MIER2 gene.
In red callout at the lower left, FitCons1 highlights a regulatory locus of about 200 bp, upstream of MIER2, via an elevated score surrounding an enhancer-associated ChromHMM feature. FitCons2 refines this locus by identifying a binding site of ~ 8 bp (AP1, red) with sharply elevated score (cluster 48, ρ = 0.31). A second component of this locus contains a smRNA signal immediately adjoining a GWAS hit. An elevated LINSIGHT score provides evidence for the importance of the binding site, but does not identify the adjoining variation, suggesting a cell-type-specific effect. The TSS and core promoter are highlighted in a green callout (center left). Active TF binding sites within the promoter are indicated by the green FitCons2 class (16, ρ = 0.43), while elevated scores identify the start codon and more conserved codon positions (blue; 07, ρ = 0.67). At the boundary of the first exon, FitCons2 scores spike, indicating increased selective pressure at both intronic (14, ρ = 0.92) and exonic (04, ρ = 0.93) splice boundaries. At center right, FitCons2 scores regularly alternate in a period of three, reflecting reduced constraint at the third codon position. The scores are elevated at the splice site (orange; 24, ρ = 0.16) and then drop off in the intron.
Supplementary Fig. 6 Genome browser display showing FitCons2 scores at the BCL3 gene and upstream enhancers in E023, a derived adipocyte cell type.
At far left (gold highlight), elevated FitCons2 scores identify individual TFBSs in the enhancer region. Enhancers associated with BCL3 are identified in brown (at top), and both ChromHMM and DNase-seq features support elevated FitCons2 scores. The TSS and core promoter (purple highlight) also show elevated FitCons2 scores and classes indicating promoter activity. Individual binding sites can be observed via the green FitCons2 class 40 (ρ = 0.43). The blue highlight shows elevated selective pressure at an active intronic splice site (red; class 14, ρ = 0.92), followed by a periodic pattern mirroring codon structure (blue; classes 00, ρ = 0.75, and 05, ρ = 0.62). FitCons2 scores here are elevated by the presence of a strong RNA-seq feature (dark green). The central brown highlight shows several areas identified as intronic enhancers for BCL3 (brown at top) including a cluster of TFBSs (gold). In the final detail (dark green), the smRNA-seq feature (turquoise) drives an elevated score that surrounds an 8-bp locus in the 3ʹ UTR of this gene, an annotated microRNA-binding site (BCL3:miR-19).
Supplementary Fig. 7 Genome browser display showing FitCons2 scores for multiple cell types at a super-enhancer on chromosome 13.
a, Genome browser display showing FitCons2 scores for multiple cell types at a cell-type-specific super-enhancer on chromosome 13. Super-enhancer SE33394 appears active and obtains high scores, in the H1hESC cell type (A) but not the GM12878 (B) or HUVEC (C) cell types. While SE33394 target LECT1 is nearly 3 Mb away, transcription levels at the gene follow FitCons2 enhancer scores in the corresponding cell type. b, LECT1 (which encodes a protein associated with suppression of angiogenesis) is transcribed in H1hESC (A) but not GM12878 (B) or HUVEC (C) cells. Highlighted in gold in a, SE33394 contains five loci identified as distal regulatory modules (blue at top) as well as a FANTOM5 enhancer (green), and is flanked by two GWAS hits associated with blood-related phenotypes (highlighted in red, rs10507601 and rs9527419). Scores at each genomic position are aggregated across cell types to generate an integrated FitCons2 score (bottom). This integrated score identifies elements of the super-enhancer exhibiting the potential for cell-type-specific activity, without requiring epigenomic data from any particular cell type.
a, Sensitivity of various computational prediction methods (see Supplementary Note) for cell-type-specific transcription factor binding sites (TFBSs). Sensitivity is evaluated using 55,024 motif matches for 12 transcription factors in ChIP–seq peaks for H1-hESC cells6 (Supplementary Methods). Sensitivity is plotted against total coverage outside of annotated coding regions as the prediction threshold for each method is varied. Results for two sets of FitCons1 and FitCons2 scores are shown: integrated scores across cell types (I) and cell-type-specific scores for H1-hESC cells. For reference, the vertical gray bar shows the expected fraction of the noncoding genome that is under selection according to FitCons2 (that is, the average score in noncoding regions). b, Receiver operating characteristic (ROC) curves for human disease-associated (pathogenic) single-nucleotide variants (SNVs) listed in HGMD (1,495 HGMD SNVs and 15,042 matched negative controls). The same computational methods are shown, but in this case only integrated scores are used for FitCons1 and FitCons2. The area-under-the-curve (AUC) statistic is listed after each label in the key. False positives are assessed using likely benign variants matched by distance to the nearest transcription start site (Supplementary Methods).
Sensitivity for TFBS prediction as a function of total noncoding coverage (as in Supplementary Fig. 9a) for the K562 cell type.
Precision (vertical axis) versus recall (horizontal axis) for HGMD. This plot is based on the same data as Supplementary Fig. 9b.
Supplementary Fig. 13 Receiver operating characteristic curves and precision–recall curves for ClinVar.
Receiver operating characteristic curves (left) and precision–recall curves (right) for ClinVar.