We describe a new computational method for estimating the probability that a point mutation at each position in a genome will influence fitness. These 'fitness consequence' (fitCons) scores serve as evolution-based measures of potential genomic function. Our approach is to cluster genomic positions into groups exhibiting distinct 'fingerprints' on the basis of high-throughput functional genomic data, then to estimate a probability of fitness consequences for each group from associated patterns of genetic polymorphism and divergence. We have generated fitCons scores for three human cell types on the basis of public data from ENCODE. In comparison with conventional conservation scores, fitCons scores show considerably improved prediction power for cis regulatory elements. In addition, fitCons scores indicate that 4.2–7.5% of nucleotides in the human genome have influenced fitness since the human-chimpanzee divergence, and they suggest that recent evolutionary turnover has had limited impact on the functional content of the genome.
This is a preview of subscription content
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Mardis, E.R. A decade's perspective on DNA sequencing technology. Nature 470, 198–203 (2011).
Wold, B. & Myers, R.M. Sequence census methods for functional genomics. Nat. Methods 5, 19–21 (2008).
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Shen, Y. et al. A map of the cis-regulatory sequences in the mouse genome. Nature 488, 116–120 (2012).
Neph, S. et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489, 83–90 (2012).
Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).
Cooper, G.M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12, 628–640 (2011).
Mayor, C. et al. VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 16, 1046–1047 (2000).
Margulies, E.H., Blanchette, M., Program, N.C.S., Haussler, D. & Green, E.D. Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507–2518 (2003).
Boffelli, D. et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394 (2003).
Ovcharenko, I., Boffelli, D. & Loots, G.G. eShadow: a tool for comparing closely related sequences. Genome Res. 14, 1191–1198 (2004).
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Cooper, G.M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).
Asthana, S., Roytberg, M., Stamatoyannopoulos, J. & Sunyaev, S. Analysis of sequence conservation at nucleotide resolution. PLOS Comput. Biol. 3, e254 (2007).
Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).
Graur, D. et al. On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE. Genome Biol. Evol. 5, 578–590 (2013).
Niu, D.K. & Jiang, L. Can ENCODE tell us how much junk DNA we carry in our genome? Biochem. Biophys. Res. Commun. 430, 1340–1343 (2013).
Doolittle, W.F. Is junk DNA bunk? A critique of ENCODE. Proc. Natl. Acad. Sci. USA 110, 5294–5300 (2013).
Eddy, S.R. The ENCODE project: missteps overshadowing a success. Curr. Biol. 23, R259–R261 (2013).
McDonald, J.H. & Kreitman, M. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351, 652–654 (1991).
Fay, J.C., Wyckoff, G.J. & Wu, C.I. Positive and negative selection on the human genome. Genetics 158, 1227–1234 (2001).
Andolfatto, P. Adaptive evolution of non-coding DNA in Drosophila. Nature 437, 1149–1152 (2005).
Eyre-Walker, A., Woolfit, M. & Phelps, T. The distribution of fitness effects of new deleterious amino acid mutations in humans. Genetics 173, 891–900 (2006).
Boyko, A.R. et al. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet. 4, e1000083 (2008).
Wilson, D.J., Hernandez, R.D., Andolfatto, P. & Przeworski, M. A population genetics–phylogenetics approach to inferring natural selection in coding sequences. PLoS Genet. 7, e1002395 (2011).
Ward, L.D. & Kellis, M. Evidence of abundant purifying selection in humans for recently acquired regulatory functions. Science 337, 1675–1678 (2012).
Khurana, E. et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587 (2013).
Arbiza, L. et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nat. Genet. 45, 723–729 (2013).
Narlikar, L. et al. Genome-wide discovery of human heart enhancers. Genome Res. 20, 381–392 (2010).
Ritchie, G.R., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).
Ernst, J. & Kellis, M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat. Biotechnol. 28, 817–825 (2010).
Hoffman, M.M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods 9, 473–476 (2012).
Hoffman, M.M. et al. Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res. 41, 827–841 (2013).
Gronau, I., Arbiza, L., Mohammed, J. & Siepel, A. Inference of natural selection from interspersed genomic elements based on polymorphism and divergence. Mol. Biol. Evol. 30, 1159–1171 (2013).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Boyle, A.P. et al. Annotation of functional variation in personal genomes using Regulome DB. Genome Res. 22, 1790–1797 (2012).
Erwin, G.D. et al. Integrating diverse datasets improves developmental enhancer prediction. PLOS Comput. Biol. 10, e1003677 (2014).
Gerstein, M.B. et al. Architecture of the human regulatory network derived from ENCODE data. Nature 489, 91–100 (2012).
Core, L.J. et al. Analysis of nascent RNA identifies a unified architecture of transcription initiation regions at mammalian promoters and enhancers. Nat. Genet. 46, 1311–1320 (2014).
Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Cooper, G.M. et al. Characterization of evolutionary rates and constraints in three mammalian genomes. Genome Res. 14, 539–548 (2004).
Lindblad-Toh, K. et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438, 803–819 (2005).
Lindblad-Toh, K. et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478, 476–482 (2011).
Ponting, C.P., Nellaker, C. & Meader, S. Rapid turnover of functional sequence in human and other genomes. Annu. Rev. Genomics Hum. Genet. 12, 275–299 (2011).
Chiaromonte, F. et al. The share of human genomic DNA under selection estimated from human-mouse genomic alignments. Cold Spring Harb. Symp. Quant. Biol. 68, 245–254 (2003).
Meader, S., Ponting, C.P. & Lunter, G. Massive turnover of functional sequence in human and other mammalian genomes. Genome Res. 20, 1335–1343 (2010).
Smith, N.G., Brandstrom, M. & Ellegren, H. Evidence for turnover of functional noncoding DNA in mammalian genome evolution. Genomics 84, 806–813 (2004).
Ponting, C.P. & Hardison, R.C. What fraction of the human genome is functional? Genome Res. 21, 1769–1776 (2011).
Rands, C.M., Meader, S., Ponting, C.P. & Lunter, G. 8.2% of the human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genet. 10, e1004525 (2014).
Lunter, G., Ponting, C.P. & Hein, J. Genome-wide identification of human functional DNA using a neutral indel model. PLOS Comput. Biol. 2, e5 (2006).
Kellis, M. et al. Defining functional DNA elements in the human genome. Proc. Natl. Acad. Sci. USA 111, 6131–6138 (2014).
Pheasant, M. & Mattick, J.S. Raising the estimate of functional human sequences. Genome Res. 17, 1245–1253 (2007).
Gronau, I., Hubisz, M.J., Gulko, B., Danko, C.G. & Siepel, A. Bayesian inference of ancient human demography from individual genome sequences. Nat. Genet. 43, 1031–1034 (2011).
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
We thank L. Arbiza for helpful discussions and assistance with early analyses and G. Cooper for constructive criticism of our validation experiments and comparisons with CADD. This research was supported by US National Institutes of Health grant GM102192, a David and Lucile Packard Fellowship for Science and Engineering (to A.S.) and a postdoctoral fellowship from the Cornell Center for Comparative and Population Genomics (to I.G.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the US National Institutes of Health.
The authors declare no competing financial interests.
Integrated supplementary information
Each of the 624 clusters is represented by a single point, with its x coordinate given by the fitCons score calculated as shown in Figure 1 and its y coordinate given by the mean placental mammalian phyloP score for the associated genomic positions15. The clusters naturally fall in two groups, corresponding to coding sequences (CDSs) with higher scores (green crosses) and noncoding sequences with lower scores (blue Xs). Three groups of outliers are shown, representing noncoding clusters with elevated fitCons scores relative to their phyloP scores. Cluster A consists of 1,200 genomic positions in narrow DNase-seq peaks with no RNA-seq signal, yet with chromatin modifications indicating transcription activity. These sites are strongly enriched for ChIP-seq–supported TFBSs and may contain enhancers with weakly expressed eRNAs not detectable from the available RNA-seq data. The two clusters in B contain 92.8 kb of sequence defined by high RNA-seq signals, broad DNase-seq peaks and Pol II binding and are strongly enriched for 3’ UTR and ncRNA annotations. Cluster C contains 52.7 kb of sequence with no DNase-seq but some RNA-seq signal, along with insulator-associated chromatin modifications. This class is strongly enriched for eQTLs and CTCF-binding sites, suggesting transcriptional silencing activity. Thus, all four of these clusters appear to be rich in regulatory sequences that could plausibly have experienced weak natural selection during most of mammalian evolution but come under stronger selection recently on the human lineage.
Supplementary Figure 2 Receiver operating characteristic (ROC) curves for cell type–specific regulatory elements.
Three types of regulatory elements were considered: (a) transcription factor binding sites (TFBSs), (b) expression QTLs (eQTLs) and (c) enhancers identified by chromatin marks. Separate curves are shown for fitCons, phastCons12, CADD35, GERP13 and phyloP15 scores. In b, a curve is also shown for the RegulomeDB database36, and in c a curve is also shown for EnhancerFinder37. True positive rates were estimated by the fraction of nucleotides in annotated elements having scores that exceed a given score threshold, and false positive rates were estimated by the fraction of nucleotides in a matched set of ‘negative’ elements having scores that exceed the same threshold (see the Online Methods for details). Each curve is generated by varying this threshold across the full range of scores for the corresponding method. In this case, only elements ‘active’ in the cell type for which the fitCons scores were produced (HUVECs) were considered (Online Methods; see Supplementary Fig. 3 for the results for a pooled set of elements across cell types). AUC values, shown in parentheses, represent areas under the ROC curve and provide an overall measure of predictive power. The apparent performance of RegulomeDB on eQTLs, particularly at low false positive rates, is somewhat influenced by the explicit inclusion of eQTL data in its scoring scheme.
Supplementary Figure 3 Receiver operating characteristic (ROC) curves for regulatory elements pooled across cell types.
Three types of regulatory elements were considered: (a) transcription factor binding sites (TFBSs) derived from ENCODE ChIP-seq data for 19 different cell types28, (b) expression QTLs (eQTLs) for lymphoblastoid cells from 462 individuals6 and (c) enhancers identified by chromatin marks in 11 cell types38. Separate curves are shown for fitCons, phastCons12, CADD35, GERP13 and phyloP15 scores. In b, a curve is also shown for the RegulomeDB database36, and in c a curve is also shown for EnhancerFinder37. The fitCons scores used here are computed by aggregating functional information across HUVEC, H1 hESC and GM12878 cells (Online Methods). Note that some regulatory elements might not be active in any of the three cell types. The apparent performance of RegulomeDB on eQTLs, particularly at low false positive rates, is somewhat influenced by the explicit inclusion of eQTL data in its scoring scheme.
Supplementary Figure 4 ROC and ROC-like curves for high-information-content positions in transcription factor binding sites.
These panels parallel previous figures except that, in this case, only positions in ChIP-seq–annotated transcription factor binding sites with strong nucleotide preferences (relative frequency of preferred allele ≥ 90% in motif model) are considered. Shown are (a) coverage as a function of total noncoding coverage (as in Fig. 5a); (b) a receiver operating characteristic (ROC) curve for elements active in HUVECs (as in Supplementary Fig. 2a); and (c) a ROC curve based on elements active in various cell types and integrated fitCons scores (as in Supplementary Fig. 3a). These curves are highly similar to the ones based on whole binding sites, despite known correlations between natural selection and information content for at least some transcription factors15,28, apparently because these correlations tend to be fairly weak and transcription factor specific and generally occur below the prediction thresholds of interest.
Curves are shown for a set of 2,053 putative enhancers identified in GM12878 lymphoblastoid cells based on characteristic patterns of divergent transcription initiation, as measured by a variant of GRO-seq that enriches for 5′-7meGTP-capped RNAs39. The tested enhancers were identified by starting with the ‘unstable/unstable’ (UU) pairs of divergent transcription start sites from ref. 39 and eliminating those that fell within 2 kb of a known gene. Each enhancer was assumed to consist of a 200-bp interval centered on the midpoint between the paired transcription start sites. Shown are curves for both cell type–integrated (FitConsI) and GM12878-specific (FitConsGM) fitCons scores, as well as for EnhancerFinder37, CADD35, phastCons12, GERP13 and phyloP15. The coarse, stair-step appearance of the FitConsGM curve reflects a lack of diversity in the functional genomic fingerprints coinciding with these enhancers, and the improvement in the FitConsI curve suggests a gain in power from considering overlapping enhancers in other cell types. Notice that EnhancerFinder and CADD perform fairly well on this set, but the conservation-based methods perform poorly.
Supplementary Figure 6 Comparison of original fitCons scores (FitCons) with an alternative set of scores based on ancestral repeats as neutral sites (FitConsAR).
Each point represents a particular functional genomic class. The two sets of scores are highly correlated overall (R2 = 0.95), suggesting that they are not highly sensitive to the choice of neutral sites. Surprisingly, however, the scores based on ARs are slightly reduced overall (genomic average of 0.058 versus 0.075), apparently owing to reduced estimates of neutral divergence rates for ARs. Notice that this trend is the opposite of what would be expected if the ARs were under less constraint than our more inclusive set of putatively neutral sites, as one might surmise would be true. We speculate that it may be a consequence of unusual properties of transposable elements, such as AT richness, hypermethylation or exapted functional elements. The ARs used for this analysis consisted of families of RepeatMasker-identified repeats having an average divergence from the consensus of >15%, excluding simple sequence repeats, microsatellites, rRNAs, tRNAs and other potentially problematic families (871 Mb of sequence in total).
Supplementary Figure 7 Coverage of regulatory elements as a function of total noncoding coverage for fitCons scores based on ancestral repeats.
As in Figure 5, coverage of each type of element is shown as the score threshold is adjusted to alter the total coverage of noncoding sequences in the genome. FitCons scores based on ancestral repeats (FitConsAR) are compared with ordinary fitCons scores (FitCons) and scores from phastCons12, CADD35, GERP13, phyloP15 and RegulomeDB36. Notice that the FitCons and FitConsAR scores behave almost identically at low levels of coverage and show only modest differences at higher levels of coverage. See Supplementary Figure 6 for details regarding ancestral repeats.
Supplementary Figure 8 FitCons scores for the same functional fingerprint in differing cell types are strongly correlated.
FitCons scores for all functional classes for (a) HUVECs versus H1 hESCs, (b) HUVECs versus GM12878 cells, and (c) GM12878 cells versus H1 hESCs. Although the individual positions assigned to each class vary widely according to cell type, the fitCons scores remain relatively constant, with Pearson correlations ≥ 0.93 and Spearman correlations ≥ 0.87 between pairs of cell types.
Mean fitCons score for (a) 100-bp promoters and (b) eQTLs that are active in one cell type and inactive in another, based on RNA-seq data for the associated gene (Online Methods). Error bars represent the standard errors of the aggregated fitCons scores (Online Methods). FitCons scores computed using functional genomic data from H1 hESCs (orange bars) for elements active in H1 hESCs and inactive in HUVECs (H1 hESC+/HUVEC–) are significantly higher than those for elements inactive in H1 hESCs and active in HUVECs (H1 hESC–/HUVEC+). The opposite pattern is observed for fitCons scores computed using functional genomic data from HUVECs (purple bars).
Supplementary Figure 10 Receiver operating characteristic (ROC) curves comparing integrated fitCons scores with cell type–specific fitCons scores.
The top row shows the predictive performance of fitCons scores for elements ‘active’ in the HUVEC cell type: (a) TFBSs, (b) eQTLs and (c) enhancers. Three versions of the fitCons score are shown: cell type–specific scores based on HUVECs (FitConsHU) and H1-hESCs (FitConsH) and scores based on integrated data from all three cell types (FitConsI). Notice that the FitConsI scores perform as well as those based on the ‘active’ cell type (FitConsHU), whereas those based on a different cell type (FitConsH1) perform substantially worse. The bottom row shows the same fitCons scores applied to elements aggregated from a broad range of cell types: (d) TFBSs, (e) eQTLs and (f) enhancers. In this case, FitConsI outperforms both sets of cell type–specific scores. Thus, the integrated scores (FitConsI) appear to improve performance in a cell type–general setting without much cost in the cell type–specific setting.
About this article
Cite this article
Gulko, B., Hubisz, M., Gronau, I. et al. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat Genet 47, 276–283 (2015). https://doi.org/10.1038/ng.3196
Genome Biology (2022)
Nature Ecology & Evolution (2022)
Scientific Reports (2022)
Human Genetics (2022)
A novel machine learning-based approach for the computational functional assessment of pharmacogenomic variants
Human Genomics (2021)