Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

The human noncoding genome defined by genetic diversity

Abstract

Understanding the significance of genetic variants in the noncoding genome is emerging as the next challenge in human genomics. We used the power of 11,257 whole-genome sequences and 16,384 heptamers (7-nt motifs) to build a map of sequence constraint for the human species. This build differed substantially from traditional maps of interspecies conservation and identified regulatory elements among the most constrained regions of the genome. Using new Hi-C experimental data, we describe a strong pattern of coordination over 2 Mb where the most constrained regulatory elements associate with the most essential genes. Constrained regions of the noncoding genome are up to 52-fold enriched for known pathogenic variants as compared to unconstrained regions (21-fold when compared to the genome average). This map of sequence constraint across thousands of individuals is an asset to help interpret noncoding elements in the human genome, prioritize variants and reconsider gene units at a larger scale.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: k-mer structure of the genome and composition of the constrained human genome.
Fig. 2: Coordinated constraint of genes and cis or distal regulatory elements.
Fig. 3: Distribution of pathogenic variants across the genome.
Fig. 4: Performance and complementarity of CDTS and other metrics for noncoding variants.

References

  1. 1.

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Bouwman, B. A. & de Laat, W. Getting the genome in shape: the formation of loops, domains and compartments. Genome Biol. 16, 154 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Knight, J. C. Approaches for establishing the function of regulatory genetic variants involved in disease. Genome Med. 6, 92 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  4. 4.

    GTEx Consortium. The Genotype–Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).

    Article  PubMed Central  Google Scholar 

  5. 5.

    Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48, 481–487 (2016).

    CAS  Article  PubMed  Google Scholar 

  6. 6.

    Petrovski, S. et al. The intolerance of regulatory sequence to genetic variation predicts gene dosage sensitivity. PLoS Genet. 11, e1005492 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl. Acad. Sci. USA 113, 11901–11906 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Aggarwala, V. & Voight, B. F. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat. Genet. 48, 349–355 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Schrider, D. R. & Kern, A. D. Inferring selective constraint from population genomic data suggests recent regulatory turnover in the human brain. Genome Biol. Evol. 7, 3511–3528 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Bartha, I., di Iulio, J., Venter, J. C. & Telenti, A. Human gene essentiality. Nat. Rev. Genet. 19, 51–62 (2018).

    CAS  Article  PubMed  Google Scholar 

  12. 12.

    MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Rackham, O. J., Shihab, H. A., Johnson, M. R. & Petretto, E. EvoTol: a protein-sequence-based evolutionary intolerance framework for disease gene prioritization. Nucleic Acids Res. 43, e33 (2015).

    Article  PubMed  Google Scholar 

  16. 16.

    Bartha, I. et al. The characteristics of heterozygous protein-truncating variants in the human genome. PLoS Comput. Biol. 11, e1004647 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Fadista, J., Oskolkov, N., Hansson, O. & Groop, L. LoFtool: a gene intolerance score based on loss-of-function variants in 60,706 individuals. Bioinformatics 33, 471–474 (2017).

    PubMed  Google Scholar 

  18. 18.

    Ward, L. D. & Kellis, M. Response to comment on “Evidence of abundant purifying selection in humans for recently acquired regulatory functions”. Science 340, 682 (2013).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  19. 19.

    Hernandez, R. D. et al. Classic selective sweeps were rare in recent human evolution. Science 331, 920–924 (2011).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J. D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Shah, N. et al. Identification of misclassified ClinVar variants using disease population prevalence. Am. J. Hum. Genet. (in the press).

  24. 24.

    Esteller, M. Noncoding RNAs in human disease. Nat. Rev. Genet. 12, 861–874 (2011).

    CAS  Article  PubMed  Google Scholar 

  25. 25.

    Makrythanasis, P. & Antonarakis, S. E. Pathogenic variants in non-protein-coding sequences. Clin. Genet. 84, 422–428 (2013).

    CAS  Article  PubMed  Google Scholar 

  26. 26.

    Gordon, C. T. & Lyonnet, S. Enhancer mutations and phenotype modularity. Nat. Genet. 46, 3–4 (2014).

    CAS  Article  PubMed  Google Scholar 

  27. 27.

    Smedley, D. et al. A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease. Am. J. Hum. Genet. 99, 595–606 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep-learning-based sequence model. Nat. Methods 12, 931–934 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  31. 31.

    Huang, Y. F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Kellis, M. et al. Defining functional DNA elements in the human genome. Proc. Natl. Acad. Sci. USA 111, 6131–6138 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Stenson, P. D. et al. Human Gene Mutation Database (HGMD): 2003 update. Hum. Mutat. 21, 577–581 (2003).

    CAS  Article  PubMed  Google Scholar 

  34. 34.

    Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44 (D1), D862–D868 (2016).

    CAS  Article  PubMed  Google Scholar 

  35. 35.

    Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).

    CAS  Article  PubMed  Google Scholar 

  36. 36.

    Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  37. 37.

    Neph, S. et al. BEDOPS: high-performance genomic feature operations. Bioinformatics 28, 1919–1920 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Li, H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics 27, 718–719 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank Human Longevity, Inc., for finnancial support.

Author information

Affiliations

Authors

Contributions

J.d.I., J.C.V. and A.T. conceived and designed the study; J.d.I., I.B., E.H.M.W., H.-C.Y., M.A.H., N.S. and E.F.K. performed the analyses; V.L. established the search capability; M.M.F. and W.H.B. performed sequencing; D.Y., I.J. and B.R. performed pcHi-C; and J.d.I., E.H.M.W., B.R. and A.T. wrote the manuscript.

Corresponding author

Correspondence to Amalio Telenti.

Ethics declarations

Competing interests

J.d.I., E.H.M.W., H.-C.Y., V.L., M.A.H., N.S., E.F.K., W.H.B. and J.C.V. are employees of Human Longevity, Inc.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Genetic ancestry of the study population.

a, Number of genomes sharing each ancestry. b, Principal-component analysis (PCA) of the study population. PCA was performed using PLINK (1.9) on 162,997 ancestry-informative markers. Genomes are colored based on their major ancestries. EUR, European; AFR, African; EAS, East Asian; CSA, Central South Asian; ARM, Native American; ADMIX, admixed population group.

Supplementary Figure 2 Heptamer metrics in the human genome.

a, Cumulative distribution function of the total number of occurrence of each heptamer in the genome. Each dot (n = 16,384) represents a heptamer. b, Cumulative distribution function of the autosomal count scores. The count score represents the fraction of the middle nucleotide in a heptameric sequence that varies. Every circle (n = 16,384) represents a heptameric sequence. The size of the circles is proportional to the number of occurrences of the heptamer in the genome (plotted in a). c, Cumulative distribution function of the autosomal frequency scores. The frequency score represents the fraction of SNV at the middle nucleotide in a heptamer that varies with an allelic frequency >0.0001. Every circle (n = 16,384) represents a heptameric sequence. The size of the circles is proportional to the number of occurrences of the heptamer in the genome (plotted in a). d, Cumulative distribution function of the autosomal tolerance scores. The tolerance score represents the probability of the middle nucleotide in a heptamer varying with an AF >0.0001. Every circle (n = 16,384) represents a heptameric sequence. The size of the circles is proportional to the number of occurrences of the heptamer in the genome (plotted in a). e, Comparison of tolerance score separately computed on autosomes versus chromosome X. Each dot (n = 16,384) represents a heptamer. The r2 represents the fraction of the variation explained by a linear regression model. The dashed line represents x = y. AF, allelic frequency; SNV, single-nucleotide variant.

Supplementary Figure 3 Distribution of genomic elements within the CDTS spectrum.

a, The bar plot displays the cumulative territory fraction covered by each element family at different percentiles (1 to 100). “Others” refers to ENCODE element families that did not cover a substantial part of the genome individually (such as transcription factor binding sites; Methods). The elements appear in the same order as in the legend. b, Size-normalized distribution of super-enhancer annotation. The relative enrichment of the fraction of enhancer bins overlapping with super-enhancer annotation is calculated with regard to the 100th percentile. Super-enhancers were subcategorized depending on the number of cell types in which they were annotated, represented by the lines of multiple shades of gray. c, The bar plot displays the distribution of the total number of nucleotides within the percentile slices for each element family. The boxes within a bar indicate the fraction of elements in each percentile slice (e.g., 23% of the promoters are within the 1st percentile). The element families are ordered on the x axis by the fraction of elements within the 1st-percentile slice. The coloring of the boxes is in the same order as in the legend. CDS, coding sequence; ncRNA, noncoding RNA; Prom., promoter; FC, fold change; CDTS, context-dependent tolerance score.

Supplementary Figure 4 Distribution of chromosomes within the CDTS spectrum.

a, The bar plot displays the cumulative territory fraction covered by autosomes and chromosome X throughout the CDTS spectrum for unrelated individuals in the study (n = 7,794). b, The bar plot displays the cumulative territory fraction covered by each chromosome throughout the CDTS spectrum for unrelated individuals in the study. c, The bar plot displays the cumulative territory fraction covered by autosomes and chromosome X throughout the CDTS spectrum for all individuals in the study (n = 11,257). d, The bar plot displays the cumulative territory fraction covered by each chromosome throughout the CDTS spectrum for all individuals. e, The bar plot displays the cumulative territory fraction covered by autosomes and chromosome X throughout the CDTS spectrum for unrelated individuals (merged from this study and the gnomAD Consortium; n = 23,290). f, The bar plot displays the cumulative territory fraction covered by each chromosome throughout the CDTS spectrum for unrelated individuals merged from this study and the gnomAD Consortium. The coloring of the boxes is in the same order as in the legend. The difference in chromosome X distribution for the smaller population reflects the lack of power to discriminate variation at the allelic frequency threshold used. The distribution of chromosome X in “all individuals” and “unrelated individuals (merged from this study and gnomAD Consortium” is very similar and indicates that the distribution stabilizes after reaching a sufficient number of chromosome X alleles. The autosome distribution is not subject to the same noise in the smaller study population, as both males and females provide two allele counts each. CDTS, context-dependent tolerance score; HLI, Human Longevity, Inc.,; gnomAD, genome aggregation database (http://gnomad.broadinstitute.org/).

Supplementary Figure 5 Robustness of the approach with different study populations.

The bar plots display the cumulative territory fraction covered by each element family in the different percentile slices (indicated on the x axis). The percentiles are based on the rank of CDTS values. The similarity in distributions indicates that the CDTS metric is robust to downsampling or different population. “Others” refers to ENCODE element families that did not cover a substantial part of the genome individually (such as transcription factor binding sites; Methods). The elements appear in the same order as in the legend depicted below the bar plots. Every bar plot was obtained by computing CDTS with a different study population or subset of study populations. Unrelated are a subset of All. Unrelated EUR, AFR and ADMIX are a subset of unrelated. CDS, coding sequence; ncRNA, noncoding RNA; EUR, European; AFR, African; ADMIX, admixed population group.

Supplementary Figure 6 Comparison of CDTS between study populations.

a, The heat map compares the CDTS percentiles computed with two different study populations: unrelated EUR (n = 4,436) and unrelated AFR (n = 1,087). The counts are normalized by the size of the respective percentile slices. The intensity of the coloring reflects the number of normalized counts. Overall matched CDTS percentiles are particularly dense at both ends of the spectrum. b, The figure illustrates the R2 obtained through linear regression when comparing the CDTS percentiles of all study populations presented in Supplementary Fig. 5 (all, n = 11,257; unrelated, n = 7,794; unrelated AFR, n = 1,087; unrelated EUR, n = 4,436; unrelated ADMIX, n = 1,763; 1000 Genomes, n = 2,504). The linear regression for each comparison was computed with the percentile-slice-size-normalized counts, as depicted in a. There is strong agreement in genome domains that have high constraint across ancestries. However, we observed occasional differences among ancestry groups that will merit attention to separate technical noise (sequencing, alignment, limited data for some populations) from biologically relevant differences. One possibility is that recent population growth may have resulted in changes in the patterns of deleterious genetic variation and genome structure with consequences for fitness and disease architecture. Unrelated are a subset of All. Unrelated EUR, AFR and ADMIX are a subset of unrelated. EUR, European; AFR, African; ADMIX, admixed population group.

Supplementary Figure 7 Comparison of conserved regions assessed with CDTS and GERP.

a, Element family composition in the 1st-percentile regions of CDTS (the bar labeled as “CTDS 1st”), GERP (“GERP 1st”) and the overlap region of CDTS and GERP (“Intersection”). Boxes in the bar correspond to different element families. “Others” refers to ENCODE element families that did not cover a substantial part of the genome individually (such as transcription factor binding sites; Methods). The coloring of the boxes is in the same order as in the legend. b, Absolute length of the 1st-percentile regions of CDTS, GERP and the overlap region of CDTS and GERP. Bins without GERP score, due to insufficient multiple-species alignments in the region, were not considered in the ranking process. This explains the total length difference between the 1st-percentile regions of CDTS and GERP. c, Element family composition in the first ten percentile regions of CDTS (the bar labeled as “CTDS 1–10th”), GERP (“GERP 1–10th”) and the overlap region (“Intersection”). d, Absolute length of the first ten percentile regions of CDTS, GERP and the overlap region of CDTS and GERP. CDS, coding sequence; ncRNA, noncoding RNA; CDTS, context-dependent tolerance score; GERP, Genomic Evolutionary Rate Profiling.

Supplementary Figure 8 CDTS distribution near coding regions.

a, Mean CDTS values are depicted for a 15-kb window up- and downstream of first exons (n = 39,948 for “All genes/isoforms”, shown in purple; n = 9,176 for “Essential genes/isoforms”, shown in red; Methods). The regional profile is distinct, indicating a general pattern of constraint around exons with more profound constrain around exons of essential genes. “Any annotation” indicates any sequence surrounding the first exon. b,c, The apparent symmetry for regions up- and downstream of the first exons shown in a disappears when only regions annotated as promoters and introns (upstream and downstream, respectively, of the exons) are considered—in particular, in the immediate vicinity of the coding region (c). The asymmetric pattern supports the specific coordination between promoters and exons. d, The bar plots display the cumulative territory fraction covered by each element family upstream and downstream of the first exon (indicated on the x axis). As every protein-coding isoform was used to increase the power of the analysis, the annotation upstream/downstream of the first exon consists of a mixture of genomic elements. “Others” refers to ENCODE element families that did not cover a substantial part of the genome individually (such as transcription factor binding sites; Methods). The coloring of the boxes is in the same order as in the legend. e, Paired analysis of promoter:intron CDTS percentile. The upward signal indicates that the asymmetry in constrains surrounding the first exon is present in most genes/isoforms. f,g, GERP (f) and Eigen (g) mean percentile distributions in the vicinity of the first exon. The same set of exons were used as in ae. Shaded regions represent 95% CIs. CDTS, context-dependent tolerance score; GERP, Genomic Evolutionary Rate Profiling.

Supplementary Figure 9 Properties of topologically associating domains.

a, The plot depicts the cumulative distribution function of the mean CDTS values (in ≤10-kb windows) inside and outside TADs. TAD and non-TAD regions were divided into 10-kb windows (the overhang windows were discarded if smaller than 1 kb). The most constrained TAD windows are those identified by Hi-C as present in five or more cell types. TAD in at least one cell type versus no TAD: Kolmogorov–Smirnov two-sided test, P = 2.2 × 10–16. The total number of windows per group was as follows: non-TAD (n = 19,999 covering 139 Mb), TAD ≥1 cell type (n = 331,471 covering 2.4 Gb), TAD ≥2 cell types (n = 134,486 covering 911 Mb), TAD ≥5 cell types (n = 4,558 covering 29 Mb). b, The plot depicts the cumulative distribution function of the mean CDTS values (in ≤10-kb windows) of anchor and loop regions within TADs. Anchor and loop regions were divided into 10-kb windows (the overhang windows were discarded if smaller than 1 kb). The anchor regions are consistently more constrained than the loops within the same TADs. Anchor in at least one cell type versus loop in at least one cell type: Kolmogorov–Smirnov two-sided test, P = 2.7 × 10–14. The total number of windows per group is as follows: anchor ≥1 cell type (n = 2,954 covering 17 Mb), anchor ≥2 cell types (n = 1,321 covering 7 Mb), anchor ≥5 cell types (n = 38 covering 0.2 Mb), loop ≥1 cell type (n = 271,356 covering 1.8 Gb), loop ≥2 cell types (n = 117,581 covering 753 Mb), loop ≥5 cell types (n = 4,020 covering 24 Mb). TAD, topologically associating domain.

Supplementary Figure 10 The distribution of pathogenic variants.

a, The distribution of pathogenic variants across the different percentile slices is normalized by the size of protein-coding and noncoding regions in the respective percentile slices. The relative enrichment is calculated with regard to the 100th percentile. The total number of pathogenic variants was as follows: n = 120,608 protein-coding variants (dark blue) and n = 15,741 noncoding variants (orange), including n = 1,369 variants that are located more than 10 bp from a splice-site position (red) b, The distribution of noncoding pathogenic variants is depicted for CDTS (pink) and GERP (green). GERP as expected best captured the larger set of variants (n = 15,741) that mostly consisted of splice-site variants. c, Outside of the exon boundaries (>10 bp; n = 1,369) both methods are enriched for pathogenic noncoding variants at their lowest percentiles; however, the enrichment is more striking with the CDTS metric. GERP misclassifies variants at the least conserved regions. CDTS, context-dependent tolerance score; GERP, Genomic Evolutionary Rate Profiling; FC, fold change.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10 and Supplementary Table 4

Life Sciences Reporting Summary

Supplementary Table 1

Size-normalized distribution of histone and transcription factor binding sites

Supplementary Table 2

Noncoding pathogenic variants from ClinVar and HGMD

Supplementary Table 3

Description of noncoding variants associated with Mendelian traits

Supplementary Table 5

Distal interacting regions and associated genes identified by pcHi-C

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

di Iulio, J., Bartha, I., Wong, E.H.M. et al. The human noncoding genome defined by genetic diversity. Nat Genet 50, 333–337 (2018). https://doi.org/10.1038/s41588-018-0062-7

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing