Abstract
Understanding the significance of genetic variants in the noncoding genome is emerging as the next challenge in human genomics. We used the power of 11,257 whole-genome sequences and 16,384 heptamers (7-nt motifs) to build a map of sequence constraint for the human species. This build differed substantially from traditional maps of interspecies conservation and identified regulatory elements among the most constrained regions of the genome. Using new Hi-C experimental data, we describe a strong pattern of coordination over 2 Mb where the most constrained regulatory elements associate with the most essential genes. Constrained regions of the noncoding genome are up to 52-fold enriched for known pathogenic variants as compared to unconstrained regions (21-fold when compared to the genome average). This map of sequence constraint across thousands of individuals is an asset to help interpret noncoding elements in the human genome, prioritize variants and reconsider gene units at a larger scale.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
GhostKnockoff inference empowers identification of putative causal variants in genome-wide association studies
Nature Communications Open Access 23 November 2022
-
Whole genome sequence analysis of blood lipid levels in >66,000 individuals
Nature Communications Open Access 11 October 2022
-
Focus on your locus with a massively parallel reporter assay
Journal of Neurodevelopmental Disorders Open Access 09 September 2022
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout




References
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Bouwman, B. A. & de Laat, W. Getting the genome in shape: the formation of loops, domains and compartments. Genome Biol. 16, 154 (2015).
Knight, J. C. Approaches for establishing the function of regulatory genetic variants involved in disease. Genome Med. 6, 92 (2014).
GTEx Consortium. The Genotype–Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48, 481–487 (2016).
Petrovski, S. et al. The intolerance of regulatory sequence to genetic variation predicts gene dosage sensitivity. PLoS Genet. 11, e1005492 (2015).
Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl. Acad. Sci. USA 113, 11901–11906 (2016).
Aggarwala, V. & Voight, B. F. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat. Genet. 48, 349–355 (2016).
Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).
Schrider, D. R. & Kern, A. D. Inferring selective constraint from population genomic data suggests recent regulatory turnover in the human brain. Genome Biol. Evol. 7, 3511–3528 (2015).
Bartha, I., di Iulio, J., Venter, J. C. & Telenti, A. Human gene essentiality. Nat. Rev. Genet. 19, 51–62 (2018).
MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012).
Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).
Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).
Rackham, O. J., Shihab, H. A., Johnson, M. R. & Petretto, E. EvoTol: a protein-sequence-based evolutionary intolerance framework for disease gene prioritization. Nucleic Acids Res. 43, e33 (2015).
Bartha, I. et al. The characteristics of heterozygous protein-truncating variants in the human genome. PLoS Comput. Biol. 11, e1004647 (2015).
Fadista, J., Oskolkov, N., Hansson, O. & Groop, L. LoFtool: a gene intolerance score based on loss-of-function variants in 60,706 individuals. Bioinformatics 33, 471–474 (2017).
Ward, L. D. & Kellis, M. Response to comment on “Evidence of abundant purifying selection in humans for recently acquired regulatory functions”. Science 340, 682 (2013).
Hernandez, R. D. et al. Classic selective sweeps were rare in recent human evolution. Science 331, 920–924 (2011).
Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J. D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016).
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Shah, N. et al. Identification of misclassified ClinVar variants using disease population prevalence. Am. J. Hum. Genet. (in the press).
Esteller, M. Noncoding RNAs in human disease. Nat. Rev. Genet. 12, 861–874 (2011).
Makrythanasis, P. & Antonarakis, S. E. Pathogenic variants in non-protein-coding sequences. Clin. Genet. 84, 422–428 (2013).
Gordon, C. T. & Lyonnet, S. Enhancer mutations and phenotype modularity. Nat. Genet. 46, 3–4 (2014).
Smedley, D. et al. A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease. Am. J. Hum. Genet. 99, 595–606 (2016).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep-learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480 (2014).
Huang, Y. F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017).
Kellis, M. et al. Defining functional DNA elements in the human genome. Proc. Natl. Acad. Sci. USA 111, 6131–6138 (2014).
Stenson, P. D. et al. Human Gene Mutation Database (HGMD): 2003 update. Hum. Mutat. 21, 577–581 (2003).
Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44 (D1), D862–D868 (2016).
Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Neph, S. et al. BEDOPS: high-performance genomic feature operations. Bioinformatics 28, 1919–1920 (2012).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Li, H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics 27, 718–719 (2011).
Acknowledgements
We thank Human Longevity, Inc., for finnancial support.
Author information
Authors and Affiliations
Contributions
J.d.I., J.C.V. and A.T. conceived and designed the study; J.d.I., I.B., E.H.M.W., H.-C.Y., M.A.H., N.S. and E.F.K. performed the analyses; V.L. established the search capability; M.M.F. and W.H.B. performed sequencing; D.Y., I.J. and B.R. performed pcHi-C; and J.d.I., E.H.M.W., B.R. and A.T. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
J.d.I., E.H.M.W., H.-C.Y., V.L., M.A.H., N.S., E.F.K., W.H.B. and J.C.V. are employees of Human Longevity, Inc.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
Supplementary Figure 1 Genetic ancestry of the study population.
a, Number of genomes sharing each ancestry. b, Principal-component analysis (PCA) of the study population. PCA was performed using PLINK (1.9) on 162,997 ancestry-informative markers. Genomes are colored based on their major ancestries. EUR, European; AFR, African; EAS, East Asian; CSA, Central South Asian; ARM, Native American; ADMIX, admixed population group.
Supplementary Figure 2 Heptamer metrics in the human genome.
a, Cumulative distribution function of the total number of occurrence of each heptamer in the genome. Each dot (n = 16,384) represents a heptamer. b, Cumulative distribution function of the autosomal count scores. The count score represents the fraction of the middle nucleotide in a heptameric sequence that varies. Every circle (n = 16,384) represents a heptameric sequence. The size of the circles is proportional to the number of occurrences of the heptamer in the genome (plotted in a). c, Cumulative distribution function of the autosomal frequency scores. The frequency score represents the fraction of SNV at the middle nucleotide in a heptamer that varies with an allelic frequency >0.0001. Every circle (n = 16,384) represents a heptameric sequence. The size of the circles is proportional to the number of occurrences of the heptamer in the genome (plotted in a). d, Cumulative distribution function of the autosomal tolerance scores. The tolerance score represents the probability of the middle nucleotide in a heptamer varying with an AF >0.0001. Every circle (n = 16,384) represents a heptameric sequence. The size of the circles is proportional to the number of occurrences of the heptamer in the genome (plotted in a). e, Comparison of tolerance score separately computed on autosomes versus chromosome X. Each dot (n = 16,384) represents a heptamer. The r2 represents the fraction of the variation explained by a linear regression model. The dashed line represents x = y. AF, allelic frequency; SNV, single-nucleotide variant.
Supplementary Figure 3 Distribution of genomic elements within the CDTS spectrum.
a, The bar plot displays the cumulative territory fraction covered by each element family at different percentiles (1 to 100). “Others” refers to ENCODE element families that did not cover a substantial part of the genome individually (such as transcription factor binding sites; Methods). The elements appear in the same order as in the legend. b, Size-normalized distribution of super-enhancer annotation. The relative enrichment of the fraction of enhancer bins overlapping with super-enhancer annotation is calculated with regard to the 100th percentile. Super-enhancers were subcategorized depending on the number of cell types in which they were annotated, represented by the lines of multiple shades of gray. c, The bar plot displays the distribution of the total number of nucleotides within the percentile slices for each element family. The boxes within a bar indicate the fraction of elements in each percentile slice (e.g., 23% of the promoters are within the 1st percentile). The element families are ordered on the x axis by the fraction of elements within the 1st-percentile slice. The coloring of the boxes is in the same order as in the legend. CDS, coding sequence; ncRNA, noncoding RNA; Prom., promoter; FC, fold change; CDTS, context-dependent tolerance score.
Supplementary Figure 4 Distribution of chromosomes within the CDTS spectrum.
a, The bar plot displays the cumulative territory fraction covered by autosomes and chromosome X throughout the CDTS spectrum for unrelated individuals in the study (n = 7,794). b, The bar plot displays the cumulative territory fraction covered by each chromosome throughout the CDTS spectrum for unrelated individuals in the study. c, The bar plot displays the cumulative territory fraction covered by autosomes and chromosome X throughout the CDTS spectrum for all individuals in the study (n = 11,257). d, The bar plot displays the cumulative territory fraction covered by each chromosome throughout the CDTS spectrum for all individuals. e, The bar plot displays the cumulative territory fraction covered by autosomes and chromosome X throughout the CDTS spectrum for unrelated individuals (merged from this study and the gnomAD Consortium; n = 23,290). f, The bar plot displays the cumulative territory fraction covered by each chromosome throughout the CDTS spectrum for unrelated individuals merged from this study and the gnomAD Consortium. The coloring of the boxes is in the same order as in the legend. The difference in chromosome X distribution for the smaller population reflects the lack of power to discriminate variation at the allelic frequency threshold used. The distribution of chromosome X in “all individuals” and “unrelated individuals (merged from this study and gnomAD Consortium” is very similar and indicates that the distribution stabilizes after reaching a sufficient number of chromosome X alleles. The autosome distribution is not subject to the same noise in the smaller study population, as both males and females provide two allele counts each. CDTS, context-dependent tolerance score; HLI, Human Longevity, Inc.,; gnomAD, genome aggregation database (http://gnomad.broadinstitute.org/).
Supplementary Figure 5 Robustness of the approach with different study populations.
The bar plots display the cumulative territory fraction covered by each element family in the different percentile slices (indicated on the x axis). The percentiles are based on the rank of CDTS values. The similarity in distributions indicates that the CDTS metric is robust to downsampling or different population. “Others” refers to ENCODE element families that did not cover a substantial part of the genome individually (such as transcription factor binding sites; Methods). The elements appear in the same order as in the legend depicted below the bar plots. Every bar plot was obtained by computing CDTS with a different study population or subset of study populations. Unrelated are a subset of All. Unrelated EUR, AFR and ADMIX are a subset of unrelated. CDS, coding sequence; ncRNA, noncoding RNA; EUR, European; AFR, African; ADMIX, admixed population group.
Supplementary Figure 6 Comparison of CDTS between study populations.
a, The heat map compares the CDTS percentiles computed with two different study populations: unrelated EUR (n = 4,436) and unrelated AFR (n = 1,087). The counts are normalized by the size of the respective percentile slices. The intensity of the coloring reflects the number of normalized counts. Overall matched CDTS percentiles are particularly dense at both ends of the spectrum. b, The figure illustrates the R2 obtained through linear regression when comparing the CDTS percentiles of all study populations presented in Supplementary Fig. 5 (all, n = 11,257; unrelated, n = 7,794; unrelated AFR, n = 1,087; unrelated EUR, n = 4,436; unrelated ADMIX, n = 1,763; 1000 Genomes, n = 2,504). The linear regression for each comparison was computed with the percentile-slice-size-normalized counts, as depicted in a. There is strong agreement in genome domains that have high constraint across ancestries. However, we observed occasional differences among ancestry groups that will merit attention to separate technical noise (sequencing, alignment, limited data for some populations) from biologically relevant differences. One possibility is that recent population growth may have resulted in changes in the patterns of deleterious genetic variation and genome structure with consequences for fitness and disease architecture. Unrelated are a subset of All. Unrelated EUR, AFR and ADMIX are a subset of unrelated. EUR, European; AFR, African; ADMIX, admixed population group.
Supplementary Figure 7 Comparison of conserved regions assessed with CDTS and GERP.
a, Element family composition in the 1st-percentile regions of CDTS (the bar labeled as “CTDS 1st”), GERP (“GERP 1st”) and the overlap region of CDTS and GERP (“Intersection”). Boxes in the bar correspond to different element families. “Others” refers to ENCODE element families that did not cover a substantial part of the genome individually (such as transcription factor binding sites; Methods). The coloring of the boxes is in the same order as in the legend. b, Absolute length of the 1st-percentile regions of CDTS, GERP and the overlap region of CDTS and GERP. Bins without GERP score, due to insufficient multiple-species alignments in the region, were not considered in the ranking process. This explains the total length difference between the 1st-percentile regions of CDTS and GERP. c, Element family composition in the first ten percentile regions of CDTS (the bar labeled as “CTDS 1–10th”), GERP (“GERP 1–10th”) and the overlap region (“Intersection”). d, Absolute length of the first ten percentile regions of CDTS, GERP and the overlap region of CDTS and GERP. CDS, coding sequence; ncRNA, noncoding RNA; CDTS, context-dependent tolerance score; GERP, Genomic Evolutionary Rate Profiling.
Supplementary Figure 8 CDTS distribution near coding regions.
a, Mean CDTS values are depicted for a 15-kb window up- and downstream of first exons (n = 39,948 for “All genes/isoforms”, shown in purple; n = 9,176 for “Essential genes/isoforms”, shown in red; Methods). The regional profile is distinct, indicating a general pattern of constraint around exons with more profound constrain around exons of essential genes. “Any annotation” indicates any sequence surrounding the first exon. b,c, The apparent symmetry for regions up- and downstream of the first exons shown in a disappears when only regions annotated as promoters and introns (upstream and downstream, respectively, of the exons) are considered—in particular, in the immediate vicinity of the coding region (c). The asymmetric pattern supports the specific coordination between promoters and exons. d, The bar plots display the cumulative territory fraction covered by each element family upstream and downstream of the first exon (indicated on the x axis). As every protein-coding isoform was used to increase the power of the analysis, the annotation upstream/downstream of the first exon consists of a mixture of genomic elements. “Others” refers to ENCODE element families that did not cover a substantial part of the genome individually (such as transcription factor binding sites; Methods). The coloring of the boxes is in the same order as in the legend. e, Paired analysis of promoter:intron CDTS percentile. The upward signal indicates that the asymmetry in constrains surrounding the first exon is present in most genes/isoforms. f,g, GERP (f) and Eigen (g) mean percentile distributions in the vicinity of the first exon. The same set of exons were used as in a–e. Shaded regions represent 95% CIs. CDTS, context-dependent tolerance score; GERP, Genomic Evolutionary Rate Profiling.
Supplementary Figure 9 Properties of topologically associating domains.
a, The plot depicts the cumulative distribution function of the mean CDTS values (in ≤10-kb windows) inside and outside TADs. TAD and non-TAD regions were divided into 10-kb windows (the overhang windows were discarded if smaller than 1 kb). The most constrained TAD windows are those identified by Hi-C as present in five or more cell types. TAD in at least one cell type versus no TAD: Kolmogorov–Smirnov two-sided test, P = 2.2 × 10–16. The total number of windows per group was as follows: non-TAD (n = 19,999 covering 139 Mb), TAD ≥1 cell type (n = 331,471 covering 2.4 Gb), TAD ≥2 cell types (n = 134,486 covering 911 Mb), TAD ≥5 cell types (n = 4,558 covering 29 Mb). b, The plot depicts the cumulative distribution function of the mean CDTS values (in ≤10-kb windows) of anchor and loop regions within TADs. Anchor and loop regions were divided into 10-kb windows (the overhang windows were discarded if smaller than 1 kb). The anchor regions are consistently more constrained than the loops within the same TADs. Anchor in at least one cell type versus loop in at least one cell type: Kolmogorov–Smirnov two-sided test, P = 2.7 × 10–14. The total number of windows per group is as follows: anchor ≥1 cell type (n = 2,954 covering 17 Mb), anchor ≥2 cell types (n = 1,321 covering 7 Mb), anchor ≥5 cell types (n = 38 covering 0.2 Mb), loop ≥1 cell type (n = 271,356 covering 1.8 Gb), loop ≥2 cell types (n = 117,581 covering 753 Mb), loop ≥5 cell types (n = 4,020 covering 24 Mb). TAD, topologically associating domain.
Supplementary Figure 10 The distribution of pathogenic variants.
a, The distribution of pathogenic variants across the different percentile slices is normalized by the size of protein-coding and noncoding regions in the respective percentile slices. The relative enrichment is calculated with regard to the 100th percentile. The total number of pathogenic variants was as follows: n = 120,608 protein-coding variants (dark blue) and n = 15,741 noncoding variants (orange), including n = 1,369 variants that are located more than 10 bp from a splice-site position (red) b, The distribution of noncoding pathogenic variants is depicted for CDTS (pink) and GERP (green). GERP as expected best captured the larger set of variants (n = 15,741) that mostly consisted of splice-site variants. c, Outside of the exon boundaries (>10 bp; n = 1,369) both methods are enriched for pathogenic noncoding variants at their lowest percentiles; however, the enrichment is more striking with the CDTS metric. GERP misclassifies variants at the least conserved regions. CDTS, context-dependent tolerance score; GERP, Genomic Evolutionary Rate Profiling; FC, fold change.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–10 and Supplementary Table 4
Supplementary Table 1
Size-normalized distribution of histone and transcription factor binding sites
Supplementary Table 2
Noncoding pathogenic variants from ClinVar and HGMD
Supplementary Table 3
Description of noncoding variants associated with Mendelian traits
Supplementary Table 5
Distal interacting regions and associated genes identified by pcHi-C
Rights and permissions
About this article
Cite this article
di Iulio, J., Bartha, I., Wong, E.H.M. et al. The human noncoding genome defined by genetic diversity. Nat Genet 50, 333–337 (2018). https://doi.org/10.1038/s41588-018-0062-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-018-0062-7
This article is cited by
-
Universal annotation of the human genome through integration of over a thousand epigenomic datasets
Genome Biology (2022)
-
Focus on your locus with a massively parallel reporter assay
Journal of Neurodevelopmental Disorders (2022)
-
GhostKnockoff inference empowers identification of putative causal variants in genome-wide association studies
Nature Communications (2022)
-
Variant pathogenic prediction by locus variability: the importance of the current picture of evolution
European Journal of Human Genetics (2022)
-
CtIP-dependent nascent RNA expression flanking DNA breaks guides the choice of DNA repair pathway
Nature Communications (2022)