Abstract
The rate of single-nucleotide polymorphism varies substantially across the human genome and fundamentally influences evolution and incidence of genetic disease. Previous studies have only considered the immediately flanking nucleotides around a polymorphic site—the site's trinucleotide sequence context—to study polymorphism levels across the genome. Moreover, the impact of larger sequence contexts has not been fully clarified, even though context substantially influences rates of polymorphism. Using a new statistical framework and data from the 1000 Genomes Project, we demonstrate that a heptanucleotide context explains >81% of variability in substitution probabilities, highlighting new mutation-promoting motifs at ApT dinucleotide, CAAT and TACG sequences. Our approach also identifies previously undocumented variability in C-to-T substitutions at CpG sites, which is not immediately explained by differential methylation intensity. Using our model, we present informative substitution intolerance scores for genes and a new intolerance score for amino acids, and we demonstrate clinical use of the model in neuropsychiatric diseases.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Hodgkinson, A. & Eyre-Walker, A. Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 12, 756–766 (2011).
Ehrlich, M. & Wang, R.Y. 5-methylcytosine in eukaryotic DNA. Science 212, 1350–1357 (1981).
Rideout, W.M. III, Coetzee, G.A., Olumi, A.F. & Jones, P.A. 5-methylcytosine as an endogenous mutagen in the human LDL receptor and p53 genes. Science 249, 1288–1290 (1990).
Arbiza, L. et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nat. Genet. 45, 723–729 (2013).
Yang, Y. et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N. Engl. J. Med. 369, 1502–1511 (2013).
Hwang, D.G. & Green, P. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc. Natl. Acad. Sci. USA 101, 13994–14001 (2004).
Blake, R.D., Hess, S.T. & Nicholson-Tuell, J. The influence of nearest neighbors on the rate and pattern of spontaneous point mutations. J. Mol. Evol. 34, 189–200 (1992).
Neale, B.M. et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485, 242–245 (2012).
Michaelson, J.J. et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 (2012).
Fromer, M. et al. De novo mutations in schizophrenia implicate synaptic networks. Nature 506, 179–184 (2014).
Lawrence, M.S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).
Samocha, K.E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
Campbell, M.C. & Tishkoff, S.A. African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu. Rev. Genomics Hum. Genet. 9, 403–433 (2008).
Schaffner, S.F. The X chromosome in population genetics. Nat. Rev. Genet. 5, 43–51 (2004).
Nachman, M.W. & Crowell, S.L. Estimate of the mutation rate per nucleotide in humans. Genetics 156, 297–304 (2000).
Mugal, C.F. & Ellegren, H. Substitution rate variation at human CpG sites correlates with non-CpG divergence, methylation level and GC content. Genome Biol. 12, R58 (2011).
Okae, H. et al. Genome-wide analysis of DNA methylation dynamics during early human development. PLoS Genet. 10, e1004868 (2014).
Hovestadt, V. et al. Decoding the regulatory landscape of medulloblastoma using DNA methylation sequencing. Nature 510, 537–541 (2014).
Walser, J.-C. & Furano, A.V. The mutational spectrum of non-CpG DNA varies with CpG content. Genome Res. 20, 875–882 (2010).
Kamiya, H. et al. Mutagenicity of 5-formylcytosine, an oxidation product of 5-methylcytosine, in DNA in mammalian cells. J. Biochem. 132, 551–555 (2002).
Deaton, A.M. & Bird, A. CpG islands and the regulation of transcription. Genes Dev. 25, 1010–1022 (2011).
Levinson, G. & Gutman, G.A. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol. 4, 203–221 (1987).
Panchin, A.Y., Mitrofanov, S.I., Alexeevski, A.V., Spirin, S.A. & Panchin, Y.V. New words in human mutagenesis. BMC Bioinformatics 12, 268 (2011).
Lanfear, R., Welch, J.J. & Bromham, L. Watching the clock: studying variation in rates of molecular evolution between species. Trends Ecol. Evol. 25, 495–503 (2010).
Kong, A. et al. Rate of de novo mutations and the importance of father's age to disease risk. Nature 488, 471–475 (2012).
Bustamante, C.D. et al. Natural selection on protein-coding genes in the human genome. Nature 437, 1153–1157 (2005).
Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013).
Cooper, G.M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12, 628–640 (2011).
Stenson, P.D. et al. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133, 1–9 (2014).
Petrovski, S., Wang, Q., Heinzen, E.L., Allen, A.S. & Goldstein, D.B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).
Georgi, B., Voight, B.F. & Bucć an, M. From mouse to human: evolutionary genomics analysis of human orthologs of essential genes. PLoS Genet. 9, e1003484 (2013).
Uddin, M. et al. Brain-expressed exons under purifying selection are enriched for de novo mutations in autism spectrum disorder. Nat. Genet. 46, 742–747 (2014).
De Rubeis, S. et al. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 515, 209–215 (2014).
Epi4K Consortium & Epilepsy Phenome/Genome Project. De novo mutations in epileptic encephalopathies. Nature 501, 217–221 (2013).
Deciphering Developmental Disorders Study. Large-scale discovery of novel genetic causes of developmental disorders. Nature 519, 223–228 (2015).
Hamdan, F.F. et al. De novo mutations in moderate or severe intellectual disability. PLoS Genet. 10, e1004772 (2014).
Rauch, A. et al. Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study. Lancet 380, 1674–1682 (2012).
de Ligt, J. et al. Diagnostic exome sequencing in persons with severe intellectual disability. N. Engl. J. Med. 367, 1921–1929 (2012).
Ginsburg, D. & Bowie, E.J. Molecular genetics of von Willebrand disease. Blood 79, 2507–2519 (1992).
Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).
Orosco, L.A. et al. Loss of Wdfy3 in mice alters cerebral cortical neurogenesis reflecting aspects of the autism pathology. Nat. Commun. 5, 4692 (2014).
Eyre-Walker, A. & Eyre-Walker, Y.C. How much of the variation in the mutation rate along the human genome can be explained? G3 (Bethesda) 4, 1667–1670 (2014).
Kimura, M. & Ohta, T. On some principles governing molecular evolution. Proc. Natl. Acad. Sci. USA 71, 2848–2852 (1974).
Ségurel, L., Wyman, M.J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014).
Hussin, J.G. et al. Recombination affects accumulation of damaging and disease-associated mutations in human populations. Nat. Genet. 47, 400–404 (2015).
Koren, A. et al. Genetic variation in human DNA replication timing. Cell 159, 1015–1026 (2014).
Siepel, A. & Haussler, D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 21, 468–488 (2004).
Acknowledgements
We thank C. Brown, M. Bucan, P. Babb, K. Siewert, K. Johnson, S. Bumgarner and two anonymous reviewers for helpful comments on the manuscript. B.F.V. is grateful for support of the work from the Alfred P. Sloan Foundation (BR2012-087), the American Heart Association (13SDG14330006), the W.W. Smith Charitable Trust (H1201) and the US National Institutes of Health/National Institute of Diabetes and Digestive and Kidney Disorders (R01DK101478).
Author information
Authors and Affiliations
Contributions
V.A. and B.F.V. conceived and designed the experiments, developed the model, performed the statistical analysis, developed and contributed analysis tools, and wrote the manuscript. B.F.V. supervised the research.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Illustration of the intuition supporting our substitution probability model.
(a) Defining the non-Bayesian probability and Bayesian posterior probability of nucleotide substitution for a 7-mer context. Here we use the example CTACGAT, where position 4 is the polymorphic site and the three nucleotides located 5′ and 3′ constitute the remainder of that site’s local 7-mer sequence context. We count (i) the number of occurrences of that 7-mer context found in the reference genome and (ii) the number of times we observe a polymorphic substitution at position 4. The example shown here is a C-to-T substitution. To generate the posterior probabilities, we sum the observed counts of occurrences and substitutions with a count obtained from the modeled prior. We apply this mathematics to all 7-mer sequence contexts for all substitution classes and then merge the reverse-complementary pairs (the A-to-C class was merged with the T-to-G class, etc.). This results in a total of 24,576 parameters, each representing a unique 7-mer sequence context. (b) Illustration showing how the same 7-mer sequence context on different codon frames leads to different types of amino acid change. Depicted are three cases where a C-to-T substitution that occurs in the sequence context CTA[C/T]GAT at either position 1, 2 or 3 of a codon results in a synonymous, nonsynonymous or nonsense change in amino acid identity.
Supplementary Figure 2 Scatter plot of nucleotide substitution probabilities for each 7-mer sequence context, inferred from 1000 Genomes and HapMap variants.
The substitution probabilities in both cases are strongly correlated with each other (R2 = 0.91, P << 10−100).
Supplementary Figure 3 Genome-wide nucleotide substitution probabilities are correlated across different human populations.
(a) The nucleotide substitution probabilities estimated from the 1-mer model for three human population groups (African, European and Asian) obtained from the 1000 Genomes Project. (b) The nucleotide substitution probabilities estimated from the 7-mer context in the same three populations. Because the x axis for this plot represents 24,576 sequence contexts, it was not practical to list them individually as was done in a. The contexts are represented graphically, sorted from lowest to highest nucleotide substitution probability, as observed in the African group. Data for the European and Asian groups were then represented according to the order obtained for the African group, to make comparison possible across the populations for any given sequence context.
Supplementary Figure 5 C-to-T substitution probabilities and methylation patterns.
Probabilities of C-to-T substitutions are shown for the following sequence contexts: CpG Me−, CpG 7-mer contexts that were unmethylated in all sperm samples; CpG Me+, CpG 7-mer contexts that were methylated in all sperm samples. ***P << 10−100.
Supplementary Figure 6 Correlation between average methylation intensity and probability of C-to-T substitution in the CpG 7-mer context.
(a) Scatterplot of average methylation intensity in brain samples against substitution probability at each 7-mer CpG context. (b) Scatterplot of average methylation intensity in oocyte samples against substitution probability at each 7-mer CpG context. (c) Scatterplot of average methylation intensity in blood samples against substitution probability at each 7-mer CpG context. (d) Scatterplot of average methylation intensity in blastocyst samples against substitution probability at each 7-mer CpG context. In all cases, the substitution probability is moderately correlated (R2 ~0.3) with methylation intensity at each 7-mer CpG sequence context.
Supplementary Figure 7 Substitution probabilities at 7-mer CpG sequence contexts and the distance of the contexts from genes.
Box-and-whisker plot of the distances between sequence contexts that contains a CpG site (C at polymorphic position 4, fixed G at position 5) and the gene nearest to that context found in the human reference genome. LOW plots the distances from sequence contexts identified in the bottom 1% smallest substitution probabilities in the C-to-T substitution class (n = 10). ALL represents the distances from all sequence contexts containing a CpG (n = 1,024). HIGH represents the distances from sequence contexts in the top 1% smallest substitution probabilities from the C-to-T substitution class (n = 10). Each distribution is significantly different from the others (pairwise P << 10−100 by Wilcoxon rank-sum test).
Supplementary Figure 8 Methylation intensity values in various sequence contexts containing a CpG site.
Box-and-whisker plot of methylation intensity values in various sequence contexts containing a CpG site. Methylation intensity represents the average intensity values across all sperm samples. Poly-CpG represents sequence contexts that segregate additional CpG dinucleotides beyond the CpG site at positions 4 and 5 (note that a 7-mer sequence context with a CpG site can segregate up to two additional CpG dinucleotides). Each distribution is significantly different from the others (pairwise P < 10−5 by Wilcoxon rank-sum test).
Supplementary Figure 9 Nucleotide substitution probabilities and recombination rate.
Scatterplot of nucleotide substitution probabilities inferred from only 1000 Genomes regions with a high recombination rate (>3 cM/Mb in the YRI population) and separately from regions with a low recombination rate (<0.05 cM/Mb in the YRI population) for each change in a 7-mer sequence context. The substitution probabilities in both cases are strongly correlated with each other (R2 = 0.97, P << 10−100).
Supplementary Figure 10 Human substitution probabilities are strongly correlated with human-chimpanzee and human-macaque divergence rates.
(a) Scatterplot of nucleotide substitution probabilities against nucleotide divergence rates between human and chimpanzee at each 7-mer sequence context. (b) Scatterplot of nucleotide substitution probabilities against nucleotide divergence rates between human and macaque at each 7-mer sequence context. In both cases, the substitution probabilities and divergence rates are strongly correlated with each other (R2 = 0.96, P << 10−100).
Supplementary Figure 11 Substitution probabilities across the variant frequency spectrum.
Scatterplot of nucleotide substitution probabilities inferred from only 1000 Genomes low to high frequency variants (MAF ≥1%) and separately from rare variants (singletons and doubletons only) for each change in a 7-mer sequence context. The substitution probabilities in both cases are strongly correlated with each other (R2 = 0.98, P << 10−100).
Supplementary Figure 12 Nucleotide substitution probabilities in the coding genome.
Posterior probabilities of nucleotide substitution for each type of amino acid substitution in the coding genome, estimated using the 7-mer coding context model. Sequences contexts are further stratified by color to indicate presence of a CpG (C at the polymorphic position 4 and G at position 5, for C-to-A, C-to-G and C-to-T substitution classes = CpG+; otherwise, CpG−) and where evidence of substitution was only observed in the intergenic region. The inset shows a magnified view specifically of the distribution for nonsense substitutions.
Supplementary Figure 13 Violin plot for trends in amino acid replacement types across different amino acids.
(a) Note that the mean probability is different for glycine and tyrosine substitutions, although the expected trend holds (synonymous > missense > nonsense). (b) Some amino acid substitutions deviate from this expected trend owing to the CpG context in the coding genome.
Supplementary Figure 14 The 7-mer context model improves power to detect pathogenic variants.
Log10 ratios of substitution probabilities for the 3-mer model with codon context for coding sequences matched to noncoding sequences for each type of amino acid replacement. We consider all variants from the 1000 Genomes Project (African, yellow) or the Human Gene Mutation Database (HGMD; orange). Larger values indicate fewer substitutions in the coding genome than expected from matched noncoding sequences (intolerance), consistent with selective constraint acting on these replacements. **P < 10−53; NS, not significant by Wilcoxon rank-sum test.
Supplementary Figure 16 Comparison and correlation of various gene score measures.
(a,b) Comparison of our presented gene score (Aggarwala) built from the 1000 Genomes African group using the coding 7-mer model with the scores presented by Petrovski et al. (a) and Samocha et al. (b). Note that in a, all HGNC gene IDs could not be mapped to Ensembl 75 genes, and in b only a subset of gene scores were publicly available.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–16 and Supplementary Note. (PDF 2799 kb)
Supplementary Tables 1–17
Supplementary Tables 1–17. (XLSX 16896 kb)
Rights and permissions
About this article
Cite this article
Aggarwala, V., Voight, B. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat Genet 48, 349–355 (2016). https://doi.org/10.1038/ng.3511
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.3511
This article is cited by
-
Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model
BMC Bioinformatics (2024)
-
Sequence dependencies and mutation rates of localized mutational processes in cancer
Genome Medicine (2023)
-
Genome-wide probing of eukaryotic nascent RNA structure elucidates cotranscriptional folding and its antimutagenic effect
Nature Communications (2023)
-
Integrative genomic analyses of promoter G-quadruplexes reveal their selective constraint and association with gene activation
Communications Biology (2023)
-
Very Low Rates of Spontaneous Gene Deletions and Gene Duplications in Dictyostelium discoideum
Journal of Molecular Evolution (2023)