An expanded sequence context model broadly explains variability in polymorphism levels across the human genome

Aggarwala, Varun; Voight, Benjamin F

doi:10.1038/ng.3511

Analysis
Published: 15 February 2016

An expanded sequence context model broadly explains variability in polymorphism levels across the human genome

Varun Aggarwala¹ &
Benjamin F Voight^2,3

Nature Genetics volume 48, pages 349–355 (2016)Cite this article

10k Accesses
109 Citations
155 Altmetric
Metrics details

Subjects

Abstract

The rate of single-nucleotide polymorphism varies substantially across the human genome and fundamentally influences evolution and incidence of genetic disease. Previous studies have only considered the immediately flanking nucleotides around a polymorphic site—the site's trinucleotide sequence context—to study polymorphism levels across the genome. Moreover, the impact of larger sequence contexts has not been fully clarified, even though context substantially influences rates of polymorphism. Using a new statistical framework and data from the 1000 Genomes Project, we demonstrate that a heptanucleotide context explains >81% of variability in substitution probabilities, highlighting new mutation-promoting motifs at ApT dinucleotide, CAAT and TACG sequences. Our approach also identifies previously undocumented variability in C-to-T substitutions at CpG sites, which is not immediately explained by differential methylation intensity. Using our model, we present informative substitution intolerance scores for genes and a new intolerance score for amino acids, and we demonstrate clinical use of the model in neuropsychiatric diseases.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: C-to-T substitution probabilities and methylation patterns in 7-mer CpG sequence contexts.**

**Figure 2: Posterior probabilities of all classes of nucleotide substitution in the intergenic noncoding genome, estimated using the 7-mer context model.**

**Figure 3: Prioritizing pathogenic variants and causal genes using constraint scores.**

**Figure 4: Application of gene and amino acid intolerance scores to *de novo* autism spectrum disorder mutational data.**

The mutational constraint spectrum quantified from variation in 141,456 humans

Article Open access 27 May 2020

Konrad J. Karczewski, Laurent C. Francioli, … Daniel G. MacArthur

Genes and genomes and unnecessary complexity in precision medicine

Article Open access 04 May 2020

Rama S. Singh & Bhagwati P. Gupta

Model-based assessment of replicability for genome-wide association meta-analysis

Article Open access 30 March 2021

Daniel McGuire, Yu Jiang, … Dajiang J. Liu

References

Hodgkinson, A. & Eyre-Walker, A. Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 12, 756–766 (2011).
Article CAS PubMed Google Scholar
Ehrlich, M. & Wang, R.Y. 5-methylcytosine in eukaryotic DNA. Science 212, 1350–1357 (1981).
Article CAS PubMed Google Scholar
Rideout, W.M. III, Coetzee, G.A., Olumi, A.F. & Jones, P.A. 5-methylcytosine as an endogenous mutagen in the human LDL receptor and p53 genes. Science 249, 1288–1290 (1990).
Article CAS PubMed Google Scholar
Arbiza, L. et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nat. Genet. 45, 723–729 (2013).
Article CAS PubMed PubMed Central Google Scholar
Yang, Y. et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N. Engl. J. Med. 369, 1502–1511 (2013).
Article CAS PubMed PubMed Central Google Scholar
Hwang, D.G. & Green, P. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc. Natl. Acad. Sci. USA 101, 13994–14001 (2004).
Article CAS PubMed PubMed Central Google Scholar
Blake, R.D., Hess, S.T. & Nicholson-Tuell, J. The influence of nearest neighbors on the rate and pattern of spontaneous point mutations. J. Mol. Evol. 34, 189–200 (1992).
Article CAS PubMed Google Scholar
Neale, B.M. et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485, 242–245 (2012).
Article CAS PubMed PubMed Central Google Scholar
Michaelson, J.J. et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 (2012).
Article CAS PubMed PubMed Central Google Scholar
Fromer, M. et al. De novo mutations in schizophrenia implicate synaptic networks. Nature 506, 179–184 (2014).
Article CAS PubMed PubMed Central Google Scholar
Lawrence, M.S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).
CAS PubMed PubMed Central Google Scholar
Samocha, K.E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).
Article CAS PubMed PubMed Central Google Scholar
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
Campbell, M.C. & Tishkoff, S.A. African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu. Rev. Genomics Hum. Genet. 9, 403–433 (2008).
Article CAS PubMed PubMed Central Google Scholar
Schaffner, S.F. The X chromosome in population genetics. Nat. Rev. Genet. 5, 43–51 (2004).
Article CAS PubMed Google Scholar
Nachman, M.W. & Crowell, S.L. Estimate of the mutation rate per nucleotide in humans. Genetics 156, 297–304 (2000).
CAS PubMed PubMed Central Google Scholar
Mugal, C.F. & Ellegren, H. Substitution rate variation at human CpG sites correlates with non-CpG divergence, methylation level and GC content. Genome Biol. 12, R58 (2011).
Article CAS PubMed PubMed Central Google Scholar
Okae, H. et al. Genome-wide analysis of DNA methylation dynamics during early human development. PLoS Genet. 10, e1004868 (2014).
Article PubMed PubMed Central Google Scholar
Hovestadt, V. et al. Decoding the regulatory landscape of medulloblastoma using DNA methylation sequencing. Nature 510, 537–541 (2014).
Article CAS PubMed Google Scholar
Walser, J.-C. & Furano, A.V. The mutational spectrum of non-CpG DNA varies with CpG content. Genome Res. 20, 875–882 (2010).
Article CAS PubMed PubMed Central Google Scholar
Kamiya, H. et al. Mutagenicity of 5-formylcytosine, an oxidation product of 5-methylcytosine, in DNA in mammalian cells. J. Biochem. 132, 551–555 (2002).
Article CAS PubMed Google Scholar
Deaton, A.M. & Bird, A. CpG islands and the regulation of transcription. Genes Dev. 25, 1010–1022 (2011).
Article CAS PubMed PubMed Central Google Scholar
Levinson, G. & Gutman, G.A. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol. 4, 203–221 (1987).
CAS PubMed Google Scholar
Panchin, A.Y., Mitrofanov, S.I., Alexeevski, A.V., Spirin, S.A. & Panchin, Y.V. New words in human mutagenesis. BMC Bioinformatics 12, 268 (2011).
Article CAS PubMed PubMed Central Google Scholar
Lanfear, R., Welch, J.J. & Bromham, L. Watching the clock: studying variation in rates of molecular evolution between species. Trends Ecol. Evol. 25, 495–503 (2010).
Article PubMed Google Scholar
Kong, A. et al. Rate of de novo mutations and the importance of father's age to disease risk. Nature 488, 471–475 (2012).
Article CAS PubMed PubMed Central Google Scholar
Bustamante, C.D. et al. Natural selection on protein-coding genes in the human genome. Nature 437, 1153–1157 (2005).
Article CAS PubMed Google Scholar
Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013).
Article CAS PubMed Google Scholar
Cooper, G.M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12, 628–640 (2011).
Article CAS PubMed Google Scholar
Stenson, P.D. et al. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133, 1–9 (2014).
Article CAS PubMed Google Scholar
Petrovski, S., Wang, Q., Heinzen, E.L., Allen, A.S. & Goldstein, D.B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).
Article CAS PubMed PubMed Central Google Scholar
Georgi, B., Voight, B.F. & Bucć an, M. From mouse to human: evolutionary genomics analysis of human orthologs of essential genes. PLoS Genet. 9, e1003484 (2013).
Article CAS PubMed PubMed Central Google Scholar
Uddin, M. et al. Brain-expressed exons under purifying selection are enriched for de novo mutations in autism spectrum disorder. Nat. Genet. 46, 742–747 (2014).
Article CAS PubMed Google Scholar
De Rubeis, S. et al. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 515, 209–215 (2014).
Article CAS PubMed PubMed Central Google Scholar
Epi4K Consortium & Epilepsy Phenome/Genome Project. De novo mutations in epileptic encephalopathies. Nature 501, 217–221 (2013).
Deciphering Developmental Disorders Study. Large-scale discovery of novel genetic causes of developmental disorders. Nature 519, 223–228 (2015).
Hamdan, F.F. et al. De novo mutations in moderate or severe intellectual disability. PLoS Genet. 10, e1004772 (2014).
Article PubMed PubMed Central Google Scholar
Rauch, A. et al. Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study. Lancet 380, 1674–1682 (2012).
Article CAS PubMed Google Scholar
de Ligt, J. et al. Diagnostic exome sequencing in persons with severe intellectual disability. N. Engl. J. Med. 367, 1921–1929 (2012).
Article CAS PubMed Google Scholar
Ginsburg, D. & Bowie, E.J. Molecular genetics of von Willebrand disease. Blood 79, 2507–2519 (1992).
CAS PubMed Google Scholar
Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).
Article CAS PubMed PubMed Central Google Scholar
Orosco, L.A. et al. Loss of Wdfy3 in mice alters cerebral cortical neurogenesis reflecting aspects of the autism pathology. Nat. Commun. 5, 4692 (2014).
Article CAS PubMed Google Scholar
Eyre-Walker, A. & Eyre-Walker, Y.C. How much of the variation in the mutation rate along the human genome can be explained? G3 (Bethesda) 4, 1667–1670 (2014).
Article Google Scholar
Kimura, M. & Ohta, T. On some principles governing molecular evolution. Proc. Natl. Acad. Sci. USA 71, 2848–2852 (1974).
Article CAS PubMed PubMed Central Google Scholar
Ségurel, L., Wyman, M.J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014).
Article PubMed Google Scholar
Hussin, J.G. et al. Recombination affects accumulation of damaging and disease-associated mutations in human populations. Nat. Genet. 47, 400–404 (2015).
Article CAS PubMed Google Scholar
Koren, A. et al. Genetic variation in human DNA replication timing. Cell 159, 1015–1026 (2014).
Article CAS PubMed PubMed Central Google Scholar
Siepel, A. & Haussler, D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 21, 468–488 (2004).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank C. Brown, M. Bucan, P. Babb, K. Siewert, K. Johnson, S. Bumgarner and two anonymous reviewers for helpful comments on the manuscript. B.F.V. is grateful for support of the work from the Alfred P. Sloan Foundation (BR2012-087), the American Heart Association (13SDG14330006), the W.W. Smith Charitable Trust (H1201) and the US National Institutes of Health/National Institute of Diabetes and Digestive and Kidney Disorders (R01DK101478).

Author information

Authors and Affiliations

Genomics and Computational Biology Program, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
Varun Aggarwala
Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
Benjamin F Voight
Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
Benjamin F Voight

Authors

Varun Aggarwala
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin F Voight
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

V.A. and B.F.V. conceived and designed the experiments, developed the model, performed the statistical analysis, developed and contributed analysis tools, and wrote the manuscript. B.F.V. supervised the research.

Corresponding author

Correspondence to Benjamin F Voight.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Illustration of the intuition supporting our substitution probability model.

(a) Defining the non-Bayesian probability and Bayesian posterior probability of nucleotide substitution for a 7-mer context. Here we use the example CTACGAT, where position 4 is the polymorphic site and the three nucleotides located 5′ and 3′ constitute the remainder of that site’s local 7-mer sequence context. We count (i) the number of occurrences of that 7-mer context found in the reference genome and (ii) the number of times we observe a polymorphic substitution at position 4. The example shown here is a C-to-T substitution. To generate the posterior probabilities, we sum the observed counts of occurrences and substitutions with a count obtained from the modeled prior. We apply this mathematics to all 7-mer sequence contexts for all substitution classes and then merge the reverse-complementary pairs (the A-to-C class was merged with the T-to-G class, etc.). This results in a total of 24,576 parameters, each representing a unique 7-mer sequence context. (b) Illustration showing how the same 7-mer sequence context on different codon frames leads to different types of amino acid change. Depicted are three cases where a C-to-T substitution that occurs in the sequence context CTA[C/T]GAT at either position 1, 2 or 3 of a codon results in a synonymous, nonsynonymous or nonsense change in amino acid identity.

Supplementary Figure 2 Scatter plot of nucleotide substitution probabilities for each 7-mer sequence context, inferred from 1000 Genomes and HapMap variants.

The substitution probabilities in both cases are strongly correlated with each other (R² = 0.91, P << 10⁻¹⁰⁰).

Supplementary Figure 3 Genome-wide nucleotide substitution probabilities are correlated across different human populations.

(a) The nucleotide substitution probabilities estimated from the 1-mer model for three human population groups (African, European and Asian) obtained from the 1000 Genomes Project. (b) The nucleotide substitution probabilities estimated from the 7-mer context in the same three populations. Because the x axis for this plot represents 24,576 sequence contexts, it was not practical to list them individually as was done in a. The contexts are represented graphically, sorted from lowest to highest nucleotide substitution probability, as observed in the African group. Data for the European and Asian groups were then represented according to the order obtained for the African group, to make comparison possible across the populations for any given sequence context.

Supplementary Figure 4 Comparison of observed and expected C-to-T substitution probabilities within a 7-mer CpG sequence context.

Supplementary Figure 5 C-to-T substitution probabilities and methylation patterns.

Probabilities of C-to-T substitutions are shown for the following sequence contexts: CpG Me⁻, CpG 7-mer contexts that were unmethylated in all sperm samples; CpG Me⁺, CpG 7-mer contexts that were methylated in all sperm samples. ***P << 10⁻¹⁰⁰.

Supplementary Figure 6 Correlation between average methylation intensity and probability of C-to-T substitution in the CpG 7-mer context.

(a) Scatterplot of average methylation intensity in brain samples against substitution probability at each 7-mer CpG context. (b) Scatterplot of average methylation intensity in oocyte samples against substitution probability at each 7-mer CpG context. (c) Scatterplot of average methylation intensity in blood samples against substitution probability at each 7-mer CpG context. (d) Scatterplot of average methylation intensity in blastocyst samples against substitution probability at each 7-mer CpG context. In all cases, the substitution probability is moderately correlated (R² ~0.3) with methylation intensity at each 7-mer CpG sequence context.

Supplementary Figure 7 Substitution probabilities at 7-mer CpG sequence contexts and the distance of the contexts from genes.

Box-and-whisker plot of the distances between sequence contexts that contains a CpG site (C at polymorphic position 4, fixed G at position 5) and the gene nearest to that context found in the human reference genome. LOW plots the distances from sequence contexts identified in the bottom 1% smallest substitution probabilities in the C-to-T substitution class (n = 10). ALL represents the distances from all sequence contexts containing a CpG (n = 1,024). HIGH represents the distances from sequence contexts in the top 1% smallest substitution probabilities from the C-to-T substitution class (n = 10). Each distribution is significantly different from the others (pairwise P << 10⁻¹⁰⁰ by Wilcoxon rank-sum test).

Supplementary Figure 8 Methylation intensity values in various sequence contexts containing a CpG site.

Box-and-whisker plot of methylation intensity values in various sequence contexts containing a CpG site. Methylation intensity represents the average intensity values across all sperm samples. Poly-CpG represents sequence contexts that segregate additional CpG dinucleotides beyond the CpG site at positions 4 and 5 (note that a 7-mer sequence context with a CpG site can segregate up to two additional CpG dinucleotides). Each distribution is significantly different from the others (pairwise P < 10⁻⁵ by Wilcoxon rank-sum test).

Supplementary Figure 9 Nucleotide substitution probabilities and recombination rate.

Scatterplot of nucleotide substitution probabilities inferred from only 1000 Genomes regions with a high recombination rate (>3 cM/Mb in the YRI population) and separately from regions with a low recombination rate (<0.05 cM/Mb in the YRI population) for each change in a 7-mer sequence context. The substitution probabilities in both cases are strongly correlated with each other (R² = 0.97, P << 10⁻¹⁰⁰).

Supplementary Figure 10 Human substitution probabilities are strongly correlated with human-chimpanzee and human-macaque divergence rates.

(a) Scatterplot of nucleotide substitution probabilities against nucleotide divergence rates between human and chimpanzee at each 7-mer sequence context. (b) Scatterplot of nucleotide substitution probabilities against nucleotide divergence rates between human and macaque at each 7-mer sequence context. In both cases, the substitution probabilities and divergence rates are strongly correlated with each other (R² = 0.96, P << 10⁻¹⁰⁰).

Supplementary Figure 11 Substitution probabilities across the variant frequency spectrum.

Scatterplot of nucleotide substitution probabilities inferred from only 1000 Genomes low to high frequency variants (MAF ≥1%) and separately from rare variants (singletons and doubletons only) for each change in a 7-mer sequence context. The substitution probabilities in both cases are strongly correlated with each other (R² = 0.98, P << 10⁻¹⁰⁰).

Supplementary Figure 12 Nucleotide substitution probabilities in the coding genome.

Posterior probabilities of nucleotide substitution for each type of amino acid substitution in the coding genome, estimated using the 7-mer coding context model. Sequences contexts are further stratified by color to indicate presence of a CpG (C at the polymorphic position 4 and G at position 5, for C-to-A, C-to-G and C-to-T substitution classes = CpG⁺; otherwise, CpG⁻) and where evidence of substitution was only observed in the intergenic region. The inset shows a magnified view specifically of the distribution for nonsense substitutions.

Supplementary Figure 13 Violin plot for trends in amino acid replacement types across different amino acids.

(a) Note that the mean probability is different for glycine and tyrosine substitutions, although the expected trend holds (synonymous > missense > nonsense). (b) Some amino acid substitutions deviate from this expected trend owing to the CpG context in the coding genome.

Supplementary Figure 14 The 7-mer context model improves power to detect pathogenic variants.

Log₁₀ ratios of substitution probabilities for the 3-mer model with codon context for coding sequences matched to noncoding sequences for each type of amino acid replacement. We consider all variants from the 1000 Genomes Project (African, yellow) or the Human Gene Mutation Database (HGMD; orange). Larger values indicate fewer substitutions in the coding genome than expected from matched noncoding sequences (intolerance), consistent with selective constraint acting on these replacements. **P < 10⁻⁵³; NS, not significant by Wilcoxon rank-sum test.

Supplementary Figure 15 The gene scores calculated from 1000 Genomes or EVS (European populations) data sets are correlated with each other.

Supplementary Figure 16 Comparison and correlation of various gene score measures.

(a,b) Comparison of our presented gene score (Aggarwala) built from the 1000 Genomes African group using the coding 7-mer model with the scores presented by Petrovski et al. (a) and Samocha et al. (b). Note that in a, all HGNC gene IDs could not be mapped to Ensembl 75 genes, and in b only a subset of gene scores were publicly available.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–16 and Supplementary Note. (PDF 2799 kb)

Supplementary Tables 1–17

Supplementary Tables 1–17. (XLSX 16896 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwala, V., Voight, B. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat Genet 48, 349–355 (2016). https://doi.org/10.1038/ng.3511

Download citation

Received: 14 April 2015
Accepted: 22 January 2016
Published: 15 February 2016
Issue Date: April 2016
DOI: https://doi.org/10.1038/ng.3511

This article is cited by

Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model
- Shahid Akbar
- Ali Raza
- Quan Zou
BMC Bioinformatics (2024)
Sequence dependencies and mutation rates of localized mutational processes in cancer
- Gustav Alexander Poulsgaard
- Simon Grund Sørensen
- Jakob Skou Pedersen
Genome Medicine (2023)
Genome-wide probing of eukaryotic nascent RNA structure elucidates cotranscriptional folding and its antimutagenic effect
- Gongwang Yu
- Yao Liu
- Jian-Rong Yang
Nature Communications (2023)
Integrative genomic analyses of promoter G-quadruplexes reveal their selective constraint and association with gene activation
- Guangyue Li
- Gongbo Su
- Guangchao Sui
Communications Biology (2023)
Very Low Rates of Spontaneous Gene Deletions and Gene Duplications in Dictyostelium discoideum
- Shelbi E. Gill
- Frédéric J. J. Chain
Journal of Molecular Evolution (2023)