Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Identifying the favored mutation in a positive selective sweep

Abstract

Most approaches that capture signatures of selective sweeps in population genomics data do not identify the specific mutation favored by selection. We present iSAFE (for “integrated selection of allele favored by evolution”), a method that enables researchers to accurately pinpoint the favored mutation in a large region (5 Mbp) by using a statistic derived solely from population genetics signals. iSAFE does not require knowledge of demography, the phenotype under selection, or functional annotations of mutations.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Characterization of the SAFE method.
Figure 2: Illustration of the iSAFE method.
Figure 3: iSAFE performance.

Similar content being viewed by others

References

  1. Vitti, J.J., Grossman, S.R. & Sabeti, P.C. Annu. Rev. Genet. 47, 97–120 (2013).

    Article  CAS  Google Scholar 

  2. Fan, S., Hansen, M.E.B., Lo, Y. & Tishkoff, S.A. Science 354, 54–59 (2016).

    Article  CAS  Google Scholar 

  3. Schrider, D.R., Mendes, F.K., Hahn, M.W. & Kern, A.D. Genetics 200, 267–284 (2015).

    Article  Google Scholar 

  4. Field, Y. et al. Science 354, 760–764 (2016).

    Article  CAS  Google Scholar 

  5. Azad, P. et al. J. Mol. Med. (Berl.) 95, 1269–1282 (2017).

    Article  CAS  Google Scholar 

  6. Stobdan, T. et al. Mol. Biol. Evol. 34, 3154–3168 (2017).

    Article  CAS  Google Scholar 

  7. Grossman, S.R. et al. Science 327, 883–886 (2010).

    Article  CAS  Google Scholar 

  8. Ronen, R. et al. PLoS Genet. 11, e1005527–e1005527 (2015).

    Article  Google Scholar 

  9. Wang, M. et al. Mol. Biol. Evol. 31, 3068–3080 (2014).

    Article  CAS  Google Scholar 

  10. Voight, B.F., Kudaravalli, S., Wen, X. & Pritchard, J.K. PLoS Biol. 4, e72 (2006).

    Article  Google Scholar 

  11. Sabeti, P.C. et al. Science 312, 1614–1620 (2006).

    Article  CAS  Google Scholar 

  12. Ohashi, J., Naka, I. & Tsuchiya, N. Mol. Biol. Evol. 28, 849–857 (2011).

    Article  CAS  Google Scholar 

  13. Tishkoff, S.A. et al. Science 293, 455–462 (2001).

    Article  CAS  Google Scholar 

  14. Heffelfinger, C. et al. Eur. J. Hum. Genet. 22, 551–557 (2014).

    Article  CAS  Google Scholar 

  15. Wilde, S. et al. Proc. Natl. Acad. Sci. USA 111, 4832–4837 (2014).

    Article  CAS  Google Scholar 

  16. Coop, G. et al. PLoS Genet. 5, e1000500 (2009).

    Article  Google Scholar 

  17. Campbell, C.D. et al. Nat. Genet. 44, 1277–1281 (2012).

    Article  CAS  Google Scholar 

  18. Galinsky, K.J., Loh, P.-R., Mallick, S., Patterson, N.J. & Price, A.L. Am. J. Hum. Genet. 99, 1130–1139 (2016).

    Article  CAS  Google Scholar 

  19. Beleza, S. et al. PLoS Genet. 9, e1003372 (2013).

    Article  CAS  Google Scholar 

  20. Cornelis, M.C. et al. Mol. Psychiatry 20, 647–656 (2015).

    Article  CAS  Google Scholar 

  21. Ferrer-Admetlla, A., Liang, M., Korneliussen, T. & Nielsen, R. Mol. Biol. Evol. 31, 1275–1291 (2014).

    Article  CAS  Google Scholar 

  22. Pybus, M. et al. Bioinformatics 31, 3946–3952 (2015).

    CAS  Google Scholar 

  23. Garud, N.R., Messer, P.W., Buzbas, E.O. & Petrov, D.A. PLoS Genet. 11, e1005004 (2015).

    Article  Google Scholar 

  24. DeGiorgio, M., Huber, C.D., Hubisz, M.J., Hellmann, I. & Nielsen, R. Bioinformatics 32, 1895–1897 (2016).

    Article  CAS  Google Scholar 

  25. Ronen, R., Udpa, N., Halperin, E. & Bafna, V. Genetics 195, 181–193 (2013).

    Article  Google Scholar 

  26. Pavlidis, P., Živkovic, D., Stamatakis, A. & Alachiotis, N. Mol. Biol. Evol. 30, 2224–2234 (2013).

    Article  CAS  Google Scholar 

  27. Chen, H., Patterson, N. & Reich, D. Genome Res. 20, 393–402 (2010).

    Article  CAS  Google Scholar 

  28. Sabeti, P.C. et al. Nature 449, 913–918 (2007).

    Article  CAS  Google Scholar 

  29. Sabeti, P.C. et al. Nature 419, 832–837 (2002).

    Article  CAS  Google Scholar 

  30. Nielsen, R. et al. Genome Res. 15, 1566–1575 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. Kim, Y. & Nielsen, R. Genetics 167, 1513–1524 (2004).

    Article  Google Scholar 

  32. Shriver, M.D. et al. Hum. Genomics 1, 274–286 (2004).

    Article  CAS  Google Scholar 

  33. Ewing, G. & Hermisson, J. Bioinformatics 26, 2064–2065 (2010).

    Article  CAS  Google Scholar 

  34. Nachman, M.W. & Crowell, S.L. Genetics 156, 297–304 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. Jensen-Seaman, M.I. et al. Genome Res. 14, 528–538 (2004).

    Article  CAS  Google Scholar 

  36. Gravel, S. et al. Proc. Natl. Acad. Sci. USA 108, 11983–11988 (2011).

    Article  CAS  Google Scholar 

  37. Szpiech, Z.A. & Hernandez, R.D. Mol. Biol. Evol. 31, 2824–2827 (2014).

    Article  CAS  Google Scholar 

  38. 1000 Genomes Project Consortium. Nature 526, 68–74 (2015).

  39. Zerbino, D.R. et al. Nucleic Acids Res. 46, D754–D761 (2018).

    Article  CAS  Google Scholar 

  40. International HapMap Consortium. Nature 449, 851–861 (2007).

Download references

Acknowledgements

This research was supported in part by the NSF (grant DBI-1458557 to A.A. and V.B.) and the NIH (grant R01GM114362 to V.B.).

Author information

Authors and Affiliations

Authors

Contributions

A.A., S.M., and V.B. conceived and designed the experiments and wrote the manuscript with input from all other authors; A.A., J.J.V., and A.I. performed the experiments; A.A. analyzed the data; A.A. and M.B. developed software tools; and P.C.S., S.M., and V.B. provided guidance throughout the study.

Corresponding authors

Correspondence to Ali Akbari or Vineet Bafna.

Ethics declarations

Competing interests

V.B. is a cofounder, has an equity interest in, and receives income from Digital Proteomics, LLC. The terms of this arrangement have been reviewed and approved by the University of California, San Diego, in accordance with its conflict-of-interest policies. Digital Proteomics was not involved in the research presented here.

Integrated supplementary information

Supplementary Figure 1 Empirical SAFE distribution.

(a) φ and κ as estimators of f. Empirical analysis, with 10,000 neutrally evolving population (about 3 million SNPs) with default parameter set, shows that φ and κ are (biased) estimators of allele frequency f (f = i/n for all integers i [1, n-1]). (b) The top panel is the SAFE score Probability Density Function (PDF) of 10,000 neutrally evolving population (about 3 million SNPs with minor allele frequency > 0.05) with default parameter set. The bottom panel is Quantiles of the SAFE score against the quantiles of the Normal distribution. The coefficient of determination (R2 = 0.9997) for the QQ-plot shows that Gaussian distribution is a good approximation to the SAFE score distribution.

Supplementary Figure 2 SAFE evaluation.

Performance of the safe score evaluated in different scenarios with 1000 simulations per bin. In each panel, we change one parameter and other parameters have their default values (see Online Methods). The fixed population size N = 20,000. The dashed (dotted) line represents median (quartile). In the bottom-right panel, white represents the result for a fixed size population model with default parameters and gray represents a model of human demography for EUR population (see Online Methods; Supplementary Figure 14). The onset times of selection was post-bottleneck (23 kya-current) epochs.

Supplementary Figure 3 iSAFE evaluation.

(a,b,c) Performance of iSAFE measured by rank of the favored variant and the distance of the favored variant from the peak in 1000 simulations per bin. The dashed (dotted) line represents median (quartile). (d) Performance of iSAFE compared to iHS and SCCT measured by rank of the favored variant in 5000 simulations on 5 Mbp region around ongoing hard sweeps (ν0 = 1/N; 0.1 < ν < 0.9) with a fixed population size (N = 20,000) and default values for other simulation parameters. In the left panel, for any rank r on the x-axis, the y-intercept represents the proportion of samples where the favored allele had rank ≤ r. In the right panel, solid (dashed) lines represent the mean (respectively, median) value of the favored allele rank. (e) iSAFE performance upon addition of outgroup samples. No deterioration is seen for low frequencies of the favored variant, but iSAFE performance improves dramatically when favored mutation is near fixation or fixed. The dashed (dotted) line represents median (quartile). This comparison is based on 1000 simulations of 5 Mbp genomic regions simulated using a model of human genome based on the human demography (Supplementary Fig. 14). The time of onset of selection was chosen at random (using the distribution in Supplementary Fig. 14) after the out of Africa event, in the lineage of EUR population (as the target population). When the onset of selection is before split of EUR and EAS (>23kya), both (EUR and EAS) are under selection.

Supplementary Figure 4 Demo I: iSAFE versus CMS in a model of human demography.

Comparing iSAFE and CMS signals in a model of human demography (see Online Methods; Supplementary Figure 14). Solid-horizontal lines separate replicates based on the favored allele frequency (ν) in EUR as the target population, and dotted-vertical lines separate different replicates. The rank of the favored mutation (solid-red circle) for each test is shown on the top-right corner.

Supplementary Figure 5 Demo II: iSAFE without outgroup samples.

iSAFE on ongoing selective sweeps with different favored allele frequency (ν) in 5 Mbp region. The position of the favored mutation selected from range [2.5 Mbp, 5 Mbp]. Other simulation parameters are the default values for fixed population size (see Online Methods) and outgroup samples are not available.

Supplementary Figure 6 Demo III: iSAFE and selection strength.

iSAFE on 5 Mbp region with different selection strength, Ns [0, 100, 200, 500, 1000]. Left panels shows the Ψe,w matrix. Middle panel

shows the iSAFE-score as a function of the variant position. Right panel show the derived allele frequency as a function of iSAFE score.

Supplementary Figure 7 Peak iSAFE.

(a) Empirical analysis, with 5000 simulations on 5 Mbp region with a wide range of selection strength (Ns [10, 50, 100, 200, 300, 400, 500, 1000]), shows difference in performance of iSAFE beyond a score threshold of 0.1 for peak value of iSAFE. (b) Rank of favored mutation as a function of peak iSAFE score (Bottom x-axis) or P value (top x-axis; see Online Methods) for the same data in part a. Each gray dot represents the favored mutation of a simulation using a wide range of selection coefficients. The performance deteriorates for iSAFE-scores below 0.1. The dashed (dotted) line represents median (quartile).

Supplementary Figure 8 iSAFE versus CMS on well-characterized selective sweeps.

iSAFE and CMS scores (right and left panels, respectively) on 8 well-characterized selective sweeps (Supplementary Table 1). The rank of the putative favored mutation (red star) in 5 Mbp region is shown in parentheses.

Supplementary Figure 9 iSAFE on targets of selection.

iSAFE on 5 Mbp regions reported to be under selection. Putative favored mutation is shown in blue square when it is among iSAFE top rank mutations, and in blue triangle when the signal of selection is very weak (peak iSAFE << 0.1). The right axis is empirical P value (see Online Methods). (a,b) PCDH15 and ADH1B loci with 207 samples (414 haplotypes) from CHB+JPT populations. (c) PSCA locus with 108 samples (216 haplotypes) from YRI population. (d,e,f) ASPM, FUT2, and F12 loci with 91 samples (182 haplotypes) from GBR population.

Supplementary Figure 10 iSAFE on the GRM5-TYR locus.

The Tyrosinase (TYR) gene, encoding an enzyme involved in the first step of melanin production is present in a large region under selection. A nonsynonymous mutation rs1042602 (blue) in TYR gene is reported as a candidate favored variant. A second intronic variant rs10831496 (red) in GRM5 gene, 396 kbp upstream of TYR, has been shown to have a strong association with skin color. In contrast, iSAFE ranks mutation rs672144 (turquoise) as the top candidate for the favored variant region out of ~22,000 mutations (5 Mbp; see Supplementary Note 2). (a) This variant was the top ranked mutation not only in CEU (Fig 3c), but also the top ranked mutation for EUR, EAS, AMR, and SAS. The signal of selection is strong in all populations (iSAFE > 0.5, P << 1.3e-8 for all of) except AFR, which does not show a signal of selection in this region. We plotted the haplotypes carrying rs672144 in all 5008 haplotypes (2504 samples) of 1000GP and found (Fig. 3d) that two distinct haplotypes carry the mutation, both with high frequencies maintained across a large stretch of the region, suggestive of a soft sweep with standing variation. (b) Frequency of derived alleles of rs10831496, rs672144, and rs1042602, are shown in red, turquoise, and blue, respectively. iSAFE candidate (rs672144) may not have been reported earlier because it is near fixation in all populations of 1000GP except for AFR (f = 0.27). (c) Each row is a haplotype and each column is a variant in EUR populations of 1000GP. In total we have 1006 haplotypes (503 samples). Carrier haplotypes of derived alleles of rs10831496, rs672144, and rs1042602, are shaded by red, turquoise, and blue, respectively. For making the plot sensible, we removed low frequency SNPs fEUR < 0.2 and SNPs that are near fixation in the whole 1000GP, f1000GP > 0.95. The previously suggested candidates rs1042602, rs10831496 are fully linked to rs672144, but not to each other. The EUR haplotypes can be partitioned into 4 clusters. Each of the 4 haplotypes show high homozygosity, suggestive of selection. However, rs1042602 can only explain the sweep in clusters C1+C2. rs10831496 can only explain C1+C3. Only rs672144 explains all 4 clusters, providing a simpler explanation of selection in this region.

Supplementary Figure 11 iSAFE on the OCA2-HERC2 locus.

The mutation rs1448484 is the iSAFE top rank mutation in all the population of 1000GP except African that does not show any signal of selection in this region. rs12913832 is a candidate favored mutation for the selection in European, proposed by Wilde et al. (2014). Supplementary Table SN2.2 provides iSAFE rank of some other candidate mutations associated with pigmentation in this region (see Supplementary Note 2).

Supplementary Figure 12 iSAFE on the KITLG locus.

iSAFE top rank mutations (circles) and candidate mutation rs642742 (blue triangle) proposed by Miller et al. (2007). See Supplementary Note 2 and Supplementary Table SN2.3 for more details.

Supplementary Figure 13 iSAFE on the TRPV6 locus.

10 mutations (rs11772526, rs4725602, rs11763225, rs7796010, rs11762011, rs13239916, rs4145394, rs10808023, rs10808021, and rs4726591) are highly linked and are top 10 iSAFE candidate mutations in all the 1000GP populations except for AFR where there is no signals of selection. See Supplementary Note 2 and Supplementary Table SN2.3 for more details.

Supplementary Figure 14 Simulation of selection on human demography.

(a) A model of human demography described by Fig. 4 and Table 2 of Gravel et al. (2011). The model assumes an out-of-Africa split at time TB, with a bottleneck that reduced the effective population from NAf to NB, allowing for migrations at rate mAf-B. The African population stays constant at NAf up to the present generation. The model assumes a second split between European and Asian populations at time TEuAs, with a bottleneck reducing the Asian and European populations to NAs0 and NEu0 respectively. The bottleneck was followed by exponential growth at rates rAs and rEu, as well as migrations among all three sub-populations, leading to current populations from which East Asian (EAS), European (EUR), and Africans (AFR) individuals were sampled. We used default values for simulation parameters not assigned (see Online Methods). (b) We simulated 1000 selective sweeps on 5 Mbp region based on the model of human demography, and with selection coefficient s = 0.05 and starting favored allele frequency ν0 = 0.001. The selection happens in a random time, after the out of Africa in the lineage of EUR population (as the target population). When the onset of selection is before split of EUR and EAS (> 23 kya), both (EUR and EAS) are under selection.

Supplementary Figure 15 Maximum difference in derived allele frequency (MDDAF).

We simulated 25,000 instances of AFR, EUR, and EAS populations, based on a model of human demography (see Online Methods; Supplementary Fig. 14). (a) The MDDAF score of mutations as a function of derived allele frequency in the target population DT. (b) Distribution of the MDDAF score for mutations with DT > 0.9. (c) Empirical P value of the MDDAF score for mutations with DT > 0.9. The dashed-red lines represent the value 0.78, where MDDAF, given DT > 0:9, has a P value less than 0.001.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15, Supplementary Table 1 and Supplementary Notes 1–2 (PDF 3847 kb)

Life Sciences Reporting Summary (PDF 132 kb)

Supplementary Software

iSAFE (v 0.1): integrated Selection of Allele Favored by Evolution (ZIP 948 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Akbari, A., Vitti, J., Iranmehr, A. et al. Identifying the favored mutation in a positive selective sweep. Nat Methods 15, 279–282 (2018). https://doi.org/10.1038/nmeth.4606

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.4606

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics