Most approaches that capture signatures of selective sweeps in population genomics data do not identify the specific mutation favored by selection. We present iSAFE (for “integrated selection of allele favored by evolution”), a method that enables researchers to accurately pinpoint the favored mutation in a large region (∼5 Mbp) by using a statistic derived solely from population genetics signals. iSAFE does not require knowledge of demography, the phenotype under selection, or functional annotations of mutations.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Vitti, J.J., Grossman, S.R. & Sabeti, P.C. Annu. Rev. Genet. 47, 97–120 (2013).
Fan, S., Hansen, M.E.B., Lo, Y. & Tishkoff, S.A. Science 354, 54–59 (2016).
Schrider, D.R., Mendes, F.K., Hahn, M.W. & Kern, A.D. Genetics 200, 267–284 (2015).
Field, Y. et al. Science 354, 760–764 (2016).
Azad, P. et al. J. Mol. Med. (Berl.) 95, 1269–1282 (2017).
Stobdan, T. et al. Mol. Biol. Evol. 34, 3154–3168 (2017).
Grossman, S.R. et al. Science 327, 883–886 (2010).
Ronen, R. et al. PLoS Genet. 11, e1005527–e1005527 (2015).
Wang, M. et al. Mol. Biol. Evol. 31, 3068–3080 (2014).
Voight, B.F., Kudaravalli, S., Wen, X. & Pritchard, J.K. PLoS Biol. 4, e72 (2006).
Sabeti, P.C. et al. Science 312, 1614–1620 (2006).
Ohashi, J., Naka, I. & Tsuchiya, N. Mol. Biol. Evol. 28, 849–857 (2011).
Tishkoff, S.A. et al. Science 293, 455–462 (2001).
Heffelfinger, C. et al. Eur. J. Hum. Genet. 22, 551–557 (2014).
Wilde, S. et al. Proc. Natl. Acad. Sci. USA 111, 4832–4837 (2014).
Coop, G. et al. PLoS Genet. 5, e1000500 (2009).
Campbell, C.D. et al. Nat. Genet. 44, 1277–1281 (2012).
Galinsky, K.J., Loh, P.-R., Mallick, S., Patterson, N.J. & Price, A.L. Am. J. Hum. Genet. 99, 1130–1139 (2016).
Beleza, S. et al. PLoS Genet. 9, e1003372 (2013).
Cornelis, M.C. et al. Mol. Psychiatry 20, 647–656 (2015).
Ferrer-Admetlla, A., Liang, M., Korneliussen, T. & Nielsen, R. Mol. Biol. Evol. 31, 1275–1291 (2014).
Pybus, M. et al. Bioinformatics 31, 3946–3952 (2015).
Garud, N.R., Messer, P.W., Buzbas, E.O. & Petrov, D.A. PLoS Genet. 11, e1005004 (2015).
DeGiorgio, M., Huber, C.D., Hubisz, M.J., Hellmann, I. & Nielsen, R. Bioinformatics 32, 1895–1897 (2016).
Ronen, R., Udpa, N., Halperin, E. & Bafna, V. Genetics 195, 181–193 (2013).
Pavlidis, P., Živkovic, D., Stamatakis, A. & Alachiotis, N. Mol. Biol. Evol. 30, 2224–2234 (2013).
Chen, H., Patterson, N. & Reich, D. Genome Res. 20, 393–402 (2010).
Sabeti, P.C. et al. Nature 449, 913–918 (2007).
Sabeti, P.C. et al. Nature 419, 832–837 (2002).
Nielsen, R. et al. Genome Res. 15, 1566–1575 (2005).
Kim, Y. & Nielsen, R. Genetics 167, 1513–1524 (2004).
Shriver, M.D. et al. Hum. Genomics 1, 274–286 (2004).
Ewing, G. & Hermisson, J. Bioinformatics 26, 2064–2065 (2010).
Nachman, M.W. & Crowell, S.L. Genetics 156, 297–304 (2000).
Jensen-Seaman, M.I. et al. Genome Res. 14, 528–538 (2004).
Gravel, S. et al. Proc. Natl. Acad. Sci. USA 108, 11983–11988 (2011).
Szpiech, Z.A. & Hernandez, R.D. Mol. Biol. Evol. 31, 2824–2827 (2014).
1000 Genomes Project Consortium. Nature 526, 68–74 (2015).
Zerbino, D.R. et al. Nucleic Acids Res. 46, D754–D761 (2018).
International HapMap Consortium. Nature 449, 851–861 (2007).
This research was supported in part by the NSF (grant DBI-1458557 to A.A. and V.B.) and the NIH (grant R01GM114362 to V.B.).
V.B. is a cofounder, has an equity interest in, and receives income from Digital Proteomics, LLC. The terms of this arrangement have been reviewed and approved by the University of California, San Diego, in accordance with its conflict-of-interest policies. Digital Proteomics was not involved in the research presented here.
Integrated supplementary information
(a) φ and κ as estimators of f. Empirical analysis, with 10,000 neutrally evolving population (about 3 million SNPs) with default parameter set, shows that φ and κ are (biased) estimators of allele frequency f (f = i/n for all integers i ∈ [1, n-1]). (b) The top panel is the SAFE score Probability Density Function (PDF) of 10,000 neutrally evolving population (about 3 million SNPs with minor allele frequency > 0.05) with default parameter set. The bottom panel is Quantiles of the SAFE score against the quantiles of the Normal distribution. The coefficient of determination (R2 = 0.9997) for the QQ-plot shows that Gaussian distribution is a good approximation to the SAFE score distribution.
Performance of the safe score evaluated in different scenarios with 1000 simulations per bin. In each panel, we change one parameter and other parameters have their default values (see Online Methods). The fixed population size N = 20,000. The dashed (dotted) line represents median (quartile). In the bottom-right panel, white represents the result for a fixed size population model with default parameters and gray represents a model of human demography for EUR population (see Online Methods; Supplementary Figure 14). The onset times of selection was post-bottleneck (23 kya-current) epochs.
(a,b,c) Performance of iSAFE measured by rank of the favored variant and the distance of the favored variant from the peak in 1000 simulations per bin. The dashed (dotted) line represents median (quartile). (d) Performance of iSAFE compared to iHS and SCCT measured by rank of the favored variant in 5000 simulations on 5 Mbp region around ongoing hard sweeps (ν0 = 1/N; 0.1 < ν < 0.9) with a fixed population size (N = 20,000) and default values for other simulation parameters. In the left panel, for any rank r on the x-axis, the y-intercept represents the proportion of samples where the favored allele had rank ≤ r. In the right panel, solid (dashed) lines represent the mean (respectively, median) value of the favored allele rank. (e) iSAFE performance upon addition of outgroup samples. No deterioration is seen for low frequencies of the favored variant, but iSAFE performance improves dramatically when favored mutation is near fixation or fixed. The dashed (dotted) line represents median (quartile). This comparison is based on 1000 simulations of 5 Mbp genomic regions simulated using a model of human genome based on the human demography (Supplementary Fig. 14). The time of onset of selection was chosen at random (using the distribution in Supplementary Fig. 14) after the out of Africa event, in the lineage of EUR population (as the target population). When the onset of selection is before split of EUR and EAS (>23kya), both (EUR and EAS) are under selection.
Comparing iSAFE and CMS signals in a model of human demography (see Online Methods; Supplementary Figure 14). Solid-horizontal lines separate replicates based on the favored allele frequency (ν) in EUR as the target population, and dotted-vertical lines separate different replicates. The rank of the favored mutation (solid-red circle) for each test is shown on the top-right corner.
iSAFE on ongoing selective sweeps with different favored allele frequency (ν) in 5 Mbp region. The position of the favored mutation selected from range [2.5 Mbp, 5 Mbp]. Other simulation parameters are the default values for fixed population size (see Online Methods) and outgroup samples are not available.
iSAFE on 5 Mbp region with different selection strength, Ns ∈ [0, 100, 200, 500, 1000]. Left panels shows the Ψe,w matrix. Middle panelshows the iSAFE-score as a function of the variant position. Right panel show the derived allele frequency as a function of iSAFE score.
(a) Empirical analysis, with 5000 simulations on 5 Mbp region with a wide range of selection strength (Ns ∈ [10, 50, 100, 200, 300, 400, 500, 1000]), shows difference in performance of iSAFE beyond a score threshold of 0.1 for peak value of iSAFE. (b) Rank of favored mutation as a function of peak iSAFE score (Bottom x-axis) or P value (top x-axis; see Online Methods) for the same data in part a. Each gray dot represents the favored mutation of a simulation using a wide range of selection coefficients. The performance deteriorates for iSAFE-scores below 0.1. The dashed (dotted) line represents median (quartile).
iSAFE and CMS scores (right and left panels, respectively) on 8 well-characterized selective sweeps (Supplementary Table 1). The rank of the putative favored mutation (red star) in 5 Mbp region is shown in parentheses.
iSAFE on 5 Mbp regions reported to be under selection. Putative favored mutation is shown in blue square when it is among iSAFE top rank mutations, and in blue triangle when the signal of selection is very weak (peak iSAFE << 0.1). The right axis is empirical P value (see Online Methods). (a,b) PCDH15 and ADH1B loci with 207 samples (414 haplotypes) from CHB+JPT populations. (c) PSCA locus with 108 samples (216 haplotypes) from YRI population. (d,e,f) ASPM, FUT2, and F12 loci with 91 samples (182 haplotypes) from GBR population.
The Tyrosinase (TYR) gene, encoding an enzyme involved in the first step of melanin production is present in a large region under selection. A nonsynonymous mutation rs1042602 (blue) in TYR gene is reported as a candidate favored variant. A second intronic variant rs10831496 (red) in GRM5 gene, 396 kbp upstream of TYR, has been shown to have a strong association with skin color. In contrast, iSAFE ranks mutation rs672144 (turquoise) as the top candidate for the favored variant region out of ~22,000 mutations (5 Mbp; see Supplementary Note 2). (a) This variant was the top ranked mutation not only in CEU (Fig 3c), but also the top ranked mutation for EUR, EAS, AMR, and SAS. The signal of selection is strong in all populations (iSAFE > 0.5, P << 1.3e-8 for all of) except AFR, which does not show a signal of selection in this region. We plotted the haplotypes carrying rs672144 in all 5008 haplotypes (2504 samples) of 1000GP and found (Fig. 3d) that two distinct haplotypes carry the mutation, both with high frequencies maintained across a large stretch of the region, suggestive of a soft sweep with standing variation. (b) Frequency of derived alleles of rs10831496, rs672144, and rs1042602, are shown in red, turquoise, and blue, respectively. iSAFE candidate (rs672144) may not have been reported earlier because it is near fixation in all populations of 1000GP except for AFR (f = 0.27). (c) Each row is a haplotype and each column is a variant in EUR populations of 1000GP. In total we have 1006 haplotypes (503 samples). Carrier haplotypes of derived alleles of rs10831496, rs672144, and rs1042602, are shaded by red, turquoise, and blue, respectively. For making the plot sensible, we removed low frequency SNPs fEUR < 0.2 and SNPs that are near fixation in the whole 1000GP, f1000GP > 0.95. The previously suggested candidates rs1042602, rs10831496 are fully linked to rs672144, but not to each other. The EUR haplotypes can be partitioned into 4 clusters. Each of the 4 haplotypes show high homozygosity, suggestive of selection. However, rs1042602 can only explain the sweep in clusters C1+C2. rs10831496 can only explain C1+C3. Only rs672144 explains all 4 clusters, providing a simpler explanation of selection in this region.
The mutation rs1448484 is the iSAFE top rank mutation in all the population of 1000GP except African that does not show any signal of selection in this region. rs12913832 is a candidate favored mutation for the selection in European, proposed by Wilde et al. (2014). Supplementary Table SN2.2 provides iSAFE rank of some other candidate mutations associated with pigmentation in this region (see Supplementary Note 2).
10 mutations (rs11772526, rs4725602, rs11763225, rs7796010, rs11762011, rs13239916, rs4145394, rs10808023, rs10808021, and rs4726591) are highly linked and are top 10 iSAFE candidate mutations in all the 1000GP populations except for AFR where there is no signals of selection. See Supplementary Note 2 and Supplementary Table SN2.3 for more details.
(a) A model of human demography described by Fig. 4 and Table 2 of Gravel et al. (2011). The model assumes an out-of-Africa split at time TB, with a bottleneck that reduced the effective population from NAf to NB, allowing for migrations at rate mAf-B. The African population stays constant at NAf up to the present generation. The model assumes a second split between European and Asian populations at time TEuAs, with a bottleneck reducing the Asian and European populations to NAs0 and NEu0 respectively. The bottleneck was followed by exponential growth at rates rAs and rEu, as well as migrations among all three sub-populations, leading to current populations from which East Asian (EAS), European (EUR), and Africans (AFR) individuals were sampled. We used default values for simulation parameters not assigned (see Online Methods). (b) We simulated 1000 selective sweeps on 5 Mbp region based on the model of human demography, and with selection coefficient s = 0.05 and starting favored allele frequency ν0 = 0.001. The selection happens in a random time, after the out of Africa in the lineage of EUR population (as the target population). When the onset of selection is before split of EUR and EAS (> 23 kya), both (EUR and EAS) are under selection.
We simulated 25,000 instances of AFR, EUR, and EAS populations, based on a model of human demography (see Online Methods; Supplementary Fig. 14). (a) The MDDAF score of mutations as a function of derived allele frequency in the target population DT. (b) Distribution of the MDDAF score for mutations with DT > 0.9. (c) Empirical P value of the MDDAF score for mutations with DT > 0.9. The dashed-red lines represent the value 0.78, where MDDAF, given DT > 0:9, has a P value less than 0.001.
About this article
Cite this article
Akbari, A., Vitti, J., Iranmehr, A. et al. Identifying the favored mutation in a positive selective sweep. Nat Methods 15, 279–282 (2018). https://doi.org/10.1038/nmeth.4606
Nature Communications (2021)
Human Genetics (2021)
BMC Genomics (2021)
Molecular Biology and Evolution (2021)
Journal of Visual Communication in Medicine (2020)