Identifying the favored mutation in a positive selective sweep

Akbari, Ali; Vitti, Joseph J; Iranmehr, Arya; Bakhtiari, Mehrdad; Sabeti, Pardis C; Mirarab, Siavash; Bafna, Vineet

doi:10.1038/nmeth.4606

Brief Communication
Published: 19 February 2018

Identifying the favored mutation in a positive selective sweep

Nature Methods volume 15, pages 279–282 (2018)Cite this article

6482 Accesses
39 Citations
70 Altmetric
Metrics details

Subjects

Abstract

Most approaches that capture signatures of selective sweeps in population genomics data do not identify the specific mutation favored by selection. We present iSAFE (for “integrated selection of allele favored by evolution”), a method that enables researchers to accurately pinpoint the favored mutation in a large region (∼5 Mbp) by using a statistic derived solely from population genetics signals. iSAFE does not require knowledge of demography, the phenotype under selection, or functional annotations of mutations.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Characterization of the SAFE method.**

**Figure 2: Illustration of the iSAFE method.**

Complexity of avian evolution revealed by family-level genomes

Article 01 April 2024

Josefin Stiller, Shaohong Feng, … Guojie Zhang

The variation and evolution of complete human centromeres

Article Open access 03 April 2024

Glennis A. Logsdon, Allison N. Rozanski, … Evan E. Eichler

Genomic data in the All of Us Research Program

Article Open access 19 February 2024

The All of Us Research Program Genomics Investigators

References

Vitti, J.J., Grossman, S.R. & Sabeti, P.C. Annu. Rev. Genet. 47, 97–120 (2013).
Article CAS Google Scholar
Fan, S., Hansen, M.E.B., Lo, Y. & Tishkoff, S.A. Science 354, 54–59 (2016).
Article CAS Google Scholar
Schrider, D.R., Mendes, F.K., Hahn, M.W. & Kern, A.D. Genetics 200, 267–284 (2015).
Article Google Scholar
Field, Y. et al. Science 354, 760–764 (2016).
Article CAS Google Scholar
Azad, P. et al. J. Mol. Med. (Berl.) 95, 1269–1282 (2017).
Article CAS Google Scholar
Stobdan, T. et al. Mol. Biol. Evol. 34, 3154–3168 (2017).
Article CAS Google Scholar
Grossman, S.R. et al. Science 327, 883–886 (2010).
Article CAS Google Scholar
Ronen, R. et al. PLoS Genet. 11, e1005527–e1005527 (2015).
Article Google Scholar
Wang, M. et al. Mol. Biol. Evol. 31, 3068–3080 (2014).
Article CAS Google Scholar
Voight, B.F., Kudaravalli, S., Wen, X. & Pritchard, J.K. PLoS Biol. 4, e72 (2006).
Article Google Scholar
Sabeti, P.C. et al. Science 312, 1614–1620 (2006).
Article CAS Google Scholar
Ohashi, J., Naka, I. & Tsuchiya, N. Mol. Biol. Evol. 28, 849–857 (2011).
Article CAS Google Scholar
Tishkoff, S.A. et al. Science 293, 455–462 (2001).
Article CAS Google Scholar
Heffelfinger, C. et al. Eur. J. Hum. Genet. 22, 551–557 (2014).
Article CAS Google Scholar
Wilde, S. et al. Proc. Natl. Acad. Sci. USA 111, 4832–4837 (2014).
Article CAS Google Scholar
Coop, G. et al. PLoS Genet. 5, e1000500 (2009).
Article Google Scholar
Campbell, C.D. et al. Nat. Genet. 44, 1277–1281 (2012).
Article CAS Google Scholar
Galinsky, K.J., Loh, P.-R., Mallick, S., Patterson, N.J. & Price, A.L. Am. J. Hum. Genet. 99, 1130–1139 (2016).
Article CAS Google Scholar
Beleza, S. et al. PLoS Genet. 9, e1003372 (2013).
Article CAS Google Scholar
Cornelis, M.C. et al. Mol. Psychiatry 20, 647–656 (2015).
Article CAS Google Scholar
Ferrer-Admetlla, A., Liang, M., Korneliussen, T. & Nielsen, R. Mol. Biol. Evol. 31, 1275–1291 (2014).
Article CAS Google Scholar
Pybus, M. et al. Bioinformatics 31, 3946–3952 (2015).
CAS Google Scholar
Garud, N.R., Messer, P.W., Buzbas, E.O. & Petrov, D.A. PLoS Genet. 11, e1005004 (2015).
Article Google Scholar
DeGiorgio, M., Huber, C.D., Hubisz, M.J., Hellmann, I. & Nielsen, R. Bioinformatics 32, 1895–1897 (2016).
Article CAS Google Scholar
Ronen, R., Udpa, N., Halperin, E. & Bafna, V. Genetics 195, 181–193 (2013).
Article Google Scholar
Pavlidis, P., Živkovic, D., Stamatakis, A. & Alachiotis, N. Mol. Biol. Evol. 30, 2224–2234 (2013).
Article CAS Google Scholar
Chen, H., Patterson, N. & Reich, D. Genome Res. 20, 393–402 (2010).
Article CAS Google Scholar
Sabeti, P.C. et al. Nature 449, 913–918 (2007).
Article CAS Google Scholar
Sabeti, P.C. et al. Nature 419, 832–837 (2002).
Article CAS Google Scholar
Nielsen, R. et al. Genome Res. 15, 1566–1575 (2005).
CAS PubMed PubMed Central Google Scholar
Kim, Y. & Nielsen, R. Genetics 167, 1513–1524 (2004).
Article Google Scholar
Shriver, M.D. et al. Hum. Genomics 1, 274–286 (2004).
Article CAS Google Scholar
Ewing, G. & Hermisson, J. Bioinformatics 26, 2064–2065 (2010).
Article CAS Google Scholar
Nachman, M.W. & Crowell, S.L. Genetics 156, 297–304 (2000).
CAS PubMed PubMed Central Google Scholar
Jensen-Seaman, M.I. et al. Genome Res. 14, 528–538 (2004).
Article CAS Google Scholar
Gravel, S. et al. Proc. Natl. Acad. Sci. USA 108, 11983–11988 (2011).
Article CAS Google Scholar
Szpiech, Z.A. & Hernandez, R.D. Mol. Biol. Evol. 31, 2824–2827 (2014).
Article CAS Google Scholar
1000 Genomes Project Consortium. Nature 526, 68–74 (2015).
Zerbino, D.R. et al. Nucleic Acids Res. 46, D754–D761 (2018).
Article CAS Google Scholar
International HapMap Consortium. Nature 449, 851–861 (2007).

Download references

Acknowledgements

This research was supported in part by the NSF (grant DBI-1458557 to A.A. and V.B.) and the NIH (grant R01GM114362 to V.B.).

Author information

Authors and Affiliations

Department of Electrical & Computer Engineering, University of California San Diego, La Jolla, California, USA
Ali Akbari, Arya Iranmehr & Siavash Mirarab
Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts, USA
Joseph J Vitti & Pardis C Sabeti
Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
Joseph J Vitti & Pardis C Sabeti
Department of Computer Science & Engineering, University of California San Diego, La Jolla, California, USA
Mehrdad Bakhtiari & Vineet Bafna

Authors

Ali Akbari
View author publications
You can also search for this author in PubMed Google Scholar
Joseph J Vitti
View author publications
You can also search for this author in PubMed Google Scholar
Arya Iranmehr
View author publications
You can also search for this author in PubMed Google Scholar
Mehrdad Bakhtiari
View author publications
You can also search for this author in PubMed Google Scholar
Pardis C Sabeti
View author publications
You can also search for this author in PubMed Google Scholar
Siavash Mirarab
View author publications
You can also search for this author in PubMed Google Scholar
Vineet Bafna
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.A., S.M., and V.B. conceived and designed the experiments and wrote the manuscript with input from all other authors; A.A., J.J.V., and A.I. performed the experiments; A.A. analyzed the data; A.A. and M.B. developed software tools; and P.C.S., S.M., and V.B. provided guidance throughout the study.

Corresponding authors

Correspondence to Ali Akbari or Vineet Bafna.

Ethics declarations

Competing interests

V.B. is a cofounder, has an equity interest in, and receives income from Digital Proteomics, LLC. The terms of this arrangement have been reviewed and approved by the University of California, San Diego, in accordance with its conflict-of-interest policies. Digital Proteomics was not involved in the research presented here.

Integrated supplementary information

Supplementary Figure 1 Empirical SAFE distribution.

(a) φ and κ as estimators of f. Empirical analysis, with 10,000 neutrally evolving population (about 3 million SNPs) with default parameter set, shows that φ and κ are (biased) estimators of allele frequency f (f = i/n for all integers i ∈ [1, n-1]). (b) The top panel is the SAFE score Probability Density Function (PDF) of 10,000 neutrally evolving population (about 3 million SNPs with minor allele frequency > 0.05) with default parameter set. The bottom panel is Quantiles of the SAFE score against the quantiles of the Normal distribution. The coefficient of determination (R² = 0.9997) for the QQ-plot shows that Gaussian distribution is a good approximation to the SAFE score distribution.

Supplementary Figure 2 SAFE evaluation.

Performance of the safe score evaluated in different scenarios with 1000 simulations per bin. In each panel, we change one parameter and other parameters have their default values (see Online Methods). The fixed population size N = 20,000. The dashed (dotted) line represents median (quartile). In the bottom-right panel, white represents the result for a fixed size population model with default parameters and gray represents a model of human demography for EUR population (see Online Methods; Supplementary Figure 14). The onset times of selection was post-bottleneck (23 kya-current) epochs.

Supplementary Figure 3 iSAFE evaluation.

(a,b,c) Performance of iSAFE measured by rank of the favored variant and the distance of the favored variant from the peak in 1000 simulations per bin. The dashed (dotted) line represents median (quartile). (d) Performance of iSAFE compared to iHS and SCCT measured by rank of the favored variant in 5000 simulations on 5 Mbp region around ongoing hard sweeps (ν₀ = 1/N; 0.1 < ν < 0.9) with a fixed population size (N = 20,000) and default values for other simulation parameters. In the left panel, for any rank r on the x-axis, the y-intercept represents the proportion of samples where the favored allele had rank ≤ r. In the right panel, solid (dashed) lines represent the mean (respectively, median) value of the favored allele rank. (e) iSAFE performance upon addition of outgroup samples. No deterioration is seen for low frequencies of the favored variant, but iSAFE performance improves dramatically when favored mutation is near fixation or fixed. The dashed (dotted) line represents median (quartile). This comparison is based on 1000 simulations of 5 Mbp genomic regions simulated using a model of human genome based on the human demography (Supplementary Fig. 14). The time of onset of selection was chosen at random (using the distribution in Supplementary Fig. 14) after the out of Africa event, in the lineage of EUR population (as the target population). When the onset of selection is before split of EUR and EAS (>23kya), both (EUR and EAS) are under selection.

Supplementary Figure 4 Demo I: iSAFE versus CMS in a model of human demography.

Comparing iSAFE and CMS signals in a model of human demography (see Online Methods; Supplementary Figure 14). Solid-horizontal lines separate replicates based on the favored allele frequency (ν) in EUR as the target population, and dotted-vertical lines separate different replicates. The rank of the favored mutation (solid-red circle) for each test is shown on the top-right corner.

Supplementary Figure 5 Demo II: iSAFE without outgroup samples.

iSAFE on ongoing selective sweeps with different favored allele frequency (ν) in 5 Mbp region. The position of the favored mutation selected from range [2.5 Mbp, 5 Mbp]. Other simulation parameters are the default values for fixed population size (see Online Methods) and outgroup samples are not available.

Supplementary Figure 6 Demo III: iSAFE and selection strength.

iSAFE on 5 Mbp region with different selection strength, Ns ∈ [0, 100, 200, 500, 1000]. Left panels shows the Ψe,w matrix. Middle panel

shows the iSAFE-score as a function of the variant position. Right panel show the derived allele frequency as a function of iSAFE score.

Supplementary Figure 7 Peak iSAFE.

(a) Empirical analysis, with 5000 simulations on 5 Mbp region with a wide range of selection strength (Ns ∈ [10, 50, 100, 200, 300, 400, 500, 1000]), shows difference in performance of iSAFE beyond a score threshold of 0.1 for peak value of iSAFE. (b) Rank of favored mutation as a function of peak iSAFE score (Bottom x-axis) or P value (top x-axis; see Online Methods) for the same data in part a. Each gray dot represents the favored mutation of a simulation using a wide range of selection coefficients. The performance deteriorates for iSAFE-scores below 0.1. The dashed (dotted) line represents median (quartile).

Supplementary Figure 8 iSAFE versus CMS on well-characterized selective sweeps.

iSAFE and CMS scores (right and left panels, respectively) on 8 well-characterized selective sweeps (Supplementary Table 1). The rank of the putative favored mutation (red star) in 5 Mbp region is shown in parentheses.

Supplementary Figure 9 iSAFE on targets of selection.

iSAFE on 5 Mbp regions reported to be under selection. Putative favored mutation is shown in blue square when it is among iSAFE top rank mutations, and in blue triangle when the signal of selection is very weak (peak iSAFE << 0.1). The right axis is empirical P value (see Online Methods). (a,b) PCDH15 and ADH1B loci with 207 samples (414 haplotypes) from CHB+JPT populations. (c) PSCA locus with 108 samples (216 haplotypes) from YRI population. (d,e,f) ASPM, FUT2, and F12 loci with 91 samples (182 haplotypes) from GBR population.

Supplementary Figure 10 iSAFE on the GRM5-TYR locus.

The Tyrosinase (TYR) gene, encoding an enzyme involved in the first step of melanin production is present in a large region under selection. A nonsynonymous mutation rs1042602 (blue) in TYR gene is reported as a candidate favored variant. A second intronic variant rs10831496 (red) in GRM5 gene, 396 kbp upstream of TYR, has been shown to have a strong association with skin color. In contrast, iSAFE ranks mutation rs672144 (turquoise) as the top candidate for the favored variant region out of ~22,000 mutations (5 Mbp; see Supplementary Note 2). (a) This variant was the top ranked mutation not only in CEU (Fig 3c), but also the top ranked mutation for EUR, EAS, AMR, and SAS. The signal of selection is strong in all populations (iSAFE > 0.5, P << 1.3e-8 for all of) except AFR, which does not show a signal of selection in this region. We plotted the haplotypes carrying rs672144 in all 5008 haplotypes (2504 samples) of 1000GP and found (Fig. 3d) that two distinct haplotypes carry the mutation, both with high frequencies maintained across a large stretch of the region, suggestive of a soft sweep with standing variation. (b) Frequency of derived alleles of rs10831496, rs672144, and rs1042602, are shown in red, turquoise, and blue, respectively. iSAFE candidate (rs672144) may not have been reported earlier because it is near fixation in all populations of 1000GP except for AFR (f = 0.27). (c) Each row is a haplotype and each column is a variant in EUR populations of 1000GP. In total we have 1006 haplotypes (503 samples). Carrier haplotypes of derived alleles of rs10831496, rs672144, and rs1042602, are shaded by red, turquoise, and blue, respectively. For making the plot sensible, we removed low frequency SNPs f_EUR < 0.2 and SNPs that are near fixation in the whole 1000GP, f_1000GP > 0.95. The previously suggested candidates rs1042602, rs10831496 are fully linked to rs672144, but not to each other. The EUR haplotypes can be partitioned into 4 clusters. Each of the 4 haplotypes show high homozygosity, suggestive of selection. However, rs1042602 can only explain the sweep in clusters C1+C2. rs10831496 can only explain C1+C3. Only rs672144 explains all 4 clusters, providing a simpler explanation of selection in this region.

Supplementary Figure 11 iSAFE on the OCA2-HERC2 locus.

The mutation rs1448484 is the iSAFE top rank mutation in all the population of 1000GP except African that does not show any signal of selection in this region. rs12913832 is a candidate favored mutation for the selection in European, proposed by Wilde et al. (2014). Supplementary Table SN2.2 provides iSAFE rank of some other candidate mutations associated with pigmentation in this region (see Supplementary Note 2).

Supplementary Figure 12 iSAFE on the KITLG locus.

iSAFE top rank mutations (circles) and candidate mutation rs642742 (blue triangle) proposed by Miller et al. (2007). See Supplementary Note 2 and Supplementary Table SN2.3 for more details.

Supplementary Figure 13 iSAFE on the TRPV6 locus.

10 mutations (rs11772526, rs4725602, rs11763225, rs7796010, rs11762011, rs13239916, rs4145394, rs10808023, rs10808021, and rs4726591) are highly linked and are top 10 iSAFE candidate mutations in all the 1000GP populations except for AFR where there is no signals of selection. See Supplementary Note 2 and Supplementary Table SN2.3 for more details.

Supplementary Figure 14 Simulation of selection on human demography.

(a) A model of human demography described by Fig. 4 and Table 2 of Gravel et al. (2011). The model assumes an out-of-Africa split at time T_B, with a bottleneck that reduced the effective population from N_Af to N_B, allowing for migrations at rate m_Af-B. The African population stays constant at N_Af up to the present generation. The model assumes a second split between European and Asian populations at time T_EuAs, with a bottleneck reducing the Asian and European populations to N_As0 and N_Eu0 respectively. The bottleneck was followed by exponential growth at rates r_As and r_Eu, as well as migrations among all three sub-populations, leading to current populations from which East Asian (EAS), European (EUR), and Africans (AFR) individuals were sampled. We used default values for simulation parameters not assigned (see Online Methods). (b) We simulated 1000 selective sweeps on 5 Mbp region based on the model of human demography, and with selection coefficient s = 0.05 and starting favored allele frequency ν₀ = 0.001. The selection happens in a random time, after the out of Africa in the lineage of EUR population (as the target population). When the onset of selection is before split of EUR and EAS (> 23 kya), both (EUR and EAS) are under selection.

Supplementary Figure 15 Maximum difference in derived allele frequency (MDDAF).

We simulated 25,000 instances of AFR, EUR, and EAS populations, based on a model of human demography (see Online Methods; Supplementary Fig. 14). (a) The MDDAF score of mutations as a function of derived allele frequency in the target population D_T. (b) Distribution of the MDDAF score for mutations with D_T > 0.9. (c) Empirical P value of the MDDAF score for mutations with D_T > 0.9. The dashed-red lines represent the value 0.78, where MDDAF, given D_T > 0:9, has a P value less than 0.001.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Akbari, A., Vitti, J., Iranmehr, A. et al. Identifying the favored mutation in a positive selective sweep. Nat Methods 15, 279–282 (2018). https://doi.org/10.1038/nmeth.4606

Download citation

Received: 17 May 2017
Accepted: 08 January 2018
Published: 19 February 2018
Issue Date: 01 April 2018
DOI: https://doi.org/10.1038/nmeth.4606

This article is cited by

Selection on the promoter regions plays an important role in complex traits during duck domestication
- Zhong-Tao Yin
- Xiao-Qin Li
- Zhuo-Cheng Hou
BMC Biology (2023)
Genome sequencing of 2000 canids by the Dog10K consortium advances the understanding of demography, genome function and architecture
- Jennifer R. S. Meadows
- Jeffrey M. Kidd
- Elaine A. Ostrander
Genome Biology (2023)
Prioritizing autoimmunity risk variants for functional analyses by fine-mapping mutations under natural selection
- Vasili Pankratov
- Milyausha Yunusbaeva
- Bayazit Yunusbayev
Nature Communications (2022)
Positive selection acts on regulatory genetic variants in populations of European ancestry that affect ALDH2 gene expression
- Helmut Schaschl
- Tobias Göllner
- David L. Morris
Scientific Reports (2022)
A database of 5305 healthy Korean individuals reveals genetic and clinical implications for an East Asian population
- Jeongeun Lee
- Jean Lee
- Murim Choi
Experimental & Molecular Medicine (2022)