Positive selection acts on regulatory genetic variants in populations of European ancestry that affect ALDH2 gene expression

ALDH2 is a key enzyme in alcohol metabolism that protects cells from acetaldehyde toxicity. Using iHS, iSAFE and FST statistics, we identified regulatory acting variants affecting ALDH2 gene expression under positive selection in populations of European ancestry. Several SNPs (rs3184504, rs4766578, rs10774625, rs597808, rs653178, rs847892, rs2013002) that function as eQTLs for ALDH2 in various tissues showed evidence of strong positive selection. Very large pairwise FST values indicated high genetic differentiation at these loci between populations of European ancestry and populations of other global ancestries. Estimating the timing of positive selection on the beneficial alleles suggests that these variants were recently adapted approximately 3000–3700 years ago. The derived beneficial alleles are in complete linkage disequilibrium with the derived ALDH2 promoter variant rs886205, which is associated with higher transcriptional activity. The SNPs rs4766578 and rs847892 are located in binding sequences for the transcription factor HNF4A, which is an important regulatory element of ALDH2 gene expression. In contrast to the missense variant ALDH2 rs671 (ALDH2*2), which is common only in East Asian populations and is associated with greatly reduced enzyme activity and alcohol intolerance, the beneficial alleles of the regulatory variants identified in this study are associated with increased expression of ALDH2. This suggests adaptation of Europeans to higher alcohol consumption.

The Neolithic transition from a hunter-gatherer lifestyle to an agriculturist one, about 9000-13,000 years ago, included substantial changes in food processing and dietary habits associated with plant and animal domestication 1 .One of the key questions in biological anthropology is whether these changes resulted in selective pressure, influencing the expression of genes in the human genome.Identifying such loci has the potential to detect the underlying genetic variants contributing to the risk for various human diseases such as autoimmune diseases, cancer, or cardiovascular disease 1 .Alcohol consumption and culture-related drinking behavior is probably one of the major changes in human dietary habits and lifestyle over the last 10,000 years.Production of larger amounts of alcoholic beverages had probably begun by the early Neolithic.A recent study reports archaeological evidence for cereal-based beer brewing by the semi-nomadic Natufians at Raqefet Cave (Mount Carmel in the north of Israel) dating back 11,700-13,700 years ago 2 .Today, large amounts of alcohol are consumed in many societies.Recent data from the World Health Organization (WHO) show that worldwide about 3 million deaths and 132.6 million disability-adjusted life years are attributable to the harmful use of alcohol 3 .In particular, Europe stands out in the WHO data as the region with the highest alcohol consumption and the highest burden of alcohol-related diseases.Large amounts of episodic drinking (binge drinking) as well as chronic alcohol consumption are associated with several very harmful effects such as alcoholic liver disease, intestinal inflammation, cancer, hypertension, brain damage including adverse behavioural changes, and decreased fertility 4,5 .While heavy alcohol consumption can cause complex negative physiological effects, positive effects of light to moderate alcohol consumption have also been reported.Light to moderate alcohol consumption has been associated with a reduced risk of some forms of cardiovascular disease and autoimmune diseases [6][7][8][9][10] .The most commonly   5).Moreover, the identified seven SNPs under positive selection (by the iHS and F ST statistics) were also identified by iSAFE as top-ranked mutations with iSAFE scores > 0.304, i.e., above the significant threshold (Fig. 2).These SNPs function as eQTLs for ALDH2, and the beneficial alleles are associated with increased ALDH2 gene expression in various human tissues (according to the set of tissues represented in GTEx) such as esophagus-mucosa, skin, muscle-skeletal, brain-nucleus accumbens, artery-tibial, arteryaorta, and thyroid.The average allele frequencies for these SNPs are given in Table 1; for most of these SNPs the frequency of the derived beneficial alleles reaches almost 50% in the European populations.In contrast, the derived alleles are very rare (< 0.3%) in African and East Asian ancestries and at low frequency in populations of   www.nature.com/scientificreports/South Asian ancestry (< 7% with the exception of rs847892).We also compared the allele frequency at these loci with ancient Eurasians, including ancient hunter-gatherers (8.2-7.5 kya) from the study of 50 .The allele frequency data from the latter study show for the SNPs rs3184504, rs4766578, rs10774625, and rs653178 that the ancestral alleles were fixed in ancient European hunter-gatherers.As expected, the Neanderthal and Denisovan data on the UCSC Genome Browser also show only ancestral alleles at these loci.In contrast, early European farmers (8.4-4.2 kya) and individuals with steppe ancestry (5.4-3.6 kya) had frequencies between 8 and 25% of the derived alleles at these loci.The analysis of the selection coefficient (s) revealed that s ranged from 0.04 to 0.1, suggesting very strong positive selection acting on these SNP eQTLs (Table 1).The corresponding allele trajectory plots, inferred by the Clues method, are presented in Supplementary Figure S2.In the European sample GBR we estimated the timing of positive selection (using the method Startmrca) of the derived beneficial alleles to be from about 3.0 to 3.7 kya with the exception of SNP rs847892, which we date at 6.0 kya (Table 1).This range of estimates are very similar to the TMRCA estimates calculated for the other European samples (TSI and FIN) (Supplementary Table 6).We also calculated the TMRCA for the derived allele of the East Asian-specific polymorphism (missense variants) at rs671-A/G and rs3782886-C/T in the East Asian population CHB, which yielded an estimation of 5.8 kya (CI: 4.8-6.7)and 5.4 kya (CI: 3.3-6.5),respectively.In addition, we used the method Clues to obtaining allele ages for the seven SNPs under positive selection.Clues calculated a similar timing of selection (2.6 kya to 4.5 kya) for the SNPs rs3184504, rs4766578 and rs10774625 as the Startmrca method (Supplementary Table S7).However, the timing of selection for the SNPs rs597808, rs653178 and rs2013002 was estimated to much older ages ranging in time frames from 7.4 kya to 14.1 kya; rs847892 between 21.3 to 30.1 kya.We further included in the analysis the ALDH2 promoter variant rs886205-A/G, which is located − 360 bp from the ATG start codon of the ALDH2 gene 51 .This promoter variant shows very large genetic differentiation with global locus-specific F ST = 0.378 (s.d.= 0.055).In the 1000 Genomes data the derived allele A is the common allele in European and South Asian populations with average frequencies of about 83% and 71%, respectively.In contrast, in populations of African and East Asian ancestry the common allele is the ancestral allele G with frequencies of about 78% and 84%, respectively.For the ALDH2 promoter variant, a study showed (in vivo and in vitro experiments) that the-360G (ancestral) allele has a significantly lower basal transcriptional activity than the − 360A (derived) allele 52 .Our LD analysis revealed that the positively selected SNPs are in complete LD (D' = 1) with the ALDH2 promoter variant rs886205 (Table 2).The chromatin state data from RegulomeDB showed that the identified SNPs are associated with active transcription start site (TSS), enhancers and strong transcription in different tissues (Table 3).Importantly, the positively selected SNPs rs4766578 and rs847892 are located in the binding motif for the transcription factor hepatocyte nuclear factor 4 alpha (HNF4A).This transcription factor is an important regulatory element of ALDH2.The mapped phenotypes (Table 3) show that the positively selected SNPs are associated with various traits and diseases, in particular with blood pressure, cardiovascular disease, cholesterol level and autoimmune diseases.The variants rs597808 and rs2013002 are also associated with alcohol drinking and physiological traits such as blood pressure 53 .We pooled related traits (Supplementary Table 8) into four main trait category namely autoimmune diseases (AIS), blood pressure (BP), cardiovascular disease (CDS) and cancer to test the null hypothesis that the traits and the allele state are independent.We found a significant (χ 2 = 28.828,df = 3, p value = 2.4e−06) relationship between the allele state and trait; the derived beneficial alleles are positively associated with AIS, BP and CDS whereas the ancestral alleles with cancer.

Discussion
This study provides evidence of positive selection across the human chromosomal region 12.q24.12.This finding is in line with two previous studies 55, 56 .We identified seven SNPs (rs3184504, rs4766578, rs10774625, rs597808, rs653178, rs847892, rs2013002) that are under positive selection and show very large global locus-specific F ST values (> 0.3), indicating high genetic differentiation between populations of European ancestry and populations from other global ancestries (Table 1).The GTEx data show that these SNPs function primarily as eQTLs for the ALDH2 gene.We further found that this genomic region is enriched in eQTLs that influence ALDH2 gene expression.A high number of these SNP eQTLs had significant iHS scores in the populations of European www.nature.com/scientificreports/ancestry.In contrast, cis-eQTLs of the other genes located at chr12q24.12showed no significant iHS values.In addition, the iHS results are supported by the iSAFE analysis which ranked the identified SNPs (eQTLs) as topranked mutations, with iSAFE scores > 0.304.This indicates that the target of positive selection are regulatory acting variants that influence ALDH2 gene expression.The derived beneficial alleles at these SNP eQTLs are associated with increased expression of ALDH2 in multiple human tissues.However, in the GTEx database, no ALDH2 cis-eQTLs are reported for the liver tissue.Nonetheless, the two positive selected SNPs, rs4766578 and rs847892, are located in binding sequences for transcription factor HNF4.That transcription factor is considered to be a master regulator of liver-specific gene expression 57 and is an important regulatory element of ALDH2 gene expression 35,58 .Positive selection leads to changes in the allele frequencies at the transcription factor binding sites, which could potentially lead to significant changes in the binding specificity of the liver-specific transcription factor HNF4. Therefore, we are inclined to hypothesize that individuals carrying the positively selected haplotypes will have higher basal expression of ALDH2 than individuals lacking the positively selected haplotypes.In addition, the RegulomeDB data indicate that the positively selected SNPs are located in active enhancer histone marks in different tissues including the liver.Moreover, the positively selected SNP eQTLs are in complete LD with the ALDH2 promoter variant rs847892.This promoter polymorphism influences individual differences in acetaldehyde elimination.The ancestral allele G, the common allele in populations of African and East Asian ancestry, has a lower basal transcriptional activity than the derived allele A, the common allele in populations of European and South Asian ancestry 52 .These results suggest that higher transcriptional activity and increased ALDH2 expression in individuals of European ancestry represent a form of genetic adaptation to increased alcohol consumption, possibly enabling faster detoxification of acetaldehyde.The derived beneficial alleles of these loci reach almost 50% in the European population, whereas in African and East Asian populations the frequencies are very low (< 0.003).The ancestral alleles at these positively selected loci appear to be fixed in ancient European hunter-gatherers, but in early farmers and individuals with steppe ancestry the frequencies of the derived alleles already range between 8 and 25% 50 .The applied Clues method found for the SNPs rs3184504, rs4766578 and rs10774625 evidence of very strong positive selection with s = 0.1, corresponding to an allele age of about 2.6 kya to 4.5 kya (Table 1).This is in line with the estimated timing of Table 3. GTEx and RegulomeDB data on SNPs under positive selection in European populations (GBR, TSI, FIN).Given is also a summary of reported traits from the NHGRI-EBI GWAS catalogue.GTEx eQTLs-eGene interaction with p < 0.0001.RegulomeDB rank: 2b: TF binding + any motif + DNase Footprint + DNase peak; 3a: TF binding + any motif + DNase peak; 4-5: TF binding + DNase peak; 6: motif hit.The RegulomeDB probability score ranges from 0 to 1, with 1 being most likely to be a regulatory variant (for further details see 54 ).Transcription factor HNF4A, an important regulatory element of the ALDH2 gene expression, is given in bold.www.nature.com/scientificreports/positive selection on the beneficial alleles in the European population GBR calculated by the Startmrca method which ranges from about 3.0 kya to 3.7 kya (except for rs847892 for which TMRCA was estimated to 6.0 kya).However, in contrast to the Startmrca method, the Clues method estimated the allele age for the other SNPs much further back in time to about 7.4 kya to 14.1 kya (again with the exception for rs847892 for which the allele age was estimated to about 21.3 kya to 30.1 kya).Nevertheless, the strong putative selection (s = 0.1) acting on several SNPs indicates that at these loci the alleles are much more intensely under positive selection than for example the lactase persistence locus SNP rs4988235 for which s = 0.0161 were calculated 59 .We further calculated the TMRCA for the East Asian-specific derived alleles rs671-A and rs3782886-C (using the Startmrca method), yielding an estimation of 5.8 kya (CI: 4.8-6.7)and 5.4 kya (CI: 4.3-6.5),respectively.Rs3782886, which is in LD with rs671, shows signals of very recent selection for the past 2000-3000 years in the Japanese population as reported in a recent study 49 .Noteworthy, rs671-A and rs1229984-A (ADH1B locus) were found in a subsequent study to be significantly associated with better survival in the Japanese population 60 .The estimated TMRCA for the derived alleles in our study suggests that these alleles spread in East Asia at a much earlier time than the beneficial alleles in populations of European ancestry.Archaeological evidence indicates early production of fermented alcohol in China 61 .Analysis of starch granules, phytoliths and fungi in food residues adhering to 8000-7000 year-old alcohol-making pottery vessels suggests that, in East Asia in the early Neolithic, alcoholic beverages were already being produced 62 .For Europe, archaeologically recognizable brewing material in Central European lakeside settlements show that alcoholic beverages were being produced in this region in the late Neolithic period about 6000 years ago 63 .A recent study suggests that extensive fermented alcoholic beverages such as beer were already consumed in Central Europe during the Iron Age 64 .Later, in Greek-Roman antiquity, a richly developed viticulture with high wine production was achieved and, in this period, wine became part of the daily diet of many people 65 .Alcohol consumption has apparently increased steadily since then in Europe, especially in the nineteenth century.In Germany, for example, the high level of consumption, in particular of strong spirits, in the early nineteenth century was-in analogy to the plague-referred to as Branntweinpest (brandy plague).Since the rs671-A allele leads to an inactive enzyme and thus to an excess of toxic acetaldehyde in cells with its negative physiological effects, we suggest that this allele may explain the differences in the signature of positive selection between populations of European and East Asian ancestry.The ALDH2 enzyme plays a critical role in the detoxification of both acetaldehyde and ROS-generated aldehyde adducts such as 4-hydroxy-2-nonenal and malondialdehyde.This enzyme thus has cytoprotective effects reducing oxidative stress 66,67 .In particular, the ALDH2*2 variant (rs671-A/A), which is common only in individuals of East Asian ancestry, has been intensively studied in East Asians.While individuals with the rs671-A allele have a reduced risk of developing alcoholism, it increases their cancer risk 68,69 .Nevertheless, this allele was found to be associated in Japanese population with better survival 60 .In European populations this allele is virtually absent.In our study, however, the identified variants that are under recent positive selection in European populations act as regulatory variants and are associated with increased ALDH2 gene expression in various human tissues.This suggests that individuals carrying these beneficial alleles should be more quickly able to detoxify the body from higher amounts of acetaldehyde and ROS-generated aldehyde adducts.However, a recent study reports higher methylation in alcohol-dependent patients compared to controls in the ALDH2 promoter region 70 .Furthermore, that study suggests that positive and negative regulatory elements interact at the ALDH2 promoter to induce genotype-mediated epigenetic changes, leading to differential transcriptional activity of this gene.In addition, a GWAS reported that the SNPs rs597808 and rs2013002, which were found in this study under positive selection, are associated with alcohol consumption and risk of developing hypertension 53 .We therefore suggest that individuals carrying the positively selected alleles may be able to consume more alcohol (over longer time periods), but may also have a higher likelihood of becoming heavy drinkers and alcohol dependent.This, then, could lead to increased methylation of the ALDH2 promoter, resulting in decreased ALDH2 gene expression.Accordingly, the protective effects of ALDH2 against oxidative damage through acetaldehyde would be lost, resulting in increased risk of numerous oxidative stress-related diseases such as cancer, diabetes, inflammatory disorders and cardiovascular conditions such as hypertension and stroke.Indeed, we found that the derived beneficial alleles are positively associated with AIS, BP and CDS whereas the ancestral alleles with cancer.
To conclude, we found that very strong positive selection (with s ranging between 0.04 and 0.1) acts on regulatory variants affecting ALDH2 gene expression in populations of European ancestry.Estimation of the timing of positive selection on the beneficial alleles suggests that these variants were recently adapted, approximately 3000 to 3700 years ago.The timing of selection and the signals of very strong selection make the chromosomal region chr12q24.12one of the most intensely selected regions in the genomes of individuals of European ancestry.In contrast to the known functional consequence of the ALDH2*2 variant (rs671) in East Asians, which is associated with alcohol intolerance, in Europeans the beneficial derived alleles are associated with increased ALDH2 gene expression.This suggests local adaptation to higher alcohol consumption in Europeans.We further hypothesize that the beneficial effects of higher ALDH2 expression leads to an increased detoxification capacity for acetaldehyde, but possibly also to increased likelihood of chronic alcohol abuse, leading to decreased ALDH2 expression and thus increased cell toxicity from EtOH-derived acetaldehyde as well as from ROS-generated aldehydes.

Genomic data
We downloaded the phased genomic datasets from the 1000 Genomes project (phase 3; ftp:// ftp.1000g enomes.ebi.ac.uk/ vol1/ ftp/ relea se/ 20130 502/) 71 .Only 1000 Genomes data were used in this study according to the Declaration of Helsinki.We obtained SNP data from 12 human populations: three representative populations each from African ancestry (AFR), European ancestry (EUR), South Asian ancestry (SAS) and East Asian ancestry (EAS) (populations names in accordance with the 1000 Genomes project-see Supplementary Table 9).We excluded

Estimating timing of positive selection and selection coefficient (s)
We estimated the timing of selection on a beneficial allele using the R package Startmrca 29 .The method applies a Markov chain Monte Carlo simulation (MCMC) that samples over the unknown ancestral haplotype to generate a sample of the posterior distribution for the time to the most recent common ancestor (TMRCA).The model takes advantage of both the length of the ancestral haplotype on each chromosome and the accumulation of derived mutations on the ancestral haplotype to generate a sample of the posterior distribution for the TMRCA.The model requires a sample (panel) containing the haplotypes with the selected allele and a reference panel of haplotypes without the selected allele.In this study, populations of European ancestry were used as samples, and populations of the other analysed genetic ancestries were used as reference panels.Because the calculated TMRCA estimates for the populations of European ancestry were very similar regardless of the reference panels used, we report in this study only the TMRCA estimates calculated for the European populations, using the European populations both as sample panel and as reference panel.We also estimated TMRCA for the East Asianspecific functional variants rs671 (ALDH2 locus) and rs3782886 (BRAP locus) 49 using the East Asian population CHB as sample and reference panel (see Supplementary Table 9 for the corresponding population names).After normalising the TMRCA data we calculated 95% credible intervals (CI = 0.95) for the timing estimates using the equal-tailed interval method implemented in the R package bayestestR 83 .We used recombination rates from the sex-averaged recombination map from deCODE to model recombination rate variation across the human

Table 1 .
SNPs under positive selection at the human chromosomal region 12q24.12 in populations with European ancestry (GBR, TSI, FIN).Given are iHS scores and the calculated (-log) p-values (in bold Bonferroni correction with p < 1 × 10 -5 ), the timing (t) of positive selection on the derived beneficial allele in thousand years ago (kya) and 95% credible interval (CI) (rounded to one decimal figure), the estimated selection coefficients (s) in GBR, average allele frequency in % for the derived beneficial allele/ancestral allele in the different ancestries and global locus-specific F ST values (sd = standard deviation) calculated across all analysed populations.

Figure 2 .
Figure 2. iSAFE scores plotted for SNPs surrounding the chr12q24.12region (5.6 Mbp window) for the population GBR (European genetic ancestry); also indicated are the SNPs identified by the iHS statistics as being under positive selection; the top-ranked SNPs are above the threshold sores iSAFE > 0.304.