Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Genome-wide association study identifies 74 loci associated with educational attainment

Abstract

Educational attainment is strongly influenced by social and other environmental factors, but genetic factors are estimated to account for at least 20% of the variation across individuals1. Here we report the results of a genome-wide association study (GWAS) for educational attainment that extends our earlier discovery sample1,2 of 101,069 individuals to 293,723 individuals, and a replication study in an independent sample of 111,349 individuals from the UK Biobank. We identify 74 genome-wide significant loci associated with the number of years of schooling completed. Single-nucleotide polymorphisms associated with educational attainment are disproportionately found in genomic regions regulating gene expression in the fetal brain. Candidate genes are preferentially expressed in neural tissue, especially during the prenatal period, and enriched for biological pathways involved in neural development. Our findings demonstrate that, even for a behavioural phenotype that is mostly environmentally determined, a well-powered GWAS identifies replicable associated genetic variants that suggest biologically relevant pathways. Because educational attainment is measured in large numbers of individuals, it will continue to be useful as a proxy phenotype in efforts to characterize the genetic influences of related phenotypes, including cognition and neuropsychiatric diseases.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Manhattan plot for EduYears associations (n = 293,723).
Figure 2: Genetic correlations between EduYears and other traits.
Figure 3: Overview of biological annotation.

References

  1. Rietveld, C. A. et al. GWAS of 126,559 individuals identifies genetic variants associated with educational attainment. Science 340, 1467–1471 (2013).

    Article  ADS  CAS  Google Scholar 

  2. Rietveld, C. A. et al. Replicability and robustness of genome-wide-association studies for behavioral traits. Psychol. Sci. 25, 1975–1986 (2014).

    Article  Google Scholar 

  3. Yang, J. et al. Genomic inflation factors under polygenic inheritance. Eur. J. Hum. Genet. 19, 807–812 (2011).

    Article  Google Scholar 

  4. Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).

    Article  CAS  Google Scholar 

  5. Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

    Article  Google Scholar 

  6. Fowler, T., Zammit, S., Owen, M. J. & Rasmussen, F. A population-based study of shared genetic variation between premorbid IQ and psychosis among male twin pairs and sibling pairs from Sweden. Arch. Gen. Psychiatry 69, 460–466 (2012).

    Article  Google Scholar 

  7. Tambs, K., Sundet, J. M., Magnus, P. & Berg, K. Genetic and environmental contributions to the covariance between occupational status, educational attainment, and IQ: a study of twins. Behav. Genet. 19, 209–222 (1989).

    Article  CAS  Google Scholar 

  8. Thompson, L. A., Detterman, D. K. & Plomin, R. Associations between cognitive abilities and scholastic achievement: Genetic overlap but environmental differences. Psychol. Sci. 2, 158–165 (1991).

    Article  Google Scholar 

  9. Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).

    Article  CAS  Google Scholar 

  10. Ardlie, K. G. et al.; GTEx Consortium. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).

    Article  ADS  Google Scholar 

  11. Pers, T. H. et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nat. Commun. 6, 5890 (2015).

    Article  CAS  Google Scholar 

  12. Allen Institute for Brain Science. BrainSpan atlas of the developing human brain http://www.brainspan.org (2015).

  13. Purcell, S. M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).

    Article  ADS  CAS  Google Scholar 

  14. Krapohl, E. et al. The high heritability of educational achievement reflects many genetically influenced traits, not just intelligence. Proc. Natl Acad. Sci. USA 111, 15273–15278 (2014).

    Article  ADS  CAS  Google Scholar 

  15. Branigan, A. R., McCallum, K. J. & Freese, J. Variation in the heritability of educational attainment: An international meta-analysis. Social Forces 92, 109–140 (2013).

    Article  Google Scholar 

  16. Heath, A. C. et al. Education policy and the heritability of educational attainment. Nature 314, 734–736 (1985).

    Article  ADS  CAS  Google Scholar 

  17. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genetics 47, 1228–1235 (2015).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This research was carried out under the auspices of the Social Science Genetic Association Consortium (SSGAC). This research has also been conducted using the UK Biobank Resource. This study was supported by funding from the Ragnar Söderberg Foundation (E9/11), the Swedish Research Council (421-2013-1061), The Jan Wallander and Tom Hedelius Foundation, an ERC Consolidator Grant (647648 EdGe), the Pershing Square Fund of the Foundations of Human Behavior, and the NIA/NIH through grants P01-AG005842, P01-AG005842-20S2, P30-AG012810, and T32-AG000186-23 to NBER, and R01-AG042568 to USC. We thank S. Cunningham, N. Galla and J. Rashtian for research assistance. A full list of acknowledgments is provided in the Supplementary Information.

Author information

Authors and Affiliations

Authors

Contributions

Study design and management: D.J.B., D.Ce., T.E., M.J., P.D.K. and P.M.V. Quality control and meta-analysis: A.O., G.B.C., T.E., M.A.F., C.A.R. and T.H.P. Stratification: P.T., J.P.B., C.A.R. and J.Y. Genetic overlap: J.P.B., M.A.F., P.T. Biological annotation: J.J.L., T.E., T.H.P., J.K.P., J.H.B., J.P.B., L.F., V.E., G.A.M., M.A.F., S.F.W.M., P.Ti., R.A.P., R.d.V. and H.J.W. Prediction and mediation: J.P.B., M.A.F. and J.Y. G×E: D.Co., S.F.L., K.O.L., S.O. and K.T. Replication in UKB: M.A.F. and C.A.R. SSGAC advisory board: D.Co., T.E., A.H., R.F.K., D.I.L., S.E.M., M.N.M., G.D.S. and P.M.V. All authors contributed to and critically reviewed the manuscript. Authors not listed above contributed to the recruitment, genotyping, or data processing for the contributing components of the meta-analysis. For a full list of author contributions, see Supplementary Information section 8.

Corresponding authors

Correspondence to Peter M. Visscher, Philipp D. Koellinger, David Cesarini or Daniel J. Benjamin.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

Results can be downloaded from the SSGAC website (http://ssgac.org/Data.php). Data for our analyses come from many studies and organizations, some of which are subject to a MTA, and are listed in the Supplementary Information.

Extended data figures and tables

Extended Data Figure 1 Q–Q plot of the genome-wide association meta-analysis of 64 EduYears results files (n = 293,723).

Observed and expected P values are on a −log10 scale (two-tailed). The grey region depicts the 95% confidence interval under the null hypothesis of a uniform P value distribution. The observed λGC is 1.28. (As reported in Supplementary Information section 1.5.4, the unweighted mean λGC is 1.02, the unweighted median is 1.01, and the range across cohorts is 0.95–1.15.)

Extended Data Figure 2 The distribution of effect sizes of the 74 lead SNPs.

a, SNPs ordered by absolute value of the standardized effect of one more copy of the education-increasing allele, with 95% confidence intervals. b, SNPs ordered by R2. Effects on EduYears are benchmarked against the top 74 genome-wide significant hits identified in the largest GWAS conducted to date of height and body mass index (BMI), and the 48 associations reported for waist-to-hip ratio adjusted for BMI (WHR). These results are based on the GIANT consortium’s publicly available results for pooled analyses restricted to European-ancestry individuals: https://www.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium.

Extended Data Figure 3 Assessing the extent to which population stratification affects the estimates from the GWAS.

a, LD score regression plot with the summary statistics from the GWAS. Each point represents an LD score quantile for a chromosome (the x and y coordinates of the point are the mean LD score and the mean χ2 statistic of variants in that quantile). That the intercept is close to 1 and that the χ2 statistics increase linearly with the LD scores suggest that the bulk of the inflation in the χ2 statistics is due to true polygenic signal and not to population stratification. b, Estimates and 95% confidence intervals from individual-level and within-family regressions of EduYears on polygenic scores, for scores constructed with sets of SNPs meeting different P value thresholds. In addition to the analyses shown here, we conduct a sign concordance test, and we decompose the variance of the polygenic score. Overall, these analyses suggest that population stratification is unlikely to be a major concern for our 74 lead SNPs. See Supplementary Information section 3 for additional details.

Extended Data Figure 4 Replication of 74 lead SNPs in the UK Biobank data.

Estimated effect sizes (in years of schooling) and 95% confidence intervals of the 74 lead SNPs in the meta-analysis sample (n = 293,723) and the UK Biobank replication sample (n = 111,349). The reference allele is the allele associated with higher values of EduYears in the meta-analysis sample. SNPs are in descending order of R2 in the meta-analysis sample. Of the 74 lead SNPs, 72 have the anticipated sign in the replication sample, 52 replicate at the 0.05 significance level, and 7 replicate at the 5 × 10−8 significance level.

Extended Data Figure 5 Q–Q plots for the 74 lead EduYears SNPs (or LD proxies) in published GWAS of other phenotypes.

SNPs with concordant effects on both phenotypes are pink, and SNPs with discordant effects are blue. SNPs outside the grey area pass Bonferroni-corrected significance thresholds that correct for the total number of SNPs we tested (P < 0.05/74 = 6.8 × 10−4) and are labelled with their rs numbers. Observed and expected P values are on a −log10 scale. For the sign concordance test: *P < 0.05, **P < 0.01 and ***P < 0.001.

Extended Data Figure 6 Regional association plots for four of the ten prioritized SNPs for mental health, brain anatomy, and anthropometric phenotypes identified using EduYears as a proxy phenotype.

a, Cognitive performance; b, hippocampus; c, intracranial volume; d, neuroticism. The four were selected because very few genome-wide significant SNPs have been previously reported for these traits. Data sources and methods are described in Supplementary Information section 3. The R2 values are from the hg19 / 1000 Genomes Nov 2014 EUR references samples. The figures were created with LocusZoom (http://csg.sph.umich.edu/locuszoom/). Mb, megabases.

Extended Data Figure 7 Application of fgwas to EduYears.

See Supplementary Information section 4.2 for further details. a, The results of single-annotation models. ‘Enrichment’ refers to the factor by which the prior odds of association at an LD-defined region must be multiplied if the region bears the given annotation; this factor is estimated using an empirical Bayes method applied to all SNPs in the GWAS meta-analysis regardless of statistical significance. Annotations were derived from ENCODE and a number of other data sources. Plotted are the base 2 logarithms of the enrichments and their 95% confidence intervals. Multiple instances of the same annotation correspond to independent replicates of the same experiment. b, The results of combining multiple annotations and applying model selection and cross-validation. Although the maximum-likelihood estimates are plotted, model selection was performed with penalized likelihood. c, Reweighting of GWAS loci. Each point represents an LD-defined region of the genome, and shown are the regional posterior probabilities of association (PPAs). The x axis gives the PPA calculated from the GWAS summary statistics alone, whereas the y axis gives the PPA upon reweighting on the basis of the annotations in b. The orange points represent genomic regions where the PPA is equivalent to the standard GWAS significance threshold only upon reweighting.

Extended Data Figure 8 Tissue-level biological annotation.

a, The enrichment factor for a given tissue type is the ratio of variance explained by SNPs in that group to the overall fraction of SNPs in that group. To benchmark the estimates for EduYears, we compare the enrichment factors to those obtained when we use the largest GWAS conducted to date on BMI, height, and waist-to-hip ratio adjusted for BMI. The estimates were produced with the LDSC Python software, using the LD scores and functional annotations introduced in ref. 17 and the HapMap3 SNPs with minor allele frequency >0.05. Each of the ten enrichment calculations for a particular cell type is performed independently, while each controlling for the 52 functional annotation categories in the full baseline model. The error bars show the 95% confidence intervals. b, We took measurements of gene expression by the Genotype-Tissue Expression (GTEx) Consortium and determined whether the genes overlapping EduYears-associated loci are significantly overexpressed (relative to genes in random sets of loci matched by gene density) in each of 37 tissue types. These types are grouped in the panel by organ. The dark bars correspond to tissues where there is significant overexpression. The y axis is the significance on a −log10 scale.

Extended Data Figure 9 Gene-level biological annotation.

a, The DEPICT-prioritized genes for EduYears measured in the BrainSpan Developmental Transcriptome data (red curve) are more strongly expressed in the brain prenatally rather than postnatally. The DEPICT-prioritized genes exhibit similar gene expression levels across different brain regions (grey lines). Analyses were based on log2-transformed RNA-seq data. Error bars represent 95% confidence intervals. b, For each phenotype and disorder, we calculated the overlap between the phenotype’s DEPICT-prioritized genes and genes believed to harbour de novo mutations causing the disorder. The bars correspond to odds ratios. c, DEPICT-prioritized genes in EduYears-associated loci exhibit substantial overlap with genes previously reported to harbour sites where mutations increase risk of intellectual disability and autism spectrum disorder (Supplementary Table 4.6.1).

Extended Data Figure 10 The predictive power of a polygenic score (PGS) varies in Sweden by birth cohort.

Five-year rolling regressions of years of education on the PGS (left axis in all four panels), share of individuals not affected by the comprehensive school reform (a, right axis), and average distance to nearest junior high school (b, right axis), nearest high school (c, right axis) and nearest college/university (d, right axis). The shaded area displays the 95% confidence intervals for the PGS effect.

Supplementary information

Supplementary Information

This file contains Supplementary Text and Data – see contents page for details. (PDF 4376 kb)

Supplementary Data

This file contains Supplementary Tables. (XLSX 4252 kb)

PowerPoint slides

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Okbay, A., Beauchamp, J., Fontana, M. et al. Genome-wide association study identifies 74 loci associated with educational attainment. Nature 533, 539–542 (2016). https://doi.org/10.1038/nature17671

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nature17671

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing