Evidence for specificity of polygenic contributions to attainment in English, maths and science during adolescence

How well one does at school is predictive of a wide range of important cognitive, socioeconomic, and health outcomes. The last few years have shown marked advancement in our understanding of the genetic contributions to, and correlations with, academic attainment. However, there exists a gap in our understanding of the specificity of genetic associations with performance in academic subjects during adolescence, a critical developmental period. To address this, the Avon Longitudinal Study of Parents and Children was used to conduct genome-wide association studies of standardised national English (N = 5983), maths (N = 6017) and science (N = 6089) tests. High SNP-based heritabilities (h2SNP) for all subjects were found (41–53%). Further, h2SNP for maths and science remained after removing shared variance between subjects or IQ (N = 3197–5895). One genome-wide significant single nucleotide polymorphism (rs952964, p = 4.86 × 10–8) and four gene-level associations with science attainment (MEF2C, BRINP1, S100A1 and S100A13) were identified. Rs952964 remained significant after removing the variance shared between academic subjects. The findings highlight the benefits of using environmentally homogeneous samples for genetic analyses and indicate that finer-grained phenotyping will help build more specific biological models of variance in learning processes and abilities.

Gene-based association analyses. In order to gain insight into functional pathways associated with AA we performed gene and gene-set analysis in MAGMA 23 ("Methods"). Results of the genome-wide gene-based test for association are shown in Table 1 and Supplementary Figs. S3 and S4. In total four genome-wide significant (p = 0.05/17,875 genes tested = 2.80 × 10 -6 ) gene-based associations with science were identified: MEF2C (Myocyte Enhancer Factor 2C; p = 2 × 10 -8 ), BRINP1 (Bone Morphogenetic Protein/Retinoic Acid Inducible Neural-Specific 1; p = 4 × 10 -7 ), S100A1 (S100 Calcium Binding Protein A1; p = 8 × 10 -7 ) and S100A13 (S100 Calcium Binding Protein; p = 1 × 10 -6 ). No significant gene-based associations were found for maths or English. We further examined gene expression patterns for the four genes using the Genotype-Tissue Expression (GTEx) portal (https ://gtexp ortal .org; Table 1, Supplementary Figs. S5-S8). Gene-set analysis revealed two gene-sets showing genome-wide significant association with science attainment during adolescence (11-14 years). The y-axis shows the p-value and the x-axis shows position on chromosome 13. Points show other SNPs located in this region-the purple SNP is the lead SNP and the other colours show the level of LD shared with the lead SNP. Replication of SNP and gene-level associations. Replication of the SNP and gene associations was performed using data from the Twins Early Development Study (TEDS, N = 2330) 24 . We failed to replicate the genome-wide significant SNP (rs9529641) and gene-level associations for science attainment in this independent longitudinal cohort, although the smaller sample and different phenotype complicates interpretation (see "Methods" and Supplementary Note 3). We also cross-referenced our SNP and gene associations with two closely related GWAS 11,25 . Rs9529641 was significantly associated with both EduYrs3 and Intelligence (Table 2). At the gene-level MEFC2 was also significantly associated with both EduYrs3 and Intelligence, and BRINP1 and S100A13 with EduYrs3 (  (Table S7).
Bivariate genetic correlations (r g ) between academic subjects estimated by LDSC regression performed in Unix were high (r g = 0.62-0.75) but significantly less than 1, illustrating a degree of genetic specificity (Table 3).
AA and IQ regressed analyses. Academic attainment across subjects is correlated at both the phenotypic and genetic level (Tables 3 and S1), and previous research demonstrates school performance is highly correlated with general cognitive function, an association partly underpinned by generalist genes 7 . We therefore sought to examine genetic contributions to subject-specific variance independent of performance in the other two subjects, or independent of IQ. This was achieved by regressing verbal and non-verbal IQ from the academic attainment scores for English (N = 3,197, verbal r = 0.548 SE = 0.011; non-verbal r = 0.209, SE = 0.015), maths (N = 3212, verbal r = 0.491, SE = 0.012; non-verbal r = 0.298, SE = 0.015) and science (N = 3260, verbal r = 0.587, SE = 0.010; nonverbal r = 0.269, SE = 0.015), or regressing out attainment in the other two subjects (N = 5895) (see "Methods"). Univariate GWAS of the AA-regressed and IQ-regressed attainment measures for each subject failed to identify any genome wide significant associations ( Supplementary Fig. S9). LDSC-based h 2 SNP estimates were lower than in the original analyses, but still significantly larger than 0 for AA-regressed and IQ-regressed science and maths scores (p′s < 0.05) ( Table 4). GREML h 2 SNP estimates were similar, ranging from 0.08 (AAreg-English) to 0.21 (IQreg-maths) ( Supplementary Table S7).
Finally, in order to check the specificity of the SNP association identified with science, we performed a hypothesis-driven lookup in the AAreg and IQreg GWAS results (Table 5) using a Bonferroni corrected p-value  Table 3. SNP-based heritability estimates, and genetic and phenotypic correlations between academic subjects. Genetic correlations are presented below the diagonal (bold), SNP-based heritability on the diagonal (underline) and phenotypic correlations above the diagonal (italics). Standard errors are in parentheses. Due to the genetically homogenous nature of this sample LDSC SNP heritability and genetic correlation estimates are reported with the h 2 intercept constrained to 1. Unconstrained LDSC estimates are provided in Supplementary  Table S7. Phenotypically, science is significantly more correlated with English (z = 5.80, p = 3.35 × 10 -9 ) and maths (z = 20.58, p = 2.08 × 10 -94 ) than these were with each other. There are no significant differences in the genetic correlations between subjects (p's > 0.05). ***p ≤ 0.001. a These correlations were significantly less than 1 (Maths-English, p = 0.04; Science-English, p = 0.03; Science-Maths, p = 0.001). Using a z-test we were able to show that rs9529641 is significantly more associated with science than English (z = 2.48, p = 0.01) but not maths (z = 1.06, p = 0.14). It was significantly more associated with AAreg science than both AAreg English (z = 3.54, p < 0.001) and AAreg maths (z = 2.12, p = 0.02), and it was also significantly more associated with IQreg science than IQreg English (z = 1.65, p = 0.05), but not IQreg maths (z = 0.94, p = 0.17).
Genetic correlations between AA and related phenotypes. We estimated genetic correlations between subject attainment scores and the GWAS summary statistics of 13 cognitive, educational, psychiatric and personality phenotypes available in the LD hub resource for European samples (http://ldsc.broad insti tute. org/ldhub /). Table 6 shows the magnitude and direction of the genetic relationships for all three academic traits. As expected, genetic correlations with adult academic attainment (years of schooling) and general intelligence were consistently high and positive (r g = 0.89 to 1.26). Genetic correlations with personality and psychiatric traits were lower, with some variation across subjects, although not statistically different after correction (Supplementary Table S8 for full details).

Discussion
This study reports the first GWAS of science attainment and the largest published GWAS of maths and English attainment using national standardised tests. The current dominant framework for assessing the genetic contributions to variance in academic attainment and its relationship to other outcomes uses a broad measure of attainment 11 , which prevents interpretation of the specificity of the relationships identified. We sought to overcome this issue by differentiating between performance in the academic subjects of English, maths and science using national standardised tests of performance to investigate the degree of genetic specificity and overlap with other cognitive, educational, psychiatric and personality traits. Despite clear evidence for the role of general mechanisms in building the brain, the brain also supports different functional operations (e.g. counting vs reading), which must be supported by a degree of genetic specificity. To understand these mechanisms better, analyses of more specific phenotypes are needed.
The three GWAS of maths, science and English found one genome-wide significant SNP (rs9529641) associated with science attainment. Rs9529641 is not an expression quantitative trait loci, i.e. it does not influence gene expression, but the nearby gene NBEA is largely expressed in the brain, and de novo variants in NBEA have been reported in neurodevelopmental cohorts 28 . Rs9529641 was also nominally associated with maths (p = 0.0001), but not English (p = 0.011) leaving open the possibility that the difference in p-values between the subjects is due to chance and the slightly different sample sizes, however the significant differences in effect sizes suggests otherwise. Rs9529641 also reaches p ≤ 0.001 in both the EduYrs3 11 and Intelligence 25 large GWAS meta-analyses. Furthermore, after regressing out variance explained by performance in the other two academic subjects, this Table 4. SNP-based heritability (h 2 SNP ) estimates for academic subjects after controlling for attainment in other subjects (AAreg), and after controlling for IQ (IQreg). SNP-based heritabilities were estimated using LD score regression with h 2 intercepts constrained to 1. Bolded estimates show those that are significantly greater than 0, *p < 0.05, ** p < 0.01. Standard errors are in parentheses. www.nature.com/scientificreports/ association remained significant and specific to science. What predicts variance in science academic achievement is an understudied topic compared to maths and English. Research to-date suggests a pattern similar to the genetic results found here, with considerable similarities but also some subtle differences in the patterns of association with executive function skills, vocabulary and IQ 14,29 . There is some evidence that science achievement may be more dependent than mathematics or English on central executive working memory processes and more complex aspects of EF, such as planning 29,30 . When regressing out IQ, the science association with rs9529641 was not significant after correction for multiple testing, however the effect size remained the same. Although the SNP association with science failed to replicate in the independent Twins Early Development Study sample, we note the relatively small sample available and that the TEDS attainment measures were based on teacher ratings and not standardised across participants. Gene-based association analyses identified four genes associated with attainment in science, but none for maths or English. The strongest signal came from the MEF2C gene, which also shows strong evidence of association in the EduYrs3 and Intelligence GWAS studies. Notably, studies looking at general cognitive ability have sought MEF2C associations in hypothesis driven tests 31 and more recently it has been associated with a large meta-analysis of intelligence and years in education 32 and the largest depression GWAS to date 33 . MEF2C, located on chromosome 5, has been linked to synaptic plasticity, memory and learning 34 . It is primarily expressed in the brain and haploinsufficiency of MEF2C is associated with severe cognitive impairment, stereotypic movements, epilepsy and cerebral malformation 35 . Evidence from animal models also suggests that it is involved in the development of memory and the consolidation of information 34 . Over-expression of MEF2C has also been implicated in poor developmental and cognitive outcomes 36,37 and it has been associated with Alzheimer's disease 38 . Given its apparently more general role in cognition, and the lower but not trivial associations with maths (p = 5.12 × 10 -6 ) and English (p = 9.76 × 10 -5 ), we are not suggesting MEF2C is a science specific gene. Rather, we suggest there may be evidence for a 'dosage effect' , where the brain mechanisms that are built through the MEFC2 gene may explain more variance in science than in English and maths. A similar effect may be at play with the rs9529641, where the mechanism linked to this SNP leads to greater individual differences in science than in English and maths.
The second most strongly associated gene was BRINP1, which is also primarily expressed in the brain, and is involved in protein binding. BRINP1 has been involved in a wide range of processes related to cognition and behaviour 39,40 . The final two associated genes were S100A1 and S100A13, both of which are members of the S100 protein family that encode calcium binding proteins and are involved in the regulation of a wide range of intra-and extracellular processes. These include cell cycle progression, differentiation and possibly stimulation of Ca 2+ release 41,42 . After correction for multiple testing, none of the four genes were significantly associated with science after regressing out variance explained by maths and English or IQ.
SNP-based heritability estimates were moderate for all three subjects, and substantially closer to the twinbased heritability estimates of 65% for maths (h 2 SNP = 47%) and 54% for science (h 2 SNP = 54%) 15 than is often the case with DNA-based estimates (e.g. maths ability h 2 SNP = 0.16 11 ). This is unusual because twin estimates capture all additive genetic effects that contribute to a phenotype, whereas GWAS estimates include only additive effects of common SNPs. These high estimates might be driven by the homogenous nature of the sample, both environmentally and ancestrally, as well as the use of a standardised assessment of academic ability. Moreover, they Table 6. LDSC-based genetic correlations between English, maths and science and 13 related educational and psychological traits. See Supplementary Table S8 for more details. Bolded correlation coefficients are significant p < 0.05 (uncorrected). Underlined are significant after Bonferroni correction p ≤ 0.001 (0.05/39). Red cells represent negative correlations and blue cells positive correlations. Intercepts were not constrained in any of these analyses. Statistical differences in correlation coefficients between academic subjects and a given trait were tested with a z-test; none of the correlations were significantly different from one another. Estimates > 1 are not uncommon in LDSC regression particularly if there is some sample overlap and suggests that the estimation is close to 1 with error. www.nature.com/scientificreports/ suggest that the majority of the genetic variance contributing to individual differences in academic attainment in adolescence comes from the additive effects of common, rather than rare, genetic variation. We found a significant overlap of common genetic variants influencing variability in the three academic subjects, indicated by the large, but significantly smaller than 1, genetic correlations (r g = 0.62-0.75). The degree of specificity estimated by looking at genetic correlations using LDscore indicate a moderate to small but potentially informative degree of specificity. In order to confirm this, we performed further GWA analyses of each subject with the variance of the other two subjects removed and controlling for IQ. Maths and science maintained heritability significantly higher than 0 in both instances. Interestingly the IQ regressed measures retained more heritability than the AA regressed measures suggesting that the subject attainment measures shared more heritable variance with each other than with IQ.
Phenotypically, science was significantly more correlated with English and maths than these were with each other (Table 3), suggesting academic performance in science might incorporate variance from the other two subjects 43 . Whilst no significant differences in genetic correlations between academic subjects were found, we note the opposite pattern of associations, with English and maths being the most highly genetically correlated. One possible explanation is that the factors contributing to the phenotypic correlation in performance between science and maths, and science and English, are under greater environmental influence than the factors contributing to the correlation between English and maths performance, which correlate more for genetic reasons. Note, it is possible that this is a specific effect of this period of development and may not be found earlier or later.
Genetic correlations with cognitive and academic attainment in other studies were found to be high and consistent across the academic subjects. Genetic correlations with personality traits varied more across academic subjects, although these differences in estimates were not statistically significant. Associations between cognition and mental health have been noted in a number of genetic and non-genetic studies 44,45 . The fact that pairwise genetic correlations between AA and autism, and AA and ADHD go in opposite directions is interesting because recent work has shown a positive shared genetic basis to ASD and ADHD symptomology in the ALSPAC sample 46 . Depression and anorexia represent another pair of traits that have been found to be correlated phenotypically and genetically in twin studies 47 , but each show opposite directions of genetic overlap with the three academic subjects. Further analysis using multivariate models would be needed to directly assess whether there are specific cognitive or academic abilities that may differentiate between these disorders.
Broadly, the results suggest that although there are underlying cognitive features which contribute to variance across all three academic subjects, there are other (both genetic and non-genetic) factors which contribute to subject-specific variance. These results reflect the conclusions of multivariate twin studies that examine genetic covariance between cognitive ability and subject attainment in a large UK twin cohort. For example, Kovas et al. (2005) investigated the genetic overlap between mathematics performance, reading and general intelligence in childhood. They reported considerable genetic correlations between mathematics and reading (r gTWIN = 0.74) and between mathematics and 'g' (r gTWIN = 0.67), but noted that approximately a third of the genetic variance in mathematics was independent of both of these factors, suggesting some degree of genetic specificity 7 . A subsequent study controlling for performance in maths, English and 'g' investigated the extent to which there was genetic specificity in science attainment in childhood 48 . The authors report a h 2 twin of 49% for science and genetic effects beyond the other factors, which were therefore specific to science.
Limitations. Whilst this study represents the largest GWAS to-date for science and English attainment, it is still underpowered to detect common variants of very small effect. Recently developed multivariate GWAS approaches such as genomic-structural equation modelling 49 may help shed light on the specific causes of the observed genetic correlations identified. Furthermore, although this study used an adolescent sample, longitudinal genetic studies will be necessary to fully understand how genetic influences unfold over development. While we believe the use of a single, homogeneous, UK cohort allowed greater sensitivity to explore our research questions, it also limits the generalisability of the results to different populations. In particular, differences in schooling, such as a greater focus on drilling in mathematics, or differences intrinsic to the language spoken, such as grammatical complexity or spelling irregularities, may impact genetic associations with specific school subjects. However, if GWAS results are population specific due to variants becoming more relevant in particular contexts, perhaps in order for polygenic scores to be accurate, they will have to be population/environment specific. Finally, regressing out attainment in other subjects or IQ (both of which are heritable) risks inducing collider bias and distorting towards or away from true associations. However, we note that the focus of the AAreg and IQreg analyses were on exploring the specificity of science attainment associations, rather than identification of novel associations 50 .

Conclusion
In this study, we performed a series of univariate GWAS of English, maths and science standardised national attainment scores, estimated SNP-based heritability and assessed shared genetic architecture with educational, cognitive, behavioural and psychiatric phenotypes. We found that rs9529641, MEFC2 and BRINP1 were significantly and robustly associated with science attainment. We also found differences in SNP-based heritability estimates and genetic correlations with other cognitive traits indicating, as with the phenotypic data, a degree of overlap and specificity between academic subjects. These findings suggest that understanding the sources of individual differences in academic attainment may facilitate a better understanding of the causal paths to later educational outcomes and mental health disorders. Future studies should examine these genetic relationships within a multivariate framework to allow the separation of general versus specific effects at the level of individual DNA sequence variants. Measures. Attainment in English, maths, and science was assessed using National Curriculum standardised tests at 11 and 14 years of age. At age 11 (end of Key Stage 2) and age 14 (end of Key Stage 3), national examsknown as the SATs-were obligatory in schools across the UK when these data were collected. Pupils sat the tests under exam conditions and scripts were externally marked, standardised, and given a curriculum level 1-9 (low to high). At ages 11 and 14 the English SATs assess reading, grammar, punctuation and spelling, in addition to comprehension and interpretation of a studied text. Maths is assessed at both ages by written SATs that cover all areas of mathematics including conceptual understanding, mathematical reasoning and problem solving. At age 11, the maths SAT also includes a 'mental maths' component in which the students are asked questions orally and, under timed conditions, must record their answers having performed the computations in their head. The science SAT at ages 11 and 14 assesses the development of scientific thinking and knowledge, experimental skills and strategies, analysis and evaluation, scientific vocabulary, units, symbols and nomenclature.
To get the most reliable score of attainment only individuals with data at both 11 and 14 years (r = 0.67 to 0.81) were included. To remove variance associated with sex and age at testing, each academic subject score was first regressed on age and sex (at each time point) and the residuals from the linear regression were summed together to create a final score for each subject. This resulted in sample sizes of 5983 for English, 6017 for maths and 6089 for science.
To assess subject-specific genetics effects, AA regressed scores were created by removing the variance shared with the other two subjects from each of the final scores. This resulted in a sample of 5,895 individuals for each subject. Finally, in order to assess genetic effects independent of general cognitive ability, IQ regressed scores were created by removing the variance shared with IQ from each of the individual subject scores, leaving a smaller sample due to the lower availability of IQ scores than academic attainment measures (see Table 7). IQ was calculated using a combined measure of Vocabulary and Matrix Reasoning raw scores taken from the Wechsler Abbreviated Scale of Intelligence 53 at age 15. In the Vocabulary subtest participants were asked the meaning of a list of gradually more complex words. The Matrix reasoning subtest consisted of a multiple-choice visual puzzle in which the participants were presented with a series of pictures and had to choose the missing image.
Genotyping and quality control. Genotyping and imputation were performed by ALSPAC. Adolescents from ALSPAC were genotyped using the Illumina HumanHap550 quad chip by 23andMe subcontracting the Wellcome Trust (Welcome Sanger Institute, Cambridge, UK) and the Laboratory Corporation of America (Burlington, NC, US). The raw genotype data were subjected to standard quality control procedures to identify individuals and SNPs for exclusion. Samples that passed quality control stages were phased and imputed using the Haplotype Reference Consortium panel of ~ 31,000 phased whole genomes and Impute V3 54 . SNP and sample quality control were repeated post-imputation (see Supplementary Note 2 for full details). Table 7. Sample descriptive statistics. Academic attainment scores are the sum of residuals from age and sex regressed linear models of individual SAT scores at age 11 (Key Stage 2) and 14 (Key Stage 3). Attainment in other subjects (e.g. English-AA) or IQ (e.g. English-IQ) were then regressed out to provide AA and IQ regressed scores to assess independent genetics effects. www.nature.com/scientificreports/ Statistical analysis. All data preparation was performed using R 3.4 55 . Scores were regressed on the first 10 ancestry principal components to control for population structure and then quantile normalized in SNPtest 56 .
In total nine GWA analyses were performed using (1) individual subject scores for attainment in English, maths and science, (2) AA regressed scores for attainment in English, maths and science (3), and IQ regressed scores for in English, maths and science. Each univariate GWA analysis was performed in SNPTest v.2 using an additive linear model and imputation probability calls 57 . Independent SNP association signals were identified by LD clumping in PLINKv1.9, with a genome-wide significance threshold for index SNPs and 0.2 threshold for LD clumping 58 . Gene-based association analyses, which test for association between aggregated SNP effects across each gene, were performed using MAGMA within the FUMA programme, using the summary statistics from each GWAS 23,59 . Significantly associated genes were identified as those surviving Bonferroni correction for multiple testing (p = 0.05/17,875 genes tested = 2.80 × 10 -6 ). Competitive gene-set analyses were also carried out in MAGMA using 10,673 gene sets (5915 GO terms, 4758 Curated gene sets) obtained from MsigDB v5.2. Functional interrogation of gene-based associations was conducted using the GTEx portal (https ://gtexp ortal .org/ home/).
The proportion of variance in science, maths or English accounted for by all the SNPs on the array passing QC, i.e. SNP heritability (h 2 SNP ), was estimated using two methods that have differing modelling assumptions of the underlying genetic architecture 60 , with the view to gain consensus estimates of SNP heritability for academic attainment. GREML was implemented in the GCTA software package in Unix to provide h 2 SNP estimates using individual level genetic data 27 . LD-score regression (LDSC 26 ) in Unix was used to estimate h 2 SNP using the GWAS summary statistics. Genetic correlations were estimated between cognitive, educational, psychiatric and personality traits available in-and using-LD hub (http://ldsc.broad insti tute.org). See Supplementary Note 4. Due to the homogeneous nature of the ALSPAC sample the LDSC h 2 intercept was constrained to 1. We note that this will result in lower standard errors 26 and also report unconstrained estimates. The AA-regressed and IQ-regressed attainment scores were excluded from this LDSC analysis due to the low SNP heritability estimates (and large standard errors) obtained leading to low heritability z-score (z < 4). Finally, z-tests were used to assess whether heritability results were significantly larger than 0, whether correlations were significantly smaller than 1 and whether correlations were significantly different from each other (p < 0.05).
Replication. Replication of independent significantly associated SNPs and genes was performed using data from the Twins Early Development Study. TEDS is a longitudinal study investigating the cognitive and behavioural development of twins born in England and Wales between January 1994 and December 1996 (www. teds.ac.uk) 24 . TEDS participants completed various web and telephone-based tests and questionnaires at regular intervals over childhood and adolescence designed to assess various aspects of cognition, language and behaviour, which are described in detail elsewhere 61 . The available sample consisted of 2352 individuals (one member of each twin pair) for whom academic attainment data at age 14 and genome-wide SNP genotyping data were available (full details can be found in Supplementary Note 3). The TEDS cohort is a few years younger than the ALSPAC cohort (recruited 1994-1996), and as school exam procedures had changed during this time national exams (SATs) were no longer obligatory. Although Key Stage 3 (KS3; age 14) SAT assessments were given to some TEDS pupils, they were teacher rated, not nationally standardised. TEDS KS3 scores are therefore not directly comparable with ALSPAC scores and capture school and teacher effects. Phenotype and genotype data were retained for 2352 unrelated individuals for maths and 2330 for science. Linear genotype-phenotype regressions for SNP rs9529641 and the SNPs in genes S100A1, S100A13, BRINP1 and MEF2c were performed separately for each TEDS genotyping array platform (OEE or Affy), regressed on the first 10 ancestry principal components and were quantile normalized in SNPtest 57 . Platform-specific results were then meta-analysed using METAL 62 . Gene-level replication was performed using MAGMA 59