Discovery of 42 genome-wide significant loci associated with dyslexia

Reading and writing are crucial life skills but roughly one in ten children are affected by dyslexia, which can persist into adulthood. Family studies of dyslexia suggest heritability up to 70%, yet few convincing genetic markers have been found. Here we performed a genome-wide association study of 51,800 adults self-reporting a dyslexia diagnosis and 1,087,070 controls and identified 42 independent genome-wide significant loci: 15 in genes linked to cognitive ability/educational attainment, and 27 new and potentially more specific to dyslexia. We validated 23 loci (13 new) in independent cohorts of Chinese and European ancestry. Genetic etiology of dyslexia was similar between sexes, and genetic covariance with many traits was found, including ambidexterity, but not neuroanatomical measures of language-related circuitry. Dyslexia polygenic scores explained up to 6% of variance in reading traits, and might in future contribute to earlier identification and remediation of dyslexia.


METHODS 23andMe Genotyping and imputation
Samples were genotyped on one of five genotyping platforms. The V1 and V2 platforms were variants of the Illumina HumanHap550 + BeadChip, including about 25,000 custom SNPs selected by 23andMe, with a total of about 560,000 SNPs. The V3 platform was based on the Illumina OmniExpress + BeadChip, with custom content to improve the overlap with our V2 array, with a total of ~950,000 SNPs. The V4 platform is a fully custom array, including a lower redundancy subset of V2 and V3 SNPs with additional coverage of lower-frequency coding variation, and ~570,000 SNPs. The v5 platform, in current use, is an Illumina Infinium Global Screening Array (~640,000 SNPs) supplemented with ~50,000 SNPs of custom content. Samples that failed to reach 98.5% call rate were excluded from the study.
Individuals were only included if they had > 97% European ancestry, as determined through an analysis of local ancestry (see 1 for further details on the methodology used). Briefly, this analysis first partitions phased genomic data into short windows of ~100 SNPs. Within each window, a support vector machine is used to classify individual haplotypes into one of 31 reference populations. The support vector machine classifications are then fed into a Hidden Markov Model (HMM) that accounts for switch errors and incorrect assignments and gives probabilities for each reference population in each window. Finally, simulated admixed individuals are used to recalibrate the HMM probabilities so that the reported assignments are consistent with the simulated admixture proportions. The reference population data are derived from public data sets (the Human Genome Diversity Project, HapMap and 1000 Genomes) and from 23andMe research participants who have reported having four grandparents from the same country.
A maximal set of unrelated individuals was chosen for each analysis using a segmental identity-bydescent (IBD) estimation algorithm 2 . Individuals were defined as related if they shared more than 700 cM IBD, including regions where the two individuals share either one or both genomic segments identical-by-descent. This level of relatedness (roughly 20% of the genome) corresponds approximately to the minimal expected sharing between first cousins in an outbred population. For the purposes of GWAS, if a case was found to be related to a control, the case was preferentially kept in the sample.
Participant genotype data were imputed against a single unified imputation reference panel, combining the May 2015 release of the 1000 Genomes Phase 3 haplotypes and the UK10K imputation reference panel. Data for each genotyping platform were phased and imputed separately. Variants that were only genotyped on the 'V1' platform were flagged due to small sample size, and variants on chrM or chrY, because many of these are not currently called reliably. Using trio data, variants that failed a test for parent-offspring transmission were also flagged; specifically, the child's allele count was regressed against the mean parental allele count and variants with fitted β < 0.6 and p < 10 -20 for a test of β<1 were flagged. Variants with a Hardy-Weinberg p < 10 -20 in Europeans, or a call rate of < 90%, were also flagged. Genotyped variants were also tested for batch effects and variants with p < 10 -50 by analysis of variance of genotypes against a factor dividing genotyping date into 20 roughly equal-sized buckets were flagged. For imputed GWAS results, variants with average r 2 < 0.5 or minimum r 2 < 0.3 in any imputation batch were flagged, as well as SNPs that had strong evidence of an imputation batch effect, using an analysis of variance of the imputed dosages against a factor representing imputation batch; results with p < 10 -50 were flagged. Each variant flagged by QC on genotyped or imputation data were excluded from the GWAS analysis.

Chinese Reading Study sample
Participants 3,127 Grade 3 to Grade 6 primary students aged nine to 14 years were recruited from three cities and four districts in China (Xi'an-YT, Xi'an-CB, Qingyang, and Baotou). In total, 2,476 participants were eligible for subsequent genotyping and association analysis. Ethical approval was obtained for each cohort at the local level and written informed consent was obtained from all the participants' parents.

Phenotypic measures
Reading accuracy: A Chinese character recognition test was employed to measure each child's reading accuracy [3][4][5] . The test consisted of 150 single Chinese characters selected from China's Elementary School Textbooks (1996). The average frequency of the characters was 182 per million (ranging from 0 to 2,282), and the reliability of this test was 0.95 3 . Each child was individually tested and was required to read aloud each character at a time.
Reading fluency: A word list reading task 3 was used to measure each child's reading fluency. In this task, children were asked to name a list of 180 two-character words as rapidly and accurately as possible. All these words were from primary school textbooks and have been learned before Grade 3, such as "我们 (we)" and "太阳 (sun)". The mean frequency of these words was 212.77 per million 6 .
Since words included in this task were all simple, this task was administrated to test children's reading fluency. The total time for naming the whole word list was recorded as the measurement of reading fluency.
Genotype quality control, imputation, and analysis DNA was extracted from saliva samples, and individuals were genotyped using the Illumina Asian screening array (650K) by Beijing Compass Biotechnology. Quality control was performed using standard quality control metrics. Eight samples were excluded as they had sex discrepancies between the records and the genetically inferred data 7,8 . Next, we removed 53 samples who had unexpected duplicates or probable relatives (PI-HAT > 0.20). Then, SNPs were filtered out if they showed a variant call rate < 0.95, a minor allele frequency (MAF) < 0.01, a missing genotype data (mind) < 0.90, or a Hardy-Weinberg Equilibrium (HWE) p < 10 -5 within each dataset.
For imputation, autosomal variants were aligned to the 1000G genomes phase 1v3 reference panel.
Imputation was performed using the Michigan imputation Server 4.0 in 5Mb chunks with 500kb buffers, filtering out variants that were monomorphic in the Genome Asia Pilot (GAsP). Chunks with 51% genotyped variants or concordance rate < 0.92 were fused with neighbouring chunks and reimputed. Finally, imputed variants were filtered out for r 2 < 0.60, MAF < 0.02, mind < 0.1, HWE p < 10 -5 using Plink (v1.90). After quality control procedures had been performed, 2,415 children with 4,261,603 SNPs were included in the final analysis. Association analyses were performed using PLINK, fitting an additive model to the linear regression model with adjustment for sex, age, and the first two principal components 8 .

Biological annotations
Genome-wide significant variants and the closest gene(s) were annotated using external reference data through FUMA v1.3.6a 9 (unless otherwise specified) and evaluated for functional or regulatory impact. Specifically, we considered the following annotations of SNPs reaching genome-wide significance (p < 5 x 10 -8 ) (Supplementary Table 10 o Function: Whether a variant is intergenic or the functional region in which the variant is located within a gene or RNA locus (e.g., 5' UTR).
 Combined Annotation Dependent Depletion (CADD) score: A score of the deleteriousness of variants computed from 63 integrated annotations 10 . The higher the score, the more deleterious a variant is: 12.37 is the threshold indicated by the study of potentially actionable exonic pathogenic single-nucleotide variants in European-and African ancestry patients 11 .
 RegulomeDB category (RDB): A variant classification system in which variants are grouped according to evidence of having a functional consequence from Category 6 (minimal evidence) to Category 1a (likely to affect binding and linked to expression of a gene target) 12 .
 Chromatin state: The minimum and the most common 15-core chromatin state across 127 tissue/cell types predicted by ChromHMM 13 from 15 (quiescent/low) to 1 (active TSS).

 GWAS Catalog: SNP-trait associations reported in the NHGRI-EBI Catalog of human GWAS 14 ,
including for each variant: the trait(s), the effect allele(s), the PubMed ID(s), the study title(s) and the study sample size(s) (Supplementary Table 2). And the following annotations of genes which were significant in genome-wide gene-based tests (Supplementary Table 12):  Probability of Loss-of-function Intolerance (pLI) score: A score of intolerance to functional mutation from the ExAC database 15 ranging from zero to one. The closer the score is to one, the more intolerant the gene is to loss-of-function mutations. The threshold suggested by Lek,et al. 15 for likely disease-causing variants is ≥ 0.9.
 Non-coding Residual Variation Intolerance Score (ncRVIS): A score of intolerance to mutation to non-coding variants 16 . Where ncRVIS is zero, the gene has the average number of noncoding variants given its total mutational burden; when ncRVIS is greater than zero, the gene has less non-coding variation than expected; when ncRVIS is less than zero, it has more. The ncRVIS percentile reflects the rank of the gene amongst all genes. The more negative the ncRVIS, or the lower the percentile, the more intolerant to non-coding variation the gene is.
 Residual Variation Intolerance Score (ncRVIS) percentile: As for ncRVIS score but the percentile of the average RVIS score for the whole gene sequence.
 Non-coding Genomic Evolutionary Rate Profiling (ncGERP) score: Identifies constraint in noncoding regions by quantifying deficits in substitutions 16 . It is calculated by taking the average GERP++ score (see Davydov,et al. 17 ) across the non-coding sequence. The higher the ncGERP score, the fewer substitutions are present than what would be expected as a result of a neutral rate of evolution, and thus the more conserved are the non-coding regions of the gene. The ncGERP percentile reflects the rank of the gene amongst all genes.
 Protein-coding Genomic Evolutionary Rate Profiling (pcGERP) percentile: As for ncGERP score but the percentile of the average GERP score for protein-coding sequence 16 .
 Non-coding Combined Annotation Dependent Depletion (CADD) score: As for CADD score but the average variant score across the non-coding sequence of the gene 16 .
 Non-coding Genome-Wide Annotation of Variants (ncGWAVA) score: Predicts the combined functionality of non-coding variants across non-coding sequence 16 . It is the average GWAVA score (see Ritchie,et al. 18 ) of variants in the non-coding sequence, ranging from zero to one.
The closer ncGWAVA is to one, the more likely the variants in non-coding regions of the gene are functional.
 Expression in the brain: Average log2 expression in transcripts per million (TPM) per tissue type per gene from the GTEx v8 dataset 19 Table 15).

Evolutionary analysis
Enrichment of heritability was estimated for the following evolutionary annotations (as described in Tilot,et al. 20 ):  Human Gained Enhancers and Promoters: These regulatory regions were identified based on differential H3K27ac and H3K4me2 patterns in the adult and foetal brain tissues of humans, macaques and mice [19,20], and shown to be present to a significantly lesser degree in macaques and mice. Thus, these regulatory elements were gained in the last 30 million years of human evolution and may be involved in the emergence of human-specific traits 21,22 .    computer and laboratory technicians and colleagues who assisted in the quality control and preparation of the imputed GWAS data and the pharmacies and hospitals that were involved.

Adolescent Brain Cognitive Development SM Study
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development SM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. publication is the work of the authors and they will serve as guarantor for the ALSPAC contribution to this article.

The Basque Center on Cognition, Brain and Language (BCBL)
Data collection was supported by the Basque Government through the BERC program, and by the Agencia Estatal de Investigación through BCBL Severo Ochoa excellence accreditation.

Brisbane Adolescent Twins Study (BATS)
The research was supported by the Australian Research Council (A7960034, A79906588, A79801419, DP0212016 and DP0343921), with genotyping funded by the NHMRC (Medical Bioinformatics Genomics Proteomics Program, 389891).

Colorado Learning Disabilities Research Center (CLDRC)
The CLDRC was supported by a NICHD grant P50 HD 27802.

The Early Language in Victoria Study (ELVS)
Funding by the NHMRC grant number 436958 is gratefully appreciated.

UK Dyslexia
Cohort recruitment and data collection was supported by Wellcome Trust (076566/Z/05/Z and 075491/Z/04) and The Waterloo Foundation (grants to JBTa and SP; 797-1720). Genotype data were generated and analyses funded by the EU (Neurodys, 018696) and the Royal Society (UF100463 grant to SP).

York
The York cohort was funded by Wellcome Trust Programme Grant 082036/B/07/Z.