Introduction

Refractive error has become a major worldwide health concern, with the prevalence of the disease, particularly myopia (nearsightedness), becoming more frequent in both the United States1 and Europe2 and reaching epidemic proportions in parts of East Asia3,4. Refractive error is caused when the optics of the eye fail to project the focal point of light on the retina, causing a blurred image. Myopia is the refractive error mostly resulting from eye elongation, which can lead to serious ocular complications like myopic macular degeneration, glaucoma and retinal detachment5,6,7,8, and is the second most common cause of blindness9,10,11.

Refractive error is a highly complex trait that is known to have both an environmental and genetic etiology. Established environmental factors include prolonged near work, education, and little outdoor exposure12. Genome-wide association studies (GWAS) and genetic linkage studies have identified multiple associated variants for refractive error13,14,15,16,17,18. The Consortium for Refractive Error and Myopia (CREAM) has reported numerous risk variants using large-scale, multiethnic datasets19,20,21,22, explaining ~18% of phenotypic variance22.

Despite estimates that 50% to 80% of refractive error variance is determined by genetic factors23,24,25,26, much of the refractive error heritability remains unaccounted for19,21. Since GWAS are particularly designed to identify common variants, some of the missing heritability may lie with rare variants (minor allele frequency [MAF] ≤ 0.01), which may be highly penetrant and exert a large effect on the phenotype27. Gene-based association tests, such as burden-style tests28,29, offer increased power to find rare variants not identified by GWAS.

This study performs a large-scale rare variant analysis on refractive error using multiethnic cohorts from CREAM. We used an initial discovery dataset consisting of over 13,000 Indo-Europeans and four replication datasets consisting of European ancestry Americans, European ancestry Australians, European ancestry Britons, and Eastern Asian ancestry Singaporeans. Gene-based tests were performed on each of the five cohorts and meta-analysis was performed subsequently. Pathway analysis was conducted on genome-wide significant genes and genes were prioritized based on annotation and biologic relevance to the trait.

Results

Overview of all analyses

Across the three (i.e., VT, CMC and ACAT) multiethnic meta-analyses, the three Indo-European meta-analyses and the three EACC analyses, we identified a total of 129 unique genes that were significantly associated with the refractive error phenotype (Supplementary Data 35). We found no statistically significant difference in p-value or the number of unique genome-wide significant genes when adding the PRS as covariates.

Multiethnic meta-analyses

Forty-three genome-wide significant genes were found using EMMAX-VT (Fig. 1a), 11 genome-wide significant genes using the EMMAX-CMC (Fig. 1b), and 28 genome-wide significant genes using ACAT (Fig. 1c).

Fig. 1: P-values of the multiethnic meta-analysis.
figure 1

The gene-based p-values of the meta-analysis association study combining all five cohorts (N = 27,006) using the (a) EMMAX-VT test, (b) EMMAX-CMC test, and (c) ACAT. The line represents the genome-wide significant threshold of 1 ×10−5. These plots are based on results in Supplementary Data 911, respectively.

Sixty-eight unique genes were identified across the three tests (Fig. 2). Four genes were significant across all three tests - GDF15 (19p13.11), PDCD6IP (3p22.3), RRM2 (2p25.1), and ST6GALNAC5 (1p31.1). GDF15 (19p13.11) was one of the top two significant genes in all three approaches (EMMAX-VT P = 5.12 × 10−9, EMMAX-CMC P = 1.12 × 10−9, ACAT P = 1.95 × 10−9). GDF15, PDCD6IP, and RRM2 all replicated in at least one cohort; ST6GALNAC5 only appeared in IECC and thus could not be replicated.

Fig. 2: Overlap between three tests in the multiethnic meta-analysis.
figure 2

A Venn diagram showing the overlap and unique significant genes in the multiethnic meta-analysis using the three different tests: EMMAX-VT (green), EMMAX-CMC (red), and ACAT (blue). These plots are based on results in Supplementary Data 68.

Overall, using a replication p-value of 0.05, 25 genes were replicated using the EMMAX-VT approach: 11 in the ACAT approach and 4 in the EMMAX-CMC approach. Three genes — HCAR1, CCDC9, and NINJ2 — were replicated in more than one replication cohort, all in the EMMAX-VT approach. MRPS27 in EMMAX-VT (REHS and EPIC-Norfolk) and GDF15 in ACAT (IECC and REHS) had genome-wide significant p-values in two cohorts. If we use the more stringent replication threshold of 3.87 × 10−4, then replications are observed for GDF15 (VT, CMC, ACAT) and MRPS27 (VT) with PDCD6IP (VT), NDC80 (VT) and LOXHD1 (ACAT) all having replication p-values very close to these thresholds. The list of all genome-wide significant genes for each test can be found in Supplementary Data 68, while the full results of all p-values can be found in Supplementary Data 911. Note that beta is provided for the individual CMC analyses and a direction for the individual VT analyses, as VT does not output a beta.

Indo-European meta-analyses

As it is possible that Eastern Asians differ in genetic risk factor profile from Indo-Europeans, we performed meta-analyses on the four Indo-European ancestry cohorts. Forty-nine genes were genome-wide significant in the EMMAX-VT approach (Fig. 3a), 13 genes in the EMMAX-CMC approach (Fig. 3b), and 29 genes in the ACAT approach (Fig. 3c). Four genes overlapped between all three tests — GDF15, PIK3CA, RRM2, and ST6GALNAC5 (Fig. 4). The signal at PIK3CA was unique to the Indo-European meta-analysis. GDF15 and RRM2 were both replicated in one cohort, while PIK3CA and ST6GALNAC5 only appeared in IECC.

Fig. 3: P-values of the Indo-European meta-analysis.
figure 3

The gene-based p-values of the meta-analysis association study (N = 22,139) combining the four Indo-European derived cohorts using the (a) EMMAX-VT test, (b) EMMAX-CMC test, and (c) ACAT. The line represents the genome-wide significant threshold of 1 ×10−5. These plots are based on results in Supplementary Data 1517, respectively.

Fig. 4: Overlap between three tests in the Indo-European meta-analysis.
figure 4

A Venn diagram showing the overlap and unique significant genes in the Indo-European cohorts meta-analysis using the three different tests: EMMAX-VT (green), EMMAX-CMC (red), and ACAT (blue). These plots are based on results in Supplementary Data 1214.

Overall, 24 genes were replicated at p = 0.05 in EMMAX-VT, 8 genes in ACAT, and 4 genes in EMMAX-CMC. NINJ2 in the EMMAX-VT and STON1 and SND1 in EMMAX-CMC were replicated in multiple cohorts. The list of all genome-wide significant genes for each test can be found in Supplementary Data 1214, while the full results of all p-values can be found in Supplementary Data 1517.

EACC analysis

We also report the standalone results of EACC analysis. Thirty-one genome-wide significant genes were found in EACC using EMMAX-VT (Fig. 5a), 5 genome-wide significant genes using EMMAX-CMC (Fig. 5b), and 22 genome-wide significant genes using ACAT (Fig. 5c). GSTM5 (1p13.3) and WEE1 (11p15.4) overlapped in all three tests (Fig. 6). SERTAD3 (chromosome 19) and ZNF25 (chromosome 10) were genome-wide significant and only appeared in EACC, i.e., rare variants in these two genes did not exist in the other cohorts. 51 unique genome-wide significant genes were identified, 39 novel to the EACC analyses. The list of all genome-wide significant genes for each test can be found in Supplementary Data 1820.

Fig. 5: P-values of the analysis using the Eastern Asian EACC only.
figure 5

The gene-based p-values of the EACC association analysis (N = 4867) using the (a) EMMAX-VT test, (b) EMMAX-CMC test, and (c) ACAT. The line represents the genome-wide significant threshold of 1 ×10−5. These plots are based on results in Supplementary Data 911 respectively.

Fig. 6: Overlap between three tests in the Eastern Asian EACC analysis.
figure 6

A Venn diagram showing the overlap and unique significant genes in the EACC analysis using the three different tests: EMMAX-VT (green), EMMAX-CMC (red), and ACAT (blue). These plots are based on results in Supplementary Data 1820.

Cohort unique genes

In addition to the two genes in the EACC EMMAX-VT analysis, there were 6 significantly associated genes that only had rare variants within a single cohort; no other rare variants existed in the other cohorts for these genes. EDN3 and CHMP1B in the IECC EMMAX-VT analysis and PRLH in the IECC ACAT analysis. KLF1 appeared only in the EPIC-Norfolk cohort, in both the EMMAX-VT and EMMAX-CMC analyses. The list of cohort unique genes appears in Supplementary Data 21.

Independent replication in UK Biobank

We extracted the variants from the 129 significant unique genes and performed replication analyses in the UK Biobank. There were 7 genes with a P < 0.05 in EMMAX-CMC and 9 genes with a P < 0.05 in EMMAX-VT (Supplementary Data 22). P4HTM, CCDC170, and CPB1 were found in both analyses. STON1 was also replicated in the UK Biobank analyses; this gene had a significant meta-analysis p-value in the EMMAX-CMC analysis. Interestingly, the p-value in all cohorts was <0.053.

Pathway and expression analysis on all significant genes

We performed IPA pathway analysis on the 129 unique genes. While this did not result in any genome-wide significant canonical pathways, the upstream regulators analysis identified over 172 associated transcription factors. The two highest were the cytokine CSF2, which is known to regulate neuroglia after retinal injuries30, and the Transcription factor (TF) MEF2C, which is known to be expressed in the retina and controls photoreceptor gene expression31 (Supplementary Data 23). The fourth ranked p-value was the Raf kinases, which are known to be involved in retinal development32 and cell survival;33 the fifth ranked p-value was TBX5, which is expressed in the retina and involved in eye morphogenesis34,35. Causal network analysis identified 288 associated pathways (Supplementary Data 24), including the TRPC5 pathway, which regulates axonal outgrowth in developing ganglion cells36.

The top overall associated physiological system functions were organ morphology, organismal development and embryonic development, while the top molecular/cellular functions were cell cycle and cellular assembly/organization. Cancer and organismal injuries/abnormalities were the top overall associated phenotypes (Supplementary Data 25). Six genes were associated with ophthalmic phenotypes: CHST6, GCNT2, P4HTHM, USH2A, GRHL2, and MAPT.

FUMA analysis found that the top enriched tissues were heart, brain, muscle, and adipose tissue (Supplementary Fig. 1a). The top functional categories were cytoskeleton organization, cell cycle processes, mitotic nuclear division, and organelle organization (Supplementary Fig. 1b.

Biological plausibility and prioritization of genes

Of the 129 genome-wide significant genes from the six meta-analyses, 27.9% (36/129) have a known expression in human ocular tissue. 51.2% (66/129) of these genes showed evidence for a human ocular phenotype.

Seven genes had a biological plausibility score higher than 3 — PER3 (internally replicated, expressed in ocular tissue and associated ocular phenotype, i.e., score of 5) and PDCD6IP, MAPT, CHST6, GRHL2, USH2A, and P4HTM (all with a score of 4). An additional 11 genes had a score of 3 — GDF15, RRM2, HSPH1, TPR, KRT81, SPHK1, GSTM5, THSD7A, WEE1, and BUB1B (Fig. 7). Table 1 provides the p-values and effect sizes of the prioritized genes. Detailed background for the prioritization of the genes can be found in Supplementary Data 26A–F. Supplementary Data 1 provides the p-values and effect sizes (when available) for each gene. Supplementary Data 27 provides the average SER for minor allele carriers versus non-carriers for each variant in these prioritized genes; please note that this table uses the single variant results which is restricted to MAC > = 3; some variants with MAC < 3 were used in the gene-based tests but will not be present in Supplementary Data 27. P-values and betas for each of the individual rare variants are also provided. In general, PDCD6IP, MAPT, and USH2A variants had the most negative average SER for carriers of the given rare variant (cases in the table), although genes GRL2, CHST6, PDCD6IP, and USH2A all had variants with high positive SER for rare variant carriers as well. Betas tended to conform with difference between rare variant carrier SER and SER in noncarriers (controls in the table), with many of the top variants having large betas. Perhaps the most interesting fact with respect to the betas is that most of the single variant betas tended to be positive and led to increased myopization (negative SER). However, there were still negative betas for some variants with more hyperopic mean SERs in carriers versus non-carriers, particularly in the IECC and the genes PDCD6IP and USH2A across cohorts.

Fig. 7: Prioritization of top genes from all 129 genome-wide significant genes.
figure 7

The top genes ranked by our prioritization schema. The figure contains the chromosome, basepair position, gene name, as well as the meta-analysis p-value and the individual cohort p-values for each gene. It also contains which test the given significant meta-analysis p-value refers to, and how many times the gene replicated in our internal analyses. Finally, it contains information regarding gene expression, whether the gene has a known ocular phenotype in mice or humans, overlap with the GWAS performed by Hysi et al., and the final overall prioritization score. This figure is based on results shown in Supplementary Data 26.

Table 1 P-values and effect sizes of prioritized genes.

The highest overall biological plausibility score belonged to the circadian rhythm gene PER3 (1p36). It was genome-wide significant in both the all cohorts ACAT and Indo-European only meta-analyses (P = 1.08 × 10−6 and 1.15 × 10−6, respectively); it was genome-wide significant in REHS and replicated in IECC. CMC betas were 0.1666, 0.1574, −0.1976, −0.1102, −0.518 for IECC, EACC, BDES, EPIC-Norfolk, and REHS respectively; none of the CMC p-values were significant, however (Supplementary Data 28). Circadian rhythm genes have been shown to be associated with refractive error22 and PER3 is located near the site of a known myopia locus (MYP14) at which the causal gene has not been identified37,38,39. PER3 was expressed in ON and OFF bipolar cells. Defects in this gene are associated with familial advanced sleep phase syndrome (OMIM 616882) and may contribute to other circadian phenotypes by altering the sensitivity to light40. In defocus experiments in chicks using −15D lenses, PER3 expression decreased by −1.26-fold in the retina41. Further chick defocusing experiments, showed that PER3 expression in the retina varies under altered visual conditions42. Recently published data from the Raine Study suggest that falling asleep later was associated with a higher risk of myopia progression43.

Five genes had a score just below PER3, including the apoptosis gene PDCD6IP (3p22.3). This gene was found to be genome-wide significant in all-cohorts meta-analyses using all three tests (P = 1.07 × 10−7, 1.45 × 10−7, and 4.88 × 10−6, respectively). Further PDCD6IP had a P of <0.006 in both the EACC and IECC cohorts and did not appear in the other cohorts. Both betas in the CMC test were negative and with a large effect size for IECC (beta = −2.5) (Supplementary Data 28). Most rare variants in this gene in the EACC and IECC samples result in mean SER’s in carriers that were smaller (more negative) than in non-carriers, which meant that the CMC test would be powerful to detect this association (Supplementary Data 27). It is particularly interesting because PDCD6IP has two low single variant p-values in both IECC and EACC (0.00556 and 0.00548, respectively) and there are no rare variants in this gene in any of the other cohorts. PDCD6IP is expressed in ganglion cells of peripheral retina and plays a role in programmed cell death in uveal melanoma44 and may play a role in cornea lymphangiogenesis and vascular responses45.

MAPT (17q21.32) encodes tau proteins responsible for stabilizing microtubules; it was found to be genome-wide significant in the all cohorts EMMAX-VT analysis (P = 8.57 × 10−7). It was genome-wide significant in REHS and replicated in EPIC-Norfolk. Betas from the CMC test were −0.4342, 0.3137, −0.4965, −0.171, and −0.8015 for IECC, EACC, BDES, EPIC-Norfolk, and REHS respectively (Supplementary Data 28). Again, none of the CMC test p-values were significant. Abnormal MAPT was present in human glaucoma patients with uncontrolled intraocular pressure46 Cowan et al. showed that MAPT was expressed in several cell types in both the peripheral and foveal human retina: horizontal cells, rod bipolar cells, ON and OFF bipolar cells GLY and GABA amacrine cells and ganglion cells47. A knock-out mouse model showed decreased total retina thickness.

CHST6 (16q23.1) was genome-wide significant in both the all cohorts and Indo-European only EMMAX-VT meta-analyses (P = 8.99 × 10−7 and 2.42 × 10−7, respectively). The gene was genome-wide significant in IECC and replicated in BDES; it was also nearly replicated in EPIC-Norfolk. Though the CMC p-values were not significant, the beta for BDES was particularly large (0.95) (Supplementary Data 28). CHST6 plays a role in maintaining corneal transparency. Mutations in this gene may result in macular corneal dystrophy (OMIM 217800), which is characterized by bilateral, progressive corneal opacification and a reduction of corneal sensitivity48. The mouse phenotype of a knock-out model corresponded to that of human, i.e., abnormal cornea morphology and decreased corneal (stroma) thickness. Since our reference expression database did not contain any corneal tissue, we couldn’t score this category.

The transcription factor GRHL2 (8q22.3) was genome-wide significant in the all cohorts EMMAX-VT meta-analysis (P = 1.42 × 10−6). It was genome-wide significant in REHS and replicated in IECC. Though the p-values for EMMAX-CMC were not significant, REHS had a large beta value of 0.87 (Supplementary Data 28). Mutations in GRHL2 may lead to posterior polymorphous corneal dystrophy49 (OMIM 618031), characterized by a variable phenotype ranging from an irregular posterior corneal surface with occasional opacities, corneal edema, reduced visual acuity, secondary glaucoma, and corectopia.

The transmembrane prolyl hydroxylase P4HTM (3p21.31) was only genome-wide significant in EACC using EMMAX-VT (P = 1.00 × 10−7). However, this gene was replicated independently in the UKBB analysis. Betas for the non-significant EMMAX-CMC test were −0.1769, −02.025, 0.6106, −0.1632, −0.1177 for IECC, EACC, BDES, EPIC-Norfolk, and REHS respectively (Supplementary Data 28). P4HTM has been shown to be expressed in different ocular cells (including horizontal cells and bipolar cells). It is associated with HIDEA, a severe autosomal recessive disorder that is characterized by multiple symptoms, including eye abnormalities50 (OMIM 618493) and knock-out mice models have shown abnormal eye morphology51.

The membrane gene USH2A (1q41) was genome-wide significant in the EACC ACAT analysis (P = 7.55 × 10−9). The EMMAX-CMC tests were not significant which is reflected in the betas which were all quite small except for 0.82 in the BDES sample (Supplementary Data 28). This reflects the wide variation in effect on SER exhibited by different rare variants in this gene, with some individual variants leading to much more myopic mean SER’s in carriers compared to non-carriers while other rare variants led to more hyperopic mean SERs in carriers compared to non-carriers. (Supplementary Data 27). USH2A is well known to cause both Usher syndrome, which includes retinitis pigmentosa (RP) and mild to moderate hearing loss, as well as RP without hearing loss52. It is known to be expressed in the retina53 and has been recently shown to be associated with high myopia54

Pathway and expression analysis on top prioritized genes

We ran the IPA and FUMA analyses on the seven top prioritized genes. IPA did not identify any canonical pathways as significant; the only pathway shared across the genes was the 14-3-3-mediated signaling pathway (MAPT and PDCD6IP). The 14-3-3 proteins are a diverse group of signaling proteins.

Upstream regulator analysis found several transcription regulators of at least two genes include NKX2-1 (GRHL2 and MAPT), PSEN1 (MAPT and PER3), and SIRT1 (MAPT and PDCD6IP) (Supplementary Data 29). In the causal network analysis, the master regulator with the highest p-value covering multiple genes was the cytokine macrophage migration inhibitory factor (MIF) (Supplementary Data 29), which covered five genes. Interestingly, MIF is an essential factor in the development of zebrafish eyes55 and has been found to be a potential regulator of diabetic retinopathy56. MIF inhibitors may also be protective to photoreceptors57. Causal network analysis can be found in Supplementary Data 30 and the top functional analysis for disease result was hereditary eye disease (Supplementary Data 31). FUMA showed the top tissue expression occurred in the small intestinal terminal ileum, skeletal muscle, and the brain cortex; the latter being probably the best proxy for eye tissue (Supplementary Fig. 2a). A heat map of the expression of the seven genes across all GTEx tissues is given in Supplementary Fig. 2b).

Potential causal variants in the prioritized genes

We used annotation from wANNOVAR to identify potential causal variants within the top genes identified by the prioritization method (Table 2). For the two prioritized genes that were significant in the ACAT analyses, we were able to look at single variant p-values in addition to annotation to determine potential causal variants. There were three good candidate variants in PDCD6IP, which was genome-wide significant in IECC and replicated in EACC. rs199990824 (3:3879764) appeared in the EACC only, was predicted to be damaging by SIFT and MutationTaster, and had a CADD score of 26. The minor allele of rs199990824 appeared in 37 carriers (all heterozygotes) with an average SER of −2.04 D (SD = 3.29) compared to the non-carrier average of −0.44 D (SD = 2.27) and the overall cohort average of −0.45 D (SD = 2.28); the single variant P was 0.000183. In the IECC, the best potential causal variant was rs62620697 (3: 33905532), which was predicted damaging by MutationTaster, had a CADD score of 23.8, and had a low single variant p-value of 0.002632. Carriers (N = 9) of rs62620697 had an average SER of −2.17D (SD = 6.87) compared to that of non-carriers with an average SER of 0.20 (SD = 2.27). rs145293758 also had a low p-value (0.000311) but was not predicted damaging.

Table 2 Potential missense causal variants in prioritized genes.

Potential candidate variants were also identified in PER3, which was genome-wide significant in REHS and replicated in IECC. The REHS signal was primarily driven by two variants - rs147327372 and rs144178755, which had single variant p-values of 1.72 × 10−8 and 0.004953, respectively. However, neither variant was predicted to be damaging by the prediction algorithms nor appeared in the other European cohorts and were not significant individually, although rs147327372 did have a p-value of 0.046 in EPIC-Norfolk in the single variant tests.

The signals in the other four genes, identified primarily by the two burden-style tests, were driven by a cumulative effect of several variants. In this case, we relied primarily on annotation and reported variants that were generally agreed upon by multiple prediction programs. Five good candidate variants were located in MAPT: rs139796158 (17:44055786), rs76375268 (17:44060807), rs63750072 (17:44060859), rs143956882 (17:44067341) and rs63750191 (17:44101481). All these variants were nonsynonymous variants and predicted damaging by three of the four databases (except for rs76375268, which was predicted damaging by two). rs139796158, rs143956882, and rs63750191 all had CADD scores >26. In CHST6, the best candidate variant was the missense variant rs140699573 (16:75512734). It was predicted damaging by SIFT, PolyPhen2, MutationTaster, and FATHMM and has a CADD score of 27.4. In GRHL2, the best candidate variant was rs142411476 (8:102570910). It was predicted damaging by two databases and had a CADD score of 22. In P4HTM, two variants of interest were identified: rs140290144 (3:49002551) and rs144279528 (3:49043292). These variants were predicted damaging by MutationTaster and had CADD scores of 22.1 and 27.3, respectively. Finally, in USH2A, three variants (rs554957414 (1:216138793), rs148135241 (1:216373416), and rs201527662 (1:216419934) were all predicted damaging by the five prediction algorithms and had CADD scores above 22.

Structural analysis of prioritized candidate proteins

In addition to the annotation, we also performed protein structural modeling of all coding variants within the prioritized genes (98 variants across 6 genes/proteins) and calculated free energy difference (ΔΔG) between wildtype and mutant proteins (Supplementary Data 32); positive ΔΔG indicates a shift from a more stable to a less stable isoform. More detailed information on the structural analysis can be found in the Supplemental Methods.

In PDCD6IP, both rs145293758 (3:33905587) and rs200697599 (3: 33840234) were predicted to be highly destabilizing to protein structure (Supplementary Fig. 3a). The variant rs145293758 leads to replacement of a proline (Pro737) for an asparagine near phosphorylation sites in the protein’s self-associating domain, which could disrupt phosphorylation. rs200697599 (Ile5) and rs199990824 (Asp376; 3:33879764) result in changes to the protein’s BRO1 domain, which is involved in endosomal targeting. The isoleucine to serine mutation at rs200697599 could introduce a phosphorylation site at the N-terminus while the asparagine to aspartic acid mutation at rs199990824 could disrupt hydrogen bonds. Recall that both rs145293758 and rs199990824 were identified as potential causal variants for refractive error in IECC and EACC, respectively, based on their annotation, and single variant p-values (Supplementary Data 27).

For PER3, several variants may affect structure, including rs140974114, which results a serine (Ser751) to aspartic acid substitution at the protein’s nuclear localization signal and could disrupt hydrogen bonds and rs200140283, which results in an alanine (Ala681) to glycine substitution in the CSNK1E binding domain. Further potential disruptions occur at rs139315125 (His416), which takes place in the nuclear export signal 3 and rs77418803 (Ser919), which occurs near the nuclear export signal 2. The model is provided in Supplementary Fig. 3b).

Of the variants in MAPT, two were predicted to be destabilizing (rs76375268 at Gly213 and rs63750191 at Gln741) (Supplementary Fig. 3c). Further, rs73314997 (Ser318) and rs143956882 (Ser427) are located near known pathogenic mutations for frontotemporal dementia and Pick disease of the brain, respectively.

Three variants on the luminal domain of CHST6 were found to have a mild effect on protein stability. Two of these variants (rs201349198 at Ala326 and rs140699573 at Gln331) are positioned near variants known to cause macular corneal dystrophy (MCD) near the C-terminus. This suggests the C-terminus is sensitive to mutations enabling interference with keratan sulfation, which could cause a loss of function that can lead to a milder disease phenotype such as refractive error. The model can be found in Supplementary Fig. 4a.

In GRHL2, variants were only predicted to have a mild effect on protein structure and were not located near known pathogenic variants (Supplementary Fig. 4b).

For P4HTM, rs140290144 is predicted to be moderately destabilizing (Supplementary Fig. 4c). It substitutes a valine for a buried isoleucine (Ile227) between two calcium binding sites; potential disruption of these calcium binding sites can result in loss of function. Similarly, rs144279528 occurs in the Fe-dependent 2-OG dioxygenase domain close to an iron binding residue. Substitution of asparagine from the wildtype aspartic acid (Asp386) could have an impact on iron binding by introducing a glycosylation (due to location on protein surface) or disruption of hydrogen bonding.

Of particular interest in the protein modeling was that of usherin (USH2A), the known retinitis pigmentosa gene. Five variants were predicted to be highly destabilizing, particularly rs554957414 with a ΔΔG value of 99.19 kcal/mol). Three of these variants, including rs554957414 (Pro2329), result in the loss of proline and the loss of that ring structure could cause an increase in conformational flexibility and account for such high destabilization predictions (Supplementary Fig. 5). Further, a mutation at rs201527662 (Cys934) results in the replacement of cysteine with tryptophan and will disrupt a disulfide bond between two cysteines.

We also compared the ΔΔG of these five candidate variants with the ΔΔG of all USH2A ClinVar (n = 63) and gnomAD (n = 1870) variants using the Wilcoxon rank sum test. A significant difference between the ClinVar variants and gnomAD variants was found (P = 0.0008) and the ΔΔG values of our candidate variants was much more similar to the known pathogenic variants than the putatively benign GnomAD variants (Supplementary Fig. 6).

Potential causal variants in other genome-wide significant genes

We also identified variants within the other 122 genome-wide significant genes that had a high potential to be damaging. This included 25 variants across the five cohorts; the results are found in Supplementary Data 33. Like our prioritized genes, we also performed protein modeling on these variants (Supplementary Data 34).

Notable findings from the structural analysis include a valine to phenylalanine substitution (Val105) that would disrupt a helix in ALG3, which has been implicated in congenital disorders of glycosylation that have ocular phenotypes58 (Supplementary Fig. 7a). We also identified multiple glycine substitutions in TNFRSF13B in areas associated with heparan sulfate – glycosaminoglycan biosynthesis; heparan sulfate has been shown to play a role in eye pathologies59 (Supplementary Fig. 7b).

Discussion

In this large scale, gene-based analysis of rare variants in refractive error, 129 associated genes were identified. Though many of the genes were associated with eye conditions or ocular development, only ten genes had previously been identified with refractive error or myopia: six with myopia including two with high myopia — USH2A and GDF1554,60 — and ten with refractive error. Pathway analysis revealed that 59 of these genes were involved in cell cycle, organ morphology, and embryonic development and 21 of these genes had upstream regulators that were directly involved in retinal development or eye morphogenesis. Given the substantial level of missing heritability still present within the refractive error, it is likely that at least some of this heritability is explained by rare variants within these genes. The fact that the significance of these genes and the explained variance of refractive error due to these genes did not significantly change after inclusion of GRS in the analysis, suggests that these association signals are independent from the effects of known common refractive error risk variants.

This large scale meta-analysis used gene-based tests for rare variants in refractive error, and was undertaken to identify rare variants that may be partially responsible for missing heritability, particularly within the CREAM data set21. The CREAM data set is well-suited for this type of rare variant analysis. First, we were able to combine many smaller cohorts into two mega-analyses – IECC (N = 11,505) and EACC (N = 4867). These meta-analyses greatly boosted power to detect variants with a MAF ≤ 0.01 and allowed more rare variants to be combined into a single, gene-based marker. In addition, we had three cohorts >1000 subjects to observe replication and perform the combined meta-analyses. Genes identified in this study were done so across a very large pool of subjects, lowering the potential for type I error.

The multiethnic composition of this dataset also allowed for observation both across and within ethnicities. We have delineated how rare variants in some genes were found only in Indo-Europeans and others in Eastern Asians, as well as some that cut across the ethnic divide. Thus, we were able to identify risk genes that might contain rare variants that affect SER within a particular population (such as ST6GALNAC5 in IECC), or more universally, like PDCD6IP.

PER3, PDCD6IP, MAPT, CHST6, P4HTM, USH2A, and GRHL2 are good candidate genes, all known to be associated with ocular abnormalities. PER3 is a circadian rhythm gene; circadian rhythm is associated with refractive error22. PDCD6IP and MAPT are both expressed in the retina while CHST6, and GRHL2 are both involved in corneal dystrophy49,61. P4HTM affects eye morphology in mice knockouts;55 it is also notable for being replicated in the UKBB analysis. USH2A is expressed in the retina and is a known RP gene52,53.

Five of these prioritized genes were found to be regulated by the cytokine MIF, which has been shown to regulate zebrafish eye development55 and have protective effects for photoreceptors57. More work on the MIF network with respect to refractive error is needed. We were further able to identify potential causal variants in these prioritized genes and, using structural analysis, were even able to determine the effect on protein stability.

STON1, C5AR1, and WDFY3 were all replicated in UKBB. C5AR1 is expressed in retinal Müller cells, which are known to play a role in retinal disease62. STON1 is associated with AMD63 while WDFY3 is associated with inherited retinal dystrophies64. Other potential interesting candidates include GDF15, which was a top significant gene across all four meta-analyses, and has been found to be significantly overexpressed in highly myopic eyes60 and patients with vitreoretinal disorders65 and may also be a potential molecular marker of neurodegeneration in glaucoma66, and MRPS27. This gene was genome-wide significant in the meta-analysis and in two individual cohorts, REHS and EPIC-Norfolk. While MRPS27 is not known to be associated with eye disease, a common variant in this gene was found to be genome-wide significant in the GWAS meta-analysis of refractive error conducted by Hysi et al.22. Other candidate genes with known links to eye disease/functions include HCAR1 with glaucoma67,68 and EPB41L2 with a potential role in phototransduction69.

One final interesting set of genes was those that were genome-wide significant within a single cohort. This implies that there may be rare risk variants unique to a certain population that are fixed in other populations. This includes ST6GALNAC5, which was genome-wide significant in IECC in both EMMAX-VT and ACAT (P = 5.84 × 10−7, 9.03 × 10−10). This gene catalyzes the transfer of sialic acid; polysialic acid has been shown to prevent vascular damage in retina70 and to stimulate the generation of new rods in the retinas of developing zebrafish71. Other interesting significant genes unique to a single cohort included SERTAD3 in EACC, which is overexpressed in retinoblastoma72 and KLF1 in EPIC-Norfolk, which may be expressed in the eye73. We also note that gene-based analyses for refractive error had been previously performed in BDES74. Of the five significant genes from that analysis, two were replicated at P ≤ 0.05 — PTCHD2 and CRISP3. PTCHD2 is located near the known myopia locus MYP14 on 1p36.2239,75 and CRISP3 is expressed in the retina74,76.

This study used multiple tests (EMMAX-VT, EMMAX-CMC and ACAT) to identify significant genes and looked at overlap to find more robust signals. By using multiple tests that differ slightly in design, we were able to cast a wider net in our search. The ACAT test was particularly useful for identifying potential causal variants within a candidate gene, as it allowed us to observe which variants had significant single variant p-values. This enabled us to zero in on potential causal variants in genes like PDCD6IP and PER3, though we note that highlighting any potential causal variants are speculative at this point. We also felt it prudent to not give more weight to the result of one test over another and instead take the largest number of unique, significant genes since this was a discovery study, though we did try to give more weight to the genes that were identified by all three tests, such as PDCD6IP.

We note that the three tests did not always agree, though the two burden-style tests agreed more often than ACAT. This is not surprising given the different nature of the tests. Both EMMAX-VT and EMMAX-CMC were burden-style tests that create a new, gene-based marker on which the p-value is calculated. The ACAT test was an aggregation-style test created from single variant p-values that does not create a new gene-based marker77. This is a critical distinction; it means that the markers analyzed in the burden-style tests and the ACAT tests are different. The ACAT analyses may have been slightly underpowered with respect to the burden-style tests, as we used a minimum allele count of three in our analyses. For EMMAX-VT and EMMAX-CMC this was calculated across all variants within a gene and for ACAT at each individual variant, which resulted in certain variants being removed from the ACAT analysis that were present in the burden style analyses. Therefore, genes present in all three analyses indicate a more robust association with refractive error.

Since this is an exome microarray study, there were still large portions of the genome that would not have been covered in this work. Thus, there are almost certainly additional rare risk variants for refractive error in these cohorts that were not genotyped in this study. The goal of this discovery study was to provide an initial starting point for further analysis; we plan whole genome sequencing on high-risk individuals identified by this study. These non-genotyped variants could explain why we did not see replication with previous refractive error GWAS findings21,22. Some of the genes identified in the common variant GWAS may have included rare risk variants that were specific to a particular population that was not used in this study.

Another challenge is that due to the gene-based nature of this work, it is critical to remember that the gene-based markers across the cohorts are often made up of different variants. This means that the gene-based marker for gene A in IECC might be made up of three variants, and in REHS might be made up of seven variants, two of which are shared across the two cohorts. This means that it was possible that some cohorts may have had association tests that were less significant because of inclusion of non-significant rare variants that did not appear in other cohorts.

We also note that this was an exploratory analysis to determine candidate genes, and one of our goals was to cast a wide net to capture potential candidates. Therefore, we chose a more liberal replication significance threshold, which may allow for potential type I errors but would also ensure that a good candidate gene would not be missed or because functional rare variants did not appear in that cohort.

We also note that while we did utilize eye expression data in this study, we were limited to expression from retinal tissue only. We are actively seeking expression data from additional eye tissue, particularly corneal and scleral tissue, to further prioritize these genes.

This work identified 129 genome-wide significant genes for refractive error using the gene-based rare variant approach. Most of these genes are novel for association with refractive error but many have associations with other ocular abnormalities. This is the largest gene-based study of rare variants performed on refractive error. The fact that we found over 100 significant genes shows that rare variants (MAF ≤ 0.01) do account for some of the missing refractive error heritability not identified in the common variant GWAS. We were able to prioritize seven of these genes as our best candidate genes for causality based on biological function – PDCD6IP, MAPT, CHST6, GRHL2, USH2A, P4HTM, and PER3 –as well as GDF15 and MRPS27 based on the strength of association. Validation studies, including replication within additional cohorts, are planned to identify the best candidates for functional studies to unravel the pathophysiology of refractive error and myopia. We also plan further analysis with the conversion of our quantitative refractive error phenotype to binary phenotypes to test for association with myopia, hyperopia, and astigmatism.

Methods

Cohort details, genotyping and joint recalling of exome array data

Fourteen population-based CREAM cohorts that had exome chip genotypes on individuals with refractive error measurements were used in this study. These 14 cohorts were: Singapore Chinese Eye Study (SCES), Singapore Malay Eye Study (SiMES), Singapore Indian Eye Study (SINDI), Age Related Eye Study (AREDS), Rotterdam Study I (RSI), Erasmus Rucphen Family (ERF), Raine Eye Health Study (REHS) of the Raine Study, Beaver Dam Eye Study (BDES), Estonian Genome Center for the University of Tartu (EGCUT), Finnish Twin Study on Aging (FITSA), Ogliastra, Croatia-Korcula, TwinsUK, and EPIC-Norfolk. Each individual cohort is described in further detail in Supplementary Note 1. All studies were performed in accordance with the Declaration of Helsinki and approved by the institutional review boards of the participating institutions. All participants provided written informed consent. The Institutional Review Board of the National Institutes of Health (NIH) determined that the analyses of deidentified data performed in the current study and the meta-analysis qualified as “not human subjects research” and did not require specific protocol approval. The study was performed under guidelines agreed to in Data Use Agreements between the individual participating studies and the NIH and the Erasmus Medical Center where these analyses took place.

Thirteen cohorts had been genotyped on the Illumina HumanExome-12 v 1.0 or v 1.1, or the Illumina HumanCoreExome-12 v1.0; EPIC-Norfolk was genotyped on Affymetrix UK BioBank Axiom Array. The 13 cohorts on the Illumina arrays were jointly recalled to obtain a larger sample size of rare variants (here defined as variants with a MAF ≤ 0.01), as recalling genotypes simultaneously across all samples increases the ability to call rare variants with more discrete distinction between allele calls and sensitivity for low-frequency (high-intensity) loci. All data were recalled for HG19 using GenomeStudio® v2011.1 (Illumina Inc., San Diego, CA, USA) per microarray platform and PLINK78. We note that these exome-based genotyping arrays consist of previously validated, high confidence rare variants, reducing the likelihood that findings might be the result of artifacts or genotyping errors that might affect sequencing studies. Further, since the imputation of very rare variants is difficult, only genotyped rare variants were used in this study; there were no imputed variants.

Combination of cohorts for mega-analysis

To increase power on rare variants, we sought to combine as many cohorts as possible into a mega-analysis. We thus performed principal components analysis (PCA) on all our cohorts after pruning the datasets for linkage disequilibrium using the pcair, part of the R package GENESIS. Pcair is designed to perform PCA in samples with cryptic relatedness and provides accurate ancestry inference that is not confounded by family structure79. For reference, we included individuals from all 11 HapMap reference panels in the PCA.

PCA showed two major groupings based on known ethnicity. The first consisted of the Han Chinese SCES and Malaysian SiMES cohorts, which were combined into the Eastern Asian combined cohort (EACC); we realize that technically the Malaysian population are Southeast Asians, but for simplicity will refer to this cohort as Eastern Asian. The second dataset consisted of the eight European cohorts (RSI, Croatia-Korcula, FITSA, EGCUT, TwinsUK, ERF, AREDS, and Ogliastra) and the one Indian cohort (SINDI). These cohorts were combined into the Indo-European combined cohort (IECC).

Analysis was performed on five discrete cohorts – IECC, EACC, EPIC-Norfolk, BDES, and REHS. The IECC analysis was performed in the Netherlands, while the EACC was performed in the United States as well as in the Netherlands. The BDES, EPIC-Norfolk, and REHS analyses were performed in their countries of origin (the United States, the United Kingdom, and Australia, respectively) as was legally required; these studies served on a per study basis as replication cohorts. A breakdown of all cohorts and the combined cohort with which they are grouped is provided in Supplementary Data 1.

Statistics and reproducibility

Quality control of the genotype data was performed as follows. For the combined cohorts, all raw cohort data were merged into a single file. All five cohorts then underwent identical quality control using PLINK78. Any individual not genotyped at 99% of all variants was removed and any variant not genotyped at 99% was also removed. Variants with a HWE p-value less than a Bonferroni-corrected p-value (defined as 0.05/total number of variants in the dataset) were also excluded. We also checked for batch effects and calculated the identity-by-descent (IBD) value of all individuals in the cohort, removing duplicates and twins. Many of the datasets exhibited cryptic relatedness amongst subjects (especially the Ogliastra study, which enrolled participants on the Italian island of Sardinia). Related individuals were not removed from the cohorts, as our analysis methods corrected for relatedness.

After QC, IECC had 13,097 individuals with 150,619 variants, EACC had 4867 individuals with 98,750 variants, BDES had 1740 individuals with 105,671 variants, REHS had 1,020 individuals with 92,313 variants, and EPIC-NORFOLK had 6282 individuals with 637,160 variants.

The refractive error phenotype analyzed here was defined as the quantitative phenotype mean spherical equivalent (SER), measured in diopters (D). Refractive error measurements in both eyes were taken from all participants and SER was calculated by adding the spherical refractive error + half the cylindrical refractive error in each eye, then taking the mean of both eyes. Individuals who had undergone procedures that could alter refraction, e.g., cataract surgery, laser refractive error procedures, retinal detachment surgery, and other ophthalmic conditions that may influence refraction were excluded from these analyses. The average spherical equivalents and standard deviations of each cohort are provided in Supplementary Data 1.

Gene-based association analysis was performed using a gene-based version of EMMAX80,81. EMMAX uses a kinship matrix to correct for population stratification and cryptic relatedness, which are present in these cohorts. EMMAX has been modified to perform gene-based burden-style tests, including the variable threshold (VT)29 and the combined multivariate and collapsing (CMC)28 methods through the software EPACTS (https://genome.sph.umich.edu/wiki/EPACTS), which we will term EMMAX-VT and EMMAX-CMC, respectively82.

We analyzed all five cohorts with EMMAX-VT and EMMAX-CMC using genetic variants with a maximum MAF = 0.01. We only included variants that were in an exon of a gene (as defined by RefSeq), including both nonsynonymous and synonymous variants. Common variants (MAF > 0.01) and variants with a MAF ≤ 0.01 that mapped to an intergenic region were excluded from the analysis. Any gene with a minor allele count (MAC) of less than three for the cohort was dropped from the analysis.

Initial analyses were performed without any covariates. We performed two follow-up analyses using age, sex, and education level (low, intermediate, and high). One covariate analysis included all three covariates, while the second used age and sex only (education level removed). We note that the inclusion of covariates resulted in no significant difference between significant genes; for brevity we only discuss the results without covariates. In addition, the Ogliastra cohort did not have data on age and education, thus ~3000 individuals were removed from the IECC covariate analyses. Hence, the covariate analyses are underpowered with respect to the analyses without covariates

We also performed gene-based analysis using the Aggregated Cauchy Association Test (ACAT)77. ACAT is a novel method that allows individual p-values to be combined into a gene-based p-value that is particularly useful for rare variants. To take advantage of this method, we analyzed all variants with a MAF ≤ 0.01 (with a minimum allele count of 3) using the original, single variant-based version of EMMAX80,81. We then combined the EMMAX p-values for each gene using the ACAT package implemented through R. Only nonsynonymous and synonymous exonic variants were included in the analysis.

Meta-analysis was then performed across our discovery cohorts. The burden-style tests that created a single p-value for a gene precluded the use of popular meta-analysis programs such as METAL, which require the input of reference and alternative alleles. Instead the gene-based p-values from the EMMAX-VT, EMMAX-CMC and ACAT were combined across studies using the classic method described by Fisher83. Fisher’s method was implemented through the R package metap84. We defined genome-wide significant as 1 ×10−5, based on the standard for gene-based studies. Replication was defined as a gene having a P ≤ 0.05 in one cohort after being found to be genome-wide significant in one of the other four cohorts. We note that this replication value is liberal and may lead to an inflation in false positives. However, as this is a discovery analysis, we were willing to allow some extra false positives in order to capture as many true positives as possible. A more stringent replication p-value of 3.9e−04 was also used to adjust for 129 attempted replications and these more stringently replicated genes were also reported.

We performed two separate meta-analyses. The first combined all five cohorts (IECC, EACC, BDES, EPIC-Norfolk, and REHS), which will be referred to as the multiethnic meta-analysis. The second combined the four ethnically Indo-European cohorts (IECC, BDES, REHS, and EPIC-Norfolk), which will be referred to as the Indo-European meta-analysis. The Indo-European meta-analysis was designed to identify any genes that might be significant in Indo-European-derived individuals but not significant in Eastern Asians; thus, we also report the Eastern Asian analyses p-values.

To investigate whether signals identified by the rare variant analysis were being partially driven by common variants, we calculated polygenic risk scores (PRS) for all cohorts using common variants identified in previous GWAS22. PRS were calculated for each subject using PLINK (Supplementary Data 2). All rare variant analyses were then repeated using the PRS values for each subject as a covariate. We compared the explained variance (R2) of our top individual genes between the analysis with and without including PRS (Supplementary Data 3 and 4).

Independent replication of the genome-wide significant genes was performed in the UK Biobank (UKBB) via extraction of all rare variants comprising the genome-wide significant genes and repeating the same analyses.

Pathway and expression analysis

All genome-wide significant genes in the four meta-analyses and the EACC analyses were analyzed using Ingenuity Pathway Analysis (IPA) (QIAGEN Inc., https://digitalinsights.qiagen.com/products-overview/discovery-insights-portfolio/analysis-and-visualization/qiagen-ipa/)85. We performed various analyses through IPA, including canonical pathway analysis (identifying which genes are in known pathways), upstream regulator analysis (which identifies genes, RNAs, and proteins that regulate the genes in the dataset), and causal network analysis (which expands the pathway analysis to include the upstream regulators in the pathway analysis). IPA also identified disease phenotypes, cellular/molecular functions, and physiological networks associated with the genes in the dataset. Additional pathway and expression analysis were also performed with Functional Mapping and Annotation of GWAS86,87 (FUMA), which provided tissue-enrichment information from GTEx and gene-group information from MsigDB. We repeated the IPA and FUMA analyses for our top prioritized genes from the schema proposed below.

Gene prioritization based on biological function

To prioritize genes according to biological background, we evaluated genes following a modified schedule proposed by Fritsche et al.88 and further adapted by Tedja et al.21. Genes were ranked based on points equally assigned for the presence of replication, expression and biological plausibility. Evidence for ocular expression was based on single-cell expression data from adult human retina and developed organoids47. Biological plausibility was based on the presence of an ocular phenotype in OMIM and/or DisGeNET89 as well as an ocular phenotype in a knock-out mouse model of this gene (Mouse Genome Informatics and International Mouse Phenotyping Consortium databases). The prioritization score ranged from zero to seven. In addition, we performed a look-up of the top-genes to screen for drugs that had these genes as target using SuperTarget90, PharmGkb91, STITCH v5.092 and DrugBank v5.093.

Variant annotation for potential causal variants

We performed annotation to identify potential causal variants within the significant genes. Therefore, we annotated all exonic variants from genome-wide significant genes using wANNOVAR94,95,96, which collates functional predictions from popular prediction algorithms like SIFT97, PolyPhen298, MutationTaster99, CADD100, and FATHMM101. We initially looked at the top-ranked genes in the prioritization approach described above, giving preference to variants that appeared to either be driving the gene-based association analysis or variants that the five annotation algorithms agreed upon as being damaging. We further expanded this approach to all significant genes identified in the meta-analyses.

Structural analysis of variants

We also performed structural analysis of all coding variants within our top prioritized genes, as well as all mutations predicted to be deleterious in all genome-wide significant genes. We examined 1) all coding variants tested within six prioritized novel candidate myopia genes and 2) the predicted deleterious variants in USH2A, a non-prioritized but genome-wide significant gene with the highest number of predicted deleterious variants. The first set comprises 98 variants across 6 proteins, including 26 of special interest, which were looked at more closely and are covered below. Those are labeled in yellow in figures and represent missense variants predicted to have a deleterious effect by at least one commonly used variant effect predictor-tool (Table 2) or/and which displayed single variant association p-value below the nominal threshold of 5%. Crystal structures were obtained from the Protein Data Bank;102 when crystal structures were not available, homology models were used for visualization and energy calculations. We used both FoldX RepairPDB and Position Scan103 to predict differences in free energy between the wildtype and mutant proteins (ΔΔG, measured in kcal/mol). ChimeraX104 was used to visualize affected proteins. We also incorporated prior information from publicly available databases (OMIM, PFam, ClinVar, gnomAD, UniProt, RCSB PDB) and predicted functional effects (Missense3D105). A more detailed explanation of each individual protein can be found in Supplementary Note 2.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.