Cross-ethnic meta-analysis identifies association of the GPX3-TNIP1 locus with amyotrophic lateral sclerosis

Cross-ethnic genetic studies can leverage power from differences in disease epidemiology and population-specific genetic architecture. In particular, the differences in linkage disequilibrium and allele frequency patterns across ethnic groups may increase gene-mapping resolution. Here we use cross-ethnic genetic data in sporadic amyotrophic lateral sclerosis (ALS), an adult-onset, rapidly progressing neurodegenerative disease. We report analyses of novel genome-wide association study data of 1,234 ALS cases and 2,850 controls. We find a significant association of rs10463311 spanning GPX3-TNIP1 with ALS (p = 1.3 × 10−8), with replication support from two independent Australian samples (combined 576 cases and 683 controls, p = 1.7 × 10−3). Both GPX3 and TNIP1 interact with other known ALS genes (SOD1 and OPTN, respectively). In addition, GGNBP2 was identified using gene-based analysis and summary statistics-based Mendelian randomization analysis, although further replication is needed to confirm this result. Our results increase our understanding of genetic aetiology of ALS.

F or people of European ancestry, the lifetime risk of amyotrophic lateral sclerosis (ALS) is 0.3-0.5% 1, 2 , with peak age of onset of 58-63 years 3 , and median survival of 2-4 years 4 . Investigations of families with multiple affected individuals have led to the identification of mutations that segregate with disease in a number of genes, including SOD1, C9orf72, TARDBP, FUS and TBK1 5,6 . However, about 90% of cases 5 ('sporadic ALS' (sALS)) present with sparse or no family history. Nonetheless, genome-wide association studies (GWAS) have provided direct evidence of a genetic contribution to sALS, with estimates that 8.5% 7 of variance in liability is tagged by common singlenucleotide polymorphisms (SNPs). Currently, only a small proportion of this variation (0.2% of variance in liability) 7 is accounted for by the six common loci (C9orf72, UNC13A, SARM1, MOBP, SCFD1, C21orf2) identified as significant based on association analysis of 12,577 cases and 23,475 controls 7 . The SNP-heritability estimate implies that more risk loci will be detected with increasing sample size, as found for other complex genetic diseases 8 . Whole-exome sequencing (WES) studies, designed to identify genes enriched for rare variants, have also been conducted for sALS. The largest study, comprising 2,874 cases and 6,405 controls, identified TBK1 as a novel ALS risk gene 6 , with GWAS support for association of common loci (p = 6.6 × 10 −8 ) 7 . Rare variant burden analysis in a WES of 1,022 index familial cases identified p.Arg261His in NEK1 as an ALS associated variant, and follow-up in large samples suggest that this variant together with NEK1 loss of function mutations account for~3% of ALS cases 9 .
To date, the largest genetic studies for ALS are in the subjects of European ancestry, but common variants associated with disease are likely to be ancient and shared across ethnicities. Given sufficient power, cross-ethnic genetic studies can aid fine mapping of disease loci, exploiting differences in allele frequency and linkage disequilibrium (LD). In China, the lifetime risk of ALS is estimated to be lower (0.1%) 1 and its mean age of onset is estimated to be a few years earlier than in Europe 4, 10 . High penetrance mutations in known ALS genes identified in Europeans have been detected in Chinese cases 11 , but the frequency of the C9orf72 expansion is much lower (0.3%) 12 than in Europeans (frequency 7%) 5 , and it may have arisen on a different haplotype background 12 .
In a cross-ethnic meta-analysis of the largest GWAS for ALS in Europeans 7 , together with a new Chinese data set, we identify the GPX3-TNIP1 locus to be significantly associated with ALS (p = 1.3 × 10 −8 ). This association is replicated in two independent Australian cohorts with a combined p-value of 1.7 × 10 −3 . Previous studies indicate functional relevance of both GPX3 and TNIP1 [13][14][15][16][17][18] . The identification of this locus contributes to a better understanding of the genetic aetiology of ALS.

Results
Genome-wide association analysis. We conduct a genome-wide (GW) association analysis in a Chinese sample of 1,234 sALS cases and 2,850 controls (Supplementary Table 1 and Supplementary Figs 1−3). The genomic inflation factor λ GC of 1.02 and λ 1000 of 1.01 showed no evidence for inflation in test statistics. The combined effects of all common genetic variants on ALS liability (SNP-heritability) estimated from the Chinese GWAS data is 15.1% (SE): 4%; p = 9.5 × 10 −5 ) using GCTA-GREML 19 and 15.0% (SE: 3.5%) using LD score regression 20 (intercept 1.0, which also shows no evidence of population stratification). Given the SE, these estimates are not different from the estimate of 8.5% (SE 0.5%) from European data 7 . Partitioning of the SNP-heritability by chromosome showed a significant positive correlation with chromosome length (Supplementary Fig. 4a) consistent with a polygenic architecture. Based on minor allele frequency (MAF) bin, the SNP-heritability was attributed to SNPs across the MAF range, but SEs per bin were large ( Supplementary  Fig. 4b); similar analyses based on European data suggested that less common SNPs tagged more variation than other MAF classes 7 .
No individual SNPs passed the GW significant p value threshold of 5 × 10 −8 , and none of the significant SNPs reported in the European 7 GWAS replicated in our samples (p > 0.05). We also checked for the associations of two GW significant SNPs in previous GWAS of Chinese cohort of ALS patients 21 . However, we could not replicate the association in that study. We note that despite evidence for population stratification, principal components derived from SNP data of the previous study were not included as covariates in their association analysis. The p values for rs6703183 and rs8141797 are 0.07 and 0.12 in our Chinese samples and 0.66 and 0.94 in European GWAS results, respectively. Direction of effect sign tests (Supplementary Table 2) and polygenic risk scoring analyses ( Supplementary Fig. 5) provided no conclusive evidence of shared risk loci (Nagelkerke's R 2 = 0.002; p = 0.01). These results are not unexpected given the size of our sample and effect sizes estimated in Europeans. The Chinese GWAS sample had 80% power to identify common genetic variants of genotype relative risk of 1.4 and 1.8 for risk allele frequency of 0.2 and 0.05, respectively, at the GW threshold of significance p = 5 × 10 −8 .
Meta-analysis. Meta-analysis of our results with those of the European 7 GWAS identified a new GW significant locus at chromosome 5p33.1 (rs10463311, risk allele C, odds ratio (OR) 1.11 95% confidence interval (CI): 1.06-1.14, p logistic = 2.9 × 10 −8 ; Functional relevance of GPX3 and TNIP1. Both GPX3 and TNIP1 are genes that could have functional relevance for ALS. The protein glutathione peroxidase 3 (GPX3), is an antioxidant molecule functionally related to superoxide dismutase 1 (SOD1) 13 ; many SOD1 single-nucleotide variants are pathogenic for ALS. In a mass spectrometric screen of sera of SOD1 H46R rats compared to their wild-type (WT) controls in the presymptomatic stage (12 weeks of age) of ALS, Gpx3 was detected as one of the two significant results (1.3-fold increase in expression) 14 . In the same study, Gpx3 expression was significantly lower (0.74 fold, p = 0.009) compared to WT controls by disease end stage, a finding which was replicated in blood sera of sporadic ALS cases (n = 18) and controls (n = 35) (GPX3 0.41-fold lower, p = 0.008) 14 . Both GPX3 and TNIP1 are functionally associated with NF-κB, the master regulator of inflammation 17,19 , with upregulation of NF-κB associated with death of motor neurones 15 . Protein-protein interaction analysis 18 links GPX3 to SOD1 and TNIP1 to OPTN, and OPTN also harbours mutations associated with familial ALS 5 . TNIP1 is associated with a wide range of immune disorders 22,23 , although our most associated SNP (rs10463311) is not in LD with specific SNPs associated with these disorders 24 . We investigated differential expression of GPX3 and TNIP1 between ALS patients and controls, but given small sample sizes, the results were not conclusive (Supplementary Note 1, Supplementary Table 3, Supplementary Fig. 6). In a pleiotropy informed analysis 25 applied to the European GWAS summary statistics 7 , rs10463311 was identified as an ALS-associated SNP, providing additional, albeit not fully independent, support for this locus.  Gene-based association analysis. No genes were significantly associated with ALS from gene-based association analysis implemented in fastBAT 26 of Chinese data (based on Bonferroni correction for~18,000 autosomal genes, significance declared at 2.8 × 10 −6 ), but meta-analysed results (Supplementary Table 4) identified multiple genes (reflecting LD and overlapping gene boundaries) at the previously reported chromosome 5, 9, 14 and 17 GWAS loci. Two new loci on Chromosome 17 (17q12 and 17q21.2) were also significant (minimum genic p = 3.3 × 10 −7 and 1.2 × 10 −7 , respectively). The former locus was also supported by summary statistic-based Mendelian randomization (SMR) analysis 27 that combines the disease-SNP association with gene expression-SNP association results and has GW significance threshold of p SMR < 8.4 × 10 −6 ) (Supplementary Fig. 7; Supplementary Data 2), with most significant association for GGNBP2 (European only p SMR = 4.6 × 10 −6 ; meta-analysis p SMR = 9.8 × 10 −6 ). The two replication samples did not provide support for the GGNBP2 SNP implicated from the SMR analysis (Supplementary Table 5); larger sample sizes are needed to confirm the association and to provide evidence to exclude ZNHIT3 (p SMR = 3.1 × 10 −5 ) or MYO19 (p SMR = 2.2 × 10 −4 ) as contributing to the association in this region. Gene-set pathway analysis implemented in MAGMA and applied to the Chinese/European meta-analysis results did not find any ALS significant pathways that passed multiple testing correction (Supplementary Table 6).

Discussion
In summary, using a cross-ethnic design we identify association of the GPX3-TNIP1 locus with ALS. This locus was identified by combining GWAS results from our Chinese data with the largest European GWAS data 7 and replicated in independent Australian samples. In addition, GGNBP2 was identified using gene-based analysis and SMR analysis, although further replication is needed to confirm this result. The discovery of a novel risk locus significantly advances our understanding of ALS aetiology.

Methods
Chinese ALS cases and controls. DNA extraction. In the Chinese cohort, genomic DNA was extracted from whole blood using the DNA Extraction Kit (Beijing Aide Lai Biotechnology Co. Ltd., Beijing, China). In the Australian replication cohorts, the majority of DNA was extracted from fresh whole blood using manual extraction protocols, except for 90% (118 out of 131) of UNSW/UM control samples, where DNA was extracted from frozen whole blood or lymphocytes using an automated purification system, Qiagen Autopure LS (Qiagen, Valencia, CA, USA).
Genome-wide association study. We performed GW genotyping in the discovery cohort using the Illumina HumanOmni ZhongHua-8 v1.0 and v1.1 arrays. These arrays contain 900,015 (v1.0) and 894,517 (v1.1) variants, respectively. Before testing for the association between each variant and disease status, we carried out quality control (QC) steps to identify and exclude poor quality samples and genetic variants. We excluded individuals based on the following QC filters: (i) genotyping call rate <99% (134 individuals); (ii) sex mismatch between genotype and clinical information (6 individuals); (iii) ancestry outliers (6 SDs from HapMap-CHB means of PC1 and PC2; 30 individuals); and (iv) duplicated or related individuals (genetic relationship matrix >0.05; 195 individuals). We excluded genetic variants based on the following criteria: (i) low genotype call rate <99%; (ii) MAF <1%; (iii) deviation from Hardy-Weinberg equilibrium p < 10 −6 ; and (iv) differential missingness in genotypes between cases and controls (p < 10 −6 ). After these QC steps, 1,234 cases and 2,850 controls with genotypic information from 753,038 markers remained for the subsequent analyses. We imputed unobserved genotypes into the 1000 Genomes Project Phase 1 v3 (all ethnicities) using samples and markers that passed QC. We implemented a two-step process, i.e., haplotyping using HAPI-UR 32 and imputation using IMPUTE 33 . We imputed 38,033,906 SNPs, but after QC (i.e., excluding markers with MAF <0.01, imputation quality score <0.80 and HWE p < 10 −6 ), 6,613,544 SNPs were available for analysis.
Validation sample genotyping. The first validation sample was genotyped on the Illumina Human Core Exome Array. QC and imputation followed the same pipeline as for the Chinese samples. After QC, 145 cases and 116 controls were available for analysis. For the second validation sample, SNPs were genotyped via Taqman assay such that the reaction mix included 1.0 μl of genomic DNA (10 ng/μl), 0.25 μl Custom TaqMan genotyping assay 20× (Life Technologies), 2.5 μl TaqMan SNP genotyping MasterMix 2X (Life Technologies) and 6.25 μl MilliQ. The thermocycler program included 30 s at 60°C, 10 min at 95°C, followed by 40 cycles of 15 s at 95°C and 1 min at 60°C and a final step of 30 s at 60°C. Fluorescent signals were analysed on a Viia7 Real-Time PCR System and genotypes were determined by allelic discrimination using the Viia7 Real-Time PCR System Software (Life Technologies). Genotype calling rates were 94% for rs4958872 (LD r 2 = 1 proxy for rs10463311) and 91% for rs9906189. After QC, 431 cases and 567 controls were available for analysis.
Genetic association analysis. The association analysis between genetic variants and disease was conducted using a linear mixed model framework implemented in GCTA (mlma-loco) 34 . To compare the results, we also used a logistic regression model by fitting five principal components as covariates. Genomic inflation factor was calculated as the median of Chi-square test statistics divided by its expected value (0.455).
Gene-based analysis. To test for the association between a set of variants within a gene (±50 kb) and ALS, we used GCTA-fastBAT 26 with SNP association analysis p values as input. This test complements SNP-disease association analysis, identifying genes that may show evidence for independent associations that individually have not achieved association significance. For Chinese data analysis, we used our own GWAS data as the reference to calculate LD and ARIC samples (dbGAP accession phs000090.v1.p1) for the European sample.
Whole-genome estimation analysis. Genomic relationship matrix (GRM) restricted maximum likelihood (GREML) analysis using GCTA 19,35,36 was used to estimate the total contribution of common genetic variants on the liability of ALS or SNP-heritability. This analysis fits all SNPs simultaneously in a mixed model linear framework to estimate the proportion of variance in disease liability explained by all SNPs. To avoid bias, for example, due to common environmental factors, we excluded related individuals based on GRM values >0.05. Lifetime disease risk of 0.002 was used in the conversion of the estimate to the liability scale 37 (compared to 0.0025 used in the European conversion, although the results are robust to these choices). LD-score regression 20 was applied to GWAS summary statistics as an alternative method to estimate the contribution of common genetic variants to variation in the liability of ALS.
Genetic overlap analysis. We considered estimation of the genetic correlation between ALS risk in Europeans and Chinese, using popcorn 38 (the cross-ethnicity LDscore regression method), but calculated 39 that the relatively small sample size for the Chinese cohort would generate an unacceptably large SE. Instead we used polygenic risk scoring (PRS) to investigate the genetic relationship between ALS in the two ethnicities. PRS were estimated for all Chinese cases and controls as the sum of risk alleles weighted by the log OR of association estimated in the European GWAS. Eight PRS were constructed for each individual using independent SNPs (based on SNPs pruned (r 2 < 0.25 in 200-kb window) that are significant at p value thresholds of 0.001, 0.005, 0.01, 0.05, 0.10, 0.25, 0.5 and 1. We also constructed a PRS using all SNPs without pruning for LD because of the difference in allele frequencies and LD between ethnicities. Association between the case-control status and PRS was evaluated by logistic regression. Binomial sign tests were also used to evaluate evidence of overlap in signal between Chinese and European association statistics.
Meta-analysis. Inverse variance meta-analysis was conducted between the largest GWAS for ALS in European 7 and our Chinese GWAS results using METAL 40 .
In silico functional analyses. To help interpret biological function of the SNP-and gene-ALS associations, gene-set pathway analyses were performed using MAGMA 41 ; this method was selected based on results of a method comparison study 42 . Gene-set pathway analyses aim to identify sets of biological pathways that are relevant to disease based on a set of disease-associated variants 42 .We also conducted SMR analysis 27 that combines the GWAS summary statistics with gene expression association results. Here we used gene expression from blood 43 as this is currently the largest gene expression quantitative trait loci data set. The SMR test identifies pleiotropic association of a variant that affects both the expression level of a gene and the trait. The SMR-HEIDI test attempts to determine whether the effect of the disease-associated gene on gene expression reflects a single causal variant, thus prioritizing loci for functional follow-up studies.