Large eQTL meta-analysis reveals differing patterns between cerebral cortical and cerebellar brain regions

The availability of high-quality RNA-sequencing and genotyping data of post-mortem brain collections from consortia such as CommonMind Consortium (CMC) and the Accelerating Medicines Partnership for Alzheimer’s Disease (AMP-AD) Consortium enable the generation of a large-scale brain cis-eQTL meta-analysis. Here we generate cerebral cortical eQTL from 1433 samples available from four cohorts (identifying >4.1 million significant eQTL for >18,000 genes), as well as cerebellar eQTL from 261 samples (identifying 874,836 significant eQTL for >10,000 genes). We find substantially improved power in the meta-analysis over individual cohort analyses, particularly in comparison to the Genotype-Tissue Expression (GTEx) Project eQTL. Additionally, we observed differences in eQTL patterns between cerebral and cerebellar brain regions. We provide these brain eQTL as a resource for use by the research community. As a proof of principle for their utility, we apply a colocalization analysis to identify genes underlying the GWAS association peaks for schizophrenia and identify a potentially novel gene colocalization with lncRNA RP11-677M14.2 (posterior probability of colocalization 0.975).


Introduction
Defining the landscape of genetic regulation of gene expression in a tissue-specific manner is useful for understanding both normal gene regulation and how variation in gene expression can alter disease risk. In the latter case, a variety of approaches now leverage the association between genetic variants and gene expression changes, including colocalization analysis 1-7 , transcriptome-wide association studies (TWAS) 8,9 , and gene regulatory network inference [10][11][12][13][14][15][16] .
There has been a relative lack of expression quantitative trait loci (eQTL) studies from the brain. Because of the more accessible nature of tissues such as blood or lymphoblastoid cell lines (LCLs), much of the large-scale identification of expression quantitative trait loci (eQTL) has occurred in these tissues [17][18][19][20] . For most other tissues, obtaining samples for RNA sequencing (RNA-seq) requires invasive biopsy, and brain tissues are typically only available in post-mortem brain samples. One effort, the Genotype-Tissue Expression (GTEx) project 21,22 , has profiled a broad range of tissues (42 distinct) for eQTL discovery, however, samples sizes in brain have been small (typically 100-150). Recently, efforts to understand gene expression changes in neuropsychiatric [23][24][25] and neurodegenerative diseases [26][27][28][29][30][31][32][33] have generated brain RNAseq from disease and normal tissue, as well as genomewide genotypes. These analyses have found little evidence for widespread disease-specific eQTL, as well as high cross-cohort overlap 24,34 , meaning that most eQTL detected are disease-condition independent. This makes it possible to perform meta-analysis despite differences in disease ascertainment of the samples, to generate a well-powered brain eQTL resource for use in downstream research.
Here we generate a public eQTL resource from cerebral cortical tissue using 1433 samples from 4 cohorts from the CommonMind Consortium (CMC) 23,24 and Accelerating Medicines Partnership for Alzheimer's Disease (AMP-AD) Consortium 29,30 , as well as eQTL for cerebellum using 261 samples from AMP-AD. We show that eQTL discovered in GTEx, which consists of control individuals (without disease) only, are replicated in this larger brain eQTL resource. We further show widespread differences in regulation between cerebral cortex and cerebellum. To demonstrate one example of the utility of these data, we apply a colocalization analysis, which seeks to identify expression traits whose eQTL association pattern appears to co-occur at the same loci as the clinical trait association, to identify putative genes underlying the GWAS association peaks for schizophrenia 35 .

Results
We generated eQTL from the publicly available AMP-AD (ROSMAP 26,27,34 (Data Citation 1), Mayo RNAseq 28,36,37 (Data Citation 2)) and CMC (MSSM-Penn-Pitt 24 (Data Citation 3), HBCC (Data Citation 4)) cohorts with available genotypes and RNA-seq data, using a common analysis pipeline (Supplementary Table 1) (https://www.synapse.org/#!Synapse:syn17015233). Analyses proceeded separately by cohort. Briefly, the RNA-seq data were normalized for gene length and GC content prior to adjustment for clinical confounders, processing batch information, and hidden confounders using Surrogate Variable Analysis (SVA) 38 . Genes having at least 1 count per million (CPM) in at least 50% of samples were retained for downstream analysis (Supplementary Table 2). Genotypes were imputed to the Haplotype Reference Consortium (HRC) reference panel 39 . eQTL were generated adjusting for diagnosis (AD, control, other for AMP-AD cohorts and schizophrenia, control, bipolar/other for CMC cohorts) and principal components of ancestry separately for ROSMAP, Mayo temporal cortex (TCX), Mayo cerebellum (CER), MSSM-Penn-Pitt, and HBCC. For HBCC, which had a small number of samples derived from infant and adolescents, we excluded samples with age-of-death less than 18, to limit heterogeneity due to differences between the mature and developing brain.
We then performed a meta-analysis using the eQTL from cortical brain regions from the individual cohorts (dorsolateral prefrontal cortex (DLPFC) from ROSMAP, MSSM-Penn-Pitt, and HBCC and TCX from Mayo). The meta-analysis identifies substantially more eQTL than the individual cohorts (Table 1, Fig. 1). There is a strong relationship between the sample size in the individual cohorts and metaanalysis and the number of significant eQTL and genes with eQTL (Fig. 1b, 1c). Notably, the meta-analysis identified significant eQTL (at FDR ≤ 0.05) in >1000 genes for which no eQTL were observed in any individual cohort. Notably, we find significant eQTL for 18,295 (18,433 when considering markers with minor allele frequency (MAF) down to 1%) of the 19,392 genes included in the analysis.
We then compared our cortical eQTL to GTEx (v7) 21 , which is the most comprehensive brain eQTL database available in terms of number of available brain tissues (Table 1, Table 2). Due to the substantially larger power in these data, we find > 3.8 million eQTL not identified in GTEx cortical regions (Anterior Cingulate Cortex, Cortex or Frontal Cortex) and we find eQTL for >11,000 genes with no eQTL in these cortical regions in GTEx. We first evaluated the replication within our cortical and cerebellar eQTL of the region specific eQTL identified in GTEx. The cortical eQTL generated through the current analyses strongly replicate the eQTL available through GTEx, not only for cortical regions, but for all brain regions including cervical spinal cord (Table 2). Interestingly, the replication in these cortical eQTL of eQTL derived from the two GTEx cerebellar brain regions (cerebellum and cerebellar hemisphere) is consistently lower than for other brain regions represented in GTEx. However, replication of GTEx cerebellar eQTL is high when compared to the cerebellar eQTL generated in this analysis from the Mayo Clinic CER samples. We also performed the reverse comparison, by examining the replication of our eQTL in those region-specific eQTL identified in GTEx. Unsurprisingly, the replication levels were substantially lower, due to the lower power in the GTEx analyses.
Additionally, we compared our eQTL to a publically available fetal brain eQTL resource 40 and found good replication of these eQTL as well (estimated replication rate π 1 = 0.909 for the cortical meta-analysis, and π 1 = 0.861 for cerebellum), though somewhat lower than the replication observed in the GTEx cohorts, which are comprised of adult-derived samples.
Finally, as a proof of concept, we performed a colocalization analysis between our eQTL meta-analysis and the Psychiatric Genomics Consortium (PGC) v2 schizophrenia GWAS summary statistics 35 . Seventeen genes showed posterior probability of colocalization using coloc 7 (PP(H 4 )) >0.7 (Table 3), with 3 showing PP(H 4 ) > 0.95 (FURIN, ZNF823, RP11-677M14.2). FURIN, having previously identified as a candidate through colocalization 24 has recently been shown to reduce brain-derived neurotrophic factor (BDNF) maturation and secretion when inhibited by miR-338-3p 41 . ZNF823 has been identified in previous colocalization analyses 42,43 . RP11-677M14.2, a lncRNA located inside NRGN, while not previously identified through colocalization analysis, has been shown to be down-regulated in the amygdala of schizophrenia patients 44

Discussion
Using resources generated in the AMP-AD and CMC consortia, we have generated a well-powered brain eQTL resource for use by the scientific community. Unsurprisingly, we see a strong relationship between the number of significant eQTL, as well as genes with significant eQTL, and sample size using analyses from the individual cohorts and meta-analysis of those cohorts. This result has previously been shown for lower sample sizes 21 . We also show higher replication of GTEx eQTL in the meta-analysis relative to the individual cohorts.
Notably, we find significant eQTL for nearly every gene in our analysis, which include all but very lowly expressed genes (less than 1 cpm in more than 50% of samples). The wide discovery of eQTL is potentially beneficial for analyses utilizing these results, such as colocalization analysis or TWAS imputation, because more genes with significant eQTL means more genes can be evaluated with these approaches. Because we have discovered eQTL for most genes, further increasing sample size will not substantially increase the number of genes with significant eQTL, however it is likely that the number of significant eQTL associations within each gene would continue to increase, along with the accuracy of estimated effect sizes. This will result in a more accurate landscape of regulatory association, which will improve the ability to fine-map causal regions, and colocalize eQTL signal with clinical traits of interest. Thus, it will be valuable to continue to update this meta-analysis with additional data from these consortia and other resources as they become available, and continue to improve this resource as future data permits. Future work may also focus on using well-powered analysis to study the landscape of causal variation and covariation in gene regulation.
We found distinct eQTL patterns across cerebral cortical and cerebellar brain regions in our resource. Specifically, comparison of eQTL from our resource with those from GTEx shows high replication for the majority of brain regions. However, cerebellar regions show consistently lower replication with the cerebral cortical eQTL generated here. In contrast, the cerebellar eQTL generated from the Mayo Clinic study replicate GTEx cerebellar eQTL at a substantially higher rate, suggesting a different pattern of regulatory variation affecting expression in cerebellum versus other brain regions. Indeed, epigenomic analyses show substantial differences between cerebellar and cerebral cortical regions [45][46][47][48] , particularly in methylation patterns, which could drive different eQTL association patterns. This is further corroborated by the observation of substantial coexpression differences between cerebellar and other brain regions 49 . These effects could be due to differences in cell type composition, with cerebellar regions consisting of substantially more neurons than other brain regions 50 . This is supported by a gene enrichment analysis of genes showing cerebellum-specific eQTL patterns, which showed that most of the top gene sets were related to axon and neuron generation and differentiation (Supplementary Table 3). One recent report suggests that there are also widespread differences in histone modifications within cell types derived from cerebellar and cortical regions 51 , though this effect had not been noted in other studies. In particular, Ma et al 51 observed that both neuronal and non-neuronal cell types show differing histone modifications across tissue of origin. Further work is necessary to confirm this finding and to develop models to deconvolve the cell-type specific regulatory effects in different brain regions [52][53][54] , however our analysis demonstrates that this meta-analysis is representative of eQTL across the majority of brain regions, with the exception of cerebellum. Future meta-analytic analyses may also cast a wider net in terms of brain regions included.
The replication of fetal eQTL, while significant, is somewhat lower than the replication of adult eQTL represented in GTEx. This may be due to multiple factors. The fetal eQTL analysis was generated from brain homogenate, rather than dissected brain regions, though the lower replication likely also reflects broad transcriptional differences between developing and mature brain 55 . These transcriptional differences may also explain why we find substantially more eQTL than a recently published, similarly sized eQTL analysis which uses samples from across developmental and adult timepoints 25 , and why this meta-analysis shows higher replication of GTEx eQTL.
Finally, the lack of widespread disease-specific eQTL observed in schizophrenia (CMC) 24 and Alzheimer's (ROSMAP) 34 , as well as a strong overlap among eQTL derived from samples from individuals with these diseases, as well as normal individuals from these and other cohorts such as GTEx 24,34 , suggests that diseasespecific eQTL, if they exist, are likely few in number and/or small in effect size. Thus, the heterogeneous samples derived from different disease-based cohorts can be metaanalyzed to create a general-purpose brain eQTL resource representing adult gene regulation, despite comprising samples with different disease backgrounds, along with normal controls. Therefore, these eQTL will be useful both within and outside these specific disease contexts. For example, since these eQTL are not disease specific they may be used to understand healthy gene expression regulation in the brain, as well as to infer colocalization of eQTL signatures with disease risk for any disease whose tissue etiology is from the brain, since these signatures are reflective of normal brain regulation. It should be stated that while many eQTL are not disease specific, i.e. they are identified under various central nervous system (CNS) disease diagnoses and in control brains, they may still contribute to common CNS diseases as previously demonstrated 24,[31][32][33]42,43 . While we have demonstrated a proof-of-concept colocalization analysis with a previously published schizophrenia GWAS, these eQTL are a broadly useful resource for studying neuropsychiatric and neurodegenerative disorders, as well as for understanding the landscape of gene regulation in brain.
For the AMP-AD studies (ROSMAP, Mayo RNAseq), Picard v2.2.4 (https://broadinstitute.github.io/picard/) was used to generate FASTQ files from the available BAM files, using the Picard SamToFastq function. Picard SortSam was first applied to ensure that R1 and R2 reads were correctly ordered in the intermediate SAM file before converting to FASTQ. The converted FASTQs were aligned to the GENCODE24 (GRCh38) reference genome using STAR v2.5.1b, with twopassMode set as Basic. Gene counts were computed for each sample by STAR by setting quantMode as GeneCounts.

RNA-seq Normalization
To account for differences between samples, studies, experimental batch effects and unwanted RNA-seq-specific technical variations, we performed library normalization and covariate adjustments for each study separately using fixed/mixed effects modeling. The workflow consisted of following steps: 1. Gene filtering: Out of ~56K aligned and quantified genes only genes showing at least modest expression were used in this analysis. Genes that were expressed more than 1 CPM (read Counts Per Million total reads) in at least 50% of samples in each tissue and diagnosis category was retained for analysis. Additionally, genes with available gene length and percentage GC content from BioMart December 2016 archive were subselected from the above list. This resulted in approximately 14K to 16K genes in each study.

Calculation of normalized expression values:
Sequencing reads were then normalized in two steps. First, conditional quantile normalization (CQN) 58 was applied to account for variations in gene length and GC content. In the second step, the confidence of sampling abundance was estimated using a weighted linear model using voom-limma package in bioconductor 59,60 . The normalized observed read counts, along with the corresponding weights, were used in the following steps. 3. Outlier detection: Based on normalized log2(CPM) of expression values, outlier samples were detected using principal component analysis (PCA) 61,62 and hierarchical clustering. Samples identified as outliers using both the above methods were removed from further analysis. 4. Covariate imputation: Before identifying associated covariates, important missing covariates were imputed. Principally, post-mortem interval (PMI), or the latency between death and tissue collection, which is frequently an important covariate for the analysis of gene expression from post-mortem brain tissue, was imputed for a portion of samples in Mayo RNAseq data for which true values were unavailable. Genomic predictors of PMI were estimated using ROSMAP and MSSM (an additional RNA-seq study available through AMP-AD) samples and were used to impute missing values as necessary. 5. Covariate identification: Normalized log2(CPM) counts were then explored to determine which known covariates (both biological and technical) should be adjusted. Except for the HBCC study, we used a stepwise (weighted) fixed/mixed effect regression modeling approach to select the relevant covariates having a significant association with gene expression. Here, covariates were sequentially added to the model if they were significantly associated with any of the top principal components, explaining more than 1% of variance of expression residuals. For HBCC, we used a model selection based on Bayesian information criteria (BIC) to identify the covariates that improve the model in a greater number of genes than making it worse. 6. SVA adjustments: After identifying the relevant known confounders, hiddenconfounders were identified using the Surrogate Variable Analysis (SVA) 38 .
We used a similar approach as previously defined 24 to find the number of surrogate variables (SVs), which is more conservative than the default method provided by the SVA package in R 63 . The basic idea of this approach is that for an eigenvector decomposition of permuted residuals each eigenvalue should explain an equal amount of the variation. By the nature of eigenvalues, however, there will always be at least one that exceeds the expected value. Thus, from a series of 100 permutations of residuals (white noise) we identified the number of covariates as shown in Supplementary Table 1. We applied the "irw" (iterative re-weighting) version of SVA to the normalized gene expression matrix, along with the covariate model described above to obtain residual gene expression for eQTL analysis. 7. Covariate adjustments: We performed a variant of fixed/mixed effect linear regression, choosing mixed-effect models when multiple tissues or samples, were available per individual, as shown here: gene expression ~ Diagnosis + Sex + covariates + (1|Donor), where each gene in linearly regressed independently on Diagnosis, identified covariates and donor (individual) information as random effect. Observation weights (if any) were calculated using the voom-limma 59,60 pipeline, which has a net effect of up-weighting observations with inferred higher precision in the linear model fitting process to adjust for the mean-variance relationship in RNA-seq data. The Diagnosis component was then added back to the residuals to generate covariate-adjusted expression for eQTL analysis. All these workflows were applied separately for each study. For the AMP-AD studies, gene locations were lifted over to GRCh37 for comparison with the genotype imputation panel (described below). For HBCC, samples with age < 18 were excluded prior to analysis. Supplementary Table 1 shows the covariates and surrogate variables identified in each study.

AD Diagnosis Harmonization
Prior to RNA-seq normalization, we harmonized the LOAD definition across AMP-AD studies. AD controls were defined as patients with a low burden of plaques and tangles, as well as lack of evidence of cognitive impairment. For the ROSMAP study, we defined AD cases to be individuals with a Braak 64 greater than or equal to 4, CERAD score 65 less than or equal to 2, and a cognitive diagnosis of probable AD with no other causes (cogdx=4), and controls to be individuals with Braak less than or equal to 3, CERAD score greater than or equal to 3, and cognitive diagnosis of 'no cognitive impairment' (cogdx = 1). For the Mayo Clinic study, we defined disease status based on neuropathology, where individuals with Braak score greater than or equal to 4 were defined to be AD cases, and individuals with Braak less than or equal to 3 were defined to be controls. Individuals not meeting "AD case" or "control" criteria were retained for analysis, and were categorized as "other" for the purposes of RNA-seq adjustment.

Genotype QC and Imputation
Genotype QC was performed using PLINK v1.9 66 . Markers with zero alternate alleles, genotyping call rate ≤ 0.98, Hardy-Weinberg p-value < 5e−5 were removed, as well as individuals with genotyping call rate < 0.90. Samples were then imputed to HRC (Version r1.1 2016) 39 , as follows: if necessary marker positions were lifted-over to GRCh37 and aligned to the HRC loci using HRC-1000G-check-bim-v4.2 (http://www.well.ox.ac.uk/~wrayner/tools/), which checks the strand, alleles, position, reference/alternate allele assignments and frequencies of the markers, removing A/T & G/C single nucleotide polymorphisms (SNPs) with minor allele frequency (MAF) > 0.4, SNPs with differing alleles, SNPs with > 0.2 allele frequency difference between the genotyped samples and the HRC samples, and SNPs not in reference panel. Imputation was performed via the Michigan Imputation Server 67 using Eagle v2.3 68 as the phasing algorithm. Imputation was done separately by cohort and by chip within cohort, and markers with R 2 ≥ 0.7 and minor allele frequency (MAF) ≥ 0.01 (within cohort) were retained for analysis.

Genetic Ancestry Inference
GEMTOOLs 69 was used to infer ancestry and compute ancestry components separately by cohort. The number of significant ancestry components were also estimated by the GEMTOOLs algorithm. For MSSM-Penn-Pitt and HBCC, which are multi-ethnic cohorts, only Caucasian samples were retained for eQTL analysis.
eQTL Analysis eQTL were generated separately in each cohort using MatrixEQTL 70 adjusting for harmonized Diagnosis and inferred Ancestry components using "cis" gene-marker comparisons: Expression ~ Genotype + Diagnosis + PC 1 + … + PC n, , where PC k is the k th ancestry component, using Expression variables which were previously covariate adjusted as described above. Here we define "cis" as ± 1 MB around the gene, and GRCh37 gene locations were used for consistency with the marker imputation panel. Meta-analysis was performed via fixed-effect model 71 using an adaptation of the metareg function in the gap package in R. Given that multiple tissues were present, we also evaluated a random-effect model, but found it to be highly conservative in this case. In order to assess potential inflation of Type 1 error, we performed 5 permutations of this analysis, starting by permuting gene expression, relative to genotype and ancestry components, within diagnosis for each cohort, and performing meta-analysis of the permuted eQTL. We found that Type 1 error was well controlled (Fig. 1a).

Comparison with GTEx and Fetal eQTL
Full summary statistics for the GTEx v7 21 eQTL for all available brain regions were obtained from the GTEx Portal (https://gtexportal.org/), and fetal eQTL were obtained from Figshare 72 . Markers and genes present in the external eQTL as well as our analysis were retained for comparison. The replication rate was estimated as the π 1 statistic using the qvalue package 73 in R as follows: the meta-analysis p-values for significant (at FDR 0.05) GTEx were used to estimate the replication rate of GTEx eQTL in the meta-analysis. Analogous methods were used to estimate all other replication rates.

Pathway Analysis of Cerebellar eQTL Genes
In order to identify whether genes showing cerebellar-specific eQTL patterns showed any biological coherence, we performed a pathway analysis as follows. For genes with at least 5 significant cerebellar eQTL, we computed the correlation of effect-size between cerebellum eQTL and cortical eQTL for the loci that were significant in cerebellum. We then selected genes for which the effect-sizes were uncorrelated (pvalue > 0.05) between the two tissues as showing cerebellar-specific eQTL patterns, and performed a pathway analysis with GO biological processes, cellular components and molecular function using a Fisher's exact test. Note that due to the (powermediated) greater detection of eQTL in cortex, we did not perform the reverse comparison.

Coloc Analysis
We applied Approximate Bayes Factor colocalization (coloc.abf) 7 from the coloc R package to the summary statistics from the PGC2 Schizophrenia GWAS 35 downloaded from the PGC website (http://pgc.unc.edu), and the summary statistics from the eQTL meta-analysis. Each gene present in the meta-analysis was compared to the GWAS in turn, and suggestive and significant GWAS peaks with p-value < 5e-6 were considered for analysis.

Code availability
An R package with all code for the gene expression normalization is available at https://github.com/Sage-Bionetworks/ampad-diffexp. All other analyses were generated using packages publically available from their respective authors.

Data Records
eQTL results for the ROSMAP, Mayo TCX, Mayo CER and cortical meta-analysis are found in Data Citation 5, Data Citation 6, Data Citation 7 and Data Citation 8, respectively, in the AMP-AD Knowledge Portal. These results include SNP (location, rsid, alleles, and allele frequency) and gene (location, gene symbol, strand and biotype) information, as well as estimated effect size (beta), statistic (z), p-value, FDR, and expression-increasing allele.

Acknowledgements
For the ROSMAP and Mayo RNAseq studies, the results published here are in whole or in part based on data obtained from the AMP-AD Knowledge Portal (doi: 10 Table 3: Top colocalized genes as inferred between meta-analysis eQTL and the PGC2 schizophrenia GWAS.