Identifying gene targets for brain-related traits using transcriptomic and methylomic data from blood

Understanding the difference in genetic regulation of gene expression between brain and blood is important for discovering genes for brain-related traits and disorders. Here, we estimate the correlation of genetic effects at the top-associated cis-expression or -DNA methylation (DNAm) quantitative trait loci (cis-eQTLs or cis-mQTLs) between brain and blood (rb). Using publicly available data, we find that genetic effects at the top cis-eQTLs or mQTLs are highly correlated between independent brain and blood samples (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat r_b = 0.70$$\end{document}r^b=0.70 for cis-eQTLs and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat r_ b = 0.78$$\end{document}r^b=0.78 for cis-mQTLs). Using meta-analyzed brain cis-eQTL/mQTL data (n = 526 to 1194), we identify 61 genes and 167 DNAm sites associated with four brain-related phenotypes, most of which are a subset of the discoveries (97 genes and 295 DNAm sites) using data from blood with larger sample sizes (n = 1980 to 14,115). Our results demonstrate the gain of power in gene discovery for brain-related phenotypes using blood cis-eQTL/mQTL data with large sample sizes.

G enome-wide association studies (GWAS) have discovered thousands of genetic variants associated with complex traits and diseases [1][2][3] . Most trait-associated variants reside in non-coding regions of the genome 4,5 , suggesting that genetic variants may affect the trait through regulation of gene expression 6,7 . With the advances in microarray and sequencing technologies, genome-wide genotype and gene expression data available from relatively large samples have been generated to identify genetic variants affecting transcription abundance [8][9][10] , i.e., expression quantitative trait loci (eQTLs). Current eQTL studies are biased toward the most accessible tissues (e.g., blood), which are often not the most relevant tissues to the traits and diseases of interest. The Genotype-Tissue Expression (GTEx) project [11][12][13] provides a comprehensive resource of data to investigate the genetic variation of gene expression across a broad range of tissues and cell types. Recent studies have utilized the GTEx data to demonstrate that genetic correlation of gene expression between tissues in local regions (i.e., ±1Mb of the transcription start site) is much higher than that in distal regions 14 , consistent with the conclusions from the latest GTEx release 13 , and that there is no evidence for the tissue-relevant eQTLs being enriched for associations with complex traits 15 .
For studies that integrate GWAS results with eQTL or DNA methylation QTL (mQTL) data to identify putative functional genes and regulatory elements for brain-related phenotypes and diseases 16,17 , the statistical power is limited by the small sample sizes of the brain eQTL or mQTL data (often in the order of 100s). On the other hand, there are blood eQTL and mQTL data available from thousands of individuals 8,9 and the sample sizes of some of the ongoing projects have reached 10,000s (e.g., the GoDMC and eQTLGen consortia). The questions are to what extent the cis-genetic effects on gene expression and DNA methylation (DNAm) in blood differ from those in brain and whether we can gain power to detect associations of genes (or DNAm sites) with brain-related traits by using the cis-eQTL (or cis-mQTL) effects estimated from a large blood sample as proxies for those in brain. Liu et al. 14 extended the stratified linkage disequilibrium (LD) score regression method to estimate genetic correlation (r g ) of gene expression between tissues at all SNPs in local or distal regions and showed that the mean estimate of pairwise r g at all local SNPs (i.e. cis-genetic correlation) was~0.75 in 11 GTEx tissues but they did not estimate r g between brain and blood. In this study, we use a summary-data-based method to estimate the correlation of effect sizes of the top-associated cis-eQTLs (or cis-mQTLs) between blood and brain for genes expressed (or CpG sites methylated) in both tissues, accounting for errors in their estimated effects. We demonstrate by simulation and analysis of real data the gain of power by using cis-eQTL or cis-mQTL effects estimated in blood as proxies of those in brain to identify putative functional genes for brain-related complex traits and diseases.

Results
Correlation of cis-eQTL effects between brain and blood. To quantify the similarity of genetic effects at the top-associated cis-eQTLs (or cis-mQTLs) between two tissues, we used a summarydata-based approach to estimate the correlation of cis-effects between two tissues (r b ) correcting for errors in the estimated cis-eQTL (or cis-mQTL) effects and sample overlap (Supplementary Fig. 1 and Methods). We showed by simulation (Supplementary Note 1) that r b is a good estimator of correlation of the true values of cis-genetic effects ( Supplementary Fig. 2). Note that the r b method is distinct from the Spearman or Pearson correlation approach 13 because the latter does not account for errors in the estimated eQTL effects and thereby leads to an underestimation of the correlation of true eQTL effects. We applied our method to estimater b at the top cis-eQTLs between different brain regions and between brain and blood in one data set, and between brain and blood in two data sets using summary-level data from GTEx v6 (whole blood and 10 brain regions) 11 , the CommonMind Consortium (CMC; dorsolateral prefrontal cortex) 18 , the Religious Orders Study and Memory and Aging Project (ROSMAP) 19 , and the Brain eQTL Almanac project (Braineac; 10 brain regions) 20 (Methods and Supplementary Table 1). All eQTL effects were re-scaled based on the expression level per gene in standard deviation (SD) units. For the GTEx, CMC and ROS-MAP data, which are based on RNA sequencing (RNA-Seq), we matched the data sets by Ensembl Gene IDs. For the Braineac data that are based on gene expression microarray, we matched the data sets by gene symbols and removed genes tagged by multiple gene expression probes to ensure a one-to-one match for genes between data sets. The main aim of this study is to quantify the extent to which cis-eQTL data in blood can be used in the SMR analysis 21 (see below) to identify genes associated with brain-related phenotypes and disorders. If we had selected the top-associated cis-eQTLs in blood and compared their effects with those in brain, we would likely suffer a form of winner's curse. To avoid the potential ascertainment bias, we selected the top cis-eQTLs in a reference tissue, i.e., GTEx-muscle (n = 361) or CMC (n = 467; independent of GTEx), using a stringent Pvalue threshold that is required for the SMR analysis 21 (see below), and estimated r b between brain and blood using these SNPs ( Supplementary Fig. 3). Although this strategy uses only a quarter of all genes, the estimates of r b should be valid (see below). Note that the estimates of local and distal r g at all SNPs 14 would be more informative for other gene-trait association methods such as TWAS 22 and MetaXcan 23 that use all SNPs in a prediction analysis framework. We chose SMR (URLs) because of one of its features (i.e., the HEIDI test) to filter out associations due to linkage 21 .
We selected the top-associated cis-eQTLs at P eQTL < 5 × 10 −8 for 4257 genes in GTEx-muscle and matched the selected genes with those in the other data sets (the number of matched genes ranged from 1113 to 3841) (Supplementary Table 2, i.e., up to 90%, with the lower numbers matched representing data sets with gene expression data for fewer genes). Note that all the matched genes were expressed in both tissues (i.e., genes which have at least 10 samples with reads per kilobase per million mapped reads (RPKM) > 0.1 and raw read counts >6) 13 . It should also be noted that our analysis below shows that the test-statistic for the difference in gene expression between tissues was almost independent of the test-statistic for the difference in SNP effect on gene expression between tissues, therefore selecting genes by cis-eQTL P-values would not bias mean gene expression in any specific tissue. We used the Jackknife approach that removes one gene at a time to estimate the sampling variance ofr b (Methods) assuming the estimated top cis-eQTL effects for different genes are independent. This assumption was approximately met given the small LD correlations among the 4257 cis-eQTLs and the subtle difference between the mean Jackknife sampling variance and the observed sampling variance in simulation (Supplementary Fig. 4).
Results showed that the effects of the top-associated cis-eQTLs were highly correlated between all the brain regions in GTEx after correcting for estimation errors, with a meanr b of 0.94 (s.e. = 0.004; Fig. 1). These estimates are higher than the Spearman correlation estimates reported in a previous study 13 because the Spearman correlation does not account for errors in the estimated SNP effects and therefore underestimates the correlation of true effects especially when the sample size is small. The two cerebellum measures ("brain cerebellar hemisphere" and "brain cerebellum") appeared to be outliers. The correlation between "brain cerebellar hemisphere" and "brain cerebellum" was almost perfect (r b ¼ 0:99 and s.e. = 0.002), but the correlations between the two cerebellum regions and the other regions (meanr b ¼ 0:89 and s.e. = 0.006) were significantly smaller than the pairwise correlations between the other regions (meanr b ¼ 0:98 and s.e. = 0.003). We performed the same analysis in the Braineac data and observed similar results as above ( Supplementary Fig. 5). The estimates of r b between brain and blood in GTEx varied from 0.74 to 0.79 across different brain regions with a mean estimate of 0.77 (s.e. = 0.010), similar to the mean estimate of local r g between GTEx-blood and 10 other non-brain GTEx tissues reported in a previous study 14 . The estimate of r b between CMC (brain) and GTEx-blood was 0.74 (s.e. = 0.014), suggesting that the betweensample genetic heterogeneity is small, in line with the strong correlations between CMC and GTEx brain regions (meanr b ¼ 0:87 and s.e. = 0.010). The estimates of r b from ROSMAP were remarkably similar to those from CMC, providing an important replication of the result. The correlations related to Braineac were notably lower than those related to CMC (Fig. 1), which is likely due to the difference in transcriptomics technology between the two studies (microarray vs. RNA-Seq). It is of note that the results were robust to scale transformation of the eQTL effects ( Supplementary Fig. 6), the exclusion of cis-eQTLs in or near the promoter regions ( Supplementary Fig. 7), the exclusion of housekeeping genes 24,25 (Supplementary Fig. 8), the inclusion of secondary cis-eQTLs identified from a conditional analysis 26 ( Supplementary Fig. 9), or the adjustment of gene expression data for confounding (e.g., batch effects) predicted from the data ( Supplementary Fig. 10). In addition, we selected the topassociated cis-eQTLs at P eQTL < 5 × 10 −8 from the CMC data, and found that the estimates of r b among the brain regions and B r a i n _ N u c l e u s _ a c c u m b e n s _ b a s a l _ g a n g l i a B r a i n _ P u t a m e n _ b a s a l _ g a n g l i a  Fig. 1 Estimated correlation of genetic effects of cis-eQTLs between tissues. We estimated r b between brain regions, between brain and blood tissues, and between data sets. The top-associated cis-eQTLs (one for each gene) were selected from GTEx-muscle at P eQTL < 5 × 10 −8 . Shown in each cell is the estimate of r b with its standard error given in the parentheses (Methods). In the Braineac data, the eQTLs effect sizes were estimated from gene expression levels averaged across 10 brain regions between brain and blood in GTEx remained largely unchanged ( Supplementary Fig. 11), suggesting that our results are also robust to the ascertainment of the cis-eQTLs.
cis-eQTLs with tissue-specific effects. The strong correlation of cis-eQTL effects between brain and blood ( Fig. 1) does not preclude eQTLs with detectable difference in effect size between tissues. Of the 1388 cis-eQTLs with P eQTL < 5 × 10 −8 in GTEx-muscle and available in CMC and GTEx-blood (Supplementary Table 2), 308 (22%) showed significant difference in effect between CMC and GTEx-blood after Bonferroni correction for multiple testing (P difference < 0.05/1388) (Methods). Note that the substantial proportion of eQTLs with significant between-tissue differences in effect does not contradict the large estimate of r b above ( Fig. 1) because the power to detect a difference in effect depends on sample size 13 ( Supplementary Fig. 12). Previous studies have indicated that chromatin state at promoters is largely invariant across diverse cell types whereas enhancers are marked with highly cell-type-specific histone-modification patterns 27 , that functional variants (predicted by chromatin activity data) in enhancers are less likely to be shared across many tissues compared with those in promoters 28 , and that cell-type-specific eQTLs are more dispersedly distributed around the transcription start site than eQTLs affected expression in multiple cell types 29,30 . These results seem to indicate that tissue-specific eQTLs are enriched in distal regulatory elements (i.e., enhancers).
To address this hypothesis, we computed the statistics to test for the between-tissue difference in eQTL effect (denoted by T D ) and tested the inflation (or deflation) of mean T D of cis-eQTLs in the functional categories annotated by the Roadmap Epigenomics Mapping Consortium (REMC) 31 (Methods). The result showed that although cis-eQTLs are enriched in genomic regions of active chromatin state (e.g., promoters and enhancers) and deflated in inactive regions, the mean T D of cis-eQTLs between CMC and GTEx-blood was almost evenly distributed across all the functional categories with no evidence of inflation in the enhancer regions (Fig. 2). The result remained largely unchanged if we repeated the enrichment analysis based on T D between GTExcerebellum and GTEx-blood ( Supplementary Fig. 13). Note that these results do not contradict the observation from a recent study that eQTLs detected in specific tissues in GTEx tend to be most enriched among the variants predicted to be functional in relevant REMC tissues 28 . There were some examples where the cis-eQTLs with tissue-specific effects in brain and blood were located in enhancers ( Supplementary Fig. 14). These examples, however, were rare because only 14 of the 308 eQTLs with P difference < 0.05/1388 were located in enhancers and only 4 of the 14 enhancers appeared to be tissue specific.   GTEx-cerebellum and GTEx-blood. The 3569 genes were ascertained with at least one cis-eQTL with P eQTL < 5 × 10 −8 in GTEx-muscle and expressed in GTEx-cerebellum and GTEx-blood (i.e. genes which have at least 10 samples with RPKM >0.1 and raw read counts >6). In this analysis, we used cis-eQTL effects in SD units and gene expression levels in log 2 (RPKM) units to avoid confounding of the correlation by the mean-variance relationship in gene expression In addition, there are a large number of genes showing differences in expression level between tissues 11 . It is not clear whether these differences are partly driven by the differences in eQTL effect. We sought to address this question by examining the correlation between test-statistic for difference in cis-eQTL effect (in SD units) and test-statistic for difference in mean expression level of the corresponding gene (in log 2 (RPKM) units) between GTEx-cerebellum and GTEx-blood for the 3569 genes each with a cis-eQTL at P eQTL < 5 × 10 −8 in GTEx-muscle (Supplementary Table 2). Note that the cis-eQTL effects were re-scaled based on the expression level per gene in SD units so that the correlation was not confounded by the mean-variance relationship in gene expression. That is, if the difference in eQTL effect and that in expression level were both computed in RPKM units, genes with larger differences in mean between tissues are more likely to have differences in variance because of the mean-variance relationship, giving rise to differences in eQTL effect even if the eQTL effects are not different in SD units. We found that the correlation was marginal (r = 0.003) (Fig. 3). This is analogous to the observation that there is a large difference in mean height between men and women but the effects of all autosomal SNPs on height in men are almost identical to those in women 32,33 . However, these results also suggest that an eQTL with identical effect on gene expression in SD units in different tissues could show different effects in RPKM units if the variance of gene expression varies across tissues, which might explain the results from recent studies that genetic variants in or near genes differentially expressed in a particular tissue are enriched for associations with a complex trait 34,35 .
Correlation of cis-mQTL effects between brain and blood. Having shown that cis-eQTL effects are highly correlated between brain and blood, we then turned to estimate the correlation of genetic effects on DNAm between the two tissues by applying the r b method to mQTL data. We analyzed summary-level mQTL data from five studies based on the Illumina HumanMethyla-tion450K array: fetal brain from Hannon et al. (n = 166) 36 , brain cortical region from ROSMAP (n = 468) 19 Table 3). All the mQTL effects are in SD units. We matched the SNPs in common across data sets, selected the top-associated cis-mQTL at P mQTL < 1 × 10 −10 for 26,840 DNAm probes in the data from Hannon et al. (because only SNPs with P mQTL < 1 × 10 −10 are available in this data set) and matched the selected probes with those in the other data sets (the number of matched probes ranged from 4892 to 6561) (Supplementary Table 4). The correlation of cis-mQTL effects between two brain samples (ROS-MAP and Jaffe et al.) was very strong (r b ¼ 0:92 and s.e. = 0.002), similar to that between two blood samples (r b ¼ 0:92 between BSGS and LBC with s.e. = 0.003) (Fig. 4a). It is of note that both estimates of r b were smaller than unity, reflecting some degree of heterogeneity between studies. The mean brain-blood r b estimate from two samples was 0.78 (s.e. = 0.006) (Fig. 4a), higher than that for cis-eQTLs (meanr b ¼ 0:70 and s.e. = 0.015) shown above (Fig. 1). The result remained largely unchanged if the cis-mQTLs were selected at P mQTL < 5 × 10 −8 in the LBC data ( Supplementary Fig. 15), again showing the robustness of our results to the choice of reference tissue. In addition, of the 5416  Similarity and difference in cis-mQTL effects between brain and blood. a Estimated r b for cis-mQTLs between brain and blood from four independent data sets. The cis-mQTLs (one for each DNAm probe) were selected at P mQTL < 1 × 10 −10 using data from the Hannon et al. study. Shown in each cell is the estimate of r b with its standard error given in the parentheses (Methods). b The distribution of cis-mQTLs across 14 functional categories derived from RMEC (Methods). c Estimated enrichment of T D (testing for the difference in cis-mQTL effect between Jaffe-brain and LBC-blood) in each functional category (Methods). Error bars represent 95% confidence intervals around the estimates. The black dash line represents the fold enrichment of 1 cis-mQTLs, 1847 (34%) showed significantly different effects between brain (Jaffe et al.) and blood (LBC) after correcting for multiple testing (P difference < 0.05/5416). We then tested whether cis-mQTLs in any of the REMC functional categories tend to have higher T D between brain and blood (see above). There were small but significant enrichments of T D in enhancer regions (e.g., transcribed enhancer, active enhancer and weak enhancer) (Fig. 4c), and one of them survived multiple-testing correction (Supplementary Table 5).
Meta-analysis of brain eQTL data from correlated samples. We know from the r b analysis above that cis-eQTLs are almost perfectly correlated in different brain regions. We then sought to combine data from the brain regions to increase the power of detecting eQTLs for follow-up analysis (e.g., identification of putative functional genes for brain-related traits and diseases). However, if there is sample overlap between two tissues and the phenotypic correlation is nonzero, the estimation errors of the SNP effects from the two tissues will be correlated. We implemented in the SMR software package (URLs) a summary-databased method, which only requires summary-level data in the cisregions to account for sample overlaps, to meta-analyze cis-eQTL data in correlated samples (MeCS) (Methods). MeCS is very similar to existing meta-analysis approaches such as MTAG 39 or the Han et al. method 40 that account for sample overlaps. However, there is a small but important distinction. That is, MeCS uses "null" SNPs (e.g., P eQTL > 0.01) to quantify sampling correlation of the estimated SNP effects between two data sets (θ), similar to the strategy used in the latest version of METAL (method unpublished, URLs), whereas MTAG 39 usesθ estimated by the intercept of bivariate LD score regression 41 that relies on the assumption of an infinitesimal model which is invalid in cis-eQTL regions 42 . Han et al. 40 suggest the use of the number of overlapping individuals 43 or z-statistics to computeθ for summary-data-based analysis. However, a meta-analysis of cis-eQTL effects from two tissues requires the correlation of expression level between the tissues (because θ = r p ρ with r p being the correlation of expression level and ρ being the proportion of sample overlap 44 ) which is not available in summary data, andθ estimated by the correlation of z-statistics in the cisregion could be biased by the strong local genetic correlation 14 .
We showed by simulations thatθ could be estimated with high accuracy from summary data of the "null SNPs" in cis-region using a simple correlation approach (Supplementary Note 1, Supplementary Figs. 16 and 17), that the MeCS test-statistics were well calibrated under the null hypothesis ( Supplementary Fig. 16), and that the MeCS estimates of meta-analysis effect sizes were well estimated under the alternative hypothesis ( Supplementary  Fig. 17). We compared MeCS to a univariate analysis of the mean expression phenotype across tissues and found that the estimates of effect size and SE from the two approaches were highly consistent ( Supplementary Fig. 18). Note that in comparison with the separate analysis in individual tissues, the gain of power for MeCS increased with the decrease of correlation in expression phenotype between tissues, more so for meta-analysis using individuallevel data ( Supplementary Fig. 19).
We applied MeCS to data from 10 brain regions in GTEx (we referred to the meta-analyzed data as GTEx-brain hereafter).    5 Identification of genes and DNAm sites associated with four brain-related traits. Genes (DNAm sites) associated with the brain-related traits were identified by a SMR analysis of GWAS data with eQTL (mQTL) data from brain and blood samples. The four brain-related traits are smoking, IQ, SCZ, and EduYears. a, c show the number of genes (DNAm sites) with at least one significant SNP at P < 5 × 10 −8 in different data sets. b, d show the number of genes (DNAm sites) associated with traits identified in different data sets. Sample sizes of the brain studies: GTEx-brain (n There were strong sample overlaps among the ten brain regions (mean overlap = 70.4%) and the mean correlation in expression level between pairwise brain regions across all the expressed genes was moderate (mean r p = 0.33). The gain of power by the metaanalysis was demonstrated by the observation that the mean χ 2 statistic for cis-eQTLs (selected from GTEx-blood at P eQTL < 5 × 10 −8 ) in GTEx-brain was larger than that in any individual brain region ( Supplementary Fig. 20c). The association test-statistic for a SNP can be written as χ 2 ¼ 1 þ n eff q 2 1Àq 2 , where n eff is the effective sample size and q 2 is the variance explained by a SNP 45 . We therefore can approximately estimate n eff of GTEx-brain assuming constant mean q 2 across brain regions (Supplementary Note 2). Note that this assumption is justified by the highly consistent estimates of variance of cis-eQTL effects across genes in different brain regions ( Supplementary Fig. 21) along with a mean r b estimate of 0.94 between pairwise brain regions for cis-eQTL effects in SD units. The estimate of n eff of GTEx-brain was 233, approximately 2.6 times larger than the actual sample size of brain tissue in GTEx (mean n =~89 across 10 brain regions) ( Supplementary Fig. 20d). To further increase the power of detecting brain eQTLs, we meta-analyzed GTEx-brain, CMC, and ROSMAP (referred to as Brain-eMeta hereafter). The gain of power is demonstrated by the increased number of genes with at least one cis-eQTL with P eQTL < 5 × 10 −8 in Brain-eMeta as compared with that in GTEx-brain, CMC, or ROSMAP (Fig. 5a).

Identifying DNAm and genes for brain-related phenotypes.
With the Brain-eMeta eQTL data (n eff = 1194) obtained from the meta-analysis above, we applied the SMR approach 21,46 to test for associations of gene expression levels with four brain-related phenotypes, i.e., ever-smoked (smoking), fluid intelligence score (IQ), years of education (EduYears), and schizophrenia (SCZ). GWAS data were from published meta-analyses for EduYears and SCZ 47,48 , and from analyses of the full release of the UK Biobank data for smoking and IQ (Methods and Supplementary Table 6). LD data required for the HEIDI test 21 were estimated from genotyped/imputed data of the Health and Retirement Study (HRS) 49 . LD r 2 from HRS were strongly correlated with those from CMC ( Supplementary Fig. 22), consistent with the observation from previous studies 26 . For power comparison, we included in the SMR analysis an additional set of blood eQTL data from a sample of 14,115 individuals from the eQTLGen Consortium. Only the genes with at least one cis-eQTL at P eQTL < 5 × 10 −8 (one of the basic assumptions of SMR) in both Brain-eMeta and eQTLGen were included. We further excluded genes in the major histocompatibility complex (MHC) region because of the complexity of this region, leaving 3943 genes for analysis. We identified 61 genes associated with the traits using the brain eQTL data, 41 of which (67.2%) were in common with a larger set of genes (97) identified using the eQTLGen blood eQTL data (Fig. 5b). Despite the heterogeneity between the two eQTL data sets (Brain-eMeta was based on RNA-Seq and eQTLGen was based on microarray), the strong overlap between the two sets of results is consistent with the strong correlation of eQTL effects between brain and blood estimated above. For SCZ, 19 out of the 24 genes identified using brain eQTL data were replicated using blood eQTL data with an additional 27 genes identified only in the blood data because of its larger sample size ( Supplementary  Fig. 23). We repeated the SMR analysis using blood eQTL data from the Consortium for the Architecture of Gene Expression (CAGE; n = 2765) 9 and observed similar results (Fig. 5b) although the power of CAGE was lower than that of eQTLGen (63 genes identified using CAGE versus 97 genes identified using eQTLGen).
We also performed the SMR analysis to detect associations between DNAm sites and the brain-related phenotypes 16 using brain mQTL data from Jaffe et al. (n = 526) and blood cis-mQTL data from a meta-analysis of LBC and BSGS (n = 1980) (Methods). We only included in the analysis DNAm probes with at least one cis-mQTL with P mQTL < 5 × 10 −8 in both the brain and blood data sets. We identified 167 DNAm sites associated with the traits (P SMR < 1.8 × 10 −6 ) using the brain mQTL data, 133 of which (79.6%) were in common with the set of 295 DNAm sites identified using the blood mQTL data (Fig. 5d and Supplementary Fig. 24). The brain to blood "replication" rate slightly decreased when we rejected the associations with P HEIDI < 0.05 ( Supplementary Fig. 25), likely because of the HEIDI test being over-conservative especially as sample size increases 21 . These results further demonstrate the feasibility and gain of power of using the cis-genetic effects on gene expression or DNAm estimated in blood to identify putative target genes and regulatory DNA elements for brain-related phenotypes.

Discussion
We estimated the correlation (r b ) of genetic effects at the topassociated cis-eQTLs/mQTLs between brain and blood. Because the r b method accounts for estimation errors,r b can be interpreted as an estimate of correlation of true cis-eQTL effects between brain and blood, as demonstrated by simulations (Supplementary Fig. 2). We applied the method to summary-level eQTL data from GTEx and found that genetic effects on gene expression at the top-associated cis-eQTLs were almost perfectly correlated between different brain regions (meanr b ¼ 0:94 for cis-eQTLs), especially between the non-cerebellar regions (mean r b ¼ 0:98 and s.e. = 0.003), in contrast to the modest phenotypic correlation in gene expression levels (mean r p = 0.33). It is therefore sensible to run a meta-analysis of the cis-eQTL effects across brain regions to gain power of detecting eQTLs for the whole brain ( Supplementary Fig. 19). This can be done even if the brain regions are from different samples. We also found that the cis-eQTL effects were highly correlated between brain and blood in GTEx (meanr b ¼ 0:77 for cis-eQTLs), and the estimate only slightly decreased using data from different samples (mean r b ¼ 0:70). These estimates were significantly different from 1, suggesting there are real genetic differences between tissues. The genetic differences are partly due to cell-type-specific genetic effects regardless whether cell composition covariates have been included in the eQTL analysis or not. This is because adjusting for cell composition only removes the mean differences in gene expression level among cell types rather than cell-type-specific genetic effects. On the other hand, however, the strong betweentissue correlation in cis-eQTL effects does not contradict the result that many genes showed differential expression between brain and blood because the difference in cis-eQTL effect is almost independent of the mean difference in gene expression level (Fig. 3). Our results reinforce that very large sample sizes are needed to generate eQTL data sets in a specific tissue (e.g., blood) to increase the power of detecting cis-eQTLs regardless of the relative expression level of the tissue.
Our results also provide some guidelines about the use of discovery-replication paradigm to compare eQTL effects between tissues (i.e., detecting eQTLs in one tissue at a stringent P-value threshold and replicating the effects in another tissue after correcting for multiple tests) 13,29 . Here, we often saw a low to moderate replication rate even if there is no genetic difference between the tissues. This is because the replication rate is a function of the sample size of the validation set ( Supplementary  Fig. 12) and the sample sizes of eQTL studies in non-blood tissues are often limited. If we apply the discovery-replication paradigm to the GTEx data, only~10.7% of eQTLs discovered in GTExmuscle could be replicated in GTEx-hippocampus (although the estimates from the recent methods 50,51 based on the discovery-replication paradigm were much higher) (Supplementary Table 7), which could potentially lead to a wrong conclusion that a large proportion of cis-eQTLs are tissue specific (note that the r b estimate between the two tissues was 0.81). We therefore do not recommend the use of the discovery-replication paradigm to quantify the tissue-specific effects especially in small samples.
We applied the SMR and HEIDI methods to identify genes and DNAm sites associated with brain-related phenotypes through pleiotropy using summary data from GWAS and cis-eQTL/ mQTL studies with large sample sizes (n max = 453,693 for GWAS, n max = 14,115 for eQTL and n max = 1980 for mQTL). We identified a number of genes and DNAm sites that showed pleiotropic associations with the phenotypes, consistent with a plausible model that the SNP effects on the phenotypes are mediated by genetic regulation of expression levels of the target genes and/or DNAm levels at the CpG sites. We repeated the analyses using eQTL and mQTL data from brain samples with much smaller sample sizes (n max = 1194 for eQTL and n max = 526 for mQTL). Due to the lower power of the data sets, the number of genes or DNAm sites detected in the brain sample was much smaller than that using the blood sample (Fig. 5, Supplementary Figs. [23][24][25], with at least 50% of genes (DNAm sites) in common between the two sets. These results provide strong justification for the use of blood samples to discover genes related to brain phenotypes and diseases. In practice, we recommend using a blood data set with large sample size for discovery, and an additional data set from brain for replication. This paradigm is certainly applicable to other phenotypes and their related tissues.
We conclude with several caveats. First, our estimation of r b is based on genes expressed in both brain and blood (i.e., genes only expressed in one tissue were not included in the estimation). Therefore, the estimate of r b needs to be interpreted with a restriction to genes expressed in both tissues. Although only a quarter (4257) of all genes were selected in our analysis (with at least one cis-eQTL at P eQTL < 5 × 10 −8 in GTEx-muscle), up to 90% of those selected genes were expressed in both brain and blood, reflecting the high proportion of all genes expressed in both tissues. Second, we focused our analyses only on cis-eQTLs and cis-mQTLs because trans-eQTLs and trans-mQTLs data were not available in most data sets used in our study. Although most SNP-based heritability for gene expression levels are attributed to cis-eQTLs 9 , trans-eQTLs may also have an important role in regulating gene expression especially for tissue-specific effects 14 .
The methods developed in this study can be applied to trans-eQTL/mQTL data with minimal modification. Because the variance explained by individual trans-eQTL/mQTL is small on average 9,38 , very large sample sizes (e.g., 10,000s) are required to detect trans-eQTLs to be useful for the SMR analysis 21 . Third, the r b analysis was focused on the correlation at the top-associated cis-eQTLs/mQTLs with relatively large effects (i.e., P < 5 × 10 −8 in a reference tissue) because the SMR test only uses cis-eQTLs/ mQTLs at P < 5 × 10 −8 . The estimate of r b was slightly lower for cis-eQTLs/mQTLs selected at a less stringent threshold (Supplementary Fig. 26), consistent with the observation in simulation (Supplementary Fig. 27). However, this does not change our conclusion about the use of the top-associated cis-eQTLs/mQTLs identified in a large blood sample to identify putative target genes for brain-related traits. Last but not least, the MeCS method requires the correlation of errors in the estimated SNP effects between two samples (θ), which is estimated by a simple correlation approach at the null SNPs in the cis-region. This approach, however, is not applicable to eQTL or mQTL summary data that have been ascertained by P-value. It will also be challenging to estimate θ if only a small number of cis-SNPs are available in the summary data. We therefore recommend eQTL and mQTL studies to make more cis-SNPs available without ascertainment (e.g., all the cis-SNPs in ±2Mb of a gene or DNAm probe). Despite these caveats, our findings shed light on the genetic architecture underlying the regulation of gene expression across tissues and provide important guidance for studies in the future to identify functional genes for human complex traits.

Methods
Summary data of cis-eQTL, cis-mQTL, and GWAS. This study is approved by the University of Queensland Human Research Ethics Committee (approval number: 2011001173). All the analyses of eQTL/mQTL data were performed based on summary-level data. A summary description of all the data sets can be found in Supplementary Table 1, Supplementary Table 3, and Supplementary Table 6. All the samples were of European descent and the summary data available to us were derived from individual-level data that passed stringent quantify control (QC) 9,11,[18][19][20][36][37][38] . The SNPs in all eQTL/mQTL data sets were from imputation of the genotyped data to the 1000 Genomes Project (1KGP) reference panels 52 , and only the SNPs with MAF > 0.01 were included in analyses.
The eQTL summary-level data were from six studies, i.e., the Genotype-Tissue Expression (GTEx) 11 v6, the CommonMind Consortium (CMC) 18 , Religious Orders Study and Memory and Aging Project (ROSMAP) 19 , the Brain eQTL Almanac project (Braineac) 20 , the Architecture of Gene Expression (CAGE) 9 , and eQTLGen. In GTEx, ROSMAP, and CMC, gene expression levels were measured by RNA-Seq. Genes in GTEx and ROSMAP were annotated by GENCODE 53 v19 and v14, respectively, and genes in CMC were annotated by Ensembl. We accessed the GTEx eQTL summary statistics of~9.3 million SNPs for~32,000 genes in 44 tissues (including 10 brain regions) through GTEx portal (URLs). The sample sizes of GTEx tissues ranged from 70 to 361 with an average of 160. We accessed the CMC summary data from Synapse (accession: syn2759792). The CMC eQTL summary statistics (ascertained at FDR < 0.2) of~1.1 million SNPs for 14,366 genes were derived from individual-level data in dorsolateral prefrontal cortex of 467 subjects, 209 of which were schizophrenia patients. We accessed the ROSMAP eQTL summary statistics of~6.4 million SNPs for 12,979 genes, which were derived from individual-level data in dorsolateral prefrontal cortex of 494 subjects. We accessed the Braineac eQTL summary statistics of~6.2 million SNPs for 25,490 genes, which were derived from data in 10 brain regions of 134 subjects free of neurodegenerative disorders 20 . The gene expression levels in Braineac were measured by Affymetrix Human Exon 1.0 ST Arrays. For blood eQTL data, we used eQTL summary data from CAGE 9 (38,624 gene expression probes and~8 million SNPs on 2765 subjects) and eQTLGen (44,556 gene expression probes and 10 million SNPs on 14,115 subjects). Gene expression levels in CAGE and eQTLGen were measured by Illumina gene expression arrays. We mapped the probes to genes based on the annotations provided by Illumina. The eQTL summary data available in GTEx, CAGE, and eQTLGen were from previous analyses of standardized gene expression levels with mean 0 and variance 1, whereas expression levels in the other data sets (i.e., CMC, ROSMAP, and Braineac) were not standardized. To harmonize the units across data sets, we rescaled the effect size and standard error (SE) of each eQTL in the CMC, ROSMAP, and Braineac based on the z-statistic, allele frequency and sample size using the method described in Zhu et al. 21 so that the eQTL effects in all data sets can be interpreted in standard deviation (SD) units. mQTL summary statistics were from five data sets: brain cortical region from ROSMAP study (n ind = 468, n probe = 420,103, n snp = 5 million) 19 ; fetal brain from We computed the effect size and SE of each mQTL from their zstatistics using the method described in Zhu et al. 21 We included in the analysis four brain-related complex traits, i.e., ever-smoked (smoking), fluid intelligence score (IQ), years of education (EduYears), and schizophrenia (SCZ). GWAS summary statistics for EduYears (n = 293,723) and SCZ (36,989 cases and 113,075 controls) were from the latest meta-analyses 47,48 , and summary data for smoking (n = 453,693) and IQ (n = 146,819) were from GWAS analyses of the latest release of the UK Biobank (UKB) data 56 . Quality control and imputation of the UKB data have been detailed elsewhere 56 . We used 456,426 individuals of European descent and 7,288,503 common SNPs (MAF > 0.01) imputed from the Haplotype Reference Consortium (HRC) 57 reference panel in the analysis. IQ was measured by 13 fluid intelligence questions and detailed description of the measurement can be found at the UKB website (URLs). We adjusted IQ (n = 146,819) by age and sex, and standardized the adjusted phenotype by rank-based inverse-normal transformation. The GWAS analyses were performed in BOLT-LMM 58 using all 7.3 million SNPs with a subset of 0.7 million SNPs in common with HapMap3 59 used to control for population structure and polygenic effects. We used self-reported "ever-smoked" as a dichotomous phenotype for smoking (208,988 cases and 244,705 controls). We analyzed the data in BOLT-LMM based a linear model with age and sex fitted as covariates, and transformed the effect size of each SNP on the observed 0-1 scale to odds ratio (OR) using LMOR 60

(URLs).
Correlation of cis-eQTL effects between tissues. Letb be the estimated effect at the top-associated cis-eQTL for a gene (i.e., one SNP per gene). We can modelb aŝ where b is the true effect and e is the estimation error. We assume that b and e are random variables when interrogated across genes, i.e., b $ N 0; var b ð Þ ð Þand e $ N 0; var e ð Þ ð Þ . The covariance of the estimated cis-eQTL effects between tissues i and j across genes can be partitioned into the covariance of true cis-eQTL effects and the covariance of estimation errors (if there is a sample overlap), i.e., where var e i ð Þ and varðe j Þ are the variance of the estimation errors across genes in tissues i and j, respectively, and r e is the correlation of estimation errors across genes between two tissues, i.e., r e ¼ corðe i ; e j Þ. We know from Bulik-Sullivan et al. 41 and Zhu et al. 44 that r e % r p ρ, where ρ ¼ N s ffiffiffiffiffiffiffi N i N j p measures the sample overlap with N i and N j being the sample sizes in tissues i and j, respectively, and N s being the number of overlapping individuals, and r p is the correlation of gene expression levels between two tissues in the overlapping sample. If i = j, then r e = 1 and var b i ð Þ ¼ varðb i Þ À var e i ð Þ, where var(b i ) is the variance of true cis-eQTL effects across genes in tissue i. We therefore can estimate the correlation of true cis-eQTL effect sizes across genes between tissues i and j aŝ where c varðb i Þ and c varðb j Þ (i.e., the estimates of varðb i Þ and varðb j Þ) are the observed sample variances ofb i andb j , respectively, in a set of genes, and c covðb i ;b j Þ is the observed sample covariance betweenb i andb j in the set of genes.
However, c var e i ð Þ, c varðe j Þ andr e are not observable. We know that SE 2 ofb of a SNP is an estimate of the variance of e over repeated experiments for a gene. We therefore can approximate c varðeÞ by the average of SE 2 across genes (one SNP per gene). We also know from Eq. (2) that if b i = b j = 0, Þ is the observed sample correlation betweenb i andb j in the set of genes. In practice, we computedr e for each gene using "null" SNPs (P eQTL > 0.01) in the cis-region by a simple correlation approach and took the average across genes. The sampling variance ofr b over repeated experiments can be computed via Jackknife approach leaving one gene out at a time.
wherer bðÀtÞ is the estimate with the t-th gene left out andr bð:Þ ¼ 1 m P tr bðÀtÞ . The method is derived based on eQTL data but can be applied to data from genetic studies of different types of molecular phenotypes (e.g., DNAm and histone modification).
Enrichment of tissue-specific eQTLs in functional categories. We used chromatin state data from 23 blood samples (T-cell, B-cell, and hematopoietic stem cells) and 10 brain samples generated by the NIH Roadmap Epigenomics Mapping Consortium (REMC) 31 . There were 25 chromatin states predicted by ChromHMM 61 based on the imputed data of 12 histone-modification marks 31 . We classified the 25 chromatin states into 14 main functional categories by combining functionally relevant annotations 62 . We tested the difference in eQTL effect for a gene between two tissues (i and j) using the method below. Let The sampling variance ofd over repeated experiments can be written as whereb i andb j are the estimated effect sizes of the top-associated cis-eQTL for a gene in two tissues i and j, Vðb i Þ and Vðb j Þ are the sampling variances ofb i andb j , respectively, over repeated experiments, and θ is sampling correlation betweenb i andb j for the gene over repeated experiments. In practice,Vðb i Þ andVðb j Þ can be estimated by SE 2 ofb i andb j , andθ can be approximated by the sample correlation betweenb i andb j across the "null" SNPs (e.g., P eQTL > 0.01) in the cis-region for the gene. The significance ofd can therefore be assessed by a Wald test, i.e., T D ¼d 2 vard ð Þ $ χ 2 1 . To test the enrichment of T D statistics in functional annotations, we allocated the cis-eQTLs to the 14 functional categories described above by physical position, and calculated the mean T D of each category. We assessed the enrichment using the inflation factor λ = T DðiÞ T D , where T DðiÞ is the mean T D of the cis-eQTLs in a category i, and T D is the mean T D of all the cis-eQTLs. We then used the Jackknife approach (leaving one gene out at a time) described above to compute the variability of λ. Note that although we described the enrichment test method above based on cis-eQTLs, the method can be applied to data from genetic studies of different types of molecular phenotypes (e.g., DNAm and histone modification).
Meta-analysis of cis-eQTL data from correlated samples. We know from Eq.
(1) that the estimated effect of a cis-eQTL for a gene can be partitioned into two components, i.e., the true effect size (b) and the estimation error (e). For multiple tissues, the joint distribution of the estimates can be written aŝ b $ N 1b; S ð Þ ð7Þ whereb ¼ ½b 1 ;b 2 ; ;b t , S is the sampling (co)variance matrix ofb over repeated experiments with S ij ¼ Cðb i ;b j Þ. S ij ¼ θ ij S i S j when i ≠ j, where θ ij is sampling correlation betweenb i andb j for the gene over repeated experiments. S 2 i and S 2 j are the sampling variance of b b i and b b j , respectively, over repeated experiments. If i = j, then θ ij ¼ 1 and S ij ¼ S 2 i . In practice,θ ij can be approximated by the sample correlation of the estimated SNP effects between a pair of tissues across the "null" SNPs (e.g., P eQTL > 0.01) in the cis-region for each gene. Similar to the summarydata-based meta-analysis methods that account for correlated estimation errors 39,40,63 , we can estimate combined effect aŝ Vb ¼ 1 Data availability. Brain-eMeta eQTL summary data are available at http:// cnsgenomics.com/software/smr/#Download. The eQTLGen summary data are available through application to the eQTLGen consortium. All the other data sets used in this study are from the public domain. The software tools are available at the URLs above.