Introduction

Genome-wide association studies (GWASs) have been carried out for numerous cosmplex traits and diseases, identifying tens of thousands of single-nucleotide polymorphisms (SNPs) associated with these phenotypes. However, our understanding of most traits’ genetic basis remains incomplete, in part due to the limited power and interpretability of the traditional GWAS approach that correlates one trait with one SNP at a time. Recently, statistical methods that jointly model multiple phenotypes have quickly gained popularity in human genetics research1,2,3. Leveraging pervasive pleiotropy in the human genome, these methods enhanced the statistical power to identify genetic associations1,4,5,6,7, improved the accuracy of genetic risk prediction8,9, revealed novel genetic sharing across diverse phenotypes10,11,12, and provided great insights into the genetic basis of a variety of diseases and traits13,14.

Genetic similarity between traits can be modeled at different scales. Methods that identify SNPs associated with multiple phenotypes have achieved some success15,16,17. However, most complex human traits and their genetic overlaps are highly polygenic, with top SNPs showing weak to moderate effects18,19,20. Thus, single SNP-based methods modeling pleiotropy effects may not be sufficient to characterize the full landscape of genetic similarity of complex traits. An alternative approach is to estimate the genetic correlation between different traits10,12,21,22. These methods effectively utilize genome-wide genetic data, including SNPs that do not reach statistical significance in GWAS, to quantify the overall genetic sharing between two traits. In addition, recent methodological advances have enabled estimation of genetic correlation with GWAS summary statistics10,11,23, making these approaches widely applicable to a large number of complex phenotypes. With these advances, genetic correlation analysis has become a routine procedure in post-GWAS analysis and was implemented in almost all large-scale GWASs published in the past few years.

However, despite improved statistical power and wide applications, genetic correlation approaches fail to provide detailed, mechanistic insights due to its oversimplification of complex genetic sharing into a single metric. Two recent methods improved genetic correlation analysis by providing local12 and annotation-stratified estimates11. However, these methods rely on strong prior evidence about which local region or functional annotation to investigate. When applied to hypothesis-free scans, statistical power is reduced. In this work, we introduce LOGODetect (LOcal Genetic cOrrelation Detector), a method that uses scan statistics to identify genome segments harboring local genetic correlation between two complex traits. Compared to other methods, LOGODetect does not pre-specify candidate regions of interest, and instead, automatically detects regions with shared genetic components with great resolution and statistical power. In addition, LOGODetect only uses GWAS summary statistics as input and is robust to sample overlap between GWASs. We demonstrate its performance through extensive simulations and analysis of well-powered GWASs for seven distinct but genetically correlated neuropsychiatric traits24,25. Our analysis implicates a collection of hub regions (small genome segments harboring local genetic correlations for multiple trait pairs) in the genome that underlie the risk for several of these traits.

Results

Method overview

Our goal is to identify genome segments showing consistent association patterns with two different traits. Here, we provide an overview of our approach and the technical details are discussed in the “Methods” section. We propose the following scan statistic:

$$Q(R) = \frac{{\mathop {\sum }\nolimits_{i \in R} z_{1i}z_{2i}}}{{\left( {\mathop {\sum }\nolimits_{i \in R} l_i} \right)^\theta }}$$
(1)

to quantify the extent of local genetic similarity in a genome region, where R is the index set for all SNPs in the region, z1i and z2i are the association z-scores for the ith SNP with two traits, li is the linkage disequilibrium (LD) score for the ith SNP10, and θ controls the impact of LD. Q(R) extends the scan statistic proposed for single trait analysis26,27 to the framework of detecting local genetic correlation. The scan statistic Q(R) is a LD score-weighted inner product of local z-scores from two GWASs and is conceptually similar to local genetic correlation—regions with high absolute values of Q(R) show concordant association patterns across multiple SNPs in the region and the sign of Q(R) shows if the correlation is positive or negative. Of note, when the candidate region is the whole genome and θ is equal to 1, the scan statistic is an estimator for the global genetic covariance11. In our framework, we do not assume that per-SNP genetic covariance is the same for all SNPs across genome, but assume that genetic covariance is localized in some small genome regions. Therefore, we use the scan statistic in a local region, as a metric to detect significant local genetic sharing. A full discussion of the functional form of the scan statistic Q(R) is provided in Supplementary Notes. We search for genome segments with the highest |Q(R)| values by scanning the genome while allowing the segment size to vary (Fig. 1). Since we assume that the global genetic covariance can be solely attributed to some small regions, thus, the identified segments should collectively recapitulate a large proportion of genetic covariance of two traits. Therefore we select the optimal tuning parameter θ by maximizing the aggregated genetic covariance of all the identified regions. Statistical evidence of genetic sharing is assessed using a Monte Carlo approach.

Fig. 1: LOGODetect workflow.
figure 1

a The inputs of LOGODetect include GWAS summary statistics for two traits and a reference panel for LD estimation. b Scan statistic is defined over a region, as the LD-weighted inner product of two z-score vectors in this region. A large absolute value of the scan statistic would hint at local genetic correlation. c LOGODetect identifies genome segments showing consistent associations with two different traits.

Simulation results

We conducted simulations to compare the performance of LOGODetect with three existing methods: ρ-HESS12, coloc28, and gwas-pw17. ρ-HESS estimates local genetic correlation in pre-specified genomic regions based on a fixed-effect model, and coloc and gwas-pw are Bayesian approaches that estimate the posterior probability of colocalization for two traits. Note that the definition of genetic covariance (correlation) in our study is consistent with the traditional definition of covariance (correlation) of additive genetic effects under fixed-effect model10.

We used HAPGEN229 to simulate genotypes for 100,000 samples based on 503 individuals with European ancestry from the 1000 Genomes Project Phase 3 data30, and assessed the type I error of the four approaches under a variety of settings (see the “Methods” section; Supplementary Notes). First, we simulated phenotypes under an infinitesimal model in which genetic effects were assumed to be the same for all SNPs. We evaluated our method across a range of heritability combinations for two traits. We then compared different methods in two additional model settings representing diverse genetic architecture: a heritability-enrichment model where 3% of randomly selected SNPs explain 30% of trait heritability and the LDAK model31 with MAF-dependent and LD-dependent architecture. In addition, we investigated if overlapping samples between two studies, mis-specified models with non-normal effects, and binary phenotypes would bias the inference. The family-wise type I error rate of our method was well-calibrated in all simulation settings with varying heritability values, extent of sample overlap, and genetic architecture (Supplementary Tables 18), showcasing the statistical robustness of LOGODetect. Type I errors for ρ-HESS were too conservative when heritability varies from 0.01 to 0.05 but showed substantial inflation when heritability was large (e.g. 0.2) (Supplementary Table 9).

We also assessed the statistical power of LOGODetect under various settings. Three different metrics (i.e. point detection rate, segment detection rate, and G-score) were used to quantify the statistical power (see the “Methods” section). Signal points detection rate and signal segments detection rate measure the sensitivity at the SNP level and segment level, respectively. However, they do not reflect specificity of the method, as both metrics will be 1 if the identified region is the entire genome. G-score is a more informative alternative, which can jointly quantify specificity and sensitivity. First, we evaluated the power of LOGODetect with different θ under a heritability enrichment model, where a higher level of heritability was attributed to correlated regions (Supplementary Fig. 1). LOGODetect with adaptive θ achieved universally higher statistical power in three measures compared to the fixed θ approach, which demonstrated that maximizing aggregated genetic covariance of the identified regions could lead to a reasonable estimate of θ.

Further, we compared different methods under a heritability enrichment model. As heritability increases, LOGODetect showed improvements in all three measures of statistical power without inflating the type I error (Fig. 2a–c). LOGODetect achieved greater signal points detection rates compared to ρ-HESS when heritability is low to moderate, compared to gwas-pw when heritability is moderate to high, and compared to coloc in all heritability settings (Fig. 2a). Moreover, LOGODetect showed almost universally higher signal segments detection rates and G-scores compared to the other three methods (Fig. 2b, c). LOGODetect achieved only slightly lower signal segments detection rates than ρ-HESS in one exceptional case when heritability is 0.05. We obtained consistent results under the heritability enrichment model with varing proportion of heritability (Fig. 2d–f). The gain of G-score can be attributed to the fact that LOGODetect flexibly and precisely identifies true signal regions, while ρ-HESS, coloc, and gwas-pw pre-specify candidate regions, which in general are much larger than the true signal regions, regardless of the disease phenotype. We also investigated if sample overlaps or binary phenotypes would affect the performance of our method. In addition, we compared statistical power of different approaches under mis-specified models, including LDAK model31 with MAF-dependent and LD-dependent effect sizes, non-infinitesimal models with sparse effects, and infinitesimal models with heavy-tailed effect distributions. Details of simulation settings are shown in the Supplementary Notes. We obtained consistent results under all simulation settings (Supplementary Figs. 210). Finally, the presence of correlated signal regions in the genome would not inflate type I error in non-signal regions (Supplementary Tables 10 and 11).

Fig. 2: Assessment of statistical power under a heritability-enrichment model with varying trait heritability and heritability enrichment.
figure 2

The Y-axis shows the statistical power assessed by three different metrics: a, d signal points detection rate measures sensitivity at the SNP level, b, e signal segments detection rate measures sensitivity at the segment level, and c, f G-score jointly measures specificity and sensitivity. The heritability represents the trait heritability on chromosome 1 and the proportion of heritability represents the proportion of the trait heritability explained by the signal regions. Significance cutoffs for gwas-pw are adjusted so that the empirical type I error rate is controlled at 0.05. Details on the above three metrics and the adjustment procedure for the significance cutoff are discussed in the “Methods” section. Source data are provided as a Source Data file.

Application to seven neuropsychiatric traits

Previous studies have revealed pervasive pleiotropy32,33,34 and genetic covariance35,36,37,38 among neuropsychiatric traits. However, there is limited understanding of the specific genetic loci contributing to multiple traits. We applied LOGODetect to study the pairwise local genetic correlation between seven neuropsychiatric traits (Supplementary Table 12): bipolar disorder (BIP; n = 51,710), schizophrenia (SCZ; n = 105,318), major depressive disorder (MDD; n = 173,005), neuroticism (NEU; n = 390,278), attention-deficit/hyperactivity disorder (ADHD; n = 53,293), autism spectrum disorder (ASD; n = 46,350), and intelligence (IQ; n = 269,867), using summary statistics from the latest GWASs39,40,41,42,43,44,45. We adaptively selected the best θ by maximizing the genetic covariance in all identified regions (Supplementary Table 13). In total, we identified 410 regions (merged into 227 non-overlapping segments) showing concordant associations with multiple traits (FDR < 0.05; Fig. 3a and Supplementary Figs. 1128). 274 of the 410 regions showed positive correlations (Supplementary Data 1). Size of the identified genome segments varied from 4 KB to 1.6 MB (Supplementary Fig. 29). The number of significant segments identified in our analysis is proportional to the absolute value of genetic correlation between each trait pair (Supplementary Fig. 30; correlation r = 0.23). We identified 56 shared genomic regions for BIP and SCZ (Fig. 3b; genetic correlation rg = 0.68, p = 9.14e−87), 53 regions for SCZ and IQ (Supplementary Fig. 18; genetic correlation rg = −0.23, p = 4.36e−28), 40 regions for MDD and NEU (Supplementary Fig. 26; rg = 0.78, p = 6.38e−41), and 261 regions for 16 other trait pairs, which is consistent with the strong genetic overlap between these traits46,47,48,49. Overall, we identified strong genetic sharing (higher genetic correlation and more shared genome segments) among BIP, SCZ, MDD, and NEU and among MDD, ADHD, ASD, and IQ. Sharing between these two clusters was relatively weaker, which is consistent with previous reports50.

Fig. 3: LOGODetect identifies genome regions contributing to multiple neuropsychiatric traits.
figure 3

a Heatmap shows the genetic correlation estimates (upper triangle) and the number of segments with local genetic correlation identified by LOGODetect (lower triangle) between the seven neuropsychiatric traits; Barplot shows the observed scale heritability estimates and standard errors of the seven traits using LDSC10. b Mirrored Manhattan plot for BIP and SCZ. The 56 shared genome regions identified by LOGODetect are highlighted in red. One locus on chromosome 6 with \(- \log _{10}P > 20\) in SCZ is truncated at 20 for visualization purpose.

LOGODetect identifies precise regions with genetic sharing

To benchmark the performance of LOGODetect with existing approaches, we also applied ρ-HESS, coloc, and gwas-pw to the same seven neuropsychiatric traits. We first assumed full sample overlaps as suggested in the original paper that introduced ρ-HESS. In total, ρ-HESS detected 778 regions for BIP and SCZ, and 304 regions for SCZ and IQ (FDR < 0.05; Supplementary Table 14). It only detected three regions for MDD and NEU, and failed to detect any significant local genetic correlation for any disorder pairs of MDD, ADHD, and ASD. Additionally, we also estimated the shared sample sizes based on the reported size of cohorts included in multiple studies (Supplementary Table 15), and used these approximated values as inputs for ρ-HESS to correct for sample overlap bias. The results remained consistent (Supplementary Table 14). The colocalization methods also detected strong genetic sharing between BIP and SCZ, between SCZ and NEU, and between SCZ and IQ (Posterior probability > 0.95; Supplementary Table 14).

We used the analysis of BIP and SCZ as an example to further illustrate the performance of LOGODetect. We used genetic covariance enrichment to quantify the precision of identified signal regions (Supplementary Notes). First, regions identified by LOGODetect showed the highest enrichment of genetic covariance compared to other methods (Fig. 4a). Although ρ-HESS identified more shared regions between BIP and SCZ, the enrichment of genetic covariance was 9.4-fold higher in the regions identified by LOGODetect, which is concordant with the simulation results based on G-scores. Second, we broke down the regions identified by ρ-HESS, coloc, and gwas-pw into two subsets: regions that overlap and do not overlap with regions identified by LOGODetect. The regions overlapping with LOGODetect results showed a higher enrichment for genetic covariance while enrichment in the regions identified by other methods alone were substantially lower (Fig. 4b). Third, to avoid comparison at an arbitrary significance cutoff, we ranked the regions identified by LOGODetect, ρ-HESS, coloc, and gwas-pw, by the corresponding p-values and posterior probability separately, and evaluated the proportion of explained genetic covariance at various thresholds. LOGODetect substantially outperformed other methods, explaining more genetic covariance with the same proportion of SNPs (Fig. 4c; Supplementary Figs. 3150). We also used estimated overlapping sample sizes to de-bias ρ-HESS estimates and results remained consistent (Supplementary Fig. 51).

Fig. 4: LOGODetect identifies precise genomic regions harboring local genetic correlations.
figure 4

Genetic covariance and its corresponding enrichment were calculated using stratified-LDSC10. a Genetic covariance fold enrichment (i.e. the ratio between the proportion of total genetic covariance and the proportion of the total SNP counts) in regions identified by LOGODetect, ρ-HESS, coloc, and gwas-pw, respectively. b Genetic covariance fold enrichment in regions identified by ρ-HESS, coloc, and gwas-pw that also overlapped with LOGODetect findings, and regions identified by ρ-HESS, coloc, and gwas-pw alone. c Genetic covariance explained and proportion of SNPs covered by regions identified by LOGODetect, ρ-HESS, coloc, and gwas-pw.

There are two reasons why our method showed improved performance compared to the other methods. First, ρ-HESS and the colocalization methods pre-specify regions of interest, which are generally much larger than the signal regions harboring true genetic sharing (Supplementary Fig. 52), while our scanning approach is data-adaptive and can precisely identify the boundaries for signal regions. Second, both BIP (heritability h2 = 0.35) and SCZ (h2 = 0.43) have high SNP heritability. As demonstrated in the simulations, regions identified by ρ-HESS may include a non-negligible proportion of false positive findings.

Further, we evaluated the identified regions in an independent replication cohort. We tested whether the significantly correlated regions between BIP and SCZ can be replicated in the UK Biobank (UKBB). The summary statistics of BIP (ncase = 1064, ncontrol = 365,476) and SCZ (ncase = 571, ncontrol = 365,476) in the UKBB were collected (Supplementary Table 16). Due to the unbalanced case-control ratio and limited effective sample size, we used aggregated genetic covariance to evaluate the replication (Supplementary Notes). Stratified-LDSC was not applicable due to the imbalanced sample sizes of cases and controls, therefore we applied GNOVA11 for stratified genetic covariance analysis of the regions identified by four methods in the UKBB data, respectively. The regions identified by LOGODetect and ρ-HESS both showed significant genetic covariance, but the regions identified by LOGODetect have a 6.7-fold higher genetic covariance enrichment than that of ρ-HESS, which demonstrates again that LOGODetect can more precisely detect the true signal regions (Table 1; Supplementary Fig. 53). Regions identified by gwas-pw showed no significant genetic covariance, while regions identified by coloc showed significant genetic covariance with the opposite sign.

Table 1 Stratified genetic covariance analysis on UKBB replication cohorts.

We also replicated findings for body-mass index (BMI) and height, for which independent replication cohorts of large sample size are available (Supplementary Notes). We identified 24 regions with significant local genetic correlation in the discovery analysis. 17 of 24 regions identified in the discovery stage were successfully replicated, suggesting the effectiveness of LOGODetect to identify replicable genomic regions with local genetic correlations (Supplementary Table 17).

Tissue enrichment of hub regions shared by neuropsychiatric traits

We used 66 GenoSkyline-Plus tissue-specific functional annotations51 to investigate the functional relevance of the genomic regions found to harbor local genetic correlations among seven neuropsychiatric traits (Supplementary Table 18). We used permutation tests to assess the enrichment of genome regions shared by multiple traits in these annotation tracks. Genome regions identified by LOGODetect were significantly enriched in eight brain regions (minimum enrichment = 1.50, p = 4.00e−4) (Fig. 5a). In addition to brain tissues, regions shared by neuropsychiatric traits were also strongly enriched in mononuclear cells from peripheral blood (enrichment = 1.93, p = 1.00e−5) and pancreatic islets (enrichment = 2.11, p = 1.00e−5). Of note, annotated functional regions in mononuclear cells and pancreatic islets have substantial overlaps with annotations of brain tissues (Fig. 5b). After conditioning on functional regions in the brain, the enrichment in pancreatic islets was substantially reduced (enrichment = 1.1, p = 0.224; Fig. 5c), while enrichment in mononuclear cells remained significant (enrichment = 1.66, p = 3.55e−3).

Fig. 5: Tissue-specific enrichment of genome regions conferring risk for multiple neuropsychiatric traits.
figure 5

a Permutation test results over 66 cell-type-specific annotations. Fold enrichment is labeled next to each bar. b The overlap of predicted functional regions in pancreatic islets, mononuclear cells from peripheral blood, and eight brain regions. c Enrichment in the predicted functional regions in pancreatic islets and mononuclear cells from peripheral blood after conditioning on the annotation overlap with brain regions.

To further assess whether enrichments are truly tissue-specific, we performed conditional analysis on six generic annotations (i.e., coding regions, enhancers, introns, promoters, 5′UTRs, and 3′UTRs, extended by a 500-bp window around each annotation) in Finucane et al. 52. After conditioning on these annotations, the enrichment in brain tissues remained significant (minimum enrichment = 1.37, p = 1.98e−3), suggesting that the observed enrichment in functional genome in these brain tissues were not driven by generic annotations alone (Supplementary Fig. 54).

We also ran Gene Ontology-enrichment analysis using FUMA53. The 968 genes in regions detected by LOGODetect were significantly enriched for 83 GO terms (Supplementary Table 19) after multiple testing correction, including RNA metabolic process (p = 5.36e−13), nucleolus (p = 9.30e−6), and protein arginine deiminase activity (p = 7.35e−9).

Hub regions contributing to multiple neuropsychiatric traits

Next, we investigated hub regions shared by five or more traits. Among the 227 non-overlapping genome regions identified in our analysis, 91 regions were identified in two or more different trait pairs (Supplementary Data 2). The five regions identified in at least seven pair-wise analyses are summarized in Supplementary Table 20. Notably, LOGODetect consistently identified these hub regions in more trait pairs compared to other methods. These hub regions show consistent associations with multiple neuropsychiatric traits and can potentially reveal key mechanisms and pathways underlying the shared genetics across traits.

The region showing significant correlation in nine pair-wise analyses is a locus spanning 711 KB on chromosome 11 (Fig. 6). Interestingly, two independent peaks were identified in this region between SCZ and NEU and between MDD and NEU. SNPs in this region have previously reached genome-wide significance in the SCZ40 (lead SNP rs2514218; p = 2.42e−12), NEU45 (lead SNP rs35738585; p = 2.47e−17), and IQ GWAS44 (lead SNP rs2885208; p = 4.58e−8). Additionally, SNPs at this locus showed consistent associations with BIP (lead SNP rs10502165; p = 3.90e−5). More importantly, this genome region covers the NTAD (NCAM1-TTC12-ANKK1-DRD2) gene cluster. Multiple variants of DRD2 and NCAM1 are reported to be associated with BIP, SCZ, MDD, and NEU54,55,56. Also, multiple eQTLs for DRD2 (lead SNP rs6589381; p = 1.10e−14) and NCAM1 (lead SNP rs1079021; p = 9.20e−16) are located in the region.

Fig. 6: Putative target genes for the hub region in chr11 shared by nine neuropsychiatric trait pairs.
figure 6

Locuszoom plot, recombination rate, and the gene names are provided. The colored band denote the location of the significant region and which trait pair is detected in.

Another region on chromosome 11 spans 488 KB and shows significant correlations in seven pair-wise analyses (Supplementary Fig. 55). IGSF9B, a potential risk gene for SCZ40 and IQ44, and its multiple eQTLs (lead SNP rs558709; p = 1.80e−13) are located in this genomic region. The third hub region is located on chromosome 14 spanning 694 KB and shows significant correlations in seven trait pairs (Supplementary Fig. 56). Gene PRKD1 largely overlaps with this locus, and FOXG1, which is an associated gene for SCZ40 and IQ44, lies about 200 KB away from the locus. In addition, multiple eQTLs for PRKD1 (lead SNP rs80019464; p = 6.40e−5) and FOXG1 (lead SNP rs138384350; p = 6.10e−7), are located in the region. The fourth region on chromosome 3 spans 258 KB and was identified in seven pairs (Supplementary Fig. 57). Notably, most parts of this genomic region are covered by the gene FOXP1, which is an implicated risk gene for SCZ40 and IQ44. The final region spans 450 KB on chromosome 10. This region was identified in seven trait pairs (Supplementary Fig. 58) and largely overlaps with SORCS3, a previously implicated risk gene for MDD and ADHD42,57,58.

Discussion

Through simulations and analyses of GWAS data, we demonstrated that our method effectively identified genetic regions that show concordant associations across multiple complex traits with high resolution and statistical power. Compared to existing approaches, LOGODetect has greater statistical power and is robust across various heritability settings and in existence of sample overlaps. Applied to well-powered GWASs for seven phenotypically distinct but genetically correlated neuropsychiatric traits, LOGODetect identified numerous shared genomic regions including hub regions that showed consistent effects for more than four traits. The regions identified by LOGODetect explain a larger portion of genetic covariance than existing approaches. Furthermore, the enrichment holds true in independent replication studies.

Two genes (i.e. DRD2 and NCAM1) are located in the hub region on chromosome 5 (Fig. 6). DRD2, also known as dopamine receptor D2, encodes the D2 subtype of the dopamine receptor. The dopamine hypothesis of schizophrenia suggests that dopaminergic pathways are overactive in schizophrenia59. In addition, multiple variants are reported to be associated with psychiatric disorders54. NCAM1, short for neural cell adhesion molecule 1, plays an important role in formation of plexiform layers, neurite fasciculation, nerve–muscle interactions and other aspects of neural development60. Expression of PSA-NCAM is increased in antidepressant treatment, while in animal models of depression or in depressed patients PSA-NCAM is reduced56. Notably, NCAM1 was identified by LOGODetect as implicated gene for MDD, but it cannot be identified by other three methods.

Other identified hub regions also included a handful of interesting candidate genes. IGSF9B (Supplementary Fig. 55) encodes a brain-specific cell adhesion molecule which is highly expressed in GABAergic interneurons, concentrated to hippocampal and cortical inhibitory synapses for their development into interneurons61. Interestingly, promotion of IGSF9B for inhibitory synapses development is coupled with NLGN2, loss of function variants of which were found in autism and schizophrenia patients62,63. PRKD1 (Supplementary Fig. 56) encodes a serine/threonine protein kinase which is important in many cellular processes, and regulates neuronal polarity, synapse formation, and synaptic plasticity64,65,66. FOXG1 (Supplementary Fig. 56) encodes the fork-head box protein G1 which is strongly expressed in neural tissues, operates as a transcriptional repressor essential in brain development67. It was suggested that PRKD1 locus regulates FOXG1 in a cis-acting way, and is associated with the FOXG1 syndrome including mental retardation, absent language, and dyskinesia67. FOXP1 (Supplementary Fig. 57) is one of the FOXP transcription factor subfamily. It is expressed in cerebral cortex, striatum, and spinal cord of the central nervous system, and is shown to regulate striatum development, motor neuron migration, and midbrain dopamine neuron differentiation68. FOXP1 is associated with ASD, speech delay, and intellectual disability69,70. SORCS3 (Supplementary Fig. 58) is highly expressed in the CA1 region of the hippocampus, and is involved in synaptic depression and spatial learning ability71,72. It is also known to play an important role in protein networks associated with PICK1, NGF, and PDGF-BB73,74, which have been implicated in ADHD, ASD, MDD, and SCZ75,76,77,78.

Our method still has some limitations. First, the goal of LOGODetect is to identify genomic regions harboring local genetic correlations. We do not give explicit estimation of local genetic correlation, but the sign of the correlation can be inferred. Although local genetic correlation in identified regions can be estimated by other methods (e.g., ρ-HESS) in principle, this remains a statistically challenging problem. As shown in our simulations, the estimation is inaccurate. Under the null hypothesis that local genetic correlation is zero, ρ-HESS underestimates the standard error of local genetic covariance when heritability is high and leads to inflated type I error rates, and it overestimates the standard error of local genetic covariance when heritability is low and leads to deflated statistical power. We note that the deflation of type I error observed for ρ-HESS is not contradictory to results published in ρ-HESS paper12. ρ-HESS was formulated as an estimation problem instead of the hypothesis testing problem in our manuscript. In their paper, they have shown simulation results to demonstrate that the local genetic correlation can be accurately estimated when the true parameter is 0. However, the evidence shown in the ρ-HESS paper could not rule out deflation when the method is used for inference. These problems are further exacerbated when ρ-HESS is applied to very small local genomic regions identified by LOGODetect. Second, LOGODetect scans a large number of genomic regions to search for local regions where the scan statistic significantly deviates from the null distribution. We currently do not have an analytical solution to derive or approximate the theoretical null distribution. Instead, a Monte Carlo approach is employed to quantify the null distribution of the maximal scan statistic, which is computationally expensive. Third, we proposed an empirical method to select the tuning parameter based on the aggregated genetic covariance of the identified regions. Although there is no theoretical guarantee, we have conducted extensive simulations to demonstrate that the empirical strategy to estimate θ works well with frequently used genetic models and leads to superior performance of LOGODetect compared to competitive methods, in terms of error control and statistical power. Fourth, our simulations are conducted with simulated genotypes based on the European ancestry individual data in the 1000 Genomes Project. It would be interesting to simulate data with various population structures to test our method. In real data applications, GWAS summary statistics are usually corrected for the genomic control factor, thus we expect population structure to have minor impact on the performance of LOGODetect. Fifth, several recent methods have been proposed to jointly model more than two GWAS traits to infer the structure of shared genetics across multiple phenotypes14,47,79. A future direction is to generalize our method to search for genomic regions shared by more than two traits. Finally, LOGODetect studies genetic correlation from GWAS data, which uses a bivariate random effect model and defines the genetic correlation as the correlation between SNPs10,18,21,80. Under this model, the definition of genetic correlation is consistent with the traditional definition of correlation of additive genetic effects10. Yet the concept should be distinguished from the additive genetic correlation, since the estimation could be biased to the partial effects of tag SNPs, and the causal effects of untagged SNPs would be absorbed to effect of random error term81.

Taken together, we have introduced LOGODetect, a scan statistic method to identify local genetic regions showing correlated effects with multiple neuropsychiatric traits. Complementary to single SNP-based approaches for pleiotropy mapping17,28 and genetic correlation estimation methods utilizing genome-wide data10,21, our method elucidates the shared genetic architecture between two traits by identifying local genomic segments that are concordant. The candidate genes and regions we identified may be tapping into a set of transdiagnostic mechanisms that underlie all of psychopathology (i.e., the “p” or general factor47). In practice, LOGODetect can be used in combination with other methods to further improve statistical power and biological interpretability. For example, it may be of interest to first screen the genome by identifying larger genetic regions12,82 or certain functional annotations11 enriched for the shared genetics between two traits. Then, LOGODetect can be applied to these candidate regions to identify the precise genetic segments that explain such sharing. Since high-dimensional sampling remains a challenge, a multi-tier analytical strategy would improve the statistical power and computational burden in the analysis. We believe that LOGODetect has addressed some key limitations in the current practice of cross-trait genetic correlation analysis and will benefit complex trait genetics research.

Methods

Genetic model

Suppose two standardized traits y1 and y1 follow the linear model with random effects:

$$\begin{array}{l}{\mathbf{y}}_1 = {\mathbf{X\beta }} + {\mathbf{\epsilon }},\\ {\mathbf{y}}_2 = {\mathbf{Z\gamma }} + {\mathbf{\delta }},\end{array}$$
(2)

where X and Z are fixed and standardized genotype matrices with M columns (i.e. the number of SNPs is M); ε and δ are non-genetic effects; β and γ are M-dimensional vectors denoting genetic effects. They follow the multivariate normal distribution:

$$\left[ {\begin{array}{*{20}{c}} {\mathbf{\beta }} \\ {\mathbf{\gamma }} \end{array}} \right] \sim {\mathrm{N}}\left( {\begin{array}{*{20}{c}} {\left[ {\begin{array}{*{20}{c}} 0 \\ 0 \end{array}} \right]\quad ,\quad } & {\left[ {\begin{array}{*{20}{c}} {\frac{{h_\beta ^2}}{M}{\mathbf{I}}_{\mathrm{{M}}}} & {\frac{{\rho _{\mathrm{{g}}}}}{K}{\tilde{\mathbf{I}}}_{\mathrm{{M}}}} \\ {\frac{{\rho _{\mathrm{{g}}}}}{K}{\tilde{\mathbf{I}}}_{\mathrm{{M}}}} & {\frac{{h_\gamma ^2}}{{\mathrm{{M}}}}{\mathbf{I}}_{\mathrm{{M}}}} \end{array}} \right]} \end{array}} \right),$$

where \(h_\beta ^2\) and \(h_\gamma ^2\) denote the heritability for two traits; ρg is the global genetic covariance between two traits; \({\tilde{\mathbf{I}}}_{\mathrm{{M}}}\) is a diagonal matrix whose ith diagonal element equals 1 if the effects of the ith SNP on two traits (i.e. βi and γi) are correlated and equals 0 if otherwise; K is the number of SNPs such that βi and γi are correlated, i.e., \(K = {\mathrm{tr}}[{\tilde{\mathbf{I}}}_{\mathrm{{M}}}]\). β and γ are independent from non-genetic effects ε and δ. The statistical model described here is similar to the polygenic model used in genetic correlation estimation10. The difference is that we allow local genetic sharing and do not assume the global genetic covariance to be equally attributed to all SNPs in the whole genome. Compared to the local genetic correlation estimation method in the literature12, we do not assume genetic effects to be fixed. Instead, our framework is a direct generalization of the model developed for global genetic correlation estimation10,11. Under the alternative hypothesis, we denote the non-overlapping genetic regions that contribute to multiple traits to be \(R_1, \ldots ,R_r\) and the union set as \({\cal{R}} = \cup _{j = 1}^rR_j\) such that \({\tilde{\mathbf{I}}}_{\mathrm{{M}}}[i,i] = 1\) if and only if \(i \in {\cal{R}}\). While under the null hypothesis, two traits share no genetic covariance, i.e., \({\cal{R}} = \emptyset\).

Scan statistic and scanning procedure

We use a scan statistics approach to identify regions showing correlated effects between different traits. This type of approach has been used for burden test in a single-trait setting83. Suppose \(n_1,n_2\) are the sample sizes for two GWASs, respectively, and we first consider the simpler case that there is no sample overlap between two GWASs. Additionally, we denote the association z-scores for two traits as

$$\begin{array}{l}{\mathbf{z}}_1 = \left[ {z_{11},z_{12}, \ldots ,z_{1M}} \right]^{\mathrm{{T}}} = \frac{1}{{\sqrt {n_1} }}{\mathbf{X}}^{\mathrm{{T}}}{\mathbf{y}}_1,\\ {\mathbf{z}}_2 = \left[ {z_{21},z_{22}, \ldots ,z_{2M}} \right]^{\mathrm{{T}}} = \frac{1}{{\sqrt {n_2} }}{\mathbf{Z}}^{\mathrm{{T}}}{\mathbf{y}}_2.\end{array}$$
(3)

Then, we can define the scan statistic:

$$Q\left( R \right) = \frac{{\mathop {\sum }\nolimits_{i \in R} z_{1i}z_{2i}}}{{\left( {\mathop {\sum }\nolimits_{i \in R} l_i} \right)^\theta }},$$
(4)

where R is the index set for SNPs in a genome region, \(l_i\) is the LD score80 for the ith SNP computed within a 1 MB window, and \(\theta\) is a tuning parameter that controls the strength we penalize over the LD structure. If SNPs in the region \(R\) show strong, concordant effects on both traits, then the inner product \(\mathop {\sum }\nolimits_{i \in R} z_{1i}z_{2i}\) will tend to have a larger absolute value and therefore yield a larger scan statistic. On the contrary, if two traits are genetically independent in the local region, then the corresponding scan statistic would be close to 0. Therefore, the scan statistic is informative to detect local genetic correlation. The purpose of the LD score term in the denominator is to normalize the effect of LD. The expected absolute value of \(\mathop {\sum}\nolimits_{i \in R} {z_{1i}z_{2i}}\) is larger in regions with strong LD (Supplementary Fig. 59; Supplementary Notes). Without the normalization term on the denominator, the method will favor regions with large LD that may not be of biological interest.

Finally, we use the maximal scan statistic over all possible regions as the test statistic:

$$Q_{{\mathrm{{max}}}} = \mathop {{\max }}\limits_{\left| R \right| \le C} \left| {Q\left( R \right)} \right|,$$
(5)

where C is a pre-specified parameter that defines the upper boundary of the SNPs count in a region. In practice, C can be set based on the number of SNPs in the dataset (e.g. the average number of SNPs in 1 million bases). LOGODetect takes advantages of the flexible framework to scan local regions with varying sizes. Compared to a sliding-window approach based on a pre-specified window size, our method is more appealing since the size of signal region could vary substantially by locus and by trait. We use a Monte Carlo type approach to assess the distribution of \(Q_{{\mathrm{{max}}}}\) under the null hypothesis. We draw 5000 pseudo-samples \(\left[ {\begin{array}{*{20}{c}} {{\mathbf{z}}_1} \\ {{\mathbf{z}}_2} \end{array}} \right]\) under the null distribution using a procedure detailed in the next section. Then, we estimate the empirical null distribution of \(Q_{{\mathrm{{max}}}}\) and its 95% upper quantile, \(Q_{0.95}\). Taken together, the scanning procedure works as follows. We scan the genome to find \(R_1\) such that \(\left| {Q\left( {R_1} \right)} \right|\) reaches the maximum. If \(\left| {Q\left( {R_1} \right)} \right| \ge Q_{0.95}\), we claim that \(R_1\) is a significant signal region and remove these SNPs from the analysis. Then, we repeat the procedure on the remaining SNPs until no region is declared significant. This procedure controls the family-wise type I error rate. Calculating \(Q\left( R \right)\) over all possible candidate regions is indeed computationally expensive, so we constrain \(\left| R \right|\) to be a multiple of 10 in practice, which reduced the computation burden by ~10 folds, with minimal reduction in accuracy. Finally, regions that are no more than 100 KB away from each other are merged into a single region.

Choice of parameter θ

Parameter θ affects the size of identified regions. A relatively long segment may not have a large absolute value of scan statistic, due to the penalty in the denominator \(\left( {\mathop {\sum}\nolimits_{i \in R} {l_i} } \right)^\theta\). A larger θ implies stronger penalty, henceforth is more likely to detect smaller signal segments. In particular when θ equals 1, \(\left| {Q\left( R \right)} \right|\) will attain local maximum with \(R\) containing only one variant. A reasonable range for θ is between 0 and 1. In practice, it is important to consider the “best” θ adaptive to the data. We used the proportion of genetic covariance of the identified regions as the metric. We varied the value of θ in the candidate set, and chose the best θ such that the corresponding identified regions have the largest genetic covariance. In general, one can use any subset of [0, 1] as the candidate set of θ values. However, extensively searching for θ substantially increases the computation time. In practice, we suggest the set of {0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7} would be sufficient. Denote the regions detected by LOGODetect under parameter θ as \(\hat R_1^\theta , \ldots ,\hat R_m^\theta\). We denote their union as \(\hat {\cal{R}}^\theta\) and denote the genetic covariance in \(\hat {\cal{R}}^\theta\) as \(\rho \left( {\hat {\cal{R}}^\theta } \right)\). In theory, \(\rho \left( {\hat {\cal{R}}^\theta } \right) = \mathop {\sum}\nolimits_{i = 1}^m {\left| {\hat R_i^\theta \mathop { \cap } \cal{R}} \right|} \frac{{\rho _g}}{K}\), where \({\cal{R}}\) is union set of true signal regions, \(\rho _{\mathrm{{g}}}\) is the global genetic covariance, and \(K = |{\cal{R}}|\) is the number of SNPs in \({\cal{R}}\). In practice, the true signal regions \({\cal{R}}\) is unknown. \(\rho \left( {\hat {\cal{R}}^\theta } \right)\) can be estimated using the stratified-LDSC10. Let \(\pi \left( \theta \right) = \frac{{\rho \left( {\hat {\cal{R}}^\theta } \right)}}{{\rho _{\mathrm{{g}}}}}\) be the proportion of genetic covariance explained by \(\hat {\cal{R}}^\theta\) to the global genetic covariance. We assume that \(\rho \left( {\hat {\cal{R}}^\theta } \right) = 0\) and π(θ) = 0 if \(\hat {\cal{R}}^\theta = \emptyset\). We calculate \(\pi \left( \theta \right)\) for a candidate set of θ values, and then we determine θ adaptive to data via the following optimization problem:

$$\hat \theta = \arg \mathop {{\max }}\nolimits_\theta \left| {\pi \left( \theta \right)} \right|.$$
(6)

Monte Carlo simulation of pseudo-z-score vectors

In order to simulate the null distribution of \(Q_{{\mathrm{{max}}}}\), we need to generate pseudo-z-score vectors. When two GWASs do not have sample overlap, it can be verified that

$${\mathrm{var}}\left[ {{\mathbf{z}}_1} \right] = \frac{1}{{n_1}}\left[ {\frac{{h_\beta ^2}}{M}{\mathbf{X}}^{\mathrm{{T}}}{\mathbf{XX}}^{\mathrm{{T}}}{\mathbf{X}} + \left( {1 - h_\beta ^2} \right){\mathbf{X}}^{\mathrm{{T}}}{\mathbf{X}}} \right],$$
(7)
$${\mathrm{cov}}\left[ {{\mathbf{z}}_1,{\mathbf{z}}_2} \right] = \frac{{\rho _{\mathrm{{g}}}}}{{\sqrt {n_1n_2} K}}{\mathbf{X}}^{\mathrm{{T}}}{\mathbf{X}}\widetilde {\mathbf{I}}_M{\mathbf{Z}}^{\mathrm{{T}}}{\mathbf{Z}}.$$
(8)

And similarly for \({\mathrm{var}}[{\mathbf{z}}_2]\). Therefore, under H0, the combined z-score vector

$$\left[ {\begin{array}{*{20}{c}} {{\mathbf{z}}_1} \\ {{\mathbf{z}}_2} \end{array}} \right] \sim {\mathrm{N}}\left( {\begin{array}{*{20}{c}} {\left[ {\begin{array}{*{20}{c}} 0 \\ 0 \end{array}} \right]} & , & {\left[ {\begin{array}{*{20}{c}} {\frac{1}{{n_1}}\left[ {\frac{{h_\beta ^2}}{M}{\mathbf{X}}^{\mathrm{{T}}}{\mathbf{XX}}^{\mathrm{{T}}}{\mathbf{X}} + \left( {1 - h_\beta ^2} \right){\mathbf{X}}^{\mathrm{{T}}}{\mathbf{X}}} \right]} & 0 \\ 0 & {\frac{1}{{n_2}}\left[ {\frac{{h_\gamma ^2}}{M}{\mathbf{Z}}^{\mathrm{{T}}}{\mathbf{ZZ}}^T{\mathbf{Z}} + \left( {1 - h_\gamma ^2} \right){\mathbf{Z}}^{\mathrm{{T}}}{\mathbf{Z}}} \right]} \end{array}} \right]} \end{array}} \right),$$

asymptotically. Note that in practice individual genotype data is hard to obtain due to privacy, it is meaningful to analyze based only on summary statistics. Here by using reference panel (e.g. the 1000 Genomes Project Phase 3 data30), \(\frac{1}{{n_1}}\)XTX and \(\frac{1}{{n_1}}\)ZTZ can be estimated as V, \(\frac{1}{{n_1^2}}\)XTXXTX and \(\frac{1}{{n_2^2}}\)ZTZZTZ can be estimated as \(\widetilde {{\mathbf{V}}^2} = \frac{{n - 1}}{{n - 2}}{\mathbf{V}}^2 - \frac{M}{{n - 2}}{\mathbf{V}}\), where n is the sample size of the reference panel and V is the LD matrix of the reference panel. And the genetic heritability for two traits \(h_\beta ^2,h_\gamma ^2\) can be estimated through LDSC80. After plugging in the reference LD matrix, we have

$$\left[ {\begin{array}{*{20}{c}} {{\mathbf{z}}_1} \\ {{\mathbf{z}}_2} \end{array}} \right] \sim {\mathrm{N}}\left( {\begin{array}{*{20}{c}} {\left[ {\begin{array}{*{20}{c}} 0 \\ 0 \end{array}} \right]\quad ,} & {\left[ {\begin{array}{*{20}{c}} {\frac{{n_1h_\beta ^2}}{M}\widetilde {{\mathbf{V}}^2} + \left( {1 - h_\beta ^2} \right){\mathbf{V}}} & 0 \\ 0 & {\frac{{n_2h_\gamma ^2}}{M}\widetilde {{\mathbf{V}}^2} + \left( {1 - h_\gamma ^2} \right){\mathbf{V}}} \end{array}} \right]} \end{array}} \right),$$

asymptotically under the null.

The random multivariate normal vectors have complex covariance structure, which is computationally challenging as the dimension of the vector can be as high as 107 in GWAS. We developed a computationally tractable method that leverages the LD structure in the genome. First, we split the high-dimensional vector z into sub-vectors \({\mathbf{z}} = \left[ {{\mathbf{z}}_{\left( 1 \right)},{\mathbf{z}}_{\left( 2 \right)}, \ldots ,{\mathbf{z}}_{(m)}} \right]\). These sub-vectors are defined by the genome positions, each spanning 1 MB genome block, i.e. chr1: 0–1 MB, chr1: 1–2 MB, etc. We denote the variance matrix of z as Σ and it can be written as the block matrix form. Denote \({\mathbf{{\Sigma} }}_{i,j} = {\mathrm{cov}}[{\mathbf{z}}_{\left( i \right)},{\mathbf{z}}_{(j)}]\) as the submatrix of Σ, with rows indexed by the \(i\)th block \({\mathbf{z}}_{\left( i \right)}\) and columns indexed by the \(j\)th block \({\mathbf{z}}_{\left( j \right)}\). Then we use a block-wise tridiagonal matrix to approximate Σ by shrinking \({\mathbf{{\Sigma} }}_{i,j}\) to 0 if \(\left| {i - j} \right| \ge 2\). This approximation is reasonable in the context of GWAS since SNPs should be independent if they are physically apart. Then, we can use an iterative approach to generate each block \({\mathbf{z}}_{\left( i \right)}\) by conditioning on the previous block \({\mathbf{z}}_{\left( {i - 1} \right)}\) via the conditional normal distribution:

$$\left( {{\mathbf{z}}_i{\mathrm{|}}{\mathbf{z}}_{i - 1} = {\mathbf{a}}} \right) \sim {\mathrm{N}}\left( {{\mathbf{{\Sigma} }}_{i,i - 1}{\mathbf{{\Sigma} }}_{i - 1,i - 1}^{ - 1}{\mathbf{a}},{\mathbf{{\Sigma} }}_{i,i} - {\mathbf{{\Sigma} }}_{i,i - 1}{\mathbf{{\Sigma} }}_{i - 1,i - 1}^{ - 1}{\mathbf{{\Sigma} }}_{i - 1,i}{\mathbf{a}}} \right).$$

In practice, \({\mathbf{{\Sigma} }}_{i,i}\) may be rank deficient and therefore not invertible. We adopt the truncated singular value decomposition (TSVD) method84 and use the top q singular values and their corresponding singular vectors to calculate the inverse matrix. For numerical stability, we choose q to be as large as possible such that the conditional number is <100085. Finally, we standardize each pseudo z-score vector so that it has the same mean and variance as the z-score vector in real data.

Application to binary traits

So far, we have based the derivation on the setting that the both input traits are continuous. This is a common approach to introducing genetic correlation methodology10,11. However, most genetic correlation methods, including LOGODetect, can be directly applied to GWAS summary statistics of binary outcomes10,11. It is known that under the liability threshold model, the following formulas hold10:

$$h_{\beta ,{\mathrm{{obs}}}}^2 = \frac{{h_\beta ^2\phi \left( {\tau _1} \right)^2S_1\left( {1 - S_1} \right)}}{{P_1^2\left( {1 - P_1} \right)^2}},$$
(9)
$$\rho _{{\mathrm{{g,obs}}}} = \rho _{\mathrm{{g}}}\frac{{\sqrt {\phi ^2\left( {\tau _1} \right)\phi ^2\left( {\tau _2} \right)S_1\left( {1 - S_1} \right)S_2(1 - S_2)} }}{{P_1\left( {1 - P_1} \right)P_2(1 - P_2)}},$$
(10)

where \(h_{\beta ,{\mathrm{{obs}}}}^2\) and \(\rho _{{\mathrm{{g,obs}}}}\) denote heritability and genetic covariance on the observed scale, respectively; P1 and P2 denote population prevalence for two traits; S1 and S2 denote sample prevalence for two traits; \(\tau _1 = {\mathrm{{\Phi} }}^{ - 1}\left( {1 - S_1} \right),\quad \tau _2 = {\mathrm{{\Phi} }}^{ - 1}\left( {1 - S_2} \right)\), \(\phi\) and \({\mathrm{{\Phi} }}\) denote the standard normal distribution density and its cumulative distribution function, respectively. When applying LOGODetect to binary traits, we replace \(h_\beta ^2,h_\gamma ^2\) (i.e., heritability on the liability scale) with \(h_{\beta ,{\mathrm{{obs}}}}^2,h_{\gamma ,{\mathrm{{obs}}}}^2\) (i.e., heritability on the observed scale).

Extension for sample overlaps

Suppose there are \(n_{\mathrm{{s}}}\) shared samples in the two GWASs, then the linear models can be restated as

$$\begin{array}{l}\left[ {\begin{array}{*{20}{c}} {{\mathbf{y}}_{1,{\mathrm{{ns}}}}} \\ {{\mathbf{y}}_{1,{\mathrm{{s}}}}} \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {{\mathbf{X}}_{\mathrm{{{ns}}}}} \\ {{\mathbf{X}}_{\mathrm{{s}}}} \end{array}} \right]{\mathbf{\beta }} + \left[ {\begin{array}{*{20}{c}} {{\mathbf{\epsilon }}_{{\mathrm{{ns}}}}} \\ {{\mathbf{\epsilon }}_{\mathrm{{s}}}} \end{array}} \right],\\ \left[ {\begin{array}{*{20}{c}} {{\mathbf{y}}_{2,{\mathrm{{ns}}}}} \\ {{\mathbf{y}}_{2,{\mathrm{{s}}}}} \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {{\mathbf{Z}}_{{\mathrm{{ns}}}}} \\ {{\mathbf{Z}}_{\mathrm{{s}}}} \end{array}} \right]{\mathbf{\gamma }} + \left[ {\begin{array}{*{20}{c}} {{\mathbf{\delta }}_{{\mathrm{{ns}}}}} \\ {{\mathbf{\delta }}_{\mathrm{{s}}}} \end{array}} \right],\end{array}$$
(11)

where \(\left[ {\begin{array}{*{20}{c}} {{\mathbf{y}}_{1,{\mathrm{{ns}}}}} \\ {{\mathbf{y}}_{1,{\mathrm{{s}}}}} \end{array}} \right],\left[ {\begin{array}{*{20}{c}} {{\mathbf{y}}_{2,{\mathrm{{ns}}}}} \\ {{\mathbf{y}}_{2,{\mathrm{{s}}}}} \end{array}} \right]\) are the standardized phenotypes of all individuals in each GWAS. \(\left[ {\begin{array}{*{20}{c}} {{\mathbf{X}}_{{\mathrm{{ns}}}}} \\ {{\mathbf{X}}_{\mathrm{{s}}}} \end{array}} \right] = {\mathbf{X}},\left[ {\begin{array}{*{20}{c}} {{\mathbf{Z}}_{{\mathrm{{ns}}}}} \\ {{\mathbf{Z}}_{\mathrm{{s}}}} \end{array}} \right] = {\mathbf{Z}}\) are standardized genotypes of all individuals in each GWAS. \({\mathbf{\epsilon }}_{{\mathrm{{ns}}}},{\mathbf{\epsilon }}_{\mathrm{{s}}},{\mathbf{\delta }}_{{\mathrm{{ns}}}},{\mathbf{\delta }}_{\mathrm{{s}}}\) are the non-genetic effects where \({\mathrm{cov}}\left[ {{\mathbf{\epsilon }}_{\mathrm{{s}}},{\mathbf{\delta }}_{\mathrm{{s}}}} \right] = \rho _{\mathrm{{e}}}{\mathbf{I}}_{n_{\mathrm{{s}}}}\). It can be shown that

$${\mathrm{cov}}\left[ {{\mathbf{z}}_1,{\mathbf{z}}_2} \right] = \frac{{\rho _{\mathrm{{g}}}}}{{\sqrt {n_1n_2} K}}{\mathbf{X}}^{\mathrm{{T}}}{\mathbf{X}}\tilde I_{\mathrm{{M}}}{\mathbf{Z}}^{\mathrm{{T}}}{\mathbf{Z}} + \frac{{\rho _{\mathrm{{e}}}}}{{\sqrt {n_1n_2} }}{\mathbf{X}}_{\mathrm{{s}}}^{\mathrm{{T}}}{\mathbf{Z}}_{\mathrm{{s}}},$$
(12)

While \({\mathrm{var}}\left[ {{\mathbf{z}}_1} \right]\) and \({\mathrm{var}}\left[ {{\mathbf{z}}_2} \right]\) have the same form as no sample overlaps setting. By using reference panel, \(\frac{1}{{n_{\mathrm{{s}}}}}{\mathbf{X}}_{\mathrm{{s}}}^{\mathrm{{T}}}{\mathbf{Z}}_{\mathrm{{s}}}\) can be replaced by V. Therefore, under \(H_0\), the combined z-score vectors asymptotically follows multivariate normal distributions

$$\left[ {\begin{array}{*{20}{c}} {{\mathbf{z}}_1} \\ {{\mathbf{z}}_2} \end{array}} \right] \sim {\mathrm{N}}\left( {\begin{array}{*{20}{c}} {\left[ {\begin{array}{*{20}{c}} 0 \\ 0 \end{array}} \right]} & , & {\left[ {\begin{array}{*{20}{c}} {\frac{{n_1h_\beta ^2}}{M}\widetilde {{\mathbf{V}}^2} + \left( {1 - h_\beta ^2} \right){\mathbf{V}}} & {\frac{{\rho _en_s}}{{\sqrt {n_1n_2} }}{\mathbf{V}}} \\ {\frac{{\rho _en_s}}{{\sqrt {n_1n_2} }}{\mathbf{V}}} & {\frac{{n_2h_\gamma ^2}}{M}\widetilde {{\mathbf{V}}^2} + \left( {1 - h_\gamma ^2} \right){\mathbf{V}}} \end{array}} \right]} \end{array}} \right)$$

Note that the variance matrix can be split into two terms as

$$\begin{array}{l}{\mathrm{Var}}\left[ {\begin{array}{*{20}{c}} {{\mathbf{z}}_1} \\ {{\mathbf{z}}_2} \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {\frac{{n_1h_\beta ^2}}{M}\widetilde {{\mathbf{V}}^2} + \left( {1 - h_\beta ^2} \right){\mathbf{V}}} & {\frac{{\rho _en_{\mathrm{{s}}}}}{{\sqrt {n_1n_2} }}{\mathbf{V}}} \\ {\frac{{\rho _en_{\mathrm{{s}}}}}{{\sqrt {n_1n_2} }}{\mathbf{V}}} & {\frac{{n_2h_\gamma ^2}}{M}\widetilde {{\mathbf{V}}^2} + \left( {1 - h_\gamma ^2} \right){\mathbf{V}}} \end{array}} \right]\\ = \left[ {\begin{array}{*{20}{c}} {\frac{{n_1h_\beta ^2}}{M}\widetilde {{\mathbf{V}}^2} + \left( {1 - h_\beta ^2 - \frac{{\rho _en_s}}{{\sqrt {n_1n_2} }}} \right){\mathbf{V}}} & 0 \\ 0 & {\frac{{n_2h_\gamma ^2}}{M}\widetilde {{\mathbf{V}}^2} + \left( {1 - h_\gamma ^2 - \frac{{\rho _en_{\mathrm{{s}}}}}{{\sqrt {n_1n_2} }}} \right){\mathbf{V}}} \end{array}} \right] + \left[ {\begin{array}{*{20}{c}} {\frac{{\rho _en_s}}{{\sqrt {n_1n_2} }}{\mathbf{V}}} & {\frac{{\rho _en_{\mathrm{{s}}}}}{{\sqrt {n_1n_2} }}{\mathbf{V}}} \\ {\frac{{\rho _en_{\mathrm{{s}}}}}{{\sqrt {n_1n_2} }}{\mathbf{V}}} & {\frac{{\rho _en_{\mathrm{{s}}}}}{{\sqrt {n_1n_2} }}{\mathbf{V}}} \end{array}} \right],\end{array}$$
(13)

if \(\rho _en_{\mathrm{{s}}}\) is positive, and can be split into two terms as

$${\mathrm{var}}\left[ {\begin{array}{*{20}{c}} {{\mathbf{z}}_1} \\ {{\mathbf{z}}_2} \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {\frac{{n_1h_\beta ^2}}{M}\widetilde {{\mathbf{V}}^2} + \left( {1 - h_\beta ^2 + \frac{{\rho _en_{\mathrm{{s}}}}}{{\sqrt {n_1n_2} }}} \right){\mathbf{V}}} & 0 \\ 0 & {\frac{{n_2h_\gamma ^2}}{M}\widetilde {{\mathbf{V}}^2} + \left( {1 - h_\gamma ^2 + \frac{{\rho _en_{\mathrm{{s}}}}}{{\sqrt {n_1n_2} }}} \right){\mathbf{V}}} \end{array}} \right] + \left[ {\begin{array}{*{20}{c}} { - \frac{{\rho _en_{\mathrm{{s}}}}}{{\sqrt {n_1n_2} }}{\mathbf{V}}} & {\frac{{\rho _en_{\mathrm{{s}}}}}{{\sqrt {n_1n_2} }}{\mathbf{V}}} \\ {\frac{{\rho _en_{\mathrm{{s}}}}}{{\sqrt {n_1n_2} }}{\mathbf{V}}} & { - \frac{{\rho _en_{\mathrm{{s}}}}}{{\sqrt {n_1n_2} }}{\mathbf{V}}} \end{array}} \right],$$
(14)

if \(\rho _en_{\mathrm{{s}}}\) is negative. We can independently simulate pseudosamples following the normal distribution with mean 0 and each variance term, respectively. Finally, by adding up two vectors simulated with respect to different variance terms, we get the pseudo z-score vector of interest. In particular, the parameters \(\sigma _\beta ^2,\sigma _\gamma ^2,\rho _e n_{\mathrm{{s}}}\) appearing in the z-score null distribution are not of our interest, but we need their values while doing Monte Carlo sampling of \(\left[ {\begin{array}{*{20}{c}} {{\mathbf{z}}_1} \\ {{\mathbf{z}}_2} \end{array}} \right]\). We adopt LDSC10 to estimate them. Note that LDSC is based on random effect random design model setup, which is incompatible with our model assumption, yet we believe it should yield little consequence.

Genome partition and FDR control

We separated the genome into 204 LD blocks using ldetect86. Each LD block spans 15 MB on average. We applied LOGODetect to each LD block separately and identified the local regions with p-value < 0.05 under a family-wise type I error control. We aggregated all the candidate regions across different LD blocks, and applied Benjamini–Hochberg procedure87 to control FDR with a cutoff of 0.05, accounting for the multiple testing problem concerning all LD blocks.

Computation time

The major computation step in LOGODetect is to compute the maximal scan statistic in real data and in Monte Carlo samples. The computation time depends on the number of SNPs in GWAS. For a typical GWAS with 6 million SNPs, it takes about 12 h on a 2.5GHz cluster with 22 computation cores.

Simulation settings

Based on 503 individuals with European ancestry from the 1000 Genomes Project Phase 3 data, we simulated genotype data for 100,000 individuals with minor allele frequency (MAF) > 5% on chromosome 1 using HAPGEN229. 336,532 variants remained in the dataset after removing strand-ambiguous SNPs. Samples were randomly divided into two subsets with equal sample size, each with 50,000 individuals. We used each subset to simulate the phenotype data.

First, we performed simulations under the null hypothesis to see whether our approach would produce false positive findings. We follow the infinitesimal model, where the effect size level of all the normalized SNPs are the same, and the per-normalized-SNP genetic effect was drawn from a normal distribution \({\mathrm{N}}(0,\frac{{h^2}}{{336,532}})\) for both traits. To realistically model the polygenic genetic architecture with different levels of genetic effects, we attributed 30% of the trait heritability to 5000 randomly chosen SNPs, while the remaining SNPs explain 70% of the trait heritability. The per-SNP genetic effect was drawn from a normal distribution \({\mathrm{N}}(0,0.3* \frac{{h^2}}{{5000}})\) for SNPs with high heritability enrichment, and from \({\mathrm{N}}(0,0.7* \frac{{h^2}}{{331,532}})\) for SNPs with low heritability enrichment. The trait heritability \(h^2\) was set to vary from 0.01 to 0.05 in each scenario. Note that a heritability value of 0.01 or 0.05 on chromosome 1 will approximately correspond to heritability values of 0.12 or 0.60 in the whole genome, which are realistic values for typical GWAS traits. Each simulation setting was repeated for 100 times.

Next, we performed simulations to assess the statistical power under a heritability enrichment model. We randomly selected N = 5 segments, each containing L = 1000 SNPs, as the signal regions shared between two traits. We attributed \(p = 0.3\) of trait heritability to the signal regions. The genetic effect size for the SNPs in the signal regions follows a multivariate normal distribution

$$\left[ {\begin{array}{*{20}{c}} {\beta _i} \\ {\gamma _i} \end{array}} \right] \sim {\mathrm{N}}\left( {\begin{array}{*{20}{c}} {\left[ {\begin{array}{*{20}{c}} 0 \\ 0 \end{array}} \right]} & , & {\left[ {\begin{array}{*{20}{c}} {\frac{{ph^2}}{{NL}}} & {\frac{{ph^2\rho }}{{NL}}} \\ {\frac{{ph^2\rho }}{{NL}}} & {\frac{{ph^2}}{{NL}}} \end{array}} \right]} \end{array}} \right).$$

The genetic effect size for the SNPs outside the signal regions follows a different multivariate normal distribution without local genetic correlation.

$$\left[ {\begin{array}{*{20}{c}} {\beta _i} \\ {\gamma _i} \end{array}} \right] \sim {\mathrm{N}}\left( {\begin{array}{*{20}{c}} {\left[ {\begin{array}{*{20}{c}} 0 \\ 0 \end{array}} \right]} & , & {\left[ {\begin{array}{*{20}{c}} {\frac{{\left( {1 - p} \right)h^2}}{{336532 - NL}}} & 0 \\ 0 & {\frac{{\left( {1 - p} \right)h^2}}{{336532 - NL}}} \end{array}} \right]} \end{array}} \right).$$

The trait heritability h2 was set to vary from 0.01 to 0.05, and the correlation of genetic effect size of two traits ρ was set to 0.9. Each simulation setting was repeated for 100 times. Further simulation settings are described in detail in the Supplementary Notes.

We adjusted the significance cutoff of different approaches to achieve the same type I error. For coloc and gwas-pw, in those heritability settings with empirical type I error >0.05, we increased the cutoff of the posterior probabilities so that the empirical type I error is controlled at 0.05.

Evaluate model performance

We use three different metrics to quantify the performance of our approach. Denote the true signal segments as \(R_1, \ldots ,R_J\), and the segments detected by LOGODetect as \(\hat R_1, \ldots ,\hat R_K\). We define the signal points detection rate as the number of true signal SNPs detected by LOGODetect divided by the number of true signal SNPs, that is \(\frac{{\mathop {\sum }\nolimits_{j = 1}^J \left| {R_j \cap (\mathop {\cup}\nolimits_{k = 1}^K {\hat R_k} )} \right|}}{{\mathop {\sum }\nolimits_{j = 1}^J \left| {R_j} \right|}}\). Similarly, we define signal segments detection rate as the number of true signal segments detected by LOGODetect divided by the number of true signal segments, namely \(\frac{{\mathop {\sum }\nolimits_{j = 1}^J I\left\{ {R_j \cap \left( {\mathop {\cup}\nolimits_{k = 1}^K {\hat R_k} } \right) \ne \emptyset } \right\}}}{J}\), where we call a segment true positive if it overlaps with a true signal segment. Signal points detection rate and signal segments detection rate aim to measure the sensitivity at the SNPs level and segments level, respectively. To take the extent of the overlap into consideration, we also followed88 to define \(S( {R_j} )\), the G-score with respect to a signal region \(R_j\), as \(\mathop {{\max }}\limits_{1 \le k \le K} \frac{{\left| {\hat R_k\mathop { \cap }R_j} \right|}}{{\sqrt {\left| {\hat R_k} \right|\left| {R_j} \right|} }}\), and further define the G-score measure as \(\frac{1}{J}\mathop {\sum}\nolimits_{j = 1}^J {S(R_j)}\). The G-score aims to measure the specificity and sensitivity together. The three metrics were also applied to quantify ρ-HESS, coloc, and gwas-pw.

Implementation of different methods

We used ldetect86 to pre-specify 1703 approximately LD-independent blocks (spanning 1.6 Mb on average) as candidate genomic regions, as suggested by ρ-HESS and gwas-pw. We also used these LD-independent blocks as candidate genomic regions for coloc. In simulation studies, we used 133 approximately LD-independent regions in chromosome 1 as the pre-specified genomic regions for ρ-HESS, coloc, and gwas-pw. For ρ-HESS, the 1000 Genomes Project Phase 3 data30 was used as the reference panel, the number of eigenvectors used in the truncated-SVD for LD matrix inversion is determined as 50 by default, and the minimum eigenvalue cut off in truncated-SVD is determined as 1.0 by default, as suggested by the ρ-HESS software (https://huwenboshi.github.io/hess/). ρ-HESS reported the estimate and significance of local genetic correlation for each candidate genomic region, and we applied Benjamini–Hochberg procedure87 to control FDR with a cutoff of 0.05, accounting for the multiple testing problem concerning all genomic regions. Coloc (https://CRAN.R-project.org/package=coloc) and gwas-pw (https://github.com/joepickrell/gwas-pw) estimated the posterior probability that two traits shared at least one causal SNP for each genomic region, and those genomic regions with posterior probability above 0.95 are determined as identified regions.

We used LDSC (https://github.com/bulik/ldsc) to estimate heritability in each chromosome. Stratified-LDSC was used to estimate genetic covariance of the identified regions. In detail, we manually created two annotations: the identified regions and the remaining genome, then we ran the standard LDSC software to calculate the genetic covariance and the proportion of genetic covariance of each annotation. For both LDSC and stratified-LDSC, LD scores were computed with the standard LDSC software from 503 individuals with European ancestry from the 1000 Genomes Project Phase 3 data. Both methods were applied with an unconstrained intercept, using all SNPs as observations in the dependent variable and LD scores as regression weights.

Application of LOGODetect to seven neuropsychiatric traits

We applied LOGODetect to seven neuropsychiatric traits. The European ancestry genotype data from 1000 Genomes Project was used as the reference panel to estimate the LD matrix. For each GWAS data, indels and SNPs not present in the reference panel were removed. The SNPs of MAF  < 0.01 in the reference panel were also removed. Then for each trait pair, we filtered out all the strand-ambiguous SNPs and took the overlaps. For SNPs whose effect alleles were the same in the two GWASs, the original z-scores were used. For SNPs whose effect alleles were reversed in two GWASs, we reversed the sign of z-score in the second GWAS accordingly. Thus, the allele coding schemes between any two studies were consistent. Then we applied LOGODetect to perform the downstream analysis.

Enrichment analysis

We aggregated 227 non-overlapping segments identified by LOGODetect in seven neuropsychiatric traits and investigated if these segments are enriched in predicted functional regions for a given tissue or cell type. Tissue or cell type-specific functional regions were defined using GenoSkyline-Plus annotations and dichotomized with a cutoff of 0.5. The annotation is robust to the cutoff due to the bimodal pattern in raw GenoSkyline-Plus annotation scores. To assess the statistical significance of enrichment, we randomly selected 227 non-overlapping segments across the genome while matching their sizes with the detected segments, and calculated the overlaps with GenoSkyline-Plus annotations. We repeated the permutation procedure 100,000 times to evaluate the significance of the observed overlap.

We also assessed whether the detected regions were enriched in non-brain tissue types after adjusting for the overlap of brain and non-brain annotations. Specifically, for the pancreatic islets cell type annotation, we removed the annotations that overlap with any of the eight significant brain cell type annotations to define the conditional annotation of pancreatic islets. The same procedure was taken to define the conditional annotation of mononuclear cells from peripheral blood. Afterwards, permutation tests were performed on these two conditional annotations. We performed conditional analysis on six generic annotations including coding regions, enhancers, introns, promoters, 5′UTRs and 3′UTRs (extended by a 500-bp window around each of the annotations) in Finucane et al. 52 by removing the overlapped regions between each generic annotation and the brain tissue-specific annotations (merged from eight significant brain cell type annotations). We used permutation test to assess the statistical significance of enrichment in conditional analyses.

Using GENCODE V33lift37 on the UCSC genome browser, we extracted 968 genes with recognized Ensembl IDs in the genomic regions found to harbor local genetic correlations among seven neuropsychiatric traits. We used FUMA53 to run the Gene Ontology enrichment analysis with these 968 genes.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.