Detecting local genetic correlations with scan statistics

Genetic correlation analysis has quickly gained popularity in the past few years and provided insights into the genetic etiology of numerous complex diseases. However, existing approaches oversimplify the shared genetic architecture between different phenotypes and cannot effectively identify precise genetic regions contributing to the genetic correlation. In this work, we introduce LOGODetect, a powerful and efficient statistical method to identify small genome segments harboring local genetic correlation signals. LOGODetect automatically identifies genetic regions showing consistent associations with multiple phenotypes through a scan statistic approach. It uses summary association statistics from genome-wide association studies (GWAS) as input and is robust to sample overlap between studies. Applied to seven phenotypically distinct but genetically correlated neuropsychiatric traits, we identify 227 non-overlapping genome regions associated with multiple traits, including multiple hub regions showing concordant effects on five or more traits. Our method addresses critical limitations in existing analytic strategies and may have wide applications in post-GWAS analysis.


Supplementary Note 1. Functional form of scan statistic
We have proposed the scan statistic as follows, to search for regions harboring local genetic correlation: .
As we mentioned in the main text, this is a generalization of the scan statistic proposed in Jeng et al 1 and Li et al 2 , which detects signal regions associated with a single trait. The scan procedure in Jeng et al. was based on the mean of the marginal test statistics in a candidate region. We refer to this as the mean scan procedure. The scan statistic for a given region is defined as , where | | denotes number of SNPs in region . The authors showed that the mean scan procedure is asymptotically optimal as long as SNPs are independent and the SNPs in the signal region have the same effect size. In our paper, we extended the mean scan framework to the context of detecting local genetic correlation. We replace | | = ∑ 1 ∈ in the mean scan procedure, with ∑ ∈ , to account for LD. Notably, under the case that all the SNPs have the same LD effect, e.g.
are the same for all SNPs, ∑ ∈ just reduces to | | up to a multiple constant. Following Li et al 2 , we relax the choice of power from 0.5 to a numeric in [0,1] to ensure that the scan statistic is comparable for different region sizes.
Second, the scan statistics we proposed has good genetics interpretation. It has been shown that under the polygenic model that per-SNP genetic covariance is the same for all SNPs, and if two GWASs are independent, then the following equation holds 3 : where the 1 , 2 represents the sample size for two GWASs, is the SNP count and is the (global) genetic covariance. If we let be 1 and be the whole genome, then the expectation of the scan statistic ( ) reduces to √ 1 2 , which is exactly the (global) genetic covariance multiplied by a constant. This implicates that when the candidate region is the whole genome, the scan statistic actually represents the strength of (global) genetic covariance. In our framework, we do not assume the polygenic model that per-SNP genetic covariance is the same for all SNPs across genome, but assume that genetic covariance is localized in some small genome regions. Therefore, we use the scan statistic in a local region, as a metric to detect significant local genetic sharing.

Supplementary Note 2. Simulation settings
Based on 503 individuals with European ancestry from the 1000 Genomes Project Phase 3 data 4 , we simulated genotype data for 100,000 individuals with minor allele frequency (MAF) greater than 5% on chromosome 1 using HAPGEN2 5 . 336,532 variants remained in the dataset after removing strand ambiguous SNPs. Samples were randomly divided into two subsets with equal sample size, each with 50,000 individuals. We used each subset to simulate the phenotype data. We note that this genotype simulation approach is consistent with the simulation setting in the ρ-HESS paper 6 .
First, we performed simulations under the null hypothesis. We simulated phenotype under different genetic architecture and evaluated whether our approach would produce false positive findings. Under the infinitesimal model, we assumed the effect size level of all the normalized SNPs are the same, and the per-normalized-SNP genetic effect was drawn from a normal distribution N(0, was false positive inflation with shared sample, we drew the effect size under the infinitesimal model as above, except that in this setting 25,000 samples were shared by two traits with environmental correlation 0.5. To study scenarios with different heritability, we fixed the heritability of the second trait at 0.02, and vary the heritability of the first trait. To simulate a binary trait, we first simulated the continuous liability following the infinitesimal model, then assigned the samples with liability greater than 50% quantile as cases and the rest as controls. We also considered scenarios with model mis-specification. Under the LDAK model 7 with MAF-dependent and LDdependent effect sizes, we drew the genetic effect of the -th SNP from a normal distribution N(0, ℎ 2 ) , where ℎ 2 ∝ [ * (1 − )] 0.75 * , is the minor allele frequency, and is the weight computed by the LDAK software. Under the infinitesimal model with t-distributed effects, per-SNP heritability for each variant was set to be the same as before, but the effect size of each SNP is proportional to tdistribution with 10 degrees of freedom. Under the non-infinitesimal model with sparse effects, we split up chromosome 1 into two parts (i.e., chr1:1-116,000,000 and chr1:116,000,001-249,143,646), each containing half of the SNPs. For the first trait, we randomly sampled 10,000 variants from the first half of chr1 as causal variants. Similarly, we randomly sampled 10000 variants from the second half of chr1 as causal variants for the second trait. Per-SNP heritability for all the causal variants were the same as previous simulations. The trait heritability ℎ 2 was set to vary from 0.01 to 0.05 in each scenario. Note that if we assume heritability is distributed proportionally in the genome, then the heritability values of 0.01 and 0.05 for chromosome 1 will correspond to the approximate heritability values of 0.12 and 0.60 for the whole genome, respectively, which are on the same scale of SNP heritability for various traits. Each simulation setting was repeated for 100 times.
Next, we performed simulations to assess the statistical power. Under the heritability enrichment model, we randomly selected = 5 segments, each containing = 1000 SNPs, as the signal regions shared between two traits. We attributed = 0.3 trait heritability to the signal regions. The genetic effect size for the SNPs in the signal regions follows a multivariate normal distribution .
The genetic effect size for the SNPs outside the signal regions follows a different multivariate normal distribution without local genetic correlation .
We evaluated the statistical power in different scenarios, with parameter ℎ 2 varying from 0.01 to 0.05, the correlation of genetic effect size of two traits varying from 0.2 to 0.8, proportion of heritability in signal regions varying from 0.1 to 0.5, and length of signal region varying from 250 to 2,000. To consider binary scenarios, we simulated two binary traits under the liability threshold model, where the continuous liabilities followed the same heritability enrichment model as previous simulations. We evaluated the statistical power in different scenarios, with total sample size fixed at 50,000 and sample prevalence varying from 0.01 to 0.5, and with effective sample size ( = 4 1 + 1 ) fixed at 10,000 and sample prevalence varying from 0.1 to 0.5. In addition, we evaluated statistical power under mis-specified models. To model LDdependent and MAF-dependent effect sizes, we simulated two phenotypes using the LDAK 7 framework. Under the non-infinitesimal model with sparse effects, we randomly sampled 10 causal regions for each trait, among which 5 causal regions were shared by both traits. Under the infinitesimal model with t-distributed effects, effect sizes for SNPs in signal regions were generated as follows: and genetic effect sizes for the SNPs outside the signal regions were generated as follows: where , , independently follow 10 distribution. The above genetic model guaranteed that the effect sizes are generated proportional to the t-distribution, and the variance explained by all SNPs is exactly equal to the trait heritability, ℎ 2 . We varied the heritability ℎ 2 as previous simulations. Each simulation setting was repeated for 100 times. Finally, we defined an identified region as false positive if its genome distance to the nearest true signal region is larger than 500 KB (to account for LD effect).
In simulation studies of ρ -HESS, we used the 133 approximately LD-independent regions 8 in chromosome 1 (1.6 Mb in width on average) as the pre-specified genomic regions, as recommended by the original paper of ρ-HESS 6 .
We adjusted the significance cutoff of different approaches to achieve the same type I error. For coloc and gwas-pw, in those heritability settings with empirical type I error greater than 0.05, we increased the cutoff of the posterior probabilities so that the empirical type I error is controlled at 0.05.

Supplementary Note 3. Interpretation of genetic covariance enrichment
First, we provide a clear definition for genetic covariance enrichment. Suppose the genetic covariance between two traits explained by additive genetic effects is , then the per-SNP genetic covariance is where is the toal number of SNPs in the analysis. For a given genomic region (or multiple regions which we consider as a SNP set) containing SNPs, we denote the local genetic covariance explained by SNPs within this genomic region as , , and the per-SNP genetic covariance within this genomic region is , . We define the genetic covariance enrichment of this given region to be , / . This term quantifies the ratio of per-SNP heritability within the given genomic region and that in the genome. Such a definition is conceptually similar to the heritability enrichment widely used in the field 9 and can be quantified using tools such as GNOVA 10 . If a region shows strong enrichment for genetic covariance, it means that SNPs in this region have more contributions to the global genetic covariance compared to randomly selected regions in the genome.
In addition, to demonstrate that genetic covariance enrichment is a fair and effective metric for comparing different approaches, we performed two simulations (i.e. under the alternative and null) and evaluated the genetic covariance enrichment respectively. First, we simulated two genetically correlated traits and applied four methods (i.e., LOGODetect, -HESS, gwas-pw, and coloc) to these two traits. We used 22 autosomes genotype data from the Wellcome Trust Case Control Consortium (WTCCC) cohort to perform simulations. We randomly selected = 100 segments, each containing = 100 SNPs, as the signal regions shared between two traits. We attributed = 0.3 trait heritability to the signal regions. The effect size correlation of shared signal SNPs, , is set to be 0.9. Heritability ℎ 2 is set to be 0.5 for both traits. Under this setting, the genetic covariance enrichment is higher in the regions identified by LOGODetect than regions identified by the other three methods (Supplementary Figure 60 solid line), which is concordant with the real data results between BIP and SCZ. To investigate whether the higher genetic covariance enrichment for the regions identified by LOGODetect are potential bias introduced by using LD scores in calculating the scan statistic, we performed null simulations. For each trait, we attributed 30% of the trait heritability to 10000 randomly chosen SNPs, while the remaining SNPs explain 70% of the trait heritability. We applied the four methods, LOGODetect, ρ-HESS, coloc, and gwas-pw to these simulated independent traits, and the corresponding top regions are identified. Then we evaluated the proportion of genetic covariance explained by these top regions with respect to the correlated traits under the alternative, and we found no significant enrichment (Supplementary Figure  60 dashed line). This demonstrates that LOGODetect does not favor regions with high LD scores and will not induce systematic bias in calculating genetic covariance enrichment. Thus, the observed enrichment of genetic covariance in real data analysis of LOGODetect regions is attributed to the shared genetic architecture instead of the systematic bias. We concluded that genetic covariance enrichment is a fair metric comparing performance of different methods, and LOGODetect actually identify more precise regions than other methods.

Supplementary Note 4. Replication of findings by LOGODetect
We applied LOGODetect to the UKBB summary statistics of BIP and SCZ, and investigated how many of 33 identified BIP-SCZ genomic regions could be replicated. However, In the UKBB, GWAS of BIP and SCZ had very limited case counts (n case =1,064 for BIP; n case =571 for SCZ) despite the large total sample size (n total =366,540 for BIP; n total =366,047 for SCZ), which led to insufficient statistical power for direct replication. We calculated the effective sample size, , for each study using the formula = 4 1 + 1 = 4 (1 − ) , where is the within-sample prevalence. for BIP and SCZ were 49,367 and 99,863 respectively in discovery GWASs, and only 4,244 and 2,280 in the UKBB.
In the following, we assessed the impact on statistical power given reduced . For simplicity, we treated all traits as continuous and assumed the total sample size in GWAS to be = . Using z-score in GWAS summary statistics as input, we mimicked the down-sampled z-score analogous to the approach PUMAS 11 as 0 = √ 0 + N ( , − 0 ), where 0 is the sample size of the random subsample and is the LD (correlation) matrix and is approximated using empirical LD in a reference panel.
Using this approach, we decreased the of SCZ and BIP GWASs to that in the UKBB replication cohort. As expected, the number of detected regions declined as decreased (Supplementary Figure 61). We do not expect to see any significant replication given the strength of local genetic correlation in input GWASs and sample size in the UKBB. Therefore, directly applying LOGODetect to the UKBB for replication would be statistically underpowered.
Instead of applying the direct replication approach, we chose to assess the enrichment of aggregated genetic covariance across all identified local genomic segments to replicate our findings. Here, we provided more empirical justifications to this approach. We repeated the down-sampling analysis described above but assessed the enrichment of genetic covariance in all identified regions in the replication dataset instead of statistical significance of each segment. We found that the detected regions are still expected to be substantially enriched for genetic covariance even when decreased to the level of UKBB replication cohort (Supplementary Figure 62A), whereas randomly selected regions show no enrichment for genetic covariance when varies (Supplementary Figure 62B).
In addition, in spite of the seven neuropsychiatric traits which we had investigated in the main text, we applied our method to two anthropometric traits: height and bodymass index (BMI), to see whether findings by LOGODetect can be replicated with a direct approach. We conducted analysis on summary statistics of height 12 (n=253,288) and BMI 13 (n=236,231) from the GIANT consortium and replicated our findings in the UKBB (n=455,332 and 454,841). We identified 14 regions with significant local genetic correlation in the discovery analysis. 10 of 14 regions identified in the discovery stage were successfully replicated, suggesting the effectiveness of LOGODetect to identify replicable genomic regions with local genetic correlations.

Supplementary Note 5. |∑ ∈ | is larger in regions with strong LD
We observed the pattern that |∑ 1 2 ∈ | is larger in regions with strong LD in real GWASs. We used BIP and SCZ as an example. We partitioned the genome into 30,957 blocks, each spanning 200 SNPs. Then we grouped these blocks into three equallysized categories (i.e., high LD, medium LD, and low LD) according to their LD strength (sum of LD scores of SNPs in each block). Each category contains 10,319 blocks. Two sample t-tests for each category pair suggest significantly larger |∑ and are in perfect LE (linkage equilibrium) with the -th SNP, which means 1 = 2 = ⋯ = −1 , 1 = 2 = ⋯ = −1 , and 1 = 1 = 0 . Suppose the environment effect is independent with the genotype. We ignore the distinction between independence in population level and in sample level, which means = 0, = 0, = 0. Note that 1 = 1 √ 1 1 , 2 = 1 √ 2 2 , it can be shown that, Similarly we have |∑ To prove the latter inequality, one only needs to show that if 1 , 2 … are independently and identically distributed random variables, then | . This indicates that the expected absolute value of ∑ 1 2 ∈ is larger in regions with strong LD.

Supplementary Figures
Supplementary Figure 1. Assessment of statistical power under a heritability enrichment model with varying trait heritability. (A-C) show statistical power assessed by three measures: Signal points detection rate, Signal segments detection rate and Gscore. Heritability represents the trait heritability for both traits on chromosome 1. For each trait, we randomly choose N=5 segments, each containing L=1000 SNPs, as the signal regions. The heritability for the signal regions is set to be 30% of trait heritability. The correlation of genetic effect size of two traits ρ is set to be 0.9. Each simulation setting was repeated 100 times.

Supplementary Figure 2. Assessment of statistical power for four methods under a heritability enrichment model with varying correlation.
(A-C) show statistical power assessed by three measures: Signal points detection rate, Signal segments detection rate and G-score. Correlation represents the correlation of genetic effect size of two traits. For each trait, we randomly choose N=5 segments, each containing L=1000 SNPs, as the signal regions. The heritability for the signal regions is set to be 30% of trait heritability. The trait heritability for two traits is set to be 0.03. Each simulation setting is repeated for 100 times. Here, significance cutoffs for coloc and gwas-pw were adjusted so that the empirical type I error rate for each method is controlled at 0.05.

Supplementary Figure 3. Assessment of statistical power for four methods under a heritability enrichment model with varying signal region length.
(A-C) show statistical power assessed by three measures: Signal points detection rate, Signal segments detection rate and G-score. Horizontal axis is in log scale. Signal region length denoted as L, represents the length of one true signal region. For each trait, we randomly choose N=5 segments, each containing L SNPs, as the signal regions. The trait heritability is set to be 0.03 for both traits. The heritability for the signal regions is set to be 30% of trait heritability. The correlation of genetic effect size of two traits ρ is set to be 0.9. Each simulation setting is repeated for 100 times. Here, significance cutoffs for coloc and gwas-pw were adjusted so that the empirical type I error rate for each method is controlled at 0.05. Heritability represents the trait heritability for the first trait. The trait heritability for the second trait is set to be 0.02. For each trait, we randomly choose N=5 segments, each containing L=1000 SNPs, as the signal regions. The heritability for the signal regions is set to be 30% of trait heritability. The correlation of genetic effect size of two traits ρ is set to be 0.9. Each simulation setting is repeated for 100 times. Here, significance cutoffs for coloc and gwas-pw were adjusted so that the empirical type I error rate for each method is controlled at 0.05. Figure 5. Assessment of statistical power for four methods under a heritability enrichment model with 50% shared samples. (A-C) show statistical power assessed by three measures: Signal points detection rate, Signal segments detection rate and G-score. Heritability represents the trait heritability for both traits. For each trait, we randomly choose N=5 segments, each containing L=1000 SNPs, as the signal regions. The heritability for the signal regions is set to be 30% of trait heritability. The correlation of genetic effect size of two traits ρ is set to be 0.9. Each simulation setting is repeated for 100 times. Here, significance cutoffs for coloc and gwas-pw were adjusted so that the empirical type I error rate for each method is controlled at 0.05. Figure 6. Assessment of statistical power for four methods under a liability threshold model with varying sample prevalence and fixed total sample size.

Supplementary
(A-C) show statistical power assessed by three measures: Signal points detection rate, Signal segments detection rate and G-score. Here we assume the liability threshold model that two binary traits are determined by unobserved continuous liabilities, where the liabilities follow the same heritability enrichment model in previous simulations. Proportion represents the sample prevalence (the proportion of cases in whole samples). Population prevalence and sample prevalence were set to be the same, varying from 0.01 to 0.5. Total sample size is fixed at 50,000 for each trait. For each trait, we randomly choose N=5 segments, each containing L=1000 SNPs, as the signal regions. The trait heritability is set to be 0.03 for both traits. The heritability for the signal regions is set to be 30% of trait heritability. The correlation of genetic effect size of two traits ρ is set to be 0.9. Each simulation setting is repeated for 100 times. Here, significance cutoffs for coloc and gwaspw were adjusted so that the empirical type I error rate for each method is controlled at 0.05. The heritability for the signal regions is set to be 30% of trait heritability. The correlation of genetic effect size of two traits ρ is set to be 0.9. Each simulation setting is repeated for 100 times. Here, significance cutoffs for coloc and gwas-pw were adjusted so that the empirical type I error rate for each method is controlled at 0.05. Here we assume the infinitesimal model with t-distributed effects. For each trait, we randomly choose N=5 segments, each containing L=100 SNPs, as the signal regions. The heritability for the signal regions is set to be 30% of trait heritability. The correlation of genetic effect size of two traits ρ is set to be 0.9. Each simulation setting is repeated for 100 times. Here, significance cutoffs for coloc and gwas-pw were adjusted so that the empirical type I error rate for each method is controlled at 0.05. Here we assume a sparse genetic model with few causal SNPs. We randomly sampled 10 causal regions for each trait, among which 5 causal regions were shared by both traits. Each causal region contains L SNPs. We assumed the per-SNP heritability for all the causal variants to be the same. The trait heritability is set to be 0.03 for both traits. The heritability for the signal regions is set to be 30% of trait heritability. The effect size correlation of shared causal regions, ρ, is set to be 0.9. Each simulation setting is repeated for 100 times. Here, significance cutoffs for coloc and gwaspw were adjusted so that the empirical type I error rate for each method is controlled at 0.05. Here the X-axis denotes the squared root of the effective sample size of the simulated BIP cohort times that of the simulated SCZ cohort. For each panel, the box on the right corresponds to effective sample size in the BIP-SCZ discovery cohorts (√ 1 * 2 = 70,214), the box on the left corresponds to effective sample size in the BIP-SCZ UKBB replication cohorts (√ 1 * 2 = 3,111). Each box represents 100 independent simulations for each effective sample size. Here, the bottom and upper end of the whisker denote the minima and maxima respectively, the bottom, middle, and upper line of the box denote the first quartile 1 , median 2 , and the third quartile 3 respectively, and the dots beyond the whiskers denote outliers (data points that fall outside [ 1 − 1.5 * , 3 + 1.5 * ], where = 3 − 1 ).

Supplementary Tables
Supplementary Table 1   The column of Genetic correlation, s.e., and P represent the estimated genetic correlation, its corresponding standard error, and p value for each disease pair respectively using LDSC. Column of LOGODetect and ρ-HESS represent the numbers of detected significant segments using these approaches with FDR<0.05 respectively. Column of coloc and gwas-pw represent the numbers of detected significant segments using these approaches with PP (posterior probability) >0.95 respectively. Each row represents the proportion of cell type specific annotation overlapped with six generic annotations. The two significantly enriched brain tissue types, i.e. cingulate gyrus and angular gyrus, are highlighted with yellow background. Here we listed out the five hub regions with significant local genetic correlation in at least seven trait pairs identified by LOGODetect. The second to fifth columns show the trait pairs with significant genetic correlation at the corresponding locus identified by LOGODetect, ρ-HESS, coloc, and gwas-pw, respectively.