Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# High-definition likelihood inference of genetic correlations across human complex traits

## Abstract

Genetic correlation is a central parameter for understanding shared genetic architecture between complex traits. By using summary statistics from genome-wide association studies (GWAS), linkage disequilibrium score regression (LDSC) was developed for unbiased estimation of genetic correlations. Although easy to use, LDSC only partially utilizes LD information. By fully accounting for LD across the genome, we develop a high-definition likelihood (HDL) method to improve precision in genetic correlation estimation. Compared to LDSC, HDL reduces the variance of genetic correlation estimates by about 60%, equivalent to a 2.5-fold increase in sample size. We apply HDL and LDSC to estimate 435 genetic correlations among 30 behavioral and disease-related phenotypes measured in the UK Biobank (UKBB). In addition to 154 significant genetic correlations observed for both methods, HDL identified another 57 significant genetic correlations, compared to only another 2 significant genetic correlations identified by LDSC. HDL brings more power to genomic analyses and better reveals the underlying connections across human complex traits.

## Access options

from\$8.99

All prices are NET prices.

## Data availability

The individual-level genotype and phenotype data are available by application from the UKBB (http://www.ukbiobank.ac.uk/). The UKBB GWAS summary statistics by the Neale laboratory can be obtained from http://www.nealelab.is/uk-biobank/. Source data are provided with this paper.

## Code availability

HDL software is available at https://github.com/zhenin/HDL/. LDSC software is available at https://github.com/bulik/ldsc/. PLINK 2.0 (https://www.cog-genomics.org/plink/2.0/) was used to extract individual-level data of imputed SNPs from the UKBB. PLINK 1.9 (https://www.cog-genomics.org/plink/) and LDAK (http://dougspeed.com/ldak/) were used in LD correlation calculation and simulations.

## References

1. 1.

Lee, S. H., Yang, J., Goddard, M. E., Visscher, P. M. & Wray, N. R. Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics 28, 2540–2542 (2012).

2. 2.

Loh, P.-R. et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat. Genet. 47, 1385–1392 (2015).

3. 3.

Bulik-Sullivan, B. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).

4. 4.

Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).

5. 5.

Zheng, J. et al. LD hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics 33, 272–279 (2017).

6. 6.

Ni, G. et al. Estimation of genetic correlation via linkage disequilibrium score regression and genomic restricted maximum likelihood. Am. J. Hum. Genet. 102, 1185–1194 (2018).

7. 7.

Yang, J. et al. Genome-wide genetic homogeneity between sexes and populations for human height and body mass index. Hum. Mol. Genet. 24, 7445–7449 (2015).

8. 8.

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

9. 9.

Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010).

10. 10.

Speed, D. & Balding, D. J. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat. Genet. 51, 277–284 (2019).

11. 11.

Canela-Xandri, O., Rawlik, K. & Tenesa, A. An atlas of genetic associations in UK Biobank. Nat. Genet. 50, 1593–1599 (2018).

12. 12.

Evans, L. M. et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat. Genet. 50, 737–745 (2018).

13. 13.

Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).

14. 14.

Yengo, L., Yang, J. & Visscher, P. M. Expectation of the intercept from bivariate LD score regression in the presence of population stratification. Preprint at bioRxiv https://doi.org/10.1101/310565 (2018).

15. 15.

Ganna, A. et al. Large-scale GWAS reveals insights into the genetic architecture of same-sex sexual behavior. Science 365, eaat7693 (2019).

16. 16.

Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

## Acknowledgements

We thank the UKBB resource, approved under application no. 14302 and 19655, for the individual-level genotype data used in LD correlation calculation and simulations. X.S. was in receipt of a Swedish Research Council starting grant (no. 2017-02543). Y.P. received a Swedish Research Council grant (no. 2016-04194). We thank the Edinburgh Compute and Data Facility (ECDF) for providing high-performance computing resources.

## Author information

Authors

### Contributions

X.S. and Y.P. initiated and coordinated the study. Z.N. performed data analysis. All authors contributed to method development and manuscript writing.

### Corresponding author

Correspondence to Xia Shen.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data

### Extended Data Fig. 1 Relative efficiency of HDL against LDSC when 100% SNPs are causal.

In each heritability group, we generated 100 pairs of traits, where true genetic correlation and phenotypic correlation are 0.5. In the high heritability group, the heritability of the pair of traits is 0.6 and 0.8 separately; in the low heritability group, the heritability of the pair of traits is 0.2 and 0.4 separately. The 307,519 array SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes and to compute the LD matrix for both HDL and LDSC. The P-values are from Levene’s test for variance heterogeneity. Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.

### Extended Data Fig. 2 Relative efficiency of HDL against LDSC under different model setups when 10% SNPs with MAF > 1% are causal.

52,914 out of 529,139 array SNPs with MAF > 1% were randomly selected as causal variants. 100 pairs of traits were generated, where true genetic correlation and phenotypic correlation are 0.5. The true phenotypes of trait i is generated from model $${\mathbf{y}}_i = \mathop {\sum}\nolimits_{k = 1}^M {{\mathbf{X}}_{ik}\beta _{ik} + \epsilon_i}$$, where $${\mathbf{X}}_{ik} = ({\mathbf{Z}}_{ik} - 2p_k1)[2p_k(1 - p_k)]^{\alpha /2}$$; Zik are the original genotypes of SNP k for trait i; pk is the MAF of SNP k; M is the number of causal variants. Four scenarios were simulated: (1) α = −1, and the marginal distribution of βik is $$N(0,h_i^2/M)$$; (2) α = −1, and the marginal distribution of βik is $$N(0,w_kh_i^2/M)$$, where wk is the LDAK weight of SNP k which is inversely proportional to its LD score; (3) α = −0.25, and the marginal distribution of βik is $$N(0,h_i^2/M)$$ and (4) α = −0.25, and the marginal distribution of βik is $$N(0,w_kh_i^2/M)$$. After βi were generated, they were rescaled by multiplying the same constant so that the true heritabilities were 0.5 for both traits. The 307,519 array SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes and to compute LD matrix for both HDL and LDSC. The P-values are from Levene’s test for variance heterogeneity. Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.

### Extended Data Fig. 3 Relative efficiency of HDL against LDSC under different model setups when 10% SNPs with 5% > MAF > 1% are causal.

52,914 out of 221,620 array SNPs with 5% > MAF > 1% were randomly selected as causal variants. 100 pairs of traits were generated, where true genetic correlation and phenotypic correlation are 0.5. The true phenotypes of trait i is generated from model $${\mathbf{y}}_i = \mathop {\sum}\nolimits_{k = 1}^M {{\mathbf{X}}_{ik}\beta _{ik} + \epsilon_i}$$, where $${\mathbf{X}}_{ik} = ({\mathbf{Z}}_{ik} - 2p_k1)[2p_k(1 - p_k)]^{\alpha /2}$$; Zik are the original genotypes of SNP k for trait i; pk is the MAF of SNP k; M is the number of causal variants. Four scenarios were simulated: (1) α = −1, and the marginal distribution of βik is $$N(0,h_i^2/M)$$; (2) α = −1, and the marginal distribution of βik is $$N(0,w_kh_i^2/M)$$, where wk is the LDAK weight of SNP k which is inversely proportional to its LD score; (3) α =−0.25, and the marginal distribution of βik is $$N(0,h_i^2/M)$$ and (4) α =−0.25, and the marginal distribution of βik is $$N(0,w_kh_i^2/M)$$. After βi were generated, they were rescaled by multiplying the same constant so that the true heritabilities were 0.5 for both traits. The 307,519 array SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes and to compute LD matrix for both HDL and LDSC. The P-values are from Levene’s test for variance heterogeneity. Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.

### Extended Data Fig. 4 Relative efficiency of HDL using imputed reference panel against LDSC.

100 pairs of traits were generated, where true heritabilities are 0.5, genetic correlation and phenotypic correlation are 0.5. The 1,029,876 imputed SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes. LDSC and LDSC.1kG stand for the LDSC software using UKBB imputed reference panel and default 1000 Genomes reference panel, respectively. 102,988 (10% of 1,029,876) randomly sampled SNPs are set to be causal variants. The P-values are from Levene’s test for variance heterogeneity. Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.

### Extended Data Fig. 5 Relative efficiency and standard error of LDSC estimate among 30 phenotypes in UK Biobank.

Each dot represents genetic correlation results for one pair of traits among 435 pairs. The x-axis represents the standard error of the LDSC estimate. The y-axis represents the relative efficiency of HDL against LDSC. HDL reference panel: UKBB imputed SNPs; LDSC reference panel: 1000 Genomes (default). Colors indicate the number of binary traits in the pair.

### Extended Data Fig. 6 Genetic correlation estimates from HDL and LDSC among 30 phenotypes in UK Biobank based on directly genotyped variants on the array.

Lower triangle: HDL estimates; Upper triangle: LDSC estimates. The areas of the squares represent the absolute value of corresponding genetic correlations. After Bonferroni correction for 435 tests at 5% significance level, genetic correlations estimates that are significantly different from zero in both methods are marked with a dot; estimates that are significantly different from zero in only one method are marked with an asterisk and a black square. HDL reference panel: UKBB array SNPs; LDSC reference panel: UKBB array SNPs.

### Extended Data Fig. 7 Relative efficiency of HDL using imputed reference panel against LDSC for the estimation of heritability.

a, 100 traits were generated using 14,867 imputed SNPs on chromosome 22 of ~336,000 UKBB genomic British individuals, where true heritability was set to 0.05. LDSC and LDSC.1kG stand for the LDSC software using UKBB imputed reference panel and default 1kG reference panel, respectively. 1,487 (10% of 14,867) randomly sampled SNPs are set to be causal variants. b, The relative efficiency, calculated as the ratio of the estimated variances of the LDSC estimates to those of the HDL estimates, was evaluated for 30 GWAS of real phenotypes in UKBB. HDL reference panel: UKBB imputed SNPs; LDSC reference panel: 1000 Genomes (default). Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.

### Extended Data Fig. 8 Comparison of the heritability estimates from HDL and default LDSC across 30 UKBB phenotypes.

The default LDSC uses the 1000 Genomes reference panel. HDL uses UKBB imputed markers as reference. R represents the correlation between the two sets of estimates. The red dashed line represents identity.

### Extended Data Fig. 9 Example of the eigenvalues of an LD matrix.

5,420 genotyped variants on chromosome 22 for UKBB genomic British individuals were used to generate the LD matrix. The red dashed line represents the cutoff where the leading eigenvalues and corresponding eigenvectors capture 90% of the information of the LD matrix.

### Extended Data Fig. 10 HDL results where the LD matrix is approximated by different numbers of leading eigenvalues and eigenvectors.

After performing eigen-decomposition to the LD matrix, leading eigenvalues explaining different amount of variances of the LD matrix and their corresponding eigenvectors were taken to approximate the LD matrix. In each heritability group, we generated 100 pairs of traits, where true genetic correlation and phenotypic correlation are 0.5. In the high heritability group, the heritability of the pair of traits is 0.6 and 0.8 separately; in low heritability group, the heritability of the pair of traits is 0.2 and 0.4 separately. The 307,519 array SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes and to compute the LD matrix for HDL. 30,752 SNPs are causal (10% of 307,519). Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.

## Supplementary information

### Supplementary Information

Supplementary Note, Table 1 and Figs. 1–6

## Source data

### Source Data Fig. 1

Statistical source data.

### Source Data Fig. 2

Statistical source data.

### Source Data Fig. 3

Statistical source data.

### Source Data Extended Data Fig. 1

Statistical source data.

### Source Data Extended Data Fig. 2

Statistical source data.

### Source Data Extended Data Fig. 3

Statistical source data.

### Source Data Extended Data Fig. 4

Statistical source data.

### Source Data Extended Data Fig. 5

Statistical source data.

### Source Data Extended Data Fig. 6

Statistical source data.

### Source Data Extended Data Fig. 7

Statistical source data.

### Source Data Extended Data Fig. 8

Statistical source data.

### Source Data Extended Data Fig. 9

Statistical source data.

### Source Data Extended Data Fig. 10

Statistical source data.

## Rights and permissions

Reprints and Permissions

Ning, Z., Pawitan, Y. & Shen, X. High-definition likelihood inference of genetic correlations across human complex traits. Nat Genet 52, 859–864 (2020). https://doi.org/10.1038/s41588-020-0653-y

• Accepted:

• Published:

• Issue Date:

• ### Clinical laboratory test-wide association scan of polygenic scores identifies biomarkers of complex disease

• Jessica K. Dennis
• , Julia M. Sealock
• , Peter Straub
• , Younga H. Lee
• , Donald Hucks
• , Ky’Era Actkins
• , Annika Faucon
• , Yen-Chen Anne Feng
• , Tian Ge
• , Slavina B. Goleva
• , Maria Niarchou
• , Kritika Singh
• , Theodore Morley
• , Jordan W. Smoller
• , Douglas M. Ruderfer
• , Jonathan D. Mosley
• , Guanhua Chen
•  & Lea K. Davis

Genome Medicine (2021)

• ### Common genetic associations between age-related diseases

• Handan Melike Dönertaş
• , Daniel K. Fabian
• , Matías Fuentealba
• , Linda Partridge
•  & Janet M. Thornton

Nature Aging (2021)

• ### Total genetic contribution assessment across the human genome

• Ting Li
• , Zheng Ning
• , Zhijian Yang
• , Ranran Zhai
• , Chenqing Zheng
• , Wenzheng Xu
• , Yipeng Wang
• , Kejun Ying
• , Yiwen Chen
•  & Xia Shen

Nature Communications (2021)

• ### Genetic mechanisms of critical illness in COVID-19

• Erola Pairo-Castineira
• , Sara Clohisey
• , Lucija Klaric
• , Andrew D. Bretherick
• , Susan Walker
• , Nick Parkinson
• , Clark D. Russell
• , James Furniss
• , Anne Richmond
• , Elvina Gountouna
• , Nicola Wrobel
• , David Harrison
• , Bo Wang
• , Yang Wu
• , Alison Meynert
• , Fiona Griffiths
• , Wilna Oosthuyzen
• , Athanasios Kousathanas
• , Loukas Moutsianas
• , Zhijian Yang
• , Ranran Zhai
• , Chenqing Zheng
• , Graeme Grimes
• , Rupert Beale
• , Jonathan Millar
• , Barbara Shih
• , Sean Keating
• , Marie Zechner
• , Chris Haley
• , David J. Porteous
• , Caroline Hayward
• , Jian Yang
• , Julian Knight
• , Charlotte Summers
• , Manu Shankar-Hari
• , Paul Klenerman
• , Lance Turtle
• , Antonia Ho
• , Shona C. Moore
• , Charles Hinds
• , Peter Horby
• , Alistair Nichol
• , David Maslove
• , Lowell Ling
• , Danny McAuley
• , Hugh Montgomery
• , Timothy Walsh
• , Alexandre C. Pereira
• , Alessandra Renieri
• , Xia Shen
• , Chris P. Ponting
• , Angie Fawkes
• , Albert Tenesa
• , Mark Caulfield
• , Richard Scott
• , Kathy Rowan
• , Lee Murphy
• , Peter J. M. Openshaw
• , Malcolm G. Semple
• , Andrew Law
• , Veronique Vitart
• , James F. Wilson
•  & J. Kenneth Baillie

Nature (2021)