Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies

Abstract

Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare-variant (RV) associations with complex human diseases and traits. Variant-set analysis is a powerful approach to study RV association. However, existing methods have limited ability in analyzing the noncoding genome. We propose a computationally efficient and robust noncoding RV association detection framework, STAARpipeline, to automatically annotate a whole-genome sequencing study and perform flexible noncoding RV association analysis, including gene-centric analysis and fixed window-based and dynamic window-based non-gene-centric analysis by incorporating variant functional annotations. In gene-centric analysis, STAARpipeline uses STAAR to group noncoding variants based on functional categories of genes and incorporate multiple functional annotations. In non-gene-centric analysis, STAARpipeline uses SCANG-STAAR to incorporate dynamic window sizes and multiple functional annotations. We apply STAARpipeline to identify noncoding RV sets associated with four lipid traits in 21,015 discovery samples from the Trans-Omics for Precision Medicine (TOPMed) program and replicate several of them in an additional 9,123 TOPMed samples. We also analyze five non-lipid TOPMed traits.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Workflow of STAARpipeline.

Data availability

This paper used the TOPMed Freeze 5 WGS data and phenotype data of lipids, CRP, eGFR, FG, FI and TL. The genotype and phenotype data are available in dbGAP. The TOPMed data were from the following 14 studies, under the provided accession numbers:

Framingham Heart Study (phs000974.v1.p1), Old Order Amish Study (phs000956.v1.p1), Jackson Heart Study (phs000964.v1.p1), Multi-Ethnic Study of Atherosclerosis (phs001416.v1.p1), GWAS of Adiposity in Samoans (phs000972) and Women’s Health Initiative (phs001237), Atherosclerosis Risk in Communities Study (phs001211), Cleveland Family Study (phs000954), Cardiovascular Health Study (phs001368), Diabetes Heart Study (phs001412), Genetic Study of Atherosclerosis Risk (phs001218), Genetic Epidemiology Network of Arteriopathy (phs001345), Genetics of Lipid Lowering Drugs and Diet Network (phs001359) and San Antonio Family Heart Study (phs001215).

The functional annotation data are publicly available and were downloaded from: GRCh38 CADD v1.4 (https://cadd.gs.washington.edu/download), ANNOVAR dbNSFP v3.3a (https://annovar.openbioinformatics.org/en/latest/user-guide/download/), LINSIGHT (https://github.com/CshlSiepelLab/LINSIGHT/), FATHMM-XF (http://fathmm.biocompute.org.uk/fathmm-xf/), CAGE (https://fantom.gsc.riken.jp/5/data/), GeneHancer (https://www.genecards.org/) and Umap/Bismap (https://bismap.hoffmanlab.org/). In addition, recombination rate and nucleotide diversity were obtained from work by Gazal et al.54. The tissue-specific functional annotations were downloaded from ENCODE (https://www.encodeproject.org/report/?type=Experiment). The assembled functional annotation data from these sources are available at http://favor.genohub.org/.

Code availability

STAARpipeline is implemented as an open-source R package available at https://github.com/xihaoli/STAARpipeline/ (ref. 55) and https://content.sph.harvard.edu/xlin/software.html. STAARpipelineSummary is implemented as an open-source R package available at https://github.com/xihaoli/STAARpipelineSummary/ (ref. 56) and https://content.sph.harvard.edu/xlin/software.html. The scripts used to generate the results have been archived on Zenodo using https://doi.org/10.5281/zenodo.6871408 (ref. 57). Data analysis was performed in R (3.6.1). STAAR v0.9.6, STAARpipeline v0.9.6 and STAARpipelineSummary v0.9.6 were used in simulation and real data analysis, and seqMeta v1.6.7 was used in simulation. Wget v1.14 was used for downloading the annotation data. FAVORannotator v1.0.0 (https://github.com/xihaoli/STAARpipeline-Tutorial/) was used to functionally annotate the whole-genome data.

References

  1. Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Wainschtein, P. et al. Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nat. Genet. 54, 263–273 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Hernandez, R. D. et al. Ultrarare variants drive substantial cis heritability of human gene expression. Nat. Genet. 51, 1349–1355 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Flannick, J. et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature 570, 71–76 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Van Hout, C. V. et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature 586, 749–756 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Zhang, F. & Lupski, J. R. Non-coding genetic variants in human disease. Hum. Mol. Genet. 24, R102–R110 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Khurana, E. et al. Role of non-coding sequence variants in cancer. Nat. Rev. Genet. 17, 93–108 (2016).

    Article  CAS  PubMed  Google Scholar 

  9. Lee, P. H. et al. Principles and methods of in-silico prioritization of non-coding regulatory variants. Hum. Genet. 137, 15–30 (2018).

    Article  CAS  PubMed  Google Scholar 

  10. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  Google Scholar 

  11. Moore, J. E. et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Bansal, V., Libiger, O., Torkamani, A. & Schork, N. J. Statistical analysis strategies for association studies involving rare variants. Nat. Rev. Genet. 11, 773–785 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95, 5–23 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Kiezun, A. et al. Exome sequencing and the genetic basis of complex traits. Nat. Genet. 44, 623 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Li, B. & Leal, S. M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83, 311–321 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Li, X. et al. Dynamic incorporation of multiple in silico functional annotations empowers rare-variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet. 52, 969–983 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Morrison, A. C. et al. Practical approaches for whole-genome sequence analysis of heart-and blood-related traits. Am. J. Hum. Genet. 100, 205–215 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Li, Z. et al. Dynamic scan procedure for detecting rare-variant association regions in whole-genome sequencing studies. Am. J. Hum. Genet. 104, 802–814 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. He, Z., Xu, B., Buxbaum, J. & Ionita-Laza, I. A genome-wide scan statistic framework for whole-genome sequence data analysis. Nat. Commun. 10, 3018 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Natarajan, P. et al. Deep-coverage whole genome sequences and blood lipids among 16,324 individuals. Nat. Commun. 9, 3391 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Li, Z., Liu, Y. & Lin, X. Simultaneous detection of signal regions using quadratic scan statistics with applications to whole genome association studies. J. Am. Stat. Assoc. 117, 823–834 (2022).

    Article  CAS  PubMed  Google Scholar 

  23. Bocher, O. & Génin, E. Rare-variant association testing in the non-coding genome. Hum. Genet. 139, 1345–1362 (2020).

    Article  PubMed  Google Scholar 

  24. Fishilevich, S. et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database 2017, bax028 (2017).

  25. FANTOM Consortium. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).

    Google Scholar 

  26. Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Breslow, N. E. & Clayton, D. G. Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 88, 9–25 (1993).

    Google Scholar 

  28. Chen, H. et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am. J. Hum. Genet. 98, 653–666 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Chen, H. et al. Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole-genome sequencing studies. Am. J. Hum. Genet. 104, 260–274 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Zhou, H., Arapoglou, T., Li, X., Li, Z. & Lin, X.. FAVOR Essential Database. https://doi.org/10.7910/DVN/1VGTJI (Harvard Dataverse V1, 2022).

  31. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).

    Article  CAS  PubMed  Google Scholar 

  33. Kinsella, R. J. et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database 2011, bar030 (2011).

  34. Povysil, G. et al. Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat. Rev. Genet. 20, 747–759 (2019).

    Article  CAS  PubMed  Google Scholar 

  35. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Huang, Y.-F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Rogers, M. F. et al. FATHMM-XF: accurate prediction of pathogenic point mutations via extended features. Bioinformatics 34, 511–513 (2017).

    Article  PubMed Central  Google Scholar 

  38. Liu, Y. et al. ACAT: a fast and powerful P value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet. 104, 410–421 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).

    Article  CAS  PubMed  Google Scholar 

  40. Stilp, A. M. et al. A system for phenotype harmonization in the National Heart, Lung, and Blood Institute Trans-Omics for Precision Medicine (TOPMed) Program. Am. J. Epidemiol. 190, 1977–1922 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  41. Moutsianas, L. et al. The power of gene-based rare-variant methods to detect disease-associated variation and test hypotheses about complex disease. PLoS Genet. 11, e1005165 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Raffield, L. M. et al. Allelic heterogeneity at the CRP locus identified by whole-genome sequencing in multi-ancestry cohorts. Am. J. Hum. Genet. 106, 112–120 (2020).

    Article  CAS  PubMed  Google Scholar 

  43. Lin, B. M. et al. Whole-genome sequence analyses of eGFR in 23,732 people representing multiple ancestries in the NHLBI trans-omics for precision medicine (TOPMed) consortium. EBioMedicine 63, 103157 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. DiCorpo, D. et al. Whole-genome sequence association analysis of fasting glucose and fasting insulin levels in diverse cohorts from the NHLBI TOPMed Program. Commun. Biol. 5, 756 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Taub, M. A. et al. Genetic determinants of telomere length from 109,122 ancestrally diverse whole-genome sequences in TOPMed. Cell Genom. 2, 100084 (2022).

    Article  CAS  PubMed  Google Scholar 

  46. Schaffner, S. F. et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15, 1576–1583 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Lee, S., Wu, M. C. & Lin, X. Optimal tests for rare-variant effects in sequencing association studies. Biostatistics 13, 762–775 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Zaidi, A. A. & Mathieson, I. Demographic history mediates the effect of stratification on polygenic scores. Elife 9, e61548 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Gogarten, S. M. et al. Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics 35, 5346–5348 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Zheng, X. & Davis, J. W. SAIGEgds-an efficient statistical tool for large-scale PheWAS with mixed models. Bioinformatics 37, 728–730 (2020).

    Article  Google Scholar 

  51. Peloso, G. M. et al. Association of low-frequency and rare coding-sequence variants with blood lipids and coronary heart disease in 56,000 whites and blacks. Am. J. Hum. Genet. 94, 223–232 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Moon, S., Lee, Y., Won, S. & Lee, J. Multiple genotype-phenotype association study reveals intronic variant pair on SIDT2 associated with metabolic syndrome in a Korean population. Hum. Genomics 12, 48 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Dong, C. et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum. Mol. Genet. 24, 2125–2137 (2015).

    Article  CAS  PubMed  Google Scholar 

  54. Gazal, S. et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 49, 1421–1427 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Li, X. & Li, Z. xihaoli/STAARpipeline: STAARpipeline_v0.9.6. version 0.9.6 https://doi.org/10.5281/zenodo.6871504 (2022).

  56. Li, X. & Li, Z. xihaoli/STAARpipelineSummary: STAARpipelineSummary_v0.9.6. version 0.9.6 https://doi.org/10.5281/zenodo.6871524 (2022).

  57. Li, X. & Li, Z. xihaoli/STAARpipeline-Tutorial: v0.9.6. version 0.9.6 https://doi.org/10.5281/zenodo.6871408 (2022).

Download references

Acknowledgements

This work was supported by grants R35-CA197449, U19-CA203654, R01-HL113338, U01-HG012064 and U01-HG009088 (to X. Lin); R01-HL142711 and R01-HL127564 (to P.N. and G.M.P.); R35-HL135824 (to C.J.W.); 75N92020D00001, HHSN268201500003I, N01-HC-95159, 75N92020D00005, N01-HC-95160, 75N92020D00002, N01-HC-95161, 75N92020D00003, N01-HC-95162, 75N92020D00006, N01-HC-95163, 75N92020D00004, N01-HC-95164, 75N92020D00007, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-000040, UL1-TR-001079, UL1-TR-001420, UL1-TR001881, DK063491, R01-HL071051, R01-HL071205, R01-HL071250, R01-HL071251, R01-HL071258, R01-HL071259 and UL1-RR033176 (to J.I.R. and X.G.); U01-HL72518, HL087698, HL49762, HL59684, HL58625, HL071025, HL112064, NR0224103 and M01-RR000052 (to the Johns Hopkins General Clinical Research Center); NO1-HC-25195, HHSN268201500001I, 75N92019D00031 and R01-HL092577-06S1 (to R.S.V. and L.A.C.); the Evans Medical Foundation and the Jay and Louis Coffman Endowment from the Department of Medicine, Boston University School of Medicine (to R.S.V.); HHSN268201800001I and U01-HL137162 (to K.M.R. and M.P.C.); R01-HL133040 (to D.E.W.); R35-HL135818, R01-HL113338, and HL436801 (to S.R.); KL2TR002490 (to L.M.R.); R01-HL92301, R01-HL67348, R01-NS058700, R01-AR48797 and R01-AG058921 (to N.D.P. and D.W.B.); R01-DK071891 (to N.D.P., B.I.F. and D.W.B.); M01-RR07122 and F32-HL085989 (General Clinical Research Center of the Wake Forest University School of Medicine); the American Diabetes Association and P60-AG10484 (Claude Pepper Older Americans Independence Center of Wake Forest University Health Sciences); U01-HL137181 (to J.R.O.); and R01-HL141944 (to R.A.M.). R.A.M. receives support as the Sarah Miller Coulson Scholar in the Johns Hopkins Center for Innovative Medicine; HHSN268201600018C, HHSN268201600001C, HHSN268201600002C, HHSN268201600003C and HHSN268201600004C (to C.L.K.); R01-HL113323, U01-DK085524, R01-HL045522, R01-MH078143, R01-MH078111, and R01-MH083824 (to H.H.H.G., R.D., J.E.C. and J.B.); R01- DK117445 and R01-MD012765 (to N.F. and B.M.L.); U01-DK078616 and R01-DK078616 (to J.B.M. and A.K.M.); 18CDA34110116 from American Heart Association (to P.S.d.V.); HHSN268201800010I, HHSN268201800011I, HHSN268201800012I, HHSN268201800013I, HHSN268201800014I and HHSN268201800015I (to A.C.); R01-HL153805 and R03-HL154284 (to B.E.C.); HHSN268201700001I, HHSN268201700002I, HHSN268201700003I, HHSN268201700005I and HHSN268201700004I (to E.B.); U01-HL072524, R01-HL104135-04S1, U01-HL054472, U01-HL054473, U01-HL054495, U01-HL054509 and R01-HL055673-18S1 (to D.K.A.). Molecular data for the TOPMed program were supported by the NHLBI. Core support including centralized genomic read mapping and genotype calling, along with variant QC metrics and filtering were provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1; contract HHSN268201800002I). Core support including phenotype harmonization, data management, sample-identity QC and general program coordination were provided by the TOPMed Data Coordinating Center (R01HL-120393, U01HL-120393 and contract HHSN268201800001I). We gratefully acknowledge the studies and participants who provided biological samples and data for TOPMed. We gratefully acknowledge the support from The Samoan Obesity, Lifestyle and Genetic Adaptations Study Group. The full study-specific acknowledgements are detailed in the Supplementary Note.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

Z.L., X. Li and X. Lin designed the experiments. Z.L., X. Li, H.Z. and X. Lin performed the experiments. Z.L., X. Li, H.Z., S.M.G., M.S.S., T.A., C.Q., Y.L., H.C., R.S., R.D., D.K.A., L.F.B., J.C.B., T.W.B., J.B., E.B., D.W.B., J.A.B., B.E.C., M.P.C., A.C., L.A.C., J.E.C., P.S.d.V., R.D., B.I.F., H.H.H.G., X.G., R.R.K., C.L.K., B.G.K., L.A.L., A.W.M., L.W.M., B.D.M., M.E.M., A.C.M., T.N., J.R.O., N.D.P., P.A.P., B.M.P., L.M.R., S.R., A.P.R., M.S.R., K.M.R., S.S.R., J.A.S., K.D.T., R.S.V., D.E.W., J.G.W., L.R.Y., W.Z., J.I.R., C.J.W., P.N., G.M.P. and X. Lin acquired, analyzed or interpreted data. G.M.P., P.N. and NHLBI TOPMed Lipids Working Group provided administrative, technical or material support. Z.L., X. Li, S.M.G. and X. Lin drafted the manuscript and revised it according to co-authors’ suggestions. All authors critically reviewed the manuscript, suggested revisions as needed, and approved the final version.

Corresponding authors

Correspondence to Zilin Li or Xihong Lin.

Ethics declarations

Competing interests

S.M.G. is now an employee of Regeneron Genetics Center. J.B.M. is an Academic Associate for Quest Diagnostics R&D. For B.D.M., the Amish Research Program receives partial support from Regeneron Pharmaceuticals. M.E.M. reports grants from Regeneron Pharmaceutical unrelated to the present work. B.M.P. serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson. L.M.R. is a consultant for the TOPMed Administrative Coordinating Center (through Westat). S.R. reports support from Jazz Pharma, Eli Lilly and Apnimed, unrelated to the present work. The spouse of C.J.W. works at Regeneron Pharmaceuticals. P.N. reports investigator-initiated grants from Amgen, Apple, AstraZeneca, Boston Scientific and Novartis, personal fees from Apple, AstraZeneca, Blackstone Life Sciences, Foresite Labs, Novartis and Roche/Genentech, is a co-founder of TenSixteen Bio, is a shareholder of geneXwell and TenSixteen Bio, and reports spousal employment at Vertex, all unrelated to the present work. X. Lin is a consultant of AbbVie Pharmaceuticals and Verily Life Sciences. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Yukinori Okada and the other, anonymous, reviewer for their contribution to the peer review of this work. Primary Handling editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Rare variant (MAF < 0.01) distribution in the discovery phase using TOPMed cohorts (n = 21,015).

Variant categories are defined by GENCODE VEP categories.

Extended Data Fig. 2 Manhattan plots and Q-Q plots for unconditional gene-centric noncoding analysis and sliding window analysis of high-density lipoprotein cholesterol (HDL-C) in the discovery phase (n = 21,015).

a, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding gene. The horizontal line indicates a genome-wide STAAR-O P value threshold of 3.57 × 10−7. The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/\left( {20,000 \times 7} \right) = 3.57 \times 10^{ - 7}\)). Different symbols represent the STAAR-O P value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. b, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding gene. Different symbols represent the STAAR-O P-value of the gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). c, Manhattan plots for unconditional gene-centric noncoding analysis of ncRNA gene. The horizontal line indicates a genome-wide STAAR-O P value threshold of 2.50 × 10−6. The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/20,000 = 2.50 \times 10^{ - 6}\)). d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of ncRNA gene. e, Manhattan plot for 2-kb sliding windows. The horizontal line indicates a genome-wide P value threshold of 1.88 × 10−8. The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/\left( {2.66 \times 10^6} \right) = 1.88 \times 10^{ - 8}\)). f, Quantile-quantile plot for 2-kb sliding windows. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, STAAR-O is a two-sided test.

Extended Data Fig. 3 Manhattan plots and Q-Q plots for unconditional gene-centric noncoding analysis and sliding window analysis of low-density lipoprotein cholesterol (LDL-C) in the discovery phase (n=21,015).

a, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 3.57 × 10−7. The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/\left( {20,000 \times 7} \right) = 3.57 \times 10^{ - 7}\)). Different symbols represent the STAAR-O P-value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. b, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding gene. Different symbols represent the STAAR-O P-value of the gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). c, Manhattan plots for unconditional gene-centric noncoding analysis of ncRNA gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 2.50 × 10−6. The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/20,000 = 2.50 \times 10^{ - 6}\)). d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of ncRNA gene. e, Manhattan plot for 2-kb sliding windows. The horizontal line indicates a genome-wide P-value threshold of 1.88 × 10−8. The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/\left( {2.66 \times 10^6} \right) = 1.88 \times 10^{ - 8}\)). f, Quantile-quantile plot for 2-kb sliding windows. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, STAAR-O is a two-sided test.

Extended Data Fig. 4 Manhattan plots and Q-Q plots for unconditional gene-centric noncoding analysis and sliding window analysis of triglycerides (TGs) in the discovery phase (n=21,015).

a, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 3.57 × 10−7. The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/\left( {20,000 \times 7} \right) = 3.57 \times 10^{ - 7}\)). Different symbols represent the STAAR-O P-value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. b, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding gene. Different symbols represent the STAAR-O P-value of the gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). c, Manhattan plots for unconditional gene-centric noncoding analysis of ncRNA gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 2.50 × 10−6. The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/20,000 = 2.50 \times 10^{ - 6}\)). d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of ncRNA gene. e, Manhattan plot for 2-kb sliding windows. The horizontal line indicates a genome-wide P-value threshold of 1.88 × 10−8. The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/\left( {2.66 \times 10^6} \right) = 1.88 \times 10^{ - 8}\)). f, Quantile-quantile plot for 2-kb sliding windows. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, STAAR-O is a two-sided test.

Extended Data Fig. 5 Manhattan plots and Q-Q plots for unconditional gene-centric noncoding analysis and sliding window analysis of total cholesterol (TC) in the discovery phase (n=21,015).

a, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 3.57 × 10−7. The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/\left( {20,000 \times 7} \right) = 3.57 \times 10^{ - 7}\)). Different symbols represent the STAAR-O P-value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. b, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding gene. Different symbols represent the STAAR-O P-value of the gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). c, Manhattan plots for unconditional gene-centric noncoding analysis of ncRNA gene. The horizontal line indicates a genome-wide STAAR-O P-value threshold of 2.50 × 10−6. The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/20,000 = 2.50 \times 10^{ - 6}\)). d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of ncRNA gene. e, Manhattan plot for 2-kb sliding windows. The horizontal line indicates a genome-wide P-value threshold of 1.88 × 10−8. The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/\left( {2.66 \times 10^6} \right) = 1.88 \times 10^{ - 8}\)). f, Quantile-quantile plot for 2-kb sliding windows. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, STAAR-O is a two-sided test.

Supplementary information

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Z., Li, X., Zhou, H. et al. A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies. Nat Methods 19, 1599–1611 (2022). https://doi.org/10.1038/s41592-022-01640-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-022-01640-x

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing