Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data

An Author Correction to this article was published on 23 January 2024

This article has been updated

Abstract

Copy number variants (CNVs) are major contributors to genetic diversity and disease. While standardized methods, such as the genome analysis toolkit (GATK), exist for detecting short variants, technical challenges have confounded uniform large-scale CNV analyses from whole-exome sequencing (WES) data. Given the profound impact of rare and de novo coding CNVs on genome organization and human disease, we developed GATK-gCNV, a flexible algorithm to discover rare CNVs from sequencing read-depth information, complete with open-source distribution via GATK. We benchmarked GATK-gCNV in 7,962 exomes from individuals in quartet families with matched genome sequencing and microarray data, finding up to 95% recall of rare coding CNVs at a resolution of more than two exons. We used GATK-gCNV to generate a reference catalog of rare coding CNVs in WES data from 197,306 individuals in the UK Biobank, and observed strong correlations between per-gene CNV rates and measures of mutational constraint, as well as rare CNV associations with multiple traits. In summary, GATK-gCNV is a tunable approach for sensitive and specific CNV discovery in WES data, with broad applications.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: GATK-gCNV pipeline steps.
Fig. 2: Calling and benchmarking of GATK-gCNV callset in a cohort of more than 7,000 samples with matching deep WGS sequencing.
Fig. 3: A high-quality rare CNV callset was generated on 200,624 exomes from the UKBB using GATK-gCNV.

Similar content being viewed by others

Data availability

The SSC benchmarking raw sequencing data can be accessed through NHGRI AnVIL; accession ID: phs000298; databank URL: https://anvilproject.org/data. SSC CNVs can be accessed through SFARIBase (base.sfari.org), accession IDs: SFARI_DS340921 (CNVs). Approval by the Simons Foundation for Autism Research Initiative (SFARI) is required.

Access to the UKBB raw sequencing data and the CNV data generated here will be provided by the UKBB (https://www.ukbiobank.ac.uk).

GENCODE V33 annotation can be found at https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.annotation.gtf.gz

Code availability

GATK-gCNV is distributed as part of the GATK jar release. For an example workspace on Terra, with recommended parameters, please see https://app.terra.bio/#workspaces/help-gatk/Germline-CNVs-GATK4.

GATK-gCNV evaluation and benchmarking code is available at

https://github.com/broadinstitute/GATK-gCNV-publication/tree/master/evaluation_code.

CMA-CNV Validation code consists of

https://github.com/talkowski-lab/cnv-validation.

GenomeSTRiP version 2.00.1982

http://software.broadinstitute.org/software/genomestrip/.

MoChA version 2022-01-14 WDL https://software.broadinstitute.org/software/mocha/mocha.20220114.wdl.

Change history

References

  1. Marshall, C. R. et al. Structural variation of chromosomes in autism spectrum disorder. Am. J. Hum. Genet. 82, 477–488 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Egolf, L. E. et al. Germline 16p11.2 microdeletion predisposes to neuroblastoma. Am. J. Hum. Genet. 105, 658–668 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Ruderfer, D. M. et al. Patterns of genic intolerance of rare copy number variation in 59,898 human exomes. Nat. Genet. 48, 1107–1111 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Miller, D. T. et al. Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. Am. J. Hum. Genet. 86, 749–764 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Srivastava, S. et al. Meta-analysis and multidisciplinary consensus statement: exome sequencing is a first-tier clinical diagnostic test for individuals with neurodevelopmental disorders. Genet. Med. 21, 2413–2421 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 27, 182–189 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Lelieveld, S. H., Spielmann, M., Mundlos, S., Veltman, J. A. & Gilissen, C. Comparison of exome and genome sequencing technologies for the complete capture of protein-coding regions. Hum. Mutat. 36, 815–822 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Benjamini, Y. & Speed, T. P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Fromer, M. et al. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. Am. J. Hum. Genet. 91, 597–607 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Jiang, Y., Oldridge, D. A., Diskin, S. J. & Zhang, N. R. CODEX: a normalization and copy number variation detection method for whole exome sequencing. Nucleic Acids Res. 43, e39 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Handsaker, R. E. et al. Large multiallelic copy number variations in humans. Nat. Genet. 47, 296–303 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Packer, J. S. et al. CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data. Bioinformatics 32, 133–135 (2016).

    Article  CAS  PubMed  Google Scholar 

  15. Klambauer, G. et al. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 40, e69 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Olshen, A. B., Venkatraman, E. S., Lucito, R. & Wigler, M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557–572 (2004).

    Article  PubMed  Google Scholar 

  17. Backman, J. D. et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Fu, J. M. et al. Rare coding variation provides insight into the genetic architecture and phenotypic context of autism. Nat. Genet. 54, 1320–1331 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Singh, T. et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604, 509–516 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Flannick, J. et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature 570, 71–76 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426–3440 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. De Rubeis, S. et al. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 515, 209–215 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Werling, D. M. et al. An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder. Nat. Genet. 50, 727–736 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Sanders, S. J. et al. Insights into autism spectrum disorder genomic architecture and biology from 71 risk loci. Neuron 87, 1215–1233 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Belyeu, J. R. et al. De novo structural mutation rates and gamete-of-origin biases revealed through genome sequencing of 2,396 families. Am. J. Hum. Genet. 108, 597–607 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Frankish, A. et al. GENCODE 2021. Nucleic Acids Res. 49, D916–D923 (2021).

    Article  CAS  PubMed  Google Scholar 

  30. Fromer, M. & Purcell, S. M. Using XHMM software to detect copy number variation in whole-exome sequencing data. Curr. Protoc. Hum. Genet. 81, 7.23.1–7.23.21 (2014).

    PubMed  Google Scholar 

  31. Krumm, N. et al. Copy number variation detection and genotyping from exome sequence data. Genome Res. 22, 1525–1532 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Plagnol, V. et al. A robust model for read count data in exome sequencing experiments and implications for copy number variant calling. Bioinformatics 28, 2747–2754 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Canela-Xandri, O., Rawlik, K. & Tenesa, A. An atlas of genetic associations in UK Biobank. Nat. Genet. 50, 1593–1599 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Owen, D. et al. Effects of pathogenic CNVs on physical traits in participants of the UK Biobank. BMC Genomics 19, 867 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Collins, R. L. et al. A cross-disorder dosage sensitivity map of the human genome. Cell 185, 3041–3055 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Pan-UK Biobank. Pan-ancestry genetic analysis of the UK Biobank. https://pan.ukbb.broadinstitute.org (2022).

  39. Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Auwerx, C. et al. The individual and global impact of copy-number variants on complex human traits. Am. J. Hum. Genet. 109, 647–668 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Adam, M. P. et al. Alpha-thalassemia. In GeneReviews (Adam, M. P. et. al. eds) (University of Washington, 2005); https://www.ncbi.nlm.nih.gov/books/NBK1435/

  42. Sabath, D. E. et al. Characterization of deletions of the HBA and HBB loci by array comparative genomic hybridization. J. Mol. Diagn. 18, 92–99 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Anzai, N. et al. The multivalent PDZ domain-containing protein PDZK1 regulates transport activity of renal urate-anion exchanger URAT1 via its C terminus. J. Biol. Chem. 279, 45942–45950 (2004).

    Article  CAS  PubMed  Google Scholar 

  44. Sinnott-Armstrong, N. et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat. Genet. 53, 185–194 (2021).

    Article  CAS  PubMed  Google Scholar 

  45. Fitzgerald, T. & Birney, E. CNest: a novel copy number association discovery method uncovers 862 new associations from 200,629 whole-exome sequence datasets in the UK Biobank. Cell Genom. 2, 100167 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Laver, T. W. et al. SavvyCNV: genome-wide CNV calling from off-target reads. PLoS Comput. Biol. 18, e1009940 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Martin, A. R. et al. Low-coverage sequencing cost-effectively detects known and novel variation in underrepresented populations. Am. J. Hum. Genet. https://doi.org/10.1016/j.ajhg.2021.03.012 (2021).

  48. Salvatier, J., Wiecki, T. V. & Fonnesbeck, C. Probabilistic programming in Python using MyMC3. PeerJ Comput. Sci. 2, e55 (2016).

    Article  Google Scholar 

Download references

Acknowledgements

We thank L. Lichtenstein, Y. Farjoun, B. Neale and N. Lennon for insightful discussions at various stages of this project, and S. Zaheri for carefully reviewing and providing feedback on the manuscript. This work was supported by grants from the Simons Foundation for Autism Research Initiative (573206), the SPARK project and SPARK analysis projects (606362 and 608540) and the National Institutes of Health (MH115957, HD081256, HG008895 and HG011450). J.M.F. was supported by an Autism Speaks Postdoctoral Fellowship and R.L.C. was supported by NSF GRFP 2017240332.

Author information

Authors and Affiliations

Authors

Contributions

M.B., E.B. and M.E.T. designed these studies and analyses. M.B., D.I.B. and S.K.L. developed and implemented the GATK-gCNV model and the inference algorithm. A.N.S. contributed model enhancements and developed sample-clustering and batch-processing workflows. X.Z., A.N.S. and J.M.F. conducted benchmarking studies of GATK-gCNV performance. A.N.S., M.B. and S.K.L. developed WDL workflows for Terra integration and scalable analysis. M.E.T., J.M.F., E.B., H.B., S.K.L., M.W. and L.D.G. supervised aspects of this project at various stages of development. J.M.F., R.L.C., H.B. and K.J.K. contributed to association analyses. I.W. and J.M.F. generated the CNV callsets. J.M.F., I.W., R.L.C., A.S.J. and H.B. conducted quality control on generated callsets. M.B., J.M.F., R.L.C., H.B. and M.E.T. wrote the manuscript, which was edited by all authors.

Corresponding authors

Correspondence to Mehrtash Babadi or Michael E. Talkowski.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Birte Kehr, Christian Marshall and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Note, Supplementary Figs. 1–10 and Supplementary Tables 2 and 3.

Reporting Summary

Peer Review File

Supplementary Table 1

CNV-phenotype association analysis in UK Biobank.

Supplementary Data 1

Seven thousand nine hundred eighty-one target regions for PCA batching.

Supplementary Data 2

Standardized set of intervals based on Gencode V33 annotation.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Babadi, M., Fu, J.M., Lee, S.K. et al. GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data. Nat Genet 55, 1589–1597 (2023). https://doi.org/10.1038/s41588-023-01449-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-023-01449-0

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research