Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Inferring compound heterozygosity from large-scale exome sequencing data

Abstract

Recessive diseases arise when both copies of a gene are impacted by a damaging genetic variant. When a patient carries two potentially causal variants in a gene, accurate diagnosis requires determining that these variants occur on different copies of the chromosome (that is, are in trans) rather than on the same copy (that is, in cis). However, current approaches for determining phase, beyond parental testing, are limited in clinical settings. Here we developed a strategy for inferring phase for rare variant pairs within genes, leveraging genotypes observed in the Genome Aggregation Database (v2, n = 125,748 exomes). Our approach estimates phase with 96% accuracy, both in trio data and in patients with Mendelian conditions and presumed causal compound heterozygous variants. We provide a public resource of phasing estimates for coding variants and counts per gene of rare variants in trans that can aid interpretation of rare co-occurring variants in the context of recessive disease.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of phasing approach using the expectation–maximization method in gnomAD.
Fig. 2: Phasing accuracy as a function of variant AF.
Fig. 3: Phasing accuracy using population-specific versus cosmopolitan Ptrans estimates.
Fig. 4: Phasing accuracy as a function of distance between variant pairs.
Fig. 5: Counts of genes with variants in trans in gnomAD.

Similar content being viewed by others

Data availability

The gnomAD v2 dataset can be accessed at https://gnomad.broadinstitute.org. We made use of prior quality control processing of these and related data. In addition, we downloaded HapMap2 genetic maps from https://github.com/joepickrell/1000-genomes-genetic-maps.

We provide both web-based look-up tools and downloads for the data generated here. A look-up tool to find the likely co-occurrence pattern between two rare (global AF in gnomAD exomes <5%) coding, flanking intronic (from position −1 to −3 in acceptor sites and +1 to +8 in donor sites) or 5′/3′ UTR variants can be found at https://gnomad.broadinstitute.org/variant-cooccurrence

Additionally, we display the per-gene counts tables that provide the details of the number of individuals with two rare variants, stratified by AF and functional consequence, on each gene’s main page. One table provides the details of counts of individuals with two heterozygous variants and includes the predicted phase, while the second table provides the details of individuals with homozygous variants. Both can be found by clicking on the ‘Variant Co-occurrence’ tab on each gene’s main page.

All variant co-occurrence tables can be downloaded from https://gnomad.broadinstitute.org/downloads#v2-variant-cooccurrence

Code availability

The code used to estimate Ptrans estimates for variant pairs and to determine the number of individuals carrying rare, compound heterozygous variants can be found at https://github.com/broadinstitute/gnomad_chets

The code has also been uploaded to Zenodo (https://doi.org/10.5281/zenodo.10034663).

References

  1. Wang, Q. et al. Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes. Nat. Commun. 11, 2539 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Bansal, V., Halpern, A. L., Axelrod, N. & Bafna, V. An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 18, 1336–1346 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Patterson, M. et al. WhatsHap: weighted haplotype assembly for future-generation sequencing reads. J. Comput. Biol. 22, 498–509 (2015).

    Article  CAS  PubMed  Google Scholar 

  4. Hager, P., Mewes, H.-W., Rohlfs, M., Klein, C. & Jeske, T. SmartPhase: accurate and fast phasing of heterozygous variant pairs for genetic diagnosis of rare diseases. PLoS Comput. Biol. 16, e1007613 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Maestri, S. et al. A long-read sequencing approach for direct haplotype phasing in clinical settings. Int. J. Mol. Sci. 21, 9177 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Mantere, T., Kersten, S. & Hoischen, A. Long-read sequencing emerging in medical genetics. Front. Genet. 10, 426 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Snyder, M. W., Adey, A., Kitzman, J. O. & Shendure, J. Haplotype-resolved genome sequencing: experimental methods and applications. Nat. Rev. Genet. 16, 344–358 (2015).

    Article  CAS  PubMed  Google Scholar 

  8. Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Browning, B. L., Tian, X., Zhou, Y. & Browning, S. R. Fast two-stage phasing of large-scale sequence data. Am. J. Hum. Genet. 108, 1880–1890 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Hofmeister, R. J., Ribeiro, D. M., Rubinacci, S. & Delaneau, O. Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat. Genet. 55, 1243–1249 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Tewhey, R., Bansal, V., Torkamani, A., Topol, E. J. & Schork, N. J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Browning, S. R. & Browning, B. L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Excoffier, L. & Slatkin, M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12, 921–927 (1995).

    CAS  PubMed  Google Scholar 

  16. Hodgkinson, A. & Eyre-Walker, A. Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 12, 756–766 (2011).

    Article  CAS  PubMed  Google Scholar 

  17. Ségurel, L., Wyman, M. J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014).

    Article  PubMed  Google Scholar 

  18. Rahbari, R. et al. Timing, rates and spectra of human germline mutation. Nat. Genet. 48, 126–133 (2016).

    Article  CAS  PubMed  Google Scholar 

  19. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Carlson, J. et al. Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Nat. Commun. 9, 3753 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Lynch, M. Rate, molecular spectrum, and consequences of human mutation. Proc. Natl Acad. Sci. USA 107, 961–968 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Baxter, S. M. et al. Centers for Mendelian genomics: a decade of facilitating gene discovery. Genet. Med. 24, 784–797 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Pejaver, V. et al. Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4 criteria. Am. J. Hum. Genet. 109, 2163–2177 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Lassen, F. H. et al. Exome-wide evidence of compound heterozygous effects across common phenotypes in the UK Biobank. Preprit at medRxiv https://doi.org/10.1101/2023.06.29.23291992 (2023).

  26. Sharp, K., Kretzschmar, W., Delaneau, O. & Marchini, J. Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics 32, 1974–1980 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature https://doi.org/10.1038/s41586-023-06045-0 (2023).

  28. Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11.10.1–11.10.33 (2013).

    PubMed  Google Scholar 

  29. Hail Team. Hail-is/hail. GitHub. github.com/hail-is/hail/commit/acd89e80c345 (2023).

  30. Choi, Y., Chan, A. P., Kirkness, E., Telenti, A. & Schork, N. J. Comparison of phasing strategies for whole human genomes. PLoS Genet. 14, e1007308 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Roadmap Epigenomics Consortium. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    Article  PubMed Central  Google Scholar 

  32. International HapMap Consortium. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007).

    Article  Google Scholar 

  33. McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17, 122 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Georgi, B., Voight, B. F. & Bućan, M. From mouse to human: evolutionary genomics analysis of human orthologs of essential genes. PLoS Genet. 9, e1003484 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Behan, F. M. et al. Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens. Nature 568, 511–516 (2019).

    Article  CAS  PubMed  Google Scholar 

  36. Hart, T., Brown, K. R., Sircoulomb, F., Rottapel, R. & Moffat, J. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Mol. Syst. Biol. 10, 733 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Hart, T. et al. Evaluation and design of genome-wide CRISPR/SpCas9 knockout screens. G3 (Bethesda) 7, 2719–2727 (2017).

    Article  CAS  PubMed  Google Scholar 

  38. Vinceti, A. et al. CoRe: a robustly benchmarked R package for identifying core-fitness genes in genome-wide pooled CRISPR–Cas9 screens. BMC Genomics 22, 828 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank all members of the gnomAD team for helpful comments and suggestions, and we particularly recognize the members of the gnomAD methods and browser teams who worked hard over many years to provide cleaned datasets, easy-to-use browsers and visualizations. This work was supported by the National Human Genome Research Institute (NHGRI; U24HG011450 to H.L.R. and M.J.D.; UM1HG008900 to D.G.M. and H.L.R.; U01HG011755 to A.O.-L. and H.L.R.).

Author information

Authors and Affiliations

Authors

Consortia

Contributions

M.H.G., L.C.F., S.L.S., J.K.G., A.O.-L., K.J.K., D.G.M. and K.E.S. conceived and designed experiments. M.H.G., L.C.F., S.L.S. and J.K.G. performed the analyses. N.A.W., P.W.D. and M.S. developed visualizations for the web browser. E.G. and M.S.-B. performed variant curation. S.B., G.T., B.M.N., J.N.H., H.L.R., M.J.D., A.O.-L. and K.J.K. provided data and analysis suggestions. J.N.H., D.G.M. and K.E.S. supervised the work. M.H.G., L.C.F., S.L.S., J.K.G. and K.E.S. completed the primary writing of the manuscript with input and approval of the final version from all other authors.

Corresponding author

Correspondence to Kaitlin E. Samocha.

Ethics declarations

Competing interests

L.C.F. is currently an employee of, and owns stock in, Vertex Pharmaceuticals. B.M.N. is a member of the scientific advisory board at Deep Genomics and Neumora (f/k/a RBNC Therapeutics). H.L.R. has received support from Illumina and Microsoft to support rare disease gene discovery and diagnosis. M.J.D. is a founder of Maze Therapeutics and Neumora Therapeutics (f/k/a RBNC Therapeutics). A.O.-L. has consulted for Tome Biosciences and Ono Pharma USA and is a member of the scientific advisory board for Congenica and the Simons Foundation SPARK for Autism study. K.J.K. is a consultant for Tome Biosciences and Vor Biosciences and a member of the Scientific Advisory Board of Nurture Genomics. D.G.M. is a paid advisor to GlaxoSmithKline, Insitro, Variant Bio and Overtone Therapeutics and has received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Google, Merck, Microsoft, Pfizer and Sanofi-Genzyme. K.E.S. has received support from Microsoft for work related to rare disease diagnostics. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Arnaldur Gylfason, Tobias Marschall and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Publicly available browser for sharing phasing data.

a, Sample gnomAD browser output for two variants (GRCh37 1-55505647-G-T and 1-55523855-G-A) in the gene PCSK9. On the top, a table subdivided by genetic ancestry group displays how many individuals in gnomAD v2 from that genetic ancestry are consistent with the two variants occurring on different haplotypes (trans), and how many individuals are consistent with their occurring on the same haplotype (cis). Below that, there is a 3×3 table that contains the 9 possible combinations of genotypes for the two variants of interest. The number of individuals in gnomAD v2 that fall in each of these combinations are shown and are colored by whether they are consistent with variants falling on different haplotypes (red) or the same haplotype (blue), or whether they are indeterminate (purple). The estimated haplotype counts for the four possible haplotypes for the two variants as calculated by the EM algorithm is displayed on the bottom right. The probability of being in trans for this particular pair of variants is >99%. b, Variant co-occurrence tables on the gene landing page. For each gene (GBA1 shown), the top table lists the number of individuals carrying pairs of rare heterozygous variants by inferred phase, allele frequency (AF), and predicted functional consequence. The number of individuals with homozygous variants are tabulated in the same manner and presented as a comparison below. AF thresholds of ≤ 5%, ≤ 1%, and ≤ 0.5% are displayed across six predicted functional consequences (combinations of pLoF, various evidence strengths of predicted pathogenicity for missense variants, and synonymous variants). Both variants in the variant pair must be annotated with a consequence at least as severe as the consequence listed (that is, pLoF + strong missense also includes pLoF + pLoF).

Supplementary information

Supplementary Information

Supplementary Note and Supplementary Figs. 1–9.

Reporting Summary

Peer Review File

Supplementary Tables

Supplementary Tables 1 and 2: The dataset includes phasing information for diagnostic variants from CMG patients and manual curation of rare, compound heterozygous loss-of-function variants.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, M.H., Francioli, L.C., Stenton, S.L. et al. Inferring compound heterozygosity from large-scale exome sequencing data. Nat Genet 56, 152–161 (2024). https://doi.org/10.1038/s41588-023-01608-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-023-01608-3

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research