Abstract
Recessive diseases arise when both copies of a gene are impacted by a damaging genetic variant. When a patient carries two potentially causal variants in a gene, accurate diagnosis requires determining that these variants occur on different copies of the chromosome (that is, are in trans) rather than on the same copy (that is, in cis). However, current approaches for determining phase, beyond parental testing, are limited in clinical settings. Here we developed a strategy for inferring phase for rare variant pairs within genes, leveraging genotypes observed in the Genome Aggregation Database (v2, n = 125,748 exomes). Our approach estimates phase with 96% accuracy, both in trio data and in patients with Mendelian conditions and presumed causal compound heterozygous variants. We provide a public resource of phasing estimates for coding variants and counts per gene of rare variants in trans that can aid interpretation of rare co-occurring variants in the context of recessive disease.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The gnomAD v2 dataset can be accessed at https://gnomad.broadinstitute.org. We made use of prior quality control processing of these and related data. In addition, we downloaded HapMap2 genetic maps from https://github.com/joepickrell/1000-genomes-genetic-maps.
We provide both web-based look-up tools and downloads for the data generated here. A look-up tool to find the likely co-occurrence pattern between two rare (global AF in gnomAD exomes <5%) coding, flanking intronic (from position −1 to −3 in acceptor sites and +1 to +8 in donor sites) or 5′/3′ UTR variants can be found at https://gnomad.broadinstitute.org/variant-cooccurrence
Additionally, we display the per-gene counts tables that provide the details of the number of individuals with two rare variants, stratified by AF and functional consequence, on each gene’s main page. One table provides the details of counts of individuals with two heterozygous variants and includes the predicted phase, while the second table provides the details of individuals with homozygous variants. Both can be found by clicking on the ‘Variant Co-occurrence’ tab on each gene’s main page.
All variant co-occurrence tables can be downloaded from https://gnomad.broadinstitute.org/downloads#v2-variant-cooccurrence
Code availability
The code used to estimate Ptrans estimates for variant pairs and to determine the number of individuals carrying rare, compound heterozygous variants can be found at https://github.com/broadinstitute/gnomad_chets
The code has also been uploaded to Zenodo (https://doi.org/10.5281/zenodo.10034663).
References
Wang, Q. et al. Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes. Nat. Commun. 11, 2539 (2020).
Bansal, V., Halpern, A. L., Axelrod, N. & Bafna, V. An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 18, 1336–1346 (2008).
Patterson, M. et al. WhatsHap: weighted haplotype assembly for future-generation sequencing reads. J. Comput. Biol. 22, 498–509 (2015).
Hager, P., Mewes, H.-W., Rohlfs, M., Klein, C. & Jeske, T. SmartPhase: accurate and fast phasing of heterozygous variant pairs for genetic diagnosis of rare diseases. PLoS Comput. Biol. 16, e1007613 (2020).
Maestri, S. et al. A long-read sequencing approach for direct haplotype phasing in clinical settings. Int. J. Mol. Sci. 21, 9177 (2020).
Mantere, T., Kersten, S. & Hoischen, A. Long-read sequencing emerging in medical genetics. Front. Genet. 10, 426 (2019).
Snyder, M. W., Adey, A., Kitzman, J. O. & Shendure, J. Haplotype-resolved genome sequencing: experimental methods and applications. Nat. Rev. Genet. 16, 344–358 (2015).
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).
Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).
Browning, B. L., Tian, X., Zhou, Y. & Browning, S. R. Fast two-stage phasing of large-scale sequence data. Am. J. Hum. Genet. 108, 1880–1890 (2021).
Hofmeister, R. J., Ribeiro, D. M., Rubinacci, S. & Delaneau, O. Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat. Genet. 55, 1243–1249 (2023).
Tewhey, R., Bansal, V., Torkamani, A., Topol, E. J. & Schork, N. J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).
Browning, S. R. & Browning, B. L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Excoffier, L. & Slatkin, M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12, 921–927 (1995).
Hodgkinson, A. & Eyre-Walker, A. Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 12, 756–766 (2011).
Ségurel, L., Wyman, M. J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014).
Rahbari, R. et al. Timing, rates and spectra of human germline mutation. Nat. Genet. 48, 126–133 (2016).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Carlson, J. et al. Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Nat. Commun. 9, 3753 (2018).
Lynch, M. Rate, molecular spectrum, and consequences of human mutation. Proc. Natl Acad. Sci. USA 107, 961–968 (2010).
Baxter, S. M. et al. Centers for Mendelian genomics: a decade of facilitating gene discovery. Genet. Med. 24, 784–797 (2022).
Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885 (2016).
Pejaver, V. et al. Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4 criteria. Am. J. Hum. Genet. 109, 2163–2177 (2022).
Lassen, F. H. et al. Exome-wide evidence of compound heterozygous effects across common phenotypes in the UK Biobank. Preprit at medRxiv https://doi.org/10.1101/2023.06.29.23291992 (2023).
Sharp, K., Kretzschmar, W., Delaneau, O. & Marchini, J. Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics 32, 1974–1980 (2016).
Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature https://doi.org/10.1038/s41586-023-06045-0 (2023).
Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11.10.1–11.10.33 (2013).
Hail Team. Hail-is/hail. GitHub. github.com/hail-is/hail/commit/acd89e80c345 (2023).
Choi, Y., Chan, A. P., Kirkness, E., Telenti, A. & Schork, N. J. Comparison of phasing strategies for whole human genomes. PLoS Genet. 14, e1007308 (2018).
Roadmap Epigenomics Consortium. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
International HapMap Consortium. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007).
McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17, 122 (2016).
Georgi, B., Voight, B. F. & Bućan, M. From mouse to human: evolutionary genomics analysis of human orthologs of essential genes. PLoS Genet. 9, e1003484 (2013).
Behan, F. M. et al. Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens. Nature 568, 511–516 (2019).
Hart, T., Brown, K. R., Sircoulomb, F., Rottapel, R. & Moffat, J. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Mol. Syst. Biol. 10, 733 (2014).
Hart, T. et al. Evaluation and design of genome-wide CRISPR/SpCas9 knockout screens. G3 (Bethesda) 7, 2719–2727 (2017).
Vinceti, A. et al. CoRe: a robustly benchmarked R package for identifying core-fitness genes in genome-wide pooled CRISPR–Cas9 screens. BMC Genomics 22, 828 (2021).
Acknowledgements
We thank all members of the gnomAD team for helpful comments and suggestions, and we particularly recognize the members of the gnomAD methods and browser teams who worked hard over many years to provide cleaned datasets, easy-to-use browsers and visualizations. This work was supported by the National Human Genome Research Institute (NHGRI; U24HG011450 to H.L.R. and M.J.D.; UM1HG008900 to D.G.M. and H.L.R.; U01HG011755 to A.O.-L. and H.L.R.).
Author information
Authors and Affiliations
Consortia
Contributions
M.H.G., L.C.F., S.L.S., J.K.G., A.O.-L., K.J.K., D.G.M. and K.E.S. conceived and designed experiments. M.H.G., L.C.F., S.L.S. and J.K.G. performed the analyses. N.A.W., P.W.D. and M.S. developed visualizations for the web browser. E.G. and M.S.-B. performed variant curation. S.B., G.T., B.M.N., J.N.H., H.L.R., M.J.D., A.O.-L. and K.J.K. provided data and analysis suggestions. J.N.H., D.G.M. and K.E.S. supervised the work. M.H.G., L.C.F., S.L.S., J.K.G. and K.E.S. completed the primary writing of the manuscript with input and approval of the final version from all other authors.
Corresponding author
Ethics declarations
Competing interests
L.C.F. is currently an employee of, and owns stock in, Vertex Pharmaceuticals. B.M.N. is a member of the scientific advisory board at Deep Genomics and Neumora (f/k/a RBNC Therapeutics). H.L.R. has received support from Illumina and Microsoft to support rare disease gene discovery and diagnosis. M.J.D. is a founder of Maze Therapeutics and Neumora Therapeutics (f/k/a RBNC Therapeutics). A.O.-L. has consulted for Tome Biosciences and Ono Pharma USA and is a member of the scientific advisory board for Congenica and the Simons Foundation SPARK for Autism study. K.J.K. is a consultant for Tome Biosciences and Vor Biosciences and a member of the Scientific Advisory Board of Nurture Genomics. D.G.M. is a paid advisor to GlaxoSmithKline, Insitro, Variant Bio and Overtone Therapeutics and has received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Google, Merck, Microsoft, Pfizer and Sanofi-Genzyme. K.E.S. has received support from Microsoft for work related to rare disease diagnostics. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks Arnaldur Gylfason, Tobias Marschall and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Publicly available browser for sharing phasing data.
a, Sample gnomAD browser output for two variants (GRCh37 1-55505647-G-T and 1-55523855-G-A) in the gene PCSK9. On the top, a table subdivided by genetic ancestry group displays how many individuals in gnomAD v2 from that genetic ancestry are consistent with the two variants occurring on different haplotypes (trans), and how many individuals are consistent with their occurring on the same haplotype (cis). Below that, there is a 3×3 table that contains the 9 possible combinations of genotypes for the two variants of interest. The number of individuals in gnomAD v2 that fall in each of these combinations are shown and are colored by whether they are consistent with variants falling on different haplotypes (red) or the same haplotype (blue), or whether they are indeterminate (purple). The estimated haplotype counts for the four possible haplotypes for the two variants as calculated by the EM algorithm is displayed on the bottom right. The probability of being in trans for this particular pair of variants is >99%. b, Variant co-occurrence tables on the gene landing page. For each gene (GBA1 shown), the top table lists the number of individuals carrying pairs of rare heterozygous variants by inferred phase, allele frequency (AF), and predicted functional consequence. The number of individuals with homozygous variants are tabulated in the same manner and presented as a comparison below. AF thresholds of ≤ 5%, ≤ 1%, and ≤ 0.5% are displayed across six predicted functional consequences (combinations of pLoF, various evidence strengths of predicted pathogenicity for missense variants, and synonymous variants). Both variants in the variant pair must be annotated with a consequence at least as severe as the consequence listed (that is, pLoF + strong missense also includes pLoF + pLoF).
Supplementary information
Supplementary Information
Supplementary Note and Supplementary Figs. 1–9.
Supplementary Tables
Supplementary Tables 1 and 2: The dataset includes phasing information for diagnostic variants from CMG patients and manual curation of rare, compound heterozygous loss-of-function variants.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Guo, M.H., Francioli, L.C., Stenton, S.L. et al. Inferring compound heterozygosity from large-scale exome sequencing data. Nat Genet 56, 152–161 (2024). https://doi.org/10.1038/s41588-023-01608-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-023-01608-3