Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Jasmine and Iris: population-scale structural variant comparison and analysis

Abstract

The availability of long reads is revolutionizing studies of structural variants (SVs). However, because SVs vary across individuals and are discovered through imprecise read technologies and methods, they can be difficult to compare. Addressing this, we present Jasmine and Iris (https://github.com/mkirsche/Jasmine/), for fast and accurate SV refinement, comparison and population analysis. Using an SV proximity graph, Jasmine outperforms six widely used comparison methods, including reducing the rate of Mendelian discordance in trio datasets by more than fivefold, and reveals a set of high-confidence de novo SVs confirmed by multiple technologies. We also present a unified callset of 122,813 SVs and 82,379 indels from 31 samples of diverse ancestry sequenced with long reads. We genotype these variants in 1,317 samples from the 1000 Genomes Project and the Genotype-Tissue Expression project with DNA and RNA-sequencing data and assess their widespread impact on gene expression, including within medically relevant genes.

This is a preview of subscription content, access via your institution

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: Structural variant inference pipeline.
Fig. 2: Mendelian discordance in the HG002 Ashkenazim trio.
Fig. 3: Structural variant inference across sequencing technologies in HG002.
Fig. 4: De novo variant discovery in HG002.
Fig. 5: Population-scale inference from public datasets.
Fig. 6: Functional impact of structural variants from Jasmine.

Data availability

The sequencing data used in this study are available from the publications listed in Supplementary Table 1 and Supplementary Table 2. All variant calls and associations are available at http://data.schatz-lab.org/jasmine/.

Code availability

The Jasmine and Iris code and documentation are available open source at https://github.com/mkirsche/Jasmine/ and https://github.com/mkirsche/Iris/. The versions used in the paper are archived in Zenodo for Jasmine62 and Iris63. These methods are also available in Bioconda and Galaxy to simplify use on the command line or within the Galaxy graphical user interface. The versions of all software packages used in the manuscript are described in Supplementary Table 3.

References

  1. Alonge, M. et al. Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell 182, 145–161 (2020).

    Article  CAS  Google Scholar 

  2. Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).

    Article  CAS  Google Scholar 

  3. Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692–699 (2017).

    Article  CAS  Google Scholar 

  4. Aganezov, S. et al. Comprehensive analysis of structural variants in breast cancer genomes using single-molecule sequencing. Genome Res. 30, 1258–1273 (2020).

    Article  CAS  Google Scholar 

  5. Nattestad, M. et al. Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res. 28, 1126–1135 (2018).

    Article  CAS  Google Scholar 

  6. Brandler, W. M. et al. Paternally inherited cis-regulatory structural variants are associated with autism. Science 360, 327–331 (2018).

    Article  CAS  Google Scholar 

  7. Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).

    Article  CAS  Google Scholar 

  8. Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).

    Article  CAS  Google Scholar 

  9. Mahmoud, M. et al. Structural variant calling: the long and the short of it. Genome Biol. 20, 246 (2019).

    Article  Google Scholar 

  10. Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    Article  CAS  Google Scholar 

  11. Sirén, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).

    Article  Google Scholar 

  12. Narzisi, G. et al. Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nat. Methods 11, 1033–1036 (2014).

    Article  CAS  Google Scholar 

  13. Korlach, J. et al. Real-time DNA sequencing from single polymerase molecules. Methods Enzymol. 472, 431–455 (2010).

    Article  CAS  Google Scholar 

  14. Jain, M., Olsen, H. E., Paten, B. & Akeson, M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 17, 239 (2016).

    Article  Google Scholar 

  15. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    Article  CAS  Google Scholar 

  16. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

    Article  CAS  Google Scholar 

  17. Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat. Methods https://doi.org/10.1038/s41592-022-01457-8 (2022).

    Article  Google Scholar 

  18. Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).

    Article  CAS  Google Scholar 

  19. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  Google Scholar 

  20. Jiang, T. et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol. 21, 189 (2020).

    Article  CAS  Google Scholar 

  21. Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).

    Article  Google Scholar 

  22. Audano, P. A. et al. Characterizing the major structural variant alleles of the human genome. Cell 176, 663–675 (2019).

    Article  CAS  Google Scholar 

  23. Beyter, D. et al. Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits. Nat. Genet. https://doi.org/10.1038/s41588-021-00865-4 (2021).

  24. Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426–3440 (2022).

    Article  CAS  Google Scholar 

  25. GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).

    Article  Google Scholar 

  26. Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).

    Article  CAS  Google Scholar 

  27. Kruskal, J. B. On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. https://doi.org/10.1090/s0002-9939-1956-0078686-7 (1956).

  28. Bentley, J. L. Multidimensional binary search trees used for associative searching. Comm. ACM https://doi.org/10.1145/361002.361007 (1975).

  29. Jalili, V. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update. Nucleic Acids Res. 48, W395–W402 (2020).

    Article  CAS  Google Scholar 

  30. Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).

    Article  CAS  Google Scholar 

  31. Renaux-Petel, M. et al. Contribution of de novo and mosaic mutations to Li-Fraumeni syndrome. J. Med. Genet. 55, 173–180 (2018).

    Article  CAS  Google Scholar 

  32. Veltman, J. A. & Brunner, H. G. De novo mutations in human genetic disease. Nat. Rev. Genet. https://doi.org/10.1038/nrg3241 (2012).

  33. Belyeu, J. R. et al. De novo structural mutation rates and gamete-of-origin biases revealed through genome sequencing of 2,396 families. Am. J. Hum. Genet. 108, 597–607 (2021).

    Article  CAS  Google Scholar 

  34. Shi, J. et al. Structural variant selection for high-altitude adaptation using single-molecule long-read sequencing. Preprint at bioRxiv https://doi.org/10.1101/2021.03.27.436702 (2021).

  35. Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).

    Article  CAS  Google Scholar 

  36. Larson, D. E. et al. svtools: population-scale analysis of structural variation. Bioinformatics 35, 4782–4787 (2019).

    Article  CAS  Google Scholar 

  37. Eggertsson, H. P. et al. GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs. Nat. Commun. 10, 5402 (2019).

    Article  Google Scholar 

  38. Cooper, G. M. et al. A copy number variation morbidity map of developmental delay. Nat. Genet. 43, 838–846 (2011).

    Article  CAS  Google Scholar 

  39. Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).

    Article  CAS  Google Scholar 

  40. Ellegren, H. Microsatellites: simple sequences with complex evolution. Nat. Rev. Genet. https://doi.org/10.1038/nrg1348 (2004).

  41. Ranallo-Benavidez, T. R. et al. Optimized sample selection for cost-efficient long-read population sequencing. Genome Res. https://doi.org/10.1101/gr.264879.120 (2021).

  42. Consortium, T. 1000 G. P. & The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature https://doi.org/10.1038/nature15393 (2015).

  43. Chen, S. et al. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biol. 20, 291 (2019).

    Article  Google Scholar 

  44. Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).

    Article  CAS  Google Scholar 

  45. Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B. & Eskin, E. Identifying causal variants at loci with multiple signals of association. Genetics 198, 497–508 (2014).

    Article  CAS  Google Scholar 

  46. Schatz, M. C. et al. Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Cell Genom. 2, 100085 (2022).

    Article  CAS  Google Scholar 

  47. Scott, A. J., Chiang, C. & Hall, I. M. Structural variants are a major source of gene expression differences in humans and often affect multiple nearby genes. Genome Res. https://doi.org/10.1101/gr.275488.121 (2021).

  48. Mezzar, S. et al. Phytol-induced pathology in 2-hydroxyacyl-CoA lyase (HACL1) deficient mice. Evidence for a second non-HACL1-related lyase. Biochim. Biophys. Acta Mol. Cell Biol. Lipids 1862, 972–990 (2017).

    Article  CAS  Google Scholar 

  49. Caltabiano, R. et al. Macrophage migration inhibitory factor (MIF) and its homologue d-dopachrome tautomerase (DDT) inversely correlate with inflammation in discoid lupus erythematosus. Molecules 26, 184 (2021).

    Article  CAS  Google Scholar 

  50. Torres-Mora, J. et al. Malignant melanotic schwannian tumor: a clinicopathologic, immunohistochemical, and gene expression profiling study of 40 cases, with a proposal for the reclassification of ‘melanotic schwannoma’. Am. J. Surg. Pathol. 38, 94–105 (2014).

    Article  Google Scholar 

  51. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

    Article  CAS  Google Scholar 

  52. Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 (2022).

    Article  CAS  Google Scholar 

  53. Wigginton, J. E., Cutler, D. J. & Abecasis, G. R. A note on exact tests of Hardy–Weinberg equilibrium. Am. J. Hum. Genet. 76, 887–893 (2005).

    Article  CAS  Google Scholar 

  54. Navarro Gonzalez, J. et al. The UCSC Genome Browser database: 2021 update. Nucleic Acids Res. 49, D1046–D1057 (2021).

    Article  Google Scholar 

  55. Zerbino, D. R., Wilder, S. P., Johnson, N., Juettemann, T. & Flicek, P. R. The Ensembl regulatory build. Genome Biol. 16, 56 (2015).

    Article  Google Scholar 

  56. Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480 (2014).

    Article  Google Scholar 

  57. Abel, H. J. et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583, 83–89 (2020).

    Article  CAS  Google Scholar 

  58. Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894 (2019).

    Article  CAS  Google Scholar 

  59. Huang, Y.-F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017).

    Article  CAS  Google Scholar 

  60. Hubisz, M. J., Pollard, K. S. & Siepel, A. PHAST and RPHAST: phylogenetic analysis with space/time models. Brief. Bioinform. 12, 41–51 (2011).

    Article  CAS  Google Scholar 

  61. Chuang, L.-S. et al. A frameshift in CSF2RB predominant among Ashkenazi Jews increases risk for Crohn’s disease and reduces monocyte signaling via GMCSF. Gastroenterology 151, 710–723 (2016).

    Article  Google Scholar 

  62. Kirsche, M. Jasmine: Population-scale structural variant merging. Jasmine software release v1.1.0 from https://github.com/mkirsche/Jasmine. Zenodo. https://doi.org/10.5281/zenodo.5586905 (2021).

  63. Kirsche, M. Iris: Structural variant breakpoint and sequence refinement. Iris software release v1.0.4 from https://github.com/mkirsche/Iris. Zenodo. https://doi.org/10.5281/zenodo.5586965 (2021).

Download references

Acknowledgements

We thank F. Sedlazeck and M. Alonge for helpful discussions. This work was supported, in part, by National Science Foundation grants DBI-1350041 (to M.C.S.), IOS-1732253 (to M.C.S.) and IOS-1758800 (to M.C.S.) and National Institutes of Health grants NCI U01CA253481 (to M.C.S.), NCI U24CA231877 (to M.C.S.), NHGRI U41HG006620 (to M.C.S.), NHGRI U24HG010263 (to M.C.S.), NIH R03CA272952 (to M.C.S.) and NIGMS R35GM139580 (to A.B.). This work was also supported in part by the Mark Foundation for Cancer Research award 19-033-ASP (to M.C.S.) and a Microsoft Research Fellows award (to A.B.). Part of this research project was conducted using computational resources at the Maryland Advanced Research Computing Center (MARCC). We also thank the investigators and the patient donors from the Human Pangenome Reference Consortium, GTEx, and 1000 Genomes for making their data available.

Author information

Authors and Affiliations

Authors

Contributions

M.K. was the principal author of the Jasmine and Iris software, and led most of the presented analyses. G.P. contributed to the genotyping and eQTL analysis of the 1000 Genomes cohort. R.S. contributed to the genotyping of the 1000 Genomes cohort. B.N. led the genotyping and eQTL analysis of the GTEx cohort. A.B. assisted in the analysis of the GTEx cohort. S.A. helped design the software methods and the overall research strategy. M.C.S. oversaw all aspects of the research and analysis. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Sergey Aganezov or Michael C. Schatz.

Ethics declarations

Competing interests

S.A. has become an employee at Oxford Nanopore. R.S. has become an employee at Illumina. M.K. has become an employee at Variant Bio.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Tables 1–3, Figs. 1–47 and Notes 1 and 2.

Reporting Summary

Peer Review File

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kirsche, M., Prabhu, G., Sherman, R. et al. Jasmine and Iris: population-scale structural variant comparison and analysis. Nat Methods (2023). https://doi.org/10.1038/s41592-022-01753-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41592-022-01753-3

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing