Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Ultrafast prediction of somatic structural variations by filtering out reads matched to pan-genome k-mer sets

Abstract

Variant callers typically produce massive numbers of false positives for structural variations, such as cancer-relevant copy-number alterations and fusion genes resulting from genome rearrangements. Here we describe an ultrafast and accurate detector of somatic structural variations that reduces read-mapping costs by filtering out reads matched to pan-genome k-mer sets. The detector, which we named ETCHING (for efficient detection of chromosomal rearrangements and fusion genes), reduces the number of false positives by leveraging machine-learning classifiers trained with six breakend-related features (clipped-read count, split-reads count, supporting paired-end read count, average mapping quality, depth difference and total length of clipped bases). When benchmarked against six callers on reference cell-free DNA, validated biomarkers of structural variants, matched tumour and normal whole genomes, and tumour-only targeted sequencing datasets, ETCHING was 11-fold faster than the second-fastest structural-variant caller at comparable performance and memory use. The speed and accuracy of ETCHING may aid large-scale genome projects and facilitate practical implementations in precision medicine.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview and schematic flow diagram of ETCHING.
Fig. 2: Benchmarking ETCHING and other SV callers over a cancer cell line (HCC1395) and LUAD and PRAD WGS datasets.
Fig. 3: SV analysis for the WGS data from 26 MM samples.
Fig. 4: Complex rearrangements and FGs in MM samples.
Fig. 5: SV and FG predictions from TPS data.

Similar content being viewed by others

Data availability

WGS data from 26 MM samples, RNA-seq data from 24 matched samples and PacBio long-read sequencing data from two multiple-myeloma samples can be downloaded from the Korean Nucleotide Archive (KONA; PRJKA220342; https://www.kobic.re.kr/kona/) with controlled access. TPS data from reference materials are available at http://big.hanyang.ac.kr/ETCHING. Genomes used to build PGK and PGK2 are listed in Supplementary Table 1. WGS from 46 BRCA, 20 PRAD and 32 LUAD were downloaded from TCGA (https://cancergenome.nih.gov). kLUAD WGS datasets (49) were acquired from a previous study33. WGS and PacBio long-read sequencing data from HCC1395/HCC1395BL were downloaded from NCBI Short Read Archive (SRA) under accession number SRP162370. Cancer-panel datasets were downloaded from SRA under accession number SRP042598. NSCLC cancer-panel data were acquired from a previous study47. Source data are provided with this paper.

Code availability

All source and binary codes of ETCHING (version 1.4.0) and in-house codes (LR_Filter and ETCHING_bench) used in the study are available at http://big.hanyang.ac.kr/ETCHING and on GitHub (https://github.com/ETCHING-team). ETCHING was designed for 64-bit Linux systems with at least 16 GB of RAM. The image file containing all codes, models and demo data is available on the Amazon elastic computing cloud (ID: ami-07c7a7d8934784df9; Region: us-east-1 (Northern Virginia)).

References

  1. Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692–699 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. Sharp, A. J., Cheng, Z. & Eichler, E. E. Structural variation of the human genome. Annu. Rev. Genomics Hum. Genet. 7, 407–442 (2006).

    CAS  PubMed  Google Scholar 

  3. Mitelman, F., Johansson, B. & Mertens, F. The impact of translocations and gene fusions on cancer causation. Nat. Rev. Cancer 7, 233–245 (2007).

    CAS  PubMed  Google Scholar 

  4. Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 719–724 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature 463, 899–905 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. Stankiewicz, P. & Lupski, J. R. Structural variation in the human genome and its role in disease. Annu. Rev. Med. 61, 437–455 (2010).

    CAS  PubMed  Google Scholar 

  7. Macintyre, G., Ylstra, B. & Brenton, J. D. Sequencing structural variants in cancer for precision therapeutics. Trends Genet. 32, 530–542 (2016).

    CAS  PubMed  Google Scholar 

  8. Di Fiore, P. P. et al. erbB-2 is a potent oncogene when overexpressed in NIH/3T3 cells. Science 237, 178–182 (1987).

    PubMed  Google Scholar 

  9. Slamon, D. J. et al. Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science 235, 177–182 (1987).

    CAS  PubMed  Google Scholar 

  10. Soda, M. et al. Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 448, 561–566 (2007).

    CAS  PubMed  Google Scholar 

  11. Lugo, T. G., Pendergast, A. M., Muller, A. J. & Witte, O. N. Tyrosine kinase activity and transformation potency of bcr-abl oncogene products. Science 247, 1079–1082 (1990).

    CAS  PubMed  Google Scholar 

  12. Abyzov, A., Urban, A. E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods 6, 677–681 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Wang, J. et al. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat. Methods 8, 652–654 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Schroder, J. et al. Socrates: identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads. Bioinformatics 30, 1064–1072 (2014).

    PubMed  PubMed Central  Google Scholar 

  17. Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. Yang, L. et al. Diverse mechanisms of somatic structural variations in human cancer genomes. Cell 153, 919–929 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).

    PubMed  PubMed Central  Google Scholar 

  20. Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).

    CAS  PubMed  Google Scholar 

  21. Cameron, D. L. et al. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. 27, 2050–2060 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. Wala, J. A. et al. SvABA: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581–591 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Chong, Z. et al. novoBreak: local assembly for breakpoint detection in cancer genomes. Nat. Methods 14, 65–67 (2017).

    CAS  PubMed  Google Scholar 

  24. Moncunill, V. et al. Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads. Nat. Biotechnol. 32, 1106–1112 (2014).

    CAS  PubMed  Google Scholar 

  25. ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).

  26. Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Cameron, D. L., Di Stefano, L. & Papenfuss, A. T. Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nat. Commun. 10, 3240 (2019).

    PubMed  PubMed Central  Google Scholar 

  28. Gong, T., Hayes, V. M. & Chan, E. K. F. Detection of somatic structural variants from short-read next-generation sequencing data. Brief Bioinform. https://doi.org/10.1093/bib/bbaa056 (2020).

  29. Zhang, J. et al. INTEGRATE: gene fusion discovery using whole genome and transcriptome data. Genome Res. 26, 108–118 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  31. Wright, M. N. & Ziegler, A. ranger: A fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. https://doi.org/10.18637/jss.v077.i01 (2017).

  32. Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016).

  33. Lee, J. J. et al. Tracing oncogene rearrangements in the mutational history of lung adenocarcinoma. Cell 177, 1842–1857 e1821 (2019).

    CAS  PubMed  Google Scholar 

  34. Xia, L. C. et al. SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution. Gigascience https://doi.org/10.1093/gigascience/giy081 (2018).

  35. Kosugi, S. et al. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol. 20, 117 (2019).

    PubMed  PubMed Central  Google Scholar 

  36. Derrien, T. et al. Fast computation and applications of genome mappability. PLoS ONE 7, e30377 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Avet-Loiseau, H. et al. High incidence of translocations t(11;14)(q13;q32) and t(4;14)(p16;q32) in patients with plasma cell malignancies. Cancer Res. 58, 5640–5645 (1998).

    CAS  PubMed  Google Scholar 

  38. Avet-Loiseau, H. et al. Rearrangements of the c-myc oncogene are present in 15% of primary human multiple myeloma tumors. Blood 98, 3082–3086 (2001).

    CAS  PubMed  Google Scholar 

  39. Chakravarty, D. et al. OncoKB: a precision oncology knowledge base. JCO Precis. Oncol. https://doi.org/10.1200/PO.17.00011 (2017).

  40. Mertens, F., Johansson, B., Fioretos, T. & Mitelman, F. The emerging complexity of gene fusions in cancer. Nat. Rev. Cancer 15, 371–381 (2015).

    CAS  PubMed  Google Scholar 

  41. Chesi, M. et al. IAP antagonists induce anti-tumor immunity in multiple myeloma. Nat. Med. 22, 1411–1420 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Raponi, S. et al. Biallelic BIRC3 inactivation in chronic lymphocytic leukaemia patients with 11q deletion identifies a subgroup with very aggressive disease. Br. J. Haematol. 185, 156–159 (2019).

    PubMed  Google Scholar 

  43. Blakemore, S. J. et al. Clinical significance of TP53, BIRC3, ATM and MAPK-ERK genes in chronic lymphocytic leukaemia: data from the randomised UK LRF CLL4 trial. Leukemia 34, 1760–1774 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Frazzi, R. BIRC3 and BIRC5: multi-faceted inhibitors in cancer. Cell Biosci. 11, 8 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. Uhrig, S. et al. Accurate and efficient detection of gene fusions from RNA sequencing data. Genome Res. 31, 448–460 (2021).

    PubMed  PubMed Central  Google Scholar 

  46. Abo, R. P. et al. BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers. Nucleic Acids Res. 43, e19 (2015).

    PubMed  Google Scholar 

  47. Shin, H. T. et al. Junction Location Identifier (JuLI): accurate detection of DNA fusions in clinical sequencing for precision oncology. J. Mol. Diagn. 22, 304–318 (2020).

    CAS  PubMed  Google Scholar 

  48. Kokot, M., Dlugosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).

    CAS  PubMed  Google Scholar 

  49. Zito Marino, F. et al. A new look at the ALK gene in cancer: copy number gain and amplification. Expert Rev. Anticancer Ther. 16, 493–502 (2016).

    CAS  PubMed  Google Scholar 

  50. Pasini, L. et al. TrkA is amplified in malignant melanoma patients and induces an anti-proliferative response in cell lines. BMC Cancer 15, 777 (2015).

    PubMed  PubMed Central  Google Scholar 

  51. Huang, M. E. et al. Use of all-trans retinoic acid in the treatment of acute promyelocytic leukemia. Blood 72, 567–572 (1988).

    CAS  PubMed  Google Scholar 

  52. Slovak, M. & Campbell, L. International System of Human Cytogenetic Nomenclature (ISCN) (Karger, 2009).

  53. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).

    CAS  PubMed  Google Scholar 

  55. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  PubMed  Google Scholar 

  56. Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank all BIGLab members for critical reading of the manuscript and for comments, S. Kim and S. Yoon of Seoul National University and W.-Y. Park of the Samsung Medical Center for providing resources. The results shown in this study are in part based on data generated by the TCGA Research Network (https://www.cancer.gov/tcga). This work was supported by the National Research Foundation (NRF) funded by the Ministry of Science & ICT (2014M3C9A3063541, 2020R1A4A1018398, 2021R1A2C3005835, 2022M3A9I2082294 and 2022M3E5F1018502 to J.-W.N.) and by the Korean Health Technology R&D Project, Ministry of Health and Welfare, Republic of Korea (HI15C3224 to J.-W.N.).

Author information

Authors and Affiliations

Authors

Contributions

J.S., M.-H.C., D.Y. and V.A.M. performed analyses. J.S., M.-H.C., D.Y. and V.A.M. contributed to writing the codes. J.S., M.-H.C. and B.N. contributed to parallel computing. J.L., J.W.P. and M.S.Y. contributed to the data processing of benchmarking datasets. S.K., S.-H.S., Y.K., S.-S.Y. and Y.S.J. provided validation datasets. Y.J.K. and J.-G.J. performed experimental validations. J.S., D.Y. and J.-W.N. contributed to the writing of the manuscript. D.B., T.-M.K. and J.-W.N. supervised the project. J.-W.N. conceived the idea.

Corresponding author

Correspondence to Jin-Wu Nam.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Ryan Layer and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Comparison of running time.

a, Wall-clock times of ETCHING and other tools from FASTQ input files to SV predictions on benchmarking BRCA samples. These include the mapping procedures of the FASTQ files from tumour and normal samples. We measured the running times using 30 threads on DELL PowerEdge R830 servers. b and c, The comparisons of the running times of ETCHING (from FASTQ) and other tools (from pre-mapped BAM) to SV predictions in CPU time on a single thread (b) and wall-clock time on 30 threads (c).

Extended Data Fig. 2 Benchmarking on BRCA and kLUAD samples.

Benchmarking results in auPR for ETCHING versus other tools on six BRCA (a) and nine kLUAD (b) samples by SV type.

Extended Data Fig. 3 Benchmarking on simulation data.

Benchmarking results for ETCHING versus other tools on simulation data sets.

Extended Data Fig. 4 Systemically benchmarking of SV prediction by ETCHING versus other tools over different genomic contexts and SV sizes with HCC1395 data.

a, The number of SV calls by each tool for defined SV size categories: 100 bp ≤ L < 1Kb (100bp–1Kb), 1Kb ≤ L < 1 Mb (1Kb–1 Mb), and 1 Mb ≤ L ( ≥ 1 Mb), as well as inter-chromosomal rearrangements (Inter-Chr., that is TRAs), where L is SV size. b, The SV ratios associated with repetitive elements, different MP scores, and different GC ratios. c, Recall and precision of the SV callers for SVs that overlap repeats. d, Recall and precision of the SV callers for SVs in regions over different genomic MP scores. e, Recall and precision of the SV callers for SVs located in regions over different GC ratios.

Source data

Extended Data Fig. 5 Validation of SVs using PacBio long-reads on HCC1395.

a, The number of SVs of each SV type. b, The area under PR curves (auPR) of ETCHING and other tools on the gold-standard SV sets. c, The validation rates of ETCHING and other tools.

Extended Data Fig. 6 Benchmarking on 32 LUAD samples.

Benchmarking results for ETCHING versus other tools on 32 LUAD samples by SV type for five different performance metrics.

Source data

Extended Data Fig. 7 Benchmarking on 20 PRAD samples.

Benchmarking results for ETCHING versus other tools on 20 PRAD samples by SV type for five different performance metrics. Note that each boxplot has 8 dots because 13 samples of low SV numbers were treated as a sample.

Source data

Extended Data Fig. 8 SVs detected by each tool.

Summary of detected SV biomarkers (a) and actionable targets (b) by DELLY, LUMPY, Manta, SvABA, novoBreak, and GRIDSS.

Extended Data Fig. 9 SV and FG prediction from TPS data, paired with WT alleles (regarded as matched-normal).

a, The TP calls (labelled as ‘Found’ in orange) and false negatives (labelled as ‘Missed’ in grey) of SV callers for cfDNA reference materials – Complete Reference (CR), Complete Mutation Mix (CMM), and Mutation Mix v2 (MMv2) – with different mutant allele ratios (0.5 to 5.0%; grey to black). CR and CMM include NCOA4-RET, EML4-ALK, and CD74-ROS1 FGs, and MMv2 includes NCOA4-RET and TPR-ALK FGs. The total TP for each tool is indicated in the lower right corner. b, Benchmarking SV callers on the reference materials including ETCHING with PGK.

Extended Data Fig. 10 PML-RARA detection.

a, Wall-clock times for detecting PML-RARA fusions on WGS data of seven APML samples. b, PML-RARA fusions detected by each tool.

Supplementary information

Source data

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sohn, Ji., Choi, MH., Yi, D. et al. Ultrafast prediction of somatic structural variations by filtering out reads matched to pan-genome k-mer sets. Nat. Biomed. Eng 7, 853–866 (2023). https://doi.org/10.1038/s41551-022-00980-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41551-022-00980-5

This article is cited by

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer