Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Long-read mapping to repetitive reference sequences using Winnowmap2

Abstract

Approximately 5–10% of the human genome remains inaccessible due to the presence of repetitive sequences such as segmental duplications and tandem repeat arrays. We show that existing long-read mappers often yield incorrect alignments and variant calls within long, near-identical repeats, as they remain vulnerable to allelic bias. In the presence of a nonreference allele within a repeat, a read sampled from that region could be mapped to an incorrect repeat copy. To address this limitation, we developed a new long-read mapping method, Winnowmap2, by using minimal confidently alignable substrings. Winnowmap2 computes each read mapping through a collection of confident subalignments. This approach is more tolerant of structural variation and more sensitive to paralog-specific variants within repeats. Our experiments highlight that Winnowmap2 successfully addresses the issue of allelic bias, enabling more accurate downstream variant calls in repetitive sequences.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: Illustration of allelic bias in near-identical genomic repeats.
Fig. 2: Visualization of alignment pileup near the mutated bases of chromosome 8 by using IGV tool47.
Fig. 3: A comparison of false-negative and false-positive rates of SV calls using four read mapping methods.
Fig. 4: Wall-clock time and memory usage of four mapping methods.
Fig. 5: Comparison of Winnowmap2 and minimap2 using GIAB SV benchmark set defined for HG002 human sample.
Fig. 6: The left plots indicate the size distribution of SVs computed by Winnowmap2-Sniffles pipeline using HG004 and HG007 samples.

Data availability

This study used publicly available data for evaluation. Complete gapless genome assembly of CHM13 human cell line (v.1.0), chromosome 8 (v.9) and chromosome X (v.0.7) can be accessed from GitHub at https://github.com/marbl/CHM13#downloads. ONT and PacBio HiFi sequencing data for HG002, HG003 and HG004 samples are available at https://precision.fda.gov/challenges/10. E.coli K12 nanopore sequencing data used for training NanoSim simulator are available at the European Nucleotide Archive (PRJEB36648). The GIAB SV call set (v.0.6) for HG002 human sample is available at https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio.

Code availability

Winnowmap2 code is available at https://github.com/marbl/Winnowmap.

References

  1. Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature https://doi.org/10.1038/s41586-020-2547-7 (2020).

  2. Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature https://doi.org/10.1038/s41586-021-03420-7 (2021).

  3. Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (blasr): application and theory. BMC Bioinform. 13, 238 (2012).

    CAS  Article  Google Scholar 

  4. Sović, I. et al. Fast and sensitive mapping of nanopore sequencing reads with graphmap. Nat. Commun. 7, 1–11 (2016).

    Article  CAS  Google Scholar 

  5. Lin, H.-N. & Hsu, W.-L. Kart: a divide-and-conquer algorithm for NGS read alignment. Bioinformatics 33, 2281–2287 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  6. Suzuki, H. & Kasahara, M. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinform. 19, 33–47 (2018).

    Article  CAS  Google Scholar 

  7. Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  8. Jain, C., Dilthey, A., Koren, S., Aluru, S. & Phillippy, A. A fast approximate algorithm for mapping long reads to large reference databases. J. Comput. Biol. 25, 766 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  9. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  10. Haghshenas, E., Sahinalp, S. C. & Hach, F. lordfast: sensitive and fast alignment search tool for long noisy read sequencing data. Bioinformatics 35, 20–27 (2019).

    CAS  PubMed  Article  Google Scholar 

  11. Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  12. Zeni, A. et al. Logan: high-performance gpu-based x-drop long-read alignment. In Proc. 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 462–471 (IEEE, 2020).

  13. Prodanov, T. & Bansal, V. Sensitive alignment using paralogous sequence variants improves long-read mapping and variant calling in segmental duplications. Nucleic Acids Res. 48, e114 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  14. Marco-Sola, S., Moure, J. C., Moreto, M. & Espinosa, A. Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics 37, 456–463 (2021).

    CAS  PubMed  Article  Google Scholar 

  15. Degner, J. F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  16. Schwartz, S., Oren, R. & Ast, G. Detection and removal of biases in the analysis of next-generation sequencing reads. PloS ONE 6, e16685 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  17. Vijaya Satya, R., Zavaljevski, N. & Reifman, J. A new strategy to reduce allelic bias in RNA-seq readmapping. Nucleic Acids Res. 40, e127 (2012).

    CAS  PubMed Central  Article  Google Scholar 

  18. Stevenson, K. R., Coolon, J. D. & Wittkopp, P. J. Sources of bias in measures of allele-specific expression derived from RNA-seq data aligned to a single reference genome. BMC Genomics 14, 536 (2013).

    PubMed  PubMed Central  Article  Google Scholar 

  19. Brandt, D. Y. et al. Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data. G3 Genes Genom. Genet. 5, 931–941 (2015).

    Google Scholar 

  20. Günther, T. & Nettelblad, C. The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLoS Genet. 15, e1008302 (2019).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  21. Mikheenko, A., Bzikadze, A. V., Gurevich, A., Miga, K. H. & Pevzner, P. A. Tandemtools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats. Bioinformatics 36, i75–i83 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  22. Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  23. Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  24. Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 1–19 (2020).

    Article  CAS  Google Scholar 

  25. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  26. Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  27. Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).

    CAS  PubMed  Article  Google Scholar 

  28. Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 1–11 (2017).

    Article  CAS  Google Scholar 

  29. Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  30. Nurk, S. et al. The complete sequence of a human genome. Science 376, eabj6987 https://doi.org/10.1126/science.abj6987 (2022).

  31. Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. & Yorke, J. A. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004).

    CAS  PubMed  Article  Google Scholar 

  32. Hollox, E. J., Armour, J. A. & Barber, J. C. Extensive normal copy number variation of a β-defensin antimicrobial-gene cluster. Am. J. Hum. Genet. 73, 591–600 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  33. Yang, C., Chu, J., Warren, R. L. & Birol, I. Nanosim: nanopore sequence read simulator based on statistical characterization. GigaScience 6, gix010 (2017).

    Google Scholar 

  34. Ono, Y., Asai, K. & Hamada, M. PBSIM: Pacbio reads simulator–toward accurate genome assembly. Bioinformatics 29, 119–121 (2013).

    CAS  PubMed  Article  Google Scholar 

  35. Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  36. Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-0538-8 (2020).

  37. Sharp, A. J. et al. Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet. 77, 78–88 (2005).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  38. Chaisson, M. J. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1–16 (2019).

    CAS  Article  Google Scholar 

  39. Audano, P. A. et al. Characterizing the major structural variant alleles of the human genome. Cell 176, 663–675 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  40. McCartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Preprint at bioRxiv https://doi.org/10.1101/2021.07.02.450803 (2021).

  41. Ho, S. S., Urban, A. E. & Mills, R. E. Structural variation in the sequencing era. Nat. Rev. Genet. https://doi.org/10.1038/s41576-019-0180-9 (2019).

  42. Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. & David, N. T. Hinge: long-read assembly achieves optimal repeat resolution. Genome Res. 27, 747–756 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  43. Bzikadze, A. V. & Pevzner, P. A. Automated assembly of centromeres from ultra-long error-prone reads. Nat. Biotechnol. 38, 1309–1316 (2020).

    CAS  PubMed  Article  Google Scholar 

  44. Nurk, S. et al. Hicanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  45. Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D. & Pevzner, P. A. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat. Biotechnol. https://doi.org/10.1038/s41587-022-01220-6 (2022).

  46. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with HiFiasm. Nat. Methods 18, 170–175 (2021).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  47. Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  48. Gel, B. & Serra, E. karyoploter: an r/bioconductor package to plot customizable genomes displaying arbitrary data. Bioinformatics 33, 3088–3090 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  49. Quinlan, A. R. & Hall, I. M. Bedtools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  50. Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-0503-6 (2020).

  51. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

Download references

Acknowledgements

We thank K. Shafin, A. Mikheenko, M. Kirsche and S. Nurk for providing useful feedback regarding Winnowmap2. We also acknowledge H. Li for responding to our queries regarding minimap2 code. Winnowmap2 and Winnowmap were developed on top of minimap2 code. This research was supported in part by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health and funding from the Indian Institute of Science.

Author information

Authors and Affiliations

Authors

Contributions

C.J. designed, implemented and tested the Winnowmap2 algorithm. A.R., S.K., N.H. and A.P. provided valuable feedback while designing the algorithm and benchmark. C.J. prepared an initial draft of the manuscript. N.H. and A.P. edited the manuscript.

Corresponding author

Correspondence to Chirag Jain.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Fritz Sedlazeck, Rayan Chikhi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–4, Supplementary Tables 1–4 and Supplementary Note 1

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jain, C., Rhie, A., Hansen, N.F. et al. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat Methods 19, 705–710 (2022). https://doi.org/10.1038/s41592-022-01457-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-022-01457-8

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing