Abstract
Approximately 5–10% of the human genome remains inaccessible due to the presence of repetitive sequences such as segmental duplications and tandem repeat arrays. We show that existing long-read mappers often yield incorrect alignments and variant calls within long, near-identical repeats, as they remain vulnerable to allelic bias. In the presence of a nonreference allele within a repeat, a read sampled from that region could be mapped to an incorrect repeat copy. To address this limitation, we developed a new long-read mapping method, Winnowmap2, by using minimal confidently alignable substrings. Winnowmap2 computes each read mapping through a collection of confident subalignments. This approach is more tolerant of structural variation and more sensitive to paralog-specific variants within repeats. Our experiments highlight that Winnowmap2 successfully addresses the issue of allelic bias, enabling more accurate downstream variant calls in repetitive sequences.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
This study used publicly available data for evaluation. Complete gapless genome assembly of CHM13 human cell line (v.1.0), chromosome 8 (v.9) and chromosome X (v.0.7) can be accessed from GitHub at https://github.com/marbl/CHM13#downloads. ONT and PacBio HiFi sequencing data for HG002, HG003 and HG004 samples are available at https://precision.fda.gov/challenges/10. E. coli K12 nanopore sequencing data used for training NanoSim simulator are available at the European Nucleotide Archive (PRJEB36648). The GIAB SV call set (v.0.6) for HG002 human sample is available at https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio.
Code availability
Winnowmap2 code is available at https://github.com/marbl/Winnowmap.
References
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature https://doi.org/10.1038/s41586-020-2547-7 (2020).
Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature https://doi.org/10.1038/s41586-021-03420-7 (2021).
Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (blasr): application and theory. BMC Bioinform. 13, 238 (2012).
Sović, I. et al. Fast and sensitive mapping of nanopore sequencing reads with graphmap. Nat. Commun. 7, 1–11 (2016).
Lin, H.-N. & Hsu, W.-L. Kart: a divide-and-conquer algorithm for NGS read alignment. Bioinformatics 33, 2281–2287 (2017).
Suzuki, H. & Kasahara, M. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinform. 19, 33–47 (2018).
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Jain, C., Dilthey, A., Koren, S., Aluru, S. & Phillippy, A. A fast approximate algorithm for mapping long reads to large reference databases. J. Comput. Biol. 25, 766 (2018).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Haghshenas, E., Sahinalp, S. C. & Hach, F. lordfast: sensitive and fast alignment search tool for long noisy read sequencing data. Bioinformatics 35, 20–27 (2019).
Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118 (2020).
Zeni, A. et al. Logan: high-performance gpu-based x-drop long-read alignment. In Proc. 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 462–471 (IEEE, 2020).
Prodanov, T. & Bansal, V. Sensitive alignment using paralogous sequence variants improves long-read mapping and variant calling in segmental duplications. Nucleic Acids Res. 48, e114 (2020).
Marco-Sola, S., Moure, J. C., Moreto, M. & Espinosa, A. Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics 37, 456–463 (2021).
Degner, J. F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).
Schwartz, S., Oren, R. & Ast, G. Detection and removal of biases in the analysis of next-generation sequencing reads. PloS ONE 6, e16685 (2011).
Vijaya Satya, R., Zavaljevski, N. & Reifman, J. A new strategy to reduce allelic bias in RNA-seq readmapping. Nucleic Acids Res. 40, e127 (2012).
Stevenson, K. R., Coolon, J. D. & Wittkopp, P. J. Sources of bias in measures of allele-specific expression derived from RNA-seq data aligned to a single reference genome. BMC Genomics 14, 536 (2013).
Brandt, D. Y. et al. Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data. G3 Genes Genom. Genet. 5, 931–941 (2015).
Günther, T. & Nettelblad, C. The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLoS Genet. 15, e1008302 (2019).
Mikheenko, A., Bzikadze, A. V., Gurevich, A., Miga, K. H. & Pevzner, P. A. Tandemtools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats. Bioinformatics 36, i75–i83 (2020).
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).
Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).
Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 1–19 (2020).
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).
Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010).
Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).
Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 1–11 (2017).
Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).
Nurk, S. et al. The complete sequence of a human genome. Science 376, eabj6987 https://doi.org/10.1126/science.abj6987 (2022).
Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. & Yorke, J. A. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004).
Hollox, E. J., Armour, J. A. & Barber, J. C. Extensive normal copy number variation of a β-defensin antimicrobial-gene cluster. Am. J. Hum. Genet. 73, 591–600 (2003).
Yang, C., Chu, J., Warren, R. L. & Birol, I. Nanosim: nanopore sequence read simulator based on statistical characterization. GigaScience 6, gix010 (2017).
Ono, Y., Asai, K. & Hamada, M. PBSIM: Pacbio reads simulator–toward accurate genome assembly. Bioinformatics 29, 119–121 (2013).
Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-0538-8 (2020).
Sharp, A. J. et al. Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet. 77, 78–88 (2005).
Chaisson, M. J. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1–16 (2019).
Audano, P. A. et al. Characterizing the major structural variant alleles of the human genome. Cell 176, 663–675 (2019).
McCartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Preprint at bioRxiv https://doi.org/10.1101/2021.07.02.450803 (2021).
Ho, S. S., Urban, A. E. & Mills, R. E. Structural variation in the sequencing era. Nat. Rev. Genet. https://doi.org/10.1038/s41576-019-0180-9 (2019).
Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. & David, N. T. Hinge: long-read assembly achieves optimal repeat resolution. Genome Res. 27, 747–756 (2017).
Bzikadze, A. V. & Pevzner, P. A. Automated assembly of centromeres from ultra-long error-prone reads. Nat. Biotechnol. 38, 1309–1316 (2020).
Nurk, S. et al. Hicanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D. & Pevzner, P. A. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat. Biotechnol. https://doi.org/10.1038/s41587-022-01220-6 (2022).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with HiFiasm. Nat. Methods 18, 170–175 (2021).
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Gel, B. & Serra, E. karyoploter: an r/bioconductor package to plot customizable genomes displaying arbitrary data. Bioinformatics 33, 3088–3090 (2017).
Quinlan, A. R. & Hall, I. M. Bedtools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-0503-6 (2020).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Acknowledgements
We thank K. Shafin, A. Mikheenko, M. Kirsche and S. Nurk for providing useful feedback regarding Winnowmap2. We also acknowledge H. Li for responding to our queries regarding minimap2 code. Winnowmap2 and Winnowmap were developed on top of minimap2 code. This research was supported in part by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health and funding from the Indian Institute of Science.
Author information
Authors and Affiliations
Contributions
C.J. designed, implemented and tested the Winnowmap2 algorithm. A.R., S.K., N.H. and A.P. provided valuable feedback while designing the algorithm and benchmark. C.J. prepared an initial draft of the manuscript. N.H. and A.P. edited the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Fritz Sedlazeck, Rayan Chikhi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling editor: Lin Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–4, Supplementary Tables 1–4 and Supplementary Note 1
Rights and permissions
About this article
Cite this article
Jain, C., Rhie, A., Hansen, N.F. et al. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat Methods 19, 705–710 (2022). https://doi.org/10.1038/s41592-022-01457-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-022-01457-8
This article is cited by
-
Leaf: an ultrafast filter for population-scale long-read SV detection
Genome Biology (2024)
-
Comprehensive and deep evaluation of structural variation detection pipelines with third-generation sequencing data
Genome Biology (2024)
-
Co-linear chaining on pangenome graphs
Algorithms for Molecular Biology (2024)
-
Comparison of structural variant callers for massive whole-genome sequence data
BMC Genomics (2024)
-
ESKEMAP: exact sketch-based read mapping
Algorithms for Molecular Biology (2024)