Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation

Abstract

Variant calling has been widely used for genotyping and for improving the consensus accuracy of long-read assemblies. Variant calls are commonly hard-filtered with user-defined cutoffs. However, it is impossible to define a single set of optimal cutoffs, as the calls heavily depend on the quality of the reads, the variant caller of choice and the quality of the unpolished assembly. Here, we introduce Merfin, a k-mer based variant-filtering algorithm for improved accuracy in genotyping and genome assembly polishing. Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller’s internal score. Merfin increased the precision of genotyped calls in several benchmarks, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from Pacific Biosciences HiFi and continuous long reads or Oxford Nanopore reads, including the first complete human genome. Moreover, we introduce assembly quality and completeness metrics that account for the expected genomic copy numbers.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Algorithms and results used in Merfin.
Fig. 2: CHM13 evaluation and polishing.
Fig. 3: HG002 human trio polishing and evaluation.
Fig. 4: Polishing and evaluation of VGP pseudo-haploid assemblies.
Fig. 5: Merfin results against quality scores.

Similar content being viewed by others

Data availability

HG002 variant-call data were downloaded from https://data.nist.gov/od/id/mds2-2336 (SEX9X, NFT0L, 23O09 and QUE7Q). Sequencing data and assemblies for CHM13, HG002 and VGP genomes are available at https://github.com/marbl/CHM13, https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0 and https://vgp.github.io/.

Source data used for generating all figures in this manuscript are available at https://github.com/gf777/misc/tree/master/merfin/paper/figures.

The K* tracks for HiFi and Illumina of the CHM13 are browsable in the associated UCSC browser (http://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_2395475_t2t-chm13-v1.1). All variant calls used in the genotyping benchmarks, k-mer databases, fitted histogram tables and K* tracks are available to download at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/MERFIN_2021/ with a step-by-step guideline available at https://github.com/arangrhie/merfin/wiki/Best-practices-for-Merfin. All data are publicly open for download with no restrictions.

Code availability

A stable release and the source code for Merfin and examples from this work are available under Apache License 2.0 at GitHub (https://github.com/arangrhie/merfin) and Zenodo (https://doi.org/10.5281/zenodo.5527270)52. The only dependency is the k-mer counter Meryl, which comes with the release. Merfin can be run in five modes (1) the -filter mode scores each variant or variants within distance k and their combinations by error k-mers for improved genotyping; (2) the -completeness mode generates completeness metrics; (3) the -dump mode computes KC, KR, K* for each base in the assembly along with QV and QV* for each sequence; (4) the -hist mode provides a K* histogram and genome-wide QV and QV* averages; and (5) the -polish mode scores each variant or variants within distance k and their combinations by the K* for polishing. Merfin is fully parallelized using OpenMP. A combination of bash and Rscript used for data analysis and visualization is available at https://github.com/gf777/misc/tree/master/merfin/paper/figures. A Code Ocean capsule of the package is provided (https://doi.org/10.24433/CO.2292349.v1)53.

References

  1. Olson, N. D. et al. precisionFDA Truth Challenge v2: calling variants from short- and long-reads in difficult-to-map regions. Preprint at bioRxiv https://doi.org/10.1101/2020.11.13.380741 (2021).

  2. Koboldt, D. C. Best practices for variant calling in clinical sequencing. Genome Med. 12, 91 (2020).

    Article  Google Scholar 

  3. Guo, Y., Ye, F., Sheng, Q., Clark, T. & Samuels, D. C. Three-stage quality control strategies for DNA re-sequencing data. Brief. Bioinform. 15, 879–889 (2014).

    Article  CAS  Google Scholar 

  4. Giani, A. M., Gallo, G. R., Gianfranceschi, L. & Formenti, G. Long walk to genomics: history and current approaches to genome sequencing and assembly. Comput. Struct. Biotechnol. J. 18, 9–19 (2020).

    Article  CAS  Google Scholar 

  5. Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).

    Article  CAS  Google Scholar 

  6. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    Article  CAS  Google Scholar 

  7. Watson, M. & Warr, A. Errors in long-read assemblies can critically affect protein prediction. Nat. Biotechnol. 37, 124–126 (2019).

    Article  CAS  Google Scholar 

  8. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

    Article  CAS  Google Scholar 

  9. Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).

    Article  Google Scholar 

  10. Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).

    Article  CAS  Google Scholar 

  11. Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).

    Article  CAS  Google Scholar 

  12. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article  CAS  Google Scholar 

  13. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at arXiv https://arxiv.org/abs/1207.3907 (2012).

  14. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    Article  CAS  Google Scholar 

  15. Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).

    Article  CAS  Google Scholar 

  16. Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).

    Article  CAS  Google Scholar 

  17. Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J. & Clavijo, B. J. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33, 574–576 (2017).

    CAS  PubMed  Google Scholar 

  18. Kundu, R., Casey, J. & Sung, W.-K. HyPo: super fast and accurate polisher for long read genome assemblies. Preprint at bioRxiv. https://doi.org/10.1101/2019.12.19.882506 (2019).

  19. Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118 (2020).

    Article  CAS  Google Scholar 

  20. Jain, C., Rhie, A., Hansen, N., Koren, S. & Phillippy, A. M. Long read mapping to repetitive reference sequences using Winnowmap2. Nat. Methods (2022).

  21. Phillippy, A. M., Schatz, M. C. & Pop, M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9, R55 (2008).

    Article  Google Scholar 

  22. Nurk, S. et al. The complete sequence of a human genome. Science 376, eabj6987 https://doi.org/10.1126/science.abj6987 (2022).

  23. Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 https://doi.org/10.1126/science.abj6965 (2022).

  24. Gershman, A. et al. Epigenetic patterns in a complete human genome. Science 376, eabj5089 https://doi.org/10.1126/science.abj5089 (2022).

  25. Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods (2022).

  26. Jarvis, E.D. et al. Automated assembly of high-quality diploid human reference genomes. Preprint at bioRxiv https://doi.org/10.1101/2022.03.06.483034 (2022).

  27. Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

    Article  CAS  Google Scholar 

  28. Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11, 1432 (2020).

    Article  CAS  Google Scholar 

  29. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article  CAS  Google Scholar 

  30. Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).

    Article  CAS  Google Scholar 

  31. Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).

    Article  CAS  Google Scholar 

  32. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

    Article  CAS  Google Scholar 

  33. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

    Article  CAS  Google Scholar 

  34. Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. https://doi.org/10.1038/nbt.4277 (2018).

  35. Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).

    Article  Google Scholar 

  36. Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01158-1 (2022).

  37. Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

    Article  CAS  Google Scholar 

  38. O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).

    Article  Google Scholar 

  39. NCBI. Gnomon - the NCBI eukaryotic gene prediction tool. https://www.ncbi.nlm.nih.gov/genome/annotation_euk/gnomon/ (2017).

  40. Fofanov, Y. et al. How independent are the appearances of n-mers in different genomes? Bioinformatics 20, 2421–2428 (2004).

    Article  CAS  Google Scholar 

  41. Zhao, X., Weber, A. M. & Mills, R. E. A recurrence-based approach for validating structural variation using long-read sequencing technology. Gigascience 6, 1–9 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Benjamini, Y. & Speed, T. P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).

    Article  CAS  Google Scholar 

  43. Yang, C. et al. Evolutionary and biomedical insights from a marmoset diploid genome assembly. Nature https://doi.org/10.1038/s41586-021-03535-x (2021).

  44. Cheng, H. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article  CAS  Google Scholar 

  45. Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 https://doi.org/10.1126/science.abl4178 (2022).

  46. Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

    Article  CAS  Google Scholar 

  47. Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. WindowMasker: window-based masker for sequenced genomes. Bioinforma. Oxf. Engl. 22, 134–141 (2006).

    Article  CAS  Google Scholar 

  48. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  Google Scholar 

  49. Kapustin, Y., Souvorov, A., Tatusova, T. & Lipman, D. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol. Direct 3, 20 (2008).

    Article  Google Scholar 

  50. Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).

    Article  CAS  Google Scholar 

  51. Nawrocki, E. P. et al. RFAM 12.0: updates to the RNA families database. Nucleic Acids Res. 43, D130–D137 (2015).

    Article  CAS  Google Scholar 

  52. Formenti, G., Rhie, A. & Walenz, B. arangrhie/merfin: Merfin v1.0. Zenodo https://doi.org/10.5281/zenodo.5527270 (2021).

  53. Formenti, G., Rhie, A. & Walenz, B. Merfin - improved Arrow polishing through vcf filtering and Genomescope2 kmer models. Code Ocean https://doi.org/10.24433/CO.2292349.v1 (2022).

Download references

Acknowledgements

We thank T. Rhyker Ranallo-Benavidez and M.C. Schatz for the useful discussion on adapting Genomescope2 models. We also thank the communities of the T2T, HPRC and VGP consortia for their constant support. G.F. and E.D.J. were supported by Rockefeller University and HHMI funds. A.R., B.P.W., S.K. and A.M.P. were supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health (NIH) (1ZIAHG200398). The work of F.T.-N. was supported by the Intramural Research Program of the National Library of Medicine, NIH. K.S. was supported by NIH/NHGRI (R01HG010485, U41HG010972, U01HG010961, U24HG011853 and OT2OD026682). E.W.M. was partially supported by the German Federal Ministry of Education and Research (01IS18026C). Part of this work used the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov).

Author information

Authors and Affiliations

Authors

Contributions

A.R., G.F., B.P.W. and E.W.M. implemented Merfin. G.F. and A.R. performed the validation analyses. K.S. performed the GIAB variant-calling analysis on HG002. F.T. generated the gene annotations for the VGP genomes. S.K. contributed to the conceptual development. G.F. and A.R. wrote the manuscript. G.F., A.R., E.D.J. and A.M.P. conceived the study. All authors reviewed, edited and approved the manuscript.

Corresponding authors

Correspondence to Giulio Formenti or Arang Rhie.

Ethics declarations

Competing interests

S.K. has received travel funds to speak at symposia organized by Oxford Nanopore. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Chuan-Le Xiao and the other, anonymous, reviewer for their contribution to the peer review of this work. Primary Handling editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Flowchart diagram of each mode in Merfin.

Text inside gray boxes on the top represents input files required (solid) or optional (dashed) for Merfin. a, genotyping (-filter) and polishing (-polish) modes. b, K* histogram (-hist) and K* completeness (-completeness) modes. Steps listed in bullet points are marked in gray if it is only applicable in -polish (a) or -completeness (b) mode.

Extended Data Fig. 2 Genome-wide density distribution of the K* using Illumina k-mers.

When the assembly is in agreement with the raw data, the K* is normally distributed with mean 0 and the smaller the standard deviation the higher the agreement. T2T-CHM13v1.0 shows a less dispersed distribution of the K* compared to v0.7.

Extended Data Fig. 3 A region of negative K* highlighting sequencing bias.

An example of low coverage in both HiFi and Illumina reads associated with high guanine content, and specifically a GA-rich repeat (heatmap). GA bias has been reported in PacBio HiFi data, and results in gaps in the assembly that in CHM13 were filled with Nanopore data22. The K* both from HiFi and Illumina k-mers (top tracks) recapitulate the coverage drop. Nanopore coverage appears less affected. Position Chr. 12:~129,862,000 bp.

Extended Data Fig. 4 The K* can identify issues in the assembly at the base level.

a, 40 bp window with K* close to 0, highlighting perfect agreement of the assembly with the raw reads. Position Chr18:~7,000,000 bp. b, A region of negative K* in coincidence with two heterozygous indels. Position Chr1:~105,008,350 bp.

Extended Data Fig. 5 Coverage titration experiment and impact on QV*.

The QV* is only marginally influenced by the coverage of the dataset being considered.

Extended Data Fig. 6 Haplotype phasing before and after polishing with Merfin.

In both parental assemblies, the haplotypes remained fully phased, and the size of the blocks substantially increased compared to the unpolished version (a,b) after polishing with Merfin (c,d). A theoretical human genome size of 3.1 Gbp was used to normalize NG* values.

Extended Data Fig. 7 VGP assembly pipeline.

Compared to the previous v1.6, the introduction of Merfin in v1.7 (green) resulted in a minimal change of the workflow, but in a generalized improvement in QV scores and gene annotations. Pipeline available at https://github.com/VGP/vgp-assembly.

Extended Data Fig. 8 Phase block analysis of zebra finch pseudo-haploid assembly.

a, Phase blocks in the primary assembly after mapping the reads to both the primary and alternate assemblies. b, Phase blocks in the primary assembly after mapping the reads to the primary only. c, Phase blocks in the alternate assembly after mapping the reads to both the primary and alternate assemblies. d, Phase blocks in the alternate assembly. In all cases, the application of Merfin filtering minor heterozygous variants (green) leads to block sizes better or comparable to prior polishing methods alone (blue). Unpolished assembly in gray. Results of Merfin without filtering in red. A genome size of ~1.03 Gbp derived from Genomescope2 was used to normalize NG* values.

Extended Data Fig. 9 Effect of merfin correction on the kinetochore scaffold 1 (KNL1) annotation.

a, Deleterious presence of an extra A around position 1,321,620 of scaffold_7 (red box) in the polished, non-merfin-corrected sequence is indicated by a 1-base gap in the alignments of zebra finch PacBio IsoSeq SRR8695295.20794.1 and KNL1 transcripts from three other Passeriformes songbirds. This insertion causes a disruption in the frame and a premature stop codon in the translated sequence (see amino acid sequence in red). b, Corresponding span in the merfin-corrected assembly, with gapless alignments of the IsoSeq read and Passeriformes transcripts, and uninterrupted translation.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Formenti, G., Rhie, A., Walenz, B.P. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods 19, 696–704 (2022). https://doi.org/10.1038/s41592-022-01445-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-022-01445-y

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics