Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation

Formenti, Giulio; Rhie, Arang; Walenz, Brian P.; Thibaud-Nissen, Françoise; Shafin, Kishwar; Koren, Sergey; Myers, Eugene W.; Jarvis, Erich D.; Phillippy, Adam M.

doi:10.1038/s41592-022-01445-y

Article
Published: 31 March 2022

Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation

Nature Methods volume 19, pages 696–704 (2022)Cite this article

7577 Accesses
21 Citations
102 Altmetric
Metrics details

Subjects

Abstract

Variant calling has been widely used for genotyping and for improving the consensus accuracy of long-read assemblies. Variant calls are commonly hard-filtered with user-defined cutoffs. However, it is impossible to define a single set of optimal cutoffs, as the calls heavily depend on the quality of the reads, the variant caller of choice and the quality of the unpolished assembly. Here, we introduce Merfin, a k-mer based variant-filtering algorithm for improved accuracy in genotyping and genome assembly polishing. Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller’s internal score. Merfin increased the precision of genotyped calls in several benchmarks, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from Pacific Biosciences HiFi and continuous long reads or Oxford Nanopore reads, including the first complete human genome. Moreover, we introduce assembly quality and completeness metrics that account for the expected genomic copy numbers.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Algorithms and results used in Merfin.**

**Fig. 2: CHM13 evaluation and polishing.**

**Fig. 3: HG002 human trio polishing and evaluation.**

**Fig. 4: Polishing and evaluation of VGP pseudo-haploid assemblies.**

**Fig. 5: Merfin results against quality scores.**

Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data

Article Open access 19 March 2024

Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes

Article Open access 11 April 2022

Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome

Article 12 August 2019

Data availability

HG002 variant-call data were downloaded from https://data.nist.gov/od/id/mds2-2336 (SEX9X, NFT0L, 23O09 and QUE7Q). Sequencing data and assemblies for CHM13, HG002 and VGP genomes are available at https://github.com/marbl/CHM13, https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0 and https://vgp.github.io/.

Source data used for generating all figures in this manuscript are available at https://github.com/gf777/misc/tree/master/merfin/paper/figures.

The K* tracks for HiFi and Illumina of the CHM13 are browsable in the associated UCSC browser (http://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_2395475_t2t-chm13-v1.1). All variant calls used in the genotyping benchmarks, k-mer databases, fitted histogram tables and K* tracks are available to download at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/MERFIN_2021/ with a step-by-step guideline available at https://github.com/arangrhie/merfin/wiki/Best-practices-for-Merfin. All data are publicly open for download with no restrictions.

Code availability

A stable release and the source code for Merfin and examples from this work are available under Apache License 2.0 at GitHub (https://github.com/arangrhie/merfin) and Zenodo (https://doi.org/10.5281/zenodo.5527270)⁵². The only dependency is the k-mer counter Meryl, which comes with the release. Merfin can be run in five modes (1) the -filter mode scores each variant or variants within distance k and their combinations by error k-mers for improved genotyping; (2) the -completeness mode generates completeness metrics; (3) the -dump mode computes K_C, K_R, K* for each base in the assembly along with QV and QV* for each sequence; (4) the -hist mode provides a K* histogram and genome-wide QV and QV* averages; and (5) the -polish mode scores each variant or variants within distance k and their combinations by the K* for polishing. Merfin is fully parallelized using OpenMP. A combination of bash and Rscript used for data analysis and visualization is available at https://github.com/gf777/misc/tree/master/merfin/paper/figures. A Code Ocean capsule of the package is provided (https://doi.org/10.24433/CO.2292349.v1)⁵³.

References

Olson, N. D. et al. precisionFDA Truth Challenge v2: calling variants from short- and long-reads in difficult-to-map regions. Preprint at bioRxiv https://doi.org/10.1101/2020.11.13.380741 (2021).
Koboldt, D. C. Best practices for variant calling in clinical sequencing. Genome Med. 12, 91 (2020).
Article Google Scholar
Guo, Y., Ye, F., Sheng, Q., Clark, T. & Samuels, D. C. Three-stage quality control strategies for DNA re-sequencing data. Brief. Bioinform. 15, 879–889 (2014).
Article CAS Google Scholar
Giani, A. M., Gallo, G. R., Gianfranceschi, L. & Formenti, G. Long walk to genomics: history and current approaches to genome sequencing and assembly. Comput. Struct. Biotechnol. J. 18, 9–19 (2020).
Article CAS Google Scholar
Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).
Article CAS Google Scholar
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Article CAS Google Scholar
Watson, M. & Warr, A. Errors in long-read assemblies can critically affect protein prediction. Nat. Biotechnol. 37, 124–126 (2019).
Article CAS Google Scholar
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Article CAS Google Scholar
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).
Article Google Scholar
Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).
Article CAS Google Scholar
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
Article CAS Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS Google Scholar
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at arXiv https://arxiv.org/abs/1207.3907 (2012).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Article CAS Google Scholar
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Article CAS Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
Article CAS Google Scholar
Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J. & Clavijo, B. J. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33, 574–576 (2017).
CAS PubMed Google Scholar
Kundu, R., Casey, J. & Sung, W.-K. HyPo: super fast and accurate polisher for long read genome assemblies. Preprint at bioRxiv. https://doi.org/10.1101/2019.12.19.882506 (2019).
Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118 (2020).
Article CAS Google Scholar
Jain, C., Rhie, A., Hansen, N., Koren, S. & Phillippy, A. M. Long read mapping to repetitive reference sequences using Winnowmap2. Nat. Methods (2022).
Phillippy, A. M., Schatz, M. C. & Pop, M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9, R55 (2008).
Article Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Science 376, eabj6987 https://doi.org/10.1126/science.abj6987 (2022).
Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 https://doi.org/10.1126/science.abj6965 (2022).
Gershman, A. et al. Epigenetic patterns in a complete human genome. Science 376, eabj5089 https://doi.org/10.1126/science.abj5089 (2022).
Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods (2022).
Jarvis, E.D. et al. Automated assembly of high-quality diploid human reference genomes. Preprint at bioRxiv https://doi.org/10.1101/2022.03.06.483034 (2022).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Article CAS Google Scholar
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11, 1432 (2020).
Article CAS Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS Google Scholar
Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).
Article CAS Google Scholar
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).
Article CAS Google Scholar
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Article CAS Google Scholar
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Article CAS Google Scholar
Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. https://doi.org/10.1038/nbt.4277 (2018).
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).
Article Google Scholar
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01158-1 (2022).
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Article CAS Google Scholar
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
Article Google Scholar
NCBI. Gnomon - the NCBI eukaryotic gene prediction tool. https://www.ncbi.nlm.nih.gov/genome/annotation_euk/gnomon/ (2017).
Fofanov, Y. et al. How independent are the appearances of n-mers in different genomes? Bioinformatics 20, 2421–2428 (2004).
Article CAS Google Scholar
Zhao, X., Weber, A. M. & Mills, R. E. A recurrence-based approach for validating structural variation using long-read sequencing technology. Gigascience 6, 1–9 (2017).
CAS PubMed PubMed Central Google Scholar
Benjamini, Y. & Speed, T. P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).
Article CAS Google Scholar
Yang, C. et al. Evolutionary and biomedical insights from a marmoset diploid genome assembly. Nature https://doi.org/10.1038/s41586-021-03535-x (2021).
Cheng, H. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS Google Scholar
Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 https://doi.org/10.1126/science.abl4178 (2022).
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Article CAS Google Scholar
Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. WindowMasker: window-based masker for sequenced genomes. Bioinforma. Oxf. Engl. 22, 134–141 (2006).
Article CAS Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS Google Scholar
Kapustin, Y., Souvorov, A., Tatusova, T. & Lipman, D. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol. Direct 3, 20 (2008).
Article Google Scholar
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).
Article CAS Google Scholar
Nawrocki, E. P. et al. RFAM 12.0: updates to the RNA families database. Nucleic Acids Res. 43, D130–D137 (2015).
Article CAS Google Scholar
Formenti, G., Rhie, A. & Walenz, B. arangrhie/merfin: Merfin v1.0. Zenodo https://doi.org/10.5281/zenodo.5527270 (2021).
Formenti, G., Rhie, A. & Walenz, B. Merfin - improved Arrow polishing through vcf filtering and Genomescope2 kmer models. Code Ocean https://doi.org/10.24433/CO.2292349.v1 (2022).

Download references

Acknowledgements

We thank T. Rhyker Ranallo-Benavidez and M.C. Schatz for the useful discussion on adapting Genomescope2 models. We also thank the communities of the T2T, HPRC and VGP consortia for their constant support. G.F. and E.D.J. were supported by Rockefeller University and HHMI funds. A.R., B.P.W., S.K. and A.M.P. were supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health (NIH) (1ZIAHG200398). The work of F.T.-N. was supported by the Intramural Research Program of the National Library of Medicine, NIH. K.S. was supported by NIH/NHGRI (R01HG010485, U41HG010972, U01HG010961, U24HG011853 and OT2OD026682). E.W.M. was partially supported by the German Federal Ministry of Education and Research (01IS18026C). Part of this work used the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov).

Author information

These authors contributed equally: Giulio Formenti, Arang Rhie.

Authors and Affiliations

Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
Giulio Formenti & Erich D. Jarvis
Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
Giulio Formenti & Erich D. Jarvis
Howard Hughes Medical Institute, Chevy Chase, MD, USA
Giulio Formenti & Erich D. Jarvis
Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
Arang Rhie, Brian P. Walenz, Sergey Koren & Adam M. Phillippy
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Françoise Thibaud-Nissen
UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
Kishwar Shafin
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
Eugene W. Myers

Authors

Giulio Formenti
View author publications
You can also search for this author in PubMed Google Scholar
Arang Rhie
View author publications
You can also search for this author in PubMed Google Scholar
Brian P. Walenz
View author publications
You can also search for this author in PubMed Google Scholar
Françoise Thibaud-Nissen
View author publications
You can also search for this author in PubMed Google Scholar
Kishwar Shafin
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Koren
View author publications
You can also search for this author in PubMed Google Scholar
Eugene W. Myers
View author publications
You can also search for this author in PubMed Google Scholar
Erich D. Jarvis
View author publications
You can also search for this author in PubMed Google Scholar
Adam M. Phillippy
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.R., G.F., B.P.W. and E.W.M. implemented Merfin. G.F. and A.R. performed the validation analyses. K.S. performed the GIAB variant-calling analysis on HG002. F.T. generated the gene annotations for the VGP genomes. S.K. contributed to the conceptual development. G.F. and A.R. wrote the manuscript. G.F., A.R., E.D.J. and A.M.P. conceived the study. All authors reviewed, edited and approved the manuscript.

Corresponding authors

Correspondence to Giulio Formenti or Arang Rhie.

Ethics declarations

Competing interests

S.K. has received travel funds to speak at symposia organized by Oxford Nanopore. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Chuan-Le Xiao and the other, anonymous, reviewer for their contribution to the peer review of this work. Primary Handling editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Flowchart diagram of each mode in Merfin.

Text inside gray boxes on the top represents input files required (solid) or optional (dashed) for Merfin. a, genotyping (-filter) and polishing (-polish) modes. b, K* histogram (-hist) and K* completeness (-completeness) modes. Steps listed in bullet points are marked in gray if it is only applicable in -polish (a) or -completeness (b) mode.

Extended Data Fig. 2 Genome-wide density distribution of the K* using Illumina k-mers.

When the assembly is in agreement with the raw data, the K* is normally distributed with mean 0 and the smaller the standard deviation the higher the agreement. T2T-CHM13v1.0 shows a less dispersed distribution of the K* compared to v0.7.

Extended Data Fig. 3 A region of negative K* highlighting sequencing bias.

An example of low coverage in both HiFi and Illumina reads associated with high guanine content, and specifically a GA-rich repeat (heatmap). GA bias has been reported in PacBio HiFi data, and results in gaps in the assembly that in CHM13 were filled with Nanopore data²². The K* both from HiFi and Illumina k-mers (top tracks) recapitulate the coverage drop. Nanopore coverage appears less affected. Position Chr. 12:~129,862,000 bp.

Extended Data Fig. 4 The K* can identify issues in the assembly at the base level.

a, 40 bp window with K* close to 0, highlighting perfect agreement of the assembly with the raw reads. Position Chr18:~7,000,000 bp. b, A region of negative K* in coincidence with two heterozygous indels. Position Chr1:~105,008,350 bp.

Extended Data Fig. 5 Coverage titration experiment and impact on QV*.

The QV* is only marginally influenced by the coverage of the dataset being considered.

Extended Data Fig. 6 Haplotype phasing before and after polishing with Merfin.

In both parental assemblies, the haplotypes remained fully phased, and the size of the blocks substantially increased compared to the unpolished version (a,b) after polishing with Merfin (c,d). A theoretical human genome size of 3.1 Gbp was used to normalize NG* values.

Extended Data Fig. 7 VGP assembly pipeline.

Compared to the previous v1.6, the introduction of Merfin in v1.7 (green) resulted in a minimal change of the workflow, but in a generalized improvement in QV scores and gene annotations. Pipeline available at https://github.com/VGP/vgp-assembly.

Extended Data Fig. 8 Phase block analysis of zebra finch pseudo-haploid assembly.

a, Phase blocks in the primary assembly after mapping the reads to both the primary and alternate assemblies. b, Phase blocks in the primary assembly after mapping the reads to the primary only. c, Phase blocks in the alternate assembly after mapping the reads to both the primary and alternate assemblies. d, Phase blocks in the alternate assembly. In all cases, the application of Merfin filtering minor heterozygous variants (green) leads to block sizes better or comparable to prior polishing methods alone (blue). Unpolished assembly in gray. Results of Merfin without filtering in red. A genome size of ~1.03 Gbp derived from Genomescope2 was used to normalize NG* values.

Extended Data Fig. 9 Effect of merfin correction on the kinetochore scaffold 1 (KNL1) annotation.

a, Deleterious presence of an extra A around position 1,321,620 of scaffold_7 (red box) in the polished, non-merfin-corrected sequence is indicated by a 1-base gap in the alignments of zebra finch PacBio IsoSeq SRR8695295.20794.1 and KNL1 transcripts from three other Passeriformes songbirds. This insertion causes a disruption in the frame and a premature stop codon in the translated sequence (see amino acid sequence in red). b, Corresponding span in the merfin-corrected assembly, with gapless alignments of the IsoSeq read and Passeriformes transcripts, and uninterrupted translation.

Supplementary information

Reporting Summary

Peer Review Information

Supplementary Table 1

Rights and permissions

Reprints and permissions

About this article

Cite this article

Formenti, G., Rhie, A., Walenz, B.P. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods 19, 696–704 (2022). https://doi.org/10.1038/s41592-022-01445-y

Download citation

Received: 13 July 2021
Accepted: 07 March 2022
Published: 31 March 2022
Issue Date: June 2022
DOI: https://doi.org/10.1038/s41592-022-01445-y

This article is cited by

Constructing telomere-to-telomere diploid genome by polishing haploid nanopore-based assembly
- Joshua Casey Darian
- Ritu Kundu
- Wing-Kin Sung
Nature Methods (2024)
Genome assembly in the telomere-to-telomere era
- Heng Li
- Richard Durbin
Nature Reviews Genetics (2024)
Technology-enabled great leap in deciphering plant genomes
- Lingjuan Xie
- Xiaojiao Gong
- Longjiang Fan
Nature Plants (2024)
The size and composition of haplotype reference panels impact the accuracy of imputation from low-pass sequencing in cattle
- Audald Lloret-Villas
- Hubert Pausch
- Alexander S. Leonard
Genetics Selection Evolution (2023)
Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement
- Kunpeng Li
- Peng Xu
- Yuannian Jiao
Nature Communications (2023)