Curated variation benchmarks for challenging medically relevant autosomal genes

Wagner, Justin; Olson, Nathan D.; Harris, Lindsay; McDaniel, Jennifer; Cheng, Haoyu; Fungtammasan, Arkarachai; Hwang, Yih-Chii; Gupta, Richa; Wenger, Aaron M.; Rowell, William J.; Khan, Ziad M.; Farek, Jesse; Zhu, Yiming; Pisupati, Aishwarya; Mahmoud, Medhat; Xiao, Chunlin; Yoo, Byunggil; Sahraeian, Sayed Mohammad Ebrahim; Miller, Danny E.; Jáspez, David; Lorenzo-Salazar, José M.; Muñoz-Barrera, Adrián; Rubio-Rodríguez, Luis A.; Flores, Carlos; Narzisi, Giuseppe; Evani, Uday Shanker; Clarke, Wayne E.; Lee, Joyce; Mason, Christopher E.; Lincoln, Stephen E.; Miga, Karen H.; Ebbert, Mark T. W.; Shumate, Alaina; Li, Heng; Chin, Chen-Shan; Zook, Justin M.; Sedlazeck, Fritz J.

doi:10.1038/s41587-021-01158-1

Article
Published: 07 February 2022

Curated variation benchmarks for challenging medically relevant autosomal genes

Nature Biotechnology volume 40, pages 672–680 (2022)Cite this article

11k Accesses
51 Citations
224 Altmetric
Metrics details

Subjects

Abstract

The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: GIAB developed a process to create new phased small variant and SV benchmarks for 273 CMRGs.**

**Fig. 2: The new CMRG benchmark contains more challenging variants and regions than previous benchmarks.**

**Fig. 3: The new benchmark covers the gene *SMN1*, which was previously excluded due to mapping challenges for all technologies in the highly identical segmental duplication.**

**Fig. 4: The benchmark resolves the gene CBS, which has a highly homologous gene (CBSL) due to a false duplication in GRCh38 that is not in HG002 or GRCh37.**

Fig. 5: The new CMRG small variant benchmark includes more challenging variants and identifies more false negatives in a standard short-read callset (Illumina–BWA-MEM–GATK) than the previous v4.2.1 benchmark in these challenging genes.

Variant calling and benchmarking in an era of complete human genome sequences

Article 14 April 2023

Nathan D. Olson, Justin Wagner, … Justin M. Zook

GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data

Article 21 August 2023

Mehrtash Babadi, Jack M. Fu, … Michael E. Talkowski

Systematic analysis of paralogous regions in 41,755 exomes uncovers clinically relevant variation

Article Open access 27 October 2023

Wouter Steyaert, Lonneke Haer-Wigman, … Christian Gilissen

Data availability

The PacBio HiFi reads used to generate the hifiasm assembly for the benchmark are in the NCBI Sequence Read Archive with accession numbers SRR10382245, SRR10382244, SRR10382249, SRR10382248, SRR10382247 and SRR10382246. The v1.00 benchmark VCF and BED files, as well as Liftoff gene annotations, assembly–assembly alignments and variant calls, are available at https://trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/, and as a DOI at https://doi.org/10.18434/mds2-2475. This is released as a separate benchmark from v4.2.1, because it includes a small fraction of the genome, it has different characteristics from the mapping-based v4.2.1 and v4.2.1 only includes small variants. Using v4.2.1 and the CMRG benchmarks as two separate benchmarks enables users to obtain broader performance metrics for most of the genome and for a small set of particularly challenging genes, respectively. The masked GRCh38 reference, recently updated to v2 with additional false duplications from the Telomere-to-Telomere Consortium, is under https://trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references. We recommend using v3.0 GA4GH/GIAB stratification bed files intended for use with hap.py when benchmarking, which are available at https://trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/. These stratifications include bed files corresponding to false duplications and collapsed duplications in GRCh38. All data have no restrictions, as the HG002 sample has an open consent from the Personal Genome Project.

Code availability

Scripts used to develop the CMRG benchmark and generate figures and tables for the manuscript are available at https://github.com/usnistgov/cmrg-benchmarkset-manuscript. The previously developed assembly, which was used as the basis of this benchmark, was from hifiasm v0.11.

A variety of open source software was used for variant calling for the evaluations of the benchmark, including NextDenovo2.2-beta.0, DRAGEN 3.6.3, NeuSomatic’s submission for the PrecisionFDA truth challenge v2 (ref. ¹²) (BWA-MEM⁵⁰ version 0.7.17-r1188 (https://github.com/lh3/bwa) and GATK version gatk-4.1.4.1 (https://gatk.broadinstitute.org/hc/en-us)), Parabricks_DeepVariant (Parabricks Pipelines DeepVariant v3.0.0_2 (https://developer.nvidia.com/clara-parabricks)), Sentieon (DNAscope) version sentieon_release_201911 (https://www.sentieon.com/products/#dnaseq), BWA-MEM and Strelka2 (BWA-MEM version 0.7.17-r1188 (https://github.com/lh3/bwa) and Strelka2 version 2.9.10 (https://github.com/Illumina/strelka)), BWA-MEM⁵⁰(v0.7.8), Picard tools (https://broadinstitute.github.io/picard/) (ver. 1.83), GATK⁵² (v3.4-0), GATK (v3.5), BWA-MEM v0.7.15-r1140, SAMtools⁵³ v1.3, Picard v2.10.10, GATK v3.8, DELLY⁵⁴ v0.8.5, GRIDSS⁵⁵ v2.9.4, LUMPY⁵⁶ v0.3.1, Manta⁵⁷ v1.6.0, Wham⁵⁸ v1.7.0, NanoPlot⁶⁰ v1.27.0, Filtlong v0.2.0, minimap2 (refs. ^40,60) v2.17-r941, cuteSV v1.0.8, Sniffles⁶¹ v1.0.12, SURVIVOR⁵⁹ v1.0.7, BWA v0.7.15, GATK v3.6, Java v1.8.0_74 (OpenJDK), Picard Tools v2.6.0, Sambamba⁶³ v0.6.7, Samblaster⁶⁴ v0.1.24, Samtools v1.9, DeepVariant v1.0 and Liftoff32 v1.4.0.

References

Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Article CAS Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS Google Scholar
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Article CAS Google Scholar
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
Article CAS Google Scholar
Mahmoud, M. et al. Structural variant calling: the long and the short of it. Genome Biol. 20, 246 (2019).
Article Google Scholar
De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).
Article Google Scholar
Mandelker, D. et al. Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. Genet. Med. 18, 1282–1289 (2016).
Article CAS Google Scholar
Ebbert, M. T. W. et al. Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight. Genome Biol. 20, 1–23 (2019).
Article Google Scholar
Lincoln, S. E. et al. One in seven pathogenic variants can be challenging to detect by NGS: an analysis of 450,000 patients with implications for clinical sensitivity and genetic test implementation. Genet. Med. 23, 1673–1680 (2021).
Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).
Article CAS Google Scholar
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020) ; erratum 38, 1357 (2020).
Article CAS Google Scholar
Olson, N. D. et al. precisionFDA Truth Challenge V2: calling variants from short- and long-reads in difficult-to-map regions. Preprint at bioRxiv https://doi.org/10.1101/2020.11.13.380741 (2020).
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Preprint at bioRxiv https://doi.org/10.1101/2020.07.24.212712 (2020).
Chin, C.-S. et al. A diploid assembly-based benchmark for variants in the major histocompatibility complex. Nat. Commun. 11, 4794 (2020).
Article CAS Google Scholar
Goldfeder, R. L. et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).
Article Google Scholar
Ball, M. P. et al. A public resource facilitating clinical use of genomes. Proc. Natl Acad. Sci. USA 109, 11920–11927 (2012).
Article CAS Google Scholar
Tate, J. G. et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 47, D941–D947 (2019).
Article CAS Google Scholar
Ross, M. G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).
Article Google Scholar
Prior, T. W., Leach, M. E. & Finanger, E. Spinal muscular atrophy. In GeneReviews [Internet] (University of Washington, 2020).
Biros, I. & Forrest, S. Spinal muscular atrophy: untangling the knot? J. Med. Genet. 36, 1–8 (1999).
CAS PubMed PubMed Central Google Scholar
Leiding, J. W. & Holland, S. M. Chronic granulomatous disease. In GeneReviews [Internet] (University of Washington, 2016).
Innan, H. A two-locus gene conversion model with selection and its application to the human RHCE and RHD genes. Proc. Natl. Acad. Sci. USA 100, 8793–8798 (2003).
Article CAS Google Scholar
Hayakawa, T. et al. Coevolution of Siglec-11 and Siglec-16 via gene conversion in primates. BMC Evol. Biol. 17, 228 (2017).
Article Google Scholar
Garg, P. et al. Pervasive cis effects of variation in copy number of large tandem repeats on local DNA methylation and gene expression. Am. J. Hum. Genet. https://doi.org/10.1016/j.ajhg.2021.03.016 (2021).
Article PubMed PubMed Central Google Scholar
Lennerz, J. K. et al. Addition of H19 ‘loss of methylation testing’ for Beckwith-Wiedemann syndrome (BWS) increases the diagnostic yield. J. Mol. Diagn. 12, 576–588 (2010).
Article CAS Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Preprint at bioRxiv https://doi.org/10.1101/2021.05.26.445798 (2021).
Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Preprint at bioRxiv https://doi.org/10.1101/2021.07.12.452063 (2021).
Boisson, B. et al. Rescue of recurrent deep intronic mutation underlying cell type–dependent quantitative NEMO deficiency. J. Clin. Invest. 129, 583–597 (2018).
Article Google Scholar
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article Google Scholar
Schmidt, K., Noureen, A., Kronenberg, F. & Utermann, G. Structure, function, and genetics of lipoprotein (a). J. Lipid Res. 57, 1339–1359 (2016).
Article CAS Google Scholar
Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020).
Article Google Scholar
Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinform. 37, 1639–1643 (2020).
Theunissen, F. et al. Structural variants may be a source of missing heritability in sALS. Front. Neurosci. 14, 47 (2020).
Article Google Scholar
Guo, Y. et al. Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis.Genomics 109, 83–90 (2017).
Article CAS Google Scholar
Pan, B. et al. Similarities and differences between variants called with human reference genome HG19 or HG38. BMC Bioinform. 20, 101 (2019).
Miller, C. A. et al. Failure to detect mutations in U2AF1 due to changes in the GRCh38 reference sequence. Preprint at bioRxiv https://doi.org/10.1101/2021.05.07.442430 (2021).
Li, H. et al. Exome variant discrepancies due to reference-genome differences. Am. J. Hum. Genet. 108, 1239–1250 (2021).
Article CAS Google Scholar
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 590, E55 (2021).
Article CAS Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinform. 26, 841–842 (2010).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinform. 34, 3094–3100 (2018).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Article CAS Google Scholar
Van der Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).
Farek, J. et al. xAtlas: scalable small variant calling across heterogeneous next-generation sequencing experiments. Preprint at bioRxiv https://doi.org/10.1101/295071 (2018).
Edge, P. & Bansal, V. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nat. Commun. 10, 4660 (2019).
Article Google Scholar
Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat. Meth. 18, 1322–1332 (2021).
Sahraeian, S. M. E. et al. Deep convolutional neural networks for accurate somatic mutation detection. Nat. Commun. 10, 1041 (2019).
Article Google Scholar
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9, e112963 (2014).
Article Google Scholar
Patterson, M. et al. WhatsHap: weighted haplotype assembly for future-generation sequencing reads. J. Comput. Biol. 6, 498–509 (2015).
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Article CAS Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects.Nat. Commun. 9, 4038 (2018).
Article Google Scholar
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinform. 25, 2078–2079 (2009).
Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinform. 28, 333–339 (2012).
Cameron, D. L. et al. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. 27, 2050–2060 (2017).
Article CAS Google Scholar
Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
Article Google Scholar
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinform. 32, 1220–1222 (2016).
Kronenberg, Z. N. et al. Wham: identifying structural variants of biological consequence. PLoS Comput. Biol. 11, e1004572 (2015).
Article Google Scholar
Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).
Article CAS Google Scholar
De Coster, W., D’Hert, S., Schultz, D. T., Cruts, M. & Van Broeckhoven, C. NanoPack: visualizing and processing long-read sequencing data. Bioinform. 34, 2666–2669 (2018).
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Article CAS Google Scholar
Jiang, T. et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol. 21, 189 (2020).
Article CAS Google Scholar
Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J. & Prins, P. Sambamba: fast processing of NGS alignment formats. Bioinform. 31, 2032–2034 (2015).
Faust, G. G. & Hall, I. M. SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinform. 30, 2503–2505 (2014).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Article CAS Google Scholar

Download references

Acknowledgements

We thank the Genome Reference Consortium for their curation efforts of GRCh37 and GRCh38 (https://www.genomereference.org), especially V.A. Schneider and P.A. Kitts from the National Institutes of Health (NIH)/NCBI for developing the falsely duplicated regions that should be masked in GRCh38. We thank S. Miller at NIST for helping make available benchmark sets and READMEs. Certain commercial equipment, instruments or materials are identified to adequately specify experimental conditions or reported results. Such identification does not imply recommendation or endorsement by NIST, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. C.F. was funded by Instituto de Salud Carlos III (PI20/00876) and Ministerio de Ciencia e Innovación (RTC-2017-6471-1; AEI/FEDER, UE), cofinanced by the European Regional Development Fund ‘A Way of Making Europe’ from the European Union, and Cabildo Insular de Tenerife (CGIEU0000219140). J.M.L.-S. was funded by Consejería de Educación-Gobierno de Canarias and Cabildo Insular de Tenerife (BOC 163, 24/08/2017). F.J.S. and M.M. was supported by the NIH (UM1 HG008898). C.X. was supported by the Intramural Research Program of the National Library of Medicine, NIH. K.H.M. was supported by the NIH/National Human Genome Research Institute (R01 1R01HG011274-01 and U01 1U01HG010971). H.L. was supported by the NIH (R01 HG010040 and U01 HG010961). C.E.M. thanks funding from the WorldQuant Foundation, NASA (NNX14AH50G), the National Institutes of Health (R01MH117406, R01CA249054, R01AI151059, P01CA214274) and the Leukemia and Lymphoma Society (LLS) (MCL7001-18, LLS 9238-16, LLS-MCL7001-18).

Author information

These authors contributed equally: Chen-Shan Chin, Justin M. Zook, Fritz J. Sedlazeck.

Authors and Affiliations

Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
Justin Wagner, Nathan D. Olson, Lindsay Harris, Jennifer McDaniel & Justin M. Zook
Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
Haoyu Cheng & Heng Li
DNAnexus, Inc., Mountain View, CA, USA
Arkarachai Fungtammasan, Yih-Chii Hwang, Richa Gupta & Chen-Shan Chin
Pacific Biosciences, Menlo Park, CA, USA
Aaron M. Wenger & William J. Rowell
Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
Ziad M. Khan, Jesse Farek, Yiming Zhu, Aishwarya Pisupati, Medhat Mahmoud & Fritz J. Sedlazeck
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Chunlin Xiao
Genomic Medicine Center, Children’s Mercy Kansas City, Kansas City, MO, USA
Byunggil Yoo
Roche Sequencing Solutions, Santa Clara, CA, USA
Sayed Mohammad Ebrahim Sahraeian
Department of Pediatrics, Division of Genetic Medicine, University of Washington and Seattle Children’s Hospital, Seattle, WA, USA
Danny E. Miller
Department of Genome Sciences, University of Washington, Seattle, WA, USA
Danny E. Miller
Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
David Jáspez, José M. Lorenzo-Salazar, Adrián Muñoz-Barrera, Luis A. Rubio-Rodríguez & Carlos Flores
CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain
Carlos Flores
Research Unit, Hospital Universitario N.S. de Candelaria, Santa Cruz de Tenerife, Spain
Carlos Flores
New York Genome Center, New York, NY, USA
Giuseppe Narzisi, Uday Shanker Evani & Wayne E. Clarke
Bionano Genomics, San Diego, CA, USA
Joyce Lee
Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA
Christopher E. Mason
Invitae, San Francisco, CA, USA
Stephen E. Lincoln
UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
Karen H. Miga
Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY, USA
Mark T. W. Ebbert
Department of Internal Medicine, Division of Biomedical Informatics, University of Kentucky, Lexington, KY, USA
Mark T. W. Ebbert
Department of Neuroscience, University of Kentucky, Lexington, KY, USA
Mark T. W. Ebbert
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
Alaina Shumate
Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
Alaina Shumate

Authors

Justin Wagner
View author publications
You can also search for this author in PubMed Google Scholar
Nathan D. Olson
View author publications
You can also search for this author in PubMed Google Scholar
Lindsay Harris
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer McDaniel
View author publications
You can also search for this author in PubMed Google Scholar
Haoyu Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Arkarachai Fungtammasan
View author publications
You can also search for this author in PubMed Google Scholar
Yih-Chii Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Richa Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Aaron M. Wenger
View author publications
You can also search for this author in PubMed Google Scholar
William J. Rowell
View author publications
You can also search for this author in PubMed Google Scholar
Ziad M. Khan
View author publications
You can also search for this author in PubMed Google Scholar
Jesse Farek
View author publications
You can also search for this author in PubMed Google Scholar
Yiming Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Aishwarya Pisupati
View author publications
You can also search for this author in PubMed Google Scholar
Medhat Mahmoud
View author publications
You can also search for this author in PubMed Google Scholar
Chunlin Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Byunggil Yoo
View author publications
You can also search for this author in PubMed Google Scholar
Sayed Mohammad Ebrahim Sahraeian
View author publications
You can also search for this author in PubMed Google Scholar
Danny E. Miller
View author publications
You can also search for this author in PubMed Google Scholar
David Jáspez
View author publications
You can also search for this author in PubMed Google Scholar
José M. Lorenzo-Salazar
View author publications
You can also search for this author in PubMed Google Scholar
Adrián Muñoz-Barrera
View author publications
You can also search for this author in PubMed Google Scholar
Luis A. Rubio-Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Flores
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Narzisi
View author publications
You can also search for this author in PubMed Google Scholar
Uday Shanker Evani
View author publications
You can also search for this author in PubMed Google Scholar
Wayne E. Clarke
View author publications
You can also search for this author in PubMed Google Scholar
Joyce Lee
View author publications
You can also search for this author in PubMed Google Scholar
Christopher E. Mason
View author publications
You can also search for this author in PubMed Google Scholar
Stephen E. Lincoln
View author publications
You can also search for this author in PubMed Google Scholar
Karen H. Miga
View author publications
You can also search for this author in PubMed Google Scholar
Mark T. W. Ebbert
View author publications
You can also search for this author in PubMed Google Scholar
Alaina Shumate
View author publications
You can also search for this author in PubMed Google Scholar
Heng Li
View author publications
You can also search for this author in PubMed Google Scholar
Chen-Shan Chin
View author publications
You can also search for this author in PubMed Google Scholar
Justin M. Zook
View author publications
You can also search for this author in PubMed Google Scholar
Fritz J. Sedlazeck
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: J.W., N.D.O., A.F., K.H.M., S.E.L., M.T.W.E., H.L., C.-S.C., J.M.Z. and F.J.S. Data curation: J.W., N.D.O. and J.M. Formal analysis – benchmark: J.W., N.D.O., J.M. and J.M.Z. Formal analysis – assembly: H.C., A.S., H.L. and C.-S.C. Methodology: J.W., H.C., H.L., C.-S.C., J.M.Z. and F.J.S. Project administration: J.W., J.M.Z. and F.J.S. Resources: C.X. Software: J.W. and N.D.O. Supervision: C.-S.C., J.M.Z. and F.J.S. Validation: J.W., N.D.O., L.H., J.M., H.C., A.F., Y.-C.H., R.G., A.M.W., W.J.R., Z.M.K., J.F., Y.Z., A.P., M.M., C.X., B.Y., S.M.E.S., D.J., J.M.L.-S., A.M.-B., L.A.R.-R., C.F., G.N., U.S.E., S.E.C., J.L., H.L., C.-S.C., J.M.Z. and F.J.S. Visualization: J.W., N.D.O., H.C., H.L. and C.-S.C. Writing – original draft: J.W., L.H., C.-S.C., J.M.Z. and F.J.S. Writing – review and editing: J.W., N.D.O., D.E.M., J.L., C.E.M., S.E.L., M.T.W.E., C.-S.C., J.M.Z. and F.J.S.

Corresponding authors

Correspondence to Chen-Shan Chin, Justin M. Zook or Fritz J. Sedlazeck.

Ethics declarations

Competing interests

A.M.W. and W.J.R. are employees and shareholders of Pacific Biosciences. A.F., Y.-C.H, R.G., and C.-S.C. are employees and shareholders of DNAnexus. S.M.E.S. is an employee of Roche. J.L. is a former employee and shareholder of Bionano Genomics. S.E.L. was an employee of Invitae. F.J.S. has sponsored travel from Pacific Biosciences and Oxford Nanopore Technologies. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Adam Ameur, Christian Marshall and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figures 1–17, Notes 1–5 and Table 1.

Reporting Summary.

Supplementary Data 1

Additional characteristics of high-priority clinical genes.

Supplementary Data 2

Overlaps of the 5,038 genes on GRCh38 primary assembly between both HG002 GRCh38 v4.2.1 and HG002 hifiasm v0.11.

Supplementary Data 3

Benchmarking of the hifiasm v0.11 assembly-based variants called with dipcall against the GIAB v4.2.1 benchmark for HG002.

Supplementary Data 4

Benchmarking statistics against CMRG benchmark and evaluation callsets.

Supplementary Data 5

Manual curation results for evaluation and common errors in v0.02.03 small variant benchmark.

Supplementary Data 6

Primer designs and reaction conditions for Long-Range PCR and Sanger confirmation.

Supplementary Data 7

Genes excluded from the CMRG benchmarks, with likely reasons for exclusion annotated for GRCh38 in the last column.

Supplementary Data 8

Commands for BWA-GATK variant calling on normal GRCh38 reference.

Supplementary Data 9

Commands for BWA-GATK variant calling on v1 masked GRCh38 reference.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wagner, J., Olson, N.D., Harris, L. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat Biotechnol 40, 672–680 (2022). https://doi.org/10.1038/s41587-021-01158-1

Download citation

Received: 07 June 2021
Accepted: 10 November 2021
Published: 07 February 2022
Issue Date: May 2022
DOI: https://doi.org/10.1038/s41587-021-01158-1

This article is cited by

Validated WGS and WES protocols proved saliva-derived gDNA as an equivalent to blood-derived gDNA for clinical and population genomic analyses
- Katerina Kvapilova
- Pavol Misenko
- Zbynek Kozmik
BMC Genomics (2024)
Rapid genomic sequencing for genetic disease diagnosis and therapy in intensive care units: a review
- Stephen F. Kingsmore
- Russell Nofsinger
- Kasia Ellsworth
npj Genomic Medicine (2024)
Utility of long-read sequencing for All of Us
- M. Mahmoud
- Y. Huang
- F. J. Sedlazeck
Nature Communications (2024)
Improved sequence mapping using a complete reference genome and lift-over
- Nae-Chyun Chen
- Luis F. Paulin
- Ben Langmead
Nature Methods (2024)
Detection of mosaic and population-level structural variants with Sniffles2
- Moritz Smolka
- Luis F. Paulin
- Fritz J. Sedlazeck
Nature Biotechnology (2024)