DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer

Baid, Gunjan; Cook, Daniel E.; Shafin, Kishwar; Yun, Taedong; Llinares-López, Felipe; Berthet, Quentin; Belyaeva, Anastasiya; Töpfer, Armin; Wenger, Aaron M.; Rowell, William J.; Yang, Howard; Kolesnikov, Alexey; Ammar, Waleed; Vert, Jean-Philippe; Vaswani, Ashish; McLean, Cory Y.; Nattestad, Maria; Chang, Pi-Chuan; Carroll, Andrew

doi:10.1038/s41587-022-01435-7

Article
Published: 01 September 2022

DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer

Nature Biotechnology volume 41, pages 232–238 (2023)Cite this article

8953 Accesses
14 Citations
131 Altmetric
Metrics details

Subjects

Abstract

Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10–25 kilobases), accurate ‘HiFi’ reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformer–encoder for sequence correction. Compared to pbccs, DeepConsensus reduces read errors by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27% and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9 megabases (Mb) to 17.2 Mb), increase gene completeness (94% to 97%), reduce the false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45) and reduce variant-calling errors by 24%. DeepConsensus models could be trained to the general problem of analyzing the alignment of other types of sequences, such as unique molecular identifiers or genome assemblies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: DeepConsensus improves the accuracy of CCS reads.**

**Fig. 3: DeepConsensus improves the contiguity and quality of the genome assemblies generated with hifiasm.**

**Fig. 4: DeepConsensus improves variant-calling performance of DeepVariant.**

Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome

Article 12 August 2019

Linear time complexity de novo long read genome assembly with GoldRush

Article Open access 22 May 2023

Scalable long read self-correction and assembly polishing with multiple sequence alignment

Article Open access 12 January 2021

Data availability

Sequencing data, predictions and analysis files are available at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/deepconsensus/publication.

Code availability

Code and pretrained models are available at https://github.com/google/deepconsensus. Sequencing data are available from the following sources:

∙ Sequel II data from Novogene⁴² at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing

∙ 15-kb HG002 and 24-kb HG002 reads from PacBio at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/deepconsensus/publication/sequencing

∙ Sequel II data from PacBio at https://downloads.pacbcloud.com/public/dataset/HG002_SV_and_SNV_CCS/

∙ HG002 diploid assembly at https://obj.umiacs.umd.edu/marbl_publications/hicanu/hg002_hifi_hicanu_combined.fasta.gz

References

Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Article CAS PubMed PubMed Central Google Scholar
Travers, K. J., Chin, C.-S., Rank, D. R., Eid, J. S. & Turner, S. W. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Res. 38, e159 (2010).
Article PubMed PubMed Central Google Scholar
Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods 19, 687–695 (2022).
Article CAS PubMed PubMed Central Google Scholar
Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science. 376, eabl4178 (2022).
Article CAS PubMed PubMed Central Google Scholar
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Article CAS PubMed PubMed Central Google Scholar
Olson, N. D. et al. precisionFDA Truth Challenge V2: calling variants from short- and long-reads in difficult-to-map regions. Cell Genom. 2, 100129 (2022).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44 (2022).
Article CAS PubMed PubMed Central Google Scholar
Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).
Article CAS PubMed PubMed Central Google Scholar
Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).
Article CAS PubMed Google Scholar
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Article CAS PubMed PubMed Central Google Scholar
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
Article CAS PubMed PubMed Central Google Scholar
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).
Article PubMed PubMed Central Google Scholar
Shafin, K. et al. Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks. Nat. Methods 18, 1322–1332 (2021).
Article CAS PubMed PubMed Central Google Scholar
Vaswani, A. et al. Attention is all you need. Preprint at https://doi.org/10.48550/arXiv.1706.03762 (2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2018).
Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. Preprint at https://doi.org/10.48550/arXiv.2010.11929 (2020).
Rao, R. et al. MSA transformer. Preprint at bioRxiv https://doi.org/10.1101/2021.02.12.430858 (2021).
The AlphaFold team. AlphaFold: a solution to a 50-year-old grand challenge in biology. DeepMind https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology
Mensch, A. & Blondel, M. Differentiable dynamic programming for structured prediction and attention. Proc. 35th International Conference on Machine Learning 80, 3462–3471 (2018).
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lal, A. et al. Improving long-read consensus sequencing accuracy with deep learning. Preprint at bioRxiv https://doi.org/10.1101/2021.06.28.450238 (2021).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).
Article PubMed PubMed Central Google Scholar
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2020).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Article CAS PubMed Google Scholar
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Article CAS PubMed PubMed Central Google Scholar
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255 (2020).
Article CAS PubMed Google Scholar
Warren, R. L. et al. ntEdit: scalable genome sequence polishing. Bioinformatics 35, 4430–4432 (2019).
Article CAS PubMed PubMed Central Google Scholar
Morisse, P., Marchet, C., Limasset, A., Lecroq, T. & Lefebvre, A. Scalable long read self-correction and assembly polishing with multiple sequence alignment. Sci. Rep. 11, 761 (2021).
Article CAS PubMed PubMed Central Google Scholar
Islam, S. et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163–166 (2014).
Article CAS PubMed Google Scholar
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
Article CAS PubMed PubMed Central Google Scholar
Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
Article CAS PubMed PubMed Central Google Scholar
Huang, Z. et al. CCNet: criss-cross attention for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 603–612 (2020).
Choromanski, K. et al. Rethinking attention with performers. Preprint at https://doi.org/10.48550/arXiv.2009.14794 (2020).
Wang, S., Li, B. Z., Khabsa, M., Fang, H. & Ma, H. Linformer: self-attention with linear complexity. Preprint at https://doi.org/10.48550/arXiv.2006.04768 (2020).
Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. Transformers are RNNs: fast autoregressive transformers with linear attention. Preprint at https://doi.org/10.48550/arXiv.2006.16236 (2020).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://doi.org/10.48550/arXiv.1412.6980 (2014).
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Article CAS PubMed PubMed Central Google Scholar
Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).

Download references

Acknowledgements

We thank F. Liu of the Google TensorFlow Model Garden team for improving our use of open-source implementation of the transformer architecture.

Author information

These authors contributed equally: Gunjan Baid, Daniel E. Cook, Maria Nattestad, Pi-Chuan Chang, Andrew Carroll.

Authors and Affiliations

Google LLC, Mountain View, CA, USA
Gunjan Baid, Daniel E. Cook, Kishwar Shafin, Taedong Yun, Felipe Llinares-López, Quentin Berthet, Anastasiya Belyaeva, Howard Yang, Alexey Kolesnikov, Waleed Ammar, Jean-Philippe Vert, Ashish Vaswani, Cory Y. McLean, Maria Nattestad, Pi-Chuan Chang & Andrew Carroll
Pacific Biosciences, Menlo Park, CA, USA
Armin Töpfer, Aaron M. Wenger & William J. Rowell

Authors

Gunjan Baid
View author publications
You can also search for this author in PubMed Google Scholar
Daniel E. Cook
View author publications
You can also search for this author in PubMed Google Scholar
Kishwar Shafin
View author publications
You can also search for this author in PubMed Google Scholar
Taedong Yun
View author publications
You can also search for this author in PubMed Google Scholar
Felipe Llinares-López
View author publications
You can also search for this author in PubMed Google Scholar
Quentin Berthet
View author publications
You can also search for this author in PubMed Google Scholar
Anastasiya Belyaeva
View author publications
You can also search for this author in PubMed Google Scholar
Armin Töpfer
View author publications
You can also search for this author in PubMed Google Scholar
Aaron M. Wenger
View author publications
You can also search for this author in PubMed Google Scholar
William J. Rowell
View author publications
You can also search for this author in PubMed Google Scholar
Howard Yang
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Kolesnikov
View author publications
You can also search for this author in PubMed Google Scholar
Waleed Ammar
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Philippe Vert
View author publications
You can also search for this author in PubMed Google Scholar
Ashish Vaswani
View author publications
You can also search for this author in PubMed Google Scholar
Cory Y. McLean
View author publications
You can also search for this author in PubMed Google Scholar
Maria Nattestad
View author publications
You can also search for this author in PubMed Google Scholar
Pi-Chuan Chang
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Carroll
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.B., P.-C.C. and A.C. conceived the study. G.B. and D.E.C. wrote DeepConsensus and trained models. G.B., D.E.C., K.S., T.Y., M.N. and A.B. performed experiments with DeepConsensus reads and made figures and documentation. F.L.-L., Q.B. and J.-P.V. conceived and implemented the alignment loss strategy, which D.E.C. integrated into DeepConsensus. A.M.W., W.J.R. and A.T. provided insight into PacBio data, identified areas for improvement, suggested informative features and provided code for preprocessing and evaluation. W.A. experimented with embedding strategies. A.K. and A.T. contributed to efficient processing of PacBio reads. H.Y. coordinated data acquisition and research agreements. J.-P.V., A.V., C.Y.M., M.N., P.-C.C. and A.C. provided guidance on experimental design, architecture and code review. G.B., D.E.C., K.S., T.Y., F.L.-L., Q.B., A.M.W., W.J.R., M.N., J.-P.V., A.V., C.Y.M., P.-C.C. and A.C. wrote the paper.

Corresponding author

Correspondence to Andrew Carroll.

Ethics declarations

Competing interests

G.B., D.E.C., K.S., T.Y., F.L.-L., Q.B., A.B., M.N., H.Y., A.K., W.A., J.-P.V., A.V., C.Y.M., P.-C.C. and A.C. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. A.M.W., A.T. and W.J.R. are full-time employees and shareholders of Pacific Biosciences. This study was funded by Google LLC.

Peer review

Peer review information

Nature Biotechnology thanks Justin Zook, Andrey Bzikadze and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 DeepConsensus with longer reads improves genome assembly contiguity.

(a) HG002 read length distribution for 15kb and 24kb DeepConsensus reads from two SMRT Cells. (b) Contiguity of the HG002 hifiasm assembly with 15kb and 24kb DeepConsensus reads from two SMRT Cells. (c) HG002 variant calling performance for 15kb and 24kb reads from DeepConsensus for two SMRT Cells.

Supplementary information

Supplementary Information

Supplementary Figs. 1–11, Supplementary Tables 1–29 and documentation of software commands used.

Reporting Summary

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Baid, G., Cook, D.E., Shafin, K. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat Biotechnol 41, 232–238 (2023). https://doi.org/10.1038/s41587-022-01435-7

Download citation

Received: 28 October 2021
Accepted: 15 July 2022
Published: 01 September 2022
Issue Date: February 2023
DOI: https://doi.org/10.1038/s41587-022-01435-7

This article is cited by

Pangenome graph construction from genome alignments with Minigraph-Cactus
- Glenn Hickey
- Jean Monlong
- Benedict Paten
Nature Biotechnology (2024)
Mabs, a suite of tools for gene-informed genome assembly
- Mikhail I. Schelkunov
BMC Bioinformatics (2023)
Comparing methods for constructing and representing human pangenome graphs
- Francesco Andreace
- Pierre Lechat
- Rayan Chikhi
Genome Biology (2023)
Approaching complete genomes, transcriptomes and epi-omes with accurate long-read sequencing
- Sam Kovaka
- Shujun Ou
- Michael C. Schatz
Nature Methods (2023)
SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads
- Luca Denti
- Parsoa Khorsand
- Rayan Chikhi
Nature Methods (2023)