Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer

Abstract

Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10–25 kilobases), accurate ‘HiFi’ reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformer–encoder for sequence correction. Compared to pbccs, DeepConsensus reduces read errors by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27% and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9 megabases (Mb) to 17.2 Mb), increase gene completeness (94% to 97%), reduce the false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45) and reduce variant-calling errors by 24%. DeepConsensus models could be trained to the general problem of analyzing the alignment of other types of sequences, such as unique molecular identifiers or genome assemblies.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of DeepConsensus.
Fig. 2: DeepConsensus improves the accuracy of CCS reads.
Fig. 3: DeepConsensus improves the contiguity and quality of the genome assemblies generated with hifiasm.
Fig. 4: DeepConsensus improves variant-calling performance of DeepVariant.

Similar content being viewed by others

Data availability

Sequencing data, predictions and analysis files are available at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/deepconsensus/publication.

Code availability

Code and pretrained models are available at https://github.com/google/deepconsensus. Sequencing data are available from the following sources:

∙ Sequel II data from Novogene42 at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing

∙ 15-kb HG002 and 24-kb HG002 reads from PacBio at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/deepconsensus/publication/sequencing

∙ Sequel II data from PacBio at https://downloads.pacbcloud.com/public/dataset/HG002_SV_and_SNV_CCS/

∙ HG002 diploid assembly at https://obj.umiacs.umd.edu/marbl_publications/hicanu/hg002_hifi_hicanu_combined.fasta.gz

References

  1. Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Travers, K. J., Chin, C.-S., Rank, D. R., Eid, J. S. & Turner, S. W. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Res. 38, e159 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  3. Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods 19, 687–695 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science. 376, eabl4178 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Olson, N. D. et al. precisionFDA Truth Challenge V2: calling variants from short- and long-reads in difficult-to-map regions. Cell Genom. 2, 100129 (2022).

  7. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).

    Article  CAS  PubMed  Google Scholar 

  10. Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Shafin, K. et al. Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks. Nat. Methods 18, 1322–1332 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Vaswani, A. et al. Attention is all you need. Preprint at https://doi.org/10.48550/arXiv.1706.03762 (2017).

  15. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2018).

  16. Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. Preprint at https://doi.org/10.48550/arXiv.2010.11929 (2020).

  17. Rao, R. et al. MSA transformer. Preprint at bioRxiv https://doi.org/10.1101/2021.02.12.430858 (2021).

  18. The AlphaFold team. AlphaFold: a solution to a 50-year-old grand challenge in biology. DeepMind https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology

  19. Mensch, A. & Blondel, M. Differentiable dynamic programming for structured prediction and attention. Proc. 35th International Conference on Machine Learning 80, 3462–3471 (2018).

  20. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Lal, A. et al. Improving long-read consensus sequencing accuracy with deep learning. Preprint at bioRxiv https://doi.org/10.1101/2021.06.28.450238 (2021).

  22. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2020).

  26. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    Article  CAS  PubMed  Google Scholar 

  28. Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255 (2020).

    Article  CAS  PubMed  Google Scholar 

  30. Warren, R. L. et al. ntEdit: scalable genome sequence polishing. Bioinformatics 35, 4430–4432 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Morisse, P., Marchet, C., Limasset, A., Lecroq, T. & Lefebvre, A. Scalable long read self-correction and assembly polishing with multiple sequence alignment. Sci. Rep. 11, 761 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Islam, S. et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163–166 (2014).

    Article  CAS  PubMed  Google Scholar 

  33. Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Huang, Z. et al. CCNet: criss-cross attention for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 603–612 (2020).

  36. Choromanski, K. et al. Rethinking attention with performers. Preprint at https://doi.org/10.48550/arXiv.2009.14794 (2020).

  37. Wang, S., Li, B. Z., Khabsa, M., Fang, H. & Ma, H. Linformer: self-attention with linear complexity. Preprint at https://doi.org/10.48550/arXiv.2006.04768 (2020).

  38. Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. Transformers are RNNs: fast autoregressive transformers with linear attention. Preprint at https://doi.org/10.48550/arXiv.2006.16236 (2020).

  39. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://doi.org/10.48550/arXiv.1412.6980 (2014).

  41. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).

Download references

Acknowledgements

We thank F. Liu of the Google TensorFlow Model Garden team for improving our use of open-source implementation of the transformer architecture.

Author information

Authors and Affiliations

Authors

Contributions

G.B., P.-C.C. and A.C. conceived the study. G.B. and D.E.C. wrote DeepConsensus and trained models. G.B., D.E.C., K.S., T.Y., M.N. and A.B. performed experiments with DeepConsensus reads and made figures and documentation. F.L.-L., Q.B. and J.-P.V. conceived and implemented the alignment loss strategy, which D.E.C. integrated into DeepConsensus. A.M.W., W.J.R. and A.T. provided insight into PacBio data, identified areas for improvement, suggested informative features and provided code for preprocessing and evaluation. W.A. experimented with embedding strategies. A.K. and A.T. contributed to efficient processing of PacBio reads. H.Y. coordinated data acquisition and research agreements. J.-P.V., A.V., C.Y.M., M.N., P.-C.C. and A.C. provided guidance on experimental design, architecture and code review. G.B., D.E.C., K.S., T.Y., F.L.-L., Q.B., A.M.W., W.J.R., M.N., J.-P.V., A.V., C.Y.M., P.-C.C. and A.C. wrote the paper.

Corresponding author

Correspondence to Andrew Carroll.

Ethics declarations

Competing interests

G.B., D.E.C., K.S., T.Y., F.L.-L., Q.B., A.B., M.N., H.Y., A.K., W.A., J.-P.V., A.V., C.Y.M., P.-C.C. and A.C. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. A.M.W., A.T. and W.J.R. are full-time employees and shareholders of Pacific Biosciences. This study was funded by Google LLC.

Peer review

Peer review information

Nature Biotechnology thanks Justin Zook, Andrey Bzikadze and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 DeepConsensus with longer reads improves genome assembly contiguity.

(a) HG002 read length distribution for 15kb and 24kb DeepConsensus reads from two SMRT Cells. (b) Contiguity of the HG002 hifiasm assembly with 15kb and 24kb DeepConsensus reads from two SMRT Cells. (c) HG002 variant calling performance for 15kb and 24kb reads from DeepConsensus for two SMRT Cells.

Supplementary information

Supplementary Information

Supplementary Figs. 1–11, Supplementary Tables 1–29 and documentation of software commands used.

Reporting Summary

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Baid, G., Cook, D.E., Shafin, K. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat Biotechnol 41, 232–238 (2023). https://doi.org/10.1038/s41587-022-01435-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-022-01435-7

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics