Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold

Steinegger, Martin; Mirdita, Milot; Söding, Johannes

doi:10.1038/s41592-019-0437-4

Brief Communication
Published: 24 June 2019

Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold

Nature Methods volume 16, pages 603–606 (2019)Cite this article

8900 Accesses
164 Citations
61 Altmetric
Metrics details

Subjects

Abstract

The open-source de novo protein-level assembler, Plass (https://plass.mmseqs.com), assembles six-frame-translated sequencing reads into protein sequences. It recovers 2–10 times more protein sequences from complex metagenomes and can assemble huge datasets. We assembled two redundancy-filtered reference protein catalogs, 2 billion sequences from 640 soil samples (soil reference protein catalog) and 292 million sequences from 775 marine eukaryotic metatranscriptomes (marine eukaryotic reference catalog), the largest free collections of protein sequences.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Sensitivity and precision of protein sequences assembled from synthetic reads and associated false discovery rate (FDR) of functional annotations.**

**Fig. 3: Plass assembles more protein sequences from various environments than the state of the art.**

Recovering prokaryotic genomes from host-associated, short-read shotgun metagenomic sequencing data

Article 16 April 2021

High-quality metagenome assembly from long accurate reads with metaMDBG

Article Open access 02 January 2024

Unraveling the functional dark matter through global metagenomics

Article Open access 11 October 2023

Data availability

The assembled protein sequence sets are available in FASTA format under a Creative Commons Attribution CC-BY 4.0 License at https://plass.mmseqs.com. All scripts and benchmark data including command-line parameters necessary to reproduce the benchmark and analysis results presented are available at https://github.com/martin-steinegger/plass-analysis.

Code availability

Plass is GPLv3-licensed open-source software. The source code and binaries for Plass can be downloaded at https://github.com/soedinglab/plass.

References

Howe, A. C. et al. Proc. Natl. Acad. Sci. USA 111, 4904–4909 (2014).
Li, J. et al. Nat. Biotechnol. 32, 834–841 (2014).
Article CAS Google Scholar
Sunagawa, S. et al. Science 348, 1261359 (2015).
Article Google Scholar
Nielsen, H. B. et al. Nat. Biotechnol. 32, 822–828 (2014).
Article CAS Google Scholar
Zerbino, D. & Birney, E. Genome Res. 18, 821–829 (2008).
Article CAS Google Scholar
Li, D. et al. Bioinformatics 31, 1674–1676 (2015).
Article CAS Google Scholar
Nurk, S. et al. Genome Res. 27, 824–834 (2017).
Article CAS Google Scholar
Ye, Y. & Tang, H. J. Bioinform. Comput. Biol. 7, 455–471 (2009).
Article CAS Google Scholar
Yang, Y. et al. Bioinformatics 31, 1833–1835 (2015).
Article CAS Google Scholar
Steinegger, M. & Söding, J. Nat. Commun. 9, 2542 (2018).
Article Google Scholar
Sczyrba, A. et al. Nat. Methods 14, 1063–1071 (2017).
Article CAS Google Scholar
Kashtan, N. et al. Science 344, 416–420 (2014).
Article CAS Google Scholar
Berube, P. M. et al. Sci. Data 5, 180154 (2018).
Article CAS Google Scholar
Hyatt, D. et al. BMC Bioinforma. 11, 119 (2010).
Article Google Scholar
van der Walt, A. J. et al. BMC Genom. 18, 521 (2017).
Article Google Scholar
Huerta-Cepas, J. et al. Mol. Biol. Evol. 34, 2115–2122 (2017).
Article CAS Google Scholar
Tian, W. & Skolnick, J. J. Mol. Biol. 333, 863–882 (2003).
Article CAS Google Scholar
Lee, S. T. M. et al. Microbiome 5, 50 (2017).
Article Google Scholar
Carradec, Q. et al. Nat. Commun. 9, 373 (2018).
Article Google Scholar
Ovchinnikov, S. et al. Science 355, 294–298 (2017).
Article CAS Google Scholar
Magoc, T. & Salzberg, S. L. Bioinformatics 27, 2957–2963 (2011).
Article CAS Google Scholar
Sheetlin, S. et al. Bioinformatics 32, 304–305 (2016).
CAS PubMed Google Scholar
Mirdita, M. et al. Nucleic Acids Res. 45, D170–D176 (2017).
Article CAS Google Scholar
Kanehisa, M. et al. Nucleic Acids Res. 45, D353–D361 (2016).
Article Google Scholar
Steinegger, M. & Söding, J. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS Google Scholar
Huerta-Cepas, J. et al. Nucleic Acids Res. 44, D286–D293 (2016).
Article CAS Google Scholar
Frith, M. Nucleic Acids Res. 39, E23 (2011).
Article Google Scholar
Hingamp, P. ISME J. 7, 1678–1695 (2013).
Article CAS Google Scholar

Download references

Acknowledgements

We are grateful to C. Notredame and C. Seok for hosting M.S. at the Centre for Genomic Regulation and Seoul National University for 12 and 30 months, respectively. We thank S. Sunagawa, F. Meyer and A. Sczyrba for helpful discussions, and T. Brown for his early analysis and detailed feedback on Plass results. We thank all who contributed metagenomic datasets used to build SRC and MERC, in particular contributors to the TARA ocean project and the US Department of Energy Joint Genome Institute (http://www.jgi.doe.gov). This work was supported by the EU’s Horizon 2020 Framework Programme (Virus-X, grant no. 685778).

Author information

Authors and Affiliations

Quantitative and Computational Biology Group, Max-Planck Institute for Biophysical Chemistry, Göttingen, Germany
Martin Steinegger, Milot Mirdita & Johannes Söding
Department of Chemistry, Seoul National University, Seoul, Korea
Martin Steinegger
Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA
Martin Steinegger

Authors

Martin Steinegger
View author publications
You can also search for this author in PubMed Google Scholar
Milot Mirdita
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Söding
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.S. and J.S. designed the research study. M.S. and M.M. developed code and performed the analyses. M.S. and J.S. wrote the manuscript.

Corresponding authors

Correspondence to Martin Steinegger or Johannes Söding.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information: Lei Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Schematic comparison of a nucleotide- and a protein assembly.

On top is the final protein assembly followed by the stacked overlapping protein reads. The small gray section highlights the multiple protein sequence alignment of the overlapping reads and below the respective nucleotide alignment. Less ambiguity is visible on the protein level due to conservative mutations (mutations with similar biochemical properties) compared to the nucleotide level, resulting in an assembly that is more robust to microdiversity in the population.

Supplementary Figure 2 Overlap of assemblies of Plass with Megahit and metaSPAdes.

(a) left: A fraction 38.3% of amino acids in the Plass-assembled proteins of set 1 is covered by alignments to proteins in the Megahit assembly at a minimum sequence identity cut-off of 99%. Conversely, 83.2% of proteins in the Megahit-assembled set 1 is covered by alignments with Plass-assembled proteins. Right: Same as on the left but comparing the Plass assembly with the metaSPAdes assembly. (b) Same as (a) but for protein set 2.

Supplementary Figure 3 Effect of neural network filter to remove wrong translation frames.

Sensitivity and precision in set 1 (a) and set 2 (b). Top: assembly sensitivity is the fraction of reference sequence amino acids that matches to an assembled protein sequence. Bottom: assembly precision is the fraction of assembled amino acids that matches to a reference protein at the minimum sequence identity on the x-axis. Plass uses a minimum sequence identity for merging fragments of 90%, Plass-97 uses a threshold of 97%.

Supplementary Figure 4 Comparison of Megahit assignment using the 2bLCA protocol.

The MMseqs2 taxonomy assignment workflow uses three steps to assign a taxonomic label to a query sequence. (1) We search with the query sequence against a reference database and extract the aligned subsequence of the best hit. (2) This sequence is matched again against the reference database. Each hit with an E-value smaller than the best hit E-value from the previous search is accepted. (3) We compute the lowest common ancestor based on the taxonomic labels of all accepted hits.

Supplementary Figure 5 Plass ORF extraction and start codon prediction (ORF calling).

Plass extracts two sets of ORFs. ORF set 1 contains all translated ORFs with at least 45 codons. ORF set 2 contains all translated ORFs with at least 20 codons starting with a putative ATG start codon that is the first ATG codon after a stop codon in the same frame. (Start codon prediction) Plass predicts start codons with a consensus method using a multiple sequence alignment of ORF set 1 and 2. Wherever at least 20% of all methionines in one column are marked by a prepended asterisk, it removes the preceding residues from all other sequences and prepends an asterisk to all sequences to mark the start.

Supplementary Figure 6 Taxonomy evaluation of the soil metagenome assembly.

(a) We investigate the taxonomic composition of the 8 most abundant taxa (all other taxa are pooled in ‘Others’) in the soil assemblies from Fig. 2d (blue: Megahit, red: Plass) and the assemblies of the 12 soil samples from Fig. 2e (light blue: Megahit, light red: Plass). On top we show the read count ratios between Plass and Megahit, for both the single and 12 soil assemblies. The inset gives the fraction of reads in the single and the 12 soil samples that could be mapped to an assembled protein sequence. (b) We show the count of assembled amino acids within various coverage ranges for Megahit (blue) and Plass (red) in the single soil sample.

Supplementary information

Supplementary Information

Supplementary Figs. 1–6

Reporting Summary

Source data

Source Data

Rights and permissions

Reprints and permissions

About this article

Cite this article

Steinegger, M., Mirdita, M. & Söding, J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat Methods 16, 603–606 (2019). https://doi.org/10.1038/s41592-019-0437-4

Download citation

Received: 01 August 2018
Revised: 15 March 2019
Accepted: 05 May 2019
Published: 24 June 2019
Issue Date: July 2019
DOI: https://doi.org/10.1038/s41592-019-0437-4

This article is cited by

Protein embedding based alignment
- Benjamin Giovanni Iovino
- Yuzhen Ye
BMC Bioinformatics (2024)
Improving deep learning protein monomer and complex structure prediction using DeepMSA2 with huge metagenomics data
- Wei Zheng
- Qiqige Wuyun
- Yang Zhang
Nature Methods (2024)
Streptomyces umbrella toxin particles block hyphal growth of competing species
- Qinqin Zhao
- Savannah Bertolli
- Joseph D. Mougous
Nature (2024)
Application of Computational Techniques in Antibody Fc-Fused Molecule Design for Therapeutics
- Chong Lee Ng
- Theam Soon Lim
- Yee Siew Choong
Molecular Biotechnology (2024)
AttSec: protein secondary structure prediction by capturing local patterns from attention map
- Youjin Kim
- Junseok Kwon
BMC Bioinformatics (2023)