Environmental meta-omics is rapidly expanding as sequencing capabilities improve, computing technologies become more accessible, and associated costs are reduced. The in situ snapshots of marine microbial life afforded by these data provide a growing knowledge of the functional roles of communities in ecosystem processes. Metaproteomics allows for the characterization of the dynamic proteome of a complex microbial community. It has the potential to reveal impacts of microbial metabolism on biogeochemical transport, storage and cycling (for example, Hawley et al., 2014), while additionally clarifying which taxonomic groups perform these roles. Previous work illuminated many of the important functions and interactions within marine microbial communities (for example, Morris et al., 2010), but a review of ocean metaproteomics literature revealed little standardization in bioinformatics pipelines for detecting peptides and inferring and annotating proteins. As prevalence of these data sets grows, there is a critical need to develop standardized approaches for mass spectrometry (MS) proteomic spectrum identification and annotation to maximize the scientific value of the data obtained. Here, we demonstrate that bioinformatics decisions made throughout the peptide identification process are as important for data interpretation as choices of sampling protocol and bacterial community manipulation experimental design. Our analysis offers a best practices guide for environmental metaproteomics.

MS-based metaproteomics is now practical due to advances in duty cycle and increased mass accuracy for both precursor and fragment masses. These improvements allow for the detection of over 104 tandem mass spectra from a single data-dependent acquisition MS analysis of a mixed microbial sample. These spectra must then be associated with peptides from thousands of proteins from diverse taxonomic groups. The most common approach is database searching: scoring observed tandem mass spectra against theoretical peptide spectra generated in silico from a protein or peptide database (Eng et al., 1994). However, the approach to database selection, or construction, can vary dramatically. In an ocean metaproteomics experiment, the two main approaches for creating a protein identification database are to (1) leverage vast quantities of public sequence data or (2) sequence and assemble a metagenome. Further, when exploring and assembling possible public databases, a wide range of databases and sequence selection methods are used. As the field of environmental proteomics grows, the integrity of metaproteomics data sets and our ability to directly compare them across time and space depends on the adoption of a standardized procedure for peptide identification and annotation. Here, we reveal how highly influential the protein database selection is to the biological interpretations of a metaproteomics experiment.

We applied four database selection techniques in order to perform peptide detection, protein inference, and taxonomic and functional assignments from MS-based, oceanic, microbial community metaproteomics (Figure 1). The metaproteome in question represents a diverse and relatively under-sequenced area of the ocean, the Pacific Arctic. Our results from this study offer a path forward as well as a caution for investigators that the biological conclusions drawn from metaproteomics data are highly database specific.

Figure 1
figure 1

Schematic of the workflow for the database searches of the metaproteomics samples. The width of the cylinders depicting each database are scaled to the number of unique tryptic peptides in each database (Supplementary Information 4).

Our study followed traditional procedures currently employed in ocean metaproteomics (details in Supplementary Information 1). Water samples were collected and selectively filtered from the Bering Strait as described in May et al. (2016) and incubated shipboard over 10 days (T0=day 0, T10=day 10). Bacterial community proteomes from the incubations were analyzed on a Q-Exactive-HF (Thermo Fisher Scientific, Waltham, MA, USA) and resulting data were searched against four different peptide identification databases (Supplementary Information 2): (1) site/time-specific metagenome collected concurrently with the incubated water; (2) NCBI’s env_NR database; (3) Arctic-bacterial database of NCBI protein sequences from known polar taxonomic groups (Supplementary Information 3) North Pacific database derived from a subset of the Ocean Microbiome sequencing project (Sunagawa et al., 2015; Supplementary Information 4). Peptides were identified and proteins were inferred using Comet v. 2015.01 rev. 2 (Eng et al., 2012, 2015), followed by peptide and protein match scoring (Pedrioli, 2010; Deutsch et al., 2015) at a false discovery rate threshold of 0.01 (Supplementary Information 5). Proteins from all databases were annotated using BLASTp (Altschul et al., 1990; Camacho et al., 2009) against the UniProtKB TrEMBL database (downloaded April 28, 2015) with an e-value cutoff of 1E-10 (Supplementary Information 6). Shifts in community biological functions over the 10-day incubation were quantified using a Gene Ontology (GO) analysis where peptide spectrum matches were associated with GO terms. Additionally, database-driven peptide score sensitivity as a function of database size was investigated by searching the site/time-specific metagenome database with increasing numbers of decoy peptides.

The number of peptide experimental spectra that yielded spectrum matches was very different among databases. The highest number of confidently scored unique peptide matches and protein inferences resulted from the search against the site/time-specific metagenome database. This number of peptide matches was augmented 1.5 times by searching the same data against unassembled reads. This ‘metapeptide’ approach (May et al., 2016) avoids sequence loss and potential noise introduced by read assembly (for example Cantarel et al., 2011). The peptides identified by the four assembled databases overlapped relatively little, suggesting that the different databases cover different parts of the acquired metaproteome (May et al., 2016). In a direct comparison of the unassembled metagenome peptides and env_NR, the metagenome contained more peptides from the metaproteome (May et al., 2016). Additionally, database size, especially in the cases of env_NR and North Pacific, had a substantial impact on search sensitivity, making statistically confident detection of peptides difficult (Supplementary Information 7; May et al., 2016). In agreement with others, we found large database searches suffer from a loss of statistical power from multiple hypothesis testing against the vast number of sequences unrepresented in the expressed metaproteome (Nesvizhiskii, 2010; Jagtap et al., 2013; Tanca et al., 2013). This paradox of too many sequences resulting in too few identifications will become increasingly problematic with the availability of more sequence data. Our results point to the success obtained by searching a metaproteome-specific database that excludes non-specific sequences, while balancing the need to retain a sufficient amount of sequence variation.

Taxonomic and functional interpretations resulting from the different searches of the same metaproteome against different databases were divergent, suggesting that each database would yield a different biological conclusion. The four resulting community taxonomy profiles diverged even at the phylum level, and these differences were amplified at finer taxonomic levels (Figure 2). The metagenome also yielded a greater variety of taxa at ranks more specific than class compared to env_NR (May et al., 2016). In addition to taxonomic discrepancies, functional response to the 10-day incubation differed depending on database used, differences that have been noted by others (Rooijers et al., 2011; Tanca et al., 2013). In our arctic microbiome, there was little agreement among database searches in the ten GO terms that changed the most between the beginning and end of the incubation experiment (Table 1, Supplementary Information 8). These GO terms would be considered the most significant contributors to changes in community function in the particular experiment, and would lead to substantially different interpretations depending on the database selected. The importance of these differences in functional assignments among search results can direct downstream analyses and interpretations. For example, they are of critical importance when inferring and reporting community function. Our results and others (for example, Rooijers et al., 2011) stress the importance of database choice for metaproteomics functional assignments and community biological process, especially in the case of a previously uncharacterized, complex community.

Figure 2
figure 2

A heat map depicting the amount of agreement of taxonomic assignments at the phylum level derived from inferred proteins across searched databases. For each phylum, a colored box represents the number of proteins (log(x+1)-transformed) associated with that phylum for each set of search results (red=highly abundant phylum; blue=low/non-existent phylum). The results are ordered by phylum abundance in the site/time-specific metagenome search results.

Table 1 Ten GO terms with the biological aspect ‘biological process’ with the greatest log fold change from each database search; five that changed the most to have higher abundance at T10 (light gray) and five that have higher abundance at T0 (dark gray)

In addition to differences in peptide search results, the true complexity in annotating detected proteins was obscured by the standard approach that uses only the top BLAST hit as the defined protein annotation. The BLAST algorithm returns a list of possible hits with associated Expect values (e-values) when a sequence is searched; to better understand the downstream effects of this approach, we included up to 500 BLAST results per protein. On average, 403 protein matches per metagenome sequence were returned that passed the e-value cutoff of 1E-10. Disagreements in functional and taxonomic assignment among the BLAST hits for a single protein are very common, even when the results all have very low e-values (Supplementary Information 9). This casts doubt on the ‘top’ BLAST hit as the correct annotation for the protein of interest, even though this is common practice in ‘omics’ literature. Inaccuracy or lack of precision of protein annotation via BLAST methodology would further obscure an accurate interpretation of metaproteomics data when combined with an uninformed database choice.

The selection of a protein database for peptide identifications is one of the most critical bioinformatics decisions for accurate biological and ecological interpretation of in situ community functions. Although more time and money are required to complete a site/time-specific metagenome, we have demonstrated that these investments lay the groundwork for more complete metaproteome interpretation (Tanca et al., 2013; May et al., 2016). Whether or not a metagenome is assembled, data interpretation must proceed with care. Based on current and previous work, we propose a general best practices guide (Figure 3) to identifying peptides and inferring biological function and taxonomic distributions of natural microbial assemblages: (1) For previously uncharacterized communities, construct as accurate and efficient a database as possible by (a) using the metapeptide approach (May et al., 2016, b) sequencing the metagenome and utilizing gene prediction software (for example, Hyatt et al., 2012) or (c) constructing the most accurate database possible to avoid loss of sensitivity due to large search space when metagenome sequencing is not possible; (2) when annotating proteins, go beyond the top BLAST hit to base the annotation for taxonomy and function on an agreement among BLAST hits above a specific e-value threshold (Supplementary Information 9); (3) to increase peptide identifications, leverage publicly available sequences via the more statistically robust multi-step or iterative searches (for example, Jagtap et al., 2013; Kertesz-Farkas et al., 2015). As researchers begin to explore these different search methods with a variety of metaproteomics data sets, this approach will provide the most robust search methods and most reliable taxonomic and functional inference for environmental metaproteomics.

Figure 3
figure 3

Depiction of the recommended ‘best practices in metaproteomics’ workflow. The ocean circles represent data derived from the same sample. (1) Selection of an accurate and efficient database is followed by (2) finding the consensus BLAST hit among the group of best hits, and (3) re-searching the data against more sequences to achieve greater metaproteome coverage using a robust multi-step or iterative algorithm.

Supplementary Information is available at ISME Journal’s website.

1: Detailed methods for metagenome sequence, metaproteomics MS, database searching and biological interpretation of data.

2: Minimum, maximum and mean protein lengths for each protein identification database used in this study.

3: Taxonomic groups used to create the Arctic-bacterial database. The first and second columns list the group name and taxonomic level for the protein sequences that were downloaded from NCBI, followed by the complete taxonomic tree. For each taxonomic group, citations are given from peer-reviewed literature that were used to infer this group’s presence near our study site. The second tab in the workbook has the full citations listed.

4: Summary of total unique protein and peptide sequences in each database. Three different scenarios are given for peptide sequence generation to construct a database: (1) 3 missed cleavages and oxidized methionine; (2) 0 missed cleavages, no oxidation; (3) 3 missed cleavages, no oxidation.

5: Comet parameter file used to run all database searches.

6: Query protein, top UniProt BLAST hit and corresponding e-value are provided for all proteins detected with high confidence. Protein lists for the different database searches can be found in the different sheets of the Excel workbook.

7: Adding large numbers of random decoy peptides to an 11-million-peptide metagenome-derived database depressed peptide detection sensitivity. Horizontal axis is the number of peptides in each search database (11 million metagenome peptides, with increasing numbers of random decoy peptides). The vertical axis is the number of metagenome peptides detected at a false discovery rate of 0.01 as determined by forward–reverse database search for five different sample files. False discovery rate was calculated from Trans Proteomic Pipeline probabilities.

8: Direction of log2 fold change for GO terms detected at total PSM count>50 in T0 vs T10, T0’ vs T10’, T0 vs T0’, and T10 vs T10’ (‘ ‘ ‘denotes a technical replicate). A log2 fold change >1 is ‘positive’,<-1 is ‘negative’, between -1 and 1 is ‘none’ and if a GO term was not detected at above 50 PSM in a database there is an ‘X’. Results for each database (site/time-specific metagenome, env_NR, Arctic-bacterial and North Pacific) are listed in separate columns for each comparison.

9: A heatmap representing the granularity of taxa returned from a BLAST search (e-value1E-10) as a function of percent identity threshold. Each colored bin represents the number of protein hits at a given least common taxonomic unit level for up to 500 protein hits. Horizontal axis: minimum percent sequence identity between query protein and BLAST hits. Vertical axis: rank of the lowest common taxonomic unit representing all BLAST hits above the threshold. Color indicates the natural log of the number of query proteins that fall into each bin, according to the scale at right. ‘None’ indicates hits that were assigned to multiple superkingdoms.