Introduction

Metagenomic studies have dramatically expanded our knowledge of the microbiota in various environments (e.g., the human intestinal tract, oceans and soil)1,2,3,4,5,6. However, fully enumerating all the DNA sequences carried by each microorganism in a given environment is impossible despite the great increase in sequencing depth in recent years. Therefore, estimating the amount needed for sequencing to fulfill the objectives of a given study is especially important. Many studies have estimated the metagenomic content in a given environment7,8,9,10. However, most of these studies did not consider the SSU rDNA copies per prokaryotic genome or assumed that all prokaryotes have the same genome size8,9. Prokaryotic genomes have 1 to 15 SSU rDNA copies that range from 139 kb to 13,034 kb. These diversities significantly hinder the accurate estimation of the sequencing amount. In this study, we introduce a novel computational approach for estimating the amount required for sequencing according to previously reported data (i.e., SSU rDNA distribution from targeted SSU rDNA sequencing, prokaryotic genome size and SSU rDNA copies of different genomes).

Results

Maximum potential error for estimates for prokaryotic genome size and SSU rDNA copies under different taxonomic categories

To evaluate the maximum potential error for estimates for prokaryotic genome size and SSU rDNA copies under different taxonomic categories (i.e., species, genus, family, order, class and phylum), we introduced a ratio (R) calculated according to equation (1). Higher R values indicate larger maximum potential errors. We calculated the R values for prokaryotic genome size and SSU rDNA copies for each taxonomic category (Supplementary dataset 1). The results indicate a significant difference among the R values for genome size among the different taxonomic categories (one-way ANOVA, F = 87.037, p < 0.001; Figure 1). Both the LSD and the Student–Newman–Keuls post hoc tests showed no significant differences between the R values at the species and the genus levels and between the class and phylum levels, whereas significant differences were detected among residual levels. Similar results were obtained for the SSU rDNA copies except for that between the genus level and the species level. Therefore, the diversity of both SSU rDNA copies and prokaryotic genome sizes could significantly disturb the accurate estimation of the amount for metagenomic sequencing.

Figure 1
figure 1

Box plots showing the distribution of R values for prokaryotic genome size and SSU rDNA copies at different taxonomic categories.

Ci difference of the human faecal samples among different taxonomic categories

The abundances of different prokaryotes in a particular environment are generally different. Most individuals in an ecosystem belong to a few numerically dominant species with low abundance. For instance, Sogin et al.11 reported that the relative abundance of different operational taxonomic units (OTUs) in deep sea samples vary by more than three orders of magnitude and the dominant bacterial populations in the diffuse flow and deep water samples from the North Atlantic are only very few. Eckburg et al.12 pointed out that all 1524 archaeal SSU rDNA sequences from six major subdivisions of the human colon and faecal samples belong to a single phylotype (Methanobrevibacter smithii), with only 395 bacterial phylotypes assigned to 7 phyla (301 phylotypes belong to Firmicutes). Therefore, we cannot accurately calculate the amount required for sequencing if the abundance of each prokaryote in a given environment is unknown. To demonstrate and verify the feasibility of our estimation method, we retrieved the SSU rDNA sequence datasets of human faecal microbiota from the study by Eckburg et al.12. Briefly, 2339 SSU rDNA sequences (1060 from specimen A, 617 from specimen B and 662 from specimen C) were retrieved.

Equations (2) and (3) provide the calculations for the genomic proportion of the ith taxonomic group (Pi) and the coverage of the ith taxonomic group (Ci) in the human faecal samples (Supplementary dataset 2). The estimated Ci among different taxonomic categories were not significantly different (one-way ANOVA, Pi: FA = 0.000, pA = 1.000; FB = 0.000, pB = 1.000; FC = 0.000, pC = 1.000; Ci: FA = 0.087, pA = 0.967; FB = 0.135, pB = 0.939; FC = 0.209, pC = 0.890; Figure 2). This result indicates that we could obtain similar results by applying the correction to the different taxonomic categories for estimating the Pi and the Ci. Hence, we could estimate Pi and Ci using high-throughput sequencing data.

Figure 2
figure 2

Rank–proportion curves (A, B, C) and rank–coverage curves (D, E, F) of human faecal samples with 3G metagenomic data per specimen.

(A, D) specimen A; (B, E) specimen B; (C, F) specimen C. Species indicates the taxonomic categories corrected to the species level. The genomic sizes and SSU rDNA copies were replaced with an attributively higher taxonomic group when a taxonomic group does not have a finished genomic sequence in the Integrated Microbial Genomes (IMG) system (http://img.jgi.doe.gov/cgi-bin/w/main.cgi). Genus, family and order indicate the taxonomic categories corrected to the corresponding taxon level.

Genomic coverage estimation of a single species

The validity of metagenomic datasets is usually assessed using the genomic coverage of a single species10,13. The genome of a single species can be accurately assembled from a complex metagenomic dataset when it shows at least 20× coverage13. To assess the genomic coverage of each OTU in a metagenomic study, the amount of DNA and the genomic coverage of each OTU were estimated using the 3G metagenomic data per specimen (Supplementary dataset 2). The ratio of the coverage to the proportion of each OTU among different taxonomic categories was not significantly different in all of the specimens (one-way ANOVA, FA = 1.350, pA = 0.258; FB = 1.528, pB = 0.206; FC = 0.899, pC = 0.442; Figure 3). Therefore, the coverage of each OTU was not overestimated or underestimated in the different taxonomic categories. Less than 15 genomes in the 3G metagenomic data per specimen reached 20× coverage. This result indicates that the 3G metagenomic sequencing is still insufficient for determining the genetic content of faecal samples. Equation (3) indicated that the genomic coverage could be calculated using different sequencing amounts and the corresponding number of species with genomes that reached 20× coverage could be counted. Therefore, the required amount for sequencing could be inferred according to the given number of species with genomes that reached to 20× coverage.

Figure 3
figure 3

Correlation between the proportion of OTU and prokaryotic genomic coverage in the 3G metagenomic data per specimen.

(A) Distribution of the estimated coverage with the proportion of OTU at different taxonomic levels; (B, C, D) distribution diversity of the estimated coverage at different taxonomic levels with the proportion of OTU from specimens A, B and C; (E, F, G, H) distribution diversity of the estimated coverage of the three specimens with the proportion of OTU at species, genus, family and order levels.

Estimation of the required minimum amount of sample for metagenomic sequencing for a given goal

To provide an example, we estimated the required minimum amount for the metagenomic sequencing of specimen A faecal microbiota from Eckburg et al.12. Equation (4) indicated that the minimum amount required for metagenomic sequencing could be estimated based on the Pi calculated in Supplementary datasets 2. Based on the calculations, if the given goal set for ensuring the genomes of species with a relative abundance more than 1% reached 20× coverage, the estimated amounts for sequencing was 7.15 Gb at the species level, 7.04 Gb at the genus level, 8.14 Gb at the family level and 6.41 Gb at the order level (Supplementary dataset 3). To obtain more reliable estimates, we plotted the regression curves of the estimated amounts for sequencing vs. the rank of the genomic proportion of each taxon and calculated the regression equation (Figure 4). The regression equation indicated that the estimated amounts for sequencing specimen A were 7.00 Gb at the species level, 6.93 Gb at the genus level, 7.10 Gb at the family level and 6.54 Gb at the order level. These results imply that at least 7 Gb is required for sequencing to enumerate the gene contents of prokaryotes with relative abundance of more than 1% in the human faecal microbiota.

Figure 4
figure 4

Correlation between the estimated sequencing amounts and the genomic proportion of each taxon.

(A) Corrected to the species level; (B) corrected to the genus level; (C) corrected to the family level; and (D) corrected to the order level.

Detection efficiency of a particular gene

Comparing the compositions of gene families or orthologs to analyze the functional differences of microbiota is a crucial component of microbial ecology. Metagenomic sequencing provides an effective way to enumerate the compositions of gene families or orthologs. Assessing the detection efficiency of each gene is a basic issue in this case, especially for those rare gene families or orthologs. To estimate the detection efficiency of a particular gene, we used the SSU rDNA as the target gene for the example. We assumed that any SSU rDNA read could confirm the presence of a particular gene and the Lread was 100 bp13. The estimated efficiency of SSU rDNA from the selected OTUs with different OTU proportions in faecal microbiota was calculated according to equation (6). Among the different taxonomic categories, the ratio of detected efficiency and proportion of each OTU was significantly different in specimens B and C but not in specimen A (one-way ANOVA, FA = 0.899, pA = 0.450; FB = 3.020, pB = 0.039; FC = 5.404, pC = 0.002; Figure 5). The LDS and the Student–Newman–Keuls post hoc tests also show that the efficiency at the order level in specimens B and C were significantly different from those at other levels. These results indicate that the detected efficiency of a specific gene was possibly overestimated at the order level. The inaccuracy could be caused by differences in prokaryotic genome size. Therefore, a target gene should be affiliated to family level or a lower taxonomic category when estimating the detection efficiency of a particular gene.

Figure 5
figure 5

Correlation between the detection efficiency of SSU rDNA and the proportion of OTUs.

(A) specimen A; (B) specimen B; (C) specimen C.

Discussion

The maximum potential error for estimating prokaryotic genome sizes and SSU rDNA copies at higher taxonomic levels was significantly higher than that at lower levels. Fortunately, prokaryotes can be classified into different species based on their SSU rDNA sequences14,12 and the taxonomic affiliations of the SSU rDNA sequences generated using Illumina and pyrophosphate sequencing platform can reach the genus level15,16,17,18. The taxonomic affiliations of sequences generated from these high-throughput sequencing platform premised to reach a relatively lower taxonomic category (i.e., species level) following the optimisation of high-throughput sequencing technologies, especially with the increasing length of reads.

Although we tried to perform an accurate estimation, two main factors influenced the accuracy. First, only the average genome size and average SSU rDNA copies of higher taxonomic category can be used for many prokaryotic genomes that are absent in the public databases (e.g., Barnesiella, Butyricicoccus, Butyricimonas, Oscillibacter, Peptococcus and Succiniclasticum). Obviously, this replacement will overestimate the R value. However, this limitation will improve with the increasing prokaryotic genomes that will be sequenced in the future. Second, some of the SSU rDNA sequences were not accurately affiliated with a genus or a species. Consequently, we had to apply a higher taxonomic category to estimate the amount required for sequencing. Although sequencing the SSU rDNA of all prokaryotic species is impossible, supplementing SSU rDNA sequences, especially novel species SSU rDNA sequences, will partially overcome the shortage in the future.

In the present study, the reference data for prokaryotic genome size and SSU rDNA copies were not selected based on habitat. However, the prokaryotes in different environments may encounter different selective stresses; therefore, the genome sizes and gene compositions may differ. For instance, Frese et al.19 found that Lactobacillus reuteri has host-specific genomic features. They found that L. reuteri F275 isolated from a healthy human adult is unable to colonise Lactobacillus-free mice and it has a genome approximately 279 kb smaller than that of rodent isolate L. reuteri 100–23. Therefore, the maximum potential error should be reduced when the reference data sets are constructed based on the habitat-selected prokaryotic genomes and the estimated amount required for sequencing should be more accurate.

Tamanes et al.10 argued that the number of sequence reads belonging to unobserved OTUs should be subtracted from the total number of reads when estimating the coverage of a species using the composition of SSU rDNA. Although the unobserved OTUs significantly affect the alpha-diversity of a given metagenome and the genomic coverage of rare species, our results indicate that they did not affect the genomic coverage of dominant and common species, as shown by the unchanged proportion of dominant and common species with the increasing total number of reads (unpublished data by Ni et al.).

Considering plasmid DNA, mitochondrial DNA and eukaryotic DNA are also present in environmental metagenomes7, the proportion of plasmid DNA and eukaryotic DNA in metagenomes should also be considered to estimate more accurately the amount for sequencing. We did not consider the composition of archaea genomes; thus, the coverage of the bacterial genome may have been overestimated. Considering the proportion of plasmid, mitochondrial, archaea and eukaryotic DNAs in the entire metagenome, a more accurate estimation could be conducted using a corrected equation that replaces in equation (2) with the total metagenomic DNA.

In conclusion, the required minimum amount for metagenomic sequencing could be estimated using the proposed method when the abundances of different prokaryotes in particular samples are known.

Methods

In all microbial diversity studies using polymerase chain reaction (PCR)-based methods, the targeted genes (e.g., SSU rDNA) are assumed to have equal amplification efficiency according to the following classical PCR equation: Ncq = N0 · (1 + E)Cq, where N0 is the amount of the target sequence before PCR, NCq is the amount of PCR products after Cq rounds of amplification and E is the assumed constant until the onset of PCR exhaustion20. The proportion of amplified DNA sequences from different microorganisms in the PCR products was assumed equal to those in the original sample. Therefore, we could infer the proportion of bacteria in an environmental sample according to the bacterial SSU rDNA sequences, which can be acquired through targeted SSU rDNA sequencing. Furthermore, the required sequencing amount could be inferred when the genome sizes of each bacterium are known. Thus, the structure and function of bacteria in a particular environment can be investigated using high-throughput sequencing technologies. The detection limit of a specific gene could be estimated when the size of a specific gene and its copies are given.

To obtain the reference data for the prokaryotic genome sizes and SSU rDNA copies, the genomic data of 1627 sequenced and annotated prokaryotes, including 111 archaea and 1516 bacteria, were retrieved from the Integrated Microbial Genomes system21. These prokaryotic genomes belong to 620 known genera and 2 unclassified genera (Supplementary dataset 4). The prokaryotic genome sizes range from 138,927 bp to 13,033,779 bp (3,497,775 ± 46,278.101 bp, s.e.m) and the SSU rDNA copies range from 1 to 15 (3.94 ± 0.068, s.e.m) (Supplementary dataset 4 and Figure S1).

To evaluate the maximum potential error in the estimation of the prokaryotic genome sizes and SSU rDNA copies at different taxonomic categories (i.e., species, genus, family, order, class and phylum), we introduced a ratio (R) calculated according to the following equation:

where E is the extreme value of genome sizes or SSU rDNA copies, AX is the mean of the corresponding genome size or SSU rDNA copy and E′ is the margin between E and AX. Bigger R values indicate larger maximum potential errors.

The genomic proportion of the ith taxonomic group (Pi) was calculated using the following equation:

where Ni is the abundance of the SSU rDNA copies of the ith taxonomic group, (AXssu)i and (AXgenome)i are the average value of SSU rDNA copies and the mean genome size of the ith taxonomic group, respectively and Ni/(AXssu)i is the individual number of the ith taxonomic group. The introduction of Ni/(AXssu)i into equation (2) eliminates the disturbance of SSU rDNA copies on Pi. Equation (2) also eliminates the disturbance of genome size differences on Pi.

The coverage of the ith taxonomic group (Ci) was calculated using the following equation:

where A is a given sequencing amount of the microbial metagenome. Considering Ci is given, the sequencing amount of a microbial metagenome was calculated using the following conversion equation:

The detected probability of a given gene (P) was calculated using the following equation:

where L is the gene length, n is the gene copies and Pj is the proportion of the taxonomic group(s) that contain the gene in the metagenome. Pj was calculated using equation (2). If the presence of one of the reads of the special gene could be confirmed, equation (5) should be corrected using the following equation:

where Lread is the length of a single read of the high-throughput sequencing.