Introduction

The ocean, the largest habitat in the biosphere, is a microbial-dominated ecosystem holding an estimated 1029 prokaryote cells1. Exploration of the ocean biodiversity associated with the huge prokaryote pool was prevented due to the limitations in the cultivation of marine prokaryotes2. This barrier was partially overcome by efficient sequencing approaches, typically targeting the genes that code for the 16S region of rDNA, which allows the definition and enumeration of the operational taxonomic units (OTUs) present in a sample, thereby providing a culture-free basis to assess biodiversity somewhat equivalent to that of species numbers3. In the past decade, global ocean expeditions and research based on them have utilized these technological developments in order to attempt to estimate the total number of prokaryote OTUs in the ocean4,5,6,7,8. For instance, the TARA Oceans Expedition explored prokaryote biodiversity in the upper ocean and described the detection of 35,650 prokaryote OTUs5 in a set of globally distributed samples, with the exception of the Arctic, while the Malaspina Expedition gave a minimum estimate of the number of prokaryote OTUs in the deep ocean which is an order of magnitude lower, at around 3,7004. The TARA Expedition estimated the total richness to be 37,470 OTUs based on the Chao estimator, which defines a lower bound on species richness. This result should be interpreted to be at least 37.470 OTUs in the upper ocean.

The fraction of the total volume of the ocean sampled by any study is minimal and thus requires extreme extrapolation (over 20 orders of magnitude) from the number of species found in the samples to an estimate for the global ocean. The approach used is that of rarefaction curves, a development first introduced in 1943 by Fisher et al. to provide a basis to estimate the species richness of Malaysian butterflies9, subsequently popularized by Sanders (1968)10 to compare benthic invertebrate species richness from marine surveys with different sample sizes. Rarefaction curves use resampling techniques to develop a curve of the number of species against the number of samples collected11. Initially introduced to evaluate how comprehensive the assessment of species numbers was based on a sampling set, it was subsequently used to infer the total number of species in the ecosystem investigated as that corresponding to the asymptote of the curve12. This approach was adopted to deliver estimates of the prokaryote species richness in the global ocean4,5. These estimates correspond mathematically to minimum estimates (e.g., Chao estimator)13, yet their precision has not been assessed. Indeed, beyond the apparent asymptote in rarefaction curves, other estimators have been proposed to estimate species richness13,14,15,16. Marine prokaryote communities are characterized by the presence of a few abundant OTUs and a large number of rare OTUs2, suggesting a much broader distribution of OTU abundance than that required to reliably apply rarefaction curves to estimate the global biodiversity of prokaryotes. Here we examine the scaling of prokaryote diversity in the ocean as a step to better understanding the extent that current assessments may underestimate prokaryote diversity in the global ocean. We do so using an array of novel approaches, including assessments across the global ocean coupled with experimental and in silico tests, to establish the scaling of ocean microbial diversity and explore its implications for the discovery of microbial diversity.

Results

Prokaryote diversity in the upper and deep ocean

The distribution of prokaryote OTUs in the upper ocean and deep ocean samples of the TARA Oceans5 and Malaspina4 Expeditions conform to broad distributions with power-law behavior, P(x) ~ x−1−α, where x represents the abundance measured in number of reads, and is characterized (the tail of the distribution) by a scaling exponent α = 1.57 for the upper ocean, and α = 0.89 for the deep ocean (Fig. 1), similar to the classic power-law describing the number of species per taxa of Willis and Yule (1922)17. A comparison to other broad distributions (lognormal, Weibull) shows that a distribution with a power-law tail (either pure power-law or truncated power-law) are most likely to be the best fitting (Table 1). This finding implies that the most abundant 1% OTUs account for 40% of the sequences while the least abundant 90% of sampled OTUs account for only 10% of the sequences in the upper ocean; while for the deep ocean, the most abundant 1% of OTUs account for more than 70% of the sequences while the least abundant 90% of sampled OTUs account for only 8% of the sequences.

Figure 1
figure 1

Abundance distribution of prokaryote OTUs in the upper and deep ocean. The rank vs abundance distribution for the (A) upper ocean and (B) deep ocean shows broad distributions with power-law tails. The abundance-rank distribution, r ~ x−α, where r is the rank of abundance x, has the same functional dependence (only the ranks have to be normalized between 0 and 1) as the complementary cumulative distribution CCD, CCD(x) = ∑i = x,∞ P(i), where P(i) is the abundance distribution. Thus, if the abundance rank distribution is given by r ~ x−α the abundance distribution decays as P(x) ~ x−1−α. (A) For the upper ocean, the abundance distribution shows a double power-law decay separated at a characteristic scale of 2,313 reads: for abundances x < 2,313, the scaling exponent is 0.37 (blue line); for abundances x > 2,313, the scaling exponent is α = 1.57 (see Materials and Methods). (B) For the deep ocean, the abundance-rank distribution is characterized by a power-law decay, P(x) ~ x−1−α, with an exponent of α = 0.89 (red line).

Table 1 Comparing fitting models to the prokaryote abundance distribution.

Theoretical scaling

Prokaryote diversity and, in general, species diversity can be characterized by magnitudes like the Shannon and Simpson indices, which by giving greater weight to the larger, common species, provide estimators with less uncertainty13 (Supplementary Table 1). However, the presence of rare species impacts the estimation of species richness. Species richness scales with sampling effort as a consequence of the power-law tail of the distribution of prokaryote abundance. Let us assume that the number of OTUs of abundance x, nx, is given by nx = Ax−1−α, where A is a normalizing constant, the scaling exponent α is larger than 0, α > 0, and the abundances are in the range nx [1, Nmax]. Thus, the total species richness, S, is given by S = ∑x=1,Nmax nx. In the limit of large Nmax, the richness can be approximated as \(S=A\zeta (1+\alpha )\), that is, A = S/ζ (1 + α), where ζ (α) is the Riemann zeta function. The total number of reads N can be obtained by N = ∑x=1,Nmax x nx. For α >1, we obtain

$$N=\frac{\zeta (a)}{\zeta (1+a)}$$
(1)

For α < 1, in the continuous limit \(N=A{\int }_{1}^{{N}_{\max }}{x}^{-a}dx=\frac{1}{(1-a)\zeta (1+a)}S({N}_{{\max }}^{{\rm{1}}-{a}}-1)\) and the assumption that \({N}_{{\max }}^{1-\alpha }\gg 1\), we obtain

$$N=\frac{1}{(1-a)\zeta (1+a)}S{N}_{{\max }}^{1-a}$$
(2)

Finally, the abundance of the most abundant OTU can be evaluated as the value Nmax at which there is only one group with abundance larger or equal than Nmax, that is, in the continuous limit \({\int }_{{N}_{{\max }}}^{\infty }{n}_{x}{dx}=1\). This leads to \(S{N}_{{\max }}^{\alpha }\) (a detailed calculation can be found in ref. 18).

Combining the previous expressions, we obtain the following scaling laws: \(S\propto {N}_{{\max }}^{\alpha }\) and for α < 1

$$S\propto {N}_{{\max }}^{a}\propto {{{\rm N}}}^{{\alpha }}$$
(3)

while for α > 1

$$S\propto {N}_{{\max }}^{a}\propto N.$$
(4)

The same scaling laws are obtained in the Yule model19 (which can also be mapped to the Simon model20,21), where the scaling exponent α is related to the ratio between speciation rate g and group growth s, \(\alpha \) = g/s. Systems showing distributions with power-law tails are ubiquitous: several methodologies have been described to fit and compare different functional forms as well as mechanisms to explain their origin18,22,23,24.

Empirical and in silico scaling

The scaling of species richness and the distribution of species abundances are two sides of the same coin. The power-law distribution of prokaryote species abundance implies that species richness (S) scales with sampling effort (N, number of samples) as S ~ Nγ, where (i) γ equals the exponent of the rank-abundance power-law (i.e., γ = α), when this exponent is α < 1, as observed in the deep ocean (Malaspina Oceans Expedition, Fig. 2), and (ii) S is proportional to sampling effort (i.e., γ = 1) for larger exponents α > 1, such as observed for the upper ocean (TARA Expedition, Fig. 2). Indeed, the power-law scaling of species richness with sampling effort implicit in the power-law distribution of the prokaryote species abundance distribution (Fig. 1) implies that the asymptote of rarefaction curves is artifactual and that indeed, the number of species does not approach any asymptote at the sampling effort this far deployed by global expeditions (Fig. 2). This expectation was confirmed by producing an in silico global ocean microbiome with an underlying distribution of prokaryote species abundance with the same shape and exponent as those empirically derived for the upper and deep ocean (dotted lines in Fig. 2). The in silico data was obtained, first, by expanding the empirically fitted data to larger populations and, second, by randomly generating abundance OTUs from the expanded distributions (see Materials and Methods). These simulations showed that increasing sampling effort, expressed as the total number of 16S reads sequenced, about 30 to 50 times relative to that applied to the upper and deep ocean by the TARA Oceans (3.3 × 106 reads, ref. 5) and Malaspina Expedition (1.8 × 106 reads, ref. 4) respectively would lead to estimates of prokaryote species abundance 4.2 and 1.2 times greater than inferred on the basis of rarefaction curves for the upper and deep ocean respectively (Fig. 2 and Supplementary Fig. 1). The estimators are calculated for a global population of 108 reads, which corresponds to 1 liter of upper ocean water (105 prokaryote cells/ml) and 10 liters of deep ocean water (104 prokaryote cells/ml) (Supplementary Table 1).

Figure 2
figure 2

Number of species as a function of the number of reads. The expected number of OTUs in a random sampling of the total population grows sublinearly with sampling size, S ~ Nγ. (A) In the upper ocean (continuous black line), we can identify a first quasi-linear regime with γ = 0.90 (confidence interval 95% <0.01) and a second regimen with γ = 0.33 (confidence interval <0.01), while (B) in the deep ocean (continuous red line) the exponent γ = 0.62 (confidence interval <0.01). The number of OTUs in the upper ocean (horizontal dotted black line) is estimated at 35,650 OTUs5 and in the deep ocean (horizontal dotted red line) the maximum number of OTUs found is 3,6954.

Mesocosm experiment

We challenged the mathematically-derived predictions, tested and confirmed by the in silico experiment, by enclosing a plankton community of the Central Red Sea in duplicate, and sampling and sequencing it every day during 20 days25 (c.f. Materials and Methods). The abundance distribution of prokaryote OTUs in the sampled Central Red Sea community continued to increase with additional sampling effort (Fig. 3), according to a power-law distribution with an average exponent of α = 0.53, comparable to that obtained for the deep ocean (α = 0.89) and for the less abundant of the upper ocean (α = 0.36) (Fig. 3D). In line with the upper and deep ocean cases, a comparative analysis performed for all the samples of the mesocosm experiment in three experimental conditions (control, single dose Nitrate-Phosphate addition and single dose Nitrate-Phosphate-Silicate addition) shows that a distribution with a power-law decay (either as a pure power-law or a truncated power-law) is the most likely fit (Supplementary Tables 27). The results confirmed the expectation that the number of OTUs retrieved in this community increased, on average, with the power 0.46 of the cumulative number of 16S reads sequenced without a clear asymptotic behavior despite exhaustive sampling (Fig. 3A–C and Tables 1 and 2).

Figure 3
figure 3

Scaling of the number of OTUs with the number of reads in an experiment. The number of prokaryote OTUs as a function of the number of reads is plotted, in a log-log scale, every two days as the experiment runs for 20 days in different conditions (A) control (Mesocosm C1 and C2), (B) single dose nitrate phosphate addition (NP) (Mesocosm C3 and C4), and (C) single dose nitrate phosphate sulfate addition (NPS) (Mesocosm C5 and C6). For all the conditions, we plot two replicates. The number of OTUs, S, scales with the number of reads, N, as S ~ Nγ, with γ = 0.44, 0.40 (control), 0.38, 0.40 (NP), 0.48, 0.52 (NPS). The insets show the same data in linear scale (same ranges as main plots) where an apparent saturation asymptote is observed. (D) Abundance vs rank plot for one of the controls for successive days from bottom to top. The exponent of a power-law distribution fit, P(x) ~ x−1−α, for the aggregated data after 20 days (black line) is α = 0.52.

Table 2 Scaling exponents and confidence interval for the mesocosm experiment.

Discussion

The results presented show that the abundance of different prokaryotic species in the ocean is described by a power-law distribution that implies that the total number of OTUs continues to increase, with a power given by that of the rank-abundance power-law, with increasing sampling effort. The dependence of the estimated richness on sampling effort is not an exclusive property of a power-law distribution and it has also been reported for lognormal distributions both theoretically26 and empirically7,23. We expect that the effort-dependence of the species richness applies to distributions with sufficient long tails and thus characterized by the presence of many rare species (OTUs). Thus, in the presence of a rare biosphere2, the effort-dependence of richness estimates is the expected outcome. Hence, the estimates that the upper and deep ocean contain ca. 37,000 and 3,700 prokaryote OTUs4,5, respectively, derived from rarefaction curves is an underestimate (Fig. 2). The estimation of the diversity based on sampling effort (both the number of samples collected and the sequencing depth applied to each sample) still represents a challenge and requires broad extrapolations. We have addressed the estimation of prokaryote diversity with the parsimonious assumption that the sampled distribution represents the population distribution, furthermore supported by the relatively conserved shape of this abundance distributions when sampling is replicated as in our mesocosm experiment (see Supplementary Tables 27). Thus, we have explored the estimation of prokaryote diversity derived from fitting different underlying distributions to the upper and deep ocean, and the mesocosm experiment. Future research increasing sampling effort, both for individual communities and locations across the ocean, are likely to yield OTU counts much higher than these estimates. The power-law distribution of species richness is not a new observation in ecology27,28,29,30,31 but is rooted in the seminal work of Willis and Yules showing a power-law distribution of species membership within taxa17. Indeed, a recent estimate of oceanic prokaryote species richness derived by extrapolating across more than 20 orders of magnitude the relationship between species numbers and number of cells sampled to match the 1029 prokaryote cells estimated in the global ocean, led to an estimate of 1010 different OTUs for this ecosystem7. Whereas the estimate derived from such wild extrapolation rests on a number of assumptions and does not necessarily reflect the shape of species abundance distribution of oceanic prokaryotes, it supports our empirical, mathematical, modeling and experimental results that indicate that the number of prokaryote OTUs in the ocean is far larger than currently estimated. A much-enhanced sampling effort is, therefore, required to unveil the prokaryote diversity concealed within the rare biosphere. Enhanced sampling efforts should be deployed both to retrieve the least abundant components of anyone community and also to benefit from the dynamics of microbial populations, which can bring otherwise rare components of the microbial biosphere to a level of abundance where they may be retrieved in sequencing projects (e.g., ref. 32). Efforts to achieve an inventory of prokaryotic OTUs in the ocean will require a far more exhaustive sampling than deployed to date combined with sound extrapolation approaches rooted in the observed abundance distributions of prokaryotic OTUs.

Materials and Methods

Data and experimental design

We have analyzed three datasets. The three empirical datasets are: from the TARA expedition we collected the abundance of 18,022 OTUs from the surface water and deep chlorophyll maximum layers in 63 and 46 sites, respectively, containing 3,323,839 reads5 (available at http://ocean-microbiome.embl.de/companion.html). From the Malaspina expedition, we collected the abundance of 3,695 free-living and particle-attached OTUs from 30 globally distributed sites in the bathypelagic ocean4 (available at https://github.com/GuillemSalazar/MolEcol_2015). The experimental data reported the OTU abundance every day for a period of 20 days in three experimental conditions: (a) control (referred as Mesocosm C1 and C2), (b) single dose Nitrate-Phosphate addition (referred as C3 and C4), and (c) single dose Nitrate-Phosphate-Silicate addition (referred as C5 and C6) (Nitrate = 2 µM, Phosphate = 0.12 µM, Silicate = 3.75 µM)25. Samples range from an average of 11,126 ± 5,400 (SD) reads leading to 337 ± 100 (SD) OTUs the first day to an aggregated number of 212,761 ± 22,000 (SD) reads and 1,331 ± 56 (SD) OTUs after completion of the experiment. Raw reads, which the OTUs counts were based on, have been deposited in the NCBI Sequence Read Archive under the accession number SRP051855.

Statistical analysis

Abundance distribution

The model fittings of the power-law distributions, the truncated power-law distributions, lognormal distributions, and the stretched exponential distributions ware obtained with the Maximum Likelihood Estimation applied to the empirical data33. For the upper ocean, we have fitted also a double power-law distribution.

In silico prokaryote diversity: upper ocean

We proposed a distribution with two power-law regimes, with the parameter values (scaling exponents and transition point) obtained as described below: P(x) = Ax−1−δ, for abundances x ≤ xc, and P(x) = Bx−1−α, for x > xc. The condition that the distribution is continuous at xc (P(xc) = Axc−1−δ = Bxc−1−α) and the normalization (ΣP(x) = 1), lead to the values A = δ + (δ – α) xc−α, and B = Axc (δ –α). We assigned to the exponents α and δ, and to the transition point xc the values obtained from the Maximum Likelihood α = 1.54, δ = 0.36, and xc = 2,313.

In silico prokaryote diversity: deep ocean

We proposed a shifted power-law to capture the power-law tail and the deviation at the head of the distribution: P(x) = α ((x + x0)/(1 + x0))−1− α. The parameters α and x0 can be obtained by the Maximum Likelihood Estimation: α = NOTU Σlog ((x0 + xi)/(1 + x0)), and (x0 + 1) Σ1/(1 + xi) = NOTU α /(1 − α). To solve these implicit equations, we proposed x0 and α, evaluate the previous expressions, and obtained new values x0′ and α′. We repeated these steps until we reached the condition |x0′ − x0| < T, for some convergence value T. For T = 10−6, the values we obtained are α = 0.89, and x0 = 20.34.

Akaike Information Criterion (AIC)

The Akaike Information Criterion is defined as AIC = −2log L + 2 V, where L is the maximum likelihood of a fit model, and V is the number of free parameters. The delta Akaike Information Criterion is calculated as ΔAIC = AIC-AICmin, where AICmin corresponds to the minimum value of all the candidate models, and AIC the value of the candidate model. The weight AIC

$${w}_{i}(AIC)=\frac{{\exp }(\frac{-1}{2}{\Delta }_{{i}}{\rm{AIC}})}{{\sum }_{K=1}^{M}{\exp }(\frac{-1}{2}{\Delta }_{{k}}{\rm{AIC}})}$$

can be interpreted as the probability that the model is the best model (in the AIC sense, that it minimizes the Kullback–Leibler discrepancy), given the data and the set

of candidate models (e.g., Burnham & Anderson, 2001).

Extrapolation of abundance distributions for larger number of samples

For the upper Ocean, the abundance distribution is fitted to a double power-law defined as P(x) = Ax1−δ for x < xc and P(x) = Bx−1−α for xc < x. A continuity condition (Axc−1−δ = Bxc−1−α) and the normalization condition (1 = ∫1P(x)dx) gives the values for the constants A and B as A = αδ(α + (δα)xc−δ)−1 and B = A xcαδ. In order to fit this distribution, we have to obtain estimates for the two exponents δ and α and for the cutoff xc. We use first the maximum likelihood method implemented in ref. 30 which fits the exponent for the tail α and the value of the cutoff xc. Then we adjust the value of the exponent for the range [1, xc] by using the same method, only fixing the minimum value to 1 and disregarding any data over the cutoff value xc. In order to extract the behavior of the parameters for an increasingly large ecosystem, we used increasingly randomly aggregated samples from the TARA Oceans Expedition (139 samples in total). The average parameters for aggregations of samples of similar total number of reads are shown in the left column of Supplementary Fig. 2 in black and the error bars reflect their standard deviation. Next, in order to extrapolate these parameters to larger number of reads we fitted the estimated parameters to some simple curves (shown in red in Supplementary Fig. 2). The results were xc = 0.0002 · Nreads1.1 + 52.6, δ = 0.32 (1 + 0.71 exp(−Nreads/570007)) and α = 1.42 (1 − 0.2 exp(−Nreads/110185)). Note that the values of the scaling exponent of the tail of the distribution α are in agreement with recently reported estimates34. For the in-vitro generation of larger samples we extrapolated the parameter values to the value corresponding to the desired number of reads and generated random numbers from the corresponding distribution up to the desired number of reads, using the method of the inversion of the cumulative distribution.

For the deep Ocean, the abundance distribution is fitted to a shifted power-law P(x) = A(x + x0)1α with a maximum possible value for the abundance xmax. The value of A is given by the normalization condition (1 = ∫1XmaxP(x)dx) and is A = α((1 + x0)α − (xmax + x0)α)−1. In this case, we need to estimate again three parameters to fit the distribution. In order to estimate the parameters, we first fitted the exponent α and the shifting parameter x0 by solving iteratively the equations from maximum likelihood:

$$a=S{(\mathop{\sum }\limits_{i=1}^{S}\log \frac{({{x}}_{{i}}+{{x}}_{{\rm{0}}})}{1+{{x}}_{{\rm{0}}}})}^{-1}$$
$${x}_{0}=aS{((1+a){\sum }_{i=1}^{s}\frac{1}{{x}_{i}+{x}_{0}})}^{-1},$$

where S stands for the number of data points. With those estimated parameters we estimated the maximum abundance xmax through the average abundance <x> found in the data by solving the implicit equation <x> = ∫1XmaxxP(x)dx:

$$\langle x\rangle =\frac{a}{1-a}\frac{{({x}_{{\max }}+{x}_{0})}^{1-a}-{(1+{x}_{0})}^{1-a}}{{(1+{x}_{0})}^{-a}-{({x}_{{\max }}+{x}_{0})}^{-a}}-{x}_{0}$$

The parameters are shown in the right column of Supplementary Fig. 2 and again in black are average estimates with standard deviations shown with error bars, and in red the simple fitted curves used for the extrapolation. In this case the simple curves fitted were x0 = 0.000003 Nreads1.1 – 1, α = 0.88 (1 − 0.45 exp(−Nreads/363263)) and <x> = 0.00042 Nreads0.97 + 23.6.

The estimation for a larger number of reads was performed as for the upper ocean but using the proper shifted power-law distribution as given by the extrapolated parameters.