Introduction

Biodiversity has traditionally been considered to be a consequence of environmental processes, such as niche partitioning, resource distribution, and disturbances. In the last several decades, a new view of biodiversity as the predictor of environmental processes and functions gained interest1,2,3 and developed into the research field now regarded as biodiversity-ecosystem function (BEF)4,5,6. Bacteria have an intimately interactive relationship with its surrounding environment and ecosystem, and thus, bacetrial diversity has an important role in BEF research7,8. However, even determining a reasonable description of bacterial diversity is challenging due to the intrinsic properties of bacteria (e.g., debatable species concept, hyperdiversity, variable 16S rRNA gene copy number) and technological difficulties9,10,11,12. One of the challenges in bacterial diversity estimation is the capture of rare taxa (rare biosphere), which often occupy large portions of microbial diversity13,14,15; the experimental determination of the uncertainty involved is not yet available. Since 2005, the second generation sequencing technologies drastically advanced the capacity and the depth of microbial community sampling by sequencing. However, there is still bias associated with the experimental procedures, and sampling by sequencing is also known to be a less-than-complete representation16. Thus, reproducible estimation of biodiversity is not yet available17. One way to overcome this problem is to use statistical and mathematical biodiversity estimations18. However, most mathematical and statistical approaches of biodiversity estimation were developed for investigating less diverse organisms (e.g., plants and animals), which imposes an inheritant challenge in applying these tools to the analysis of bacterial communities due to their hyperdiversity. Therefore, a framework accomodating those challenges is needed for a reasonable bacterial diversity estimation using current available experimental resources.

Hill number (Na)19 was proposed as a unified diversity concept by defining biodiversity as a reciprocal mean proportional abundance and differently weighing taxa based on their abundances as follows:

Parameter a determines special cases of Hill number, for example, N0 as number of taxa, N1 as exponential Shannon index, and N2 as reciprocal Simpson index19. Because of the generality and flexibility in controlling the effects of rare taxa in biodiversity measure, Hill number may be an excellent framework for bacterial diversity studies9. Recently, Haegeman et al.20 showed that the uncertainty associated with Hill numbers quickly increased to an uncontrollable range when a < 1 from the series of sequence data sets.

The consensus in bacterial diversity studies is that a fully exhaustive census may require an extremely large amount of resources for most natural ecosystems11,21. We argue that the “unsaturation” or asymptotic result in those rarefaction curves is due to the vast size of rare biosphere; thus, the saturated bacterial diversity may be obtainable with reasonable sequencing efforts using diversity measure framework of Hill number with differential weight on rare taxa. The goal of this study is to investigate the use of Hill number as a framework for reliable diversity estimation given sequencing depth.

Results and Discussion

The taxa-accumulation curves of the Amazon and Texas mine studies (Fig. 1, S1 and S2) show both similarities and differences in their patterns. The richness measures (N0 and Chao1 index) are far from saturated in both studies, and as the parameter a increased, the degree of diversity coverage increased, as well. The degree of coverage, however, was much less in the Texas mine study; only N2 was able to provide enough coverage (asymptote). Apparently, the difference is due to the depth of sampling (sequencing), which will be further discussed below with in silico analysis. The higher a represents increased insensitivity to the contributions by rare taxa to the overall biodiversity (γ diversity) and more robustness in doing so with reduced uncertainty20.

Figure 1
figure 1

Smoothed taxa-accumulation curves (TACs) with different Hill numbers (A N0, C N1 and D N2) and Chao1 index (B) for both Amazon (66 samples) and Texas mine (36 samples) studies together. Insert is the Texas mine rarefaction curve, shown alone in order to better represent the trend due to the large difference in sequence reads between two data sets. Taxa (N0) represents unique OTU at 97% similarity cutoff.

This analysis revealed an interesting pattern between the soil bacterial communities measured in very different sequencing depths from two distinct ecosystems. The observed taxa richness (N0) is fairly similar, but the difference becomes greater as a increases (Table S1) in that the Texas mine soil bacterial community is much more diverse than that of the Amazon soil samples. This is at least partly due to the abundant rare taxa, which should have caused rather low sampling completedness in Texas mine (~32%) compared to the Amazon samples (~65%)22. In the case of the Chao1 index, large numbers of singleton and doubleton in the Texas mine samples inflate the Chao1 index which is defined, in part, as the ratio between the square of the singleton frequency (F1), and times two of the doubleton frequency (F2) (Fig. 2B). It is impossible to determine how much of those singletons and doubletones are a part of real rare taxa and sequencing artifacts. However, because of the uncertainty, Hill number may be useful by enabling controlling of the contributions of rare taxa on determining diversity. Significant deviation (D = 0.17, P < 0.001) from a log-normal model also indicates incomplete sampling in the Texas mine microbial communities23. The large difference in the proportion of rare taxa between the two data sets also resulted in distinctive taxa abundance patterns (Fig. 2 and S3). Since the Texas mine samples were from the chronosequence of reclamation, the Zipf model is conceptually fitting24. However, under-sampling of the Texas data set may be contributing to the distinctive taxa abundance patterns, as well. To test the relationship between sampling degree and biodiversity coverage in TAC, we used randomly subsampled Amazon data between 25,000 and 400,000 reads in varing degrees (Fig. S4). Sufficient biodiversity coverage using TAC seems to be obtained with ~200,000 reads resulting in reliable biodiversity measures (N1 and N2).

Figure 2
figure 2

Rank abundance distribution plots (Whittaker plots) for Amazon (A) and Texas mine (B) studies. The best fit taxa abundance distribution (TAD) models are a log normal distribution for Amazon and a Zipf distribution for Texas mine data.

The two data sets used here were suitable because they were prepared using almost identical procedures, but the sequencing depths were vastly different. A recent study using a mock community concluded that microbial composition results are influenced by the primers and sequencing platforms used25; thus, the compatible experimental procedure increases the credibility of the results. The diverse sequencing coverage is also useful because it could show the scale-independency of the analyses and results.

In conclusion, the hyperdiverse nature of microbiota in most ecosystems often results in random- and under-sampling, thus hampering reliable diversity estimations even with the technological advancementes made by the second generation sequencing technologies. Until a series of significant technological advancements in sampling coverage is available, the Hill number and TAC approach may be a suitable framework for reliable estimation of diversity and further applications in research studies like BEF and dimensions of biodiversity.

Methods

We used a smoothed taxa-accumulation curve (TAC), which is often mis-labeled as a rarefaction curve, to investigate a reliable approach to estimate bacterial diversity from two 454 pyrosequence data sets. One data set is from soil samples in a chronosequence of reclaimed surface mine sites in East Texas (Texas study) and the other is from soil samples from an Amazonian rainforest that was converted to agricultural fields (Amazon study). Both data were prepared by very similar experimental and analytical procedures. Briefly, both studies used a PowerSoil DNA Isolation kit for DNA extraction (MoBio Laboratories) following manufacturer’s instruction and 454 GS FLX Sequencer (454 Life Sciences) for 16S rRNA gene sequencing at V4-V5 region (~350 bp). The quality processed sequences were analyzed using mothur software (v. 1.23.1)26 with SILVA and ribisomal database project (RDP) database for alignment and classification.

The depth of sequencing was quite different between the two studies: ~31,000 reads in the Texas mine sample in comparing mine reclaiming techniques (crosspit spreader, CP and mixed overburden, MO) and ~400,000 reads in the Amazon sample between forest and converted pasture. First, unique taxa (OTU0.97) richness (N0), Chao1 index27, exponential Shannon index (N1), and reciprocal Simpson index (N2) were calculated then used in TAC construction and by using EstimateS 9.128 and R 3.1.329. Rank abundance distribution (RAD) plots were prepared using vegan (2.2-1) and sads packages (0.2.4).

Additional Information

Accession codes: Sequence data used for this study is available from NCBI Sequence Read Archive (SRA) under accession number SRP026369 (Texas Mine data) and FigShare, http://dx. doi.org/10.6084/m9.figshare.1547935 (Amazon data). http://www.nature.com/srep

How to cite this article: Kang, S. et al. Hill number as a bacterial diversity measure framework with high-throughput sequence data. Sci. Rep. 6, 38263; doi: 10.1038/srep38263 (2016).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.