Article | Open

Hill number as a bacterial diversity measure framework with high-throughput sequence data

  • Scientific Reports 6, Article number: 38263 (2016)
  • doi:10.1038/srep38263
  • Download Citation
Published online:


Bacterial diversity is an important parameter for measuring bacterial contributions to the global ecosystem. However, even the task of describing bacterial diversity is challenging due to biological and technological difficulties. One of the challenges in bacterial diversity estimation is the appropriate measure of rare taxa, but the uncertainty of the size of rare biosphere is yet to be experimentally determined. One approach is using the generalized diversity, Hill number (Na), to control the variability associated with rare taxa by differentially weighing them. Here, we investigated Hill number as a framework for microbial diversity measure using a taxa-accmulation curve (TAC) with soil bacterial community data from two distinct studies by 454 pyrosequencing. The reliable biodiversity estimation was obtained when an increase in Hill number arose as the coverage became stable in TACs for a ≥ 1. In silico analysis also indicated that a certain level of sampling depth was desirable for reliable biodiversity estimation. Thus, in order to attain bacterial diversity from second generation sequencing, Hill number can be a good diversity framework with given sequencing depth, that is, until technology is further advanced and able to overcome the under- and random-sampling issues of the current sequencing approaches.


Biodiversity has traditionally been considered to be a consequence of environmental processes, such as niche partitioning, resource distribution, and disturbances. In the last several decades, a new view of biodiversity as the predictor of environmental processes and functions gained interest1,2,3 and developed into the research field now regarded as biodiversity-ecosystem function (BEF)4,5,6. Bacteria have an intimately interactive relationship with its surrounding environment and ecosystem, and thus, bacetrial diversity has an important role in BEF research7,8. However, even determining a reasonable description of bacterial diversity is challenging due to the intrinsic properties of bacteria (e.g., debatable species concept, hyperdiversity, variable 16S rRNA gene copy number) and technological difficulties9,10,11,12. One of the challenges in bacterial diversity estimation is the capture of rare taxa (rare biosphere), which often occupy large portions of microbial diversity13,14,15; the experimental determination of the uncertainty involved is not yet available. Since 2005, the second generation sequencing technologies drastically advanced the capacity and the depth of microbial community sampling by sequencing. However, there is still bias associated with the experimental procedures, and sampling by sequencing is also known to be a less-than-complete representation16. Thus, reproducible estimation of biodiversity is not yet available17. One way to overcome this problem is to use statistical and mathematical biodiversity estimations18. However, most mathematical and statistical approaches of biodiversity estimation were developed for investigating less diverse organisms (e.g., plants and animals), which imposes an inheritant challenge in applying these tools to the analysis of bacterial communities due to their hyperdiversity. Therefore, a framework accomodating those challenges is needed for a reasonable bacterial diversity estimation using current available experimental resources.

Hill number (Na)19 was proposed as a unified diversity concept by defining biodiversity as a reciprocal mean proportional abundance and differently weighing taxa based on their abundances as follows:

Parameter a determines special cases of Hill number, for example, N0 as number of taxa, N1 as exponential Shannon index, and N2 as reciprocal Simpson index19. Because of the generality and flexibility in controlling the effects of rare taxa in biodiversity measure, Hill number may be an excellent framework for bacterial diversity studies9. Recently, Haegeman et al.20 showed that the uncertainty associated with Hill numbers quickly increased to an uncontrollable range when a < 1 from the series of sequence data sets.

The consensus in bacterial diversity studies is that a fully exhaustive census may require an extremely large amount of resources for most natural ecosystems11,21. We argue that the “unsaturation” or asymptotic result in those rarefaction curves is due to the vast size of rare biosphere; thus, the saturated bacterial diversity may be obtainable with reasonable sequencing efforts using diversity measure framework of Hill number with differential weight on rare taxa. The goal of this study is to investigate the use of Hill number as a framework for reliable diversity estimation given sequencing depth.

Results and Discussion

The taxa-accumulation curves of the Amazon and Texas mine studies (Fig. 1, S1 and S2) show both similarities and differences in their patterns. The richness measures (N0 and Chao1 index) are far from saturated in both studies, and as the parameter a increased, the degree of diversity coverage increased, as well. The degree of coverage, however, was much less in the Texas mine study; only N2 was able to provide enough coverage (asymptote). Apparently, the difference is due to the depth of sampling (sequencing), which will be further discussed below with in silico analysis. The higher a represents increased insensitivity to the contributions by rare taxa to the overall biodiversity (γ diversity) and more robustness in doing so with reduced uncertainty20.

Figure 1
Figure 1

Smoothed taxa-accumulation curves (TACs) with different Hill numbers (A N0, C N1 and D N2) and Chao1 index (B) for both Amazon (66 samples) and Texas mine (36 samples) studies together. Insert is the Texas mine rarefaction curve, shown alone in order to better represent the trend due to the large difference in sequence reads between two data sets. Taxa (N0) represents unique OTU at 97% similarity cutoff.

This analysis revealed an interesting pattern between the soil bacterial communities measured in very different sequencing depths from two distinct ecosystems. The observed taxa richness (N0) is fairly similar, but the difference becomes greater as a increases (Table S1) in that the Texas mine soil bacterial community is much more diverse than that of the Amazon soil samples. This is at least partly due to the abundant rare taxa, which should have caused rather low sampling completedness in Texas mine (~32%) compared to the Amazon samples (~65%)22. In the case of the Chao1 index, large numbers of singleton and doubleton in the Texas mine samples inflate the Chao1 index which is defined, in part, as the ratio between the square of the singleton frequency (F1), and times two of the doubleton frequency (F2) (Fig. 2B). It is impossible to determine how much of those singletons and doubletones are a part of real rare taxa and sequencing artifacts. However, because of the uncertainty, Hill number may be useful by enabling controlling of the contributions of rare taxa on determining diversity. Significant deviation (D = 0.17, P < 0.001) from a log-normal model also indicates incomplete sampling in the Texas mine microbial communities23. The large difference in the proportion of rare taxa between the two data sets also resulted in distinctive taxa abundance patterns (Fig. 2 and S3). Since the Texas mine samples were from the chronosequence of reclamation, the Zipf model is conceptually fitting24. However, under-sampling of the Texas data set may be contributing to the distinctive taxa abundance patterns, as well. To test the relationship between sampling degree and biodiversity coverage in TAC, we used randomly subsampled Amazon data between 25,000 and 400,000 reads in varing degrees (Fig. S4). Sufficient biodiversity coverage using TAC seems to be obtained with ~200,000 reads resulting in reliable biodiversity measures (N1 and N2).

Figure 2
Figure 2

Rank abundance distribution plots (Whittaker plots) for Amazon (A) and Texas mine (B) studies. The best fit taxa abundance distribution (TAD) models are a log normal distribution for Amazon and a Zipf distribution for Texas mine data.

The two data sets used here were suitable because they were prepared using almost identical procedures, but the sequencing depths were vastly different. A recent study using a mock community concluded that microbial composition results are influenced by the primers and sequencing platforms used25; thus, the compatible experimental procedure increases the credibility of the results. The diverse sequencing coverage is also useful because it could show the scale-independency of the analyses and results.

In conclusion, the hyperdiverse nature of microbiota in most ecosystems often results in random- and under-sampling, thus hampering reliable diversity estimations even with the technological advancementes made by the second generation sequencing technologies. Until a series of significant technological advancements in sampling coverage is available, the Hill number and TAC approach may be a suitable framework for reliable estimation of diversity and further applications in research studies like BEF and dimensions of biodiversity.


We used a smoothed taxa-accumulation curve (TAC), which is often mis-labeled as a rarefaction curve, to investigate a reliable approach to estimate bacterial diversity from two 454 pyrosequence data sets. One data set is from soil samples in a chronosequence of reclaimed surface mine sites in East Texas (Texas study) and the other is from soil samples from an Amazonian rainforest that was converted to agricultural fields (Amazon study). Both data were prepared by very similar experimental and analytical procedures. Briefly, both studies used a PowerSoil DNA Isolation kit for DNA extraction (MoBio Laboratories) following manufacturer’s instruction and 454 GS FLX Sequencer (454 Life Sciences) for 16S rRNA gene sequencing at V4-V5 region (~350 bp). The quality processed sequences were analyzed using mothur software (v. 1.23.1)26 with SILVA and ribisomal database project (RDP) database for alignment and classification.

The depth of sequencing was quite different between the two studies: ~31,000 reads in the Texas mine sample in comparing mine reclaiming techniques (crosspit spreader, CP and mixed overburden, MO) and ~400,000 reads in the Amazon sample between forest and converted pasture. First, unique taxa (OTU0.97) richness (N0), Chao1 index27, exponential Shannon index (N1), and reciprocal Simpson index (N2) were calculated then used in TAC construction and by using EstimateS 9.128 and R 3.1.329. Rank abundance distribution (RAD) plots were prepared using vegan (2.2-1) and sads packages (0.2.4).

Additional Information

Accession codes: Sequence data used for this study is available from NCBI Sequence Read Archive (SRA) under accession number SRP026369 (Texas Mine data) and FigShare, http://dx. (Amazon data).

How to cite this article: Kang, S. et al. Hill number as a bacterial diversity measure framework with high-throughput sequence data. Sci. Rep. 6, 38263; doi: 10.1038/srep38263 (2016).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    et al. Biodiversity and ecosystem functioning: Maintaining natural life support processes (Ecological Society of America, Washington DC, 1999).

  2. 2.

    , , , & Declining biodiversity can alter the performance of ecosystems. Nature 368, 734–737 (1994).

  3. 3.

    Biodiversity: population versus ecosystem estability. Ecology 77, 350–363 (1996).

  4. 4.

    & Biodiversity and ecosystem multifunctionality. Nature 448, 188–190, doi: 10.1038/nature05947 (2007).

  5. 5.

    et al. Biodiversity and ecosystem functioning: current knowledge and future challenges. Science 294, 804–808, doi: 10.1126/science.1064088 (2001).

  6. 6.

    , , & Biodiversity and ecosystem functioning decoupled: invariant ecosystem functioning despite non-random reductions in consumer diversity. Oikos 125, 424–433, doi: 10.1111/oik.02220 (2016).

  7. 7.

    et al. Loss in microbial diversity affects nitrogen cycling in soil. ISME J. 7, 1609–1619, doi: 10.1038/ismej.2013.34 (2013).

  8. 8.

    , & The unseen majority: soil microbes as drivers of plant diversity and productivity in terrestrial ecosystems. Ecol. Lett. 11, 296–310, doi: 10.1111/j.1461-0248.2007.01139.x (2008).

  9. 9.

    & The tragedy of the uncommon: understanding limitations in the analysis of microbial diversity. ISME J. 2 (2008).

  10. 10.

    et al. A unifying quantitative framework for exploring the multiple facets of microbial biodiversity across diverse scales. Environ. Microbiol. 15, 2642–2657, doi: 10.1111/1462-2920.12156 (2013).

  11. 11.

    et al. Pyrosequencing enumerates and contrasts soil microbial diversity. ISME J. 1, 283–290 (2007).

  12. 12.

    & The Variability of the 16S rRNA Gene in Bacterial Genomes and Its Consequences for Bacterial Community Analyses. PLOS One 8, e57923, doi: 10.1371/journal.pone.0057923 (2013).

  13. 13.

    & Linking community and ecosystem processes: The role of minor species. Ecosystems 9, 119–127 (2006).

  14. 14.

    et al. Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc. Natl. Acad. Sci. USA 103, 12115–12120 (2006).

  15. 15.

    & Ecology and exploration of the rare biosphere. Nat Rev Micro 13, 217–229, doi: 10.1038/nrmicro3400 (2015).

  16. 16.

    et al. Random Sampling Process Leads to Overestimation of β-Diversity of Microbial Communities. mBio 4, doi: 10.1128/mBio.00324-13 (2013).

  17. 17.

    et al. Reproducibility of pyrosequencing data for biodiversity assessment in complex communities. Methods in Ecology and Evolution 5, 881–890, doi: 10.1111/2041-210X.12230 (2014).

  18. 18.

    , , & Counting the uncountable: statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol. 67, 4399–4406 (2001).

  19. 19.

    Diversity and evenness: a unifying notation and its consequences. Ecology 54, 427–432 (1973).

  20. 20.

    et al. Robust estimation of microbial diversity in theory and in practice. ISME J. 7, 1092–1101, doi: 10.1038/ismej.2013.10 (2013).

  21. 21.

    , & The rational exploration of microbial diversity. ISME J. 2, 997–1006 (2008).

  22. 22.

    , , , & Undersampling bias: the null hypothesis for singleton species in tropical arthropod surveys. J. Anim. Ecol. 78, 573–584 (2009).

  23. 23.

    , & A meta-analysis of species-abundance distributions. Oikos 119, 1149–1155 (2010).

  24. 24.

    Methods for fitting dominance/diversity curves. J. Veg. Sci. 2, 35–46 (1991).

  25. 25.

    , , , & 16S rRNA gene sequencing of mock microbial populations- impact of DNA extraction method, primer choice and sequencing platform. BMC Microbiol. 16, 1–13, doi: 10.1186/s12866-016-0738-z (2016).

  26. 26.

    et al. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75, 7537–7541 (2009).

  27. 27.

    Nonparametric estimation of the number of classes in a population. Scand J Statist 11, 265–270 (1984).

  28. 28.

    EstimateS: Statistical estimation of species richness and shared species from samples. Version 9 (2013).

  29. 29.

    R: A language and environment for statistical computing. (R Foundation for Statistical Computing, Vienna, Austria, 2015).

Download references


The authors would like to recognize and thank Dr. Brendan Bohannan for the valuable comments.

Author information


  1. Department of Biology, Baylor University, Waco, TX, USA

    • Sanghoon Kang
  2. Department of Land, Air and Water Resources, University of California, Davis, Davis, CA, USA

    • Jorge L. M. Rodrigues
  3. Department of Soil & Crop Sciences, Texas A&M University, College Station, TX, USA.

    • Justin P. Ng
    •  & Terry J. Gentry


  1. Search for Sanghoon Kang in:

  2. Search for Jorge L. M. Rodrigues in:

  3. Search for Justin P. Ng in:

  4. Search for Terry J. Gentry in:


S.K. designed the research; J.L.M.R., J.P.N. and T.J.G. conducted the research. S.K. analyzed the data, and S.K. and J.L.M.R. wrote the paper.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Sanghoon Kang.

Supplementary information


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Creative Commons BYThis work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit