1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses

Zou, Yuanqiang; Xue, Wenbin; Luo, Guangwen; Deng, Ziqing; Qin, Panpan; Guo, Ruijin; Sun, Haipeng; Xia, Yan; Liang, Suisha; Dai, Ying; Wan, Daiwei; Jiang, Rongrong; Su, Lili; Feng, Qiang; Jie, Zhuye; Guo, Tongkun; Xia, Zhongkui; Liu, Chuan; Yu, Jinghong; Lin, Yuxiang; Tang, Shanmei; Huo, Guicheng; Xu, Xun; Hou, Yong; Liu, Xin; Wang, Jian; Yang, Huanming; Kristiansen, Karsten; Li, Junhua; Jia, Huijue; Xiao, Liang

doi:10.1038/s41587-018-0008-8

Download PDF

Resource
Open access
Published: 04 February 2019

1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses

Yuanqiang Zou ORCID: orcid.org/0000-0003-2380-264X^1,2,3^na1,
Wenbin Xue^1,2^na1,
Guangwen Luo^1,2,4^na1,
Ziqing Deng ORCID: orcid.org/0000-0001-8726-0160^1,2^na1,
Panpan Qin ORCID: orcid.org/0000-0002-7307-8736^1,2,5^na1,
Ruijin Guo^1,2,
Haipeng Sun^1,2,
Yan Xia^1,2,5,
Suisha Liang^1,2,6,
Ying Dai^1,2,
Daiwei Wan^1,2,
Rongrong Jiang^1,2,
Lili Su^1,2,
Qiang Feng^1,2,
Zhuye Jie^1,2,
Tongkun Guo^1,2,
Zhongkui Xia^1,2,
Chuan Liu^1,2,6,
Jinghong Yu^1,2,
Yuxiang Lin^1,2,
Shanmei Tang^1,2,
Guicheng Huo⁴,
Xun Xu^1,2,
Yong Hou ORCID: orcid.org/0000-0002-0420-0726^1,2,
Xin Liu ORCID: orcid.org/0000-0003-3256-2940^1,2,7,
Jian Wang^1,8,
Huanming Yang^1,8,
Karsten Kristiansen ORCID: orcid.org/0000-0002-6024-0917^1,2,3,9,
Junhua Li ORCID: orcid.org/0000-0001-6784-1873^1,2,10,
Huijue Jia ORCID: orcid.org/0000-0002-3592-126X^1,2,11 &
…
Liang Xiao ORCID: orcid.org/0000-0003-0836-4397^1,2,6,9,12

Nature Biotechnology volume 37, pages 179–185 (2019)Cite this article

50k Accesses
303 Citations
227 Altmetric
Metrics details

Subjects

Abstract

Reference genomes are essential for metagenomic analyses and functional characterization of the human gut microbiota. We present the Culturable Genome Reference (CGR), a collection of 1,520 nonredundant, high-quality draft genomes generated from >6,000 bacteria cultivated from fecal samples of healthy humans. Of the 1,520 genomes, which were chosen to cover all major bacterial phyla and genera in the human gut, 264 are not represented in existing reference genome catalogs. We show that this increase in the number of reference bacterial genomes improves the rate of mapping metagenomic sequencing reads from 50% to >70%, enabling higher-resolution descriptions of the human gut microbiome. We use the CGR genomes to annotate functions of 338 bacterial species, showing the utility of this resource for functional studies. We also carry out a pan-genome analysis of 38 important human gut species, which reveals the diversity and specificity of functional enrichment between their core and dispensable genomes.

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Wenpin Hou & Zhicheng Ji

A distinct Fusobacterium nucleatum clade dominates the colorectal cancer niche

Article Open access 20 March 2024

Martha Zepeda-Rivera, Samuel S. Minot, … Christopher D. Johnston

A host–microbiota interactome reveals extensive transkingdom connectivity

Article 20 March 2024

Nicole D. Sonnert, Connor E. Rosen, … Noah W. Palm

Main

The human gut microbiota refers to the all the microorganisms that inhabit the human gastrointestinal tract. Diverse roles of the gut microbiota in human health and disease have been recognized^1,2. Metagenomic studies have transformed our understanding of the taxonomic and functional diversity of human microbiota, but more than half of the sequencing reads from a typical human fecal metagenome cannot be mapped to existing bacterial reference genomes^3,4. The lack of high-quality reference genomes has become an obstacle for high-resolution analyses of the human gut microbiome.

Although the previously reported Integrated Gene Catalog (IGC) has enabled metagenomic, metatranscriptomic and metaproteomic analyses^3,5,6, the gap between compositional and functional analyses can only be filled by individual bacterial genomes. Genes that co-vary among samples can be clustered into metagenomic linkage groups⁷, metagenomic clusters⁸ and metagenomic species^9,10, whose annotation depends on alignment to the limited number of existing reference genomes. Other metagenomics-based analyses of the gut microbiome—for example, single nucleotide polymorphisms (SNPs), indels and copy number variations—rely on the coverage and quality of reference genomes^11,12,13.

Despite the rapid increase in the number of sequenced bacterial and archaeal genomes, reference genomes for gut bacteria are underrepresented. It is estimated that <4% of the bacterial genomes in the US National Center for Biotechnology Information (NCBI) database belong to the human gut microbiota. Rather, the focus has been on clinically relevant pathogenic bacteria, which are overrepresented in the microbial databases. The first catalog of 178 reference bacterial genomes for the human microbiota was reported by the Human Microbiome Project (HMP)¹⁴ in 2010. To date, the HMP has sequenced >2,000 microbial genomes cultivated from human body sites, 437 of which are gut microbiota (data accessed 8 September 2017). However, the number of reference gut bacterial genomes is still far from saturated.

We present a reference catalog of genomes of cultivated human gut bacteria (named the CGR), established by culture-based isolation of >6,000 bacterial isolates from fecal samples of healthy individuals. The CGR comprises 1,520 nonredundant, high-quality draft bacterial genomes, contributing at least 264 new reference genomes to the gut microbiome. After inclusion of CGR genomes, the mapping rate of selected metagenomic datasets improved from around 50% to over 70%. In addition to improving metagenomic analyses, the CGR will improve functional characterization and pan-genomic analyses of the gut microbiota at high resolution.

Results

Expanded catalog of gut bacterial genomes

We obtained 6,487 bacterial isolates from fresh fecal samples donated by 155 healthy volunteers by using 11 different media under anaerobic conditions (Supplementary Fig. 1a and Supplementary Table 1). Notably, more than half of the isolates were cultured from MPYG medium (Supplementary Fig. 1b). All the isolates were subjected to 16 S rRNA gene amplicon sequencing analysis, and 1,759 nonredundant isolates that provided broad coverage of the phylogenetic tree were selected for whole-genome sequencing (Supplementary Fig. 1c and Supplementary Table 2). After de novo assembly of the next-generation sequencing reads, we identified 104 isolates that contained more than one genome. These assembled sequences were then parsed into 212 genomes using our in-house pipeline (Supplementary Table 3). Briefly, multi-genomes were split at scaffold level on the basis of G + C content versus sequencing depth. The closest reference genomes for the spilt scaffolds were determined on the basis of average nucleotide identity (ANI), and the mis-split scaffolds were mapped back to their closest reference genome to get the final split genome (see Methods). In total, we obtained a collection of 1,867 newly assembled genomes, 1,520 (81.4%) of which fulfilled the HMP’s criteria for high-quality draft genomes and exceeded 95% genome completeness and less than 5% contamination as evaluated by CheckM. The genome sizes and G + C contents of CGR ranged from 0.2 to 7.9 Mbp and 26.56% to 64.28%, respectively. A total of 5,749,641 genes were predicted from the annotation data (Supplementary Table 4).

Taxonomic annotation of CGR was carried out using a self-constructed, efficient ANI-based pipeline (Supplementary Fig. 2). The 1,520 high-quality genomes were classified into 338 species-level clusters (ANI ≥ 95%, a species delineation corresponding to 70% DNA–DNA hybridization), which covered all the major phyla of the human gut microbiota, including Firmicutes (211 clusters, 796 genomes), Bacteroidetes (60 clusters, 447 genomes), Actinobacteria (54 clusters, 235 genomes), Proteobacteria (10 clusters, 36 genomes) and Fusobacteria (3 clusters, 6 genomes) (Fig. 1a and Supplementary Table 5). Among these 338 clusters, 134 clusters (corresponding to 264 genomes) were not annotated to any present reference genomes in NCBI (Fig. 1a), and 50 clusters did not fall within any sequenced genera (Supplementary Table 5). To corroborate the presence of novel species in CGR, we carried out additional taxonomic identification using 16 S rRNA gene analysis. A species was recognized as novel if its 16 S rRNA gene sequence had < 98.7% similarity with known species in the EzBioCloud database (see Methods). Overall, we identified 350 distinct bacterial species (based on operational taxonomic units), including 149 candidate novel species, 42 of which represent candidate novel genera. These results underscore the value of the individual reference genomes provided by the CGR.

**Fig. 1: Phylogenetic tree of 1,520 isolated gut bacteria based on whole-genome sequences.**

Despite the variation of individual microbiota at the genus level, the CGR identified bacterial populations with broad diversity, covering eight out of nine core genera in the Chinese gut microbiota¹⁵. More than 80 species were novel in comparison with the previously sequenced species from a reported 1,000 cultured bacterial species from the human gastrointestinal tract¹⁶ (Supplementary Fig. 3a). Moreover, the CGR successfully identified 38 genera that were of low relative abundance ( < 1%) according to the IGC⁶, which is a large catalog of reference genes derived from a collection of ~1,250 metagenomic samples from individuals on three continents (Supplementary Fig. 3b). Among them, 7 genera were identified with more than 20 genomes (Bifidobacterium, Collinsella, Coprobacillus, Dorea, Streptococcus, Prevotella and Parabacteroides). The CGR also identified another 9 genera that were not detected by IGC⁶ (Butyricicoccus, Butyricimonas, Catenibacterium, Dielma, Erysipelatoclostridium, Megamonas, Melissococcus, Peptoclostridium and Vagococcus) (Supplementary Fig. 3b). These results underscore the contribution of the CGR to the existing database of gut bacterial whole genomes.

Improvement in metagenomic and SNP analyses

The existing reference genomes for metagenomic sequence mapping are far from saturated. For example, the genomes or draft genomes of bacteria and archaea used in a recent study allowed mapping of less than half of the sequences in the fecal metagenome^3,4. To illustrate the value of the CGR to metagenomic analyses, we performed sequence mapping using previous metagenomic data⁶ with or without CGR. For Chinese samples, the read mapping rate in the original study that used the IGCR dataset (3,449 reference genomes from IGC⁶) was 52.00%, which was significantly improved to 76.88% after the inclusion of the CGR dataset (Fig. 2a and Supplementary Table 6). Since all the samples in the CGR were from China, it is reasonable to assume that this genome dataset contributes substantially to the Chinese fecal metagenome. To evaluate the contribution of the CGR to the mapping of non-Chinese metagenomes, we carried out a similar analysis using metagenomic data from American, Spanish and Danish fecal samples. Notably, the metagenomic read mapping ratios of these samples all increased substantially (Fig. 2a), although to a lesser extent compared with that of Chinese samples (Supplementary Fig. 4a). The improvement of mapping rates in both Chinese and non-Chinese samples indicates that the CGR covers a considerable number of gut bacterial species shared by people between these countries. To reveal the improvement of gene and protein diversity enabled by the CGR, we compared the gene and protein cumulative curve based on genomes used in a previous IGC study and after addition of the CGR (Supplementary Fig. 4b,c). The number of gene and protein families increased with inclusion of the first 1,500 genomes, but more or less plateaued at around 3,000 genomes. The addition of our CGR genomes led to a substantial increase in the number of added gene and protein families as a function of genome number. A total of 373,555 gene clusters and 149,945 protein clusters were added by inclusion of the CGR, corresponding to a 22% and 16% increase in known gene and protein sequence diversity, respectively.

**Fig. 2: Contribution of CGR to metagenomic and SNP analyses.**

To further illustrate the utility of the CGR, we used it to analyze gut microbiome SNPs in a cohort of 250 samples from the TwinsUK registry¹⁷. We generated a new set of 282 nonredundant representative genomes from the CGR (see Methods, Supplementary Fig. 5 and Supplementary Table 7), which number nearly doubled the 152 reference genomes used in the original TwinsUK analysis¹⁷. To highlight the new reference genomes identified by analysis with the existing genomes and the CGR genomes, we performed an ANI-based alignment of the 282 genomes with the previously reported 152 genomes. Among the 192 newly added reference genomes, 85 were classified species while 107 were unclassified species (Fig. 2b). A high SNP density was found in Ruminococcus sp. CAG:108 (Clu 21), unclassified Firmicutes (Clu 157), Eubacterium rectale (Clu 6), Escherichia coli (Clu 22), and Ruminococcus sp. CAG:57 (Clu 19), suggesting a high degree of variations in the genomes of these species, while Lactobacillus gasseri (Clu 241), Enterococcus fecalis (Clu 316), Enterococcus durans (Clu 274) and Streptococcus mutans (Clu 217) showed lower SNP density. A total of 9.14 million SNPs were identified. The number of SNPs was increased for some species due to the newly added high-quality reference genomes in the CGR. We conclude that the CGR is a valuable resource for metagenomic studies because of the significant improvement in metagenomic resolution it enables.

Functions of gut microbiome bacteria

To better elucidate functions of the gut microbiota, we annotated gene functions in 1,520 CGR genomes using KEGG (the Kyoto Encyclopedia of Genes and Genomes)¹⁸. Functional pathways at KEGG level 2 showed that pathways involved in carbohydrate and amino acid metabolism are abundant in all isolated strains, suggesting that these are core functions of the gut microbiota (Supplementary Fig. 6). We also analyzed KEGG level 3 pathways and focused on those enriched at the phylum or genus level (Fig. 3a). We found that lipopolysaccharide biosynthesis (ko00540) genes were widely distributed in the phyla Fusobacteria, Bacteroidetes and Proteobacteria, the main phyla of gram-negative bacteria. Genes involved in glycan degradation (ko00531 and ko00511) were abundant in the genomes of the Bacteroidetes phylum. This observation is consistent with the notion that members of Bacteroidetes are prominent human gut symbionts that help degrade glycans in the diet and the gut mucosa¹⁹. The members of the Bacteroidetes also possess a high proportion of genes involved in sphingolipid metabolism (ko00600), glycosphingolipid biosynthesis (ko00601, ko00603 and ko00604) and steroid hormone biosynthesis (ko00140). Sphingolipids and hormone biosynthesis are ubiquitous in eukaryotic cells but not present in most bacteria. These results suggest that members of the Bacteroidetes not only participate in energy metabolism in the gut, but may also act in sphingolipid and hormone signaling in mammalian cells. The Proteobacteria showed relatively high abundance in genes involved in degradation of xenobiotics (ko01220), possibly contributing to the degradation of environmental chemicals and pharmaceuticals in the gut.

**Fig. 3: Functional landscape of gut microbiota.**

The signal transduction system (two-component system, ko02020) and xenobiotics degradation (KEGG level 2 pathway) were ubiquitous in the genera Paenibacillus, Bacillus, Klebsiella, Escherichia, Citrobacter and Enterobacter, which are also presented in environmental niches, such as soil and water. The abundant signal transduction and xenobiotics degradation systems allow these genera to sense and respond to various stresses and toxic substance presented in natural environments. Cell motility (chemotaxis, ko02030; flagellar assembly, ko02040) was conserved in the genera Roseburia, Paenibacillus, Bacillus, Escherichia, Citrobacter and Enterobacter, but varied within the genera Clostridium and Eubacterium.

Next we investigated functions and pathways that are annotated in the KEGG database, but not categorized as KEGG pathways (Fig. 3b and Supplementary Table 9). Virulence factors and antibiotic resistance genes were annotated using the Virulence Factors Database (VFDB)²⁰ and Comprehensive Antibiotic Resistance Database (CARD)²¹, respectively. Virulence factors and antibiotic resistance are clinically relevant and are abundant in the Proteobacteria phylum, suggesting that this phylum may be a reservoir for opportunistic pathogens. We examined the distribution of genes involved in responses to stresses frequently encountered by gut bacteria: oxygen tolerance and acid resistance. Oxygen tolerance was reflected by the number of genes encoding catalase and superoxide dismutase, two detoxification enzymes that scavenge reactive oxygen species produced during aerobic respiration. As expected, the facultative anaerobic bacteria in the genera Paenibacillus, Bacillus, Klebsiella, Escherichia, Citrobacter and Enterobacter were more oxygen tolerant. In addition to the previously reported Bacteroides fragilis²², other members of Bacteroidetes also showed moderate oxygen tolerance. Notably, bacteria in the Bacteroidetes phylum and the Bifidobacterium genus generally lacked acid resistance genes, suggesting that potential probiotics based on these organisms may suffer impaired survival in the acidic stomach environment after oral administration. Finally, we examined the distribution of six bacterial functions in the CGR that might have beneficial effects on human health. Amino acid and vitamin B synthesis genes were widely present in various gut bacteria, suggesting that gut microbiota might be an alternative source for nutrients that are sparse in vegetarian diets. Genes encoding bacterial bile salt hydrolases, which transform primary bile acids into secondary bile acids in the human intestine, were also ubiquitous in most gut bacteria. Genes encoding β-galactosidases, which might attenuate problems associated with lactose intolerance, were relatively abundant in the phylum Bacteroidetes. Genes involved in bacteriocin synthesis in gut bacteria were relatively rare and did not show phylum- or genus-specific distribution.

Core and pan-genomes of underrepresented gut bacteria

We carried out a pan-genome analysis of 36 species or clusters that contain more than ten genomes, as well as two other species enriched in healthy controls compared with patients with type 2 diabetes in previous studies^7,23,24, Fecalibacterium prausnitzii (cluster 63, seven genomes) and butyrate-producing bacterium SS3_4 (cluster 45, nine genomes). These clusters covered the phyla Firmicutes, Bacteroidetes, Actinobacteria and Proteobacteria (Supplementary Fig. 7a and Supplementary Table 8a). The pan-genome of a cluster can be defined as the sum of the core genes and dispensable genes (including unique genes and accessory genes) of all the members within that cluster²⁵. Our pan-genome analysis showed that Eubacterium rectale (cluster 37) contained the lowest proportion of core genes (12%); the remaining genes fell into accessory and unique genomes (38% and 40%, respectively). In contrast, Eubacterium 3_1 (cluster 6) contained the largest proportion of core genes (53%) (Supplementary Fig. 7b). The pan-genome fitting curves showed that most clusters in Bacteroidetes displayed an ‘open’ pan-genome and had a relatively large pan-genome size, with Bacteroides vulgatus having the largest pan-genome size at 14,970 genes (Supplementary Figs. 8 and 9 and Supplementary Table 8b). In contrast, members in the phylum Actinobacteria tend to represent a relatively ‘closed’ pan-genome, which was only slightly expanded by the addition of CGR genomes. These results suggest that gut bacterial genomes are variable in the Bacteroidetes phylum, less variable in the Firmicutes and Proteobacteria, and fairly conserved in the Actinobacteria.

We also explored the distribution of genes involved in butyrate synthesis and antibiotic resistance in the pan-genomes of gut bacteria. Functional annotation showed that six clusters contained the complete acetyl-CoA to butyrate biosynthesis pathway (Fig. 4a). Among them, F. prausnitzii, E. rectale, butyrate-producing bacterium SS3_4 and Roseburia sp. CAG:45 harbored the complete pathway in their core genome, suggesting that the butyrate-producing function was highly conserved in these species. This result is consistent with the reported butyrate-producing capacity of these species^26,27,28. To explore the distribution of antibiotic resistance within the established pan-genomes, we annotated 25 antibiotic resistance genes (ARGs) in each pan-genome. Consistent with a previous report²⁹, the tetracycline resistance gene was widely present in the dispensable genome of these clusters (Fig. 4b). Notably, Escherichia coli contained almost all ARGs (23 of 25) in its pan-genome, with half of these present in the core genome (Fig. 4b). In contrast, Bifidobacterium species, including B. bifibium, B. adolescentis, B. longum and B. pseudocatenulatum, rarely contained ARGs in their pan-genomes.

**Fig. 4: Pan-genome analysis of 38 representative clusters.**

To obtain a better understanding of the distribution of bacterial functions in the core and dispensable genomes, we annotated the genomes using the Clusters of Orthologous Groups (COG) database³⁰. This revealed that several housekeeping functions were significantly enriched in the core genome, including post-translational modification, protein turnover and chaperones (O, P = 7.28 × 10^–12); translation, ribosomal structure and biogenesis (J, P = 7.28 × 10^–12); energy production and conversion (C, P = 7.28 × 10^–12); amino acid transport and metabolism (E, P = 7.28 × 10^–12); nucleotide transport and metabolism (F, P = 7.28 × 10^–12); coenzyme transport and metabolism (H, P = 1.46 × 10^–11); lipid transport and metabolism (I, P = 2.40 × 10^–10); and inorganic ion transport and metabolism (P, P = 2.40 × 10^–10) (Supplementary Fig. 10). By contrast, COG categories enriched in the dispensable genome included cell wall-membrane-envelope biogenesis (M, P = 2.70 × 10^–9); cell motility (N, P = 3.11 × 10^–5); signal transduction mechanisms (T, P = 0.00039); intracellular trafficking secretion and vesicular transport (U, P = 1.22 × 10^–7); defense mechanisms (V, P = 7.28 × 10^–12); transcription (K, P = 3.64 × 10^–11); replication recombination and repair (L, P = 7.28 × 10^–12); and function unknown (S, P = 0.03111). The remaining COG categories showed no significant differences in core and dispensable genome.

Discussion

We used 11 culturing conditions for isolation of gut bacteria and archived more than 6,000 isolates. From this collection of isolates, we generated 1,520 high-quality draft reference genomes. The high coverage of the resulting CGR at the genus and species levels (including low-abundance species) demonstrates the value of culture-based methods for strain isolation from the gut microbiota. In line with this, a large number of gut bacterial species that were previously considered as unculturable have been successfully cultivated in two recent studies^31,32. Although there was some overlap between the novel species archived by CGR and in these two studies, the CGR contains 659 additional genomes (representing 209 clusters or species). Our cultivation methods can be applied to expand the CGR until it is saturated with the genomes of culturable gut bacteria. After that, single-cell sequencing can be used to investigate genomes of unculturable bacteria, with an overall aim of defining a saturated set of reference genomes of gut microbiota to underpin a better understanding of gut microbiome biology.

We applied out CGR genome dataset to assign functions to gut bacteria. For example, we found that virulence factors and antibiotic resistance genes are enriched in Klebsiella, Escherichia, Citrobacter and Enterobacter, which are opportunistic pathogens frequently isolated in clinical samples³³. The abundance of signal transduction and cell motility genes in these bacteria could further contribute to their pathogenicity^34,35. Notably, the Proteobacteria also possess abundant genes for degradation of xenobiotics, which might affect drug metabolism of patients in drug therapy. In line with this, a recent study reported that intratumor Proteobacteria can metabolize chemotherapeutic drugs into inactive forms and thus attenuate the efficacy of cancer therapies³⁶. The genes involved in beneficial functions such as glycan degradation and vitamin B synthesis are enriched in the Bacteroides genus, consistent with its mutualistic role in the human gut. Notably, we found that Bacteroides species contain a considerable number of genes involved in sphingolipid and steroid hormone synthesis, suggesting their potential for modulating signaling in mammalian cells. In support of this, a recent study reported that Bacteroides fragilis can take advantage of sphingolipid signaling to enable symbiosis in the intestine³⁷. It is noteworthy that genes involved in glycan degradation and sphingolipid metabolism were also enriched in the genus Bifidobacterium, another well-known gut commensal microbe. However, genes involved in both pathways were not abundant in the Prevotella genus, suggesting a distinct function of Prevotella compared with other members of the Bacteroidetes phylum. This might account for observed negative correlations between the relative abundances of Prevotella and Bacteroides in the gut microbiota³⁸. The potential role of gut bacteria in metabolism of estrogens has long been recognized³⁹, but detailed mechanistic studies are still lacking. It will be interesting to explore the implication of this unique function of gut bacteria in hormone-related health or disease. The CGR also enabled the identification of several potential bacteriocin-producing bacteria strains, which merit further verification.

The CGR will improve metagenomic analyses, genome variation analyses, functional characterization and pan-genome analyses. The isolated gut bacteria strains have been deposited in the China National GeneBank (CNGB) and may be useful for studies that aim to alter microbiota functions, as novel probiotics, or for verification of disease-associated bacterial markers.

Methods

Anaerobic cultivation of fecal bacteria

Fecal samples were collected from 155 healthy human donors not taking any drugs during the last month before sampling. Detailed information is given in Supplementary Table 2. The samples were immediately transferred to an anaerobic chamber (Bactron Anaerobic Chamber, Bactron IV-2, Shellab, USA), homogenized in pre-reduced phosphate buffered saline (PBS) supplemented with 0.1% cysteine, and then diluted and spread on agar plates with different growth media (Supplementary Table 1). Plates were incubated under anaerobic condition in an atmosphere of 90% N₂, 5% CO₂ and 5% H₂ at 37 °C for 2–3 d. Single colonies were picked and streaked onto new plates to obtain single clones. All the strains were stored in a glycerol suspension (20%, v/v) containing 0.1% cysteine at –80 °C. The collection of the 155 samples was approved by the Institutional Review Board on Bioethics and Biosafety of BGI under number BGI-IRB17005-T1. All protocols were in compliance with the Declaration of Helsinki and explicit informed consent was obtained from all participants. Bacteria in the CGR (Culturable Genome Reference) are deposited in and are available from the E-BioBank (EBB) of the China National GeneBank (http://ebiobank.cngb.org/index.php?g=Content&m=Hql&a=sample_5&id=393).

Whole-genome sequencing and de novo assembly

DNA extraction

Isolates cultivated to stationary phase were centrifuged at 7,227g at 4 °C for 10 min, and the resulting pellets were resuspended in 1 ml of Tris-EDTA. For bacterial cell lysis, 50 µl of 10% SDS and 10 µl of proteinase K (20 mg/ml) were added, and the solution was incubated at 55 °C in a water bath for 2 h. The released genomic DNA was extracted using the phenol-chloroform method⁴⁰.

Genome sequencing and assembly

Paired-end libraries with an insert size of 500 bp were constructed and sequenced on Illumina Hiseq 2000 platform to obtain about 100 × clean data for each sample. The reads were assembled using SOAPdenovo 2.04⁴¹ to form scaffolds from which the rRNA genes were extracted by RNAMMer 1.2⁴². An in-house pipeline was used to obtain the best assembly containing an orthogonal design to investigate L,M,d,D,L,u,G (arguments of SOAPdenovo) and a single-factor design to investigate K (argument of SOAPdenovo) by comprehensively considering contig average length, longest scaffold and rRNA score. Libraries with an insert size of 240 bp were constructed and sequenced on the ionProton platform, which produced about 100 × clean data for each sample. The reads were assembled through SPAdes (version 3.1.0)⁴³ to form scaffolds.

Assessment of genome quality

Six high-quality draft assembly criteria from the Human Microbiome Project (HMP)¹⁴ and rRNA (5 s, 16 s and 23 s) completeness were adopted to ensure the assembly quality. The criteria are (i) 90% of the genome assembly must be included in contigs > 500 bp, (ii) 90% of the assembled bases must be at > 5 × read coverage, (iii) the contig N50 must be > 5 kb, (iv) scaffold N50 must be > 20 kb, (v) average contig length must be > 5 kb, and (vi) > 90% of the core genes^44,45 must be present in the assembly.

Splitting for multi-genome isolates

The multi-genomes in isolates were initially identified using CheckM⁴⁶ (contamination > 5%) and confirmed by manual inspection of the plot of G + C percentage vs. sequencing depth. An in-house pipeline was developed to split the scaffolds of multi-genomes into single genomes. Briefly, scaffolds in multi-genomes were first split on the basis of G + C percentage vs. sequencing depth values using the dbscan function of R (package “fpc”). The “complete” and “contamination” of split genomes were checked using CheckM. For split genomes with “complete” > 100% or “contamination” > 15%, an additional species-designating pipeline was used to obtain their closest reference (with ANI value > 90%). Finally, the mis-split scaffolds in each split genome were mapping back to the closest reference genome using BLASTn (-e 1e-5 -F F -m 8, blastn hits’ length > 90 nt, query scaffold coverage ≥ 50%) to obtain the final split genomes.

Massive species and genus assignment process

NCBI-retrieved prokaryotic genomes

All complete genomes (update time 19 November 2014) and draft genomes (update time 8 August 2014) on the NCBI ftp site were downloaded to a local server. Items with more than one NCBI taxonomy identifier (taxid) or genome sequence not available or of non-prokaryotic source were removed, and of redundant items, only one was kept. As a result, 24,552 genomes, 19,116 genome-scale amino acid sequences, and their taxonomic information were obtained.

Average nucleotide identity (ANI)⁴⁷ for species level taxonomic assignment

The taxonomic assignment of each query genome was determined by the taxonomic information of all the NCBI-available prokaryotic genomes. The tetra-base signature profiles of all the genomes and each query genome were acquired. A Pearson correlation test was performed between each query genome and all the genomes, resulting in a reference list sorted by decreasing correlation coefficient for each query genome. Then pairwise ANI alignment was performed between query and reference genomes one by one according to the reference list (tetra-base profile’s Pearson correlation test: correlation coefficient > 0 and P < 0.001) until the ANI value was larger than 95% in the top 500 items (defined as assigned in this case) or reference item number exceeded 500 without any ANI value being larger than 95% (defined as not assigned in this case).

Percentage of conserved proteins (POCP)⁴⁸ for genus-level taxonomic assignment

The taxonomic assignment of each query genome was determined by the taxonomic information of all the NCBI-available prokaryotic genomes. The tetra-base signature profiles of all the reference genomes and the query genomes with no species assignment based on ANI were acquired. A Pearson correlation test was performed between each query genome and all the reference genomes, resulting in a reference list sorted by decreasing correlation coefficient for each query genome. Then the POCP calculation was performed between query and reference genomes one by one according to the reference list until the POCP value was larger than 50% in the top 500 items (defined as “assigned” in this case) or reference items number exceeded 500 without any POCP value being larger than 50% (defined as “not assigned” in this case).

16S rRNA sequence analysis and novel species determination

16 S rRNA gene sequences were extracted from the isolate genomes using RNAmmer⁴², except for 16 genomes where extraction failed. The sequences were quality-control processed in EzBioCloud (http://www.ezbiocloud.net)⁴⁹. The species-level operational taxonomic units (OTUs) were classified using mothur⁵⁰ with an identity of 98.7% as a species-level cut-off, and cut-offs of 94.5% and 86.5% were used for genera and families⁵¹, respectively.

Comparison of CGR with genome datasets from other studies

To compare the new genomes and novel species archived in CGR with those identified in two recent studies, we downloaded 215 genomes reported by Browne et al.³² and 169 genomes reported by Lagier et al.³¹. We adopted a similar ANI pipeline as described above for species-level comparison by replacing the NCBI references with these newly downloaded genomes. “Map” was defined if the pairwise ANI value between a query genome in our 1,520 high-quality genomes and any one of references genomes (tetra-base profile’s Pearson correlation test: correlation coefficient > 0 and P < 0.001) was larger than 95%; if not, the species was defined as “unmap.”

Construction of species clusters

Pairwise ANI alignment was performed among the 1,520 high-quality genomes, and then hclust from the R package was used for hierarchical clustering with distance of 0.05 (equivalent to 95% ANI). A set of 40 universally conserved single-copy genes encoding proteins in bacteria and archaea was used for construction of a phylogenetic tree. Marker genes were detected and aligned using specI⁵² and prank⁵³. Alignments were trimmed by trimal⁵⁴ and concatenated with in-house scripts. A phylogenetic tree was inferred using the maximum likelihood method with RAxML (version 8.2.8)⁵⁵ for the clusters’ representative genomes (N50 longest among cluster) with Rhizobium selenitireducens ATCC BAA 1503 (taxoid:1336235) as an outgroup, and was visualized in iTOL (http://itol.embl.de/)⁵⁶ online.

Genome function annotation

The 1,520 high-quality genomes were functionally annotated. Genes were identified using Genemark⁵⁷. The translated amino acid sequences of coding genes were aligned with RAPSearch (-s f -e 1e-2 -v 100 -u 2)⁵⁸ against the Kyoto Encyclopedia of Genes and Genomes (KEGG version 76)^18,59 (query match length higher than 50%) or with BLASTp (-e 1e-2 -F T -b 100 -K 1 -a 1 -m 8) against the Antibiotic Resistance Genes Database (ARDB) (both query and subject match length higher than 40%, with identity higher than the ARDB-recommended thresholds)⁶⁰, the Virulence Factor Database (VFDB)^20,61 (query match length higher than 50%, with identity higher than 60%), and the bacteriocin database (downloaded from BAGEL3⁶², with identity higher than 60%). Annotation of genes against the Comprehensive Antibiotic Resistance Database (CARD)²¹ was performed using Resistance Gene Identifier available as a downloadable command-line tool in the download section of the CARD website using default parameters.

Mapping ratio of metagenomic samples

The metagenomic reads⁶ were first aligned to the reference genomes of IGCR (3,449 sequenced prokaryotic genomes from IGC⁶) using SOAP2⁶³ (default parameters, except -m 100 -x 1000 -r 1 -l 30 -v 5 -c 0.95 -u). The unmapped reads were then aligned to the newly sequenced genomes of CGR. The read mapping ratio of different samples was calculated, and the difference between samples was determined by Wilcoxon test in R.

Analysis of gene and protein diversity

Gene clusters

5,749,641 genes in the 1,759 CGR genomes and 11,330,042 genes in 3,449 IGCR genomes were clustered using CD-HIT⁶⁴ with default parameters, except -G 0 -aS 0.9 -c 0.95 -M 0 -d 0 -r 1 -g 1, which amounts to 95% local sequence identity over 90% alignment coverage for the shorter sequence. A cluster is composed of two or more genes. An accumulative curve of gene clusters was drawn according to the sample name alphabetically with IGCR at the front part and CGR at the latter part.

Protein clusters

5,749,641 protein sequences translated from genes in the 1,759 CGR genomes and 11,330,042 protein sequences translated from genes in 3,449 IGCR genomes were clustered using the kClust algorithm⁶⁵ with default parameters, which amounts to 20–30% maximum pairwise sequence identity over 80% alignment length with the longest sequence or seed of the cluster. A cluster is composed of two or more protein sequences. An accumulative curve of protein clusters was drawn according to the sample name alphabetically with IGCR at the front part and CGR at the latter part.

SNP identification and similarity score

1,520 genomes from the CGR were aligned with the sequenced reads from the 250 TwinsUK samples using SOAP2 with identity ≥90%. Representative genomes used for SNP analysis were identified according to three criteria described previously¹⁷. The resulting 282 genomes (Supplementary Table 7) that fulfilled these criteria were used as references for SNP calling using SAMtools (frequency > 1% and supported by ≥4 reads) as previously described^10,17,20. The reference genomes used in a previous study¹⁸ (152 genomes) were compared with that from CGR of this study (282 genomes) to identify shared and new reference genomes using ANI ≥ 95% as a threshold (species level).

Pan genome analysis for 38 cluster

Clusters containing more than ten genomes (from CGR and NCBI), as well as Fecalibacterium prausnitzii (seven genomes) and butyrate-producing bacterium SS3_4 (nine genomes), were used for pan-genome analysis using the Bacterial Pan Genome Analysis tool (BPGA) pipeline⁶⁶. The set of genes shared by all the members of cluster was defined as core genes, while genes partially shared in members (accessory genes) and unique to single members (unique genes) in a cluster were defined as dispensable gene⁶⁷. The pan-genome fitting curves of 38 clusters were generated by the BPGA workflow and plotted in R (v.3.3.3). The functions of genes in the pan-genomes of 38 clusters were annotated by KEGG and ARDB, using arguments identical to those used for functional annotation of genomes. The acetyl-CoA-to-butyrate biosynthesis pathway was generated according to a previous study⁶⁸, and the associated enzymes were identified according to the functional annotation and BLAST to the NCBI protein database (cut-off 1e–5, identity ≥70%, coverage ≥70%). The COG database³⁰ was also used to identify the functional distribution in the core and dispensable sections via the BPGA pipeline. The significance of the difference between COG distribution in core and dispensable genomes was examined using Wilcoxon test as implemented in R (v.3.3.3).

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The assembly draft genomes and annotation information for the 1,520 CGR strains are deposited in the NCBI under accession code PRJNA482748, and these data are also available in the China National GeneBank (CNGB) Nucleotide Sequence Archive (CNSA; accession code CNP0000126). All bacterial strains in the CGR have been deposited in the CNGB, a nonprofit, public-service-oriented organization in China. The accession code for each strain is given in Supplementary Table 5 (Genebank_id). Researchers can explore strain information and request strains via http://ebiobank.cngb.org/index.php?g=Content&m=Hql&a=sample_5&id=393#.

References

Wang, J. & Jia, H. Metagenome-wide association studies: fine-mining the microbiome. Nat. Rev. Microbiol. 14, 508–522 (2016).
Article CAS Google Scholar
Lynch, S. V. & Pedersen, O. The human intestinal microbiome in health and disease. N. Engl. J. Med. 375, 2369–2379 (2016).
Article CAS Google Scholar
Qin, J. et al. A human gut microbial gene catalog established by metagenomic sequencing. Nature 464, 59–65 (2010).
Article CAS Google Scholar
Sunagawa, S. et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10, 1196–1199 (2013).
Article CAS Google Scholar
Methé, B. A. et al. A framework for human microbiome research. Nature 486, 215–221 (2012).
Article Google Scholar
Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834–841 (2014).
Article CAS Google Scholar
Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).
Article CAS Google Scholar
Karlsson, F. H. et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498, 99–103 (2013).
Article CAS Google Scholar
Le Chatelier, E. et al. Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–546 (2013).
Article Google Scholar
Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).
Article CAS Google Scholar
Schloissnig, S. et al. Genomic variation landscape of the human gut microbiome. Nature 493, 45–50 (2013).
Article Google Scholar
Hu, Y. et al. Metagenome-wide analysis of antibiotic resistance genes in a large cohort of human gut microbiota. Nat. Commun. 4, 2151 (2013).
Article Google Scholar
Greenblum, S., Carr, R. & Borenstein, E. Extensive strain-level copy-number variation across human gut microbiome species. Cell 160, 583–594 (2015).
Article CAS Google Scholar
Nelson, K. E. et al. A catalog of reference genomes from the human microbiome. Science 328, 994–999 (2010).
Article CAS Google Scholar
Zhang, J. et al. A phylo-functional core of gut microbiota in healthy young Chinese cohorts across lifestyles, geography and ethnicities. ISME J. 9, 1979–1990 (2015).
Article Google Scholar
Rajilić-Stojanović, M. & de Vos, W. M. The first 1000 cultured species of the human gastrointestinal microbiota. FEMS Microbiol. Rev. 38, 996–1047 (2014).
Article Google Scholar
Xie, H. et al. Shotgun metagenomics of 250 adult twins reveals genetic and environmental impacts on the gut microbiome. Cell Systems 3, 572–584.e573 (2016).
Article CAS Google Scholar
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS Google Scholar
Koropatkin, N. M., Cameron, E. A. & Martens, E. C. How glycan metabolism shapes the human gut microbiota. Nat. Rev. Microbiol. 10, 323–335 (2012).
Article CAS Google Scholar
Chen, L. et al. VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res. 33, D325–D328 (2005).
Article CAS Google Scholar
Jia, B. et al. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res. 45(D1), D566–D573 (2017).
Article CAS Google Scholar
Sund, C. J. et al. The Bacteroides fragilis transcriptome response to oxygen and H2O2: the role of OxyR and its effect on survival and virulence. Mol. Microbiol. 67, 129–142 (2008).
Article CAS Google Scholar
Aw, W. & Fukuda, S. Understanding the role of the gut ecosystem in diabetes mellitus. J. Diabetes Investig. 9, 5–12 (2018).
Article Google Scholar
Aw, W. & Fukuda, S. Toward the comprehensive understanding of the gut ecosystem via metabolomics-based integrated omics approach. Semin. Immunopathol. 37, 5–16 (2015).
Article CAS Google Scholar
Medini, D., Donati, C., Tettelin, H., Masignani, V. & Rappuoli, R. The microbial pan-genome. Curr. Opin. Genet. Dev. 15, 589–594 (2005).
Article CAS Google Scholar
Louis, P. & Flint, H. J. Diversity, metabolism and microbial ecology of butyrate-producing bacteria from the human large intestine. FEMS Microbiol. Lett. 294, 1–8 (2009).
Article CAS Google Scholar
Van den Abbeele, P. et al. Butyrate-producing Clostridium cluster XIVa species specifically colonize mucins in an in vitro gut model. ISME J. 7, 949–961 (2013).
Article Google Scholar
Louis, P., Young, P., Holtrop, G. & Flint, H. J. Diversity of human colonic butyrate-producing bacteria revealed by analysis of the butyryl-CoA:acetate CoA-transferase gene. Environ. Microbiol. 12, 304–314 (2010).
Article CAS Google Scholar
de Vries, L. E. et al. The gut as reservoir of antibiotic resistance: microbial diversity of tetracycline resistance in mother and infant. PLoS One 6, e21644 (2011).
Article Google Scholar
Tatusov, R. L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).
Article Google Scholar
Lagier, J.-C. et al. Culture of previously uncultured members of the human gut microbiota by culturomics. Nat. Microbiol. 1, 16203 (2016).
Article CAS Google Scholar
Browne, H. P. et al. Culturing of ‘unculturable’ human microbiota reveals novel taxa and extensive sporulation. Nature 533, 543–546 (2016).
Article CAS Google Scholar
Guentzel, M.N. Escherichia, Klebsiella, Enterobacter, Serratia, Citrobacter, and Proteus. Medical Microbiology Ch. 25 (University of Texas Medical Branch, Galveston, Texas, USA, 1996).
Josenhans, C. & Suerbaum, S. The role of motility as a virulence factor in bacteria. Int. J. Med. Microbiol. 291, 605–614 (2002).
Article CAS Google Scholar
Gotoh, Y. et al. Two-component signal transduction as potential drug targets in pathogenic bacteria. Curr. Opin. Microbiol. 13, 232–239 (2010).
Article CAS Google Scholar
Geller, L. T. et al. Potential role of intratumor bacteria in mediating tumor resistance to the chemotherapeutic drug gemcitabine. Science 357, 1156–1160 (2017).
Article CAS Google Scholar
An, D., Na, C., Bielawski, J., Hannun, Y. A. & Kasper, D. L. Membrane sphingolipids as essential molecular signals for Bacteroides survival in the intestine. Proc. Natl Acad. Sci. USA 108(Suppl. 1), 4666–4671 (2011).
Article CAS Google Scholar
Ley, R. E. Gut microbiota in 2015: Prevotella in the gut: choose carefully. Nat. Rev. Gastroenterol. Hepatol. 13, 69–70 (2016).
Article CAS Google Scholar
Rowland, I., Wiseman, H., Sanders, T., Adlercreutz, H. & Bowey, E. Metabolism of oestrogens and phytoestrogens: role of the gut microflora. Biochem. Soc. Trans. 27, 304–308 (1999).
Article CAS Google Scholar
Giraffa, G., Rossetti, L. & Neviani, E. An evaluation of chelex-based DNA purification protocols for the typing of lactic acid bacteria. J. Microbiol. Methods 42, 175–184 (2000).
Article CAS Google Scholar
Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).
Article Google Scholar
Lagesen, K. et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108 (2007).
Article CAS Google Scholar
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
Article CAS Google Scholar
Callister, S. J. et al. Comparative bacterial proteomics: analysis of the core genome concept. PLoS One 3, e1542 (2008).
Article Google Scholar
Li, L., Stoeckert, C. J. Jr & Roos, D. S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003).
Article CAS Google Scholar
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article CAS Google Scholar
Richter, M. & Rosselló-Móra, R. Shifting the genomic gold standard for the prokaryotic species definition. Proc. Natl Acad. Sci. USA 106, 19126–19131 (2009).
Article CAS Google Scholar
Qin, Q. L. et al. A proposed genus boundary for the prokaryotes based on genomic insights. J. Bacteriol. 196, 2210–2215 (2014).
Article Google Scholar
Kim, O. S. et al. Introducing EzTaxon-e: a prokaryotic 16S rRNA gene sequence database with phylotypes that represent uncultured species. Int. J. Syst. Evol. Microbiol. 62, 716–721 (2012).
Article CAS Google Scholar
Schloss, P. D. et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75, 7537–7541 (2009).
Article CAS Google Scholar
Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 635–645 (2014).
Article CAS Google Scholar
Mende, D. R., Sunagawa, S., Zeller, G. & Bork, P. Accurate and universal delineation of prokaryotic species. Nat. Methods 10, 881–884 (2013).
Article CAS Google Scholar
Löytynoja, A. Phylogeny-aware alignment with PRANK. Methods Mol. Biol. 1079, 155–170 (2014).
Article Google Scholar
Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
Article Google Scholar
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
Article CAS Google Scholar
Letunic, I. & Bork, P. Interactive tree of life (iTOL)v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 44(W1), W242–W245 (2016).
Article CAS Google Scholar
Besemer, J. & Borodovsky, M. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res. 33, W451–W454 (2005).
Article CAS Google Scholar
Ye, Y., Choi, J. H. & Tang, H. RAPSearch: a fast protein similarity search tool for short reads. BMC Bioinformatics 12, 159 (2011).
Article CAS Google Scholar
Du, J. et al. KEGG-PATH: Kyoto encyclopedia of genes and genomes-based pathway analysis using a path analysis model. Mol. Biosyst. 10, 2441–2447 (2014).
Article CAS Google Scholar
Liu, B. & Pop, M. ARDB—Antibiotic Resistance Genes Database. Nucleic Acids Res. 37, D443–D447 (2009).
Article CAS Google Scholar
Chen, L., Zheng, D., Liu, B., Yang, J. & Jin, Q. VFDB 2016: hierarchical and refined dataset for big data analysis—10 years on. Nucleic Acids Res. 44(D1), D694–D697 (2016).
Article CAS Google Scholar
van Heel, A. J., de Jong, A., Montalbán-López, M., Kok, J. & Kuipers, O. P. BAGEL3: Automated identification of genes encoding bacteriocins and (non-)bactericidal posttranslationally modified peptides. Nucleic Acids Res. 41, W448–W453 (2013).
Article Google Scholar
Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).
Article CAS Google Scholar
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Article CAS Google Scholar
Hauser, M., Mayer, C. E. & Söding, J. kClust: fast and sensitive clustering of large protein sequence databases. BMC Bioinformatics 14, 248 (2013).
Article Google Scholar
Chaudhari, N. M., Gupta, V. K. & Dutta, C. BPGA—an ultra-fast pan-genome analysis pipeline. Sci. Rep. 6, 24373 (2016).
Article CAS Google Scholar
Bosi, E. et al. Comparative genome-scale modelling of Staphylococcus aureus strains identifies strain-specific metabolic capabilities linked to pathogenicity. Proc. Natl Acad. Sci. USA 113, E3801–E3809 (2016).
Article CAS Google Scholar
Vital, M., Howe, A. C. & Tiedje, J. M. Revealing the bacterial butyrate synthesis pathways by analyzing (meta)genomic data. MBio 5, e00889 (2014).
Article Google Scholar

Download references

Acknowledgements

We gratefully acknowledge colleagues at BGI-Shenzhen for DNA extraction, library construction, sequencing and discussions. This research was supported by the National Natural Science Foundation of China (grants 81670606 and 81673850), the Shenzhen Municipal Government of China (JCYJ20160229172757249 and JCYJ20170818111103886).

Author information

These authors contributed equally: Yuanqiang Zou, Wenbin Xue, Guangwen Luo, Ziqing Deng, Panpan Qin.

Authors and Affiliations

BGI-Shenzhen, Shenzhen, China
Yuanqiang Zou, Wenbin Xue, Guangwen Luo, Ziqing Deng, Panpan Qin, Ruijin Guo, Haipeng Sun, Yan Xia, Suisha Liang, Ying Dai, Daiwei Wan, Rongrong Jiang, Lili Su, Qiang Feng, Zhuye Jie, Tongkun Guo, Zhongkui Xia, Chuan Liu, Jinghong Yu, Yuxiang Lin, Shanmei Tang, Xun Xu, Yong Hou, Xin Liu, Jian Wang, Huanming Yang, Karsten Kristiansen, Junhua Li, Huijue Jia & Liang Xiao
China National Genebank, BGI-Shenzhen, Shenzhen, China
Yuanqiang Zou, Wenbin Xue, Guangwen Luo, Ziqing Deng, Panpan Qin, Ruijin Guo, Haipeng Sun, Yan Xia, Suisha Liang, Ying Dai, Daiwei Wan, Rongrong Jiang, Lili Su, Qiang Feng, Zhuye Jie, Tongkun Guo, Zhongkui Xia, Chuan Liu, Jinghong Yu, Yuxiang Lin, Shanmei Tang, Xun Xu, Yong Hou, Xin Liu, Karsten Kristiansen, Junhua Li, Huijue Jia & Liang Xiao
Laboratory of Genomics and Molecular Biomedicine, Department of Biology, University of Copenhagen, Copenhagen, Denmark
Yuanqiang Zou & Karsten Kristiansen
Key Laboratory of Dairy Science, College of Food Sciences, Northeast Agricultural University, Harbin, Heilongjiang, China
Guangwen Luo & Guicheng Huo
BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, China
Panpan Qin & Yan Xia
Shenzhen Engineering Laboratory of Detection and Intervention of Human Intestinal Microbiome, Shenzhen, China
Suisha Liang, Chuan Liu & Liang Xiao
BGI-Qingdao, BGI-Shenzhen, Qingdao, China
Xin Liu
James D. Watson Institute of Genome Sciences, Hangzhou, China
Jian Wang & Huanming Yang
Qingdao-Europe Advanced Institute for Life Sciences, Qingdao, China
Karsten Kristiansen & Liang Xiao
School of Bioscience and Biotechnology, South China University of Technology, Guangzhou, China
Junhua Li
Macau University of Science and Technology, Taipa, Macau, China
Huijue Jia
Department of Digestive Diseases, Huashan Hospital of Fudan University, Shanghai, China
Liang Xiao

Authors

Yuanqiang Zou
View author publications
You can also search for this author in PubMed Google Scholar
Wenbin Xue
View author publications
You can also search for this author in PubMed Google Scholar
Guangwen Luo
View author publications
You can also search for this author in PubMed Google Scholar
Ziqing Deng
View author publications
You can also search for this author in PubMed Google Scholar
Panpan Qin
View author publications
You can also search for this author in PubMed Google Scholar
Ruijin Guo
View author publications
You can also search for this author in PubMed Google Scholar
Haipeng Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yan Xia
View author publications
You can also search for this author in PubMed Google Scholar
Suisha Liang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Dai
View author publications
You can also search for this author in PubMed Google Scholar
Daiwei Wan
View author publications
You can also search for this author in PubMed Google Scholar
Rongrong Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Lili Su
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Feng
View author publications
You can also search for this author in PubMed Google Scholar
Zhuye Jie
View author publications
You can also search for this author in PubMed Google Scholar
Tongkun Guo
View author publications
You can also search for this author in PubMed Google Scholar
Zhongkui Xia
View author publications
You can also search for this author in PubMed Google Scholar
Chuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jinghong Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yuxiang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Shanmei Tang
View author publications
You can also search for this author in PubMed Google Scholar
Guicheng Huo
View author publications
You can also search for this author in PubMed Google Scholar
Xun Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Hou
View author publications
You can also search for this author in PubMed Google Scholar
Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Huanming Yang
View author publications
You can also search for this author in PubMed Google Scholar
Karsten Kristiansen
View author publications
You can also search for this author in PubMed Google Scholar
Junhua Li
View author publications
You can also search for this author in PubMed Google Scholar
Huijue Jia
View author publications
You can also search for this author in PubMed Google Scholar
Liang Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.J., J.L., L.X., Y.Z. and W.X. conceived and designed the project. H.J, J.L. and L.X monitored the project. Y.Z., W.X., Y.D., D.W. and R.J. collected samples and performed experiments. Y.Z., W.X., G.L., P.Q., Z.D., R.G., H.S., Y.X., S.L., Q.F., Z.J., L.S., T.G., X.X., Y.H., X.L., J.W., H.Y., Y.L., S.T., G.H., C.L., Z.X., J.Y., K.K., L.X., J.L. and H.J. analyzed and interpreted the data. Y.Z. and Z.D. wrote the paper. K.K., H.J. and L.X. revised the paper. All authors commented on the manuscript.

Corresponding authors

Correspondence to Junhua Li, Huijue Jia or Liang Xiao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Cultivation and genome sequencing of the gut microbiota.

(a) The 155 feces samples from healthy volunteers grouped by ages and sex. (b) The number of isolates achieved by 11 different culture media under anaerobic condition. (c) The workflow of the cultivation and sequencing of isolated gut bacteria.

Supplementary Figure 2 Workflow for species annotation of sequenced genomes.

Species assignment was carried out using an average-nucleotide identity (ANI)-based pipeline. Genomes not assigned by ANI were subjected to genus annotation by POCP.

Supplementary Figure 3 Diversity and novelty of gut bacterial genomes archived in CGR.

(a) The number of bacterial species archived in CGR belonging to 9 core genera of the human gut microbiota in Chinese. The archived bacterial species were compared to the previously reported 1,000 cultured bacterial species in the human gastrointestinal tract, with the known species shown in white and novel species shown in black. (b) Low abundance (<1%) gut bacterial genera identified in CGR. Grey box indicates the relative abundance of each genus from 1267 samples, according to the previous IGC study. Red dot indicates the number of species in each genus archived in this study. Each boxplot illustrates the estimated median (centre line), upper and lower quartiles (box limits), 1.5 × interquartile range (whiskers).

Supplementary Figure 4 The improvement in metagenomic analysis by CGR.

(a) The improvement of reads mapping ratio in metagenomic analysis by CGR (relevant to Fig. 2a). The percentage of improvement is calculated by the following formula: (CGR-ICG)/(100-ICG). The percentage of improvement for Chinese (n=368) is significantly higher than American (n=139, P=8×10^-20), Spanish (n=320, P=9×10^-33), and Danish (n=109, P=2×10^-31) individuals. The significance of improvement was determined by unpaired Wilcoxon rank-sum test (two.sided). ICG represents the reads mapping ratio calculated from 3,449 reference genomes (ICGR in Fig. 2a), CGR represents the reads mapping ratio calculated from the addition of 1,520 reference genomes (ICGR+CGR in Fig. 2a). Each boxplot illustrates the estimated median (centre line), upper and lower quartiles (box limits), 1.5 × interquartile range (whiskers), and outlier (points) of the reads mapping ratio. (b)(c) Gene and protein sequence diversity increased by CGR. Increase in number of new gene families (b) and protein families (c) across added genomes from ICGR (blue) and CGR (red).

Supplementary Figure 5 SNP density in the 282 reference genomes with a cumulative coverage of at least 10× in the 250 samples from the TwinsUK registry.

The reference genomes are ordered according to the cumulative coverage, with new reference genomes generated by this study highlighted in red.

Supplementary Figure 6 Functional annotation of 1,520 genomes in CGR.

The gene functions in the genomes are annotated using KEGG pathways, with level 2 functions shown in the figure. The stack bar on the out-most layer represents the number of genes with given functions in each genome. The phylogenetic tree is plotted according to Fig. 1.

Supplementary Figure 7 Statistics for the pan-genome analysis of the 38 clusters.

(a) Genomes for each cluster used in the pan-genome analysis. (b) Composition of core genes, unique genes, and accessory genes in the genomes of the 38 clusters. The clusters were ordered by the proportion of core genes.

Supplementary Figure 8 Pan-genome fitting curves of the 38 clusters.

The pan-genome fitting curves of 38 representative clusters, from Firmicutes (orange), Bacteroidetes (blue), Actinobacteria (violet), Proteobacteria (green), and Fusobacteria (grey). The pan-genome size is accumulated from all combinations of strains contained in each cluster.

Supplementary Figure 9 Pan- and core-genome analysis of the 38 clusters.

The number of gene families in the pan (cyan) and core (pink) genomes are plotted as a function of the number of genomes of the 38 clusters. Box plots indicate 25 th and 75 th percentiles with medians shown as horizontal lines and whiskers set at 10 th and 90 th percentiles.

Supplementary Figure 10 COG distribution in the core genome and the dispensable genome.

The percentage of 20 COGs in the core genome (pink) was compared to that in the pan-genomes (cyan) of 38 clusters. The significance of improvement was determined by two-side Wilcoxon rank-sum test (*,P< 0.05; **, P < 0.01; ***, P < 0.001). The exact P value is 0.931 for D, 2.70×10^-9 for M, 3.11×10^-5 for N, 7.28×10^-12 for O, 3.88×10^-4 for T, 1.22×10^-7 for U, 7.28×10^-12 for V, 7.28×10^-12 for J, 3.64×10^-11 for K, 7.28×10^-12 for L, 7.28×10^-12 for C, 0.261 for G, 7.28×10^-12 for E, 7.28×10^-12 for F, 1.46×10^-11 for H, 2.40×10^-10 for I, 0.874 for Q, 2.40×10^-10 for P, 0.365 for R, and 0.031 for S. Each boxplot illustrates the estimated median (centre line), upper and lower quartiles (box limits), 1.5 × interquartile range (whiskers), and outlier (points) of the COG percentage.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10

Reporting Summary

Supplementary Table 1

Culture media used for isolation of gut bacteria

Supplementary Table 2

Information for the 6,487 isolates and 1,759 sequenced strains

Supplementary Table 3

List of separated genomes

Supplementary Table 4

Statistics for sequencing data of 1,867 strains

Supplementary Table 5

Taxonomic information for 1,520 genomes

Supplementary Table 6

Improvement of metagenomic reads mapping ratio by CGR

Supplementary Table 7

Improvement of SNP analysis by CGR

Supplementary Table 8

Pan-genome analysis of the 38 clusters

Supplementary Table 9

The function annotation of 1,520 genomes

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zou, Y., Xue, W., Luo, G. et al. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nat Biotechnol 37, 179–185 (2019). https://doi.org/10.1038/s41587-018-0008-8

Download citation

Received: 26 October 2017
Accepted: 13 December 2018
Published: 04 February 2019
Issue Date: February 2019
DOI: https://doi.org/10.1038/s41587-018-0008-8

This article is cited by

Deciphering the gut microbiome of grass carp through multi-omics approach
- Ming Li
- Hui Liang
- Zhigang Zhou
Microbiome (2024)
Gut microbiota dynamics and fecal SCFAs after colonoscopy: accelerating microbiome stabilization by Clostridium butyricum
- Zhenhui Chen
- Lu Yu
- Hongying Fan
Journal of Translational Medicine (2024)
Comparison of the DNBSEQ platform and Illumina HiSeq 2000 for bacterial genome assembly
- Tongyuan Hu
- Jianwei Chen
- Yuanqiang Zou
Scientific Reports (2024)
Discovery and structural mechanism of DNA endonucleases guided by RAGATH-18-derived RNAs
- Kuan Ren
- Fengxia Zhou
- Zhiwei Huang
Cell Research (2024)
Integrative metagenomic and metabolomic analyses reveal the potential of gut microbiota to exacerbate acute pancreatitis
- Jianjun Liu
- Qiulong Yan
- Dong Shang
npj Biofilms and Microbiomes (2024)

Subjects

Abstract

Similar content being viewed by others

Main

Results

Expanded catalog of gut bacterial genomes

Improvement in metagenomic and SNP analyses

Functions of gut microbiome bacteria

Core and pan-genomes of underrepresented gut bacteria

Discussion

Methods

Anaerobic cultivation of fecal bacteria

Whole-genome sequencing and de novo assembly

DNA extraction

Genome sequencing and assembly

Assessment of genome quality

Splitting for multi-genome isolates

Massive species and genus assignment process

NCBI-retrieved prokaryotic genomes

Average nucleotide identity (ANI)47 for species level taxonomic assignment

Percentage of conserved proteins (POCP)48 for genus-level taxonomic assignment

16S rRNA sequence analysis and novel species determination

Comparison of CGR with genome datasets from other studies

Construction of species clusters

Genome function annotation

Mapping ratio of metagenomic samples

Analysis of gene and protein diversity

Gene clusters

Protein clusters

SNP identification and similarity score

Pan genome analysis for 38 cluster

Reporting Summary

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Integrated supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links

Average nucleotide identity (ANI)⁴⁷ for species level taxonomic assignment

Percentage of conserved proteins (POCP)⁴⁸ for genus-level taxonomic assignment