The gastrointestinal microbiome is a complex ecosystem with functions that shape human health. Studying the relationship between taxonomic alterations and functional repercussions linked to disease remains challenging. Here, we present an integrative approach to resolve the taxonomic and functional attributes of gastrointestinal microbiota at the metagenomic, metatranscriptomic and metaproteomic levels. We apply our methods to samples from four families with multiple cases of type 1 diabetes mellitus (T1DM). Analysis of intra- and inter-individual variation demonstrates that family membership has a pronounced effect on the structural and functional composition of the gastrointestinal microbiome. In the context of T1DM, consistent taxonomic differences were absent across families, but certain human exocrine pancreatic proteins were found at lower levels. The associated microbial functional signatures were linked to metabolic traits in distinct taxa. The methodologies and results provide a foundation for future large-scale integrated multi-omic analyses of the gastrointestinal microbiome in the context of host–microbe interactions in human health and disease.
Metagenomic studies of faecal samples have provided deep insights into the structure and functional potential of microbial communities within the human gastrointestinal tract1,2, revealing their inter-individual variability3 and apparent intra-individual stability4,5. Case–control studies have associated microbial traits in individuals with patterns of diseases, thereby implicating gastrointestinal microbial communities in a range of human disorders6,
These deep insights notwithstanding, metagenomics offers little information on which microbial traits are actually contributed to human physiology. More specifically, previous studies analysing functional ‘omes’, such as the metatranscriptomes or the metaproteomes, in gastrointestinal samples9,11,
Families play an important role in the formation of the microbiota of individuals3,21,22. More specifically, greater similarities have been observed between the microbiota of family members and cohabiting partners than between unrelated individuals23. Furthermore, the microbiota of siblings have been found to be closer to each other than to their parents'22. In the present study, healthy family members were used as a stringent control group in type 1 diabetes mellitus (T1DM), a disease supposedly influenced by an interplay of genetics and environment.
The worldwide increase in the incidence of T1DM (ref. 24), especially in individuals with a low genetic risk25, suggests that environmental factors may trigger the autoimmune destruction of insulin-producing beta cells in the endocrine pancreas that causes T1DM. Gastrointestinal agents have been implicated in this process due to the immune-modulatory role of the gastrointestinal tract26,27 and the potential involvement of the pancreatic ducts in the inflammation that culminates in beta-cell destruction28. Case–control studies of the microbial community structure during29,
In this Article, we expand our recently developed methodologies for integrated multi-omic analyses of microbial consortia36,37 to systematically explore the variability of the different omes and their relationships in the context of a case study of familial T1DM. Our study allowed us to link the expression of disease-associated microbial functions to distinct taxa, which demonstrates the necessity for integrated multi-omic analyses of microbiota in the context of human disease research.
Study cohort and microbial community structures
The study cohort consisted of 20 individuals from four families of at least two generations presenting at least two cases of T1DM (Fig. 1a). T1DM was in all cases diagnosed for more than five years and successfully controlled by insulin replacement therapy, and the auto-antibody status of the healthy individuals did not indicate immediate risk of developing T1DM (for detailed descriptions see Supplementary Section 1 and Supplementary Table 1).
To obtain overviews of the microbial community structures in 53 well-preserved faecal samples collected over a period of two to four months, we calculated relative abundances of metagenomic operational taxonomic units (mOTUs)38 from 3.9 ± 0.1 Gbp (mean ± standard deviation (s.d.)) of metagenomic data per sample (Supplementary Section 2, Supplementary Fig. 1 and Supplementary Table 2). In most cases, mOTU-based taxonomic profiles were stable over time within individuals. Given that family members share environment, nutrition and genetic background, families formed more distinct groups than unrelated individuals, and samples did not cluster according to T1DM status (Supplementary Fig. 2). Similar to earlier analyses38, 43 ± 14% of the detected microorganisms in each sample belonged to mOTUs without a sequenced isolate genome. The relative abundance of single mOTUs not assigned to any phylum reached up to 9% (Supplementary Section 2). These results indicate that analytical approaches purely based on isolate reference genomes would not be able to capture the taxonomic and functional diversity in these microbial communities.
Given the considerable proportion of taxa not represented by sequenced isolate genomes in the mOTU analyses, we devised a reference genome-independent workflow for the integration of metagenomic, metatranscriptomic and metaproteomic data (Fig. 1b). A total of 8.0 ± 1.0 Gbp (mean ± s.d.) of metatranscriptomic sequencing data and 185,000 ± 13,000 peptide mass spectra were obtained in addition to metagenomic data from biomolecular fractions of 36 of the faecal samples collected from 18 of the individuals (Supplementary Table 3). We approached the multi-omic data integration using a de novo genomic assembly strategy, which allowed us to identify differences in encoded and expressed microbial functions in the context of intra-individual variation, family membership and T1DM. Our workflow, which involves the co-assembly of metagenomic and metatranscriptomic reads, resulted in longer total assembled genomic contigs (287 ± 43 Mbp), longer maximal contig lengths (310 ± 170 kbp; N50 (the largest contig length such that 50% of all assembled nucleotides are contained in contigs of at least this size): 660 ± 130 bp) and enhanced read usage (88 ± 4% and 88 ± 2% of the metagenomic and metatranscriptomic reads mapping back to the assembly, respectively) when compared to assemblies based on single omes (Supplementary Section 3, Supplementary Fig. 3 and Supplementary Table 3). Annotations with functional categories (KEGG orthologous groups, MetaCyc and Swiss-Prot enzymes, Pfam or TIGR-Pfam domains) were obtained for 489,000 ± 75,000 predicted open reading frames (ORFs; representing 82 and 66% of the metagenomic and metatranscriptomic reads, respectively; Supplementary Section 3, Supplementary Figs 4 and 5 and Supplementary Table 3).
For protein identification from the metaproteomic data, we assembled sample-specific search databases using the microbial ORFs of the co-assemblies, including strain-resolved variants. Given that a significant proportion of human proteins are typically present in faecal metaproteomes and that these may reflect host–microbiome interactions11, we also sequenced the whole genomes of study participants (Supplementary Table 3) and added allele-specific individualized human protein sequences to the proteomic search databases. This allowed us to identify 2,400 ± 1,600 microbial and 200 ± 70 human protein groups per sample (920 ± 570 microbial and 90 ± 30 human proteins uniquely identified; Supplementary Section 4 and Supplementary Table 3).
Genomic context and taxonomy of expressed genes
Bacteria accounted for the majority of metagenomic, metatranscriptomic and metaproteomic information (Supplementary Fig. 6). To link the bacterial and archaeal genes to their genomic context and thereby allow for the integration of functional data from multiple omic levels (Fig. 1b), assembled contigs were binned using an algorithm that makes use of multiple features, including genomic signatures, metagenomic coverage information and the presence of single-copy essential genes (Supplementary Section 5 and Supplementary Table 4). Contig bins (Fig. 2a–c), which represent (partial) population-level genomes, were analysed in terms of their functional annotations as well as protein-coding phylogenetic markers. Taxonomic profiles from the binning were consistent with the assembly-independent mOTU-based analysis results (Fig. 2d, Supplementary Figs 2 and 7 and Supplementary Section 6). Furthermore, we were able to use the information from the binned genomic reconstructions to trace specific functions of interest to different taxa encoding or expressing them (Fig. 2e, Supplementary Fig. 8 and Supplementary Section 7).
In addition, non-human eukaryotic genes, transcripts and proteins were detected that originated from food-borne organisms (plants and fungi) and their viruses, as well as active gastrointestinal eukaryotes (fungi and protists; non-human eukaryotic genes accounted for 0.07 ± 0.05% of the metagenomic reads, 0.8 ± 1% of the metatranscriptomic reads and 2.6 ± 3.1% of the identified proteins, Supplementary Section 8). Viral biomolecules represented less than 0.1% in terms of predicted ORFs as well as mappable reads and identified proteins (Supplementary Section 9).
Metagenomic data and functional omic data
To assess how metagenomic taxonomic profiles might be reflected at the metatranscriptomic level, mOTU abundance profiles were calculated using the metatranscriptomic reads as input, in addition to the previous metagenome-based analyses. A similar set of mOTUs was detected on the metagenomic and metatranscriptomic levels (Fig. 3a), and their abundance profiles correlated strongly (Spearman's ρ = 0.81 ± 0.15; P < 2.2 × 10−16; Fig. 3b and Supplementary Section 10). When comparing the relative mOTU abundances at the metagenomic and metatranscriptomic levels, apparent differences in the inferred activities of different mOTUs in the collected faecal samples were related to their prevalence and activity at distinct sites within the gastrointestinal tract9, where the most transcriptionally active organisms are known residents of the lower gastrointestinal tract, and organisms with low activity are associated with the oral cavity and upper respiratory tract (Fig. 3b and Supplementary Section 10).
Overall, although a strong correlation was observed between metagenomic and metatranscriptomic reads mapping to the phylogenetic marker genes (which are expected to be constitutively expressed), a wide range of expression levels was observed for the other ORFs encoded by the reconstructed genomes (Supplementary Fig. 9). This explains the overall weak correlation between metagenomic and metatranscriptomic depths of coverage at the level of all ORFs (Spearman's ρ = 0.06 ± 0.09, Fig. 3c). Within the metaproteome, only a limited proportion of the predicted ORFs were detectable (Fig. 3c,d and Supplementary Table 3). However, ORFs encoding identified proteins typically exhibited higher metatranscriptomic coverage (Kruskal–Wallis test P < 2.2 × 10−16, Fig. 3e). Stronger variations in the metatranscriptomic levels of non-housekeeping functions were observed, most probably due to these being differentially regulated or expressed by different taxa (Supplementary Fig. 9). This may also explain why, despite a similar overlap of detected functional categories (Fig. 3f), community-wide metagenomic and metatranscriptomic functional profiles were less correlated (Spearman's ρ = 0.41 ± 0.08; P < 2.2 × 10−16; Fig. 3g) than the mOTU abundance profiles (Supplementary Section 11). Metagenomic and metaproteomic functional profiles did not concur (gene-wise ρ = 0.14 ± 0.08; functional category-wise ρ = 0.10 ± 0.05; P < 0.07, Fig. 3h), but protein abundances tended to correlate with transcript abundances (gene-wise ρ = 0.33 ± 0.12, Supplementary Fig. 9; functional category-wise ρ = 0.40 ± 0.16, P < 2.2 × 10−16, Fig. 3i), indicating a stronger dependence of protein expression on corresponding transcript levels.
Individuality of gut microbiota on all omic levels
The microbial consortia of the gut are known to be among the most stable human-associated microbial communities2, and sample donors remain identifiable from these based on metagenomic data when compared to data sets from earlier samples39. We investigated whether the patterns of individuality are also discernible on the other omic levels. At the metagenomic level, the observed community structures were stable over time in most individuals (Fig. 4a). More specifically, the intra-individual distances between metagenomic mOTU abundance profiles were significantly smaller than the corresponding inter-individual distances, and donors thus remained recognizable based on mOTU abundance profiles (Supplementary Section 12 and Supplementary Fig. 10). Furthermore, significantly greater intra-individual than inter-individual similarities were observed in the metatranscriptome-based taxonomic profiles, in the functional profiles on all omic levels including the metagenome, and even in the human protein profiles (Wilcoxon rank sum tests between Jensen–Shannon or Soerensen distances: P < 0.05, Fig. 4a). Intra-individual variability was greater in the metatranscriptome and metaproteome, thereby hampering recognition of samples from the same donor (Supplementary Section 12). Taken together, these results indicate that the stability of the community structure within the same individual2 is reflected at the functional levels, which nonetheless display a higher level of plasticity. Consequently, if the gastrointestinal microbiome exhibits changes in individuals with T1DM, signatures may be reflected across the different omes.
Family resemblances are apparent at all omic levels
To place the observed individual microbiome signatures into family contexts, we compared intra- and inter-family distances of community-level taxonomic and functional profiles. Smaller intra- than inter-family distances were observed between community structures (Fig. 4a). To assess whether the observed family-specific traits were also conserved on the other omic levels, multiple co-inertia analysis was performed for taxonomic profiles based on the metagenomic and metatranscriptomic data, for functional profiles on all omic levels, as well as for the human proteome. A remarkable level of similarity among all six levels was observed, with particularly tight clusters observable in several families (in particular in family M04; Fig. 4b). Accordingly, intra-family distances in terms of taxonomic and functional profiles were also smaller than inter-family distances (Fig. 4a, Supplementary Section 13 and Supplementary Fig. 11).
Functional individuality of discrete genomes
In addition to community stability, the persistence of individual-specific strains has been documented4,5. To explore whether strain specificity also translates to the functional level, we analysed the encoded functional potential and corresponding expression profiles of closely related reconstructed genomes recovered from different samples. Our results revealed that the functional potentials of related genomes reconstructed from samples of the same donor were significantly more similar than samples from different donors (Wilcoxon signed rank test between Soerensen dissimilarity indices of all tested taxa: P < 0.05, Fig. 4c, Supplementary Section 14 and Supplementary Fig. 10). Furthermore, the expression profiles of these genomes displayed stronger intra-individual than inter-individual correlations (Wilcoxon signed rank test between Spearman's correlation coefficients of all tested taxa: P < 0.05, Fig. 4d, Supplementary Section 14 and Supplementary Fig. 10). Individuality is therefore a characteristic of gastrointestinal whole microbial communities on all functional levels, and is also apparent in the functional potential and expression profiles of specific populations common to different individuals.
Family-independent effects of T1DM
As discussed above, on the level of the observed community structures as well as the metatranscriptomic and metaproteomic functional profiles, large differences between families were identified. Additionally, nutritional records of the days prior to faecal sampling revealed that the diets of family members were very similar. In line with medical recommendations, the individuals with T1DM followed the same diets as their healthy family members, and their diet was no more or less variable or rich (Supplementary Sections 1 and 15, Supplementary Table 1, Supplementary Figs 12 and 13). Family membership thus appears to stratify the cohort more strongly than all other factors known to influence gastrointestinal microbial community composition, such as age or body mass index (Supplementary Table 1 and Supplementary Fig. 14). Consequently, we analysed T1DM-specific differences in the microbial taxonomic and functional profiles while accounting for family membership.
No differences in taxonomic or functional diversity and richness were observed between individuals with T1DM and their healthy relatives at any omic level (Supplementary Fig. 15). However, we found one highly active mOTU, classified to the order Clostridiales, to be more abundant in the metatranscriptomes of the individuals with T1DM (main effect with false-discovery-rate (FDR)-adjusted P < 0.05 and absolute fold change >2; Supplementary Section 16 and Supplementary Fig. 15).
Differential transcript abundances of diverse functions (Supplementary Fig. 16) suggested differences in the availability or microbial use of cellulose and subtle differences in the immune status (genes for reactive oxygen species (ROS) detoxification, cell surface proteins and complement-inactivating staphopain A) of individuals with T1DM (Supplementary Section 17). The latter group of functions was also reflected in the proteins, which exhibited the greatest differences in terms of their abundance (not significant after FDR adjustment; Supplementary Fig. 16 and Supplementary Section 17). T1DM-specific functions could mostly be traced to genomes of taxa whose abundances were not affected by T1DM (Supplementary Figs 18 and 19), which indicates that expression of potentially disease-related community functions may be independent of the abundances of distinct taxa inferred from the metagenomic data alone (Supplementary Section 17).
Human protein excretion in T1DM and microbial functions
Little is known about changes in the function of the gastrointestinal tract with respect to T1DM traits, such as the proposed heightened innate inflammatory state27. Among the ten human proteins with the greatest changes in abundance in the faeces of individuals with T1DM (Supplementary Table 5), we identified two with higher abundance in the individuals with T1DM, namely lactotransferrin (LTF) and the lactotransferrin receptor intelectin (ITLN1), which maintain an important functional position at the crossroads of iron uptake, innate immunity and insulin function40,
Furthermore, we found four proteins that are preferentially expressed in the exocrine pancreas45, α-amylase proteins AMY2A and AMY2B (Fig. 5a,b), carboxypeptidase CPA1 and the less characterized CUZD1 (Supplementary Fig 18), to be less abundant in the stool of individuals with T1DM, potentially reflecting a weakening of exocrine function. To what extent T1DM affects the exocrine pancreas is still a matter of debate46, and our results suggest that this may be the case at least for certain digestive enzymes (Supplementary Section 15). We analysed whether potential repercussions of the apparent T1DM-related decrease in digestive enzymes (which would probably affect starch availability in the lower intestine47) were reflected in the collected samples. Because we were unable to identify taxa whose abundances correlated with the measured abundances of the pancreatic enzymes, we focused on the microbial genes with known metabolic functions (Supplementary Section 18). We uncovered several significant correlations between microbial transcript levels and the relative abundances of the amylases AMY2A and AMY2B (24 with P < 0.05 after FDR adjustment; |Spearman's ρ| > 0.75; Supplementary Table 5), including functions related to microbial central carbon metabolism and related cofactors (Fig. 5c), in particular thiamine biosynthesis (Supplementary Section 18 and Supplementary Fig. 20). These functions were transcribed by distinct taxa in the different samples (for example, thiazole synthase, Fig. 2e). Low thiamine blood levels have been linked to complications of T1DM48 and colonic uptake transporters of thiamine have been described49. However, the observed patterns in transcript abundances of thiazole synthase and plasma thiamine levels were not correlated in the small subset of individuals who did not take thiamine supplements, precluding any conclusions (Supplementary Section 18 and Supplementary Fig. 20). The observed patterns thus suggest the need for a detailed follow-up in a dedicated future study involving a larger cohort of families who explicitly do not take thiamine supplements. Furthermore, in light of the T1DM-related findings (Fig. 6), the levels of pancreatic and immune-related proteins should be measured alongside functional microbial profiles in longitudinal pre-diabetes studies to establish the timing of their decrease and elucidate whether they play a role in the changes in the microbiota associated with diabetes development.
By dissecting the gastrointestinal microbial community structures resolved using the individual omic data sets, we have shown that community structures are reflected across all omic levels. Nonetheless, the apparent higher complexity and variability exhibited by the functional omic levels necessitates an integrated analysis of the individual omic data sets18,50 to facilitate a reliable link between specific microbial genes and their genomic context. Linking genes of interest to genomic context has been attempted in other studies of T1DM by correlating abundances of functional traits and organisms32. However, as demonstrated, transcript levels cannot be expected to simply depend on the abundances inferred from metagenomic data. We therefore developed the presented framework for genome reconstructions, which served as the basis for the subsequent integration of the multi-omic data.
All omic levels exhibited individuality and family specificity, which has previously been described only for metagenomes3,
Written informed consent was obtained from all subjects enrolled in the study. This study was approved by the Comité d'Ethique de Recherche (CNER; reference no. 201110/05) and the National Commission for Data Protection in Luxembourg.
Anthropometric, nutritional, clinical data and blood and stool sampling
The study design was an observational study of four purposely selected family groups containing at least two members with T1DM and healthy individuals in two generations or more, from existing patient cohorts from the Centre Hospitalier du Luxembourg. Recruited families were seen three times at intervals of between four and eight weeks to have data and samples collected (Supplementary Table 1). On enrolment, study participant pedigrees were drawn, medical history was collected and a ‘Food Frequency Questionnaire’ was completed (estimating the foods eaten over the last 12 months). During every visit, anthropometric data were recorded. All participants were weighed and measured using the SECA combined scale and measuring device. Body mass index was calculated as kg m−2. Blood pressure was taken in the sitting position with a SureSigns VM6.
Daily recall questionnaires for food intake in the 24 h preceding the visit were filled in on visits 2 and 3. Nutritional content of the reported food was estimated based on the food frequency or daily recall questionnaires using FETA software51.
Faecal samples were self-collected at three time points (if bowel movement permitted sampling) and immediately placed on dry ice and subsequently transported and stored without thawing. Blood was collected using standard venepuncture and processed using validated processes52. Samples and associated data used in this study were processed and stored at IBBL (Integrated BioBank of Luxembourg) following ISO17025:2005 standards and the International Society for Biological and Environmental Repositories (ISBER) best practices.
Islet cell antibodies were measured using the test from INOVA Diagnostics. Insulin antibodies, tyrosine phosphatase-related islet antigen 2 antibodies, glutamate decarboxylase 2 antibodies and zinc-transporter 8 antibodies were measured using enzyme-linked immunosorbent assay (ELISA) kits from RSR. Measurements of glucose, glutamate-oxalacetate-transaminase, glutamate-pyruvate-transaminase, γ-glutamyltransferase, cholesterol, low-density cholesterol, high-density cholesterol and triglycerides were performed on a Cobas c501 analyser (Roche) with appropriate reagents (Roche). Quantification of glycated haemoglobin was performed on an Integra 400 analyser (Roche) with a kit from the same company. Insulin and the C-peptide were measured on a Cobas e601 (Roche) with matching kits (Roche). Leptin and pro-insulin were quantified using ELISA kits (Beckman Coulter). Thiamine levels in plasma were measured in technical duplicates using a vitamin B1 ELISA kit (Cloud-Clone) according to the manufacturer's protocol.
Extraction of biomolecular fractions from faecal samples
Biomolecular fractions were extracted from unthawed, frozen faecal subsamples (150 mg) after pretreatment of the weighed subsamples with 1.5 ml RNAlater ICE (LifeTechnologies) overnight. The faeces-RNAlater ICE mixture was homogenized by bead-beating, as previously described53. Differential centrifugation and extraction using the All-In-One kit (Norgen Biotek) to recover DNA and proteins were carried out as previously described53. DNA fractions were supplemented with DNA extracted from 200 mg subsamples using the MOBIO Power Soil Kit.
RNA was extracted from faecal subsamples (250 mg), which were pretreated with RNAlater ICE (LifeTechnologies) as described above. Homogenization and differential centrifugation were carried out as described previously53 and the pellet was lysed with MOBIO glass bead tubes. RNA was extracted using the MOBIO PowerMicrobiome RNA Isolation Kit, according to the manufacturer's recommendations. The integrity of isolated RNA fractions was assessed using an Agilent Bioanalyzer2000. Only fractions with an RNA integrity number > 7 were subjected to sequencing.
Metagenomic DNA and RNA sequencing
DNA was sequenced with 101 bp (per paired end) on a HiSeq2000 system (Illumina) by BGI. Libraries with an insert size of 350 bp were constructed from metagenomic DNA by fragmentation by sonication (Covaris), end-repair, adenylation, adapter ligation and amplification of adapter-ligated DNA fragments using appropriate enzymes (Enzymatics). Library amplification and cluster generation were performed using a TruSeq PE Cluster Kit V3–cBot–HS (Illumina) and flow cells were sequenced to 101 bp per paired end.
Strand-specific cDNA libraries were constructed from rRNA-depleted RNA fractions and sequenced to 150 bp (per paired end) on an Illumina HiSeq2500 by Oxford Gene Technologies. In detail, rRNA was depleted from the RNA fractions using Epibio's Universal Bacteria Ribozero Kit. Stranded cDNA libraries were constructed using the NEBNext Ultra Directional RNA Library Prep Kit. Libraries were sequenced to 150 bp (paired-end) using TruSeq Rapid SBS Kits v1 (Illumina).
Filtering and trimming of metagenomic and metatranscriptomic reads
The in silico analysis results presented in this Article were obtained using the high-performance computing facilities of the University of Luxembourg54. Metagenomic and metatranscriptomic paired-end reads in fastq format for each individual and each time point were processed separately using the MOCAT pipeline55. First, reads were trimmed using the MOCAT trimming and quality filtering step (MOCAT.pl –rtf) with the following parameter settings: readtrimfilter_length_cutoff=40, readdtrimfilter_qual_cutoff=20, readtrimfilter_use_sanger_scale=yes, readtrimfilter_trim_5_prime=yes and readtrimfilter_use_precalc_5prime_trimming=no. This step generated paired- and single-end reads, which were then mapped to the human genome (hg19) using the MOCAT screening step (MOCAT.pl –s hg19), which used SOAPaligner v2.21 (ref. 56) and then filtered mapping reads out. Metatranscriptomic reads were then also screened against the human RNA transcripts as defined for hg19 by Ensembl (release 66) to additionally remove reads overlapping splice sites in human transcripts. MOCAT screening was run with the following parameter setting: screen_length_cutoff=30, screen_percent_cutoff=90, screen_soap_seed_length=30, screen_soap_max_mm=10, screen_soap_cmd=-M 4 and screen_save_format=sam. This step resulted in two sets of reads in fastq format (human and non-human), each consisting of paired- and single-end reads. To assess the rRNA content of the sequencing libraries, metatranscriptomic reads were independently mapped against the prokaryotic and eukaryotic parts of the SILVA database57 using Bowtie2 (ref. 58). The non-human files were used for taxonomic analyses as well as assembly (Supplementary Figs 21 and 22).
Read-based taxonomic analyses, mOTU analyses and use of an integrated gene catalogue
mOTU analysis for taxonomic profiling was separately performed for trimmed and filtered metagenomic and metatranscriptomic reads for each sample using the MOCAT pipeline38,55 and the mOTU.v1.padded reference database (Supplementary Fig. 21), which comprises genes for the ribosome-binding ATPase (YchF; GTP1/OBG family), phenylalanyl-tRNA-synthetase alpha subunit, arginyl-tRNA-, seryl-tRNA-, cysteinyl-tRNA-, leucyl-tRNA- and valyl-tRNA-synthetase, tRNA A37 threonylcarbamoyltransferase (TsaD), and two signal recognition particle GTPases (COG0012, COG0016, COG0018, COG0172, COG0215, COG0495, COG0525, COG0533, COG0541 and COG0552, respectively)38. Metagenomic and metatranscriptomic codes to recognize donors were built from mOTU relative abundances and evaluated using Idability39 with the --meta_mode setting.
Taxonomic profiling of reads and contigs was also performed using Kraken59 with a database comprising 9,464 prokaryotic and 2,326 viral genomes. Metagenomic reads were mapped against the integrated gene catalogue (IGC)60 and genus-level abundance profiles were calculated as described in the publication of the integrated gene catalogue60. These profiles were concatenated with the profiles of 1,267 samples distributed with the IGC. A Jensen–Shannon divergence matrix was calculated using the phyloseq61 implementation over all profiles, and partitioning around medoids62 was performed to obtain enterotype annotations for each sample.
Co-assembly of metagenomic and metatranscriptomic reads
Trimmed and filtered metagenomic and metatranscriptomic reads were merged and made non-redundant using cd-hit-dup from the CD-HIT software suite63. The non-redundant set of non-human paired-end reads was then assembled using a combined metagenomic and metatranscriptomic assembly pipeline (Supplementary Fig. 22).
Reads were first converted from fastq to fasta format using the fq2fa script (fq2fa -merge -filter) provided by IDBA-UD (ref. 64). IDBA-UD was then run using its pre-error-correction step for correcting reads (idba_ud --pre_correction). In the next step, the contigs assembled by IDBA-UD were extended using the unused paired- and single-end reads and the Velvet assembly tool65 (version 1.2.07). For this, the paired-end reads were mapped onto the previously assembled contigs using SOAPaligner v2.21 (ref. 56; with parameter -r 2 -M 4 -l 30 -v 10 -p 8 -u unmapped.fa). The unused single-end reads were combined with the unmapped reads and cd-hit-dup from the CD-HIT software suite63 was used to remove duplicate reads. Velvet was then run over a range of k-mer sizes (27, 31, 35, 39, 43, 47, 51, 55, 59, 63) with the IDBA-UD contigs as long read input (velveth parameters: -long contig.fa, velvetg parameter: -conserveLong yes). Afterwards, all newly assembled contig sequences from the different k-mer runs and the original IDBA-UD contigs were concatenated and clustered using cd-hit from the CD-HIT software suite (parameter: -c 0.99) to remove redundancy. Newbler from MeGAMerge 1.1 (ref. 66) was then used to join and extend the clustered contigs. Minimal contig length was set to 125 bps.
Gene predictions, functional and phylogenetic annotation
Prodigal v2.60 (ref. 67) (parameter: -p meta) was used for gene prediction on the final set of contigs. Essential single-copy genes68 among the protein predictions were detected as described69. Marker genes for phylogenetic analysis were predicted using Amphora2 (ref. 70) and fetchMG (ref. 38). rRNA genes were predicted using Barrnap (http://www.vicbioinformatics.com/software.barrnap.shtml). Functional gene annotations were generated using HMMER 3.1 (refs 71,72) against multiple pre-compiled and in-house annotation databases. Existing databases for PFAM (ref. 73) and TIGRFAM (ref. 74) were downloaded from their web pages. Additionally, hidden Markov models (HMMs) were trained for all KEGG orthologous groups (KOs) after aligning the protein sequences using T-Coffee (ref. 75). Enzyme-specific HMMs were generated for MetaCYC (ref. 76) and UNIPROT (ref. 77) using only enzyme annotations with at least three digit Enzyme Commission (EC) numbers, as described previously78. The best hit in each HMM set was associated to each gene, if the HMM score was higher than the binary logarithm of the number of target genes (19.6 ± 0.2), in accordance with the recommendations in the HMMer manual.
Determination of metagenomic and metatranscriptomic depths of coverage of contigs and ORFs
Metagenomic and metatranscriptomic paired-end and singleton reads of each sample were mapped against contigs using Bowtie2 (ref. 58) and default settings. Average depths of coverage were calculated using SAMtools (ref. 79) and a custom perl script. Quality-trimmed metagenomic paired-end and singleton reads were mapped against predicted protein- or RNA-coding regions using Bowtie2 with default settings and in addition to depth of coverage, the numbers of mapping reads were counted. Stranded metatranscriptomic paired-end and singleton reads were mapped against predicted protein- or RNA-coding regions using Bowtie2 with --fr --nofw settings to obtain coverage depth and read counts for reads mapping in the sense direction.
To obtain whole-community metagenomic or metatranscriptomic functional profiles, sums of all read counts for functionally annotated genes were calculated for every functional category. If two functions were assigned to the same gene with the same likelihood, the number of reads for this gene was divided equally between both functions.
Binning of contigs from metagenomic and metatranscriptomic co-assembly
Two-dimensional maps for every contig ≥1 kbp were generated for each sample as described previously80. In short, for every contig ≥1 kbp, pentanucleotide frequencies were computed, which represent the genomic fragments as points in a 512-dimensional space (frequencies of individual pentamers and their reverse complements were summed). This matrix was subsequently transformed by log ratio and centred. Following this, the high-dimensional data were embedded into a two-dimensional space using Barnes–Hut stochastic neighbour embedding and the resulting two-dimensional maps were used for clustering. As rRNA-coding regions have been shown to challenge clustering based on genomic signature80, 16S/18S and 23S/28S sequences were removed from the contigs prior to the calculation of pentamer frequencies as long as a minimum length of 1 kbp was retained. Two-dimensional maps of contigs were visualized in R. Clusters were binned using an automatic workflow based on DBSCAN (ref. 81) (Supplementary Fig. 23). Briefly, two-dimensional coordinates were clustered using the dbscan function in the R package fpc. In the first pass, the parameters were ten minimum neighbourhood points and the Eps value was estimated based on the recommendations of the authors of DBSCAN. Number and multiplicity of essential genes68,69 were used to judge completeness of clusters. Clusters with multiple copies of the same essential genes were divided further by analysing the metagenomic coverage depth of the essential genes using the Hartigans' diptest as implemented in the R package diptest. If unimodality was rejected, a normal mixture model (implemented in the R package mixtools, ref. 82) was used to find a cutoff to bin the contigs by metagenomic depth of coverage. These two steps were repeated three times on all overcomplete bins, and the number of minimum neighbourhood points was raised by 2 in each step.
Binned population-level genomic complements were linked to mOTUs used in the taxonomic analysis by calling the marker genes using fetchMG (ref. 38) on the predicted ORFs, aligning the predicted marker genes against the database used for mOTU analysis and assigning to each called marker gene the taxonomy of the closest hit.
Human DNA sequencing and analysis
Genomic DNA was isolated from buffy coats of human blood samples using validated processes according to the requirements of Complete Genomics (CG). Whole-genome sequencing (WGS) was performed by CG using their proprietary paired-end nanoarray-based sequencing-by-ligation technology83. All sequencing data quality control, mapping and variant calling were carried out by CG as part of their sequencing service using the Standard Sequencing Service pipeline version 2.4. Sequencing reads were mapped against the NCBI build 37. For gene annotations, NCBI build 37.2 (RefSeq) was used. For details on coverage and called variants see Supplementary Table 3. All subsequent analyses were performed using the family WGS analysis pipeline as previously described84. In short, as input for the WGS analysis pipeline, we first combined all the variants from all genomes of one family into the union of variants using the CGAtools (CG Analysis Tools) listvariant command and CG var-files as input. We used CGAtools version 1.6 as provided by CG (available from http://cgatools.sourceforge.net). Next, we used the CGAtools testvariant command to test each genome for the presence of each variant. Only variants that were called in at least one genome of a family as high-quality calls (VQHIGH) by CG were used for further analysis.
Database generation for metaproteomic analyses
Sample-specific search databases were constructed from individual-specific pseudo-phased human proteins including homo- and heterozygous variants, as well as microbial proteins predicted based on the combined metagenomic and metatranscriptomic assemblies after inclusion of all called variants (Supplementary Fig. 24). In detail, for each human genome, small variants (single nucleotide polymorphisms and small insertions, deletions and substitutions shorter than 200 bp) were annotated using ANNOVAR (ref. 85) (version 2015Mar12) using the NCBI RefSeq release 60 and the Ensembl release 74 genome annotations. An individualized personal reference genome for each sample was generated by incorporating short homozygous variants into the hg19 reference genome. For each RefSeq protein-coding isoform, the new coding (CDS) mRNA sequence was extracted from the personalized reference genome and then translated. To also include heterozygous exonic amino acid changing variants, only exonic variants that were annotated as non-synonymous, stop gain, stop loss, (non-)frameshift insertions, deletions or substitutions were kept. Because no phasing information for the heterozygous variants was available, we generated pseudo-phased proteins by adding one copy per protein including all homo- and heterozygous variants. For each individual genome, a FASTA protein sequence database was then created by combining two FASTA databases: (1) REFSEQ protein sequences from the personalized human genome reference hg19 and (2) protein sequences containing heterozygous variants (proteins were marked if they contained homo- or heterozygous amino acid changes, respectively, or no change at all). Proteins with newly introduced stop codons were cut after the stop codon.
For the metaproteomic databases, variants were called per faecal sample on all predicted ORFs from the alignment results of mapping metagenomic and metatranscriptomic reads against the contigs using Platypus variant caller86 and vcftools (ref. 87). Variants with low mapping quality (MQ flag) were discarded. Multiple variants called on the same ORF were combined to yield the minimal number of variant proteins without contradictory overlap of variants using a custom perl script. The variant proteins as well as the consensus protein from the assembly were kept. Incomplete predicted proteins (or variant proteins) were removed, if no full tryptic peptide was expected from the prediction.
Protein liquid chromatography (LC) and mass spectrometry
An aliquot of 15 µg isolated protein fraction was reduced (dithiothreitol) and alkylated (iodoacetamide) and the proteins precipitated using the Clean-up Kit (GE Healthcare) according to the manufacturer's instructions. The protein pellets were resuspended at a concentration of 500 µg ml–1 in 50 mM ammonium bicarbonate to perform a complete trypsin digestion. Digested protein (3.5 µg) was purified on a Ziptip C18 (Millipore), dried and resuspended in 100 mM ammonium formiate (pH 10) at 333 ng µl–1. Digested protein (3 µg) was injected on a Nano 2D UPLC – Orbitrap MS system (2D-nanoAcquity UPLC (Waters) and Q-Exactive (Thermo)). MPDSMIX (Waters) was spiked in at 150 fmol of ADH (yeast alcohol dehydrogenase) digest per injection. Measurements were performed on technical triplicates.
A 2D LC method with three steps of 180 min was run. The three steps were run on a column at high pH with increasing percentages of acetonitrile, and the eluted peptides were diluted and loaded on a low pH column, where each step consisted of a gradient of 5 min from 99% of A (A = 0.1% formic acid in water; B = acetonitrile) to 93% of A followed by a gradient of 135 min from 93% of A to 65% of A.
Mass spectrometry was performed using a TopN-MSMS method (N = 12), with parameters as follows: mass range from 400 to 1,750 m/z, resolution 70,000, AGC target 106 or maximum injection time 200 ms. MS2 parameters for spectrum acquisition were an isolation window of 1.6 m/z, normalized collision energy (NCE) of 25, resolution of 17,500, AGC target of 105 or maximum injection time of 50 ms.
The database searches were performed using Proteome Discoverer (v 1.4, Thermo Scientific) and Sequest HT on the protein predictions. Proteomics measurements and database searches were carried out by the proteomics facility of the Center of Analytical Research and Technology – Groupe Interdisciplinaire de Génoprotéomique Appliquée (CART-GIGA, Liège, Belgium). Peptide spectrum matches with at least a high confidence and a delta Cn better than 0.05 were considered, with fragment mass tolerance of 0.02 Da and precursor mass tolerance of 5 ppm. Peptides of 6–144 amino acids with a maximum of two missed cleavage sites were considered. Peptides were grouped by mass and sequences and validated based on q-values with FDR < 0.01. A minimal number of one unique peptide was required for protein identification. The strict maximum parsimony principle was applied and the grouping of proteins that could not be differentiated into protein groups based on their peptide masses was enabled. Areas under the ion-chromatography curves for proteins were quantified based on the top three unique peptides. Relative protein abundances of human proteins were calculated by addition of the areas under the curve of all protein groups containing one or more variant of one or more isoform of a single gene product. Similarly, the sum was calculated of the areas under the curve of all protein groups containing only genes and/or variants of genes annotated with the same function.
Integrated data handling and comparative, ecological and network analyses
All functional and taxonomic annotations, read coverage data, positions and frequencies of variants and relative protein abundances for the contigs and their associated ORFs were stored in a MongoDB (v. 3.0.3) collection (Supplementary Fig. 25). Retrieval of data for further analyses in R (ref. 88) was accomplished using the package rmongodb.
Bray–Curtis dissimilarity indices were calculated, and permutational MANOVA (multivariate analysis of variance; aka ‘adonis’89) was performed using the R package vegan on sum-normalized relative abundances. Unseen90 mOTUs were estimated using ‘estimateR’ from vegan. Total Soerensen dissimilarity indices were calculated using the R package betapart (ref. 91) on sum-normalized relative abundances. Jensen–Shannon divergences were calculated using the implementation in the R package phyloseq (ref. 61), also on sum-normalized relative abundances. Principal coordinate analyses were carried out using the R package ape 3.2 (ref. 92). Multiple co-inertia analysis was performed using the R package omicade4 (ref. 93) using scaled data (in accordance with the manual's recommendations).
Median numbers per individual of metagenomic or metatranscriptomic reads mapping to the protein-coding mOTU (ref. 38) marker genes and of metagenomic and metatranscriptomic reads mapping to functionally annotated genes were used for differential analyses by applying the R package DESeq2 (ref. 94) (see Supplementary Table 1 for numbers of samples per individual). To find differences between the individuals with T1DM and healthy individuals, the main effect of T1DM in a model accounting for family membership and T1DM was analysed. For differences between relative protein abundances, Wilcoxon rank sum tests or Kruskal–Wallis tests were applied.
Network analysis was performed on a gene-centric generalized network of KOs, built as described in ref. 37 by using shared products/educts of reactions related to KOs detected at the metatranscriptomic level, to link these KOs using a custom R script and functions from R packages graph, igraph (ref. 95) and BioNet (ref. 96). Top-scoring modules were extracted using BioNet and Heinz (ref. 97), using P values of differential or correlation analyses (without multiple testing adjustment). Metatranscriptomic abundances, fold changes or correlations within KEGG pathways were visualized using the R package pathview (ref. 98).
Data and code availability
Metagenomic and metatranscriptomic sequencing reads can be accessed from NCBI BioProject PRJNA289586. Assembled contigs and gene predictions can be accessed at MG-RAST (ref. 99) (submission IDs are shown in Supplementary Table 7). The mass spectrometry proteomics data have been deposited at the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository100 with the data set identifier PXD003791. The 200 binned population-level genomes with >93% completeness can be accessed via the guest account at RAST (ref. 101) (submission IDs are shown in Supplementary Table 4). Custom scripts are available at https://git-r3lab.uni.lu/anna.buschart/MuStMultiomics.
The authors thank the staff of the Clinical and Epidemiological Investigation Center (CIEC) Luxembourg for undertaking the sample and data collection from participants in this study. The authors thank B. Phillips for input and feedback during the conception of the study and S. Collignon, K. Greenhalgh, P. do Rosario Martins Conde and A. Kaysen for technical assistance with biomolecular extraction and quality control. The authors thank D. Baiwir and G. Mazucchelli (CART-GIGA) for measurements and assistance. The in silico analysis results presented in this Article were obtained using the high-performance computing facilities of the University of Luxembourg, whose administrators are acknowledged. The authors thank M. Brunkow at the Institute for Systems Biology (ISB), Seattle, who provided project management, and acknowledge support from the Family Genomics Group of Leroy Hood (ISB) for the human whole-genome sequencing data. The present work was supported by an ATTRACT programme grant (ATTRACT/A09/03) and CORE programme grant (CORE/15/BM/10404093) to P.W. and Aide à la Formation Recherche grants to C.C.L. (AFR PHD/4964712) and L.W. (AFR PHD-2013-5824125), as well as by BIOMARKAPD—a project under the aegis of an EU Joint Programme—on Neurodegenerative Diseases (JPND), all funded by the Luxembourg National Research Fund (FNR). Additional support was provided by ‘le plan Technologies de la Santé par le Gouvernment du Grand Duché de Luxembourg’ through the Luxembourg Centre for Systems Biomedicine, University of Luxembourg to P.M. Sample collection, processing and storage were supported by IBBL under the Personalised Medicine Consortium Diabetes programme.
Anthropometric, demographic, medical and nutritional information on the study cohort
Taxonomic and functional profiles based on the metagenomic, metatranscriptomic and metaproteomic data
Statistics on metagenomic, metatranscriptomic and metaproteomic data, as well as human whole-genome sequencing data
Population-level genomes recovered from combined metagenomic and metatranscriptomic data by binning
Human stool proteins and microbial functional transcripts correlating to alpha-amylase levels
Results of differential analyses of metatranscriptomic and metaproteomic functions.
List of sets of open reading frames predicted on assembled contigs with MG-RAST13 accession IDs.
Results of comparison of correlation analysis with binning for tracing functions to taxonomic entities