Gut microbiome diversity detected by high-coverage 16S and shotgun sequencing of paired stool and colon sample

The gut microbiome has a fundamental role in human health and disease. However, studying the complex structure and function of the gut microbiome using next generation sequencing is challenging and prone to reproducibility problems. Here, we obtained cross-sectional colon biopsies and faecal samples from nine participants in our COLSCREEN study and sequenced them in high coverage using Illumina pair-end shotgun (for faecal samples) and IonTorrent 16S (for paired feces and colon biopsies) technologies. The metagenomes consisted of between 47 and 92 million reads per sample and the targeted sequencing covered more than 300 k reads per sample across seven hypervariable regions of the 16S gene. Our data is freely available and coupled with code for the presented metagenomic analysis using up-to-date bioinformatics algorithms. These results will add up to the informed insights into designing comprehensive microbiome analysis and also provide data for further testing for unambiguous gut microbiome analysis.

However, conserved regions are not entirely identical across groups of bacteria and archaea, which can have an effect on the PCR amplification step. Notably, among the conserved regions of the 16S gene, central regions are more conserved, suggesting that they are less susceptible to producing bias in PCR amplification 12 . Furthermore, an in silico study has shown that the V4-V6 regions perform better at reproducing the full taxonomic distribution of the 16S gene 13 . In another study, a constructed mock sample was sequenced by IonTorrent technology, demonstrating that the V4 region (followed by V2 and V6-V7) was the most consistent for estimating the full bacterial taxonomic distribution of the sample 14 . In addition, other methodological factors such as the actual primer sequence, sequencing technology and the number of PCR cycles used may impact on microbiome detection when using 16S sequencing. However, the relative ratios in taxonomic abundance have been shown to be consistent regardless of the experimental strategy used 15 .
Beyond 16S sequencing, shotgun metagenomics allows not only taxonomic profiling at species level 16,17 , but may also enable strain-level detection of particular species 18 , as well as functional characterization and de novo assembly of metagenomes 19 . Moreover, a plethora of new computational methods and query databases are currently available for comprehensive shotgun metagenomics analysis 20 . However, shotgun metagenomics is more expensive than 16S sequencing and may not be feasible when the amount of host DNA in a sample is high 21 . Nevertheless, provided sufficient sequencing coverage, taxonomic profiling of shotgun metagenomes is rather robust and mostly depends on the input DNA quality and bioinformatics analysis tools 22 . Taken together, 16S and shotgun microbiome profiles from the same samples are not entirely the same, but rather represent the relative microbiome composition captured by each methodological approach [23][24][25][26] . In agreement, comparative studies have already revealed that faecal, rectal swab and colon biopsy samples collected from the same individuals usually produce differential microbiome structures although consistent relative taxon ratios and particular core profiles are also detected 27 .
In this study, we characterized the gut microbiome signature of nine participants with paired feacal and colon tissue samples. Our data shows a high concordance between different sequencing methods and classification algorithms for the full microbiome on both sample types. However, clear deviations depending on the sample, method, genomic target and depth of sequencing data were also observed, which warrant consideration when conducting large-scale microbiome studies.
Accompanying this dataset, we also provide the full source code for the bioinformatics analysis, available and thoroughly documented on a GitLab repository. We expect that this annotated, high-quality gut microbiome dataset will provide useful insights for designing comprehensive microbiome analyses in the future, as well as be of use for researchers wishing to test their analysis bioinformatics pipelines.

Methods
Subjects and sampling. The COLSCREEN study is a cross-sectional study that was designed to recruit participants from the Colorectal Cancer Screening Program conducted by the Catalan Institute of Oncology. This program invites men and women aged 50-69 to perform a biennial faecal immunochemical test (FIT, OC-Sensor, Eiken Chemical Co., Japan). Patients with a positive test result (≥20 g Hb/g faeces) are referred for colonoscopy examination. A detailed description of the screening program is provided elsewhere 28,29 . Exclusion criteria are as follows: gastrointestinal symptoms; family history of hereditary or familial colorectal cancer (2 first-degree relatives with CRC or 1 in whom the disease was diagnosed before the age of 60 years); personal history of CRC, adenomas or inflammatory bowel disease; colonoscopy in the previous five years or a FIT within the last two years; terminal disease; and severe disabling conditions. Participants provided written informed consent and underwent a colonoscopy. A week prior to colonoscopy preparation, participants were asked to provide a faecal sample and store it at home at − 20 °C. The day of the colonoscopy, participants delivered the faecal sample. Participants also delivered a self-administered risk-factor questionnaire where they had to report antibiotics, probiotics and anti-inflammatory drugs intake in the previous months (Table 1). Patients reporting any antibiotics or probiotics intake one month prior to sampling were not included in this study. www.nature.com/scientificdata www.nature.com/scientificdata/ All stool samples were stored in − 80 °C, while colonic mucosa biopsy samples were retrieved during the colonoscopy. Four biopsies of normal tissue of each colon segment (4 of ascending colon, 4 of transverse colon, 4 of descending colon, and 4 of rectum) were obtained. If a tumour or a polyp was biopsied or removed, a biopsy was obtained if the endoscopist considered it possible. Subsequently, biopsy samples were immediately transferred to RNAlater (Qiagen) and stored at − 80 °C. One biopsy of normal tissue from ascending colon was selected from each of nine individuals and used in this study.
Colonic lesions were classified according to "European guidelines for quality assurance in CRC" 30 . For the present study, we selected patients with no lesions in the colonoscopy, patients with intermediate-risk lesions (3-4 tubular adenomas measuring <10 mm with low-grade dysplasia or as ≥1 adenoma measuring 10-19 mm) and with high-risk lesions (≥5 adenomas or ≥1 adenoma measuring ≥20 mm). We analysed 18 biological samples (9 faecal samples and 9 colon tissue samples) from 9 participants: n = 3 negative colonoscopy, n = 3 high-risk lesions, n = 3 intermediate-lesions) ( Table 2). Our CRC screening programme follows the Public Health laws and the Organic Law on Data Protection. All procedures performed in the study involving data from human participants were in accordance with the ethical standards of the institutional research committee, and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. The protocol of the study was approved by the Bellvitge University Hospital Ethics Committee, registry number PR084/16. DNA extraction and sequencing. Total faecal DNA was extracted using the NucleoSpin Soil kit (Macherey-Nagel, Duren, Germany) with a protocol involving a repeated bead beating step in the sample lysis for complete bacterial DNA extraction. Total DNA from the snap-frozen gut epithelial biopsy samples was extracted using an in-house developed proteinase K (final concentration 0.1 μg/μL) extraction protocol with a repeated bead beating step in the sample lysis. All extracted DNA samples were quantified using Qubit dsDNA kit (Thermo Fisher Scientific, Massachusetts, USA) and Nanodrop (Thermo Fisher Scientific, Massachusetts, USA) for sufficient quantity and quality of input DNA for shotgun and 16S sequencing. DNA yields from the extraction protocols are shown in Table 2.
Metagenomics sequencing libraries were prepared with at least 2 μg of total DNA using the Nextera XT DNA sample Prep Kit (Illumina, San Diego, USA) with an equimolar pool of libraries achieved independently based on Agilent High Sensitivity DNA chip (Agilent Technologies, CA, USA) results combined with SybrGreen quantification (Thermo Fisher Scientific, Massachusetts, USA). The indexed libraries were sequenced in one lane of a HiSeq 4000 run in 2 × 150 bp paired-end reads, producing a minimum of 50 million reads/sample at high quality scores. In total 92.15% of the base calls of the whole sequencing run had a quality score Q30 or higher (i.e. an error rate of 1 in 1000).
Targeted 16S sequencing libraries were prepared using Ion 16S Metagenomics Kit (Life Technologies, Carlsbad, USA) in combination with Ion Plus Fragment Library kit (Life Technologies, Carlsbad, USA) and loaded on a 530 chip and sequenced using the Ion Torrent S5 system (Life Technologies, Carlsbad, USA). The protocol was designed for microbiome analysis using Ion torrent 510/520/530 Kit-chef template preparation system (Life Technologies, Carlsbad, USA) and included two primer sets that selectively amplified seven hypervariable regions (V2, V3, V4, V6, V7, V8, V9) of the 16S gene. At least 10 ng of total DNA was used for 16S library preparation and re-amplified using Ion Plus Fragment Library kit for reaching the minimum template concentration. Equimolar pool of libraries were estimated using Agilent High Sensitivity DNA chip (Agilent Technologies, CA, USA). Library preparation and 16S sequencing was performed with the technological infrastructure of the Centre for Omic Sciences (COS).
Bioinformatics analysis. Bioinformatics analysis was performed by running in-house pipelines. Shotgun reads were first introduced into a pipeline including removal of human reads and quality control of samples. High quality reads resulting from this pipeline were further analysed under three different approaches: taxonomic classification, functional classification and de novo assembly. Additionally, we subsampled high quality shotgun reads to analyse the loss of observed alpha diversity when a lower sequencing depth is reached. www.nature.com/scientificdata www.nature.com/scientificdata/ Targeted 16S sequencing reads, on the other hand, were first subjected to a pipeline which identifies variable regions and separates them accordingly. Further denoising and classification analyses were performed separately for each 16S variable region as explained in the following sections.
Removal of human reads. Prior to submission of the raw sequence data to the European Nucleotide Archive (ENA), human reads were removed from the metagenome samples in order to follow legal privacy policies. Raw reads were aligned to the human genome (GRCh38) using Bowtie2 with optionsvery-sensitive-local and -k 1. A FASTQ file was then generated from reads which did not align (carrying SAM flag 12) using Samtools. These FASTQ files were deposited to the ENA.
Shotgun reads quality control. Shotgun samples were quality controlled using FASTQC. Accordingly, sequences were deduplicated using clumpify from the BBTools suite, followed by quality trimming (PHRED > 20) on both ends and adapter removal using BBDuk. Read pairs where one read had a length lower than 75 bases were discarded. Results of this quality control pipeline are shown in Table 3.

Shotgun taxonomic and functional profiling.
Pre-processed paired-end shotgun sequences were classified using three different classifiers: Kraken2 (a k-mer matching algorithm), MetaPhlan2 (a marker-gene mapping algorithm) and Kaiju (a read mapping algorithm). These three softwares were chosen to cover the three main algorithms used in taxonomic classification 20 .
Kraken2 was run against a reference database containing all RefSeq bacterial and archaeal genomes (built in May 2019) with a 0.1 confidence threshold. Following classification by Kraken, Bracken was used to re-estimate bacterial abundances at taxonomic levels from species to phylum using a read length parameter of 150. MetaPhlAn2 was run using default parameters on the mpa_v20_m200 marker database. Kaiju was run against the Progenomes database (built in February 2019) using default parameters. Corresponding taxonomic profiles at family level are shown in Fig. 1a.
Functional profiling of the concatenated metagenomic paired-end sequences was performed using the HUMAnN2 pipeline with default parameters, obtaining gene family (UniRef90), functional groups (KEGG orthogroups) and metabolic pathway (MetaCyc) profiles. ChocoPhlAn and UniRef90 databases were retrieved in October 2018.
De novo assembly. High quality metagenomic reads were assembled using metaSPADES with default parameters and binned into putative metagenome assembled genomes (MAGs) using metaBAT. checkM was used to check the quality of MAGs and filter them to comply with strict quality requirements (completeness > 90%, contamination < 5%, number of contigs < 300 %, N50 > 20,000). A total of 112 high quality MAGs were assembled from the nine high-coverage metagenomes and assigned a species-level taxonomy using PhyloPhlAn2. Assembled species shared by at least two of the nine samples are listed in Table 4. Pseudo-samples were then classified using Kraken2 and HUMAnN2. From this classification, Shannon index alpha diversity profiles were computed at the species, genus and phylum level, as well as UniRef90, KO and MetaCyc pathways level using the R package vegan.
Splitting 16S samples by region. As the Ion 16S Metagenomics Kit contains several primers in the PCR mix, the resulting FASTQ files contained sequencing reads belonging to different variable regions. Hence, an in-house Python program was written in order to identify the variable region(s) present in each read. Then, FASTQ files were stratified into new subfiles where all sequences contained belonged to the same region.
First, we positioned the 16S conserved regions 12  www.nature.com/scientificdata www.nature.com/scientificdata/ Regions 5 and 7 were truncated to match the reference E. coli sequence. Each sequencing read was then assigned into its corresponding variable region by mapping.
Analysis of the regions covered in our samples revealed a prevalence of V3, followed by V4, V2, V6-V7 and V7-V8 (Table 5). For each sample, each set of sequences from the same variable region(s) was subsequently extracted from the original FASTQ files with an in-house Python script (code available). www.nature.com/scientificdata www.nature.com/scientificdata/ 16S denoising and taxonomic binning. 16S sequences were denoised following the standard DADA2 pipeline with adaptations to fit our single-end read data. For this analysis, reads spanning different regions, obtained in the previous step, were introduced into the pipeline as different input files. Taxonomic classification of the high-quality sequences was performed using IdTaxa included in the DECIPHER package. A summary of quality estimates of the DADA2 pipeline is shown in Table 6. Taxonomic assignment at family level by region and source material is shown in Fig. 1b.   Continued www.nature.com/scientificdata www.nature.com/scientificdata/ Statistical analysis. For the statistical analysis of the bacterial abundance data, we used compositional data analysis methods 31 .
Count matrices of the classified taxa were subjected to central log ratio (CLR) transformation after removing low-abundance features and including a pseudo-count. Here, we used the codaSeq.filter, cmultRepl and codaSeq.clr functions from the CodaSeq and zCompositions packages. Principal components analysis (PCA) biplots were generated from the central log ratios using the prcomp function in R.

Data Records
The raw sequence data generated in this work were deposited into the European Nucleotide Archive (ENA). Faecal metagenomic sequences are available under accession PRJEB33098 32 . Faecal 16S sequences are available under accession PRJEB33416 33 and tissue 16S sequences are available under accession PRJEB33417 34 . Human sequences were removed from whole shotgun samples as previously described prior to the ENA submission.

Technical Validation
Prior to analysis, shotgun sequencing reads were subject to quality and adapter trimming as previously described. Moreover, reads were deduplicated to avoid compositional biases caused by PCR duplicates. Quality control and denoising of 16S reads was performed within the DADA2 denoising pipeline and not as an independent data processing step.
In order to validate the 16S variable region assignment, we selected reads that were assigned to a species by the assignSpecies function in DADA2, which searches for unambiguous full-sequence matches in the SILVA database. These pre-processed 16S reads were aligned to a full length 16S gene from those species in the SILVA database (version 132, gene codes shown in Table 7). The reads mapped consistently in regions within the 16S gene in agreement with the variable region assigned by our pipeline. That is, each read was assigned between the  Table 6. DADA2 results. Total amount of reads entering the pipeline and passing all the quality controls are indicated, as well as percentages of reads filtered in each step.
www.nature.com/scientificdata www.nature.com/scientificdata/ start and end loci reported in Table 7, and corresponding to the estimated 16S variable region for the particular microbe species genomes. These results suggest that our read level 16S region assignment was largely correct.
To define the taxonomic structure of the microbiome, we compared three different classifier algorithms which are based on full genome k-mer matching (Kraken2), protein-level read alignment (Kaiju) or gene specific markers (MetaPhlAn2) (Fig. 1a). A common core microbiome structure was observed regardless of the taxonomic classifier method. However, particular deviations in relative abundance were observed between these methods. To estimate the microbiome community structure differences, we performed a PCA of CLR-transformed data, which revealed a clear clustering by the taxonomic classification method (Fig. 2b). Importantly, however, Kraken2 and Kaiju family-level classifications clustered samples in the same order along the second component, which likely reflects consistency in classification despite of the method used.
Both variable regions analysed and the source material (faeces or tissue) revealed differential distributions of the bacterial taxa (Fig. 1b). Indeed, when analysing CLR-transformed taxonomic profiles, samples clustered mostly by source material (Fig. 2a). Notably, the V7-V8 data showed the largest deviation in principal components from all other variable regions (Fig. 2a).

Software
Use Version   Altogether, a clear difference in community structure was observed between 16S and shotgun sequences from the same faecal sample (Fig. 2c). Regardless, samples were displayed in the same order on the second component, which indicated consistency of the detected microbial signature.
Finally, we subsampled original high quality reads for lower coverage and computed alpha diversity at different taxonomic and functional levels in order to estimate the sequencing depth necessary to capture the observed microbial diversity in a given sample (Fig. 3). www.nature.com/scientificdata www.nature.com/scientificdata/ These alpha diversity profiles demonstrated a gradual drop in diversity as sequencing coverage decreased. This drop in coverage was more noticeable in features with higher diversity, particularly at species level or when using gene families (UniRef90). Altogether, in the case of species, sequencing coverages as low as 1 million read pairs appeared to capture the taxonomic diversity present in a sample, in line with previous findings 35 . In this study, we demonstrate that our high-coverage dataset from nine participants sustained sufficient sequencing depth to capture the majority of the known bacterial taxa and functional groups present in the samples.

Usage Notes
For reproducibility purposes, sequencing data was deposited as raw reads. However, human sequencing reads were removed from the dataset prior to uploading in order to prevent participants' identification. Thus, reads need to be trimmed and, if necessary, deduplicated, before being reutilized.
For 16S data, reads have been uploaded without any manipulation. Hence, reads from different variable regions are present in the same FASTQ file. We suggest researchers to run the reads classification scripts in order to choose variable regions for the analysis. Following that, reads will still need to be quality controlled, either directly or by denoising algorithms such as DADA2.

Code availability
Software versions used are listed in Table 8.
Code for sequence quality control and trimming, shotgun and 16S metagenomics profiling and generation of figures in this paper is freely available and thoroughly documented at https://gitlab.com/JoanML/colonbiome-pilot. This repository includes instructions for the analysis and reproduction of the figures on this paper from the publicly available samples, as well as pipelines used for the analysis. This repository is arranged in folders, each containing a README: • qc: Scripts for quality control and preprocessing of samples • analysis_shotgun: Scripts to run softwares for metagenomics analysis • regions_16s: In-house scripts for splitting IonTorrent reads into new FASTQ files • analysis_16s: DADA2 pipeline adapted to this dataset • assembly: Scripts to run the assembly, binning and quality control software • figures: Scripts used to generate the figures in this manuscript • shannon_index_subsamples: Scripts used to compute alpha diversity in subsampled FASTQs