Colorectal cancer is a heterogenous and mostly sporadic disease, the development of which is associated with microbial dysbiosis. Recent advances in subtype classification have successfully stratified the disease using molecular profiling. To understand potential relationships between molecular mechanisms differentiating the subtypes of colorectal cancer and composition of gut microbial community, we classified a set of 34 tumour samples into molecular subtypes using RNA-sequencing gene expression profiles and determined relative abundances of bacterial taxonomic groups. To identify bacterial community composition, 16S rRNA amplicon metabarcoding was used as well as whole genome metagenomics of the non-human part of RNA-sequencing data. The generated data expands the collection of the data sources related to the disease and connects molecular aspects of the cancer with environmental impact of microbial community.
transcription profiling design • disease analysis objective
transcription profiling assay • rRNA 16S
RNA sequencing • amplicon sequencing
human gut metagenome • Homo sapiens • colorectum
Machine-accessible metadata file describing the reported data (ISA-Tab format)
Background & Summary
Colorectal cancer (CRC) is one of the most common types of cancer worldwide, in terms of both incidence and mortality1. Most cases of CRC are sporadic with no known genetic link. Environmental factors are therefore likely to play a critical role in the development of the disease, and a key characteristic of the colon is that it houses the largest proportion of the human microbiome, suggesting that this might play a role in causing CRC. Recent data points to the importance of the microbial communities in the gut, the microbiome, and possible links to the development of CRC2,3,4,5. If this is the case, understanding the role of the microbiome in CRC will have profound effects on cancer rates, since it is potentially relatively easily to manipulate, using diet, pre- and probiotics and faecal transplants6,7,8,9. However, despite the intense interest in the field and increasing evidence pointing to a role for the microbiome in CRC, convincing connections with clinical parameters and outcome are rarely seen.
CRC is a highly heterogeneous disease, with varying clinical outcomes, response to therapy, and morphological features, and molecular subtyping systems based on CpG-island methylation, microsatellite instability and gene mutations have shown strong associations with outcome and response to therapy in CRC10,11,12,13.
Contrary to other microbiome studies, where CRC is treated as a single disease entity, we focused on the association between Consensus Molecular Subtypes (CMS) of colorectal cancer and gut microbiome patterns in the accompanying primary publication14. We stratified a set of CRC tumour samples into CMS according to their gene expression profiles15 and assessed differences in bacterial communities among CMS. The gene expression profiles were generated using RNA sequencing, and 16S rRNA metabarcoding as well as metagenomic analysis of non-human portion of the RNA sequencing data were employed for bacterial taxa quantification. We analysed the enrichment/depletion of bacterial species in one subtype compared to the other subtypes and showed enrichment of certain oral bacteria associated with CMS, which was validated using targeted quantitative PCR.
The data generated in this study combine various views of each sample as multiple different methods were used to obtain information about the samples. This allows us to study associations between the results of the particular methods. Making the raw sequencing data available together with the scripts used for data processing and analysis, we enable reuse of the data and extend the collection of the data sources related to CRC, for which the aetiology is not yet well understood.
Here, we present a more condensed version of the methods that led to data and analyses in the primary publication14. The workflow is shown in Fig. 1 and the names of the partial processes (depicted in blue in the figure) are used as titles in this section to structure the text. We make the raw sequencing data freely available in NCBI Sequence Read Archive16, and scripts together with more downstream analysis results are accessible as the Zenodo dataset17.
Sample collection & handling
Tumour tissue was collected from 34 patients undergoing surgery for resection of colorectal tumours. None of the patients had received chemotherapy prior to surgery, and all patients provided written, informed consent. This study was carried out with approval from the University of Otago Human Ethics Committee (ethics approval number: H16/037). Table 1 shows patient metadata for the cohort. At the time of surgery, CRC tumour cores were taken and immediately frozen in liquid nitrogen and initially stored at −80 °C. They were subsequently transferred to RNAlater ICETM (Qiagen), and equilibrated for at least 48 hours at −20 °C, prior to nucleic acid extraction. RNA and DNA were extracted from 15–20 mg each of tissue using RNEasy Plus Mini Kit (Qiagen) and DNeasy Blood and Tissue Mini Kit (Qiagen), respectively. Tissue disruption was carried out using a Retsch Mixer Mill. RNA extraction included a DNAse treatment step, and DNA extraction included overnight incubation with proteinase K, and treatment with RNAse A. Purified nucleic acids were quantified using the NanoDrop 2000c spectrophotometer (Thermo Scientific, Asheville, NC, USA), and stored at −80 °C. Nucleic acids were extracted from all tumour samples in a single batch by one operator, to avoid inter-batch variation.
Library preparation and ribosomal RNA depletion was carried out using Illumina TruSeq stranded total RNA library prep V1 and Ribo-Zero Gold. The ribosomal RNA depletion step has potentially removed a portion of bacterial ribosomal RNA alongside of the human one, hence losing some information on bacteria. However, the same method of depletion was used on all the samples thus the potential loss would effect all of them in a similar manner. RNA sequencing was carried out using the Illumina HiSeq. 2500 V4 platform, to produce 125 bp paired end reads. Each sample library was split equally to two HiSeq lanes and the sequences from the two lanes were merged for each sample during the data processing phase.
Read mapping, Gene expression quantification, and Profile classification
Adapters and low quality segments were removed from the sequenced reads using fastq-mcf from EA Utils18 and SolexaQA++19. The cleaned reads were mapped to the GRCh38 reference human genome with STAR20 and the read count for each HAVANA annotated gene in every sample was calculated with htseq-count21. The read counts were transformed to gene expression profiles measured in transcripts-per-million (TPM) with DESeq222. The published CMS classifier15 was used to assign a molecular subtype of the disease to each sample based on the gene expression profiles (for more details see14). We identified six samples as CMS1, 13 samples as CMS2 and nine samples as CMS3. No samples were classified as CMS4, and six samples were unclassified.
Assignment of reads to bacterial taxa
A Kraken23 database was built containing all NCBI Refseq complete genomes or chromosome-level genomes (January 2017) and additional genomes of bacteria proposed to play a role in CRC, disregarding their genome status. The used bacterial genomes are listed in the files Supplementary_table_K1.xlsx (all complete and chromosome-level genomes) and Supplementary_table_K2.xlsx (of interest specifically for CRC) in the folder data/kraken of the accompanying repository. All RNA-seq reads that were not uniquely mapped to the human genome reference sequence were used as input to Kraken using this custom database for taxonomic classification per sample. Altogether, 2231 different bacterial species were detected in at least one sample and only 1.4% of the analysed reads were not assigned to any bacterial species. We visualised bacterial abundances per CRC subtype using Krona24 and the interactive plots are available at http://crc.sschmeier.com.
Differential analysis of bacterial species in CMS
We analysed the enrichment/depletion of bacterial species in one subtype compared to the other subtypes employing a strategy similar to differential expression analysis. Using edgeR25, we identified bacterial taxa with considerable abundance differences among the subtypes. For each CRC sample we used the assigned CMS subtype, the list of identified bacterial species, and the read counts corresponding to the identified species as input data. We treated all samples of a certain CMS subtype as replicates belonging to the subtype and ran differential analysis of each CMS subtype against all the other classified samples. This analysis identified bacterial species that are enriched (or depleted) in a subtype as compared to all other subtypes. For further details regarding the analysis, please refer to the primary publication14.
16S rRNA metabarcoding
Libraries containing 16S rRNA were prepared with 20 ng of DNA for each sample using primer pairs flanking the V3 and V4 hypervariable regions of the 16S rRNA gene and Illumina sequencing adaptors and barcodes were added using limited cycle PCR. Amplicon sequencing was carried out using the Illumina MiSeq platform, and paired end reads of length 250 bp were generated.
Metabarcoding data analysis
Short overlapping forward and reverse reads coming from the same fragment were joined together with FLASh26 to form sequences of the V3-V4 hypervariable 16S rRNA region. Afterwards, low quality regions were removed from the resulting fragments with SolexaQA++19. Microbiome analysis was carried out with the QIIME bioinformatics pipeline27 using the Greengenes database28 for taxonomy assignment. No further normalisation of the data was performed.
Sequenced genomic data from both RNA-seq and 16S rRNA metabarcoding experiments are stored in the Sequence Read Archive as the study SRP11776316. Data resulting from the analyses presented here are located in the folder data of the Zenodo repository17. The data are separated into several subfolders:
The folder expr contains raw read counts in subfolder raw_counts, tpm-based expression profiles of all samples stored in file tpm.readyForClassifier.tsv and also file CMSclassifiedCRC.tpm.havana.tsv containing the CMS subtype classification. These files are the main outcomes of gene expression profile classifications.
Results of the metagenomics analysis of the non-human genomic content of RNA-seq are located in folder kraken together with two tables (Supplementary_table_K*.xlsx) containing lists of bacterial species used in this metagenomics analysis.
The folder 16S contains the biom file otu_table.biom resulting from the 16S rRNA metabarcoding analysis with QIIME and two partial abundance tables otu_table_sorted_*.txt.gz. The abundance tables are derived from the biom file and were used further for data visualisation in the primary publication as well as for the metagenomics method comparison.
RNA-seq raw data quality
The quality of raw sequenced reads from RNA-seq experiments was assessed with FASTQC and was very good. A pair of representative per base quality plots of corresponding forward and reverse read pairs for one sample is shown in Fig. 2a). Regardless of the raw data quality, all the samples underwent routine data cleaning to ensure that no base was called with a Phred quality below 20. In Table 2, we show number of reads passing various data processing stages together with relative proportion of the reads passing two different stages.
16S rRNA sequencing raw data quality
In Fig. 2b), we show quality of the 16S rRNA sequencing raw data for sample CRC_16. The other samples’ 16S rRNA quality plots looked similar. It can be seen that per base quality varied a little bit more along the 16S rRNA reads when compared to the RNA-seq reads, but overall the quality was very good for the 16S rRNA sequencing as well. Please note that the read length for the 16S rRNA sequencing was twice the read length of the RNA-seq, which together with differences between the used sequencing instruments explains differences in the quality plots. All the 16S rRNA samples underwent routine data cleaning to ensure that no base was called with a Phred quality below 20. In Table 3, we show number of reads passing various data processing stages together with relative proportion of the reads passing two different stages.
All the code used to process the genomic data is freely available as a part of the provided Zenodo repository17 and the code is located in the folder named scripts. The scripts folder also contains dependencies listed in the file used_packages_and_their_versions.tsv and the used parameter values listed in used_parameters.tsv. Depending on the scripts’ functionality, they are separated into various folders:
The folder rnaseq-subtype-classification contains scripts used for read mapping, gene expression quantification, and profile classification.
The folder kraken/human-unmapped contains scripts to assign reads to bacterial taxa.
The folder kraken/diff-expr-taxa contains scripts for differential analysis of bacterial species in CMS.
The folder 16S-metabarcoding contains scripts for metabarcoding data analysis.
Ferlay, J. et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in globocan 2012. International Journal of Cancer 136, E359–E386 (2015).
Ahn, J. et al. Human gut microbiome and risk for colorectal cancer. Journal of the National Cancer Institute 105, 1907–1911 (2013).
Marchesi, J. R. et al. Towards the human colorectal cancer microbiome. PLoS One 6, e20447 (2011).
Gao, Z., Guo, B., Gao, R., Zhu, Q. & Qin, H. Microbiota disbiosis is associated with colorectal cancer. Frontiers in Microbiology 6, 20 (2015).
Sobhani, I. et al. Microbial dysbiosis in colorectal cancer (CRC) patients. PLoS One 6, e16393 (2011).
Rauch, M. & Lynch, S. The potential for probiotic manipulation of the gastrointestinal microbiome. Current Opinion in Biotechnology 23, 192–201 (2012).
Preidis, G. A. & Versalovic, J. Targeting the human microbiome with antibiotics, probiotics, and prebiotics: gastroenterology enters the metagenomics era. Gastroenterology 136, 2015–2031 (2009).
Gupta, A. & Khanna, S. Fecal microbiota transplantation. The Journal of the American Medical Association 318, 102–102 (2017).
Nakatsu, G. et al. Gut mucosal microbiome across stages of colorectal carcinogenesis. Nature Communications 6, 8727 (2015).
Jass, J. Classification of colorectal cancer based on correlation of clinical, morphological and molecular features. Histopathology 50, 113–130 (2007).
Leggett, B. & Whitehall, V. Role of the serrated pathway in colorectal cancer pathogenesis. Gastroenterology 138, 2088–2100 (2010).
Network, C. G. A. et al. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330 (2012).
Domingo, E. et al. Use of multivariate analysis to suggest a new molecular classification of colorectal cancer. The Journal of Pathology 229, 441–448 (2013).
Purcell, R. V., Visnovska, M., Biggs, P. J., Schmeier, S. & Frizelle, F. A. Distinct gut microbiome patterns associate with consensus molecular subtypes of colorectal cancer. Scientific Reports 7, 11590 (2017).
Guinney, J. et al. The consensus molecular subtypes of colorectal cancer. Nature Medicine 21, 1350 (2015).
NCBI Sequence Read Archive, http://identifiers.org/insdc.sra:SRP117763 (2017).
Schmeier, S., Visnovska, M., Biggs, P. J., Purcell, R. V. & Frizelle, F. A. Scripts and data attached to colorectal cancer study by Purcell, 2017. Zenodo, https://doi.org/10.5281/zenodo.2604777 (2018).
Aronesty, E. Comparison of sequencing utility programs. The Open Bioinformatics Journal 7, 1–8 (2013).
Cox, M. P., Peterson, D. A. & Biggs, P. J. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 11, 485 (2010).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Anders, S., Pyl, P. T. & Huber, W. HTSeq–a python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq. 2. Genome Biology 15, 550 (2014).
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 15, R46 (2014).
Ondov, B. D., Bergman, N. H. & Phillippy, A. M. Interactive metagenomic visualization in a web browser. BMC Bioinformatics 12, 385 (2011).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Magoč, T. & Salzberg, S. L. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27, 2957–2963 (2011).
Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7, 335 (2010).
DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied and Environmental Microbiology 72, 5069–5072 (2006).
The authors would like to thank Helen Morrin and the staff at the Cancer Society Tissue Bank, Christchurch for their enthusiasm and support for this project, the patients involved, for generously participating in this study, as well as New Zealand Genomics Limited (NZGL) for the sequencing and support in study design. Funding sources (Rachel Purcell): Maurice and Phyllis Paykel Trust, Gut Cancer Foundation (NZ), with support from the Hugh Green Foundation, Colorectal Surgical Society of Australia and New Zealand (CSSANZ).
The authors declare no competing interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
ISA-Tab metadata file
About this article
Cite this article
Visnovska, T., Biggs, P.J., Schmeier, S. et al. Metagenomics and transcriptomics data from human colorectal cancer. Sci Data 6, 116 (2019). https://doi.org/10.1038/s41597-019-0117-3