Parvimonas micra, Peptostreptococcus stomatis, Fusobacterium nucleatum and Akkermansia muciniphila as a four-bacteria biomarker panel of colorectal cancer

Dysbiosis of the gut microbiome has been associated with the pathogenesis of colorectal cancer (CRC). We profiled the microbiome of gut mucosal tissues from 18 CRC patients and 18 non-CRC controls of the UKM Medical Centre (UKMMC), Kuala Lumpur, Malaysia. The results were then validated using a species-specific quantitative PCR in 40 CRC and 20 non-CRC tissues samples from the UMBI-UKMMC Biobank. Parvimonas micra, Fusobacterium nucleatum, Peptostreptococcus stomatis and Akkermansia muciniphila were found to be over-represented in our CRC patients compared to non-CRC controls. These four bacteria markers distinguished CRC from controls (AUROC = 0.925) in our validation cohort. We identified bacteria species significantly associated (cut-off value of > 5 fold abundance) with various CRC demographics such as ethnicity, gender and CRC staging; however, due to small sample size of the discovery cohort, these results could not be further verified in our validation cohort. In summary, Parvimonas micra, Fusobacterium nucleatum, Peptostreptococcus stomatis and Akkermansia muciniphila were enriched in our local CRC patients. Nevertheless, the roles of these bacteria in CRC initiation and progression remains to be investigated.


Results
Sample collection and subject demographics. A total of 36 subjects (newly diagnosed CRC patients, n = 18; healthy controls, n = 18) were enrolled for the discovery phase of the study. Demographic distribution between the subjects for both CRC and control groups were similar in terms of age, sex and ethnicity. For CRC patients, majority of the tumours were located at the left-sided colon (83.3%) and of Dukes' B stage (61.1%). Non-CRC subjects were observed to be patients who attended the clinic for exploratory colonoscopy screening due to symptoms such as abdominal pain, altered bowel habit or family history of CRC. The demographic characteristics of the discovery cohort is shown in Table 1.
For the validation cohort, 40 CRC tissue samples and 20 control samples (18 from discovery cohort, two from UMBI-UKMMC Biobank) were included into the study. Tissues for all CRC stages were available for this part of the study, and the majority were of the Dukes' C stage (57.5%). Similar to the discovery phase, most tumours were located at the left-sided colon (90.0%). The demographic characteristics of the validation cohort is shown in Table 2.
Altered mucosal microbiome landscape in CRC . 16S rRNA gene sequencing generated a total of 7,940,453 high quality reads (mean ± SD ~ 220,568 ± 144,691) for all samples of the discovery cohort. Paired-end reads were clustered into Operational Taxonomic Units (OTU) at 97% similarity and taxonomic classifications were assigned to bacterial genera. From our core diversity analysis, we identified a total of 21 bacterial phyla and 358 genera for all included samples. Comparison of alpha diversity abundance revealed significant difference between the mucosal microbiome architecture of CRC patients compared to controls. Principal coordinate analysis (PCoA) plot of unweighted UniFrac analysis highlighted distinct β-diversity microbiome clusters (Fig. 1). Overall, mucosal microbiome composition differed significantly between CRC and non-CRC control subjects according to unweighted UniFrac distance.
At the phylum level, patients with CRC were found to harbour increased abundance of Fusobacteria, Verrucomicrobia and Synergitetes; while nine genera found to be significantly enriched in CRC were Filifactor, Prevotella, Peptostreptococcus, Akkermansia, Parvimonas, Lachnobacterium, Bulleidia, Dialister and Fusobacterium. Interestingly, Faecalibacterium, Dorea, Sutterella, Propionibacterium, Neisseria and Anaerofuctis were significantly depleted in CRC patients (p < 0.05). Over-representation of Fusobacterium nucleatum, Intestinimonas butyriciproducens, Peptostreptococcus stomatis, Eubacterium coprostanoligenes, Ruminococcus bromii, Bacteroides fragilis, Akkermansia muciniphila, Ruminococcus callidus, Parvimonas micra, and Gemella morbillorum was found in more than 66% of CRC patients. In contrast, Haemophilus parainfluenzae, Atopobium parvulum and Clostridium oroticum were significantly depleted in CRC patients. On the other hand, we also found several novel CRCrelated bacteria species such as Intestinimonas butyriproducens, Eubacterium coprostanoligenes and Ruminococcus bromii which have yet to be associated with CRC. Table 3 shows top 12 bacterial species over-represented in CRC (> 1.5-fold, occurrence in > 66% CRC samples) patients compared to controls. Figure 2 shows LEfSe analysis of bacterial taxa in CRC and non-CRC control subjects. Fusobacterium, Peptostreptococcus, Parvimonas and Akkermansia were amongst bacteria genera enriched in CRC patients, indicating a state of dysbiosis of the mucosal microbiome architecture in CRC.  Figure 1. β-diversity of gut microbiome in CRC patients compared to non-CRC controls. Principal coordinates analysis (PCoA) plot of gut microbiome β-diversity in the subjects of this study based on unweighted UniFrac analysis. www.nature.com/scientificreports/ Eubacterium coprostanoligenes, Solobacterium moorei and Intestinimonas butyriciproducens are available; however, we encountered difficulty in primer design for these bacteria to validate their abundance via qPCR. This was mostly due to high similarity in regions suitable for primer design among different species of the same bacterial genus. Amplification of larger regions can differentiate these bacteria; however, qPCR products should be < 500 bp in size. In addition, at the time of this analysis, the whole genome sequences of Christensenella timonensis and Ruminococcus bromii are still not available. Therefore, validation of bacteria abundance was only carried out in five bacteria, namely, Fusobacterium nucleatum (Fn), Akkermansia muciniphila (Am), Parvimonas micra (Pm), Peptostreptococcus stomatis (Ps) and Bacteroides fragilis (Bf) (Fig. 3). Interestingly, bivariate correlation analysis demonstrated strong positive correlation (Spearman's r = 0.85-0.918, p < 0.001) between 16S rRNA gene sequencing (discovery phase) and qPCR (validation phase) abundance for Fn, Am, Pm and Ps (Table 4). On the other hand, Bf qPCR abundance was found to be poorly correlated to 16S rRNA gene sequencing abundance. We found all Fn, Am, Pm and Ps to show significant enriched abundance in the CRC patients of our validation cohort (p ≤ 0.001), where all patients had an abundance of more than 66% of these bacteria, while non-CRC controls had only less than 30%. Supplementary Dataset 1 shows qPCR data for both discovery (16S rRNA gene sequencing vs qPCR assay) and validation (validation cohort on qPCR assay) phase experiments.
PICRUSt analysis of CRC-associated gut microbiome. Thirty-nine significant metagenome functions (KEGG Level 3) of both CRC and non-CRC control subjects' gut microbiome were identified. Enriched

Discussion
Gut microbiome dysbiosis is a hallmark of CRC. In our study, we profiled the gut microbiome of our study subjects, and observed respective clustering of CRC and non-CRC groups via unweighted UniFrac analysis. Interestingly, Pm, Ps, Fn and Am were further identified as the predominant bacteria associated with CRC locally. Bacteria found over-represented in our CRC patients were also reported in other geographical areas of the world 7,8,12 . Interestingly, our results are largely in concordance with a recently published report on a separate     13 , with the exception that Bf was not found to be consistently over-represented in our CRC patients. Bf, designated as the "driver" bacteria of CRC tumorigenesis via production of genotoxic molecules, are hypothesized to be one of the earliest colonizers on the colon mucosa of CRC patients 14 . These bacteria will be outcompeted by tumour-foraging opportunistic "passenger" bacteria such as Fn in subsequent stages of CRC. Incidentally, we had a lower number of early-stage CRC patients (Dukes' A, n = 2, 3.4%, only in validation cohort) compared to that of the other Malaysian study (stage I, n = 5, 11.4%). Indeed, we did not manage to enrol any Dukes' A patient in our discovery cohort; this might have also caused the poor correlation between sequencing (discovery) and qPCR (validation) results for Bf. Our study identifies Fn as being commonly over-represented for all the demographic subtypes of the CRC patients in our study. The role of Fn in CRC tumorigenesis has been investigated and reported in many studies.  www.nature.com/scientificreports/ The bacteria has been shown to induce tumorigenesis via Toll-like receptor 4 signalling to MYD88, causing initiation of inflammatory nuclear factor-kappa B (NF-kB) signalling pathway 4,15 . Fn has also been reported to adhere and invade into epithelial cells, stimulating the β-catenin pathway and causing activation of pro-inflammatory and oncogenic events 16 . On the other hand, the roles of Am, Pm and Ps in CRC tumorigenesis are still unclear. Intriguingly, even though Am has been reported to be reduced in patients with inflammatory bowel diseases 17 , it has been reported to cause tumorigenesis in mouse models 18 . Pm and Ps are oral commensals which occasionally may turn into pathogens and have been reported to be associated with CRC. These two bacteria have been identified to be abundant in some CRC gut microbiome studies carried out in Canada, USA and China 7,8,19,20 . In our study, Fn appears to be consistently over-represented in all CRC demographics, and high abundance (> 2 fold, data not shown) of Pm, Ps, Fn and Am (our CRC microbiome predictive panel) were observed across all cancer patients. Our findings suggest that over-abundance of these four bacteria could implicate colorectal carcinogenesis. Fn might have been integral in creating a pro-inflammatory environment and tumorigenesis of colon cells 4,15,16 . Meanwhile, Pm and Ps, which are oral microbiome and biofilm producers, secrete biofilm which protect cancer cells from the host's immune system 21 . Am, where it has been recently reported to be positively correlated with the host's immunotherapeutic response in mice, might be playing a role in reducing the pro-carcinogenesis effect of pathogenic bacteria such as Fn 22,23 . Nevertheless, the exact roles played by these four bacteria in CRC are still unknown, and remains to be investigated. Future experiments involving co-culture of these four bacteria with colon cells could be conducted to further investigate this. In a recent study, meta-analyses of the CRC microbiome were performed through comparisons across multiple datasets and populations, whereby Fusobacterium, Porphyromonas, Parvimonas, Peptostreptococcus, Gemella, Prevotella, and Solobacterium species were found to be enriched in CRC patients 24,25 . Another study which also performed multi-cohort analysis of CRC gut microbiome showed significant abundance of B. fragilis, F. nucleatum, P. asaccharolytica, P. micra, P. intermedia, A. finegoldii, and T. acidaminovorans in CRC subjects 26 . Interestingly, results from these two cohorts and our study all showed over-representation of Fusobacterium and Parvimonas. The Fn, Pm, Ps, Am four-bacterial panel combination seems to be unique for the Malaysian CRC population, compared to results from other studies.
In this study, we observed that metabolic functions involving DNA repair, ribosome activity and aminoacyl-tRNA biosynthesis were more abundant in the gut microbiome of CRC patients compared to non-CRC controls. It has been hypothesized that CRC-associated bacteria secretes toxins which will cause DNA damage in host cells and carcinogenesis 14 . Host cells subsequently respond by mounting an immune response via reactive oxygen species (ROS)-mediated pathways to eliminate the infecting bacteria 27 . In return, we suspect that the CRC gut microbiome would then require upregulation of pathways associated with DNA repair to survive the ROS attack. Protection of CRC-associated gut bacteria from the host immune system allows bacteria growth and further development of tumours in colon cells with the activation of ribosomal activity and aminoacyl-tRNA biosynthesis 28,29 . On another note, we observe slight elevation in polycyclic aromatic hydrocarbon (PAH) degradation metabolism in the CRC microbiome; incidentally, PAH metabolism is linked with red meat consumption and carcinogensis 30,31 .
Till date, various studies have been carried out to profile the CRC gut microbiome but there are differences in the approach including sequencing method and bioinformatics workflow 32 . In our study, tumour / colonic mucosal tissue samples were used for DNA extraction, as these samples have been proposed, compared to stool samples, to give a more accurate picture of the gut microbiome landscape on tumour environments and enable functional studies into the role of these bacteria in tumorigenesis [33][34][35][36] . In addition, we were also able to perform species-level taxonomical classification for each microbiota using the One Codex bioinformatics platform. Many previous studies published only genera-level identification of gut microbiota using QIIME, a commonly-used bioinformatics pipeline. One Codex is a new pipeline using the assignment-first approach in taxonomical classification 37 , using the K-mer-based analysis which is also used by Kraken 38 and CLARK 39 . At the time of our study, it is still not widely used for 16S rRNA gene analysis; nevertheless, we find the platform robust and user-friendly, allowing us to species-classify the gut microbiome of our study samples and further use these results for qPCR primer design in our validation phase of the study. In regards to this, the results from many earlier studies of CRC gut microbiome profiling were not further verified in a biological cohort 9,20,[40][41][42][43] . In our study, bacteria found to be significantly abundant in the discovery phase CRC samples were also found to be over-represented in our CRC validation cohort. We showed the utility of qPCR for the validation of 16S rRNA gene sequencing results. We also found good AUROC, Sn, Sp, PPV and NPV values in our Pm-Ps-Fn-Am four-bacteria CRC biomarker panel, showing the importance of these bacteria in CRC.
There were some limitations in our study. The number of samples used in our study was smaller compared to other studies, but we showed that the results were reproducible in the validation cohort. On the other hand, due to limited numbers, bacteria identified as significantly different in various CRC demographics and specific to certain ethnicity, gender and CRC staging could not be further verified in our validation cohort. It is also unclear how these bacteria might contribute to the observed demographic-specific difference in abundance found in our study. In addition, at the time of study, due to unavailability of whole genome sequences of some bacteria found over-represented in our CRC patients, we could not design primers to validate their abundance using qPCR assay. Therefore, the over-representation of Ruminococcus callidus, Eubacterium coprostanoligenes, Intestinimonas butyriciproducens and Ruminococcus bromii in CRC remains to be confirmed. Some recent publications reported the possible role of these bacteria in gut health and pathology, but the number of reports were too few to allow definite conclusions [44][45][46] .
Furthermore, while usage of mucosal tissues in 16S rRNA gene sequencing enabled us to identify the gut microbiome present on tumours and deduce their roles in tumorigenesis, results from mucosal tissue sequencing might be discrepant from those found in stool samples of CRC patients. Therefore, the four-bacteria panel identified to be over-represented in tissues of our CRC patients could not be used for population screening of the www.nature.com/scientificreports/ disease. The currently available immunochemical fecal occult blood test (iFOBT) will be a more suitable method for this purpose. Nevertheless, iFOBT only has a 40% sensitivity to detect patients with advanced adenomas 47 .
To this end, identification of over-abundant gut bacteria found in stool samples of advanced adenoma and earlystage CRC patients, and utilization of these bacteria as screening markers for early detection of CRC could be further explored.

Conclusion
In summary, we identified Pm, Ps, Fn and Am as bacteria significantly abundant in our cohort of Malaysian CRC patients. The exact role of these bacteria in CRC initiation and progression remains to be investigated in further studies. The study was carried out in two phases: discovery (gut microbiome profiling via 16S rRNA gene sequencing) and validation (qPCR amplification of 5 significant CRC-associated bacterial markers found in the discovery phase). Patients and controls for the discovery phase were recruited from those undergoing colonoscopy and tumour removal surgery at the Hospital Canselor Tuanku Muhriz, UKM Medical Centre (UKMMC), Kuala Lumpur, from 2015 to 2017. Written informed consent was obtained from each patient and control prior to colonoscopy. Subjects in this phase included individuals presenting with digestive symptoms and asymptomatic individuals undergoing colon screening. Exclusion criteria included history of any cancer or colon resection, subjects with gut diseases such as inflammatory bowel disease (IBD) and polyps, consumption of antibiotics for the past 3 months and unsuccessful colonoscopy procedure. Endoscopic pinch biopsies were performed from tumour sites, immediately flash frozen in liquid nitrogen after collection and stored at − 80˚C until further analysis. After colonoscopy results were obtained, subjects with confirmed CRC were grouped as "patients", while subjects whose tissues contained no evidence of active gut pathology were grouped as "non-CRC". For the validation phase, matched tissue samples were selected from the UMBI-UKMMC Biobank. Patients with recurrent CRC, history of other cancers, inflammatory bowel disease, and those who had either radiotherapy or chemotherapy prior surgery were excluded from this phase of the study. For controls, the corresponding samples from the discovery phase, together with archived control tissues from the UMBI-UKMMC Biobank were used.

Methodology
DNA extraction and 16S rRNA gene sequencing. Genomic DNA from flash frozen tissue samples was extracted using the QIAGEN DNA Micro Kit (discovery phase) and the Machery-Nagel NucleoSpin Tissue Kit (validation phase) as per manufacturers' protocols. The quality of the extracted DNA was determined by gel electrophoresis and the Nanodrop 2000c. Amplicon libraries for the discovery phase were prepared according to Illumina's 16S Metagenomic Sequencing Library Preparation protocol with some slight modifications. Briefly, amplicons were generated using primers targeting the V3/V4 region of 16S rRNA gene. PCR products were purified using E-Gel EX SizeSelect 2% agarose and subsequently attached with Nextera XT forward and reverse indices for barcoding. The amplicons were then purified using AMPure XP beads and quality-checked using the Agilent Bioanalyzer High Sensitivity DNA kit to determine library size. Libraries were quantified using Illumina Library Quantification kit (KAPA Biosystems) and normalized to 2 nM prior sequencing using a 2 × 250 bp MiSeq Reagent kit v2 on an Illumina MiSeq sequencer.
Bioinformatics analyses. Sample de-multiplexing was performed using Illumina's BCL2FASTQ algorithm by MiSeq Software Reporter. Raw FASTQ files were exported and processed by Trimmomatic v0.34 48 for adapter trimming and quality filtering. Forward and reverse sequences of each sample were assembled using SeqPrep (https ://githu b.com/jstjo hn/SeqPr ep) and converted to FASTA via FASTX-Toolkit (http://hanno nlab.cshl.edu/ fastx _toolk it/) prior to analysis using the QIIME v1.9.1 49,50 and One Codex software 51 . In the QIIME analysis, assembled reads were clustered into Operational Taxonomic Units (OTUs) using UCLUST 52 and aligned against the GreenGenes 16S rRNA gene database version 2013.05 (http://green genes .lbl.gov) at 97% similarity threshold. Core diversity analyses were performed to determine alpha and beta-diversity of the samples. For specieslevel analysis, assembled FASTA files were uploaded to One Codex platform (https ://www.oneco dex.com/platf orm/) and aligned against the Targeted Loci database for species-level taxonomic classification. LEfSe (Linear discriminant analysis effect size) was used to compare the relative abundance of different taxa between groups, where a p-value of less than 0.05 for the Kruskal-Wallis rank-sum test and a size-effect threshold of 2.0 on the logarithmic LDA score were applied for discriminative microbial biomarkers 53 . The gene functions of mucosal-associated microbiome for each group were predicted using Phylogenetic Investigation of the Communities by Reconstruction of Unobserved States (PICRUSt v1.1.3) 54 . After normalisation for 16S rRNA copy numbers, metagenomes were predicted based on KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway database 55 and summarised using KEGG from level 1, 2 and 3 metabolic functions. Differences in predicted metabolic function abundance between groups were identified using Statistical Analysis for Metagenomic Profile (STAMP) 56 . The STAMP software was used to assess significant statistical differences between the predicted metabolic function profiles using Welch's t-test corrected for multiple-testing by Benjamini-Hochberg false discovery rate (FDR). Corrected p-values below 0.05 were considered significant. www.nature.com/scientificreports/ Quantitative PCR (qPCR). qPCR was used to determine the relative abundance of candidate bacterial markers. Only bacterial species with available genome sequences for primer design were tested for relative abundance. Table 8 shows primers sequence of candidate markers designed using AlleleID v7.84 (PREMIER Biosoft, USA). All reactions were performed on a CFX96 Touch Real-time PCR Detection system in a 10µL reaction volume using SsoAdvanced Universal SYBR Green Super mix (Bio-Rad, USA). Each sample was assayed for 40 cycles in a triplicate reaction and relative abundance of each marker was calculated in reference to total bacteria DNA. Primers for total bacteria were adapted from a previous study 57 . Abundance of the tested bacterial markers was calculated as a relative unit normalised to the total bacteria of that sample, using the 2 −ΔΔCt method (where ΔΔCt = the average ΔCt value of each target-the average ΔCt value of total bacteria).
Statistical analysis. Taxonomic differences of gut microbiome composition between CRC and control was analysed using the Mann-Whitney test, where a minimum fold change of > 1.5 in bacteria abundance and > 66% occurrence in CRC was considered as significantly enriched. To determine gut microbiome differences between different demographics, the Mann-Whitney test was also used, with a cut-off value of > 5 fold abundance. For CRC staging, Dukes' B was classified as early CRC, while Dukes' C and D were classified as advanced CRC. Spearman correlation coefficient analysis was used to investigate the correlation between bacterial candidate marker abundance detected from sequencing and qPCR techniques. Diagnostic value for the bacterial markers in identifying CRC patients were evaluated by calculating the area under the receiver-operating characteristic (ROC) curve. The best cut-off values were determined by ROC analyses from maximized Youden index and smallest distance value. Sensitivity and specificity values were compared to find the best panel combination that gives high positive predictive value (PPV) and negative predictive value (NPV). Logistic regression model was applied to obtain probability plot values for estimating the CRC incidence among all subjects. ROC curves were constructed from the logistic regression for four-bacteria panel data. All tests were performed by GraphPad Prism 7.0 or SPSS software v22.0. A nominal value of p < 0.05 was determined as statistical significance.
Ethics approval and consent to participate. This study was approved by the National University of

Data availability
Data has been uploaded as Supplementary Dataset 1 and 2 of the manuscript.