Introduction

CC is the fourth most common malignant tumor in women, but the incidence and mortality rank second place in countries with a low development index1. The key etiological factor in the development of CIN and CC is well recognised to be persistent infection with high-risk HPV. Nevertheless, it is widely recognised that HPV alone is insufficient to cause cervical malignant transformation. Numerous factors have been linked to the occurrence of CIN, and it has been suggested that the composition of cervical microbiota has a significant role in the progression of HPV infection, ultimately leading to the formation of CIN or CC2,3,4. The potential impact of this phenomenon on the development of CC lies in its ability to modify various aspects of the host’s inflammatory, genomic, and metabolic processes5,6.

An optimal cervicovaginal status was commonly associated with low microbial diversity and prevalence of Lactobacillus. Lactobacillus spp., which produce lactic acid and bacteriocin, were the key microbes in healthy women, participating in creating the stability of the cervicovaginal microbial composition and maintaining a low pH environment7,8. With the development of next-generation sequencing technology, increasing evidence has shown that cervical microbiota dysbiosis, as an important environmental factor, may lead to bacterial vaginosis (BV)9, sexually transmitted infections (STIs)10, HPV infection and the subsequent development of cervical lesions11,12,13,14,15,16. There were five community state types (CSTs). CST I, II, III, and V were dominated by Lactobacillus crispatus (L. crispatus), Lactobacillus gasseri (L. gasseri), Lactobacillus iners (L. iners) and Lactobacillus jessenii (L. jessenii) respectively. CST IV was mainly composed by a high abundance of anaerobic bacteria17. Prior research has indicated that L. crispatus is more prevalent among women who do not have HPV infection or cancer lesions, but L. iners and non-Lactobacillus species are more frequently observed in women with HPV infection and patients who have cancer lesions18,19,20. Numerous research have been conducted to investigate the cervical microbiota and its association with cervical lesions in various populations. However, there have been conflicting findings regarding microbial distinctions between individuals with and without cervical lesions. For example, some taxa, including Aerococcus, Coriobacteria and Fannyhessea were reported to be enriched in CIN21. Whereas, Audirac-Chalifour et al.22 found that Sneathia and Fusobacterium were the predominant taxa in CIN and CC, respectively. The key issue related to the differences in cervical microbiota between cervical lesions and health controls is the lack of apparent reproducibility in different studies when identifying the microbiome characteristics of cervical lesions. Besides, the development of non-invasive and sensitive early diagnosis tests for CC and CIN based on cervical microbiota is meaningful.

Variation in study objectives and analytical processing can influence findings. Herein, we systematically reviewed the literature on cervical carcinogenesis microbiota, and used a consistent analytical approach on pooled 16S rRNA gene raw sequences from 6 studies to identify CC-associated microbiota. We sought to determine the differences in alpha-diversity, beta-diversity, microbial compositions and taxonomic alterations across the stages of CC development. Moreover, the bacterial biomarkers for classifications of different disease groups and bacterial taxa correlation were also examined.

Materials and methods

Dataset acquisition and study inclusion criteria

A systematic searches for public available data were performed on the NCBI Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/sra), European Nucleotide Archive (ENA, http://www.ebi.ac.uk/ena) and Genome Sequence Archive (GSA, https://ngdc.cncb.ac.cn/gsa/) databases using the search term “cervical cancer”, “HPV”, “cervical intraepithelial neoplasia” and “16S rRNA”, and limiting search results by “Bioproject”. Bioproject accession numbers containing high throughput sequencing reads and associated metadata were collected (Table 1). Then, we retrieved articles from NCBI PubMed or Google Scholar using submitted Bioproject information. The studies incorporated in this meta-analysis were mandated to meet the following criteria: (a) they had to involve samples obtained from the human cervix, (b) they needed to have have sequenced by NGS for 16S rRNA gene, (c) they had non-use of douches in previous days of sampling, (d) not having records of antibiotic or antifungal or antiviral usage within the previous three months of sampling, (e) to have associated metadata, sequencing data and barcodes from the public database or provided by the authors until November 2022 upon request by emails. In order to facilitate comparisons of data across studies, a preformatted metadata file including sample ID, study ID, sequencing type, country and diagnosis was collected from public database or request from authors. Finally, we obtained 16S rRNA sequence data and sufficient metadata (containing information differentiating samples by diagnosis) from 6 studies for subsequent analyses22,23,24,25,26,27. An additional 23 studies related to cervical microbiota of cervical lesions were excluded due to incomplete information on sequences, metadata or barcodes or the same sequences15,16,21,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47. The downloaded datasets were grouped according to the following stages: NC, HPV, CIN and CC.

Table 1 Summary of 16S-tag studies on cervicovaginal microbiota-related information.

Sequence processing

The raw data from all datasets were downloaded in the sequence read archive (SRA) and converted into fastq format using SRA Toolkit. Next, all downloaded 16S rRNA gene sequences were processed using the open-source DADA2 for quality control and denoising using parametric error model in R (version 1.16)48. Filtering, learning errors, dereplication, amplicon sequence variant (ASV) inference and chimera removal were conducted for each study. Subsequently, taxonomy was assigned using Silva 138 rRNA database using the Naïve Bayesian Classifier algorithm default in DADA249. The merging of paired-end reads was conducted when the overlapping region between the reads exceeded a length of 20 base pairs following the truncation process. Otherwise, only forward reads were used for the following analysis. The within-sample (alpha) diversity metrics including evenness and Shannon diversity were calculated based on the ASV table in each sample in R using vegan package. The between-sample (beta) diversity was assessed based on Bray–Curtis distance. Principal coordinate analysis (PCoA) was used to visualize the distance using PAST (version 3.0) software, and the differences between groups were determined using permutational multivariate analysis of variance (PERMANOVA) with 999 permutations.

Statistical analysis

Significantly altered taxa among different stage groups were determined by the linear discriminant analysis (LDA) effect size (LEfSe) method with a cutoff LDA score > 3 and p-value < 0.05. STAMP software (version 2.1.3) with extend error bar method was used to explore differentially abundant bacteria at genus level between two groups (i.e. NC vs CC), and the significance criteria were p-value < 0.05 using two-sided Fishers exact test. Afterward, logistic regression models were built using the selected bacterial genera biomarker with a backward stepwise selection algorithm using leaps package. The receiver operating characteristic (ROC) analysis was used to illustrate the performances of classification models. The statistical analyses were performed using GraphPad Prism software (version 9.3). Pairwise and multiple groups comparisons were performed using the two-sided Wilcoxon rank-sum test and the Kruskal–Wallis test, respectively. Spearman correlation test was used for correlation analyses for the representative (top 15) genera and displayed using R software (version 4.1.2) with corrplot, ggplot2 and ggpubr packages. Bubble plots were made using the ggplot2 and reshape 2 packages, and a heatmap plot was conducted with the pheatmap package in R.

Results

Alpha and beta diversity differences

Quality-filtered 16S rRNA gene sequences from 507 cervical epithelial scrapings samples were available for meta-analysis. Sequences were from six independent studies and five countries (Table 1). To compare the alpha diversity between different disease stages, we observed that the NC group had significantly lower Shannon diversity and evenness when compared CIN and CC groups (p < 0.0001) (Fig. 1A). There were no significant differences between NC and HPV groups, CIN and CC groups (p > 0.05). Moreover, the microbial diversities exhibited an increasing trend with disease progression (Fig. 1A). We next analyzed whether there were differences in the structures of microbial communities associated with different disease stages. Beta diversity was visualized by principal coordinate analysis (PCoA) based on the Bray–Curtis distance. Overall cervical microbial community structures of four disease stages were significantly different (PERMANOVA, F = 11.54, p < 0.001) (Fig. 1B).

Figure 1
figure 1

Bacterial diversity for patients in each stage group. (A) Alpha-diversity was estimated by Shannon and evenness indexes in each group. The solid black line indicated the corresponding median value in each group. Pairwise comparisons were performed using Wilcoxon rank-sum test. Asterisks mean differences between the two groups are statistically significant (p < 0.0001). ns, no significant difference. (B) Principal coordinate analysis (PCoA) for all included samples based on Bray–Curtis distance. p-value was estimated by permutational multivariate analysis of variance (PERMANOVA).

Characteristics of the core cervicovaginal microbiota

The present study investigated the comprehensive microbial compositions during the course of cervical cancer. The cervical microbiota was shown to be predominantly composed of five bacterial phyla at the phylum level, namely Firmicutes, Actinobacteria, Proteobacteria, Bacteroidetes, and Fusobacteria. There were 45 genera identified as core microbiota when combining all datasets. Among them, Lactobacillus, Gardnerella, Streptococcus, Sneathia, Prevotella, Pseudomonas, Fannyhessea, Megasphaera, Fusobacterium and Acinetobacter were the most prevalent and abundant genera across the datasets. Hierarchical clustering analysis showed that the HPV group and CIN group shared a similar distribution of the top 10 genera (Fig. 2). For Lactobacillus, the most common encountered species was L. iners (Fig. 3). Moreover, Lactobacillus was significantly lower in CIN and CC groups when compared to NC and HPV groups (p < 0.001). While, Streptococcus and Fusobacterium were significantly higher in CIN and CC groups when compared to NC and HPV groups (p < 0.05) (Fig. 4).

Figure 2
figure 2

Heatmap plot showing the top 10 most abundant bacterial communities in each group at the genus level. The abundance was log-transformed to reduce the skewness of the data.

Figure 3
figure 3

Bubble plot showing relative abundance of Lactobacillus species across stage groups.

Figure 4
figure 4

Comparison of top 10 abundant bacterial genera across disease progression stages. Multiple groups comparisons were peformed using Kruskal–Wallis test. *p < 0.05; **p < 0.01; ***p < 0.001; ****p < 0.0001.

Differential taxa across different disease stages

LEfSe was performed to identify the specific taxa with significantly higher abundance among the four groups. As shown in Fig. 5, a total of 59 clades were screened out with a LDA threshold score of 3.0. Class Bacilli (including Lactobacillus genus) was enriched in the NC group. Genus Acinetobacter and Bacteroides were also enriched in the NC than in other groups. The CIN group was more highly colonized by class Actinobacteria and Gammaproteobacteria. As for the CC group community, class Clostridia and Epsilonproteobacteria had significantly higher relative abundances.

Figure 5
figure 5

Bacterial taxa differences among the four groups using Wekemo Bioincloud (https://bioincloud.tech/). (A) Linear discriminant analysis (LDA) effect size (LEfSe) analysis on selected core taxa among the four groups. Only lineages with LDA values > 3 are displayed. (B) Cladogram showing the phylogenetic distribution of lineages associated with the four groups.

Bacterial biomarkers for distinguishing CC from NC

Among the ten bacterial genera that significantly differed between the two stages, four genera including Streptococcus, Fusobacterium, Pseudomonas and Anaerococcus, were found to be abundant in the CC stage compared to the NC stage. On the other hand, six were depleted including Lactobacillus, Acinetobacter and Bacteroides (Fig. 6A). We assessed the significantly altered bacterial genera for their potential as diagnostic biomarkers for discriminating CC from NC. Four CC-enriched and one CC-depleted were identified as potential biomarkers using the backward stepwise selection algorithm. A logistic regression model was built based on the five biomarkers. To evaluate the performance of the model, ROC analysis was conducted, yielding an area under the curve (AUC) of 0.8947 (95%CI: 0.8343, 0.9551) (Fig. 6D).

Figure 6
figure 6

Differentially taxa between CC, CIN, HPV and NC and the diagnostic genera markers. (AC) The significantly altered genera as revealed by the extended error bar method using the Wilcoxon rank-sum test. (DF) Receiver operating characteristic (ROC) analysis for the identified genera markers with logistic regression model discriminating CC, CIN, and HPV from NC.

Potential biomarkers to classify other disease stages

Microbial biomarkers were identified with the purpose of developing invasive diagnostic procedures for distinguishing between CIN and NC, HPV infection and NC, CIN and HPV infection, CC and HPV infection, and CC and CIN. The microbial taxa that exhibited significant differences between CIN and NC samples consisted of 14 genera that were enriched in CIN samples and 4 genera that were deficient in CIN samples (Fig. 6B). The logistic regression model, which utilised three genera, effectively differentiated between CIN and NC. It achieved an AUC of 0.7673 (95% Confidence Interval [CI]: 0.7007, 0.8339) (Fig. 6E). Nevertheless, the model’s ability to differentiate between HPV and NC was deemed inadequate, as indicated by the AUC of 0.6568 (95% confidence interval: 0.6035, 0.7101) (Fig. 6C, F). Similarly, the performance of the models for distinguishing the CIN from HPV was good (AUC: 0.7669, 95%CI: 0.6919, 0.8418) (Figure S1A). The taxa were significantly changed between CC and HPV including 9 CC-enriched and 5 CC-depleted genera (Figure S1B). Five genera were capable of differentiating samples between CC and HPV (AUC: 0.8836, 95%CI: 0.8217, 0.9455) (Figure S1E). For CC vs CIN, 2 CC-enriched genera and 1 CC-depleted genus were significantly altered (Figure S1C), these genera were capable of discriminating samples between CC and CIN by achieving an AUC of 0.7353 (95%CI: 0.6422, 0.8284) (Figure S1F).

Correlation network analysis

To further our comprehension of the potential interaction among core taxa, we performed co-occurrence network analysis in the datasets. Significant correlations were found in 35 genera pairs (r > 0.6 or r < − 0.6, p < 0.05) (Table S1). There were three genera pair showing negative correlations. Lactobacillus exhibited a negative correlation compared with all other dominant genera (Fig. 7A). The most strongest negative correlation was found between Lactobacillus and Prevotella. In addition, Lactobacillus also had negative correlations with Fannyhessea and Dialister. Positive co-occurrence correlations occurred in the genera of Prevotella, Sneathia, Megasphaera, Fannyhessea, Acinetobacter, Dialister, Anaerococcus and Bifidobacterium (Fig. 7B). These microorganisms may play a crucial role in the network.

Figure 7
figure 7

(A) Correlation analysis of the top 15 dominant genera with spearman,s rank correlation test method using R software (version 4.1.2). (B) Correlation network of the core genera in the combined dataset using Wekemo Bioincloud (https://bioincloud.tech/). The blue line indicates a negative correlation (r < − 0.6), and the red line indicates a positive correlation (r > 0.6). Each circle represents a core bacterial taxa, and the colour of the circle represents the phylum to which it belongs.

Discussion

Cervical cancer is a multifactorial disease involving the interactions among host, microbial and environmental factors. Despite HPV being demonstrated as the main well-established risk factor, cervicovaginal microbiome dysbiosis has emerged as a key risk factor in inflammation50, HPV acquisition, persistence and cervical carcinogenesis19. We integrated six datasets to present a cervical-related microbial landscape along the different stages, and explored the potential bacterial taxa as biomarkers to monitor cervical carcinogensis.

As an important component of the cervicovaginal self-purification function and biological barrier, the resident microbiota and opportunistic pathogens of the cervix maintain the microecological balance. The dominant bacteria Lactobacillus can decompose cervical epithelial glycogen to produce lactic acid, maintain a weakly acidic environment, which can inhibit the proliferation of pathogens51. In addition, they can enhance anti-infection ability by producing various metabolites or stimulating immune cells to produce various cytokines52. Following imbalance of this defense system, it may induce histological alterations of the vaginal mucosa and the cervical epithelium, thereby exerting a selective pressure on the microbiota53. Some cervicovaginal taxa, such as Gardnerella, Fusobacteria, Dialister and Prevotella, as well as a decrease in the proportion of Lactobacillus spp. have been linked to dysbiosis that would generate an unstable microenvironment, which in turn may affect key risk factor in cervical cancer54,55. Acid-producing Lactobacillus dysbiosis are responsible for increasing the levels of mucin-degrading enzymes, which may affect the mucous layer stability of the cervicovaginal epithelium56.

We observed no significant difference between HPV and NC group in bacterial richness and diversity, which is consistent with several previous studies15,57,58. Although a few studies results showed that HPV infection can increase bacterial diversity13,43,59. The prevailing bacterial taxa that exhibited enrichment in women with HPV infection were Prevotella, Megasphaera, Shuttleworthia, and Dialister. This finding aligns with prior investigations conducted in this field60,61. Prevotella has been considered to play an important role in HPV infection and persistence43. This bacterium was indicated as a major modulator of host inflammatory responses in the female genital tract by increasing the number of cytokines in cervicovaginal fluid62. Moreover, Prevotella can secrete proteases to degrade host antibodies, and transfer ammonia to Gardnerella, resulting reduction in host mucosal immunity63. Lebeau et al.64 found that HPV infection alters vaginal microbiome through down-regulating host mucosal innate peptides used by Lactobacilli as amino acid sources. However, several studies have shown that HPV does not necessarily induce significant changes in the cervicovaginal microbial communities15,65. Borgogna et al.31 found that the vaginal metabolome of HPV-positive women differed from normal individuals in terms of several metabolites, including biogenic amines, glutathione and lipid-related metabolites.

Higher richness and diversity were observed in CIN and CC groups when compared to NC group. Gardnerella, one of the most frequently reported in CIN studies, was enriched in CIN group compared with NC, as well as other undesirable genera,like Sneathia, Pseudomonas, Fannyhessea and Megasphaera. A longitudinal cohort study revealed a positive association between Gardnerella and CIN progression caused by elevated microbial diversity66. Previous studies have posited that there may be an increased risk of CIN associated with the enrichment of Gardnerella vaginalis (G. vaginalis) and Fannyhessea vaginae67. G. vaginalis is classified as a facultative anaerobe, capable of adhering to the vaginal epithelium. This adherence serves as a framework for the production of biofilms and promotes the proliferation of microorganisms68,69. It was associated with higher levels of inflammatory cytokines70. It has been reported that Fannyhessea vaginae can activate the proinflammatory transcription factor NF-kB in the cervicovaginal epithelial cells, triggering abundant inflammation and innate immune responses71,72.

In relation to CC, it has been observed that some bacterial taxa, namely Fusobacterium, Pseudomonas, and Anaerococcus, exhibit not only higher abundance in women with precancerous lesions but also in women with cervical cancer. Other identified taxon in this meta-analysis to be associated with cervical cancer included Streptococcus. Previously, species of Streptococcus have been reported that likely involve the activation of multiple inflammatory cytokines and may affect human vaginal and cervical epithelial cells. Soares et al.73 revealed that Streptococcus possesses metallopeptidases which could help them invade tissue or cause bacterial transmission. Fusobacterium species have attracted widespread attention due to their pro-inflammatory properties74,75. Fusobacterium nucleatum has been shown to potentiate intestinal tumorigenesis and modulate the tumor-immune microenvironment, indicating potential as a diagnostic biomarker for colorectal cancer76,77. Besides, Fusobacterium nucleatum also has been studied as a possible diagnostic biomarker of CC as it is positively correlated with tumor differentiation78. Pseudomonas has long been considered to be an opportunistic pathogen in vaginal inflammation, the human urogenital system. They can disrupt the mucosal defense against extracellular pathogens by secreting protease IV and inactivating interleukin 2279. It has been indicated that Pseudomonas aeruginosa has a potential role in the development of cervical cancer by promoting the expression of integrins in cervical cancer cells80.

In this meta-analysis, we observed distinct distributions of the CC group compared with other sample groups. The most relevant genera in each disease stage were revealed by our study, which allowed us to discover robust diagnostic biomarkers. Finally, five common genera including Streptococcus, Fusobacterium, Pseudomonas, Anaerococcus and Acinetobacter were identified as the most important features for distinguishing CC from the normal population (AUC > 0.8), indicating their possible role in cervical carcinogenesis as well as their clinical applications. Additionally, we also identified the depletion of potentially beneficial microorganisms, such as lactic acid-producing Lactobacillus. It has been reported that L. iners can create a protective micro-ecological environment by regulating the core fucosylation of the vaginal epithelial cell against CC44. Moreover, we identified additional biomarkers to discriminate CC from CIN (AUC > 0.7) and HPV (AUC > 0.8).

Profiling cervicovaginal microbial communities using 16S rRNA genes is a straightforward and cost-effective method, and is generally cheaper than shotgun metagenomic sequencing. Nevertheless, shotgun metagenomics and circularizing probes-based RNA (ciRNA) sequencing targeted approaches with species-level resolution can provide in-depth insights into cervicovaginal microbiome. A growing number studies have employed shotgun sequencing of the vaginal metagenome41,81,82,83,84, metatranscriptome85,86 and ciRNA87,88. Liu et al.84 have observed a total of 111 species in vaginal microbiome of healthy Chinese women, including all dominant vaginal Lactobacillus species, such as L. iners, L. crispatus, L. gasseri, and L. jensenii. It has been reported that L. crispatus, L. iners and G. vaginalis were the top three species in both HPV16-positive and control groups83. A previous study by Macklaim et al.85 focused on the transcriptional activity of L. iners in four reproductive age women and found variation in the species, transcriptional activity. Communities dominated by L. crispatus were found to exhibit higher expression of phosphate and phosphonate transporters86.

The meta-analysis conducted for CIN and CC in our study was constrained by limitations in the study number and sample size. Additionally, the study lacks comprehensive demographic and clinical data. Therefore, we appeal researchers to accurately and comprehensively disseminate their sequencing data and accompanying metadata. Furthermore, the division of CIN samples into low-grade squamous intraepithelial lesions (LSIL) and high-grade squamous intraepithelial lesions (HSIL) was not feasible due to the limited sample size. Moreover, the fact that LSIL are considered non-progressive lesions, which could affect the analyses in the study.

Conclusions

In summary, the present study conducted a comprehensive analysis of several cervical 16S rRNA gene sequencing datasets in a standardised manner, leading to the identification of distinct microbial characteristics in the cervical region throughout different stages of disease progression. The findings of our investigation suggest a potential correlation between the presence of certain microorganisms and the development of cervical cancer, which provides great clinical significance and feasibility for the development of noninvasive screening or diagnosis methods for cervical cancer.