Preterm birth (PTB) is the leading cause of mortality in infants under the age of 5 years1. The rate of spontaneous PTB, defined as the onset of labour and subsequent delivery before 37 weeks of gestation, is reported to be at one in ten pregnancies (~15 million PTB annually) worldwide2,3. In addition to increased mortality, prematurity is associated with significant morbidity in terms of chronic lung disease, increased rates of neurodevelopmental delay, and long-term health problems4,5,6. The economic burden through healthcare costs related to prematurity is estimated to be $26 billion in the United States alone7.

Several studies have investigated links between maternal health status including diet8, smoking9, obesity10, stress11, age12, and previous history of obstetric complications13 as indicators of PTB. Microbial infection, specifically urinary tract infections, vaginitis, bacterial vaginosis (BV), and even periodontal disease have also been associated with an increased risk of spontaneous preterm delivery14,15,16. The composition of the microbiota of the mother, and in particular the mother’s vaginal microbiome, has more recently been linked to spontaneous PTB3. In general, the vaginal microbiota is stable throughout pregnancy with the dominant species rarely changing17,18; however, biogeography and ethnicity can play a considerable role in determining the apparent ‘normal’ microbiota19. A study of 90 women, mainly of African American decent, found that there was no association between the vaginal microbiota and PTB20. In contrast, DiGiulio et al.21 found that changes in the vaginal populations of Caucasian women were correlated with PTB. Further studies have provided additional evidence that race is a key factor when identifying patterns linking PTB with vaginal microbiota composition22,23.

The common means of categorizing the vaginal microbiome has been to group them into community state types (CSTs). CSTs dominated by Lactobacillus species have generally been associated with positive health states for the woman24. In contrast, CST-4, which is reduced in lactobacilli and contains an increased abundance of mixed species, has been associated with poorer health outcomes. These mixed species, including Gardnerella vaginalis (assigned as Bifidobacterium vaginale in the Genome Taxonomy Database (GTDB)), Atopobium vaginae (assigned as Fannyhessea vaginae in the GTDB), Mobiluncus sp., Prevotella sp., and others, are most often associated with BV. Multiple studies have reported a twofold increased risk for PTB in women suffering from BV25,26. Despite this, large cohort studies have reported that CST-4 can also be present in a high percentage of the healthy population with no symptoms of BV27. Indeed, B. vaginale has also been frequently isolated from asymptomatic women28.

Determining a role for the vaginal microbiota in influencing pregnancy outcome has been limited by the ability to fully differentiate taxonomic groups. This resolution may prove important if discrete communities, species, or strains prove to represent a risk factor. In light of this, it is important to consider that the use of standard 16S rRNA targeted amplicon sequencing has been reported to under-report certain subgroups of B. vaginale29, while targeting the cpn60 gene, has shown an improvement in terms of identifying Bifidobacterium30. Shotgun metagenomics presents an opportunity to overcome many of the limitations of amplicon sequencing, with the additional benefit of understanding the functional potential of a community. An increasing number of studies have now used this approach to elucidate the vaginal microbial communities. In particular, functional distinctions among different metagenomic assembled B. vaginale isolated from the same sample31 and an ability to sub-speciate important taxa32 provide examples for the power of this approach.

In a cohort of women with a predisposed risk for PTB, we aimed to use shotgun metagenomics to distinguish any vaginal microbial signatures different to a low-risk control group.


Participant data

A total of 57 participants with singleton pregnancy were successfully recruited over the study period, 20 at low risk of PTB and 37 with risk factors for PTB. Multiple swabs were collected from some women depending on clinic visit frequency, accounting for 89 total samples. To control for multiple sampling of some woman and different trimester timepoints, we focused our analysis on single samples for each woman in the second trimester of pregnancy (n = 49). Of these, 8 pregnancies ultimately were preterm (PTB) and the remaining 41 were full-term birth (FTB). In addition, for the 35 participants initially considered at risk of PTB, 7 women delivered before 37 weeks’ gestation and were grouped into the PTB group (risk_PTB). The remainder were grouped into a risk but full-term group (risk_FTB). For the 14 women at low risk of PTB, 13 delivered at term (no-risk_FTB). The single sample from this control group that delivered prior to 37 weeks was subsequently provided a distinct group of no-risk_PTB. Within the PTB samples, six deliveries were late preterm (32–37 weeks), one was a moderate preterm (31.7 weeks), and one was a preterm at 26 weeks (Table 1). None of the women had preterm premature rupture of membranes. There were no significant differences in the age, race, body mass index, or smoking status of the participants by either the grouping of of risk_PTB, risk_FTB, no-risk_FTB, and no-risk_PTB or the FTB and PTB grouping. Patient demographics are outlined in Table 1. There was a lower birth weight for infants from the PTB compared to FTB (2230.62 g vs. 3646.12 g, respectively; p < 0.001 Student’s t-test).

Table 1 Descriptive statistics of study participants.

Taxonomic analysis

Following quality filtering, a mean of 828,950 high-quality microbial sequencing reads were obtained per sample. The median number of observed species for either the FTB or PTB groups was 168 and 185.5, respectively (Fig. 1a), with the highest number of species observed at 15 for a risk_PTB sample. There was no significant difference in α-diversity measure between either the FTB or PTB groups (Fig. 1a); however, there was an increased diversity of the risk_FTB group compared to the no-risk_FTB group using both Shannon and Simpson measures (Fig. 1b; p = 0.01 and p = 0.003, respectively). In terms of β-diversity, the risk_PTB communities were significantly dissimilar from both the no-risk_FTB and risk_FTB samples (Fig. 1b; p = 0.02).

Fig. 1: Species diversity between study groups.
figure 1

α-Diversity comparison as measured by Shannon and Simpson index, including the total observed species for either delivery outcome (a) or risk grouping (b). Significant differences as calculated by Wilcoxon’s test are noted with *p < 0.05 or **p < 0.02. The species level community dissimilarity as measured by Bray–Curtis and visualized using PCoA for either delivery outcome (c) or risk grouping (d). Ellipses are generated using stat_ellipse function in R. Significance of group dissimilarity as calculated by PERMANOVA is identified by the given p-value. The top five lactobacilli across all samples are labelled to highlight drivers of variation for clusters with black arrows to indicate the directionality.

Taxonomic classification revealed that lactobacilli were dominant across all groups at 59.13% mean relative abundance (Fig. 2a), with a greater proportion observed in full-term pregnancies (61.86%) compared to preterm (45.11%). This dominance was greatest in the no-risk_FTB group (71.32%) where both Lactobacillus crispatus (50.86%) and Lactobacillus iners (21.27%) were dominant. Within the risk_FTB group, L. crispatus (25.83%), L. iners (14.83%), Lactobacillus gasseri (11.70%), and L. gasseri A (7.30%) were the dominant species detected. L. crispatus was never >0.03% relative abundance in the risk_PTB group. In addition, L. gasseri was only detected in two samples in the risk_PTB group, accounting for just 0.01% mean relative abundance. Although L. iners (30.82%) accounted for a greater proportion of the mean population in the risk_PTB group compared to both FTB groups, this increase was not statistically significant. Due to the large variation in both the abundance and number of species observed between samples, L. crispatus (p < 0.001), L. gasseri (p = 0.028), and Bifidobacterium breve (p = 0.036) were the only species that significantly differed between full-term and preterm groups, with a higher respective mean relative abundance (Fig. 2b). A significant negative correlation was observed between PTB samples and both L. crispatus and L. gasseri (p-value 0.03 and 0.03, respectively), and after correcting for multiple comparisons, these q-values were below than 0.25 (Fig. 2c, Table 2, and Supplementary Table 2). Overall, the dominance of BV-associated bacteria was low across all samples. Using the GTDB database, B. vaginale (this database assigns G. vaginalis as B. vaginale) was assigned as multiple sub-species. The two most abundant BV-associated species across all samples were B. vaginale and B. vaginale_G; however, neither were significantly increased in either FTB or PTB samples (Fig. 2b). A significant correlation with PTB samples was observed for B. vaginale_D, _E, and _F (Fig. 2c, Table 2, and Supplementary Table 2; q-value 0.097, 0.006, and 0.002, respectively). In addition, F. vaginae and F. vaginae_A (this database assigns A. vaginae as F. vaginae) were positively correlated with PTB samples (q-value 0.137 and 0.008, respectively; Fig. 2c, Table 2, and Supplementary Table 2).

Fig. 2: Species composition across study groups.
figure 2

a The relative abundance for the top 30 species across all samples for each of the four risk groupings. b Comparitive analysis within the delivery outcomes for the ten most abundant species across all samples. Significant differences were calculated using Student’s t-test. c MaAsLiN analysis correlating species multiple metadata fixed effects. Heatmap displays the top 50 species with a significant assoaction to either fixed effect with q-value < 0.25.

Table 2 MaAsLiN2 multivariate correlation analysis of most abundant microbial species and sample groupings.

Resolution of the microbial species into CSTs revealed that six distinct CSTs were present across all swabs (Fig. 3). CSTs were determined by the most dominant species in a cluster as previously described33 with the definition of CST-8 according to Brooks et al.24. CST-1, dominated by L. crispatus, had a weak association overall with delivery outcome (Supplementary Table 1; p = 0.09). There was a weak association between samples from the risk_PTB groups and CST-5 (Supplementary Table 1; p = 0.09). There were five observations of the BV-associated CST-4, three of which were from the risk_PTB group. This CST had a positive association with samples from the risk_PTB group (p = 0.02). Neither CST-3 nor CST-8 were associated with any pregnancy outcome.

Fig. 3: Community state types of all samples.
figure 3

Relative heatmap intensity comparison for all species with an across-sample mean relative abundance > 0.2, with sample clustering of similar samples by Pearson’s clustering. Community state types are defined based on the most abundant species per sample and are indicated by the top bar of the figure.

Functional pathways analysis

A total of 2928 gene functions were identified across all samples. Stratification of the data into the three Gene Ontology classifications of cellular component (CC), biological process (BP), and molecular functions (MFs) revealed functional diversity between the risk groupings. Both Shannon and Simpson diversity indexes determined there was a significant increase in the diversity of gene functions for both BPs and metabolic functions for PTB samples compared to FTB (Fig. 4a). When samples were stratified into risk groupings, this difference in functional diversity was not observed between the risk_FTB and no-risk_FTB groups (Supplementary Fig. 1a). The FTB and PTB samples were also functionally dissimilar to each other by Bray–Curtis measure both in the BP and MF classes (Fig. 4b; p-value 0.038 and 0.045, respectively). From ADONIS analysis, only L. gasseri was significantly influencing the variaton in gene function across all three categories (p-value 0.017 (MF), 0.027 (CC), and 0.013 (BP)). Stratification of the samples into the risk groupings did not show any significant differences in diversity of function (Supplementary Fig. 1b). Using multivariate analysis, 22 CCs, 154 BPs, and 299 MFs were significantly correlated to either women who had a previous PTB, large loop excisions of the transformation zone (LLETZ), were categorized with a prior risk for PTB, subgrouped to the four risk categories according to this study, or ultimately delivered preterm (Fig. 5 and Supplementary Table 3; q-value < 0.25 after multiple correction). For gene functions with significant association to preterm delivery, the MF category contained the most differential features. Among these were genes involved in ‘2′ 3′-bisphosphoglycerate-independent phosphoglycerate mutase activity’, ‘receptor activity’, ‘aldose 1 epimerase activity’, ‘carboxyl or carbamoyltransferase activity’, and ‘copper exporting ATPase activity’ (Fig. 5 and Supplementary Table 3). The top five gene functions most associated with preterm delivery within the BP category included ‘respiratory electron transport chain’, ‘alginic acid biosynthetic process’, ‘ATP hydrolysis coupled proton transport’, ‘folic acid biosynthetic process’, and ‘glucose catabolic process’ (Fig. 5 and Supplementary Table 3).

Fig. 4: Functional diversity of the vaginal microbiome.
figure 4

Gene functions are divided into three higher-level functional categories based on Gene Ontology classification with both α- (a) and β-diversity (b) measures of identified content shown. Significant differences are highlighted by an asterisk where p-value was <0.05 as calculated by Kruskal–Wallis. Significant differences in β-diversity were determined by PERMANOVA analysis. Ellipses are generated using stat_ellipse function in R. The top seven species across all samples are labelled to highlight drivers of variation for clusters with black arrows to indicate the directionality.

Fig. 5: Multivariate association analysis.
figure 5

Each heatmap presents the independent significant associations of species to grouping as determined by MaAsLiN2 with q-value < 0.25. For biological process and molecular function on the top 50 associations are presented.


In summary, using a shotgun metagenomics approach, this study has confirmed the association of a vaginal microbiome dominated by L. crispatus with full-term pregnancies. In addition, evidence is provided that CST-4 is more associated with the vaginal microbiome from a spontaneous PTB. This study also identified functional differences in the vaginal microbiome of women who subsequently deliver preterm.

To date, numerous microbiome sequencing studies have been conducted with the aim of understanding the vaginal microbiota and its’ association with PTB. A limiting factor in several of these studies has been the use of targeted amplicon sequencing. In particular, it has been shown that certain key vaginal taxa can be underrepresented or not detected, depending on the gene region targeted. Of note, BV has been determined as a risk factor for PTB, yet the detection and differentiation of B. vaginale (G. vaginalis) subtypes can be difficult with the potential to omit subtle nuances when associating vaginal communities with PTB.

In our study, we have used shotgun metagenomics to determine the vaginal communities in women at high risk of PTB. This approach has enabled a higher resolution of the species of bacteria associated with high-risk pregnancies compared to previous approaches. The relatively high microbial read depth achieved, coupled with the recently updated GTDB has revealed a richness of species that are present at low abundance (<5%; Fig. 3). With use of the GTDB classification database, several species of B. vaginale were detected with each determined to correlate differently with pregnancy outcome (Table 2 and Fig. 2c). Importantly, these sub-species were not the dominant B. vaginale, which when present (18/49 occurences) had a mean relative abundance of 14%. In contrast, B. vaginale_D (9/49 occurences), B. vaginale_E (8/49 occurences), and B. vaginale_F (8/49 occurences), which significantly correlated with PTB outcome, were never >0.2% relative abundance. At such low abundance, it remains to be determined whether these species can exert a meaningful biological influence within the microbiota. Nonetheless, this difference within the B. vaginale group highlights the importance of stratifying this species when investigating associations with vaginal health and perinatal outcome. Another important member of BV-associated bacteria, F. vaginae (A. vaginae), was identified in three of the PTB samples at relatively high abundance (4.42–8.10%), yet was rare in full-term samples at a maximum relative abundance of 3% (in 3/41 samples). This species has been shown to develop strong biofilms with B. vaginale34,35. Previous studies identified high loads of this species as a risk for PTB36,37. Of interest in our study, there was an increased relative abundance of F. vaginae in the control group sample that had no perceived risk of PTB, yet delivered preterm (Fig. 3).

The species Sneathia amnii and Prevotella amnii have recently been identified as emerging candidates for poor pregnancy outcomes38,39. In our study, both of these were increased and had a significant association with PTB (q-value 0.02 and 0.05, respectively; Table 2). Although BVAB-1 (assigned UBA629 sp005465875 in GTDB) has been identified in microbiome studies as a potential problematic species in terms of vaginal health, this species was not identified in this study, perhaps reflective of the cohort studied as it has predominantly been reported in North American cohort studies29,40,41. Overall, however, the BV-associated CST, CST-4, was positively associated with PTB, a finding that is in agreement with previous studies in this area21,42.

Unlike previous reports, we did not see any significant differences in α-diversity between the samples from preterm and full-term pregnancies; however, there was a weak correlation between L. crispatus and full-term pregnancy, and an overall increased abundance of this species in samples from full-term pregnancies. This association is in agreement with several other studies and supports the concept that this species is a beneficial component of a healthy vaginal microbiome. Taken together, these studies may suggest a role of L. crispatus in providing protection from PTB, particularly in a Caucasian population. However, a mechanistic basis for such a role has yet to be revealed. Nonetheless, L. crispatus has previously been identified as a species that dominates a healthy vaginal tract, specifically in subjects free from BV43. It has also been linked to the stability of the vaginal microbiota during pregnancy44. Given that inflammation of the uterus has been linked with PTB45, there may be merit to assessing the ability of L. crispatus to provide a protective effect by suppressing inflammation through H2O2 signalling to control nuclear factor-κB activity46 or, indirectly, whereby the inhibition of BV-associated bacteria prevents the occurrence of a proinflammatory response to the vaginosis47. Importantly, a strong correlation with proinflammatory cytokines and BV-associated bacteria was observed by Fettweis et al.39 in a preterm cohort, further highlighting a potential microbial inflammatory mechanism for preterm onset. In general, lactobacilli have been regarded as a marker of a healthy vaginal microbiome and have been seen to improve pregnancy outcome even in the presence of risk associated taxa37. An exception to this is L. iners, which has previously been shown to associate with PTB in a similar cohort48. This observation was not repeated in the study.

Although these findings are important indications that merit repeated investigation, there are limitations to this dataset. In particular, due to the relatively low ratio of PTB samples, a much larger sample size will be necessary to investigate the link between the CSTs, species, and preterm outcome. Moreover, this study’s findings are specific to a Caucasian population, whilst ethnicity has been previously shown as major confounder in describing the vaginal microbiome. Notably, Romero et al.20 found that the composition of the vaginal microbiota did not differ between PTB and FTB across a cohort of predominately African American women. The difference in ethnicity in terms of vaginal microbiota and PTB has been further evidenced recently in a study that indicated that the frequency of Lactobacillus, Gardnerella, and Ureaplasma varies significantly between preterm and full-term deliveries within a Caucasian cohort but not within and African American cohort49. Unfortunately, socioeconomic data was not collected as part of this work, and as has been previously reported50,51 could be a confounder in the different microbiome profile observed. In addition, there were women in both the risk_FTB and risk_PTB groups that had received antibiotic treatment in the 6 months prior to sample collection, yet due to the low numbers is was not possible to determine the influence of this.

Although there is growing evidence for a microbiota-related role in spontaneous PTB, and indeed numerous descriptions of the importance for the microbiota with general vaginal health, there has been a limited understanding of the functional capacity of these microbial communities. As suggested by Heintz-Buschart and Wilmes52, a functional insight is required to develop and test hypothesis for mechanisms of action for host microbe interactions. Indeed, the sensitivity of community analysis using functional metagenomics has already been shown to distinguish healthy populations from cases of inflammatory bowel disease53,54, type 2 diabetes55, and obesity56. Within our preterm cohort, there was a clear distinction with respect to the functional profile of genes involved in both metabolic functions and BPs. The PTB samples had an increased abundance of functions relating to methionine/homocysteine and folate metabolism in addition to purine metabolism. Notably, maternal folate and homocysteine levels have been associated with poor pregnancy outcomes including PTB57. Moreover, there has been reductions in the concentrations of cysteine as measured in cervico-vaginal fluid from preterm samples in a previous study58. Taken together, it would suggest that further research into methionine/folate/cysteine homeostasis is merited in terms of a role for microbial metabolism and host interaction.

Overall, this study demonstrates the benefits of using shotgun metagenomics to understand the vaginal microbiome of women at risk of PTB. Moreover, we have shown that use of a microbial DNA enrichment kit is a feasible method to increase high-quality microbial sequencing reads to the upper limit of what has been achieved in previous studies (Supplementary Table 4) and overcome the notable problem of host contamination31,59. A distinct variation in both the taxonomic and functional potential of the preterm linked vaginal microbiome reinforces the concept of a microbiome role in PTB. However, the inability of both this study and others to determine a strict signature highlights the need for much larger controlled studies with the ability to examine confounding factors like ethnicity, biogeography, and predetermined risk factors using high-resolution, strain level analysis. This study provides evidence for the functional role of the microbiota in spontaneous PTB. Direct gene expression studies are now required if we are to fully elucidate any microbial–host interactions, which are involved in the spontaneous onset of PTB.


Study participants and sampling

This was a prospective cohort study with institutional ethics approval by the National Maternity Hospital Research Ethics Committee and maternal written consent. Women were determined to be at risk of spontaneous PTB if they had either a history of previous spontaneous PTB, and therefore at high risk of subsequent spontaneous PTB (n = 29), or had two previous LLETZ (n = 11). High-risk participants in the study were recruited from women attending the PTB clinic at The National Maternity Hospital Dublin, Ireland. Anonymized patient data were collected from patient charts by an independent researcher and pregnancy outcome data collected from patient charts and a computerized database in the National Maternity Hospital, Dublin. Low-risk controls (n = 14) were women who had previously delivered a full-term baby with no prior history of PTB or LLETZ, and were attending routine antenatal care. Inclusion criteria were women over 18 years of age and pregnant. Exclusion criteria were women currently on antibiotic treatment. Following informed consent, a high vaginal swab was taken using a speculum from the posterior vaginal fornix and external orifice of the cervix prior to transvaginal ultrasonography using a dry cotton swab and delivered to the laboratory within 24 h.

DNA extraction/purification

One millilitre of sterile phosphate-buffered saline was added to each swab followed by rigorous vortexing for 1 min. Total microbial DNA was extracted using the MoBio PowerFood Microbial DNA isolation kit and manufacturer’s instructions. Briefly, 700 μl of swab material was centrifuged and the resulting pellet was disrupted by mechanical and enzymatic treatment. DNA was eluted at 100 μl in elution buffer and samples were stored at −20°C. DNA concentrations were determined using the Qubit high sensitivity kit as per manufacturer’s instructions. To reduce sequencing reads mapping to eukaryotic DNA downstream, DNA from each swab was treated with the Microbiome Enrichment kit (NEB) and purified with AMPure magnetic beads (Beckman Coulter). A negative sterile water control was included during the enrichment and carried forward through subsequent sequencing steps. Finally, all samples were normalized to 0.2 ng µl−1.

Shotgun metagenomic sequencing

Libraries of DNA were prepared according to standard Illumina protocols. Briefly, DNA was sheared by heating to 55 °C for 7 min. Paired-end indexes were added and amplification occurred for 12 cycles before samples were purified with AMPure magnetic beads. DNA was quantified by Qubit dsDNA HS assay kit and the Agilent Bioanalyser 2100 with high sensitivity DNA chips before being pooled to 2 mM. Quality of the sample pool was confirmed by quantitative PCR. A sample of sterile water was processed in parallel with the DNA during library preparation to act as a negative control (Supplementary Table 5). Libraries were sequenced using 2 × 150 bp paired-end kit on the Illumina NextSeq platform.

Bioinformatic analysis

Raw sequencing data were base-called using Illumina’s bcl2fastq software (v 2.19) ( TrimGalore (v 0.6.0) (, a perl wrapper for Cutadapt (v. 1.18)60 and FastQC (v. 0.11.8) (, was used to remove adapter sequences and low-quality sequences using default parameters. Removal of human DNA contamination was performed by aligning all high-quality paired-end reads to the latest draft of the human genome (hg38) using Bowtie2 (v. 2.3.4)61. The resulting SAM files were converted to BAM format and filtered to keep only unmapped paired-end reads using SAMtools62. Bedtools63 was used to convert the remaining reads from BAM to FASTQ format. Taxonomic assignment of paired-end reads was performed using Kraken264 alignment against the GTDB_54k database created by Méric et al.65 ( Functional profiling was performed using the HUMAnN2 pipeline (v. 2.8.1)66. The gene families output were renormalized as copies per million reads and regrouped according to Gene Ontology terms.

Data were visualized using both Graphpad Prism 6 and RStudio (R v.3.6.0). Heatmaps were generated using the ‘ComplexHeatmap’ package67 with samples clustered using the Pearson’s distance metric and columns split by k-means clustering to visualize CSTs, assigned based on previous reported definitions24. Plots for diversity analysis were generated using the ggplot2 package ( Statistical analysis was carried out in R using the vegan package ( and RVAideMemoire ( Multivariate Association with Linear Models 2 (MaAsLiN2, R V.1.2.0) was used to determine independent associations of species and functions with metadata factors. A q-value of >0.25 was considered significant in our analysis.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.