The genus Streptococcus comprises several harmful pathogenic species such as Streptococcus pyogenes or Streptococcus pneumoniae, together with a single 'Generally Recognized As Safe' species, S. thermophilus. Assessing the innocuous nature of S. thermophilus as a food microorganism is of major importance since this bacterium is widely used for the manufacture of dairy products1,2,3 (annual market value of $40 billion)3. In consequence, over 1021 live cells are ingested annually by the human population. The dairy streptococcus must have followed a divergent evolutionary path from that of its pathogenic congeners, as it has adapted to a rather narrow, well-defined and constant ecological niche, milk. To obtain insight into this path and to assess the potential for virulence of this bacterium, we sequenced the genomes of two yogurt strains of S. thermophilus, and compared them to those of previously sequenced pathogenic streptococci4,5,6,7,8,9.


Divergence of S. thermophilus strains

S. thermophilus CNRZ1066 and LMG13811 were isolated from yogurt manufactured in France and in the United Kingdom, respectively. Both strains contain a single circular chromosome of 1.8 Mb, containing about 1,900 coding sequences (Supplementary Fig. 1 online and Table 1). Out of these, about 1,500 (80%) are orthologous (defined as BLASTP reciprocal best hits) to other streptococcal genes, which indicates that S. thermophilus and its pathogenic relatives still share a substantial part of their overall physiology and metabolism. The two S. thermophilus genomes reported here display about 3,000 single nucleotide differences (0.15% polymorphism). Taking into account the estimated natural mutation rate10, and assuming a growth rate between one and ten divisions per day, their common ancestor would have lived about 107 generations ago, that is, 3,000–30,000 years back, roughly fitting the duration of human dairy activity, believed to have begun about 7,000 years ago1. The two genomes differ by 170 single nucleotide shifts, mostly in mononucleotide (n > 3) stretches, and 42 regions of sequence differences >50 base pairs (indels) that represent about 4% of genome length (Supplementary Table 1 online). The two strains have >90% of coding sequences in common (Table 1), suggesting a similar lifestyle, as expected from their involvement in the same dairy process. The main differences concern genes for extracellular polysaccharide biosynthesis (eps, rps), bacteriocin synthesis and immunity, a remnant prophage and a locus known as 'clustered regularly interspaced short palindromic repeats' (denoted CRISPR2 here; CRISPR1 is present in both strains), closely linked to genes of unknown function (cas, Supplementary Fig. 2 online)11.

Table 1 General features of S. thermophilus CNRZ1066 and LMG18311 genomes

Inactive S. thermophilus genes

Unexpectedly, 10% of the S. thermophilus genes are not functional due to frameshift, nonsense mutation, deletion or truncation (globally named pseudogenes). This proportion is the highest among the sequenced streptococcal genomes (Supplementary Table 2 online). A nearly identical set of pseudogenes is shared between the two strains. Different functional categories are affected to various extents, ranging from 60% truncated coding sequences for “Other Functions” (atypical conditions, phages, transposons), which mainly include insertion sequences known to be prone to inactivation5 (Supplementary Table 2 online), to only 3.5% or even none, for 'Translation' and 'Transcription', respectively (Table 2). Remarkably, two of the most highly decaying functional groups, 'Transport Proteins' and 'Energy Metabolism' (30% truncated coding sequences) relate to carbohydrate degradation, uptake and fermentation. Notably, half of the genes dedicated to sugar uptake, including four of the seven sugar phosphotransferase system (PTS) transporters, are pseudogenes in S. thermophilus (Supplementary Tables 3 and 4 online). To substantiate this finding, we sequenced ptsG (glucose), fruA (fructose), bglP (β-glucoside) and treP (trehalose) PTS transporter genes in eight different S. thermophilus strains and in four strains of the closely related oral commensal Streptococcus salivarius (trehalose PTS was not analyzed in the latter). We found them to be pseudogenes in all S. thermophilus strains (with a single exception of the fructose PTS in one strain), whereas they appeared intact in the four S. salivarius strains. Inactivation of two other genes involved in carbon metabolism, butA (acetoin reductase) and adhE (alcohol-acetaldehyde dehydrogenase), in the S. thermophilus but not S. salivarius strains also took place. Some genes dedicated to carbohydrate uptake may have also been lost, as S. thermophilus has only a minor fraction (19–36%) of the genes present in other streptococci (Supplementary Table 3 online). Conversely, a specific symporter for lactose (the main milk carbohydrate) is present in the S. thermophilus genome but absent in other streptococci (Supplementary Tables 3 and 4 online). Thus, probably because mammals have emerged relatively recently (60 million years ago) in comparison to the remote lactic acid bacteria group (1.5–2 billion years ago)12, numerous genes encoding proteins dispensable in the milk niche have become pseudogenes, paving the way towards gene loss.

Table 2 Truncated coding sequences in different functional categories

Absence of virulence-related genes in S. thermophilus

The availability of the S. thermophilus genome sequence allowed us to systematically search the chromosome for potential genetic virulence determinants. The ability to use an extended range of carbohydrates is reported to be important for the virulence of pathogenic streptococci, possibly by allowing maintenance of these bacteria in their ecological niche5,9. The observed impairment of this function in S. thermophilus is likely to reduce the virulence potential. Antibiotic resistance is another important facet of pathogen virulence. The S. thermophilus genome does not contain any obvious antibiotic modification genes such as those found in the Streptococcus agalactiae pathogen8 and it is reported to be sensitive to a wide range of antimicrobial compounds2. Many streptococcal virulence-related genes (VRGs) are absent from the S. thermophilus genome or are present only as pseudogenes, unless they code for proteins performing basic cellular functions (Supplementary Tables 5,6,7 online). Have some of the absent genes been lost from S. thermophilus, rather than being acquired by the pathogenic streptococci? Over a quarter of virulence-related genes absent in S. thermophilus are present in both S. pyogenes and S. pneumoniae (25/92, using BLASTP with a cut-off value of 10−10) and almost 40% of these (9/25) are found in regions that are colinear in the two genomes. This suggests that they were present in the strain ancestral to both pathogenic species and presumably S. thermophilus and that they were lost from the latter.

Pathogenic streptococci exploit surface-exposed proteins to achieve adhesion to mucosal surfaces and escape host defenses4. Among the 28 S. pneumoniae virulence-related genes coding surface-exposed proteins, only 4 have orthologs in S. thermophilus (Supplementary Table 7 online). A global analysis of surface proteins revealed a major decay in specialized surface proteins (excluding lipoproteins) with a high proportion of pseudogenes (8/13, Supplementary Tables 8,9,10 online). The lipoprotein class, which includes a large number of substrate-binding subunits of ABC transporters (16 out of 27–28 predicted lipoproteins) and contains a low number of virulence-related genes (2 out of 27–28), is not massively affected. Globally, the most important virulence determinants that are exposed on the cell surface of pathogenic streptococci are absent or inactivated, such as the pneumococcal surface protein A and C (PspA, PspC), the pneumococcal manganese ABC transporter lipoprotein PsaA, IgA proteases, adhesins and a majority of pneumococcal choline-binding proteins. One homolog to a choline-binding protein (CbpD) was found in each S. thermophilus genome (Supplementary Table 9 online) but neither contains the domain necessary for binding to teichoic acids substituted with phosphorylcholine, in line with the lack of the lic gene cluster required for phosphorylcholine metabolism13 in the S. thermophilus genome. The two S. thermophilus genomes lack genes coding for sortase-anchored surface proteins14; moreover, the single sortase gene itself is a pseudogene. Some of the important virulence determinants in pathogenic streptococci (S. pyogenes, Streptococcus mutans)6,9 are sortase-anchored proteins. Furthermore, sortase mutants of pathogenic Gram-positive bacteria, including streptococci (S. mutans, Streptococcus gordonii) are attenuated in animal models15. In spite of the presence of homologs of the cps genes, which are involved in the synthesis of the capsule that is essential for virulence in pathogenic streptococci such as S. pneumoniae, the two S. thermophilus strains are not encapsulated. Their cps homologs, also known as eps, are involved in the synthesis of exopolysaccharides, important for the industrial use of S. thermophilus, as they confer the desired texture to yogurt16.

RecQ inhibits symmetrical genome inversions in bacteria

Genome plasticity is another important feature for evolutionary adaptation of pathogens to host defense mechanisms17, as opposed to genome stability, which is expected to better fit the sedate life style of a dairy bacterium. To estimate genome instability we analyzed symmetrical inversions around the chromosome origin/termination axis, which result from recombination events between the replication forks18. X-alignment analysis of pathogenic streptococci versus S. thermophilus revealed a much higher score of chromosomal inversions within the Streptococcus genus than in pairwise comparisons of closely related Bacillus species (see Supplementary Fig. 3a and b online for two selected comparisons with similar G+C content). What might be the reason for this high inversion frequency? We examined replication and recombination-related genes likely to play a role in recombination between the replication forks, and found that streptococci lack the recQ gene whereas B. subtilis has it. RecQ helicases are present in most living cells, from bacteria to man, and contribute in several ways to genome stability19. We found a negative correlation between the frequency of symmetrical chromosomal inversions and the presence of the recQ helicase gene in Gram-positive bacteria (Supplementary Fig. 3c online), suggesting that RecQ stabilizes the genome of these bacteria. However, as all streptococci lack RecQ, this protein does not increase the stability of the S. thermophilus genome relative to its pathogenic relatives and its X-alignment with other streptococci does not appear more conserved than that between pathogenic streptococci (not shown). It is interesting that a phylogenetically related bacterium used in dairy fermentations, Lactococcus lactis (previously S. lactis) possesses recQ20. We noted that pathogenic streptococci, but not S. thermophilus and L. lactis, lack yet another potential genome-stabilizing function, encoded by the sbcC and sbcD genes and thought to participate in the repair of recombinogenic double-stranded DNA breaks21. These genes are adjacent to a remnant transposase in S. thermophilus, suggesting they may have been introduced by lateral gene transfer at a later evolutionary stage to counteract the destabilizing consequences of RecQ deprivation. However, as is often the case with the putative LGT, we cannot rule out the possibility that the genes were originally present in all streptococci and were lost subsequently by deletion from the pathogenic species.

Lateral gene transfer in S. thermophilus

In addition to gene decay and loss, lateral gene transfer has contributed to the shaping of the S. thermophilus genome. There are >50 insertion sequences in the two genomes, some with anomalous G+C content and associated with genes of relevance to milk adaptation. About 75% of insertion sequences are associated with the change in S. thermophilus gene order relative to S. pyogenes, suggesting that these sequences play an important role in the shaping of the genome. A particularly interesting case of LTG is a 17-kb region found within a truncated pepD gene, that is present in both S. thermophilus strains. It could be considered as a hot spot of lateral gene transfer, as it contains three of the six insertion sequence 1191 copies present in the LMG18311 strain and constitutes a mosaic of fragments with more than 90% identity to DNA of Lactobacillus bulgaricus and two subspecies of L. lactis (lactis and cremoris), three other bacteria also growing in milk (Fig. 1a). Interestingly, the leftward flanking region is conserved in two streptococcal species (Streptococcus equii and S. mutans). Similarly, the rightward flanking region is conserved in S. equii, starting about 3.5 kb from the end of the 17 kb region. This conservation supports the hypothesis that insertions took place in the S. thermophilus genome. The L. bulgaricus fragment (3.6 kb) brings a unique copy of metC allowing methionine biosynthesis, a rare amino acid in milk2. The high level of identity (95%) of the respective metC regions reveals a recent lateral gene transfer event between these two rather distant species used in association in yogurt manufacture2 and suggests that ecological proximity rather than a phylogenetic one is a prerequisite for lateral gene transfer. We observed that the two species adhere to each other (Fig. 1b), which could facilitate gene transfer between them.

Figure 1: Lateral gene transfer between S. thermophilus and dairy bacteria.
figure 1

(a) Schematic representation of a 17-kb mosaic region of lateral gene transfer encompassing DNA fragments with more than 90% DNA/DNA identity with Lactococcus lactis subsp. lactis20 (L.L. lactis, blue), Lactococcus lactis subsp. cremoris (L.L. cremoris, red; Joint Genome Institute, and Lactobacillus bulgaricus (green; Joint Genome Institute, Rectangular boxes in color correspond to exchanged DNA regions; species, DNA fragment size and percentage of DNA identity are indicated below. IS1191 are shown as black boxes. Extension '-tr' indicates genes inactive because of a truncation or one or more frameshifts, pepD, endopeptidase; tnp, transposase; hsdR, restriction endonuclease; hsdM, methylase; dacA, carboxypeptidase; IS, insertion sequences. (b) Adhesion of S. thermophilus CNRZ1066 and L. bulgaricus. The two organisms were cultivated together in liquid broth to mid-exponential phase; a glass slide was deposited in the culture for 1 h, withdrawn and rinsed five times with water and observed under an optical microscope. Inset, higher magnification.


Comparative genomics leads us to the view that the dairy streptococcus genome may have been shaped mainly through loss-of-function events, even if lateral gene transfer played an important role. This is the first instance where regressive evolution is observed in a food niche rather than in pathogen- or symbiont-host situations22,23. The massive gene decay resulted in inactivation and loss of most of the virulence determinants. This provides a strong genomic argument in support of the 'Generally Recognized As Safe' status of the dairy streptococcus, indicating that massive consumption of this bacterium by humans likely entails no health risk.



S. thermophilus strains CNRZ1066 and LMG18311 are yogurt isolates, deposited in Institut National de la Recherche Agronomique (INRA) and Laboratorium voor Microbiologie Gent (LMG) collections. Other Streptococcus strains used in this study are from the INRA collection: S. salivarius JIM 14, 15, 16 and 17; S. thermophilus CNRZ 302, 385, 388, 389, 703, 1100,1202 and 1575.

Genome sequencing and assembly.

The complete sequences were determined by the random shotgun sequencing strategy followed by multiplex PCR as described earlier20. Two sets of random libraries containing 2- to 3-kb inserts were constructed from chromosomal DNA from S. thermophilus strains LMG18311 and CNRZ1106. Assembling of 20,000 and 28,000 sequences gave 350 and 300 contigs, respectively, for the two strains. We carried out 1,500 multiplex PCR reactions for final assembling of CNRZ1066 in mixtures of 48 primers, according to the one-step protocol20, which led to a single circular contig. Subsequently, fragments representing the boundaries of repetitive regions were flagged with respect to partial mismatches at the ends of alignments and were independently assembled before the final sequence polishing. We used a similar finishing strategy for strain LMG18311. In summary, sequences of the two strains were determined by construction of two independent sequence data sets containing 28,000 random and 2,000 primer-directed reads for CNRZ1066 strain and 21,000 random and 1,500 primer-directed reads for LMG18311 strain.

Gene prediction and annotation.

A combination of CRITICA24, Glimmer25 and an open reading frame calling program developed at Integrated Genomics was used to identify coding sequences. The assembled genomes were analyzed using the ERGO ( bioinformatics suite. The complete DNA sequence and the predicted coding sequences were added into the integrated environment for genome annotation and metabolic reconstruction as described26. Protein identifiers (PIDs) sth0001 and stu0001 were assigned to dnaA in CNRZ1066 and LMG18311, respectively.

Strain polymorphism.

Nucleotide sequences of internal fragments of the genes from different Streptococcus strains were determined from PCR products amplified from chromosomal DNAs using selected primers. For each gene fragment the nucleotide sequences were compared and clustered using CLUSTALW program.

Genome comparisons.

MUMmer27 was used for detailed comparative analysis of the two S. thermophilus genomes. Comparative genome alignments were based on results of BLASTP Reciprocal Best Hits (RBH)28, identification of conserved gene order and construction of chromosome gene clusters29. The number of ori-symmetrical genome rearrangements17 was computed using the syntheny groups30 identified in the RBH genome comparison. We computed the number of inversions by first determining the synthenic regions composed of RBH and then counting the number of regions equidistant from the origin (within 10% tolerance, allowing us to eliminate effects of most insertions and deletions in the compared genomes) but carried on different chromosome arms. Only genomes pairs that have a homology greater than 50% were selected. The homology was defined as the mean of BLASTP identity of all RBH in the pair of genomes that were compared.

Nucleotide sequence accession number.

The S. thermophilus genome sequences have been deposited in GenBank with accession no. CP000024 (CNRZ1061) and CP000023 (LMG18311).

Note: Supplementary information is available on the Nature Biotechnology website.