Introduction

Universal single-copy genes (USCGs) are marker genes that occur once and only once in almost every genome1. Because of this property of ubiquitous existence and uniqueness in each genome, USCGs are widely used to study the evolution and classification of species2,3,4. In the past two decades, with the enormous amount of metagenomic data generated, USCGs are also routinely employed for taxonomic profiling of microbial species and completeness evaluation of metagenome-assembled genomes in shotgun metagenomic studies5,6,7,8,9.

A widely used marker gene is the 16S rRNA gene, which is thought to be single-copy while shown otherwise and is thus not a perfect USCG10. 16S rRNA gene has been used as the gold standard in amplicon sequencing and demonstrated its power for taxonomic profiling in various metagenomic studies4,11,12,13,14. It is the most sequenced gene, with its sequences often the only ones we know about a species. Despite its indisputable value and popularity, the 16S rRNA gene may have multiple copies in an unknown genome10. Moreover, its sequences might be too conserved to distinguish certain species and/or correctly measure the species divergence4. It is thus natural to consider other USCGs in evolution and metagenomics.

Previous studies have identified USCGs other than the 16S rRNA gene1,2,3,4,9,15,16,17,18,19. For instance, Ciccarelli et al. obtained 31 USCGs to reconstruct the tree of life across all three domains1. Later, Creevey et al. inferred 40 USCGs from the same set of 191 complete genomes16, which were then used for taxonomic profiling in metagenomics by the tools mOTUs and mOTUs25,6. Wu and Eisen developed the AMPHORA pipeline for automatic phylogenomic analysis of microbial species based on 31 USCGs4. These 31 USCGs were combined with 104 archaeal USCGs for phylogenomic studies19. Later, the same group developed the phyEco gene sets that comprised 40 USCGs for “all bacteria and archaea” and 114 USCGs for “all bacteria”3. Alneberg et al. analyzed all clusters of orthologous groups in 525 genomes and identified 36 USCGs that occurred in > 97% of the 525 genomes with a frequency < 1.03 per genome15. Lan et al. discovered 73 USCGs in > 90% of 1897 genomes to classify prokaryotic species2. Parks et al. extracted 120 USCGs present in > 90% of genomes and single-copy in > 95% of their present genomes9. These sets of USCGs are inferred from different groups of sequenced genomes with different purposes. They are thus different, although they do share a fraction of USCGs.

With multiple USCGs, in addition to studying evolution, classifying and taxonomic profiling of species, several studies attempted to investigate the strain diversity7,20,21,22,23. For instance, Quince et al. developed tools to de novo reconstruct bacterial strain genomes from shotgun metagenomic reads with the aforementioned 36 USCGs7,23. The StrainPhlAn tool infers bacterial strains directly from shotgun reads based on ~ 200 clade-specific marker genes22. Nayfach et al. proposed a pipeline for strain profiling in metagenomic datasets with 15 USCGs24. Such studies of bacterial strains are important for understanding drug resistance, microbial diversity, and the cure of various complex diseases25,26,27,28,29,30.

Despite many studies on USCGs, our understanding of USCGs is somewhat outdated. For instance, how much agreement is there between the USCG sets from different studies? For a set of USCGs, how universal are these USCGs in the ever-increasing number of sequenced genomes? When defining the above sets of USCGs, to our knowledge, only the first two sets of USCGs were selected by requiring their single-copy occurrence in all 191 genomes. None of the USCG sets is tested on the latest set of complete genomes to show their universalism. Moreover, it is also not clear whether certain USCGs can tell two species apart better than others. In addition, it is unknown whether a set of USCGs such as the 40 USCGs from Creevey et al.16 contain enough variations to distinguish different strains of a bacterial species.

To address these questions, especially the last one, in this study, we compared different sets of USCGs. We systematically studied how similar a USCG is in different species and strains with the latest set of complete bacterial genomes downloaded on October 27, 2021. We found that almost every USCG occurs multiple times in certain genomes, while more than 99.4% of USCGs are single-copy in a given genome. We also observed that USCGs together are good for separating species, while they cannot distinguish strains from the same bacterial species in general. Our study provides a more updated picture of USCGs and their potential applications in evolutionary and metagenomic studies.

Results

USCG sets are different in gene content while similar in gene function

We compared USCG sets from seven studies1,2,3,4,9,15,16 (Material and Methods, Supplementary Table S1). They were (1). 31 USCGs from Ciccarelli et al.; (2). 31 USCGs from Wu and Eisen; (3). 40 USCGs from Creevey et al.; (4). 40 USCGs from Wu et al. for “all” bacteria and archaea; (5). 36 USCGs from Alneberg et al.; (6). 73 USCGs from Lan et al.; and (7). 120 USCGs from Parks et al. These USCG sets require different universalism across all genomes and uniqueness in individual genomes. For instance, USCGs in the last two sets occurred in only > 90% of the genomes, while that was at least > 97% in the third to the fifth sets.

We found that even the USCG sets from the same research group could be very different (Fig. 1A). For instance, only 18 (77%) of the 31 USCGs from Wu and Eisen were shared with the 40 USCGs from Wu et al. Both sets were inferred by the same research group, with the latter inferred from a much larger number of genomes3,4. Their difference thus implied the significant effect of the genomes used to infer these USCGs. On the other hand, the USCG sets from the same research group could be highly consistent as well. For instance, the USCG set from Creevey et al. was a superset of the USCG set from Ciccarelli et al., as the same research group inferred them with the same genomes and a slightly different strategy. However, the additional nine USCGs from Creevey et al. also demonstrated the effect of such different strategies on defining USCGs.

Figure 1
figure 1

USCG sets and functional categories. (A) Overlap of the seven USCG sets. (B) The percentage of the functional categories of the USCGs in the seven sets. For each set, the percent of USCGs from the functional categories J, L, H, O, F, and others is shown in order.

The USCG sets from different research groups differed more (Supplementary Table S2). For instance, only 14 (45%) of the 31 USCGs from Ciccarelli et al. were shared with those from Wu and Eisen. Again, the difference in these two sets corroborated the effect of different strategies and different genomes. When we considered the two sets refined later by the corresponding research groups, the third and fourth sets, 28 (70%) of the 40 USCGs in these two sets were shared. Although the USCGs in these two sets were still drastically different, they shared much more USCGs than their earlier versions, likely because the number of genomes was large enough to choose a more representative set of genomes by Wu et al. Compared with the sets from Alneberg et al., Lan et al., and Parks et al., the USCGs from Creevey et al. was at least comparable with those from Wu et al., if not better. Overall, the USCG set from Creevey et al. is likely as reliable as any other set, if not more reliable, because of its stricter requirement of universalism and uniqueness, the similar evolutionary trajectory of these USCGs, and likely more representative genomes (the resulted USCG set has a good overlap with other sets that used much more genomes).

Despite the difference in the number and members of USCGs, the function of the USCGs in each of the above sets was quite consistent (Fig. 1B)31,32,33,34,35. By checking the functional annotations at https://www.ncbi.nlm.nih.gov/research/cog, we found that the vast majority of USCGs in every set are annotated with the functional category J (Translation, ribosomal structure, and biogenesis). The remaining USCGs were annotated with other translation and metabolism related categories, such as L (Replication, recombination, and repair), H (Coenzyme transport and metabolism), F (Nucleotide transport and metabolism), O (Posttranslational modification, protein turnover, chaperones), etc. Each category other than J was annotated with much fewer USCGs, from one to a handful of USCGs, compared with several dozen for J. Overall, the USCG sets are enriched with functions related to translation and metabolism, no matter which USCG set is concerned about. For instance, among the 40 USCGs from Wu et al., 35 are involved in the translation process, while the remaining five are related to the cellular metabolic process3.

USCGs are indeed universal across species and unique in individual genomes

To our knowledge, the number of genomes used to infer the above USCG sets was no more than 2000 except those from Parks et al. With more than 25,000 complete bacterial genomes at National Center for Biotechnology Information (NCBI), it was unclear whether the USCGs were still universal and unique. We thus studied the occurrence of the USCG sets in the latest set of complete genomes available on October 27, 2021 (Material and Methods). We focused on the sets from Creevey et al., Wu et al., and Alneberg et al. below because they are widely used in metagenomics and more updated than their previous versions (the first and third set). Moreover, they have more stringent criteria for universalism and uniqueness than the sets from Lan et al. and Parks et al.2,9. In the following, we presented our study on the above three sets.

We found that the 40 USCGs from Creevey et al. were universally distributed in almost each of the 25,271 genomes (Table 1, Supplementary Table S3, Material and Methods). For every USCG, it occurred in at least 97.7% of the 25,271 complete genomes, with the USCG Signal recognition particle GTPase occurring in the least genomes. For genomes without a copy of a USCG, we applied BLAST with an arbitrary cutoff of 1E-15 to search for this USCG. BLAST could identify a copy with this cutoff in most of these missed genomes. A tiny fraction of the genomes still did not have a USCG copy, likely due to the cutoff, the quality of the genomes, and the imperfectness of the gene retrieval methods used. Despite these limitations, each USCG occurred in at least 99.7% of the genomes. Of note, the number of USCGs occurring in each genome also showed the universal distribution of these USCGs (Supplementary Figure S1A). The mean number of USCGs occurring in each genome was 39.6. In other words, almost each USCG occurred in every genome, indicating the universalism of the USCGs from Creevey et al.

Table 1 Universalism and uniqueness of USCGs.

We also observed that each of the 40 USCGS from Creevey et al. indeed had only one copy in almost every genome (Table 1). The mean and median percentage of the 25,271 genomes with only one copy of a USCG were 98.5% and 99.0%, respectively. For individual USCGs, 24 (60%) of the 40 USCGs occurred more than once in < 0.2% of the genomes, while several other USCGs occurred more than once in > 2.0% of the genomes. Interestingly, these USCGs that occurred more than once in > 2% of the genomes were often not identified in other USCG sets. For instance, for the top ten USCGs occurring more than once, only one of them was also identified in the set from Wu et al., suggesting that these USCGs might have occurred multiple times in other studies and were thus filtered by those studies. Overall, the average number of USCGs with only one copy in a genome was 39.4. This average number suggests that almost all USCGs have only one copy in almost all complete bacterial genomes. Note that rarely is a genome with multiple copies of a USCG for multiple USCGs (Supplementary Figure S1B).

We further studied the universalism and uniqueness of the USCG sets from Wu et al. and Alneberg et al. (Supplementary Table S3). Compared with the set from Creevey et al., the other two sets had a similar median (Fig. 2). For instance, the median of the universalism was 99.3% for all three sets, and the corresponding median of the uniqueness was 99.0%, 99.1%, and 99.1%. However, there were more individual USCGs in the other two sets that were not so universal and unique as those from Creevey et al. It is thus evident that the above 40 USCGs from Creevey et al. might be a better USCG set.

Figure 2
figure 2

The universalism and uniqueness of the three USCG sets. Universalism means how many percent of genomes has at least a copy of a USCG. Uniqueness refers to the percentage of genomes with only one copy of a USCG.

USCGs separate different species well, but not strains of most species

With the USCGs indeed universal and unique in bacterial genomes, we investigated how well they together could distinguish two bacterial species in the same genus and whether they together are enough to set two bacterial strains of the same bacterial species apart. To address these questions, we studied all 440 genera with at least two sequenced species genomes and all 747 bacterial species with at least two sequenced strain genomes in the 25,271 complete genomes (Material and Methods). We found that at least the 40 USCGs from Creevey et al. together were enough to distinguish almost all pairs of species in the same genus. However, they could not separate pairs of strains from most species (Fig. 3).

Figure 3
figure 3

The cumulative distribution of the PID of species pairs and strain pairs. (A) Species pairs. (B) Strain pairs.

With the Creevey et al. USCG set, we studied the similarity of 40 pairs of corresponding USCGs in every two species from the same genus for all 440 genera with at least two complete genomes (Supplementary Table S4, Figure S2A). We measured the similarity by the percentage of identity (PID) in the alignment of the 40 pairs of USCGs from a pair of species because the PID determines how similar the shotgun metagenomics reads from these USCGs are. The PID had a mean and median of 90.2% and 90.6%. However, about 2.3% of species pairs had a PID larger than 99.5%, which was the limit that the current computational tools could distinguish two genomes7,36,37,38,39,40. Given that the total nucleotide length of the 40 USCGs is about 33,852 base pairs, the 99.5% PID means there are at least 170 varied positions in the 40 USCGs required to separate the two species. Therefore, the 40 USCGs could tell most pairs of species apart, which may leave about 2.3% of the pairs of species indistinguishable.

With the same USCG set, we also studied the similarity of 40 pairs of corresponding USCGs in every two strains from the same species for all 747 species with at least two complete genomes (Fig. 3, Supplementary Table S5, Figure S2B). The PID in a pair of strains had a mean of 99.8% and a median of 99.9%. About 91.2% of the PID scores between pairs of strains from the same species were already larger than 99.5% (Fig. 3). In other words, the 40 USCGs alone may have difficulty in separating more than 91.2% of pairs of bacterial strains.

We also studied how well the USCG set from Wu et al. and from Alneberg et al. separated species and strains (Fig. 3). Compared with the USCG set from Creevey et al., these two sets had a slightly lower performance, especially in terms of separating strains of the same species. Many more pairs of species and strains had close to 100% PID based on the Wu et al. set compared with the other two sets. Overall, the USCG set from Creevey et al. is likely to perform better in distinguishing species and strains.

Discussion

We systematically studied the occurrence of USCGs in the most updated list of complete bacterial genomes. We showed that USCGs were universally distributed and uniquely present in almost every bacterial species. The 40 USCGs from Creevey et al. together could distinguish different species from the same genus and different strains from the same species well. Barely was there one individual USCG that could do so alone.

There are inevitably mistakes in the annotation of the complete genomes. For instance, the MSHR1132 genome was annotated as the genome of a Staphylococcus aureus strain previously41 and is now classified as the genome of a Staphylococcus argenteus species in the current NCBI. Such an incorrect annotation may change the similarity of individual species and strain pairs. However, it does not affect our conclusions about universalism, uniqueness, and species and strain pair similarity, because universalism and uniqueness are based on individual genomes, and the species and strain pair similarity are based on the upper bound of the similarities. The incorrect annotation is likely to affect low similar pairs. Moreover, the incorrect number of annotations is too small to affect the conclusions made in this study.

There are many other sets of USCGs not compared here. Wrighton et al. used 30 USCGs to measure the completeness of the reconstructed 49 bacterial genomes18. Haroon et al. employed the Amphora2 marker genes, including the 31 USCGs for bacteria from Wu and Eisen and 104 archaeal genes, to measure the completeness of the assembled genomes17. Rinke et al. estimated the completeness of the assembled genomes based on 139 bacterial and 162 archaeal conserved marker genes42. Wu et al. reported 114 USCGs for “all” bacteria in addition to the 40 USCGs compared here3. We did not compare these sets of USCGs because they were more for other purposes and not as universal and unique as those compared in this study.

We found that three of the seven sets of USCGs were universal and unique. Other sets of USCGs we studied, except the one from Ciccarelli et al., were unlikely to be as universal and unique as the three sets because their selection was not as strict. For instance, Lan et al. and Parks et al. required USCGs to occur in only more than 90% of the genomes instead of all genomes2,9. The Ciccarelli et al. set was already included in the set from Creevey et al.

For the complete genomes, we demonstrated that USCGs could distinguish different species but not strains. This was based on the fact that the PID of 99.5% is the limit of current tools to separate similar sequences. When the strain similarity is higher, say more than 99.5%, we are left with only a few dozen variable loci to distinguish different strains. In this case, with the available USCG sets, it may be challenging to distinguish them, even if possible. Note that all aforementioned USCG sets are not created to distinguish bacterial species and strains in microbiomes. In the future, one may hope to generate new USCG sets for this purpose by developing novel methods and integrating different sources of information43,44,45,46,47,48,49. Alternatively, one may consider overlapping the assembled contigs or metagenome-assembled-genomes with the USCGs so that more polymorphic sites are available to distinguish strains7,23.

Material and methods

Seven sets of USCGS

We studied seven sets of USCGs: the 31 set from Ciccarelli et al.1, the 31 set from Wu and Eisen4, the 40 set from Creevey et al.16, the 40 USCGs from Wu et al.3, the 36 USCGs from Alneberg et al.15, the 73 USCGs from Lan et al.2, and the 120 USCGs from Parks et al.9. We focused on these sets for the separation of similar strains, not for the completeness of the assembled genomes in metagenomics. We converted the gene name of each USCG into its clusters of orthologous groups (COG) name with the information from https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/cog-20.def.tab. In this way, we measured the overlap of two USCG sets by the number of their shared COG gene names.

The complete bacterial genomes

We studied the complete bacterial genomes at NCBI. We retrieved these bacterial genomes on October 27, 2021, from the NCBI website (https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/) by selecting the filtering criteria as the ‘Bacteria’ Kingdom and the ‘Complete’ Assembly level. In total, we obtained 25,271 complete bacterial genomes. We also downloaded the protein sequences in each genome from the NCBI FTP site at https://ftp.ncbi.nlm.nih.gov/genomes/all/.

Identification of USCGs in a genome

To identify the corresponding sequences of a USCG in a given genome, we used the tool fetchMG for USCGs from Ciccarelli et al. and Creevey et al.6. This tool is based on hidden Markov models to identify the occurrence of a USCG in input sequences. It was used previously to identify and quantify the microbial species in assembled contigs from shotgun metagenomic reads. We used the following command to fetch the USCG sequence in a genome: ./fetchMG.pl -m extraction -p proteins.fasta -o output. Since we input the aforementioned protein sequences in a genome to fetchMG, the fetchMG output includes the full-length protein sequences that are likely the copies of USCGs in this genome. The fetchMG tool might be imperfect, miss certain copies of the USCGs from these two sets in a genome, and does not work for USCGs not in these two sets but in other sets. Therefore, we also applied BLAST to retrieve copies of USCGs that were not from the above two sets or USCGs from the two sets but without an identified copy in a genome. The command we used was: psiblast -in_msa marker_aligned.fasta -db protein.fasta -evalue 1E-15 -num_alignments num_keep. When we applied BLAST, the multiple sequence alignments used as input (the parameter -in_msa marker_aligned.fasta) was the sequences downloaded from https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/fasta/ and aligned with the mafft tool50 for all other USCG set except the Wu et al. set. The input multiple sequence alignment for the Wu et al. set was downloaded from their paper3. The E-value cutoff used was 1E-15 as previously3,4,19. We considered the fetched protein sequences as the corresponding USCG sequences in the genomes. Note that neither fetchMG nor BLAST is perfect for identifying copies of USCGs in a genome, as they sacrifice accuracy for speed. However, the identified copies of USCGs are likely to be reliable. We obtained the functional categories of each USCG from https://www.ncbi.nlm.nih.gov/research/cog.

The USCG similarity measurement

We aligned the extracted USCG protein sequences for every USCG with the tool MAFFT50. In other words, we did multiple alignments of the USCG sequences extracted from each genome under consideration for each USCG. In order to assess the similarity of the USCG sequences, we calculated the score based on the blosum62 matrix, the matched number, the mismatch number, the indel number, and the PID. We chose the PID to evaluate the similarity of each pair of USCG sequences because it directly relates to the shotgun metagenomic reads mapped to different USCG sequences. To measure the PID, we obtained the two corresponding sequences in the alignment and removed the loci with an indel versus an indel. We then calculated the PID as the ratio of the number of matched positions to the number of all remaining aligned positions.