A revisit to universal single-copy genes in bacterial genomes

Universal single-copy genes (USCGs) are widely used for species classification and taxonomic profiling. Despite many studies on USCGs, our understanding of USCGs in bacterial genomes might be out of date, especially how different the USCGs are in different studies, how well a set of USCGs can distinguish two bacterial species, whether USCGs can separate different strains of a bacterial species, to name a few. To fill the void, we studied USCGs in the most updated complete bacterial genomes. We showed that different USCG sets are quite different while coming from highly similar functional categories. We also found that although USCGs occur once in almost all bacterial genomes, each USCG does occur multiple times in certain genomes. We demonstrated that USCGs are reliable markers to distinguish different species while they cannot distinguish different strains of most bacterial species. Our study sheds new light on the usage and limitations of USCGs, which will facilitate their applications in evolutionary, phylogenomic, and metagenomic studies.


Scientific Reports
| (2022) 12:14550 | https://doi.org/10.1038/s41598-022-18762-z www.nature.com/scientificreports/ Despite many studies on USCGs, our understanding of USCGs is somewhat outdated. For instance, how much agreement is there between the USCG sets from different studies? For a set of USCGs, how universal are these USCGs in the ever-increasing number of sequenced genomes? When defining the above sets of USCGs, to our knowledge, only the first two sets of USCGs were selected by requiring their single-copy occurrence in all 191 genomes. None of the USCG sets is tested on the latest set of complete genomes to show their universalism. Moreover, it is also not clear whether certain USCGs can tell two species apart better than others. In addition, it is unknown whether a set of USCGs such as the 40 USCGs from Creevey et al. 16 contain enough variations to distinguish different strains of a bacterial species.
To address these questions, especially the last one, in this study, we compared different sets of USCGs. We systematically studied how similar a USCG is in different species and strains with the latest set of complete bacterial genomes downloaded on October 27, 2021. We found that almost every USCG occurs multiple times in certain genomes, while more than 99.4% of USCGs are single-copy in a given genome. We also observed that USCGs together are good for separating species, while they cannot distinguish strains from the same bacterial species in general. Our study provides a more updated picture of USCGs and their potential applications in evolutionary and metagenomic studies.

Results
USCG sets are different in gene content while similar in gene function. We compared USCG sets from seven studies [1][2][3][4]9,15,16 ( (7). 120 USCGs from Parks et al. These USCG sets require different universalism across all genomes and uniqueness in individual genomes. For instance, USCGs in the last two sets occurred in only > 90% of the genomes, while that was at least > 97% in the third to the fifth sets.
We found that even the USCG sets from the same research group could be very different (Fig. 1A). For instance, only 18 (77%) of the 31 USCGs from Wu and Eisen were shared with the 40 USCGs from Wu et al. Both sets were inferred by the same research group, with the latter inferred from a much larger number of genomes 3,4 . Their difference thus implied the significant effect of the genomes used to infer these USCGs. On the other hand, the USCG sets from the same research group could be highly consistent as well. For instance, the USCG set from Creevey et al. was a superset of the USCG set from Ciccarelli et al., as the same research group inferred them with the same genomes and a slightly different strategy. However, the additional nine USCGs from Creevey et al. also demonstrated the effect of such different strategies on defining USCGs.
The USCG sets from different research groups differed more (Supplementary Table S2). For instance, only 14 (45%) of the 31 USCGs from Ciccarelli et al. were shared with those from Wu and Eisen. Again, the difference in these two sets corroborated the effect of different strategies and different genomes. When we considered the two sets refined later by the corresponding research groups, the third and fourth sets, 28 (70%) of the 40 USCGs in these two sets were shared. Although the USCGs in these two sets were still drastically different, they shared much more USCGs than their earlier versions, likely because the number of genomes was large enough to choose a more representative set of genomes by Wu   www.nature.com/scientificreports/ of its stricter requirement of universalism and uniqueness, the similar evolutionary trajectory of these USCGs, and likely more representative genomes (the resulted USCG set has a good overlap with other sets that used much more genomes).
Despite the difference in the number and members of USCGs, the function of the USCGs in each of the above sets was quite consistent (Fig. 1B) [31][32][33][34][35] . By checking the functional annotations at https:// www. ncbi. nlm. nih. gov/ resea rch/ cog, we found that the vast majority of USCGs in every set are annotated with the functional category J (Translation, ribosomal structure, and biogenesis). The remaining USCGs were annotated with other translation and metabolism related categories, such as L (Replication, recombination, and repair), H (Coenzyme transport and metabolism), F (Nucleotide transport and metabolism), O (Posttranslational modification, protein turnover, chaperones), etc. Each category other than J was annotated with much fewer USCGs, from one to a handful of USCGs, compared with several dozen for J. Overall, the USCG sets are enriched with functions related to translation and metabolism, no matter which USCG set is concerned about. For instance, among the 40 USCGs from Wu et al., 35 are involved in the translation process, while the remaining five are related to the cellular metabolic process 3 .

USCGs are indeed universal across species and unique in individual genomes.
To our knowledge, the number of genomes used to infer the above USCG sets was no more than 2000 except those from Parks et al. With more than 25,000 complete bacterial genomes at National Center for Biotechnology Information (NCBI), it was unclear whether the USCGs were still universal and unique. We thus studied the occurrence of the USCG sets in the latest set of complete genomes available on October 27, 2021 (Material and Methods). We focused on the sets from Creevey et al., Wu et al., and Alneberg et al. below because they are widely used in metagenomics and more updated than their previous versions (the first and third set). Moreover, they have more stringent criteria for universalism and uniqueness than the sets from Lan et al. and Parks et al. 2,9 . In the following, we presented our study on the above three sets.
We found that the 40 USCGs from Creevey et al. were universally distributed in almost each of the 25,271 genomes (Table 1, Supplementary Table S3, Material and Methods). For every USCG, it occurred in at least 97.7% of the 25,271 complete genomes, with the USCG Signal recognition particle GTPase occurring in the least genomes. For genomes without a copy of a USCG, we applied BLAST with an arbitrary cutoff of 1E-15 to search for this USCG. BLAST could identify a copy with this cutoff in most of these missed genomes. A tiny fraction of the genomes still did not have a USCG copy, likely due to the cutoff, the quality of the genomes, and the imperfectness of the gene retrieval methods used. Despite these limitations, each USCG occurred in at least 99.7% of the genomes. Of note, the number of USCGs occurring in each genome also showed the universal distribution of these USCGs (Supplementary Figure S1A). The mean number of USCGs occurring in each genome was 39.6. In other words, almost each USCG occurred in every genome, indicating the universalism of the USCGs from Creevey et al.
We also observed that each of the 40 USCGS from Creevey et al. indeed had only one copy in almost every genome ( Table 1). The mean and median percentage of the 25,271 genomes with only one copy of a USCG were 98.5% and 99.0%, respectively. For individual USCGs, 24 (60%) of the 40 USCGs occurred more than once in < 0.2% of the genomes, while several other USCGs occurred more than once in > 2.0% of the genomes. Interestingly, these USCGs that occurred more than once in > 2% of the genomes were often not identified in other USCG sets. For instance, for the top ten USCGs occurring more than once, only one of them was also identified in the set from Wu et al., suggesting that these USCGs might have occurred multiple times in other studies and were thus filtered by those studies. Overall, the average number of USCGs with only one copy in a genome was 39.4. This average number suggests that almost all USCGs have only one copy in almost all complete bacterial genomes. Note that rarely is a genome with multiple copies of a USCG for multiple USCGs (Supplementary Figure S1B).
We further studied the universalism and uniqueness of the USCG sets from Wu et al. and Alneberg et al. (Supplementary Table S3). Compared with the set from Creevey et al., the other two sets had a similar median (Fig. 2). For instance, the median of the universalism was 99.3% for all three sets, and the corresponding median of the uniqueness was 99.0%, 99.1%, and 99.1%. However, there were more individual USCGs in the other two sets that were not so universal and unique as those from Creevey et al. It is thus evident that the above 40 USCGs from Creevey et al. might be a better USCG set.
USCGs separate different species well, but not strains of most species. With the USCGs indeed universal and unique in bacterial genomes, we investigated how well they together could distinguish two bac- www.nature.com/scientificreports/ terial species in the same genus and whether they together are enough to set two bacterial strains of the same bacterial species apart. To address these questions, we studied all 440 genera with at least two sequenced species genomes and all 747 bacterial species with at least two sequenced strain genomes in the 25,271 complete genomes (Material and Methods). We found that at least the 40 USCGs from Creevey et al. together were enough to distinguish almost all pairs of species in the same genus. However, they could not separate pairs of strains from most species (Fig. 3).
With the Creevey et al. USCG set, we studied the similarity of 40 pairs of corresponding USCGs in every two species from the same genus for all 440 genera with at least two complete genomes (Supplementary Table S4, Figure S2A). We measured the similarity by the percentage of identity (PID) in the alignment of the 40 pairs of USCGs from a pair of species because the PID determines how similar the shotgun metagenomics reads from these USCGs are. The PID had a mean and median of 90.2% and 90.6%. However, about 2.3% of species pairs had a PID larger than 99.5%, which was the limit that the current computational tools could distinguish two genomes 7,36-40 . Given that the total nucleotide length of the 40 USCGs is about 33,852 base pairs, the 99.5% PID means there are at least 170 varied positions in the 40 USCGs required to separate the two species. Therefore, the 40 USCGs could tell most pairs of species apart, which may leave about 2.3% of the pairs of species indistinguishable.
With the same USCG set, we also studied the similarity of 40 pairs of corresponding USCGs in every two strains from the same species for all 747 species with at least two complete genomes (Fig. 3, Supplementary  Table S5, Figure S2B). The PID in a pair of strains had a mean of 99.8% and a median of 99.9%. About 91.2% of the PID scores between pairs of strains from the same species were already larger than 99.5% (Fig. 3). In other words, the 40 USCGs alone may have difficulty in separating more than 91.2% of pairs of bacterial strains.    (Fig. 3). Compared with the USCG set from Creevey et al., these two sets had a slightly lower performance, especially in terms of separating strains of the same species. Many more pairs of species and strains had close to 100% PID based on the Wu et al. set compared with the other two sets. Overall, the USCG set from Creevey et al. is likely to perform better in distinguishing species and strains.

Discussion
We systematically studied the occurrence of USCGs in the most updated list of complete bacterial genomes. We showed that USCGs were universally distributed and uniquely present in almost every bacterial species. The 40 USCGs from Creevey et al. together could distinguish different species from the same genus and different strains from the same species well. Barely was there one individual USCG that could do so alone.
There are inevitably mistakes in the annotation of the complete genomes. For instance, the MSHR1132 genome was annotated as the genome of a Staphylococcus aureus strain previously 41 and is now classified as the genome of a Staphylococcus argenteus species in the current NCBI. Such an incorrect annotation may change the similarity of individual species and strain pairs. However, it does not affect our conclusions about universalism, uniqueness, and species and strain pair similarity, because universalism and uniqueness are based on individual genomes, and the species and strain pair similarity are based on the upper bound of the similarities. The incorrect annotation is likely to affect low similar pairs. Moreover, the incorrect number of annotations is too small to affect the conclusions made in this study.
There are many other sets of USCGs not compared here. Wrighton 3 . We did not compare these sets of USCGs because they were more for other purposes and not as universal and unique as those compared in this study.
We found that three of the seven sets of USCGs were universal and unique. Other sets of USCGs we studied, except the one from Ciccarelli et al., were unlikely to be as universal and unique as the three sets because their selection was not as strict. For the complete genomes, we demonstrated that USCGs could distinguish different species but not strains. This was based on the fact that the PID of 99.5% is the limit of current tools to separate similar sequences. When the strain similarity is higher, say more than 99.5%, we are left with only a few dozen variable loci to distinguish different strains. In this case, with the available USCG sets, it may be challenging to distinguish them, even if possible. Note that all aforementioned USCG sets are not created to distinguish bacterial species and strains in microbiomes. In the future, one may hope to generate new USCG sets for this purpose by developing novel methods and integrating different sources of information [43][44][45][46][47][48][49] . Alternatively, one may consider overlapping the assembled contigs or metagenome-assembled-genomes with the USCGs so that more polymorphic sites are available to distinguish strains 7,23 .

Material and methods
Seven sets of USCGS. We studied seven sets of USCGs: the 31 set from Ciccarelli et al. 1 , the 31 set from Wu and Eisen 4 , the 40 set from Creevey et al. 16 , the 40 USCGs from Wu et al. 3 , the 36 USCGs from Alneberg et al. 15 , the 73 USCGs from Lan et al. 2 , and the 120 USCGs from Parks et al. 9 . We focused on these sets for the separation of similar strains, not for the completeness of the assembled genomes in metagenomics. We converted the gene name of each USCG into its clusters of orthologous groups (COG) name with the information from https:// ftp. ncbi. nih. gov/ pub/ COG/ COG20 20/ data/ cog-20. def. tab. In this way, we measured the overlap of two USCG sets by the number of their shared COG gene names.
The complete bacterial genomes. We studied the complete bacterial genomes at NCBI. We retrieved these bacterial genomes on October 27, 2021, from the NCBI website (https:// www. ncbi. nlm. nih. gov/ genome/ browse# !/ proka ryotes/) by selecting the filtering criteria as the 'Bacteria' Kingdom and the 'Complete' Assembly level. In total, we obtained 25,271 complete bacterial genomes. We also downloaded the protein sequences in each genome from the NCBI FTP site at https:// ftp. ncbi. nlm. nih. gov/ genom es/ all/.

Identification of USCGs in a genome.
To identify the corresponding sequences of a USCG in a given genome, we used the tool fetchMG for USCGs from Ciccarelli et al. and Creevey et al. 6 . This tool is based on hidden Markov models to identify the occurrence of a USCG in input sequences. It was used previously to identify and quantify the microbial species in assembled contigs from shotgun metagenomic reads. We used the following command to fetch the USCG sequence in a genome: ./fetchMG.pl -m extraction -p proteins.fasta -o output. Since we input the aforementioned protein sequences in a genome to fetchMG, the fetchMG output includes the full-length protein sequences that are likely the copies of USCGs in this genome. The fetchMG tool might be imperfect, miss certain copies of the USCGs from these two sets in a genome, and does not work for USCGs not in these two sets but in other sets. Therefore, we also applied BLAST to retrieve copies of USCGs that were not from the above two sets or USCGs from the two sets but without an identified copy in a genome. The command we used was: psiblast -in_msa marker_aligned.fasta -db protein.fasta -evalue 1E-15 -num_alignments num_keep. When we applied BLAST, the multiple sequence alignments used as input (the parameter -in_msa marker_aligned.fasta) was the sequences downloaded from https:// ftp. ncbi. nih. gov/ pub/ COG/ COG20  3 . The E-value cutoff used was 1E-15 as previously 3,4,19 . We considered the fetched protein sequences as the corresponding USCG sequences in the genomes. Note that neither fetchMG nor BLAST is perfect for identifying copies of USCGs in a genome, as they sacrifice accuracy for speed. However, the identified copies of USCGs are likely to be reliable. We obtained the functional categories of each USCG from https:// www. ncbi. nlm. nih. gov/ resea rch/ cog.
The USCG similarity measurement. We aligned the extracted USCG protein sequences for every USCG with the tool MAFFT 50 . In other words, we did multiple alignments of the USCG sequences extracted from each genome under consideration for each USCG. In order to assess the similarity of the USCG sequences, we calculated the score based on the blosum62 matrix, the matched number, the mismatch number, the indel number, and the PID. We chose the PID to evaluate the similarity of each pair of USCG sequences because it directly relates to the shotgun metagenomic reads mapped to different USCG sequences. To measure the PID, we obtained the two corresponding sequences in the alignment and removed the loci with an indel versus an indel. We then calculated the PID as the ratio of the number of matched positions to the number of all remaining aligned positions.