One-third of all protein-coding genes from bacterial genomes cannot be annotated with a function. Here, to investigate the functions of these genes, we present genome-wide mutant fitness data from 32 diverse bacteria across dozens of growth conditions. We identified mutant phenotypes for 11,779 protein-coding genes that had not been annotated with a specific function. Many genes could be associated with a specific condition because the gene affected fitness only in that condition, or with another gene in the same bacterium because they had similar mutant phenotypes. Of the poorly annotated genes, 2,316 had associations that have high confidence because they are conserved in other bacteria. By combining these conserved associations with comparative genomics, we identified putative DNA repair proteins; in addition, we propose specific functions for poorly annotated enzymes and transporters and for uncharacterized protein families. Our study demonstrates the scalability of microbial genetics and its utility for improving gene annotations.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank V. Lo, W. Shao, and K. Keller for technical assistance with the Fitness Browser website. Sequencing was performed at: the Vincent J. Coates Genomics Sequencing Laboratory (University of California at Berkeley), supported by NIH S10 Instrumentation Grants S10RR029668, S10RR027303, and OD018174; the DOE Joint Genome Institute; the College of Biological Sciences UCDNA Sequencing Facility (UC Davis); and the Institute for Genomics Sciences (University of Maryland). Studies of novel isolates were conducted by ENIGMA and were supported by the Office of Science, Office of Biological and Environmental Research of the US Department of Energy, under contract DE-AC02-05CH11231. The other data collection was supported by Laboratory Directed Research and Development (LDRD) funding from Berkeley Laboratory, provided by the Director, Office of Science, of the US Department of Energy under contract DE-AC02-05CH11231 and a Community Science Project from the Joint Genome Institute to M.J.B., J.B., A.P.A., and A.M.D. The work conducted by the US Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, is supported by the Office of Science of the US Department of Energy under contract no. DE-AC02-05CH11231.
Extended data figures and tables
a, The utilization of d-alanine or cytosine by Azospirillum brasilense Sp245. Each point shows the fitness of a gene in the two conditions. The data are the average of two biological replicates for each nitrogen source. Amino acid synthesis genes were identified using the top-level role in TIGRFAMs. The genes for d-alanine utilization were a d-amino acid dehydrogenase (AZOBR_RS08020), an ABC transporter operon (AZOBR_RS08235:RS08260), and a LysR family regulator (AZOBR_RS21915). The genes for cytosine utilization were cytosine deaminase (AZOBR_RS31895) and two ABC transporter operons (AZOBR_RS06950:RS06965 and AZOBR_RS31875:RS31885). b, Zinc stress in S. loihica PV-4. We compare fitness in rich medium with added zinc (II) sulfate to fitness in plain rich medium. The LB data are the average of two biological replicates. The highlighted genes include a putative heavy metal efflux pump (CzcCBA or Shew_3358:Shew_3356), a hypothetical protein at the beginning of the czc operon (CzcX), a zinc-responsive regulator (ZntR or Shew_3411), and another heavy metal efflux gene related to arsP or DUF318 (Shew_3410). CzcX lacks homology to any characterized protein, but homologues in other strains of Shewanella are also specifically important for resisting zinc stress. In both panels, the lines show x = 0, y = 0, and x = y.
We categorized proteins in our data set by their type of annotation or by whether they have homologues in the same genome (‘paralogues’). For each category, we show the fraction of genes that have statistically significant phenotypes, and more specifically the fractions that have strong phenotypes (|fitness| > 2 and statistically significant) or are significantly detrimental to fitness (fitness > 0). Genes with high or moderate similarity to another gene in the same genome (paralogues with alignment score above 30% of the self-alignment score) were less likely to have a phenotype (25% versus 32%, P < 10−15, Fisher’s exact test), which is likely to reflect genetic redundancy.
We compared the growth of a gene deletion strain and the wild-type bacterium under varying cisplatin concentrations. We show all replicate growth curves for each genotype. We believe the higher overall growth for some of the wild-type experiments (for example, top middle) is random. We observe this phenomenon consistently for some bacteria and we speculate that this is due to varying oxygen content across the microplate. a, E. coli radD (n = 6 independent experiments per strain). b, D. shibae Dshi_2244 (n = 3 independent experiments for wild-type and n = 6 independent experiments for the mutant). c, Phaeobacter inhibens PGA1_c08960 (n = 4 independent experiments for wild-type and n = 6 independent experiments for the mutant). Dshi_2244 and PGA1_c08960 are orthologues of MmcB (DUF1052) from C. crescentus24.
Extended Data Fig. 4 EndA, DUF3584, and a FAN1-like VRR-NUC domain protein are important for cisplatin resistance.
As in Extended Data Fig. 3, comparing cisplatin sensitivity of a gene deletion mutant to the wild-type bacterium. a, E. coli endA knockout. cycA encodes an amino acid transporter and is not expected to have a phenotype on cisplatin and is used as a control. Each growth curve is the average of 12 replicate wells and the dashed lines show 95% confidence intervals from the t-test. b, A deletion of S. oneidensis MR-1 SO4008, a member of the DUF3584 protein family (n = 6 independent experiments per strain). c, A deletion of P. stutzeri RCH2 Psest_2235 (n = 4 independent experiments per strain). Psest_1636 is not expected to be involved in DNA repair and is used here as a control. Psest_2235 is a FAN1-like VRR-NUC domain protein25.
We assayed the growth of an E. coli endA− Keio collection deletion mutant carrying one of three different vectors: an empty vector with no insert (endA− + empty), a complementation vector carrying a wild-type copy of endA (endA− + endA), and a complementation vector with a mutant version of endA with an alanine at position 84 instead of histidine (endA− + mutant endA). A mutation of this conserved histidine residue in a close homologue from Vibrio vulnificus has been reported to eliminate nearly all nuclease catalytic activity64. As a control, we assayed the wild-type, parental E. coli strain carrying a vector with no insert (wt + empty). We performed these growth assays on three separate microplates (Plate #1, #2, #3). n = 3 independent experiments per strain in Plate #1; n = 4 independent experiments per strain in Plates #2 and #3. We added 20 µg ml−1 gentamicin to each assay to maintain selection for the plasmids (pBBR1-MCS5 and derivatives). Although the catalytic activity of EndA (endonuclease I) appears to be important for resisting cisplatin, it is not clear how EndA would be involved in DNA repair if it is located in the periplasm, as previously believed65,66,67. We speculate that EndA relocates to the cytoplasm upon DNA damage or that EndA degrades broken DNA that enters the periplasm and would otherwise damage the membrane.
Growth comparison of gene deletion mutants in UPF0126 versus wild-type bacteria in minimal defined medium. a, SO1319 from S. oneidensis MR-1, with either ammonium chloride (n = 6 independent experiments per strain) or glycine as the sole source of nitrogen (n = 12 independent experiments per strain). b, PGA1_c00920 from P. inhibens, with glycine as the sole source of carbon (n = 8 independent experiments for wild-type and n = 16 independent experiments for the mutant). c, Psest_1636 from P. stutzeri RCH2, with either ammonium chloride (n = 4 independent experiments per strain) or glycine (n = 8 independent experiments per strain) as the sole source of nitrogen. The Psest_2235 deletion strain is used as a control and is not expected to have a phenotype in these conditions.
Extended Data Fig. 7 PGA1_c00920 partially rescues the glycine growth defect of an E. coli cycA mutant.
CycA is a glycine transporter from E. coli and a mutant in this gene has reduced uptake of glycine37. We investigated whether a member of the UPF0126 protein family could rescue the glycine growth defect of an E. coli cycA deletion strain. We introduced different plasmids into the E. coli cycA Keio collection deletion background: an empty plasmid with no insert (cycA− + empty), a plasmid with a wild-type allele of the E. coli cycA gene (cycA− + cycA), and a plasmid with PGA1_c00920 from P. inhibens (cycA− + PGA1_c00920). We compared the growth of these strains and a wild-type E. coli control (wt + empty) in defined media with either ammonium chloride (n = 2 independent experiments per strain) or glycine as the sole source of nitrogen (n = 4 independent experiments per strain). PGA1_c00920 partially rescues the glycine-specific growth defect of the cycA− deletion strain.
Extended Data Fig. 8 Overexpression of members of protein family UPF0060 confers resistance to thallium.
We introduced three plasmids into wild-type E. coli: a plasmid control with no insert (Empty vector), a plasmid carrying RR42_RS34240 from C. basilensis 4G11, and a plasmid carrying Pf6N2E2_2547 from P. fluorescens FW300-N2E2. We assayed the growth of these strains in LB at 30 °C with varying concentrations of thallium(I) acetate (n = 6 independent experiments per strain). We added 50 µg ml−1 kanamycin to each assay to maintain selection for the plasmids (pFAB2286 and derivatives). RR42_RS34240 and Pf6N2E2_2547 are members of the UPF0060 protein family.
We selected 2,593 hypothetical or vaguely annotated proteins from diverse bacterial species, compared them to the protein-coding genes for which we have fitness data (using protein BLAST), and identified potential orthologues as best hits that were homologous over at least 75% of each protein’s length. We show the fraction of these proteins that have a potential orthologue with each type of phenotype and that is above a given level of amino acid sequence similarity. Similarity was defined as the ratio of the alignment’s bit score to the score from aligning the query to itself.
a, The effect of rescaling the cofitness values by the number of generations in six bacteria. For each of the six bacteria, we identified all pairs of protein-coding genes that were assigned to the same TIGR subrole, were more than 20 kB apart, and had fitness data. This gave 1,711–9,406 pairs per bacterium. We also selected a random subset of pairs that were assigned to different TIGR subroles, were more than 20 kB apart, and had fitness data (1,559–8,881 pairs per bacterium). For each pair, we compared the original cofitness values to the rescaled cofitness (computed from fitness values that were divided by the number of generations). b, The effect of averaging fitness scores from replicate experiments on the cofitness values.
This file contains a Supplementary Table guide, Supplementary Figures 1-5, Supplementary Notes 1-6 and Supplementary References
This file contains Supplementary Tables 1-22. The tables are provided in a single Excel file with separate tabs for each table