One-third of all protein-coding genes from bacterial genomes cannot be annotated with a function. Here, to investigate the functions of these genes, we present genome-wide mutant fitness data from 32 diverse bacteria across dozens of growth conditions. We identified mutant phenotypes for 11,779 protein-coding genes that had not been annotated with a specific function. Many genes could be associated with a specific condition because the gene affected fitness only in that condition, or with another gene in the same bacterium because they had similar mutant phenotypes. Of the poorly annotated genes, 2,316 had associations that have high confidence because they are conserved in other bacteria. By combining these conserved associations with comparative genomics, we identified putative DNA repair proteins; in addition, we propose specific functions for poorly annotated enzymes and transporters and for uncharacterized protein families. Our study demonstrates the scalability of microbial genetics and its utility for improving gene annotations.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Chang, Y.-C. et al. COMBREX-DB: an experiment centered database of protein function: knowledge, predictions and knowledge gaps. Nucleic Acids Res. 44, D330–D335 (2016).
Schnoes, A. M., Brown, S. D., Dodevski, I. & Babbitt, P. C. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLOS Comput. Biol. 5, e1000605 (2009).
Deutschbauer, A. et al. Towards an informative mutant phenotype for every bacterial gene. J. Bacteriol. 196, 3643–3655 (2014).
Deutschbauer, A. et al. Evidence-based annotation of gene function in Shewanella oneidensis MR-1 using genome-wide fitness profiling across 121 conditions. PLoS Genet. 7, e1002385 (2011).
Nichols, R. J. et al. Phenotypic landscape of a bacterial cell. Cell 144, 143–156 (2011).
Price, M. N. et al. The genetic basis of energy conservation in the sulfate-reducing bacterium Desulfovibrio alaskensis G20. Front. Microbiol. 5, 577 (2014).
Langridge, G. C. et al. Simultaneous assay of every Salmonella typhi gene using one million transposon mutants. Genome Res. 19, 2308–2316 (2009).
van Opijnen, T., Bodi, K. L. & Camilli, A. Tn-seq: high-throughput parallel sequencing for fitness and genetic interaction studies in microorganisms. Nat. Methods 6, 767–772 (2009).
Wetmore, K. M. et al. Rapid quantification of mutant fitness in diverse bacteria by sequencing randomly bar-coded transposons. MBio 6, e00306–e00315 (2015).
Liu, H. et al. Magic pools: parallel assessment of transposon delivery vectors in bacteria. mSystems 3, e00143–17 (2018).
Rubin, B. E. et al. The essential gene set of a photosynthetic organism. Proc. Natl Acad. Sci. USA 112, E6634–E6643 (2015).
Melnyk, R. A. et al. Novel mechanism for scavenging of hypochlorite involving a periplasmic methionine-rich peptide and methionine sulfoxide reductase. MBio 6, e00233–15 (2015).
Smith, A. M. et al. Quantitative phenotyping via deep barcode sequencing. Genome Res. 19, 1836–1842 (2009).
Rensing, C., Pribyl, T. & Nies, D. H. New functions for the three subunits of the CzcCBA cation-proton antiporter. J. Bacteriol. 179, 6871–6879 (1997).
Hottes, A. K. et al. Bacterial adaptation through loss of function. PLoS Genet. 9, e1003617 (2013).
Haft, D. H. et al. TIGRFAMs and genome properties in 2013. Nucleic Acids Res. 41, D387–D395 (2013).
Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230 (2014).
Baker, J. L. et al. Widespread genetic switches and toxicity resistance proteins for fluoride. Science 335, 233–235 (2012).
Keseler, I. M. et al. EcoCyc: fusing model organism databases with systems biology. Nucleic Acids Res. 41, D605–D612 (2013).
Hillenmeyer, M. E. et al. Systematic analysis of genome-wide fitness data in yeast reveals novel gene function and drug action. Genome Biol. 11, R30 (2010).
Rabus, R., Reizer, J., Paulsen, I. & Saier, M. H. Jr Enzyme INtr from Escherichia coli. A novel enzyme of the phosphoenolpyruvate-dependent phosphotransferase system exhibiting strict specificity for its phosphoryl acceptor, NPr. J. Biol. Chem. 274, 26185–26191 (1999).
van Opijnen, T., Dedrick, S. & Bento, J. Strain dependent genetic networks for antibiotic-sensitivity in a bacterial pathogen with a large pan-genome. PLoS Pathog. 12, e1005869 (2016).
Chen, S. H., Byrne, R. T., Wood, E. A. & Cox, M. M. Escherichia coli radD (yejH) gene: a novel function involved in radiation resistance and double-strand break repair. Mol. Microbiol. 95, 754–768 (2015).
Lopes-Kulishev, C. O. et al. Functional characterization of two SOS-regulated genes involved in mitomycin C resistance in Caulobacter crescentus. DNA Repair (Amst.) 33, 78–89 (2015).
Gwon, G. H. et al. Crystal structure of a Fanconi anemia-associated nuclease homolog bound to 5′ flap DNA: basis of interstrand cross-link repair by FAN1. Genes Dev. 28, 2276–2290 (2014).
Justice, S. S., Hunstad, D. A., Cegelski, L. & Hultgren, S. J. Morphological plasticity as a bacterial survival strategy. Nat. Rev. Microbiol. 6, 162–168 (2008).
da Rocha, R. P., Paquola, A. C. de M., Marques Mdo, V., Menck, C. F. M. & Galhardo, R. S. Characterization of the SOS regulon of Caulobacter crescentus. J. Bacteriol. 190, 1209–1218 (2008).
Abella, M., Campoy, S., Erill, I., Rojo, F. & Barbé, J. Cohabitation of two different lexA regulons in Pseudomonas putida. J. Bacteriol. 189, 8855–8862 (2007).
Cirz, R. T., O’Neill, B. M., Hammond, J. A., Head, S. R. & Romesberg, F. E. Defining the Pseudomonas aeruginosa SOS response and its role in the global response to the antibiotic ciprofloxacin. J. Bacteriol. 188, 7101–7110 (2006).
Wiegmann, K. et al. Carbohydrate catabolism in Phaeobacter inhibens DSM 17395, a member of the marine roseobacter clade. Appl. Environ. Microbiol. 80, 4725–4737 (2014).
Brouns, S. J. J. et al. Identification of the missing links in prokaryotic pentose oxidation pathways: evidence for enzyme recruitment. J. Biol. Chem. 281, 27378–27388 (2006).
Johnsen, U. et al. d-xylose degradation pathway in the halophilic archaeon Haloferax volcanii. J. Biol. Chem. 284, 27290–27303 (2009).
Stephens, C. et al. Genetic analysis of a novel pathway for d-xylose metabolism in Caulobacter crescentus. J. Bacteriol. 189, 2181–2185 (2007).
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).
Overbeek, R. et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 42, D206–D214 (2014).
Iwamoto, R. & Imanaga, Y. Direct evidence of the Entner–Doudoroff pathway operating in the metabolism of d-glucosamine in bacteria. J. Biochem. 109, 66–69 (1991).
Ghrist, A. C. & Stauffer, G. V. The Escherichia coli glycine transport system and its role in the regulation of the glycine cleavage enzyme system. Microbiology 141, 133–140 (1995).
Figueira, R. et al. Adaptation to sustained nitrogen starvation by Escherichia coli requires the eukaryote-like serine/threonine kinase YeaG. Sci. Rep. 5, 17524 (2015).
Tagourti, J., Landoulsi, A. & Richarme, G. Cloning, expression, purification and characterization of the stress kinase YeaG from Escherichia coli. Protein Expr. Purif. 59, 79–85 (2008).
Thorgersen, M. P. et al. Molybdenum availability is key to nitrate removal in contaminated groundwater environments. Appl. Environ. Microbiol. 81, 4976–4983 (2015).
Ray, J. et al. Complete genome sequence of Cupriavidus basilensis 4G11, isolated from the Oak Ridge Field Research Center site. Genome Announc. 3, e00322–15 (2015).
Vaccaro, B. J. et al. Novel metal cation resistance systems from mutant fitness analysis of denitrifying Pseudomonas stutzeri. Appl. Environ. Microbiol. 82, 6046–6056 (2016).
Kovach, M. E. et al. Four new derivatives of the broad-host-range cloning vector pBBR1MCS, carrying different antibiotic-resistance cassettes. Gene 166, 175–176 (1995).
Baba, T. et al. Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Mol. Syst. Biol. 2, 2006.0008 (2006).
Kuehl, J. V. et al. Functional genomics with a comprehensive library of transposon mutants for the sulfate-reducing bacterium Desulfovibrio alaskensis G20. MBio 5, e01041–14 (2014).
Zane, G. M., Yen, H. C. & Wall, J. D. Effect of the deletion of qmoABC and the promoter-distal gene encoding a hypothetical protein on sulfate reduction in Desulfovibrio vulgaris Hildenborough. Appl. Environ. Microbiol. 76, 5500–5509 (2010).
Kahm, M., Hasenbrink, G., Lichtenberg-Frate, H., Ludwig, J. & Kschischo, M. grofit: fitting biological growth curves with R. J. Stat. Softw. 33, 1–21 (2010).
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
Bashir, A. et al. A hybrid approach for the automated finishing of bacterial genomes. Nat. Biotechnol. 30, 701–707 (2012).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9, e112963 (2014).
Tritt, A., Eisen, J. A., Facciotti, M. T. & Darling, A. E. An integrated pipeline for de novo assembly of microbial genomes. PLoS One 7, e42304 (2012).
Hunt, M. et al. Circlator: automated circularization of genome assemblies using long sequencing reads. Genome Biol. 16, 294 (2015).
Eddy, S. R. Accelerated profile HMM searches. PLOS Comput. Biol. 7, e1002195 (2011).
Wu, M. & Scott, A. J. Phylogenomic analysis of bacterial and archaeal sequences with AMPHORA2. Bioinformatics 28, 1033–1034 (2012).
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS One 5, e9490 (2010).
Sagawa, S., Price, M. N., Deutschbauer, A. M. & Arkin, A. P. Validating regulatory predictions from diverse bacteria with mutant fitness data. PLoS One 12, e0178258 (2017).
Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G. D. & Maltsev, N. The use of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. USA 96, 2896–2901 (1999).
Aziz, R. K. et al. The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9, 75 (2008).
Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 44, D471–D480 (2016).
Price, M. N. & Arkin, A. P. PaperBLAST: text mining papers for information about homologs. mSystems 2, e00039–17 (2017).
Marchler-Bauer, A. et al. CDD: NCBI’s conserved domain database. Nucleic Acids Res. 43, D222–D226 (2015).
Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
Li, C.-L. et al. DNA binding and cleavage by the periplasmic nuclease Vvn: a novel structure with a known active site. EMBO J. 22, 4014–4025 (2003).
Ananthaswamy, H. N. The release of endonuclease I from Escherichia coli by a new cold shock procedure. Biochem. Biophys. Res. Commun. 76, 289–298 (1977).
Lopes, J., Gottfried, S. & Rothfield, L. Leakage of periplasmic enzymes by mutants of Escherichia coli and Salmonella typhimurium: isolation of ‘periplasmic leaky’ mutants. J. Bacteriol. 109, 520–525 (1972).
Nossal, N. G. & Heppel, L. A. The release of enzymes by osmotic shock from Escherichia coli in exponential phase. J. Biol. Chem. 241, 3055–3062 (1966).
We thank V. Lo, W. Shao, and K. Keller for technical assistance with the Fitness Browser website. Sequencing was performed at: the Vincent J. Coates Genomics Sequencing Laboratory (University of California at Berkeley), supported by NIH S10 Instrumentation Grants S10RR029668, S10RR027303, and OD018174; the DOE Joint Genome Institute; the College of Biological Sciences UCDNA Sequencing Facility (UC Davis); and the Institute for Genomics Sciences (University of Maryland). Studies of novel isolates were conducted by ENIGMA and were supported by the Office of Science, Office of Biological and Environmental Research of the US Department of Energy, under contract DE-AC02-05CH11231. The other data collection was supported by Laboratory Directed Research and Development (LDRD) funding from Berkeley Laboratory, provided by the Director, Office of Science, of the US Department of Energy under contract DE-AC02-05CH11231 and a Community Science Project from the Joint Genome Institute to M.J.B., J.B., A.P.A., and A.M.D. The work conducted by the US Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, is supported by the Office of Science of the US Department of Energy under contract no. DE-AC02-05CH11231.
Extended data figures and tables
a, The utilization of d-alanine or cytosine by Azospirillum brasilense Sp245. Each point shows the fitness of a gene in the two conditions. The data are the average of two biological replicates for each nitrogen source. Amino acid synthesis genes were identified using the top-level role in TIGRFAMs. The genes for d-alanine utilization were a d-amino acid dehydrogenase (AZOBR_RS08020), an ABC transporter operon (AZOBR_RS08235:RS08260), and a LysR family regulator (AZOBR_RS21915). The genes for cytosine utilization were cytosine deaminase (AZOBR_RS31895) and two ABC transporter operons (AZOBR_RS06950:RS06965 and AZOBR_RS31875:RS31885). b, Zinc stress in S. loihica PV-4. We compare fitness in rich medium with added zinc (II) sulfate to fitness in plain rich medium. The LB data are the average of two biological replicates. The highlighted genes include a putative heavy metal efflux pump (CzcCBA or Shew_3358:Shew_3356), a hypothetical protein at the beginning of the czc operon (CzcX), a zinc-responsive regulator (ZntR or Shew_3411), and another heavy metal efflux gene related to arsP or DUF318 (Shew_3410). CzcX lacks homology to any characterized protein, but homologues in other strains of Shewanella are also specifically important for resisting zinc stress. In both panels, the lines show x = 0, y = 0, and x = y.
We categorized proteins in our data set by their type of annotation or by whether they have homologues in the same genome (‘paralogues’). For each category, we show the fraction of genes that have statistically significant phenotypes, and more specifically the fractions that have strong phenotypes (|fitness| > 2 and statistically significant) or are significantly detrimental to fitness (fitness > 0). Genes with high or moderate similarity to another gene in the same genome (paralogues with alignment score above 30% of the self-alignment score) were less likely to have a phenotype (25% versus 32%, P < 10−15, Fisher’s exact test), which is likely to reflect genetic redundancy.
We compared the growth of a gene deletion strain and the wild-type bacterium under varying cisplatin concentrations. We show all replicate growth curves for each genotype. We believe the higher overall growth for some of the wild-type experiments (for example, top middle) is random. We observe this phenomenon consistently for some bacteria and we speculate that this is due to varying oxygen content across the microplate. a, E. coli radD (n = 6 independent experiments per strain). b, D. shibae Dshi_2244 (n = 3 independent experiments for wild-type and n = 6 independent experiments for the mutant). c, Phaeobacter inhibens PGA1_c08960 (n = 4 independent experiments for wild-type and n = 6 independent experiments for the mutant). Dshi_2244 and PGA1_c08960 are orthologues of MmcB (DUF1052) from C. crescentus24.
Extended Data Fig. 4 EndA, DUF3584, and a FAN1-like VRR-NUC domain protein are important for cisplatin resistance.
As in Extended Data Fig. 3, comparing cisplatin sensitivity of a gene deletion mutant to the wild-type bacterium. a, E. coli endA knockout. cycA encodes an amino acid transporter and is not expected to have a phenotype on cisplatin and is used as a control. Each growth curve is the average of 12 replicate wells and the dashed lines show 95% confidence intervals from the t-test. b, A deletion of S. oneidensis MR-1 SO4008, a member of the DUF3584 protein family (n = 6 independent experiments per strain). c, A deletion of P. stutzeri RCH2 Psest_2235 (n = 4 independent experiments per strain). Psest_1636 is not expected to be involved in DNA repair and is used here as a control. Psest_2235 is a FAN1-like VRR-NUC domain protein25.
We assayed the growth of an E. coli endA− Keio collection deletion mutant carrying one of three different vectors: an empty vector with no insert (endA− + empty), a complementation vector carrying a wild-type copy of endA (endA− + endA), and a complementation vector with a mutant version of endA with an alanine at position 84 instead of histidine (endA− + mutant endA). A mutation of this conserved histidine residue in a close homologue from Vibrio vulnificus has been reported to eliminate nearly all nuclease catalytic activity64. As a control, we assayed the wild-type, parental E. coli strain carrying a vector with no insert (wt + empty). We performed these growth assays on three separate microplates (Plate #1, #2, #3). n = 3 independent experiments per strain in Plate #1; n = 4 independent experiments per strain in Plates #2 and #3. We added 20 µg ml−1 gentamicin to each assay to maintain selection for the plasmids (pBBR1-MCS5 and derivatives). Although the catalytic activity of EndA (endonuclease I) appears to be important for resisting cisplatin, it is not clear how EndA would be involved in DNA repair if it is located in the periplasm, as previously believed65,66,67. We speculate that EndA relocates to the cytoplasm upon DNA damage or that EndA degrades broken DNA that enters the periplasm and would otherwise damage the membrane.
Growth comparison of gene deletion mutants in UPF0126 versus wild-type bacteria in minimal defined medium. a, SO1319 from S. oneidensis MR-1, with either ammonium chloride (n = 6 independent experiments per strain) or glycine as the sole source of nitrogen (n = 12 independent experiments per strain). b, PGA1_c00920 from P. inhibens, with glycine as the sole source of carbon (n = 8 independent experiments for wild-type and n = 16 independent experiments for the mutant). c, Psest_1636 from P. stutzeri RCH2, with either ammonium chloride (n = 4 independent experiments per strain) or glycine (n = 8 independent experiments per strain) as the sole source of nitrogen. The Psest_2235 deletion strain is used as a control and is not expected to have a phenotype in these conditions.
Extended Data Fig. 7 PGA1_c00920 partially rescues the glycine growth defect of an E. coli cycA mutant.
CycA is a glycine transporter from E. coli and a mutant in this gene has reduced uptake of glycine37. We investigated whether a member of the UPF0126 protein family could rescue the glycine growth defect of an E. coli cycA deletion strain. We introduced different plasmids into the E. coli cycA Keio collection deletion background: an empty plasmid with no insert (cycA− + empty), a plasmid with a wild-type allele of the E. coli cycA gene (cycA− + cycA), and a plasmid with PGA1_c00920 from P. inhibens (cycA− + PGA1_c00920). We compared the growth of these strains and a wild-type E. coli control (wt + empty) in defined media with either ammonium chloride (n = 2 independent experiments per strain) or glycine as the sole source of nitrogen (n = 4 independent experiments per strain). PGA1_c00920 partially rescues the glycine-specific growth defect of the cycA− deletion strain.
Extended Data Fig. 8 Overexpression of members of protein family UPF0060 confers resistance to thallium.
We introduced three plasmids into wild-type E. coli: a plasmid control with no insert (Empty vector), a plasmid carrying RR42_RS34240 from C. basilensis 4G11, and a plasmid carrying Pf6N2E2_2547 from P. fluorescens FW300-N2E2. We assayed the growth of these strains in LB at 30 °C with varying concentrations of thallium(I) acetate (n = 6 independent experiments per strain). We added 50 µg ml−1 kanamycin to each assay to maintain selection for the plasmids (pFAB2286 and derivatives). RR42_RS34240 and Pf6N2E2_2547 are members of the UPF0060 protein family.
We selected 2,593 hypothetical or vaguely annotated proteins from diverse bacterial species, compared them to the protein-coding genes for which we have fitness data (using protein BLAST), and identified potential orthologues as best hits that were homologous over at least 75% of each protein’s length. We show the fraction of these proteins that have a potential orthologue with each type of phenotype and that is above a given level of amino acid sequence similarity. Similarity was defined as the ratio of the alignment’s bit score to the score from aligning the query to itself.
a, The effect of rescaling the cofitness values by the number of generations in six bacteria. For each of the six bacteria, we identified all pairs of protein-coding genes that were assigned to the same TIGR subrole, were more than 20 kB apart, and had fitness data. This gave 1,711–9,406 pairs per bacterium. We also selected a random subset of pairs that were assigned to different TIGR subroles, were more than 20 kB apart, and had fitness data (1,559–8,881 pairs per bacterium). For each pair, we compared the original cofitness values to the rescaled cofitness (computed from fitness values that were divided by the number of generations). b, The effect of averaging fitness scores from replicate experiments on the cofitness values.
This file contains a Supplementary Table guide, Supplementary Figures 1-5, Supplementary Notes 1-6 and Supplementary References
This file contains Supplementary Tables 1-22. The tables are provided in a single Excel file with separate tabs for each table
About this article
The ISME Journal (2019)
Nature Biotechnology (2019)
The ISME Journal (2019)
Nature Reviews Microbiology (2019)
The ISME Journal (2019)