The journey to understand previously unknown microbial genes

Wirbel, Jakob; Bhatt, Ami S.; Probst, Alexander J.

doi:10.1038/d41586-024-00077-w

NEWS & VIEWS FORUM
30 January 2024

The journey to understand previously unknown microbial genes

The analysis of DNA sequences sheds light on microbial biology, but it is difficult to assess the function of genes that have little or no similarity to characterized genes. Here, scientists discuss this challenge from genomic and microbial perspectives.

Jakob Wirbel
1. Jakob Wirbel is in the Department of Medicine, Division of Hematology, Stanford University School of Medicine, Stanford, California 94305, USA.
View author publications

You can also search for this author in PubMed Google Scholar
Ami S. Bhatt
1. Ami S. Bhatt is in the Department of Medicine, Division of Hematology, Stanford University School of Medicine, Stanford, California 94305, USA, and in the Department of Genetics, Stanford University School of Medicine.
View author publications

You can also search for this author in PubMed Google Scholar
Alexander J. Probst
1. Alexander J. Probst is in the Research Center One Health Ruhr, University Alliance Ruhr, Department of Chemistry at University of Duisburg-Essen, Essen, 45141, Germany.
View author publications

You can also search for this author in PubMed Google Scholar

You have full access to this article via your institution.

Download PDF

THE TOPIC IN BRIEF

• Some aspects of microbiology remain mysterious because of a lack of information about the identity and role of many microbial genes and proteins.

• The ability to obtain and analyse microbial sequences at scale and across species, including those that cannot be grown under laboratory conditions, are providing insights and data to explore.

• Writing in Nature, Rodríguez del Río et al.¹ report their analysis of 149,842 bacterial genomes sampled from a variety of habitats in the wild.

• The data were used to select sequences to generate a catalogue of 404,085 previously unknown gene families that could be prioritized for further study.

• The investigation of these previously unknown genes could lead to new clinical tools or offer fresh perspectives about how microorganisms evolved to survive in their natural environments.

JAKOB WIRBEL & AMI S. BHATT: Bringing structure and context to gene mysteries

The function of most microbial genes is unknown. Some of this microbial ‘dark matter’ might encode previously unknown types of enzyme or classes of antibiotic. As ever more genes of unknown function are discovered through sequencing of DNA from mixtures of multiple genomes, termed metagenomic sequencing, the difficulty of experimentally characterizing these enigmatic genes has led to a focus on computationally predicting their function². Two publications in Nature, one by Rodríguez del Río et al.¹, and one by Pavlopoulos et al.³ published last October, tackle this challenge by cleverly leveraging advances in clustering algorithms (computational tools that group genes on the basis of similarities in amino-acid sequence) and protein-structure prediction tools⁴ such as AlphaFold.

Read the paper: Functional and evolutionary significance of unknown genes from uncultivated taxa

Despite distinct technical approaches, the core strategy used by Pavlopoulos et al. and Rodríguez del Río et al. was similar. Both clustered hundreds of millions of protein sequences from metagenomic data sets into previously unknown protein families. Rodríguez del Río and colleagues filtered their data to examine genes only from prokaryotes (organisms whose cells lack a nucleus), whereas Pavlopoulos et al. used data that also included sequences from eukaryotes (organisms whose cells have a nucleus) and viruses.

With these catalogues of previously unknown families at hand, both teams set out to predict the function of their newly described families, capitalizing on genomic-context analysis, which involves examining adjacent genes for clues about function, as well as harnessing breakthroughs in methods to predict protein structures. In prokaryotic genomes, genes involved in the same pathway are often present close to one other. Genomic-context analysis, which proposes ‘guilt by association’, has been used effectively to predict previously unknown antiviral defence systems used by bacteria⁵. The second approach, comparing predicted protein structures to find similar (homologous) proteins, is more sensitive than simply comparing amino-acid sequences alone⁶. Both teams predicted structures for their protein families and compared them with databases of known structures, thereby generating informed predictions about the function of some of these enigmatic proteins.

The sheer scale and computational investment involved in these efforts, which yielded hundreds of thousands of newly discovered protein families (Fig. 1), is impressive. Yet, the number of previously unknown genes that have a functional prediction still remains relatively small. In both publications, only around 15% of the previously unknown protein families could be annotated on the basis of structural similarity; genomic-context analysis enabled functions to be proposed for 7.4% of families in Pavlopoulos et al. and 13% in Rodríguez del Río and co-workers. In addition, some assigned functional categories (such as ‘ribosome’) lack detailed specificity and this might obscure the precise role of these genes. Ultimately, the reliability of these predictions will have to be determined experimentally. Indeed, Rodríguez del Río et al. took the first step towards this objective by experimentally verifying the annotation for two of their predicted families.

**Figure 1 | Previously unknown microbial gene families.** The large-scale analysis of DNA sequences captured from microbial samples as reported by Rodríguez del Río *et al*.¹ and by Pavlopoulos *et al*.³ has revealed hundreds of thousands of previously unknown gene families. These data — which were gathered from microbes in the wild and across different habitats, and include species that have not been cultivated in the laboratory — provide a starting point for gaining insights into unexplored aspects of the biology of bacterial and archaeal microorganisms. Figure adapted from Fig. 3a of ref. 1.

By delving deeper into the microbial dark matter, these two studies unlock a wealth of previously hidden knowledge, paving the way for future discoveries in diverse fields from medicine to biotechnology. Follow-up experiments might include the study of protein families with completely new protein folds, possibly revealing unexplored biological functions. Similarly, synapomorphic genes — corresponding to protein families that are specific to a group of organisms sharing a common ancestor but absent in others — might hold clues to key evolutionary processes. With further refinement and validation, these computational approaches offer a powerful tool for unlocking the functional secrets of the unseen microbial world.

ALEXANDER J. PROBST: Microbial sequences reveal ecology and evolution

Genes are the ultimate source of all biological information on Earth, from human eye colour to the cell shape of microorganisms. The proteins they encode can be grouped using bioinformatics into families, usually with shared functionality. The ensemble of all known proteins in databases is continuously expanding as genomes are sequenced and the functions of the encoded proteins are predicted. The greatest fraction of biological functional diversity on our planet is attributed to microbial proteins. With the advent of sequencing of mixed microbial genomes from the environment (an approach that explores multiple genomes and is called metagenomics⁷), the increase in the rate at which data are being added to genome and protein databases is striking. However, the functional capacity of most protein families is unknown and part of the microbial dark matter.

Tracking humans and microbes

Rodríguez del Río and colleagues’ work, as well as the study by Pavlopoulos et al., analysed large-scale metagenomic data and explored the potential function and distribution of unknown protein families, which might have evolutionary and ecological importance. Rodríguez del Río analysed nearly 150,000 microbial genomes (Fig. 1), and Pavlopoulos and colleagues investigated nearly 27,000 metagenomic data sets retrieved from diverse ecosystems with various bioinformatics approaches — going well beyond the scale of public-database entries used in previous such studies⁸. Surprisingly, a method called rarefaction analysis used by Pavlopoulos and colleagues revealed no slowing down in the detection of previously unknown protein families as new metagenomes were added to their analysis. Instead, the detection of protein families increased exponentially, warranting an array of follow-on studies.

The distribution of protein families across Earth’s categories of ecosystem (biomes) presented by Pavlopoulos and colleagues corroborates the findings of previous investigations regarding the distribution of microbial genes⁸. Some biological entities, however, were particularly rich sources of newly discovered protein families, including viruses, as Pavlopoulos et al. report, and microbes called Asgardarchaeota, as presented by Rodríguez del Río and colleagues. The latter are a group of microorganisms called archaea that are closely related to the first ancestor of eukaryotes. As such, studying their proteins might reveal new insights into the evolution of the eukaryotic cell⁹.

Crowdsourcing Earth’s microbes

One major challenge in exploring the wealth of previously unknown protein families encoded in genomes of natural samples is the identification of eukaryotic genes in metagenomes. Although certain algorithms exist for the recovery of eukaryotic genomes from metagenomes, accurately predicting eukaryotic genes in mixed DNA sequences — equivalent to Pavlopoulos and colleagues’ method of identifying microbial genes — is still not possible bioinformatically. Once this shortcoming is overcome with the development of new algorithms, scientists will substantially expand the protein ‘sequence space’ and will identify protein families of unknown function that drive the ecology and evolution of eukaryotes.

The greatest advance in painstakingly organizing the protein families of nearly 27,000 metagenomes and across the tree of life lies in the identification of ecosystem-specific protein clusters that differ in terms of their presence or absence, or relative abundance between varying conditions of a given ecosystem — for example, between the contexts of health or disease. Applying this strategy to examine microbial data for healthy people and those with colorectal cancer, Rodríguez del Río and colleagues found that specific unknown protein families were enriched in the gut bacteria of people with cancer. These protein families were associated with microbial motility, adhesion and invasion potentially of human tissue, as revealed through genomic-context analysis. Harnessing this approach in other fields of research should be extremely helpful for deciphering the different functions of sample sets, in the hope of identifying new targets for biochemical analyses to shed light on a tiny fraction of the microbial dark matter.

Identifying differences in microbial communities (microbiomes) that might explain, for example, the disease state of a person, rely heavily on comparing which species are present and how abundant they are (the taxonomic composition), and examining genes that are associated with certain functions. Finding specific but differentially abundant protein families of unknown function, as demonstrated by Rodríguez del Río and co-workers, has the potential not only to replace current marker-gene-based approaches for differentiating microbiomes but also to advance microbiome research to a new and causality-driven level.

Nature 626, 267-269 (2024)

doi: https://doi.org/10.1038/d41586-024-00077-w

References

Rodríguez del Río, A. et al. Nature 626, 377–384 (2023).
Article Google Scholar
Vanni, C. et al. eLife 11, e67667 (2022).
Article PubMed Google Scholar
Pavlopoulos, G. A. et al. Nature 622, 594–602 (2023).
Article PubMed Google Scholar
Jumper, J. et al. Nature 596, 583–589 (2021).
Article PubMed Google Scholar
Doron, S. et al. Science 359, eaar4120 (2018).
Article PubMed Google Scholar
Illergård, K., Ardell, D. H. & Elofsson, A. Proteins 77, 499–508 (2009).
Article PubMed Google Scholar
Tyson, G. W. et al. Nature 428, 37–43 (2004).
Article PubMed Google Scholar
Coelho, L. P. et al. Nature 601, 252–256 (2022).
Article PubMed Google Scholar
Eme, L. et al. Nature 618, 992–999 (2023).
Article PubMed Google Scholar

Download references

Reprints and permissions

Competing Interests

The authors declare no competing interests.

Subjects

Latest on:

Bird flu virus has been spreading in US cows for months, RNA reveals

News 27 APR 24

Bird flu in US cows: is the milk supply safe?

News Explainer 25 APR 24

WHO redefines airborne transmission: what does that mean for future pandemics?

News 24 APR 24

Exploring the lung microbiome’s role in disease

Outlook 17 APR 24

Gut bacteria break down cholesterol — hinting at probiotic treatments

News 02 APR 24

A host–microbiota interactome reveals extensive transkingdom connectivity

Article 20 MAR 24

Bird flu virus has been spreading in US cows for months, RNA reveals

News 27 APR 24

Ecologists: don’t lose touch with the joy of fieldwork

World View 24 APR 24

Emx2 underlies the development and evolution of marsupial gliding membranes

Article 24 APR 24

Jobs

Junior Group Leader

The Imagine Institute is a leading European research centre dedicated to genetic diseases, with the primary objective to better understand and trea...

Paris, Ile-de-France (FR)

Imagine Institute
Director of the Czech Advanced Technology and Research Institute of Palacký University Olomouc

The Rector of Palacký University Olomouc announces a Call for the Position of Director of the Czech Advanced Technology and Research Institute of P...

Czech Republic (CZ)

Palacký University Olomouc
Course lecturer for INFH 5000

The HKUST(GZ) Information Hub is recruiting course lecturer for INFH 5000: Information Science and Technology: Essentials and Trends.

Guangzhou, Guangdong, China

The Hong Kong University of Science and Technology (Guangzhou)
Suzhou Institute of Systems Medicine Seeking High-level Talents

Full Professor, Associate Professor, Assistant Professor

Suzhou, Jiangsu, China

Suzhou Institute of Systems Medicine (ISM)
Postdoctoral Fellowships: Early Diagnosis and Precision Oncology of Gastrointestinal Cancers

We currently have multiple postdoctoral fellowship positions within the multidisciplinary research team headed by Dr. Ajay Goel, professor and foun...

Monrovia, California

Beckman Research Institute, City of Hope, Goel Lab

[1] Rodríguez del Río, A. et al. Nature 626, 377–384 (2023).
Article Google Scholar

[2] Vanni, C. et al. eLife 11, e67667 (2022).
Article PubMed Google Scholar

[3] Pavlopoulos, G. A. et al. Nature 622, 594–602 (2023).
Article PubMed Google Scholar

[4] Jumper, J. et al. Nature 596, 583–589 (2021).
Article PubMed Google Scholar

[5] Doron, S. et al. Science 359, eaar4120 (2018).
Article PubMed Google Scholar

[6] Illergård, K., Ardell, D. H. & Elofsson, A. Proteins 77, 499–508 (2009).
Article PubMed Google Scholar

[7] Tyson, G. W. et al. Nature 428, 37–43 (2004).
Article PubMed Google Scholar

[8] Coelho, L. P. et al. Nature 601, 252–256 (2022).
Article PubMed Google Scholar

[9] Eme, L. et al. Nature 618, 992–999 (2023).
Article PubMed Google Scholar

The journey to understand previously unknown microbial genes

THE TOPIC IN BRIEF

JAKOB WIRBEL & AMI S. BHATT: Bringing structure and context to gene mysteries

ALEXANDER J. PROBST: Microbial sequences reveal ecology and evolution

References

Competing Interests

Subjects

Latest on:

Jobs

Junior Group Leader

Director of the Czech Advanced Technology and Research Institute of Palacký University Olomouc

Course lecturer for INFH 5000

Suzhou Institute of Systems Medicine Seeking High-level Talents

Postdoctoral Fellowships: Early Diagnosis and Precision Oncology of Gastrointestinal Cancers

Search

Quick links

THE TOPIC IN BRIEF

JAKOB WIRBEL & AMI S. BHATT: Bringing structure and context to gene mysteries

ALEXANDER J. PROBST: Microbial sequences reveal ecology and evolution

References

Competing Interests

Related Articles

Subjects

Latest on:

Jobs

Junior Group Leader

Director of the Czech Advanced Technology and Research Institute of Palacký University Olomouc

Course lecturer for INFH 5000

Suzhou Institute of Systems Medicine Seeking High-level Talents

Postdoctoral Fellowships: Early Diagnosis and Precision Oncology of Gastrointestinal Cancers

Search

Quick links