A wide variety of enzymatic pathways that produce specialized metabolites in bacteria, fungi and plants are known to be encoded in biosynthetic gene clusters. Information about these clusters, pathways and metabolites is currently dispersed throughout the literature, making it difficult to exploit. To facilitate consistent and systematic deposition and retrieval of data on biosynthetic gene clusters, we propose the Minimum Information about a Biosynthetic Gene cluster (MIBiG) data standard.
Living organisms produce a range of secondary metabolites with exotic chemical structures and diverse metabolic origins. Many of these secondary metabolites find use as natural products in medicine, agriculture and manufacturing. Research on natural product biosynthesis is undergoing an extensive transformation, driven by technological developments in genomics, bioinformatics, analytical chemistry and synthetic biology. It has now become possible to computationally identify thousands of biosynthetic gene clusters (BGCs) in genome sequences, and to systematically explore and prioritize them for experimental characterization1,2. A BGC can be defined as a physically clustered group of two or more genes in a particular genome that together encode a biosynthetic pathway for the production of a specialized metabolite (including its chemical variants). It is becoming possible to carry out initial experimental characterization of hundreds of such natural products, using high-throughput approaches powered by rapid developments in mass spectrometry3,4,5 and chemical structure elucidation6. At the same time, single-cell sequencing and metagenomics are opening up access to new and uncharted branches of the tree of life7,8,9, enabling scientists to tap into a previously undiscovered wealth of BGCs. Furthermore, synthetic biology allows the redesign of BGCs for effective heterologous expression in preengineered hosts, which will ultimately empower the construction of standardized high-throughput platforms for natural product discovery10,11.
In this changing research environment, there is an increasing need to access all the experimental and contextual data on characterized BGCs for comparative analysis, for function prediction and for collecting building blocks for the design of novel biosynthetic pathways. For this purpose, it is paramount that this information be available in a standardized and systematic format, accessible in the same intuitive way as, for example, genome annotations or protein structures. Currently, the situation is far from ideal, with information on natural product biosynthetic pathways scattered across hundreds of scientific articles in a wide variety of journals; it requires in-depth reading of papers to confidently discern which of the molecular functions associated with a gene cluster or pathway have been experimentally verified and which have been predicted solely on the basis of biosynthetic logic or bioinformatic algorithms. Although some valuable existing manually curated databases have data models in place to store some of this information12,13,14, all are specialized towards certain subcategories of BGCs and include just a limited number of parameters defined by the interests of a subset of the scientific community. To enable the future development of databases with universal value, a generally applicable community standard is required that specifies the exact annotation and metadata parameters agreed upon by a wide range of scientists, as well as the possible types of evidence that are associated with each variable in publications and/or patents. Such a standard will be of great value for the consistent storage of data and will thus alleviate the tedious process of manually gathering information on BGCs. Moreover, a comprehensive data standard will allow future data infrastructures to enable the integration of multiple types of data, which will generate new insights that would otherwise not be attainable.
The Genomic Standards Consortium (GSC)15 (Box 1) previously developed the Minimum Information about any Sequence (MIxS) framework16. This extensible 'minimum information' standardization framework includes the Minimum Information about a Genome Sequence (MIGS)17 and the Minimum Information about a MARKer gene Sequence (MIMARKS)16 standards. MIxS is a flexible framework that can be expanded upon to serve a wide variety of purposes. The GSC facilitates the community effort of maintaining and extending MIxS, and stimulates compliance among the community.
Here, we introduce the “Minimal Information about a Biosynthetic Gene cluster” (MIBiG) specification as a coherent extension of the GSC's MIxS standards framework. MIBiG provides a comprehensive and standardized specification of BGC annotations and gene cluster–associated metadata that will allow their systematic deposition in databases. Through a community annotation of BGCs that have been experimentally characterized and described in the literature during previous decades, we have constructed an MIBiG-compliant seed dataset. Moreover, a large part of the research community has committed to continue submitting data on newly characterized gene clusters in the MIBiG format in the future. Together, the MIBiG standard and the resulting MIBiG-compliant data sets will allow data infrastructures to be developed that will facilitate key future developments in natural product research.
Design of the MIBiG standard
The MIBiG standard covers general parameters that are applicable to each and every gene cluster as well as compound type–specific parameters that apply only to specific classes of pathways (Fig. 1). Notably, the standard has been designed to be suitable for biosynthetic pathways from any taxonomic origin, including those from bacteria, archaea, fungi and plants.
The general parameters cover important data items that are universally applicable. First, they include identifiers of the publications associated with the characterization of the gene cluster, so that the full description of the experimental results that support the entire entry can be accessed easily.
The second key group of general parameters describes the associated genomic locus (or loci) and its accession numbers and coordinates, as deposited in or submitted to one of the databases of the International Nucleotide Sequence Database Collaboration (INSDC): the DNA Data Bank of Japan (DDBJ), the European Nucleotide Archive (EBI-ENA) or GenBank, all of which share unified accession numbers. The INSDC accession numbers are also used to link each MIBiG entry (which is given a separate MIBiG accession number) and its annotations to the corresponding nucleotide sequence(s) computationally; hence, a GenBank/ENA/DDBJ submission of the underlying nucleotide sequence is always required to file a MIBiG submission.
The third group of general parameters describes the chemical compounds produced from the encoded pathway, including their structures, molecular masses, biological activities and molecular targets. Additionally, these parameters allow documentation of miscellaneous chemical moieties that are connected to the core scaffold of the molecule (but synthesized independently) and the genes associated with their biosynthesis; this will facilitate the design of tools for the straightforward comparison of such 'sub-clusters', which are frequently present in different variants across multiple parent BGCs.
Finally, there is a group of general parameters describing experimental data on genes and operons in a gene cluster, including gene knockout phenotypes, experimentally verified gene functions and operons verified by techniques such as RNA-seq.
Beside the general parameters, the MIBiG standard contains dedicated class-specific checklists for gene clusters encoding pathways to produce polyketides, nonribosomal peptides (NRPs), ribosomally synthesized and post-translationally modified peptides (RiPPs), terpenes, saccharides and alkaloids. These include items such as acyltransferase domain substrate specificities and starter units for polyketide BGCs, release/cyclization types and adenylation domain substrate specificities for NRP BGCs, precursor peptides and peptide modifications for RiPP BGCs, and glycosyltransferase specificities for saccharide BGCs. Where applicable, the standard was made compliant with earlier community agreements, such as the recently published classification of RiPPs18. Hybrid BGCs that cover multiple biochemical classes can be described by simply entering information on each of the constituent compound types: the checklists have been designed in such a way that this does not lead to conflicts. Importantly, the modularity of the checklist system allows for the straightforward addition of further class-specific checklists when new types of molecules are discovered in the future.
The combination of general and compound-specific MIBiG parameters, together with the MIxS checklist, provides a complete description of the chemical, genomic and environmental dimensions that characterize a biosynthetic pathway (Fig. 2). A minimal set of key parameters is mandatory, while other parameters are optional. For many parameters, a specific ontology has been designed in order to standardize the inputs and to make it easier to categorize and search the resulting data.
Whenever possible, parameters are linked to a system of evidence attribution that specifies the kinds of experiments performed to arrive at the conclusions indicated by the chosen parameter values. Hence, each annotation entered during submission is assigned a specific evidence code: for example, when annotating the substrate specificity of a nonribosomal peptide synthetase (NRPS) adenylation domain, the submitter can choose between 'activity assay', 'structure-based inference' and 'sequence-based prediction' as evidence categories to support a given specificity.
During the design of the standard, great care was taken to make it compatible with unusual biosynthetic pathways, such as branched or module-skipping polyketide synthase (PKS) and NRPS assembly lines. Also, to ensure that the standard is compliant with the current state of the art in the various subfields of natural product research, we conducted an online community survey at an early stage of standard development (see Supplementary Data Set 1). Feedback was provided by 61 principal investigators from 16 different countries (most of whom also coauthored this paper), including at least ten leading experts for each major class of biosynthetic pathways covered.
Addressing key research needs
Adoption of the MIBiG standard will allow the straightforward collation of all annotations and experimental data on each BGC, which would otherwise be dispersed across multiple scientific articles and resources. Moreover, there are at least three additional key ways in which MIBiG will facilitate new scientific and technological developments: it will enable researchers to systematically connect genes to chemistry (and vice versa), to better understand secondary metabolite biosynthesis and the compounds produced in their ecological and environmental context, and to effectively use synthetic biology to engineer newly designed BGC configurations underpinned by an evidence-based parts registry (Fig. 3).
First, the comprehensive dataset generated through MIBiG-compliant submissions will enable researchers to systematically connect genes and chemistry. Not only will it allow individual researchers to predict enzyme functions by comparing enzyme-coding genes in newly identified BGCs to a thoroughly documented dataset, it will also facilitate general advances in chemistry predictions. Substrate specificities of PKS acyltransferase domains and NRPS adenylation domains, as well as their evidence codes, will be registered automatically for all gene clusters. This will enable automated updating of the training sets for key chemistry prediction algorithms19,20,21, which can then be curated by the degree of evidence available, increasing the accuracy of predictions of core peptide and polyketide scaffolds. Also, because groups of genes associated with the biosynthesis of specific chemical moieties (such as sugars and nonproteinogenic amino acids) will be registered consistently, a continuously growing dataset of such sub-clusters will be available to use as a basis for chemical structure predictions.
In addition, MIBiG has the potential to greatly enhance the understanding of secondary metabolite biosynthesis in its ecological and environmental context: the connection of MIBiG to the MIxS standard should stimulate researchers to supply MIxS data on the genome and metagenome sequences that contain the BGCs. This will generate opportunities for a range of analyses, such as the biogeographical mapping of secondary metabolite biosynthesis22, thereby identifying locations and ecosystems harboring rich biosynthetic diversity. But even if the contextual data associated with the genome sequences cannot always be made MIxS compliant (perhaps because the origin of a strain can no longer be traced), the MIBiG standard itself provides a comprehensive reference dataset for annotating large-scale MIxS-compliant metagenomic data from projects such as the Earth Microbiome Project23, Tara Oceans24 and Ocean Sampling Day25. This will enable scientists to obtain a better understanding of the distribution of BGCs in the environment. Altogether, the standard will play a significant role in guiding sampling efforts for future natural product discovery.
Finally, the data resulting from MIBiG-compliant submissions will provide an evidence-based parts registry for the engineering of biosynthetic pathways. Synthetic biologists need a toolbox containing genetic parts that have been experimentally characterized. The MIBiG standard, through its systematic annotation of gene function by evidence coding, knockout mutant phenotypes and substrate specificities, will streamline the identification of all available candidate genes and proteins available to perform a desired function, together with the pathway context in which they natively occur. In this manner, it will provide a comprehensive catalog of parts that can be used for the modification of existing biosynthetic pathways or the de novo design of new pathways.
Community annotation effort
To accelerate the usefulness of new MIBiG-compliant data submissions, we initiated this project by annotating a significant portion of the experimental data on the hundreds of BGCs that have been characterized in recent decades. The resulting data will allow immediate contextualization of new submissions (see below) and comparative analysis of any newly characterized BGCs with a rich source of MIBiG-compliant data. Moreover, this annotation effort offered an ideal opportunity to evaluate the MIBiG standard in practice on a diverse range of BGCs. Hence, we carefully mined the literature to obtain a set of 1,170 experimentally characterized gene clusters: 303 PKS, 189 NRPS, 147 hybrid NRPS-PKS, 169 RiPP, 78 terpene, 123 saccharide, 21 alkaloid and 140 other BGCs. Compared to the 288 BGCs currently deposited in ClusterMine36012 and the 103 BGCs deposited in DoBISCUIT14, this presents a significant advance in terms of comprehensiveness. We then annotated each of these 1,170 BGCs with a minimal number of parameters (genomic locus, publications, chemical structure and biosynthetic class and subclass). Subsequently, in a community initiative involving 81 academic research groups and several companies worldwide, we performed a fully MIBiG-compliant reannotation of 405 of these BGCs according to the information available in earlier publications and laboratory archives. (All participants of this annotation effort are either listed as coauthors of this article or mentioned in the Acknowledgments, depending on the size of their contribution.) An initial visualization of the full data set arising from this reannotation is publicly available online at http://mibig.secondarymetabolites.org. Altogether, these submitted entries will function as a very useful seed dataset for the development of databases on secondary metabolism. Future data curation efforts will strive to achieve a fully MIBiG-compliant annotation of the remaining 765 BGCs that are currently annotated with a more restricted set of parameters.
To allow straightforward and user-friendly access, the MIBiG standard will be implemented by multiple databases and web services for genome data and secondary metabolite research. For example, the MIBiG-curated dataset has already been integrated into the antiSMASH tool in the form of a new module26 that compares any identified BGCs with the full MIBiG-compliant dataset of known BGCs. Moreover, a full-fledged database is currently under development that will be tightly integrated with antiSMASH and will build on the previously published ClusterMine360 framework12. Additionally, MIBiG-compliant data will be integrated into the recently released Integrated Microbial Genomes Atlas of Biosynthetic Clusters (IMG-ABC) database from the Joint Genome Institute (https://img.jgi.doe.gov/ABC/)27. Regular exchange of data will take place between the MIBiG repository and the IMG-ABC, antiSMASH and ClusterMine databases. Additional cross-links with the chemical databases ChemSpider28, chEMBL29 and chEBI30 are being developed so that researchers can easily find the full MIBiG annotation of the BGC responsible for the biosynthesis of given molecules. Finally, all community-curated data are freely available and downloadable in JSON format for integration into other software tools or databases, without any need to request permission, as long as the source is acknowledged.
For submission of new MIBiG-compliant data by scientists in the field, we prepared an interactive online submission form (available from http://mibig.secondarymetabolites.org), which was extensively tested through the community annotation effort. Data can also be submitted through the BioSynML plug-in26 (http://www.biosynml.de) that was recently built for use in the Geneious software. In this way, MIBiG-compliant data can easily be integrated with the in-house BGC content management systems of individual laboratories or companies. Finally, it will be possible to submit updates to existing MIBiG entries based on peer-reviewed articles through dedicated web forms.
The MIBiG coordinating team within the GSC is committed to ensuring the continued support and curation of the MIBiG standard, in cooperation with its partners. Compliance with the standard and interoperability with other standards and databases will also be guaranteed within the GSC. In order to stay relevant and viable, MIBiG is projected to be a 'living' standard: updates will be made as needed to remain technologically and scientifically current.
Coordination with relevant journals will be sought to make MIBiG submission of BGCs (evidenced by MIBiG accession codes) a standard item to check during manuscript review. To stimulate submission of MIBiG data during the process of publishing new biosynthetic gene clusters, unique MIBiG accession numbers are provided for each BCG that can be used during article review (including for data embargoed until after publication). The research community represented by this paper commits itself to submitting MIBiG-compliant data sets as well as updates to existing entries when publishing new experimental results on BGCs. We encourage the larger community to join in this endeavor.
M.H.M. was supported by a Rubicon fellowship of the Netherlands Organization for Scientific Research (NWO; Rubicon 825.13.001). The work of R.K. was supported by the European Union's Seventh Framework Programme (Joint Call OCEAN.2011–2: Marine microbial diversity—new insights into marine ecosystems functioning and its biotechnological potential) under the grant agreement no. 287589 (Micro B3). M.C. was supported by a Biotechnology and Biological Sciences Research Council (BBSRC) studentship (BB/J014478/1). The GSC is supported by funding from the Natural Environment Research Council (UK), the National Institute for Energy Ethics and Society (NIEeS; UK), the Gordon and Betty Moore Foundation, the National Science Foundation (NSF; US) and the US Department of Energy. The Manchester Synthetic Biology Research Centre, SYNBIOCHEM, is supported by BBSRC/Engineering and Physical Sciences Research Council (EPSRC) grant BB/M017702/1. We thank P. d'Agostino, P.R. August, R. Chau, C.D. Deane, S. Diethelm, L. Fernandez-Martinez, A. El Gamal, C. Garcia De Gonzalo, T.H. Grossman, C.-J. Huang, S. Kodani, A.L. Leandrini, I.A. MacNeil, M. Metelev, E.M. Molly, C. Olano, M. Ortega, L. Ray, K. Reynolds, A. Ross, I.N. Silva, R. Teufel, G. Thibodeaux, J. Tietz and D. Widdick for their contributions in the community annotation. We thank R. Baltz, M. Bibb, C. Boddy, C. Corre, E. Dittmann, H. Gramajo, N. Ichikawa, H. Ikeda, P. Jensen, C. Khosla, R. Li, M. Marahiel, D. Mohanty, C. Moore, W. Nierman, D.-C. Oh, E. Schmidt, Y. Shen, D. Stevens, B. Tudzynski and S. Van Lanen for useful comments on an early draft version of the community standard. We are grateful to three anonymous referees for their constructive suggestions.