Minimum information requested in the annotation of biochemical models (MIRIAM)

Abstract

Most of the published quantitative models in biology are lost for the community because they are either not made available or they are insufficiently characterized to allow them to be reused. The lack of a standard description format, lack of stringent reviewing and authors' carelessness are the main causes for incomplete model descriptions. With today's increased interest in detailed biochemical models, it is necessary to define a minimum quality standard for the encoding of those models. We propose a set of rules for curating quantitative models of biological systems. These rules define procedures for encoding and annotating models represented in machine-readable form. We believe their application will enable users to (i) have confidence that curated models are an accurate reflection of their associated reference descriptions, (ii) search collections of curated models with precision, (iii) quickly identify the biological phenomena that a given curated model or model constituent represents and (iv) facilitate model reuse and composition into large subcellular models.

Main

During the genomic era we have witnessed a vast increase in availability of large amounts of quantitative data. This is motivating a shift in the focus of molecular and cellular research from qualitative descriptions of biochemical interactions towards the quantification of such interactions and their dynamics. One of the tenets of systems biology is the use of quantitative models (see Box 1 for definitions) as a mechanism for capturing precise hypotheses and making predictions1,2. Many specialized models exist that attempt to explain aspects of the cellular machinery. However, as has happened with other types of biological information, such as sequences, macromolecular structures or microarray data, quantitative models will be useful only if their access and reuse is made easy for all scientists. Moreover, the next step towards a more synergistic view of living systems is assembling models into larger entities, by module reuse and assembly or modeling across different spatial, temporal or physiological scales. Both model retrieval and model composition require formal descriptions of model structure and semantics. Our separate groups have been active in the development of standards for encoding biological models in machine-readable formats (e.g., CellML3 and SBML4,5) and of public repositories of computational models (such as BioModels Database6, Sigpath7, EcoCyc8, the CellML repository (http://www.cellml.org/examples/repository/), JWS Online9, RegulonDB10, DOQCS11). We firmly believe in the value of expressing computational models using standardized, structured formats as a means of enabling direct interpretation and manipulation of those models by software tools.

Databases of quantitative models are valuable resources only if researchers can trust the quality of their content. Similarly, repositories are not useful unless users can search for specific models and then relate model constituents to other data sets such as bioinformatics databases and controlled vocabularies. To meet these needs, we believe four complementary aspects of the quality of an encoded model must be addressed: (i) the quality of the documentation (e.g., journal article) associated with the encoded model, (ii) the degree of correspondence between the encoded model and the documentation, (iii) the accuracy and extent of the annotations of the encoded model and (iv) whether the model is encoded in a machine-readable format, that is, a format that can be immediately and unambiguously parsed by software to perform simulations and analysis.

Most of the encoded models available in scientific publications or on the Internet are not in a standard format. Of those that are encoded in a standard format, it turns out that most actually fail compliance tests developed for these standards. Failures occur for a variety of reasons, ranging from minor syntactic errors to significant conceptual problems, including the incorrect specification of units. Even deeper semantic inaccuracies can lie in the structure of the model itself. Finally, there is no standard naming scheme for the model constituents, so the precise identification of constituents depends on the associated documentation/annotation. Most models available today are not annotated, and as a result, users are faced with such things as a reaction 'X' between the constituents 'A' and 'B,' producing 'C' and modulated by 'M.' As a consequence, models frequently have to be re-encoded in order to be reused, a process that in practice is often performed by a different person from the original author.

These quality issues must be addressed when curating model collections for public use, just as it is done for other type of biological data. One crucial step is the development of interchange standards12, such as those developed for microarray data13, protein interactions14 or metabolic analyses15. By 'curation,' we mean the processes of collecting models, verifying them to some degree and annotating them with metadata. We propose to standardize an approach to the curation of model collections and the encoding of models using a framework of rules we call MIRIAM, the Minimum Information Requested In the Annotation of Models. MIRIAM aims to define processes and schemes that will instill confidence in model collections, enable the assembly of meta-collections of models at the same high level of quality and allow the curation process to be shared among teams at different sites and institutions. The standard we propose is designed to cover encoding processes that may be conducted either up front by the model author or post hoc by a curator. However, we do not believe that the post hoc approach is particularly efficient, and prefer modelers to make their models available in standard formats. Box 2 describes some uses of MIRIAM.

Scope of MIRIAM

MIRIAM applies only to models linked to a unique reference description. MIRIAM does not address directly issues of quality of documentation (although sufficiently poor documentation can make a model impossible to curate). The assessment of the quality of documentation is well established in the scientific community. We expect that, by assessing the documentation describing quantitative models, peer reviewers (not the model curators) will assess the models' ability to represent and predict the quantitative behavior of biological systems and/or make an important theoretical contribution. Instead, MIRIAM focuses on the correspondence of an encoded model to its associated description and how the encoded model is annotated. In other words, even if it is MIRIAM compliant, a model may not necessarily make sense in biological terms. Conversely, many models that cannot be declared MIRIAM compliant may still be of high scientific interest.

We expect MIRIAM to apply mainly to quantitative models that can be simulated over a range of parameter values and provide numerical results. This encompasses not only models that can be integrated or iterated forwards in time, such as ordinary and partial differential equation models and differential algebraic equation models, but also other quantitative approaches such as steady-state models (e.g., Metabolic Control Analysis16, Flux Balance Analysis17). Discrete approaches, such as logical modeling18,19,20 or stochastic and hybrid Petri Net21, can also be considered when they can lead to specific numerical results. Although we are aware that this means we can cover only part of the modeling field, we make this our initial focus because only these models can lead to quantitative numerical results providing refutable predictions. The comparison of these predictions with the reference description of the model is a crucial test of MIRIAM compliance.

Overview of the proposal

MIRIAM is divided into two parts. The first is a proposed standard for reference correspondence dealing with the syntax and semantics of the model, whereas the second is a proposed annotation scheme that specifies the documentation of the model by external knowledge.

Standard for reference correspondence

The aim of this proposal is to ensure that the model is properly associated with a reference description and is consistent with that reference description. To be declared MIRIAM compliant, a quantitative model must fulfill a set of rules dealing with its encoding, its structure and the results it should provide when instantiated in simulations. These rules are detailed in Box 3.

To pass the various tests, and in particular the reproduction of described results, a modeler could be required to make minor changes to a model until it is truly consistent with the results given in the associated reference description. If the modeler is not one of the authors, ideally he/she should perform these modifications in collaboration with the authors. Examples include changing a few parameter and/or initial condition values.

When the model given in the text of the reference description is significantly different from the encoded model used to generate the results given in this text, the model cannot be curated and MIRIAM cannot be applied. For example, MIRIAM cannot be applied if a significant number of parameter values are different between the two models (the significance being judged by the curators). The original authors of the model should be encouraged to publish an erratum detailing the correct values.

Annotation schemes

The scheme for annotation is composed of two complementary components: attribution, covering the absolute minimum information that is required to associate the model with both a reference description and an encoding process, and external data resources, covering information required to relate the constituents of quantitative models to established data resources or controlled vocabularies.

The annotations must always be transferred with the encoded model. The ideal case is incorporating these annotations in the same file as the model itself, in a structured form such as the CellML metadata22 or the SBML simple annotation scheme23. However, annotations could also be joined in another form, such as one or several accompanying files, in various formats, textual or graphical.

Attribution annotation

To be confident in being able to reuse an encoded model, one must be able to trace its origin and the people who were involved in its creation. In particular, the reference description has to be identified, as well as the authors and creators of the model. The information that must always be joined with an encoded model is listed in Box 4.

External data resources annotation

The aim of this scheme is to link model constituents to corresponding structures in existing and future open access bioinformatics resources. Such data resources can be, for instance, database or controlled vocabularies. This will permit the identification of model constituents and the comparison of model constituents between different models, but also the execution of queries on models to recover specific constituents in models. Possible sources of annotation for various types of constituents are listed in Table 1.

Table 1 Possible sources of annotation for different model constituentsa

This annotation must permit a piece of knowledge to be unambiguously related to a model constituent. The structure of an atomic element of the annotation is similar to the relationshipXref element of BioPAX (http://www.biopax.org/). The referenced information should be described using a triplet {“data-type,” “identifier,” “qualifier”}. The “data-type” is a unique, controlled description of the type of data. The “identifier,” within the context of the “data-type,” points to a specific piece of knowledge. The “qualifier” is a string that serves to refine the relation between the referenced piece of knowledge and the described constituent. Example of qualifiers are “has a,” “is version of,” “is homolog to.” The qualifier is optional, and its absence does not preclude MIRIAM compliance. When a qualifier is absent, one assumes the relation to be “is.”

The “data-type” should be written as a Unique Resource Identifier24. This URI can be a Uniform Resource Locator25 or a Uniform Resource Name26. The URL or URN does not have to describe an actual physical location. It is up to the software tool reading the model to decide what to do with this URI. This software can, for instance, use the “identifier” with a search engine built on a database mirroring the “data-type.” Alternatively, a reading tool translating the model can build a hyperlink using the “identifier” and another URL related to the “data-type.”

The “data-type” and the “identifier” can be combined into a single URL, such as http://www.myResource.org/#myIdentifier or as a URN, for instance using the LSID scheme27 of urn:lsid:myResource.org:myIdentifier.

To enable interoperability, the community will have to agree on a set of standard, valid URIs. An online resource will be established to catalog the URIs and the corresponding physical URLs of the agreed-upon “data-types,” whether these are controlled vocabularies or databases. This catalog will simply list the URIs and for each one, provide a corresponding summary of the syntax for the “identifier.” An application programming interface (API) can be created so that software tools can retrieve valid URL(s) corresponding to a given URI. Table 2 shows a small subset of this forthcoming list. Note that although MIRIAM compliance does not require such a list to exist, it is considered crucial to actually enforce MIRIAM usage, and to make it truly useful. The list will also have to evolve with the data resources.

Table 2 Examples of different physical locations related to the same URIs expressed as a URL or a LSID

It is important that model constituents be annotated with perennial identifiers. For example, the “entry name” field of UniProt28 is not perennial but is modified on a regular basis to reflect the classification of the protein. However, the “accession” field of UniProt is perennial. Consider a model with an entity representing the protein calmodulin. An annotation of this entity referring to the UniProt record for calmodulin should therefore use a URI containing the “accession” field value for calmodulin “P62158” rather than the “entry name” field value “CALM_HUMAN.”

Quite often, several identified biological entities, physical components or reactions are lumped in a single constituent of the model. For instance, successive reactions of a pathway may be merged into one reaction, or a set of different molecules is represented by one pool. The annotation must reflect this situation, either by enumerating the biological entities, or with a carefully chosen term from a controlled vocabulary (an example of a curated and annotated model is presented in Table 3).

Table 3 Example of a small curated and annotated model

Conclusions

We believe that through the standardization of the model curation process, it will be possible to create resources that are as significant to systems biology as resources like Ensembl29 are to genomics. Pursuing this proposal will in the short term allow us to establish collections of models of sufficient quality to gain the confidence of the systems biology community. To pave the way, the resources handled by the authors of this manuscript (BioModels Database, CellML repository, DOQCS, SigPath) endorse the standard, and will undertake efforts to make them MIRIAM compliant. In the longer term, the application of MIRIAM will enable the peer review process to become more efficient and its products more accessible. We also hope the standard will be adopted by publishers of scientific literature, as was the case with other standards such as MIAME13.

References

  1. 1

    Kitano, H. Computational systems biology. Nature 420, 206–210 (2002).

  2. 2

    Crampin, E. et al. Computational physiology and the physiome project. Exp. Physiol. 89, 1–26 (2004).

  3. 3

    Lloyd, C., Halstead, M. & Nielsen, P. CellML: its future, present and past. Prog. Biophys. Mol. Biol. 85, 433–450 (2004).

  4. 4

    Hucka, M. et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19, 524–531 (2003).

  5. 5

    Finney, A. & Hucka, M. Systems biology markup language: level 2 and beyond. Biochem. Soc. Trans. 31, 1472–1473 (2003).

  6. 6

    Le Novère, N., et al. BioModels Database: A free, centralized database of curated, published, quantitative kinetic models of biochemical and cellular systems. Nucleic Acids Res. 34, (2006).

  7. 7

    Campagne, F. et al. Quantitative information management for the biochemical computation of cellular networks. Sci. STKE 248, PL11 (2004).

  8. 8

    Keseler, I. et al. EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res. 33, D334–D337 (2005).

  9. 9

    Olivier, B. & Snoep, J. Web-based kinetic modelling using JWS Online. Bioinformatics 20, 2143–2144 (2004).

  10. 10

    Salgado, H. et al. RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res. 32, D303–D306 (2004).

  11. 11

    Sivakumaran, S., Hariharaputran, S., Mishra, J. & Bhalla, U. The database of quantitative cellular signaling: management and analysis of chemical kinetic models of signaling networks. Bioinformatics 19, 408–415 (2003).

  12. 12

    Quackenbush, J. Data standards for 'omic' science. Nat. Biotechnol. 22, 613–614 (2004).

  13. 13

    Brazma, A. et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat. Genet. 29, 365–371 (2001).

  14. 14

    Hermjakob, H. et al. The HUPO PSI's molecular interaction format-a community standard for the representation of protein interaction data. Nat. Biotechnol. 22, 177–183 (2004).

  15. 15

    Lindon, J. et al. Summary recommendations for standardization and reporting of metabolic analyses. Nat. Biotechnol. 23, 833–838 (2005).

  16. 16

    Kacser, H. & Burns, J. The control of flux. Symp. Soc. Exp. Biol. 27, 65–104 (1973).

  17. 17

    Savinell, J. & Palsson, B. Optimal selection of metabolic fluxes for in vivo measurement. I. Development of mathematical methods. J. Theor. Biol. 155, 201–214 (1992).

  18. 18

    Thomas, R. Boolean formalisation of genetic control circuits. J. Theor. Biol. 42, 565–583 (1973).

  19. 19

    Sánchez, L. & Thieffry, D. Segmenting the fly embryo: a logical analysis of the pair-rule cross-regulatory module. J. Theor. Biol. 224, 517–537 (2003).

  20. 20

    Laubenbacher, R. & Stigler, B. A computational algebra approach to the reverse engineering of gene regulatory networks. J. Theor. Biol. 229, 523–537 (2004).

  21. 21

    Doi, A., Fujita, S., Matsuno, H., Nagasaki, M. & Miyano, S. Constructing biological pathway models with hybrid functional petri nets. In Silico Biol. 4, 271–291 (2003).

  22. 22

    Cuellar, A., Nelson, M. & Hedley, W. The CellML metadata 1.0 specification. http://www.cellml.org/specifications/metadata/.

  23. 23

    Le Novère, N. & Finney, A. A simple scheme for annotating SBML with references to controlled vocabularies and database entries. http://www.ebi.ac.uk/compneur-srv/sbml/proposals/AnnotationURI.pdf.

  24. 24

    Berners-Lee, T., Fielding, R. & Masinter, L. Uniform resource identifier (URI): Generic syntax. http://www.gbiv.com/protocols/uri/rfc/rfc3986.html.

  25. 25

    Berners-Lee, T. Uniform resource locators (URL): a syntax for the expression of access information of objects on the network. http://www.w3.org/Addressing/URL/url-spec.txt.

  26. 26

    Moats, R. URN syntax. http://www.ietf.org/rfc/rfc2141.txt.

  27. 27

    Martin, S., Niemi, M. & Senger, M. Life sciences identifiers RFP response. http://www.omg.org/technology/documents/formal/life_sciences.htm

  28. 28

    Apweiler, R. et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 32, D115–D119 (2004).

  29. 29

    Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).

  30. 30

    Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).

  31. 31

    Hamosh, A., Scott, A., Amberger, J., Bocchini, C. & McKusick, V. Online mendelian inheritance in man ({OMIM}), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).

  32. 32

    Wheeler, D. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 28, 10–14 (2000).

  33. 33

    Phan, I., Pilbout, S., Fleischmann, W. & Bairoch, A. NEWT, a new taxonomy portal. Nucleic Acids Res. 31, 3822–3823 (2003).

  34. 34

    Mulder, N.J. et al. InterPro, progress and status in 2005. Nucleic Acids Res. 33, 201–205 (2005).

  35. 35

    Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y. & Hattori, M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32, D277–D280 (2004).

  36. 36

    Wu, C. et al. PIRSF: family classification system at the protein information resource. Nucleic Acids Res. 32, D112–D114 (2004).

  37. 37

    Joshi-Tope, G. et al. The genome knowledgebase: A resource for biologists and bioinformaticists. Cold Spring Harb. Symp. Quant. Biol. 68, 237–243 (2003).

  38. 38

    Bader, G. & Hogue, C. BIND—a data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics 16, 465–477 (2000).

  39. 39

    Hermjakob, H. et al. IntAct—an open source molecular interaction database. Nucleic Acids Res. 32, D452–D455 (2004).

  40. 40

    Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003).

  41. 41

    Wu, C. et al. Update on genome completion and annotations: protein information resource. Nucleic Acids Res. 31, 345–347 (2003).

  42. 42

    Fleischmann, A. et al. IntEnz, the integrated relational enzyme database. Nucleic Acids Res. 32, D434–D437 (2004).

  43. 43

    Bairoch, A. The ENZYME database in 2000. Nucleic Acids Res. 28, 304–305 (2000).

  44. 44

    Bower, J. & Beeman, D. The Book of GENESIS (Springer-Verlag, New York, 1998).

  45. 45

    Ermentrout, B. Simulating, Analyzing, and Animating Dynamical Systems: A Guide to XPPAUT for Researchers and Students (Society for Industrial & Applied Math, Philadelphia, PA, 2002).

  46. 46

    Chabrier, N. & Fages, F. Symbolic model checking of biochemical networks. in International Workshop on Computational Methods in Systems Biology (Springer-Verlag, New York, 2003).

Download references

Author information

Correspondence to Nicolas Le Novère.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and Permissions

About this article

Further reading