Main

Biomedical and clinical research fields are increasingly applying a range of high-throughput experimental techniques, often called 'functional genomics', to study direct and indirect products of gene expression, molecular interactions and the cellular environment. Such approaches aim to determine the function of all genes, with 'function' broadly defined to include the relationship to phenotype, interaction partners (e.g., DNA, RNA, proteins, metabolites), localization and responses in expression to external stimuli. A functional genomics approach may use multiple techniques1,2 in a single study to analyze multiple kinds of data3,4. It is widely recognized that significant benefits can result from detailed annotation and archiving of data sets resulting from these types of studies. The benefits include the ability to exchange data with collaborators or submit them to public databases, the sharing of best practice, which provides capabilities for validation of the study or reinterpretation of results, and the development of new algorithms for data analysis. However, functional genomics often involves sophisticated sample processing, complex equipment, rich data sets and intricate data analyses. As a result, describing experiments in a systematic way requires similarly rich data models that enable data to be analyzed, validated and interpreted by people other than their immediate producers.

The challenges of building data standards for functional genomics have been addressed by scientific communities well-versed in microarray and proteomics technologies. Specifically, the Microarray Gene Expression Data Society (MGED, http://www.mged.org/) was formed in 1999 and devised the Minimal Information About a Microarray Experiment (MIAME)5 reporting requirements. MGED participants also provided a data model, the MicroArray Gene Expression object model (MAGE-OM version 1 (refs. 6,7), to capture MIAME-compliant data. In 2002, the Proteomics Standards Initiative (PSI; http://psidev.sourceforge.net/) was formed by the Human Proteome Organization (HUPO) and has since developed reporting requirements and data formats for protein interactions (PSI-MI8) and mass-spectrometry data (mzData, http://www.psidev.info/index.php?q=node/80#mzdata). PSI has also begun work on formats for protein-separation technologies, using the Proteomics Experiment Data Repository (PEDRo9) as a starting point. These standardization efforts address some of the concerns for publicly accessible formats for data and experimental annotation, but the independent nature of the standards groups caused common aspects of experimental protocols to be modeled using different terminology and levels of detail. The result is that semantically equivalent information was represented in syntactically incompatible ways across the standards, potentially complicating the publication process, the analysis and verification of studies that use multiple high-throughput technologies, and the integration of such data.

In response to the need for integration of the various technology types, independent attempts were made to merge MAGE and PEDRo into a single data model10,11. The conclusion from these efforts is that a comprehensive data model for all experimental types would be large and complex, hindering adoption by the technology-specific developer communities and vendors. On the other hand, these efforts also demonstrated that convergence of models in the areas shared between technologies, such as the biological source material, sample processing and the experimental variables, would yield significant benefits, both for data producers and for data consumers, if common aspects of an experimental activity could be recorded once (and not separately for each kind of technique used to study a sample). Furthermore, both data producers and consumers stand to benefit from the use of consistent styles of representation for the types of annotation that differ across techniques. The Functional Genomics Experiment model (FuGE) seeks to make the representation of data resulting from diverse experimental techniques more systematic and consistent by providing: (i) a format for representing laboratory workflows, (ii) a mechanism for supplementing existing data formats with additional metadata to describe their context within a laboratory workflow and their relationships to other data and (iii) a framework for building new data formats with a common structure for techniques that have specific requirements.

FuGE accomplishes these goals by focusing on the representation of the common aspects of experimental annotation and generally applicable information about the design of investigations. As such, FuGE provides a solid foundation for other technology-specific, life-science standards and data formats and is currently being used to develop formats for microarrays, proteomics, metabolomics and various other technologies.

In the next sections, we describe the methodology used to develop FuGE, the translation of the data model to other formats, aspects of the data model itself, and current development efforts based on provisional releases of FuGE. In this document, we designate any concept represented directly in the data model with a fixed width font.

Results

In this section, we present some of the key concepts of the FuGE model, which consists of ten packages that have been placed in two categories: Common and Bio (Box 1). The FuGE specification is too large to cover in detail here; instead, we focus on how FuGE models the structure of an 'omics investigation, the experimental methods, the tracking of samples within an experimental workflow and the multidimensional data produced. Examples are presented from the Protocol, Material, Investigation and Data packages that illustrate the most important types of functionality. The Audit, Description and Reference packages are closely based on MAGE version 1, whereas the Investigation and ConceptualMolecule packages have evolved from MAGE to cover the wider context of functional genomics. The Protocol and Material packages reuse certain components from MAGE and from PEDRo but have been developed de novo. The Ontology and Data packages are newly created, using principles from related object-oriented proposals, as detailed in the complete specification (http://fuge.sourceforge.net/).

Basic functionality for all objects in FuGE is represented in the Common namespace. Every object can be annotated with audit information (tracking changes) and the desired security settings (users or groups that can access or modify objects). This level of control is important for larger organizations when, for example, regulatory requirements must be fulfilled. Furthermore, most objects in FuGE can be annotated with a unique identifier, a local name, a textual description and references to external database or bibliographic entries.

Representing biological workflows

All descriptions of an experimental workflow, from starting samples through to the final results, are encoded by the Protocol package. The package represents any method or procedure in an experiment, including standard operating procedures, the mechanism for running an instrument and the use of software for data processing.

A Protocol object can be associated with Software and Equipment, each of which can have a set of parameters with default values (Fig. 1). A Protocol consists of a set of Actions (or steps) that can be ordered. An Action can contain simple text describing an atomic step within a protocol, it can be associated with parameters or it can be a reference to a child Protocol. This means that a complex procedure can be represented by building a Protocol that references other Protocols in a nested structure. An example protocol is sample processing in proteomics. A single Protocol could be defined for the entire procedure, which has three Actions for the harvesting of material, protein extraction and protein solubilization. Each Action would contain a reference to a separate Protocol for each of the three steps.

Figure 1: A UML diagram displaying a subset of the Protocol package.
figure 1

FuGE relies heavily on inheritance (such as the association between Protocol and Parameterizable), whereby classes inherit attributes and associations from the parent class. All classes have additional attributes inherited from more general parent classes (not shown) which allow a unique identifier, a name, descriptive text and various other properties to be provided.

A laboratory procedure is typically defined once (such as a method in a lab book or a standard operating procedure), but may be applied many times. FuGE represents this distinction by defining ProtocolApplication (Fig. 2). ProtocolApplication represents the running of a Protocol, allowing runtime parameter values to be supplied if they differ from the defaults. The separation of Protocol and ProtocolApplication is technically advantageous as it would be inefficient to redefine a complete protocol for every single deviation that occurs. For example, mass-spectrometry techniques use the same protocol definition for hundreds of runs, with only a small subset of the parameter values varied across them. ProtocolApplication also provides mechanisms for recording the operator and date of the procedure; both are variables that have been shown to be important when identifying and accounting for confounding factors in data analysis12,13.

Figure 2: A UML diagram of ProtocolApplication in FuGE.
figure 2

ProtocolApplication, EquipmentApplication and SoftwareApplication can be used to supply runtime values (ParameterValue) for Parameters that were defined by the Protocol, Software or Equipment.

In addition to recording run-time parameter settings, a ProtocolApplication references the input and output materials and/or data that were acted upon. As such, it can be used to construct experimental workflows by tracking the identity of all samples and data files. FuGE supplies a placeholder for the description of all physical materials (e.g., samples, organisms, chemicals, solutions) represented by the Material class. A Material can be annotated with ontology terms to describe its type or the role it plays within an experimental workflow (such as sample, buffer or reagent). It is also anticipated that the Material class will be extended within technology-specific formats; possible examples include gels, antibodies, arrays, reporters and so on.

ProtocolApplication can also be used to describe a data-processing pipeline by virtue of its references to input and output Data objects, such as a series of data transformations, where the output of each step serves as input for the next. As such, a single robust mechanism can be used to demonstrate the provenance of a highly processed outcome (e.g., gene expression profiles) from the starting samples, through sample processing, raw data acquisition and data analyses.

Multidimensional data representations

A common aspect of high-throughput technologies is multidimensional data. Many technology types already have established data formats, some of which are open-source formats. These technology-specific formats tend to lack metadata structures to describe the context under which the data were produced. FuGE seeks to augment established formats with this type of metadata, thus providing a context for the data within a complete experiment. An example of this functionality is given by the Computational Proteomics Analysis System (CPAS)14 project, as described below, which uses FuGE to integrate mass-spectrometry formats into a complete workflow description. In the Data package, referencing external data files is accomplished by the ExternalData class, which contains an attribute for referencing a file and a mechanism for referencing validation schema, descriptors or documentation on the external format. These attributes use standard URI notation to specify locations, such as Web addresses or local files (Fig. 3).

Figure 3
figure 3

The Data package enables data to be stored internally by a specification of dimensions, coordinates and matrices, or in an externally defined file format.

Alternatively, standards groups seeking to provide vendor-neutral data formats can encode data directly within FuGE using the data-matrix representation, specified by InternalData, Dimension and DimensionElement. The Dimension object describes an axis of the data matrix, which contains ordered instances of DimensionElement that describe the types of the coordinates in the axis. A simple example would be a tabular representation of gene expression, where one axis (Dimension) represents a gene list (a 9,500 feature array would have 9,500 DimensionElement instances for this Dimension), representing the dependent (responding) variable in the investigation. A second axis would represent the independent variable, such as time points within a time-course experiment, whereas a third axis represents the types of measurements derived from scanning the slide (e.g., signal, normalized value and P value). The InternalData object stores the data as a matrix of values, separated from the definition of the data dimensions. The set of coordinates of DimensionElements can be used to access individual values in the InternalData matrix. This structure for data storage and access is similar to the HDF5 specification for multidimensional scientific data (http://hdf.ncsa.uiuc.edu/HDF5/), which provides a representation that is highly efficient in terms of storage space and access speed.

Biological investigations

The Investigation package has been developed in consultation with cross-technology working groups15 to capture the overall goal and design of the investigation, such as high-level description of the motivation for the experiments, and the experimental variables (Fig. 4). Repositories are frequently queried using this kind of metadata to retrieve data sets of interest; thus, it is important that such information is captured in a consistent manner.

Figure 4
figure 4

The Investigation package and a textual example instance.

The Investigation class captures the name and description of the entire investigation, with which ontology terms can be used to annotate the type of design employed; suitable terms from the MGED Ontology16 include: “dose response design” or “genetic modification design.” The package also models the important sources of material (single organisms, populations, tissue, cell cultures and so on), as determined by the investigator, for the purpose of providing a summary that can be queried. The Investigation class can also define a hypothesis, the conclusions or other important classification information as free text or as rich ontological structures.

InvestigationComponent represents a single functional genomics technique, allowing the user to specify experimental replicate design, the normalization strategy and quality control procedures, which are properties given prominence in the MIAME guidelines. InvestigationComponent can also define experimental design in relation to the technology (e.g., 'dye swap') through the use of ontology terms.

The principal comparators in an investigation (the manipulated or independent variables), such as dosage, genetic difference or environmental factor, are modeled by Factor. A Factor can, but need not, be shared across different instances of InvestigationComponent; for instance, certain technologies might be used to measure certain variables but not others. The value for a Factor is stored in FactorValue in conjunction with the Measurement class. Although FuGE does not include a specific 'Unit' class, this essential information can be provided via OntologyTerm references. In addition to providing the units for FactorValue measurements, ontologies can provide terms for nonnumeric FactorValues, such as cell line or sex (Fig. 4, Factor 1). There is also a mechanism for relating particular experimental variables to data of interest, via the DataPartition class. This will allow queries of the type “retrieve all data relating to the 10 mg drug dose.” The Factor, FactorValue model is intended for capturing a summary description of the independent variables tested; the exact details of the study design and relationships between variables are represented in the Protocol and Data packages, allowing highly complex studies to be reported.

Building extensions on FuGE

FuGE can store general details about a protocol, samples and data but does not model specific properties of techniques or instruments, which is left to experts in those domains to define. There are two methods that can be used to define extensions:

Extending the object model with more specific attributes and associations that enforce the reporting of particular information. These modular formats based on FuGE can fill this role of enforcing constraints while remaining compatible with other FuGE-based formats.

Developing external ontologies that include specific controlled vocabulary terms and rules that govern their usage.

Several formats based on FUGE are being developed by PSI and MGED. To date, these formats have extended parts of the model (method 1) to capture specific details about the technology. However, ontologies are also being developed to capture parts of the model that do not have a fixed scope and may be extended incrementally over time. FuGE also relies on ontologies for enumerated lists of values, such as units. The Ontology of Biomedical Investigation (OBI), formerly called the Functional Genomics Investigation Ontology17 (FuGO), is being developed in parallel to FuGE and will provide terminology for annotating data in a consistent manner (http://obi.sourceforge.net/).

As an example extension of FuGE, a model is under development to describe electrophoresis (GelML, http://www.psidev.info/index.php?q=wiki/Gel_electrophoresis), which is used in proteomics to separate complex mixtures of proteins in a polyacrylamide gel matrix. GelML aims to support a proposal for the minimum reporting requirements for gel electrophoresis18 and to serve as a format for exchanging gel electrophoresis data. For the purposes of this example, a two-dimensional gel electrophoresis protocol consists of the following steps: loading a sample onto a gel strip, performing electrophoresis in the first dimension, loading the strip onto a second gel and then performing electrophoresis in the second dimension. The example in Figure 5 demonstrates how this complex procedure can be expressed by an extension of Protocol and Action.

Figure 5: A UML diagram of an extension to FuGE for capturing protocols for two-dimensional gel electrophoresis from GelML.
figure 5

The complete UML diagrams for SampleLoadingProtocol, ElectrophoresisProtocol and GenericProtocol are not shown.

The Gel2DProtocol class has four distinct steps, expressed by extensions of Action. SampleLoadingAction references a child protocol for capturing how the samples are loaded (SampleLoadingProtocol). The Actions for the first- and second-dimension separations have a reference to ElectrophoresisProtocol (which consists of a collection of parameters for voltages and timings, not shown). Finally, InterDimensionAction represents the stages that occur between the first- and second-dimension separations, and references the FuGE GenericProtocol class that captures any procedure that has no explicit model.

The advantages of using FuGE in this way are as follows. FuGE provides structure for fitting extensions into the larger context of a complete workflow, such as relating protocols to samples and data files, thus facilitating format design by allowing developers to focus on what to capture rather than on how to structure the model. In addition, extended objects gain the rich functionality of FuGE for auditing, controlling security settings and having a consistent identification system. Furthermore, by extending from specific FuGE classes, models of different techniques will share significant structural similarities, facilitating future data-integration efforts and reducing the learning time for new models. Modular formats built on FuGE will also allow developers to focus on a single representation (namely, UML development) from which the XML Schema, relational database definition and software components can be generated automatically. This automation should both simplify development and the mapping work required to maintain parallel implementations.

Over the next year, standards that extend from FuGE will begin to emerge; in addition, data formats that are not based on FuGE will continue to exist. FuGE is intended to be used for capturing complete experimental workflows. In a typical usage scenario, software will be developed that facilitates capture of FuGE-compliant data and data conforming to extensions of FuGE (by modular additions to the software). The software should allow a complete 'omics investigation to be packaged within the FuGE file format. The file will have external references to other data formats, such as outputs from specific instruments, some of which will have been developed as extensions to FuGE. The FuGE file will allow the complete experiment description to be exchanged or sent to public databases. Where referenced files are not FuGE extensions, additional software is likely to be required for local data capture, processing and display.

Discussion

In the past, functional genomics standards development focused on single technologies or solutions within a single community. In contrast, FuGE has received input from a diverse set of standards bodies and organizations with an interest in data sharing and, as such, represents a major cross-community collaboration. Several groups are currently using or evaluating FuGE as the basis for their respective data models.

The Fred Hutchinson Cancer Research Center has developed CPAS, which uses a file format for archiving based on an early release of FuGE. The archive file stores information describing the experiment, including materials, protocols and types of data involved. Files produced from assays (e.g., raw mass-spectrometry data in mzXML format19) and the results of data-analysis procedures (e.g., a pepXML file from a search engine result20) are packaged together with supporting metadata, and the collection can be submitted to CPAS or other compatible systems, such as the ProteusLIMS commercial laboratory information management system (http://www.genologics.com/). The PRIDE21 public data repository also plans to support PSI-endorsed formats based on FuGE.

FuGE is currently being used by MGED to develop MAGE version 2, with the aim of reducing the complexity of the format as a result of feedback from MAGE version 1, and to include additional experimental approaches, including SNP arrays, protein arrays (developed in collaboration with PSI) and a number of data analyses. PSI is also developing formats based on FuGE for gel electrophoresis, sample processing and reporting of mass-spectral analyses22. These are expected to be released within the next year. To date, FuGE has not been significantly deployed to manage data resulting from clinical trials. However, FuGE can describe assays stemming from clinical samples and study-design information, and thus complements existing mechanisms for reporting clinical trials. The Metabolomics Standards Initiative (http://msi-workgroups.sourceforge.net/) has recently been formed, and the data-exchange working group is currently evaluating FuGE. The group is likely to recommend its adoption for capturing investigational design and sample processing and as a basis for future formats involving metabolite separation and analysis. FuGE is also being evaluated by groups developing formats for RNAi, flow cytometry, cellular assays and immunohistochemistry.

FuGE is not formally owned by any single standards organization. Instead, it constitutes a stable, independent artifact that will be formalized in a standardization process. The model should be extended according to a set of guidelines, a draft of which appears on the Web site, thus encouraging new formats to share a consistent structure. Formats that extend FuGE without following the guidelines may not be able to use the templates for producing XML Schema, relational database definition or software platforms. We believe this kind of open process will avoid the need for the formation of large cross-technology standards groups in which it is difficult to make rapid progress in response to technological developments.

Adoption of FuGE by the transcriptomics, proteomics and metabolomics communities will result in a common format for representation of experimental descriptors that are independent of a particular technique. Researchers can describe the overall investigation, the source of material and experimental techniques using the core FuGE model, potentially allowing for the ad hoc assembly of studies that cross technological boundaries. The ability to provide rich annotation using both general and domain-specific ontologies, as developed by OBI, will facilitate linkage of cross-platform and organization investigations.

As research groups move towards systems-biology approaches that cross technologies, the type of convergence of data formats that FuGE promotes will be essential to ease the burden of the capture, dissemination and publication of annotated data sets. Widespread adoption of FuGE will also aid evaluation and comparison of those published results by improving the uniformity of the annotation of data deposited in public repositories. Finally, convergence of data formats will promote the development of software applications that span technology types, facilitating the peer-review process and fostering reanalysis of data and novel methods development, such as modeling of complex biological processes.

Methods

'Use cases' gathered from a broad spectrum of the functional genomics research communities were used to develop the FuGE data model. As such, it contains representations of the concepts common to most functional genomics experiments. The initial development stages involved analysis of MAGE, the removal of components specific to microarrays, and redesign of components to fit the wider set of use cases. Subsequently, feedback obtained during the development of external formats or projects based on the FuGE milestones was communicated to the FuGE developers and incorporated in later releases.

Several stable, provisional versions of FuGE, termed milestones, have been publicly released to allow developers to work on extensions of FuGE or implement software using FuGE. Each milestone consists of the UML model, the XML Schema produced from the model, and documentation. A formal standardization process has been followed, including a significant period for public comment on the specifications, which has resulted in an official stable release (FuGE version 1.0).

The FuGE project relies on several freely available tools for development. The UML model is developed using the MagicDraw CASE tool (http://www.magicdraw.com/) and is subsequently translated to other formats using AndroMDA (http://www.andromda.org/), an open source project that can produce various types of documents from the UML model using a set of document templates. We have specifically tailored AndroMDA templates for production of an XML Schema, a relational database definition and Java software components. Templates for supporting other software platforms, such as Perl or C++, can be developed in the future. The use of publicly available tools for model design and format generation provides a common platform for developing community-specific extensions and avoids excluding particular groups based on software costs.

The FuGE UML specification is restricted to class diagrams, where classes use simple inheritance (only one parent class) and define attributes and associations to other classes but no procedures (methods). The restriction on inheritance greatly simplifies the mapping to other platforms, such as the XML Schema and relational database schema, and there are relatively few instances where multiple inheritance would convey any advantage. The Supplementary Note contains a brief tutorial illustrating the subset of UML syntax used in FuGE.

Note: Supplementary information is available on the Nature Biotechnology website.

Box 1 Overview of the packages in FuGE