One purpose of the biomedical literature is to report results in sufficient detail that the methods of data collection and analysis can be independently replicated and verified. Here we present reporting guidelines for gene expression localization experiments: the minimum information specification for in situ hybridization and immunohistochemistry experiments (MISFISHIE). MISFISHIE is modeled after the Minimum Information About a Microarray Experiment (MIAME) specification for microarray experiments. Both guidelines define what information should be reported without dictating a format for encoding that information. MISFISHIE describes six types of information to be provided for each experiment: experimental design, biomaterials and treatments, reporters, staining, imaging data and image characterizations. This specification has benefited the consortium within which it was developed and is expected to benefit the wider research community. We welcome feedback from the scientific community to help improve our proposal.
High-throughput analyses of gene expression in biological samples (for example, using microarrays or proteomics) often do not provide information about the cell types or spatial domains within tissues in which genes are expressed and may not reveal dynamic or transient gene expression. Therefore, such analyses are often followed by experiments to determine the location and degree of gene expression in specific cell types within the tissue by probing with reporters for the genes of interest. In addition, high-throughput analyses of fresh samples can be supplemented with a wealth of clinical information associated with tissue samples in large collections worldwide. However, studies that use in situ hybridization (ISH) and immunohistochemistry (IHC) staining and/or their resulting images are often presented without the information needed to understand the images or the methods that produced them. Furthermore, neither the reagents and methods nor the results are easily searchable through current biomedical literature databases such as PubMed. As the interpretation of ISH and IHC stains may differ between observers, between different image analysis platforms and programs, and even between different sessions using the same image analysis platform and program1, communication of the methods used is critically important for evaluating published work.
Data annotation specifications for microarray experiments1,2,3,4 have begun to benefit the biomedical research community. Many researchers participated in the debate around MIAME and contributed to its development. The accessibility of data increased significantly, aided by common exchange formats; open-source software and ontologies were developed by many groups; and discussion forums promoted interaction between manufacturers and experimenters. Similar guidelines are under development for other high-throughput technologies5,6,7,8,9,10.
In the area of microscopy images, data formats that facilitate the exchange of data have been proposed. The XML (Extensible Markup Language) data format for tissue microarrays does not include minimum-information reporting guidelines11. Also available is Open Microscopy Environment (OME), which provides a flexible XML data format for storing and transmitting metadata for microscopy image datasets (see Table 1 for URL). However, there is no comprehensive specification for facilitating the exchange of data from visual interpretation–based experiments that seek to determine the abundance and/or localization of proteins or mRNA in tissues (hereafter referred to as 'gene expression localization experiments'), such as ISH and IHC.
The goals of MISFISHIE
MISFISHIE describes the minimum information to be provided when publishing, making public or exchanging results from visual interpretation–based tissue gene expression localization experiments, such as ISH, IHC, lectin affinity histochemistry, and experiments that involve reporter gene constructs (for example, green fluorescent protein (GFP) and β-galactosidase). Compliance with this specification should enable researchers at different laboratories to fully evaluate data and reproduce experiments. Although MISFISHIE facilitates the identification of specific sources of variability, it cannot, and does not aim to, reduce this variability. However, if complete information, including raw image data, is always provided, the original interpretations may be reevaluated by other researchers. Like MIAME12, MISFISHIE prescribes that the most relevant details within each of the sets of broad categories of information be provided, relying on data producers and reviewers to ensure that each category contains the information deemed necessary to allow readers to fully assess and reproduce experiments.
MISFISHIE does not dictate a specific format for reporting information. We intend to develop a data model based on the concepts of MAGE-OM (MicroArray Gene Expression Object Model) and software based on the MAGEstk (MicroArray Gene Expression software tool kit)3. This model and the associated XML-based mark-up language will provide a data format for archiving or transferring data. Because a major revision of MAGE-OM, the FuGE-OM (Functional Genomics Experiment Object Model)13, is at present being developed to accommodate data from other functional genomics experiments, the MISFISHIE-derived object model will probably be an extension of FuGE-OM rather than a separate construct. A simpler, non-XML format following the concepts of MAGE-TAB14 may also facilitate data sharing in cases where simplicity is most important15.
MISFISHIE was designed to function together with other technology-related specifications such as MIAME and MIAPE (Minimum Information About a Proteomics Experiment)16 to support functional genomics investigations. We anticipate that MISFISHIE will be integrated with other MGED (Microarray and Gene Expression Data) Society standards17, in particular through the Reporting Structure for Biological Investigation (RSBI) working group18 and the Minimum Information for Biological and Biomedical Investigations (MIBBI) project19. Clearly, the goal of integrating different data types will be best served by a common reporting structure. Separation of the minimum information specification and the data format is important because in the data format there should be scope to provide unlimited further information beyond the minimum specification and there should be the ability to encode incomplete information for optimal flexibility. Furthermore, broad acceptance of a minimum information standard would greatly aid the design of a data model.
To facilitate data transfer between some existing expression databases, a MISFISHIE-compliant XML data format has been developed. A document type definition (DTD) was developed for three expression databases: ANISEED (Table 1), COMPARE20 and 4DXpress21. Its format follows MISFISHIE. This DTD and an associated example are available at ANISEED (http://crfb.univ-mrs.fr/aniseed/exchange_format.php) and at COMPARE (http://compare.ibdml.univ-mrs.fr/exchange_format.php).
It has long been appreciated that improved standards for IHC are needed. However, the focus has been on developing standardized technical protocols that would produce more uniform staining22 or on reducing subjectivity in interpreting histological sections23. MISFISHIE does not endorse standardized methodologies or data interpretation but rather seeks to promote complete disclosure of the methodologies used.
Guidelines for tumor-marker prognostic studies, known as REMARK24, were recently established. REMARK encompasses outcome studies based on tumor markers of any kind, not just those of IHC. MISFISHIE encompasses nearly any study employing IHC or ISH regardless of context, such as a tumor-marker study or a zebrafish embryo study. We expect that requirements pertaining to specialized subdomains (for example, clinical prognostic studies) will be added to MISFISHIE in the future.
Existing databases for gene expression localization data provide a useful framework from which to build a specification. Two databases for the mouse research community, the Mouse Gene Expression Database (GXD)25 and the Edinburgh Mouse Atlas Gene Expression (EMAGE) database26, influenced the design of MISFISHIE. We replaced mouse-specific fields with more organism-neutral ones and eliminated fields deemed unnecessary. In these databases, many experiments entered by curators using information in journal articles have empty fields because the papers lacked sufficient detail. MISFISHIE-compliant publications will result in more complete database descriptions. Although MISFISHIE is primarily designed for peer-reviewed journal articles, it will guide database development as well. For example, the release of ANISEED version 3.0 is based on MISFISHIE rules, and the new schema of this database is MISFISHIE compliant. The inclusion of specific experimental details, such as tissue type, reagents and methods, will allow investigators to more efficiently find precedents for their experiments. For example, an investigator might rapidly search all publications that reported immunoperoxidase localization of membrane metallo-endopeptidase (CD10, MME) in the human prostate using a database and retrieve information on how the gene localization experiments were conducted.
An abbreviated MISFISHIE checklist is provided in Box 1. A printable version is available at http://scgap.systemsbiology.net/standards/misfishie/MISFISHIE_Checklist_2007-10-28.xls. The complete checklist is available at http://scgap.systemsbiology.net/standards/misfishie. One example of real experimental data annotated according to MISFISHIE is given in Supplementary Note 1 online; more examples accompany the complete checklist at the preceding URL.
The checklist covers six types of information; for each, the guiding principle is to supply enough information to allow the experiment to be reproduced.
Biomaterials (specimens) and treatments (section or whole-mount preparation)
Reporters (probes or antibodies)
Ontologies, such as the MGED Ontology (MO)4 or Ontology for Biomedical Investigations (OBI; formerly named FuGO)27 are extremely advantageous as a source of descriptors because they facilitate computational searches of data. For terms outside the scope of OBI, such as those in anatomy, another appropriate ontology may be used. A good list of ontologies is maintained at the Ontologies for Biology Organization (OBO) website (Table 1). Use of OBI and other ontologies will be especially important as MISFISHIE-supporting applications and databases are developed. Many of the terms used in this specification are already defined in OBI.
Experimental design. The experiment as a whole is described by the following:
Experiment description: the aims of the experiment.
Assay type(s): for example, IHC, ISH, lectin affinity histochemistry, cell lineage– or tissue-specific reporter expression.
Experiment design type: for example, comparisons of normal versus diseased tissue, of multiple tissue or embryo specimens of similar type, or of multiple probes or antibodies applied to the same tissue; a localization screen; etc. The MGED Ontology ExperimentDesignType has many entries categorizing design type.
Experimental factors: the parameters or conditions that are tested, such as probe or antibody, disease state, genetic variation, structural unit, age, etc. Again, the MGED Ontology is a rich source of terms for describing the factors being tested.
Total number of assays performed in the experiment: an assay is defined as one instance of a hybridization/stain of a single specimen with a single reporter. Thus, the result of a tissue microarray consisting of a 10 × 10 array of tissues would be counted as 100 assays. If replicates or reruns are a component of the experimental design, provide details that should include number of replicates per tissue, per reporter, etc.
URL of any websites or database accession numbers (if available) pertinent to the experiment.
Contact information for communicating with the experimenters.
Biomaterials (specimens) and treatments (section or whole-mount preparation). Describing specimens comprehensively is challenging, as they may have dozens or hundreds of characteristics. This is especially true for material from human subjects when clinical information is available. Characteristics that are known to differ among specimens should be provided with each specimen, whereas common attributes of all the specimens may be provided only once. The biological sample is described by the following:
Origin of the specimens.
Attributes of the individual(s). The organism species must be named, preferably using the US National Center for Biotechnology Information (NCBI) taxonomy, and for non-human organisms the strain and mutant alleles should be named according to the accepted standards for that organism. Other attributes may include, but are not limited to, sex, age, developmental stage, genotype, and phenotype.
Physiologic state of the individual(s) (normal versus diseased).
Relevant exogenous factors (for example, treatment, special diet).
Anatomic source of the tissue or cell sample.
Provider of the specimens.
The information necessary to reproduce the biomaterials is not limited to the above examples. Use of an ontology or controlled vocabulary is highly encouraged, although a standardized set of terms and a single, widely accepted ontology is not yet available. The rationale for providing specific structural detail is that the location of an object, such as a cell type that is being studied, may correlate with expression of a specific gene by that cell type. Structural detail may be important not only for cases where gene expression depends on tissue handling (for example, there is stronger labeling at the specimen edges) but also in cases where there is heterogeneity even within a single microanatomical unit (for example, in lung tumors, cell cycle regulatory genes are highest at the periphery)28.
Manner of preparation of the specimens.
Nature of the specimens (for example, whole tissue, whole mounts of tissue, tissue sections, thickness of sections, whole cells, or sections of cells).
Manner in which the specimens were prepared for the experiments (for example, fixed specimens, with type of fixative and duration of fixation; fresh, non-fixed, non-frozen specimens; or non-fixed, frozen specimens; sections mounted on slides versus sections floating in reagents).
Protocols used. Referencing previously published protocols is permissible if the protocols are appropriately detailed and were strictly followed.
Sensitivity of the immunoreaction of some gene products to fixation is exemplified by the observation that cyclin-dependent kinase inhibitor p27Kip1 (CDKN1B) was least frequent and least intense in prostate cancer cells that were farthest from the cut surface of a fixed tissue. These were the cells that were least rapidly fixed29.
Reporters (probes or antibodies). Reporters (probes, lectins or antibodies) can differ in reactivity from lot to lot and from manufacturer to manufacturer. A manufacturer's literature usually provides most of the needed information but may not be permanent. For privately produced reporters, enough information should be provided to enable another lab to generate them. Validation of reporters in the current literature is often poor. MISFISHIE does not at present require that researchers validate reporters, but such validation is encouraged and should be reported when performed.
Unambiguous genomic identification of each reporter:
For in situ hybridizations, provide the corresponding GenBank/EMBL/DDBJ accession number and, if applicable, the start and end nucleotide positions of the probe within that sequence. Also, provide the accession number version or database release version.
For antibodies, provide the protein identifier, including specific version information for the accession number or database release.
Full sequence of each probe or clone number of each antibody. For fluorescent protein experiments, the promoter sequence should be specified. In each case, provide the method by which the reporter was characterized.
If the sequence or clone number is not known, the template or clone must be made publicly available. Provide specific details on how the template or clone may be obtained.
In tissue localization experiments based on expression of a fluorescent protein reporter gene fused to the promoter of a gene of interest, what is important is not the sequence of the reporter but the sequence of the promoter, which confers cell and tissue specificity on the reporter.
Protocol(s) for how the reporters were designed and produced or the source from which they were obtained.
For reporters purchased from a company, the company name, address, catalog number and lot number should be provided.
For a custom-made antibody, the putative antigen and references to studies that characterize the sensitivity and specificity of the antibody in tissue immunostains should be given.
Additional attributes of the reporter:
For antibodies, the type of primary antibody (monoclonal or polyclonal), the immunoglobulin isotype and the organism in which the antibody was generated.
For lectins, the full name (for example, Dolichos biflorus), the source of the lectin (for example, which company produced it), how it was detected (for example, whether it was fluorescently labeled or biotinylated, with follow-up histochemical analysis), and how it was labeled (if, for example, the investigators labeled the lectin themselves, they should give source of the reagents, the method and/or the labeling kit).
Staining. Staining protocols vary considerably, and the merits of standardizing them have been discussed extensively in the literature.
Number of detectable reporters in the hybridization or stain (for example, more than one for multiple-dye fluorescence microscopy) and details about the detection method:
Detection reagent (for example, fluors, enzyme substrates, gold particles).
Source of the detection system and description of the reaction.
Protocol to produce the hybridization or immunostain, including a description of how the tissue (organism, organ or section) was mounted onto the slide or substrate and treatments of the section (for example, IHC protocol inclusive of parameters such as buffer, temperature, post-wash conditions, etc.). Referencing previously published protocols is permissible if the protocols are appropriately detailed and were strictly followed. Also:
What steps, if any, were taken to decrease nonspecific reaction product. For example, in immunoperoxidase experiments, the specimen preparation may be preincubated with albumin solution to block nonspecific binding or with peroxide solution to block signal due to endogenous peroxidase.
Use of an antigen or gene product retrieval method.
Information about assay controls: the nature of both positive and negative tissue and reporter controls (or state if controls were not performed). The same level of detail for the tissue controls should be reported as for the cells or tissues that are being studied. Optionally, provide specificity reporter controls, such as competitive inhibition with either purified protein or peptide in IHC.
Imaging data. Although the MIAME specification stops short of requiring microarray image data, we propose that representative IHC or ISH images be provided, as interpretation of these images varies among observers. Images are not needed to reproduce an experiment, but they aid in the analysis. Both positive and negative results should be reported.
Repositories for images from gene expression localization experiments exist for several model organisms, such as GXD25 and EMAGE26 for mouse and ZFIN30 for zebrafish, but not for human. A general, organism-independent database would be very valuable, as it could provide examples of tissue localization studies, serve as a reference site for verifying the tissue localizations of reporter reagents and provide accession numbers for publications. There are two projects that aim to provide these features. MorphBank is an available general purpose image repository for biological research. BioImage is an image repository under construction (ref. 31).
The information on imaging data should include:
Digital images for each assay in the study, digitally available for download without charge. Images should be of sufficient resolution to allow independent characterization and provided in a standard file format (for example, JPEG, PNG, GIF, TIFF). Images should be named or tagged with the reporter and specimen that they represent.
Detection method by which hybridization or staining is observed (for example, for each channel, fluorescence excitation and emission wavelengths if more than one reporter is used). If the detection method is the same for all images, it need only be mentioned once.
Images for the controls are optional.
Image characterizations. Interpretations of the results should be reported in categorized tabular format so that they can be easily stored in a database, queried and compared with other expression data. The following minimum requirements can be supplemented with further characterizations as needed.
Ontology entries, including reference to the ontology terms, accession numbers or terms and definitions if sufficient detail cannot be found in an existing ontology32,33,34,35 for individual structural units used for classification. (Note that some ontologies, such as the College of American Pathologists' Systematized Nomenclature of Medicine–Clinical Terms (SNOMED CT) and the Unified Medical Language System (UMLS) of the US National Institutes of Health's National Library of Medicine, may contain licensing restrictions that make them unavailable to some or limit the use of the terms; a MISFISHIE-compliant document that contains SNOMED CT entries or some UMLS entries may not be legally redistributable36). Structural units include organ, tissue, cell and subcellular component. List and characterize only the relevant structural units and not those that are visible in assays or slides but irrelevant.
Intensity scale, ideally one from the MGED Ontology. For example, a three-level scale of present, absent or equivocal might be appropriate for evaluating IHC stains. However, any appropriate scale may be used as long as each gradation of intensity in the scale is defined.
For each relevant structural unit in each assay or image:
Staining intensity or the fraction of the structural unit's population showing each intensity (see example below).
Other optional annotations or characterizations of the structural unit: for example, feature density, qualitative characteristics or spatial distribution of the structural unit or staining. The use of referenced ontology terms is encouraged.
Both positive and negative calls of staining relevant to the experiment should be reported. A negative result is an upper limit to the expression level, where the limit is usually not well known. If some structural units cannot be characterized for some reporters, corresponding calls may be null. For example:
Luminal epithelial cell: present
Basal epithelial cell: absent
is sufficient; or, when appropriate, more detail:
Luminal epithelial cell: 90% present, 10% equivocal, 0% absent
Basal epithelial cell: 10% present, 10% equivocal, 80% absent
Unless only a few expression calls are presented, it is clearest if the calls are presented in tabular form.
Optionally, the protocol for the characterization and information about the basic characterization technique. For example, how many observers performed the characterizations, whether the characterizations were performed from the images themselves or visually through the instrument and any exceptions or assumptions made in characterizing the data. One example of a well described characterization protocol may be found in ref. 37. We also note that the use of digital images may have advantages in terms of replication and decreased intra- and interobserver variability38.
Survey of the recent literature
To compare MISFISHIE with current publication practice, we examined articles reporting on IHC or ISH in the last 7 years. Three articles39,40,41 were assessed and discussed by all ten ad hoc reviewers to minimize inter-reviewer variability. Another 29 articles42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69 were assigned to individual reviewers. Each reviewer assessed articles as if reviewing a manuscript submitted to a journal that required MISFISHIE compliance. Compliance for each MISFISHIE subsection was rated on a scale of 0 to 10, where a 10 indicates inclusion of all information needed to understand or reproduce the experiment without making any assumptions. Scores of 8 and 9 were considered a low pass; the reviewer could reproduce the experiment with a few assumptions. One example of a paper deemed MISFISHIE compliant is ref. 64.
Of the 32 papers assessed, four (13%) were deemed MISFISHIE compliant in all six sections (Table 2). Another 28% were out of compliance with only one section, and 31% did not comply in two sections. More than 90% complied with sections 1 and 2 (experimental design; biomaterials and treatments). Compliance for sections 3 and 4 (reporters; staining) was ∼75%. Section 5 (imaging data) proved to be the most troublesome, with only 16% of the articles compliant. Finally, ∼47% complied with section 6 (image characterizations). The reviewers felt that the majority of noncompliant papers would require only modest additions to become compliant, with the possible exception of section 5. This section requires that at least one representative image of each assay be made electronically available in a model organism database, a generic image database, a journal's supplemental data web site or even the author's web site (the least preferable option).
MISFISHIE was developed by the Stem Cell Genome Anatomy Projects consortium of the US National Institutes of Health National Institute of Diabetes & Digestive & Kidney Diseases to facilitate data sharing within the consortium and was discussed with members of the larger research community. The history of the creation of MISFISHIE and the lessons learned from it70 may be helpful to others aiming to create similar guidelines for other data types. We expect that MISFISHIE will be updated as other localization methods, such as DNA in situ hybridization to chromosomes, are implemented. There is still considerable room for researching the scientific best practice for performing and reporting these types of studies, and the eventual accepted specification will be achieved through discussion and consensus. Suggestions from the community are actively encouraged and will be collected and incorporated into an eventual second release, published at the MISFISHIE domain of the MGED web site: http://www.mged.org/Workgroups/MISFISHIE. Comments may be addressed to the email distribution list dedicated to discussion about MISFISHIE: firstname.lastname@example.org. Some frequently asked questions and answers are listed in Box 2. After a suitable period of dialog and revision by the community, and if there is widespread acceptance by the community, we would encourage reviewers, journal editors and funding agencies to promote compliance with MISFISHIE.
Our survey of recent articles indicated that only ∼15% of published works are fully compliant and that most fail by not making images of assays used in the study digitally accessible to the research community. Most of the surveyed papers could be brought into compliance by uploading the images into a repository and adding fewer than a dozen more sentences of description. If article length constraints hinder MISFISHIE compliance, the required information could be provided in supplementary information. Several of the model organism databases, including GXD25 and EMAGE26 for mouse and ZFIN30 for zebrafish, are already able to accept and archive the results from a publication that provides all information that MISFISHIE specifies. We encourage authors to submit their data to these databases upon submission of the manuscript for publication.
Note: Supplementary information is available on the Nature Biotechnology website.
We thank R. Drysdale, L. Eichner, M. Heiskanen and M. Westerfield for comments and discussions during the preparation of the MISFISHIE specification and C. Emswiler for assistance with the figures. This work was funded in part with support from the US National Institute of Diabetes & Digestive & Kidney Diseases to members of the Stem Cell Genome Anatomy Projects Consortium, including DK63483 to J. Gordon, DK63481 to I. Lemischka, DK63400 to M. Little, DK63630 to A. Liu and DK63328 to L. Zon (Children's Hospital Boston).