The Image Data Resource: A Scalable Platform for Biological Image Data Access, Integration, and Dissemination

Access to primary research data is vital for the advancement of science. To extend the data types supported by community repositories, we built a prototype Image Data Resource (IDR) that collects and integrates imaging data acquired across many different imaging modalities. IDR links high-content screening, super-resolution microscopy, time-lapse and digital pathology imaging experiments to public genetic or chemical databases, and to cell and tissue phenotypes expressed using controlled ontologies. Using this integration, IDR facilitates the analysis of gene networks and reveals functional interactions that are inaccessible to individual studies. To enable re-analysis, we also established a computational resource based on IPython notebooks that allows remote access to the entire IDR. IDR is also an open source platform that others can use to publish their own image data. Thus IDR provides both a novel on-line resource and a software infrastructure that promotes and extends publication and re-analysis of scientific image data.

email: jrswedlow@dundee.ac.uk phone +44 1382 385819 fax +44 1382 388072 URL: www.openmicroscopy.orgSeveral public image databases have appeared over the last few years.These provide online access to image data, enable browsing and visualisation, and in some cases include experimental metadata.The Allen Brain Atlas, the Human Protein Atlas, and the Edinburgh Mouse Atlas all synthesise measurements of gene expression, protein localization and/or other analytic metadata with coordinate systems that place biomolecular localisation and concentration into a spatial and biological context [1][2][3] .Similarly, many other examples of dedicated databases for specific imaging projects exist, each tailored to its aims and its target community [4][5][6][7][8] .There are also a number of public resources that serve as true scientific, structured repositories for image data, i.e., that collect, store and provide persistent identifiers for long-term access to submitted datasets, as well as provide rich functionalities for browsing, search and query.One archetype is the EMDataBank, the definitive community repository for molecular reconstructions recorded by electron microscopy 9 .The Journal of Cell Biology has built the JCB DataViewer, which publishes image datasets associated with its on-line publications 10 .The CELL Image Library publishes several thousand community-submitted images, some of which are linked to publications 11 .FigShare stores 2D pictures derived from image datasets, and can provide links for download of image datasets (http://figshare.com).The EMDataBank recently has released a prototype repository for 3D tomograms, the EMPIAR resource 12 .Finally, the BioStudies and Dryad archives include support for browsing and downloading image data files linked to studies or publications 13 (https://datadryad.org/).Some of the these provide a resource for a specific imaging domain (e.g., EMDataBank) or experiment (e.g., Mitocheck), while others archive datasets and provide links to a related publication available at an external journal's website (e.g., BioStudies).However, no existing resource links several independent biological imaging datasets to provide an "added value" platform, like the Expression Atlas achieves for a broad set of gene expression datasets 14 and UniProt delivers for protein sequence and function datasets 15 .
Inspired by these "added value" resources, we have built a next-generation Image Data Resource (IDR) -an added value platform that combines data from multiple independent imaging experiments and from many different imaging modalities, integrates them into a single resource, and makes the data available for re-analysis in a convenient, scalable form.IDR provides, for the first time, a prototyped resource that supports browsing, search, visualisation and computational processing within and across datasets acquired from a wide variety of imaging domains.For each study, metadata related to the experimental design and execution, the acquisition of the image data, and downstream interpretation and analysis are stored in IDR alongside the image data and made available for search and query through a web interface and a single API.Wherever possible, we have mapped the phenotypes determined by dataset authors to a common ontology.For several studies, we have calculated comprehensive sets of image features that can be used by others for reanalysis and the development of phenotypic classifiers.By harmonising the data from multiple imaging studies into a single system, IDR users can query across studies and identify phenotypic links between different experiments and perturbations.

Genetic, Chemical and Functional Annotation in IDR
To enable querying across the different datasets stored in IDR, we have included annotations describing experimental perturbations (genetic mutants, siRNA targets and reagents, expressed proteins, cell lines, drugs, etc.) and phenotypes declared by the study authors either from quantitative analysis or visual inspection of the image data.Wherever possible, experimental metadata in IDR link to external resources that are the authoritative resource for those metadata (Ensembl, NCBI, PubChem, etc).
The result is that IDR is a sampling of phenotypes related to experimental perturbations across several independent studies.Many of the studies in IDR perturb gene function by mutation or siRNA depletion.To calculate the sampling of gene orthologues, we used Ensembl's BioMart resource 16 to access a normalised list of gene orthologues.Overall, 19,601 gene orthologues are sampled, and 84.1% of gene orthologues are sampled more than 20 times.90.3% of gene orthologues are sampled in three or more studies, so even in this early incarnation the phenotypes of perturbations in the majority of known genes are sampled in several different assays and organisms.
We also sought to normalise the phenotypes defined by submitting study authors in IDR.Functional annotations (e.g., "increased peripheral actin") have been converted to defined terms in the Cellular Microscopy Phenotype Ontology (CMPO) or other ontologies 17 , in collaboration with the data submitters (e.g., http://idrdemo.openmicroscopy.org/webclient/?show=image-109846).Overall, 88% of the functional annotations have links to defined, published controlled vocabularies.158 different ontologynormalised phenotypes (e.g., "increased number of actin filaments", "mitosis arrested") are included in IDR, and 136 are reported by authors in only one study.Nonetheless, these phenotypes are well-sampled--the mean number of samples per phenotype, across HCS and other imaging datasets is 698 and the median is 144.This skewing occurs because some phenotypes are very common or are over-represented in specific assays, e.g."protein localized in cytosol phenotype", (CMPO_0000393; http://idrdemo.openmicroscopy.org/mapr/phenotype/?value=CMPO_0000393). Nonetheless, there are several cases where phenotypes are observed in multiple orthogonal assays.Two examples are the "round cell" phenotype (CMPO_0000118; http://idrdemo.openmicroscopy.org/mapr/phenotype/?value=CMPO_0000118) and the "increased nuclear size" phenotype (CMPO_0000140; http://idrdemo.openmicroscopy.org/mapr/phenotype/?value=CMPO_0000140). Figure 1 summarises the sampling of phenotypes across the current IDR datasets.Several classes of phenotypes are included, and many cases are sampled in thousands of individual experiments.In total, IDR includes more than one million individual experiments (Table 1), with ~9 % annotated with experimentally observed phenotypes.

Standardised Interfaces for Imaging Metadata
IDR integrates imaging data from many different, independent studies.These data were acquired using several different imaging modalities, in the absence of any over-arching standards for experimental, imaging or analytic metadata.While efforts like MIACA (http://miaca.sourceforge.net/),NeuroVault 19 , MULTIMOT 20 and several other projects have proposed data standards in specific imaging subdomains, there is not yet a metadata standard that crosses all of the imaging domains potentially served by IDR.We therefore sought to adopt lightweight methods from other communities that have had broad acceptance 21 and converted metadata submitted in custom formats -spreadsheets, PDFs, MySQL databases, and Microsoft Word documents --into a consistent tabular format inspired by the MAGE-TAB and ISA-TAB specifications 22,23 that could then be used for importing semi-structured metadata like gene and ontology identifiers into OMERO.We also used the Bio-Formats software library to identify and convert well-defined, semanticallytyped elements that describe the imaging metadata (e.g., image pixel size) as specified in the OME Data Model 24,25 .The resulting translation scripts were used to integrate datasets from multiple distinct studies and imaging modalities into a single resource.The scripts are publicly available (see Methods) and thus comprise a framework for recognising and reading a range of metadata types across several imaging domains into a common, open specification.
Phenotypes across distinct studies can also be used to build novel representations of gene networks.Figure 2A shows the gene network created when the gene knockouts or knockdowns that caused an elongated cell phenotype (CMPO_0000077) in studies in S. pombe and human cells are linked by queries to String DB and visualised in Cytoscape 33 .The genes discovered in the three studies form interconnected, non-overlapping, complementary networks that connect specific macromolecular complexes to the elongated cell phenotype.For example, HELZ2, MED30, MED18 and MED20 are all part of the Mediator Complex, but were identified as "elongated cell" hits in separate studies using different biological models (idr0001-A, idr0008-B, idr0012-A, Figure 2B).In another example, POLR2G (from idr0012-A), PAF1 (from idr0001-A) and SUPT16H (from idr0008-B) were scored as elongated cell hits in these studies and are all part of the Elongation complex in the RNA Polymerase II transcription pathway.Finally, ASH2L ("elongated cell phenotype" in idr0012-A), associates with SETD1A and SETD1B ("elongated cell phenotype" in idr0001-A) to form the Set1 histone methyltransferase (HMT).These examples demonstrate that these individual hits are probably not due to off-target effects or characteristics of individual biological models but arise through conserved, specific functions of large macromolecular complexes.This shows the utility and importance of combining phenotypic data of studies from different organisms and scales, and of integrating metadata from independent studies, to generate added value that can enhance the understanding of biological mechanisms and lead to new mechanistic hypothesis and predictions.
The integration of experimental, image and analytic metadata also provides an opportunity to include new functionalities for more advanced visualization and analytics of imaging data and metadata, bringing further added value to the original studies and datasets.As an example, we have added the data analytics tool Mineotaur 34 to one of IDR's datasets (http://idr-demo.openmicroscopy.org/mineotaur/).This allows visual querying and analysis of quantitative feature data.For instance, having shown that components of the Set1 HMT function in controlling cell morphology in S. pombe and human cells, we noticed that genes like ASH2L were in the "elongated cell" network based on human cell data (idr0012-A) but not S. pombe data, where ash2, the S. pombe ASH2L orthologue, was not annotated as a cell elongation "hit".We first noted that ash2 has a microtubule cytoskeleton phenotype (http://idr-demo.openmicroscopy.org/webclient/?show=well-592371).We then queried the criteria previously used for cell shape hits in the Sysgro screen (idr0001-A) and found that ash2 fell just below the cutoff originally used in this study to define phenotypic hits for cell shape (Supplemental Information).When combined with results on ASH2L from HeLa cells (idr0012-A, idr0008-B) (Figure 2B) these results suggest that the Set1 HMT has a strongly conserved role in controlling cell shape and the cytoskeleton in unicellular and multicellular organisms.

Data Integration and Access
Like most modern on-line resources IDR makes data available through a web user-interface as well as a web-based JSON API.This encourages third-parties to make use of IDR in their own sites.For example, image data in IDR has been linked to study data in BioStudies, thereby extending the linkage of study and image metadata (e.g., https://www.ebi.ac.uk/biostudies/studies/S-EPMC4704494), and to PhenoImageShare 35 , an on-line phenotypic repository (e.g., http://www.phenoimageshare.org/search/?term=&hostName=Image+Data+Repository+(IDR) ).These are examples of use of IDR as a service that delivers data for other applications to integrate and reuse.
To add further value and extend the possibilities for reuse of IDR data, we have initiated the calculation of comprehensive sets of feature vectors of IDR image data.For this purpose, we have used WND-CHARM, an open source tool that calculates a broad set of image features 36 .To date full WND-CHARM features have been calculated for images in idr0002-A, idr0005-A, idr0008-B, idr0009-A, idr0009-B, idr0012-A, and parts of idr0013-A and idr0013-B.Feature calculations for other IDR datasets are in progress.Features are stored in IDR using OMERO's HDF5-based tabular data store and available through the OMERO API in IDR's computational resource (see Supplemental Information).
The integration of image-based study phenotypes and calculated features makes IDR an attractive candidate for computational re-analysis.However, given the size of IDR, downloading the full complement of data it contains is impractical.We have therefore built two methods of accessing IDR data.In the first, we have connected IDR to a computational resource that provides remote, API-based access to IDR datasets.This resource authenticates against GitHub and is based on IPython notebooks.This provides a flexible, web browser-based analysis capability for IDR.To demonstrate the utility of this resource, and exemplify its use, we have developed and deposited notebooks that provide visualisation of single tile WND-CHARM features using PCA, access to images annotated with CMPO phenotypes, calculation of gene networks, and calculation of WND-CHARM features for individual images.In particular, we provide notebooks to build interactive versions of Figures 1 and 2 directly from the IDR.Users can access their own analysis notebooks stored in GitHub (https://github.com/IDR/idr-notebooks). IDR's computational resource is available at https://idr-demo.openmicroscopy.org/jupyter.
In addition, to enable local access to IDR metadata, we have built scripts in Ansible that automate the deployment of the IDR software stack and have made all the databases, metadata and thumbnails in IDR available for download (see Supplemental Information).The downloadable IDR misses the original image data, but contains all image thumbnails and all experimental, phenotypic and analytic metadata associated with IDR images.These can be deployed via the IDR software stack and re-used in a local context.

Discussion
Making data public and available is a critical part of the scientific enterprise 37 (https://wellcome.ac.uk/what-we-do/our-work/expert-advisory-group-data-access) (https://royalsociety.org/topics-policy/projects/science-public-enterprise/report/).To take the next step in facilitating the reuse and meta-analysis of image datasets we have built IDR, a next-generation data technology that integrates and publishes image data and metadata from a wide range of imaging modalities and scales in a consistent format.IDR integrates experimental, imaging, phenotypic and analytic metadata from several independent studies into a single resource, allowing new modes of biological Big Data querying and analysis.As more datasets are added to and integrated with IDR, they will potentiate and catalyze the generation of new biological hypotheses and discoveries.
In IDR, we have linked image metadata from several independent studies.Experimental, imaging phenotypic and analytic metadata are recorded in a consistent format.Rather than declaring and attempting to enforce a strict imaging data standard, IDR provides tools for supporting community formats and releases these as a framework that facilitates data reuse.We hope that the availability of this framework will provide incentives for others to structure their metadata in shareable formats that can be read into IDR or other applications, whether based on OMERO or not.In the future, we can imagine that these and other capabilities could be extended in IDR -or similar repositories that link to IDR -to enable systematic integration, visualization and analytics across imaging studies, thereby helping to harness and capitalize on the exponentially increasing amounts of bio-imaging data that the community generates.
As of this writing, IDR has published 35 reference image datasets grouped into 24 studies (Table 1) and, utilising EMBL-EBI's Embassy Cloud, has capacity to receive and publish many more.Authors of scientific publications that are already published or under submission can submit accompanying image datasets for publication in IDR, using the metadata specifications and formats we have built.These datasets can be integrated as described above.Once published, the datasets can be browsed and viewed through IDR's WUI, or queried and re-analysed using the IDR computational resource.Details about the submission process are available under http://idrdemo.openmicroscopy.org/about/submission.html.Moreover, IDR software and technology is open source, so it can be accessed and built into other image data publication systems.This supports the building of technology and installations that integrate and publish bio-image data for the scientific community, allowing discoveries and predictions similar to what we have shown in Figure 2. IDR therefore functions both as a resource for image data publication and as a technology platform that supports the creation of on-line scientific image databases and services.In the future, those databases and services may ultimately amalgamate to form resources analogous to the genomic resources that are the foundation of much of modern biology.

Table 1. List of Datasets in IDR
The phenotype column contains the number of submitted phenotypes.The number of genes, compounds or proteins identified as targets for analysis is listed in the Targets column and the 'Experiments' column lists the number of individual wells in HCS studies or imaging experiments in non-screen datasets.The numbers of samples per phenotype.Each sample represents a well from a micro-well plate in a screen or image from a dataset.Wells annotated as controls were not included.User submitted phenotype terms were mapped to the CMPO terms shown here.Colours represent higher-level groupings of phenotype terms.Point size shows the number of studies (group of related screens) each phenotype is linked to with small, medium and large points representing 1, 2 or 3 studies respectively.A. Protein-protein interaction network produced in StringDB 57 and visualized using Cytoscape (http://www.cytoscape.org/ 33) based on the genes linked to the elongated cell phenotype (CMPO_000077) in three distinct studies in IDR.Genes from S. pombe (green, idr0001-A, 5 ), HeLa cell morphology (blue, idr0012-A, 46 ) and HeLa Actinome (red, idr0008-B, 44 ) are displayed with linkages (gray) from StringDB.To enable comparisons in Cytoscape, the human orthologues of S. pombe genes are used for the genes identified in idr0001-A.It is likely that in the future IDR will receive more submissions that include related phenotypic annotations, so Figure 2 shows one view of an network, based on an evolving set of data.IDR study idr0028 (http://idr-demo.openmicroscopy.org/webclient/?show=screen-1502) also contains experiments annotated with CMPO_0000077/elongated cell phenotype.this study used targeted libraries, so when included in the network, shows no novel functional interactions.Therefore, data from idr0028 have been omitted from this network, but can be viewed and queried interactively (http://idr-demo.openmicroscopy.org/mapr/phenotype/?value=CMPO_0000077) or accessed using the IDR computational resource (https://idr-demo.openmicroscopy.org/jupyter).

Study
B. Zoomed view of network in A, with gene names.See Supplemental Information for the list of gene names used in the figure.

Figure 1 .
Figure 1.Sampling of Phenotypes in the IDR.The numbers of samples per phenotype.Each sample represents a well from a micro-well plate in a screen or image from a dataset.Wells annotated as controls were not included.User submitted phenotype terms were mapped to the CMPO terms shown here.Colours represent higher-level groupings of phenotype terms.Point size shows the number of studies (group of related screens) each phenotype is linked to with small, medium and large points representing 1, 2 or 3 studies respectively.

Figure 2 .
Figure 2. Network Analysis of Genes Linked to the Elongated Cell Phenotype in the IDR.A. Protein-protein interaction network produced in StringDB 57 and visualized using Cytoscape (http://www.cytoscape.org/ 33) based on the genes linked to the elongated cell phenotype (CMPO_000077) in three distinct studies in IDR.Genes from S. pombe (green, idr0001-A,5 ), HeLa cell morphology (blue, idr0012-A,46 ) and HeLa Actinome (red, idr0008-B,44 ) are displayed with linkages (gray) from StringDB.To enable comparisons in Cytoscape, the human orthologues of S. pombe genes are used for the genes identified in idr0001-A.It is likely that in the future IDR will receive more submissions that include related phenotypic annotations, so Figure2shows one view of an network, based on an evolving set of data.IDR study idr0028 (http://idr-demo.openmicroscopy.org/webclient/?show=screen-1502) also contains experiments annotated with CMPO_0000077/elongated cell phenotype.this study used targeted libraries, so when included in the network, shows no novel functional interactions.Therefore, data from idr0028 have been omitted from this network, but can be viewed and queried interactively (http://idr-demo.openmicroscopy.org/mapr/phenotype/?value=CMPO_0000077) or accessed using the IDR computational resource (https://idr-demo.openmicroscopy.org/jupyter).
M phase more multinucleate cells increased cell size in population fewer aggregated cells in population increased cell component number decreased number of filopodia decreased number of microtubules increased number of filopodia increased number of microtubules increased amount of stress fibers increased amount of stress fibers located in the cell cortex increased amount of transverse stress fibers increased amount of zig−zag stress fibers increased number of microtubule bundle cell size in population increased variability of cell shape in population layered cells in population increased variability of nuclear shape in population more cells with metaphase microtubule spindles more cells with interphase microtubule arrays fewer cells with metaphase microtubule spindles fewer cells with interphase microtubule arrays fewer cells with G1 phase microtubule arrays more cells with G1 phase microtubule arrays more cells with S phase microtubule arrays polypetide in cell nucleus decreased level of polypetide in cell nucleus protein localized in bud neck protein localized in cell periphery protein localized in cytosol protein localized in endoplasmic reticulum protein localized in mitochondrion protein localized in nuclear periphery protein localized in nucleolus protein localized in nucleus protein localized in punctate foci protein localized in vacuole protein localized in vacuolar membrane protein localized in Cajal body protein localized in nuclear speckle protein localized in paraspeckle protein localized in PML body protein localized in polycomb body protein localized in Sam68 nuclear body protein localized in centrosome protein localized in nuclear pore absence of protein localized in bud neck absence of mitotic process cell protein import into nucleus positive regulation of protein import into nucleus cellular response to chemical stimulus increased rate of protein secretion mild decrease in rate of protein secretion strong decrease in rate of protein secretion decrease in rate of protein secretion