Main

Biological imaging is one of the most innovative fields in the modern biological sciences. New imaging modalities, probes and analysis tools appear every few months and often prove decisive for enabling new directions in scientific discovery. One feature of this dynamic field is the need to capture new types of data and data structures. While there is a strong drive to make scientific data findable, accessible, interoperable and reusable (FAIR1), the rapid rate of innovation in imaging has resulted in the creation of hundreds of proprietary file formats (PFFs) and has prevented the unification and adoption of standardized data formats. Despite this, opportunities for sharing and integrating bioimaging data and, in particular, linking these data to other ‘omics’ datasets have never been greater. Therefore, to every extent possible, increasing ‘FAIRness’ of bioimaging data is critical for maximizing scientific value, as well as for promoting openness and integrity2.

When working with a large number of PFFs, interoperability and accessibility are achieved using translation and conversion provided by open-source, community-maintained libraries that produce an open, common data representation. On-the-fly translation produces a transient representation of bioimage metadata and binary data in an open format but must be repeated on each use. In contrast, conversion produces a permanent copy of the data, again in an open format, bypassing bottlenecks in repeated data access. As workflows and data resources emerge that handle terabytes (TB) to petabytes (PB) of data, the costs of on-the-fly translation have become bottlenecks to scientific analysis and the sharing of results. Open formats like OME-TIFF3 and HDF5 (ref. 4) are often used for permanent conversion, but both have limitations that make them ill-suited for use cases that depend on very high and frequent levels of access, such as training of artificial intelligence models and publication of reference bioimage datasets in cloud-based resources. For these situations, the community is missing a multidimensional, multiresolution binary container that provides parallel read-and-write capability that is natively accessible from the cloud (without server infrastructure) and that has a flexible, comprehensive metadata structure (Supplementary Note).

To this end, we have begun building OME’s next-generation file format (OME-NGFF) as a complement to OME-TIFF and HDF5. Together these formats provide a flexible set of choices for bioimaging data storage and access at scale over the next decade and, potentially, a common, FAIR solution for all members of the biological imaging community (academic and industrial researchers, imaging scientists, and academic and commercial technology developers).

Next-generation file formats

We use the term next-generation file formats (NGFFs) to denote file formats that can be hosted natively in an object (or cloud) storage for direct access by a large number of users. Our current work, which we refer to as OME-NGFF, is built upon the Zarr format5 but heavily informed and connected to both TIFF and HDF5. We have compared the characteristics of these three open formats in Supplementary Table 1.

To date, the development of OME-NGFF has focused on pixel data and metadata specifications for multidimensional, multiscale images, high-content screening datasets and derived labeled images. These specifications include support for ‘chunking’ or storage of parts of the binary pixel data in smaller files that support rapid access to the data from orthogonal views or different resolution levels (also known as pyramidal data). Labeled images, such as segmentation or classification masks can now remain in a common data structure with the original pixel data and metadata, providing a single mechanism for tracking the provenance of original and derived data allowing programmatic rather than manual management.

We have also built multiple implementations of these specifications, demonstrating the usability and performance of these formats. bioformats2raw can be used for writing OME-NGFF from standalone Java applications and omero-cli-zarr is available for exporting from OMERO6. Reading is implemented in ome-zarr-py, which has been integrated into the napari viewer7, in Fiji via the MoBIE plugin8 and finally via Viv-based vizarr for access in the browser9. Permissively licensed example datasets from the Image Data Resource (IDR)10 have been converted into Zarr and stored in an S3-object storage bucket for public consumption (Extended Data Fig. 1). Though OME-NGFF is still in development, each of these implementations is an example of how data access and application is simplified by having a universal data-storage pattern. Current and future specifications are published under https://ngff.openmicroscopy.org/latest/.

Bioimage latency benchmark

To demonstrate how NGFFs complement available, open formats, we have built and published a bioimage latency benchmark that compares random, serial-access speeds to uncompressed TIFF, HDF5 and Zarr files. These measurements provide an upper bound on the overhead that a user would experience accessing the formats using common libraries, tifffile, h5py and Zarr-Python, respectively. Though future extensions to the benchmark are intended, we have focused on a single, serverless Python environment because one library (fsspec) can be used to access all three data formats across multiple storage mechanisms without the need for any additional infrastructure.

The benchmark includes instructions for running on Docker or AWS EC2 and contains all necessary code to regenerate representative samples for two established imaging modalities: large multi-channel two-dimensional (2D) images such as the ones produced by cyclic immunofluorescence (CycIF)11 and time-lapse isotropic volumes typically generated by light-sheet microscopy (LSM)12. Each synthetic HDF5, TIFF and Zarr dataset was generated by first invoking the ImarisWriter, then converting the HDF5-based Imaris files into Zarr with bioformats2raw and finally converting the Zarr to TIFF with raw2ometiff. All three datasets along with a 1byte dummy file for measuring overhead were placed in three types of storage: local disk, a remote server and object storage. We measured the reading time of individual chunks for all four file types across the three storage systems. Figure 1 shows that as the latency of access grows, access times for monolithic formats such as TIFF and HDF5 increase because libraries must seek the appropriate data chunk, whereas NGFF formats such as Zarr provide direct access to individual chunks. In the three-dimensional (3D) case, the TIFF data were too large to fit into local memory and the benchmark errored.

Fig. 1: Chunk retrieval time is less sensitive to data location with next-generation file formats.
figure 1

a,b, Random sampling of 100 chunks from synthetically generated, five-dimensional images measures access times for three different formats on the same file system (green), over HTTP using the nginx web server (orange) and using Amazon’s proprietary S3 object storage protocol (blue) under two scenarios: a whole-slide CycIF imaging dataset with many large planes of data (x = 64,000, y = 64,000, c = 8) and chunks of 256 × 256 pixels (128 KB) (a); and a time-lapse LSM dataset with isotropic dimensions (x = 1,024, y = 1,024, z = 1,024, t = 100) and chunks of 32 × 32 × 32 pixels (64 KB) (b) (Methods).

On local storage, access speeds for NGFF files were similar to HDF5 and both substantially outperformed TIFF. This matches previous results showing that a number of factors must be taken into account to determine the relative performance of HDF5 and Zarr13. Together these results partially explain HDF5’s popularity for desktop analysis and visualization of LSM datasets.

However, on cloud storage, access speeds for NGFF files are at least an order of magnitude faster than HDF5. Parallel reads14, supporting streaming of image data files from remote http-based or cloud-based servers give performance similar to local disk access. Data streaming obviates the need for wholesale data download and is especially important for providing performant access to multi-TB datasets.

We note that our benchmark measures direct access to underlying storage. Additional applications, such as HSDS for HDF5 or OMERO for TIFF, may improve the performance of specific use cases, but add complexity to any deployment and make direct comparisons between the different data-access regimes in Fig. 1 difficult. Additionally, a key parameter in overall access times is the size of individual chunks. As chunk sizes decrease, the number of individual chunk files increases rapidly (Extended Data Fig. 2). In this benchmark, we have chosen a compromise between chunk size and number of individual files. This illustrates a primary downside of NGFF formats; as the number of files increases, the time required for copying data between locations increases. Users will need to understand and balance these trade-offs when choosing between open, bioimaging file formats.

Outlook: community adoption

We assert that together low-latency, cloud-capable NGFF, TIFF and HDF5 can provide a balanced set of options that the community can converge upon and slow the development of ever more file formats. To this end, OME is committed to building an interoperable metadata representation across all three file formats to ensure ease of adoption and data exchange (Supplementary Note).

When data are frequently accessed, for example, as a public resource or a training dataset, upfront conversion will lead to overall time savings. In situations where object storage is mandated, as in large-scale public repositories, we encourage the use of OME-NGFF today. Alternatively, users needing to transfer their images may choose to store their data in a large single file such as HDF5. OME-TIFF remains a safe option for those who rely on proprietary software for visualization and analysis, especially in digital pathology and other whole-slide image applications, as many have been extended to both read and write this open standard. Each choice comes with benefits and costs and individual scientists, institutions, global collaborations and public data resources need the flexibility to decide which approach is suitable. We encourage the community to choose from the most appropriate of the formats described above, secure in the knowledge that conversion is possible if it becomes necessary.

We foresee this being a critical strategy where data generated in advanced bioimaging applications is converted into an optimized format for downstream processing, analysis, visualization and sharing. All subsequent data access occurs via open data formats without the need for repeated, on-the-fly translation. We have begun implementing this workflow in the IDR (Extended Data Fig. 1), alleviating the need for time-consuming downloads and cross-referencing metadata and resulting in substantially more accessible and interoperable data. We look forward to working with other resources to further develop this policy. Further, as adoption of public image data resources increases, commercial vendors will hopefully engage with these efforts to support their customers, who are increasingly required to publish datasets as supplementary material. Moreover, some commercial imaging companies are themselves building cloud-based data handling and analysis solutions (for example, https://www.apeer.com), thus broadening the community of users who need cloud-competent file formats.

Ultimately, we hope to see digital imaging systems producing open, transparent (in other words FAIR), data without the need for further conversion. Until that time, we are committed to providing the data conversion needs of the community. Following the same pattern established by bioformats2raw and raw2ometiff, we propose to meet this challenge via a set of migration tools allowing efficient data transformations between all data formats contained in this suite of interoperable formats. Additionally, as the specification evolves based on community feedback, the same migration tools will allow upgrading the scientific data generated by the bioimaging community to prevent the need for long-term maintenance of older data. Upcoming specifications include geometric descriptions of regions of interest, meshes and transformations for correlative microscopy.

To provide the best chance of wide adoption and engagement, we are developing the formats in the open, with frequent public announcements of progress and releases of reference software and examples (https://forum.image.sc/tag/ome-ngff) and regular community meetings where we present work, source feedback and encourage community members, including vendors, to participate in the specification and implementation. The community process is being developed and we welcome contributions from all interested parties on https://github.com/ome/ngff.

Methods

Bioimage latency benchmark: synthetic data generation

Imaging modality and dataset sizes

Synthetic datasets were generated for two established imaging modalities: a large multi-channel two-dimensional image typical of CycIF11 of xyzct dimensions 64,000 × 64,000 × 1 × 8 × 1 and a time-lapse isotropic volume typical of LSM12 of xyzct dimensions 1,024 × 1,024 × 1,024 × 1 × 100.

For each modality, the chunk size of the benchmark dataset was chosen as a compromise between the size of individual chunks and the total number of chunks in the Zarr dataset. To make this decision, the individual chunk size was computed against the total number of chunks for typical sizes ranging from 16 up to 1,024 (chunks.py15 and Extended Data Fig. 2). Based on this, we chose a 2D chunk size of 256 × 256 for the CycIF-like dataset and a 3D chunk size of 32 × 32 × 32 for the LSM-like dataset. Note that owing to the planar limitation of TIFF, the LSM dataset was stored as 2D TIFF tiles of size 32 × 32 but the benchmark loaded 32 tiles to measure the total access time. All data was stored uncompressed to keep chunk sizes consistent for the random generated data. Note that with the default aws s3 cp command, data upload decreased from over 100 MiB s−1 for the single HDF5 file to under 20 MiB s−1 for the Zarr dataset.

Dataset generation

The HDF5 version of each synthetic dataset was first generated by using the ImarisWriter library16 (v.2021-04-07) with a version of the ImarisWriterTest example16,17 modified to allow setting the desired chunk size and generate gradient images rather than random data. This HDF5-based Imaris file was converted into Zarr using a modified version of bioformats2raw v.0.2.6 with support for chunks using a / dimension separator18. Finally, the Zarr was converted into TIFF with a modified version of raw2ometiff v.0.2.6, allowing it to consume Zarr filesets with a / dimension separator19. Both modifications have been released since in bioformats2raw v.0.3.0 and raw2ometiff v.0.3.0.

For the CycIF-like dataset, this conversion generated a single 86 GB TIFF file, a single 86 GB HDF5 file and a Zarr dataset composed of 700,000 files of 86 GB in total. For the LSM-like dataset, the conversion generated a single 300 GB TIFF file, a single 229 GB HDF5 file and a Zarr dataset of 4.3 million files of 264 GB in total.

Bioimage latency benchmark: measurements and results

Measurements

All three datasets along with a 1 byte dummy file for measuring overhead were placed in three types of storage: local disk, a remote server and object storage. We measured the reading time of individual chunks for all four file types across the three storage systems.

A random sequence of 100 chunk locations was chosen for the benchmark. All 100 chunks were loaded from each file in the same order. The time taken to retrieve the chunk, independent of the time taken to open a file or prepare the remote connection, was recorded.

Raincloud plots

Raincloud plots20 combine three representations (split-half violin plots, box plots, raw data points) so that the true distribution and the statistical parameters can be compared. Split-half violin plots show a smoothed version of a histogram with a kernel density estimate. This type of plot is useful to determine, at a glance, if the mean is lower or higher than the median depending on the skewness of the curve. Box plots show the median and the boundaries of quartiles on either side of the median of the distribution to determine statistical differences at a glance. Below each box plot, the raw data points are additionally plotted with slight vertical jittering to avoid overlaps.

All code for reproducing the plots and the runs both locally with Docker or Amazon EC2 instances are available under a BSD-2 license on Zenodo15.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.