Background & Summary

In biodiversity, organisms are studied with the aim to record their diversity at the genetic, metabolic, physiological, morphological, or the ecosystem level. Despite the fact that bioimaging techniques such as macro- and microscopy are prominently used, FAIR bioimaging data in the context of biodiversity are still very scarce1,2,3. This is especially the case for taxonomically difficult and underrepresented groups such as bryophytes. Currently, there are approx. 24’000 species of bryophytes known to science4. Unlike vascular plants, bryophytes lack well-differentiated organs that protect them from environmental exposures and pathogens. As a result, phenotypes are often cryptic and difficult to identify visually as bryophytes have developed unique specialised metabolisms and cell structures such as oil bodies5,6.

The highly diverse family of Scapaniaceae contains 48 taxa in Europe7 and is an ecologically important group regarding environmental adaptations8, the biochemistry of terpenoid natural products and other chemical structures9,10,11,12, the metabolism of pollutants and heavy metals13,14, and phylogenetics15,16,17,18. Generally, there is a considerable lack of described traits in bryophytes19 and especially in liverworts such as Scapaniaceae, phenotypic traits to assess the diversity of form and function are understudied20.

Bioimaging data in the field of biodiversity is of high relevance as they allow to assess the phenotypic diversity through an analysis and assessment of images2,21,22. In the form of measurable phenotypic traits, biological images are the groundwork of many ecological studies3,20,21,23,24. Phenotypisation through recording images of anatomical and morphological characters allows qualitative and quantitative measurements of molecular structures relating to genetics, molecular pathways and biotechnology25,26,27. Bioimages have also gained a lot of interest in citizen sciences and in the digitization of natural history collections and digital herbaria28,29. Furthermore, meta-synthesis methods, which synthesise disparate data sources spanning published case studies, have great potential to reveal context-dependencies within bioimaging research data30.

For example, liverworts such as the species investigated herein produce cellular oil bodies that are visible under the microscope as little droplets of oil (Fig. 1). These oil bodies are often species-specific and are an important phenotypic character for species identification. Furthermore, as they contain many specialized metabolites such as fatty acids, terpenoids, or flavonoids they can provide a mechanistic link between molecular function and the phenotype4,5,31. Images of oil bodies are of high interest as they often degrade within a few hours or days after sampling and are usually absent in dried herbaria material.

Fig. 1
figure 1

Comparison of leaf cells of dried herbaria voucher specimens and fresh samples. Oil bodies are usually absent from dried specimens. (a) Cells in the apex of the antical lobe of Scapania gymnostomophila in a voucher specimen (left) and a fresh sample (right). Cells of this species produce one large brownish structured cellular oil body. (b) Cells in the centre of the postical lobe of Scapania cuspiduligera in a dried herbaria voucher specimen (left) and in a fresh specimen (right). Cells of this species usually produce 2–5 translucent oil bodies per cell.

Macroscopy and microscopy are characterized by physical constrains resulting in diffraction and shallow depth of field22,32,33. From a technical perspective, our data employs two major methods to significantly extend the depth of field and to increase the resolution of the composite images: image stitching (combining several images relative to the x- and y-axes of the visible accommodation to form an image with a larger frame)34,35 and multi-focus image fusion (merging multiple images at different focal planes of the z-axis in such a way that only regions in focus will contribute to the resulting image)36,37 (Figs. 2, 3). In this regard, the raw data also allows to be reused for combining image fusion with computational super-resolution22,38,39,40,41.

Fig. 2
figure 2

Two exemplary processing workflows used in this study to create segmented images. (a) Example of multi-focus image fusion where (1) several images of one object of a leaf lobe are fused into a (2) composite image. The leaf lobe as shown in the composite image is then (3) segmented and the background removed. Several leaf lobe objects are then put onto the (4) final image and a microscopic scale is applied. (b) Example of image stitching where several fused images of the same object showing the ventral sides of the stature (habitus) of a plant are (2) arranged into segments and (3) stitched into a composite image with larger dimensions. Several of these stitched images are then put onto the (4) final image and a microscopic scale is applied.

Fig. 3
figure 3

Flow chart for the bioimage processing workflow for one species. The workflow starts with the microscopy experiment where raw bioimages are acquired for several biological objects. These raw images are pre-processed using image enhancement methods such as color balance, or exposure correction using experimental metadata and generating expressive metadata. These bioimaging data is then further processed using image fusion or image stitching methods where several images of the same object are fused or stitched together. The processed images are then manually segmented such as separating the object from the background, or putting segmented objects such as leaves onto one image. Finally, a microscopic scale is put onto the processed image using the metadata information. During each processing step, experimental data is recorded and annotated in the final image metadata. The flow chart was created using the draw.io web software tool.

Cloud infrastructural resources are able to execute computational workflows that combine data with computational analysis tools at a large scale42,43. However, there is still a considerable lack of data containing machine-actionable metadata1,3,44,45,46,47. To document provenance, ensure reproducibility and support reuse any raw and segmented image in this data set has been associated with a rich set of contextual and expressive metadata48, documenting the phenotypic characters, and recording any digital image processing (i.e., increasing contrast, brightness, image fusion) (Fig. 3). The metadata has been annotated with community-accepted semantics that allow for machine-actionable data-mining and to create scientific workflow modules that produce segmented composite images automatically by reusing the instructions contained in our metadata20,43,46,47,49.

In this Data Descriptor, we present the principles to generate reference images from raw microscopic bioimaging data and show how individual images are associated with technical and expressive metadata. Despite that we were able to associate our images with a rich set of metadata, we ascertain that there is still a lack of usable ontological terms and schemas in bioimaging with regard to documenting image processing and associating individual images with phenotypic characters3,46 (Table 1). Our high-resolution images allow for large prints and zooming into images to obtain critical details, which is particularly important for species identification and for computational image analyses, computer-assisted species recognition and identification50,51. Despite that we have deposited the data to the two specialised imaging repositories BioStudies (containing raw and pre-processed images which enable direct use in, i.e., machine learning approaches in computational ecology) and Imaging Data Resource (containing pre-processed and fully segmented images to be rapidly reused by ecologists), we ascertain that there is still the need for connecting macro- and microscopic bioimaging data to biodiversity platforms3,52 such as iDigBio53 and GBIF54, or even the citizen scientists community-effort iNaturalist55. Our reference data framework facilitates the further integration of bioimaging data into other research disciplines56 and, thus, we want to inspire future data reuse and meta-synthesis in the fields of biodiversity and computational ecology.

Table 1 List of phenotypic characters that were associated with the images.

Methods

Sample collection and biological material

Representative voucher specimens were received from different herbaria. Supplementary Table 1 lists all used voucher specimens and freshly collected samples that have been investigated in this study. Samples have been associated with taxonomic species identifiers (NCBI, GBIF, or Open Tree of Life identifiers, if available), the text on the specimen sleeves (collector, date and text on the envelopes) and the voucher specimen identifiers (the first letters either indicate the Index Herbariorum institution code57, if available, or the name of private collection where the specimens were stored). Fresh samples of Diplophyllum taxifolium, Scapania cuspiduligera, Scapania gymnostomophila and Scapania subalpina were additionally investigated to depict oil bodies which are usually absent from herbaria specimens. Fresh samples were additionally collected at various sites, put into envelopes on-site, identified and photographed afterwards. Information regarding the date, site (including geographical coordinates), habitat, substrate and other further information were collected.

Microscopy and photographic equipment

For microscopy, a Zeiss Axio Scope.A1 HAL 100/HBO, 6x HD/DIC, M27, 10x/23 microscope with an achromatic-aplanatic 0.9 H D Ph DIC condenser was used with the objectives EC Plan Neofluar 2.5x/0.075 M27 (a = 8.8 mm), Plan-Apochromat 5x/0.16 M27 (a = 12.1 mm), Plan-Apochromat 10x/0.45 M27 (a = 2.1 mm), Plan-Apochromat 20x/0.8 M27 (a = 0.55 mm), and Plan-Apochromat 40x/0.95 Korr M27 (a = 0.25 mm) using the EC PN and the Fluar 40x/1.30 III and PA 40x/0.95 III filters for DIC. The conversion filter CB3 and the interference filter wideband green were used to improve digital reproduction of colors. Color balance was adjusted in the camera and during postprocessing of the images. For macroscopy and for preparing microscopy slides, a binocular microscope Zeiss Stemi 2000c was used (apochromatic Greenough system with a stereo angle of 11° and 100/100 switchover of camera and ocular viewing). The objectives Canon MP-E 65 mm 2.8 1-5x macro and Venus Optics Laowa 25 mm 2.5-5.0x ultra-macro for Canon EF and the Canon EF-RF adapter were used for stand-alone macroscopic images.

A full-frame, high-resolution camera (Canon EOS RP, 26 megapixel) was used to acquire digital images. It was adapted to the microscopes using binocular phototubes with sliding prism 30°/23 (Axio Scope.A1) and 100:0/0:100 reversed image (Stemi 2000c) using 60-T2 camera adapter for Canon EOS and Canon EF-RF adapter. The objectives Canon MP-E and Laowa 25 mm were adapted directly through the Canon EF-RF adapter.

Image processing

Figures 2 and 3 provide overviews on the image processing tasks that were performed. Images were recorded at different focal planes to construct images with extended depth of field using computational methods. This “focus stacking” approach was automatized for macroscopy by attaching the camera to a Cognisys StackShot macro rail fixed on a Novoflex macro stand, and for microscopy by adapting a Cognisys StackShot motor to the fine adjustment of the microscope using two cogged wheels, one small wheel (1 cm diameter) adapted on the motor and one large wheel (8.5 cm diameter) on the fine adjustment of the microscope. The two cogged wheels were coupled with a toothed belt to obtain very fine step increments of the stepping motor for high magnifications. A Cognisys StackShot controller was used to control the amount and distance of the stepping motor with the following controller settings: Dist/Rev: 3200 stp, Backlash: 0 steps, # pics: 1, Tsettle: 100.0 ms, Toff: 450.0 ms, Auto Return: yes, Speed: 3000 st/sec, Tlapse: off, Tpulse: 800.0 ms, Tramp: 100 ms, Units: steps, Torque: 6, Hi Precision: Off, LCD Backlight: 10, Mode: Auto-Step using between 25 steps (magnification 1x) and 50 steps (magnification 25x) and 100 steps (magnification 400x) (number of steps depending on aperture settings and effective magnification).

Raw images were recorded in CR3-format and pre-processed with Adobe Camera RAW. Non-destructive image processing such as corrections of the field curvature, removal of chromatic aberration, increase of contrast and brightness were performed in Adobe Camera RAW. Images were then exported to TIFF-format and any image processing steps were recorded in individual Adobe XMP-files.

Multi-focus image fusion was performed on the individual images in the z-stacks using the software Helicon Focus 7.7.5 and by choosing the algorithms depth map and pyramid with different settings of radius (4, 8, 16, 24) and smoothing (2, 4). The best composite images were chosen manually and retained. When composite images contained specimens that were larger than the frame, several images were stitched together using the panorama stitching function in the software Affinity Photo 1.10.1.

Image segmentation

Images were manually segmented and interfering background removed using the flood select, brush selection and freehand selection tools in the software Affinity Photo. A stage micrometre was photographed separately with any of the objectives and microscope combinations to determine the scale which was then calculated per pixel for each combination (File scale_bar_distances.csv in58). Scale bars were put post-hoc onto the segmented images using the Python script scale_bar.py58.

Handling of metadata

Metadata including species name, taxonomic rank information (NCBI-Taxon, GBIF and OTT taxonomy identifiers), voucher specimen id, image acquisition date, an object description including the name of the captured phenotypic character(s), the used objective, microscope, and magnification were associated with any raw image based on unique respective file names. Table 1 lists the ecologically relevant phenotypic characters that were associated with the images. Individual file names (variable file list), name within an image focus stack (variable stack name) and name within an image stitching stack (variable stitch name) were recorded additionally to facilitate subsequent automatized image processing in computational workflows. A Python script was created to put individual images as part of image stacks into directories (File create_image_stacks.py in59). The Python script parses the Label tag in the XMP-files. Any metadata regarding image enhancement and non-destructive image processing were extracted from XMP-files using a simple Python script (File xmp_stack_to_tsv.py in59). The metadata was saved in individual TSV-files and merged using a helper Python script (File tsv_merge.py in59). Supplementary Table 2 lists all fields which were extracted from the XMP.

Data deposition

Raw camera and pre-processed imaging data in CR3 and TIFF format, respectively, were uploaded to BioStudies using the command line IBM Aspera software tool ascp version 3.8.1.161274 to ensure that data has been transmitted without errors. Sparse file check summing was enabled to ensure integrity of files during transfer (parameter -k 2). The raw bioimaging data is available under the BioStudies identifier S-BIAD188.

Pre-processed images were converted to the Bio-Formats OME-TIFF format60 by creating intermediate ZARR-pyramid tiles using the bioformats2raw converter version 0.4.0 and then using the raw2ometiff version 0.3.0 software tool to create the final pyramid images. Individual fully segmented and processed images were associated with standardised geolocation information to improve data reuse and to enable linking bioimaging data to ecological data repositories. Swiss Topo CH1903/LV03 coordinates were converted to WGS84 using Swisstopo-WGS84-LV0361. The processed images were further associated with the metadata information listed in Table 2 to enable machine-readability in IDR. A helper script was implemented in R to facilitate the generation of TSV tables for data upload to the Image Data Resource (IDR) repository (_tsv_res_2_idr.r in62). Processed images and the metadata aggregated in a TSV table were uploaded to IDR using the software Globus Connect Personal 3.1.6. The dataset is available under the identifier idr0134.

Table 2 List of semantic terms used to annotate the segmented images.

Data Records

Two separate data records were created to enable rapid use of the data in machine learning and biodiversity approaches.

(1) The camera raw images (Canon CR3-format), the pre-processed images (16-bit TIFF-format), and the contextual metadata were deposited to BioStudies under the identifier S-BIAD18863. The data record consists of a total of 223’989 individual raw image files partitioned into 48 species. The entire data record has a total size of approx. 12 TB.

(2) The pre-processed and fully segmented and processed images along with metadata were deposited in OME-TIFF format to the Image Data Resource (IDR) repository under the identifier idr013464. The data record consists of a total of 4233 pre-processed and 905 fully processed imaged files. The data record has a total size of approx. 14 TB.

Technical Validation

Biological validation of species identity and visible phenotypic characters in the pre-processed images were performed by consulting the external experts Edwin Urmi, Heike Hofmann, Vadim Bakalin and Kristian Hassel. Photos of the herbarium specimen CM-30377 originating from North America (Supplementary Table 1) show quite different characters when compared to the voucher specimen B-108428 originating from Northern Europe. Hence, the taxonomic status of the species Scapania glaucocephala is not yet fully clear15. Photos of CM-30377 may, thus, relate to the species Scapania scapanioides (C.Massal.) Grolle, which is listed in7 as separate species occurring in Europe. Further, S. brevicaulis and S. degenii may comprise taxonomically identical species and additional research is needed to resolve their taxonomic status. Images from this study can help to clarify relationships of phenotypic characters and the phylogenetic and taxonomic status of cryptic species.

Multi-focused image fusion methods were applied with different settings to the individual images in stacks in order to validate the technical quality of fused composite images. Composite images were manually inspected and the best image retained. Generally, classic Laplacian pyramid transform-based methods such as Pyramid Maximum Contrast implemented in the software Helicon Focus produce good results in complex cases with regard to intersecting objects and along edges (boundary regions), but this algorithm increases contrast and glare and it is prone to noise and artefacts and is generally considered less accurate regarding the reproduction of microscopic objects65,66,67,68. The deterministic depth map-based method implemented in the software Helicon Focus first calculates depth maps from intermediary images based on the absolute difference in the brightness of corresponding pixels in source images and smoothed intermediary images and then generates the composite image from the source image pixels with indices differing from the indices in the smoothed depth map69. Whereas larger values for the parameter radius increase blur along edges, lower values can introduce artefacts, while the amount of blur along the transition between fused areas of individual images can be controlled with the parameter smoothing. The depth map-based method generally produces accurate reproductions of microscopic objects. However, in some circumstances and especially with high magnifications, it can generate large artefacts and blur around the edges (boundary regions) (Fig. 4). Recently, machine learning-based methods have been applied to focus-based image fusion tasks that may be superior to deterministic approaches37. Although there have been proposed some algorithms specifically for microscopic imaging, there is a considerable lack of usable implementations and a lack of microscopic training data for machine learning-based algorithms37. Our reference dataset can be used to train and improve these algorithms.

Fig. 4
figure 4

Deficiencies of multi-focus image fusion methods. Red circles and bars were drawn post-hoc with Affinity Photo to indicate the deficient regions in the images, thus, the regions where multi-focus image fusion methods can produce blur and artefacts in composite images of microscopic objects. (a) Crop of IMG_15321621 Scapania cuspiduligera stature ventral side. Visible blur along edges (boundary regions) of overlapping leaves (Parameter settings: Method: Depth Map, Radius: 8, Smoothing: 4). (b) IMG_01070226 Scapania ligulifolia stature dorsal side (Parameter settings: Method: Depth Map, Radius: 32, Smoothing: 20).

Python scripts have been written which are available as Open Source software in github58,59 to facilitate the automated processing of images. These scripts use metadata information to put individual images into image stacks to perform focus-based image fusion and image stitching tasks. However, most of the work has still been implemented manually and scientific workflows need to be developed that allow to fully automate the entire process combining images with software tools utilising the machine-actionable information contained in the metadata43,49,70. Using the procedures described in20,46,47 metadata used herein has been validated. Standardised vocabularies were used following the FAIR guiding principles1. When improved algorithms have been developed, the entire pipeline can be re-run resulting in improved segmented images without any further intervention. This data reuse and the rich documentation in metadata will foster good scientific practices through source tracking and provenance.

Usage Notes

Raw camera and pre-processed imaging data, the fully segmented and processed images, and the metadata are available under the terms of the license Creative Commons CC BY 4.0. Open-Source software scripts and code58,59 are available under the terms of the BSD 3-Clause license.