## Main

Machine learning is proving vital for cell-scale optical microscopy image and video analysis for studies in the life sciences1,2,3,4,5,6,7, including major tasks such as cell segmentation, tracking, classification and population analytics. However, artificial intelligence and machine learning solutions are lacking for the analysis of small and dynamic subcellular structures such as mitochondria and vesicles. This affects how subcellular mechanisms are studied in life-science studies. First, only small-scale analyses can be conducted—via tedious manual annotations (Fig. 1a) and with limited conclusiveness owing to manual subjectivity (Fig. 1b) and the small statistical sample size. Second, because machine learning methods are rare in live-cell analysis because of the difficulties of annotation, other forms of quantitative statistical analyses such as fluorescence correlation spectroscopy are used instead. Third, electron microscopy image analysis is used for subcellular morphological investigations, but this does not provide the perspective of real-time unfolding of subcellular mechanisms. The information acquired using these techniques is valuable, but a computer-vision (CV) centric approach for observing live-cell subcellular processes holds an untapped potential for gaining unprecedented insights.

The segmentation of subcellular structures is a fundamental step towards realizing CV of subcellular mechanisms. Our interest lies in segmenting small and dynamic subcellular structures in cells from fluorescence microscope images. This is an immensely difficult task because of the small sizes of subcellular structures with respect to both the optical and digital resolutions of the microscopes. The structures often have dimensions on the order of 100–1,000 nm, while the pixel size in microscope images is generally 80–120 nm and the optical-resolution limit of advanced live-cell-compatible microscopes is typically 200–300 nm. This means that the details of the structures are often lost. Furthermore, the point spread function (PSF) of optical microscopes induces a three-dimensional (3D) blur, as a result of which out-of-focus structures appear with different intensities and blurring profiles to those that are in focus. Accordingly, the segmentation of out-of-focus structures is often inaccurate. There are also other problems. Small structures have few binding locations for fluorophores, resulting in a low fluorescence intensity per structure. Small structures in living cells are highly dynamic and demand high-speed imaging (10–100 ms per frame), thus requiring short exposure times and limited fluorescence intensity. The signal-to-noise ratio is thus quite poor (2 to 4 in our experiments), which compounds the difficulty of segmentation. The great variability in the structures and the possibility of multiple overlapping structures creating high-intensity spots further complicate matters. These challenges are discussed in more detail in Supplementary Note 1 and Supplementary Fig. 1.

Image-processing techniques such as the popular Otsu approach8 are consequently grossly inaccurate and contribute to large errors in conclusions about subcellular structures, as shown in the analytics results in Fig. 1a. Therefore, despite being fast (Fig. 1a), these techniques are not good candidates for performing subcellular analyses. Semi-supervised solutions also exist9, but these are prone to subjectivity and tediousness, similar to manual segmentation. Supplementary Note 2 discusses the existing approaches in more detail, including manual annotation (Supplementary Fig. 2). Meanwhile, deep learning solutions hold promise for both live-cell analysis and large-scale systematic studies.

Interestingly, deep learning solutions optimized for cell-scale optical microscopy data10 cannot simply be translated to subcellular scales because of the problems associated with digital and optical resolution and noise. In general, generating ground truth (GT) through manual segmentation over large datasets is considered the only way to create training datasets. However, generating correct GT manually for fluorescent images of small subcellular structures is not possible, as the inaccuracy of every pixel contributes a non-negligible amount of error. Consequently, challenging subcellular structures such as mitochondria and vesicles have received less attention.

In this Article, we present a new physics-rooted deep learning approach for solving the GT deficiency in subcellular segmentation (Fig. 1). There are two key parts to our approach: (1) physics-based simulation-supervised learning, in which a supervised training dataset is created by simulating noisy microscope images, and (2) physics-based GT for generating the target segmentation in supervised learning. The simulation-supervised approach is a form of synthetic data-oriented approach. We present a short study about the ineffectiveness of synthetic data generation for our application in Supplementary Note 3 and Supplementary Fig. 3. Supplementary Note 4 presents a discussion on other known simulators for optical microscopy with relevance to our problem and assesses the possibility of using super-resolution microscopy for generating synthetic microscopy datasets.

In our simulation-supervised approach, the training data are generated using a physics-based simulator that simulates everything from the binding location of an individual fluorescent molecule and its photokinetics to the 3D geometry of the subcellular structure on which fluorescent molecules are present, as well as the microscope instrument and noise characteristics. In addition, the physics-based simulation allows us to design a physics-based GT approach that is unbiased by the microscope instrument, free of manual subjectivity of segmentation, and assures that a particular geometry and fluorescent labelling result corresponds to a unique GT. This simulation engine and physics-based GT generates notably better and unbiased segmentation than expert-generated manual GT (Fig. 1b).

In this Article, we show that our simulation-supervision approach allows good-quality segmentation with a variety of deep learning approaches, indicating its suitability for resolution-limited, noise-afflicted and GT-deficient subcellular microscopy data analysis. It can be applied across a variety of experimental conditions and cell types (Fig. 2) to segment organelles for which training was performed using simulation-supervised datasets (for example, mitochondria). We demonstrate the generalizability of our approach across microscopes through transfer learning (Fig. 3) and the possibility of performing multi-class classification, tracking and morphology-associated analytics at the scale of individual mitochondria (Figs. 4 and 5). We are also able to identify and analyse the interaction of mitochondria and vesicles inside living cells (Fig. 5).

## Results on segmentation

We present the results of segmenting two types of subcellular entity—mitochondria and vesicles—in nine datasets of living and fixed cells (the datasets are described in Supplementary Fig. 4). Mitochondria and vesicles were chosen because they are interesting cases. Mitochondria are highly dynamic, tubular and shape-changing, with diameters close to the optical-resolution limit (200–300 nm) and lengths easily exceeding the depth of the focal field, rendering their segmentation challenging. Vesicles have simple geometries but vary significantly in size, with some being smaller than the optical-resolution limit and comparable to the digital resolution (pixel size of ~100 nm) and others being much larger. They thus present large variability in optical intensity and visibility. In fact, as our results indicate, vesicles are more challenging to segment than mitochondria, despite having a more simple geometry. Another point of interest is that subcellular mechanisms involving mitochondria and their interactions with vesicles such as endosomes and lysosomes are crucial for cell homeostasis and relevant for understanding disease development11.

### Physics-based simulation-supervised dataset and deep learning

Our simulation engine includes separate modules for the geometry of the subcellular organelles and labelling, the photokinetics of fluorescence, the microscope and image simulator, the noise simulator and the GT simulator. A detailed description of the six-step process (Fig. 1e) is presented in Supplementary Note 5 and Supplementary Fig. 5. This is extensible to further varieties of subcellular structures, microscopes and labelling protocols. At present it includes mitochondria and vesicle geometries (Supplementary Table 1), the ability to simulate epifluorescence and Airyscan microscopes (Supplementary Table 2), as well as surface labelling of vesicles and mitochondria. The simulation of photokinetics12 is important for high-speed microscope videos (on the scale of milliseconds per frame) to model frame-to-frame variability. The modules are customizable to include other photokinetic and noise models. We created six simulation datasets, considering the two subcellular structures and three microscopes (Supplementary Table 3). Of these, datasets SimEpi1Mito and SimEpi1Vesi contain 7,000 images each. The other datasets contained 3,000 images each so as to explore the impact of the size of the simulation dataset on the accuracy of segmentation and the possibility of performing transfer learning across different microscopes.

We tested the efficacy of our simulation-supervised training as a suitable paradigm by using U-Net13 (Supplementary Fig. 6) in conjunction with five state-of-the-art backbone networks. We found that Inception-V3 and EfficientNet-B3 generally performed best, although the other networks also performed robustly. For all further results we used the EfficientNet-B3 backbone. Details of the results are provided in Supplementary Note 6 and Supplementary Table 4. We also found that the performance is stable when the training dataset contains 3,000 or more simulation-supervised data samples (Supplementary Table 5).

### Physics-based GT

The physics-based simulator provides us with a unique opportunity to test different strategies for GT segmentation (Supplementary Note 7). Because the raw microscope images are generated before inclusion of the noise model, we explored the use of noise-free images with conventional morphological processing for generating the GT. We evaluated Otsu’s thresholding8 as well as Otsu’s thresholding followed by morphological erosion by a kernel of the size of the microscope PSF to compensate for the blur of the microscope. A third technique was explored considering the role of noise in affecting how well an expert can segment images. For this, we thresholded using the noise level. Finally, we used the projection of the actual emitter distribution on the image plane directly as a mechanism of physics-based GT, as shown in Fig. 1f. This approach is not affected by the PSF or the noise level, and provides a unique unambiguous GT for a given sample. Supplementary Fig. 7 shows a comparison of these different methods for generating the GT.

We compared the four GT methods to identify which was the best strategy (Supplementary Note 7 and Supplementary Table 6). We found that the physics-based GT allows the deep learning models to perform better than the other GT mechanisms. Even a visual comparison of physics-based GT with a manual expert’s segmentation (manual GT 2), as presented in Fig. 1b, presents a very good match. It also outperforms expert annotations in the challenging situations of out-of-focus structures and high noise levels (Supplementary Fig. 1c,d).

We also assessed the sensitivity of the performance of simulation-supervised training to important aspects of simulation (Supplementary Note 8). We note that our approach performs better if the simulation conditions closely match the experimental conditions (Supplementary Table 6 and Supplementary Fig. 8). The sensitivity of our approach contributes to good selectivity.

### Comparison with contemporary techniques

We compared the performance of current methods used for segmenting subcellular structures from optical microscope images, namely (1) automated image-processing techniques such as Otsu-based thresholding8, adaptive thresholding14 and backpropagation15, (2) semi-automatic segmentation techniques9 and (3) the proposed simulation-supervised deep learning approach. The details of applying these methods are presented in the Methods. Table 1 presents the mean intersection over union (mIOU) values and the F1 scores. For the proposed method, we trained one deep learning model each for mitochondria and vesicles and used them with their corresponding test datasets. For the simulated test data of SimEpi1Mito and SimEpi1Vesi with physics-based GT, the proposed approach gives an advantage of ~10% for mitochondria and ~18% for vesicles.

We assessed whether the simulation-supervised training approach presents advantages over training a fresh and a pre-trained network with manually annotated GT (Supplementary Note 9 and Supplementary Table 7). Our results show that, even when manual annotation is used as the GT, the proposed simulation-supervised training approach outperforms training using the manually annotated dataset. We also assessed whether it helps to use a larger dataset with manual annotation generated by a sophisticated consensus-based annotation approach. We evaluated manual annotations by 12 scientists with the relevant background and obtained consensus on the GT. The consensus was quite unreliable, even with multiple annotators (Supplementary Fig. 2 and Supplementary Note 2). Also, the performance evaluation showed a large standard deviation as well as a large difference from the performance obtained with the physics-based GT (Supplementary Note 9 and Supplementary Table 8). Furthermore, the time taken by different annotators ranged from 3 to 10 min for annotating eight mitochondria in five small images. This task could thus prove considerably demanding in terms of resources if a large pool of annotators were employed for generating annotations for a sufficiently large dataset.

### Live-cell segmentation results

We generated results on the live-cell datasets LiveEpi1RatMitoRed, LiveEpi1RatVesiFarRed and LiveEpi1HuMitoGreen. These datasets were acquired on epifluorescence microscope Epi 1 and manually annotated by an expert. Table 1 presents the mIOU values and F1 scores. The simulation-supervised deep learning method provides mIOU values of 74–76% and outperforms the closest method by 7–8% for mitochondria and ~13% for vesicles. A qualitative comparison of all the methods indicates that the proposed approach indeed provides the best results (Supplementary Fig. 9). Representative results for the proposed method (Fig. 2) illustrate that our approach can be applied to cells of different kinds and species (rat cardiomyoblasts are shown in Fig. 2a and human cancer cells in Fig. 2b). It can also be applied across cells subjected to different growth conditions, for example, under normal cell growth conditions (Fig. 2a, top row) and in cells subjected to hypoxia for 1 h before being imaged (Fig. 2a, bottom row). Further illustrative examples indicate that there are situations where the expert annotation appears deficient in comparison to the CV-based segmentation result due to factors such as out-of-focus light and noise (Supplementary Figs. 10 and 11). Interestingly, in the LiveEpi1RatVesiFarRed dataset, a membrane marker is used that labels a huge variety of membrane structures including large structures and not just vesicles. Therefore, the raw microscope images of vesicles show a lot of other details beside vesicles. Nevertheless, the deep learning method shows clear proficiency at selecting vesicles, similar to trained experts. We also present interesting results in Fig. 2b that pertain to segmentation of mitochondria in human cancer cells from the LiveEpi1HuMitoGreen dataset. The zoomed view of region 1 in Fig. 2b shows the ability of the proposed method to tackle low-intensity regions. Conventional methods are unable to deal with such a situation where there are also other higher-intensity regions. In addition, dense mitochondrial regions and mitochondrial networks are a bigger challenge for expert annotation than the proposed segmentation approach, which yields good segmentation results that retain a lot of structural detail, for example, as shown in zoomed-in region 2 of Fig. 2b (also Supplementary Fig. 12). The quantitative performance of only 74−76% may thus be attributed to imperfect manual annotations as well.

### Generalizability across microscopes and fluorophores

The generalizability across microscopes through computation-inexpensive re-training is crucial for quick adoption of this approach in various bioimaging laboratories across a variety of imaging set-ups. We thus assessed whether a simulation-supervised approach is amenable to transfer learning across microscopes (Fig. 3a). Details of the experiments and results are provided in Supplementary Note 10 and the results in Supplementary Table 9. We first considered two epifluorescence microscopes with different optical parameters (Supplementary Table 2). The results indicate a significant improvement in segmentation after transfer learning for challenging cases (Supplementary Fig. 13). We next considered transfer learning from an epifluorescence microscope to a different type of microscope, the Airyscan microscope. Here, as well, the results indicate a clear enhancement in the quality of segmentations and imply improved interpretability (Supplementary Fig. 14).

We also assessed the generalizability of our approach across fluorophores (Supplementary Note 8). We note that our approach performs robustly if the fluorophores used in experiments are different from the emission wavelength used for the simulation-supervised dataset. At the same time, we note that our approach presents high structural specificity using the challenging example presented in Supplementary Note 11 and Supplementary Fig. 15. Generalizability across fluorophores without re-training is a significant advantage because, when a deep learning model is trained for one subcellular structure and microscope, it can be used in a versatile manner for a wide range of biological experiments, irrespective of fluorophore, cell type and cellular conditions.

## Application to morphological analysis

We demonstrate two applications: deriving morphological data analytics and event detection by tracking. This is made possible due to the better quality of segmentation over a large population of individual subcellular structures across optical microscope images of several cells under different conditions.

### Morphology-based analytics

Morphological classification of mitochondria as dots, rods and network is highly informative16. The relevant statistics include the number and size of different mitochondrial phenotypes17. The primary challenge of such an analysis is accurate segmentation, which has been resolved. Figure 4a presents the steps of our analysis (also discussed in the Methods). Figure 4b presents the statistics of different morphologies and Fig. 4c depicts graph-based connectivity analysis in cells subjected to three different cell growth conditions in the dataset LiveEpi1RatMitoRed. A similar analysis of small and large vesicular structures is presented in Fig. 4d. These results indicate the potential for (1) large-scale automated analysis under different cell growth conditions and (2) automated analysis of the evolution of cell health under dynamically changing cell-culture conditions. We present an additional analysis of the effect of carbonyl cyanide m-chlorophenyl hydrazone, a drug known to alter mitochondrial membrane potential, on mitochondrial morphologies observed in a living cell over a period of 60 min after administering the drug (Supplementary Note 12 and Supplementary Fig. 16).

### Tracking and analysis of morphologically significant events

We assigned identities to each segmented mitochondrion and tracked them in high-speed microscopy videos of living cells. Because the segmentation both encodes morphologies and enables temporal tracking, the morphological changes over time can be monitored. We note different motion patterns that may be biologically relevant. Four examples are presented in Fig. 5 and one in Supplementary Fig. 17 (Supplementary Videos 15). Figure 5a shows a migrating mitochondrion. The mitochondrion in Fig. 5a was segmented by the expert as two mitochondria until the expert observed that the mitochondrion moves as a single entity. However, simulation-supervision segmented it as a single mitochondrion, compensating automatically for the out-of-focus region in the middle of it. In Fig. 5b, a mitochondrion performs a flip-and-move manoeuvre over ~40 s. Figure 5c shows a typical morphological change of a mitochondrion, from curled to elongated. Figure 5d shows an interesting situation where a vesicle migrates towards a mitochondrion in a seemingly targeted manner and then interacts with the mitochondrion. This analysis of such an event using CV demonstrates the utility of simulation-supervised deep learning-based segmentation for advanced analytics and analysis. The ability to perform automated detection of such events could lead to correlated behaviour analysis.

## Discussion and conclusion

The proposed method brings physics and machine learning to a nexus where machine learning can create a significant impact with the help of physics-based modelling. Physics-based simulation-supervised training is thus proven to be the vital solution to the challenging GT-deficient problem of segmentation using deep learning for subcellular structures. The newly defined physics-based GT allows deep learning to tackle the optically hard problem of out-of-focus light and PSF-associated blurring. It also enables correct identification of the structures that the models have been trained to recognize, even in the challenging cases of fluorescence bleed-through (Supplementary Fig. 15). This approach is also generalizable across different types of cell and fluorophore. Transfer learning using smaller microscope-specific simulation-supervised datasets is a suitable mechanism for adopting the proposed paradigm across various fluorescence microscopy systems. Although the approach itself is generalizable, the models trained using this approach are sufficiently discriminative of the experimental conditions. Such discriminative ability is of significance in avoiding misleading inferences.

Valuable biological knowledge can be derived from automated, accurate segmentation of hundreds of subcellular organelles across one cell as well as in several cell images and long live-cell videos. Thus, several opportunities for performing advanced CV tasks are enabled by our proposed segmentation approach. Here, two proof-of-concept applications show the ability to perform morphology analysis and morphology-derived analytics. More application-specific morphological features of interest can be derived. Further morphological dynamics-based features may be analysed through tracking and following the morphological changes of segmented structures, for example, as shown in Fig. 5. The results strongly suggest that our approach is applicable to a wide range of different automated analysis pipelines. Accordingly, it may advance research in a variety of fields of biology and biomedicine in which the results and fundamental knowledge are often derived from bioimage analysis.

We highlight that the proposed method establishes the utility of a physics-based simulation-supervised training approach for deep learning applications in the microscopy data of living cells. This will open other research avenues in the future. More challenging and complicated structures of interest in the life sciences, such as the endoplasmic reticulum and Golgi bodies, can be simulated to extend the applicability of this approach in life-science studies. Furthermore, it will be interesting to explore whether 3D segmentation can be derived from raw microscopy image stacks with only a few z-planes, enabling long-term live-cell 3D analysis to be undertaken and more accurate observations to be derived (an example case is shown in Supplementary Fig. 17). Another important extension of this approach could be for label-free microscopy modalities such as bright-field microscopy. However, the realization of accurate physics-based simulation models for small structures will be a significant challenge18, because the inherent optical contrast of the structures contributes to multiple scattering in the near field, which requires mathematically nonlinear physics solvers. Still, such models might be realized in the future and optimized for the large-scale creation of datasets of complex structures. We expect the nexus between machine learning and biology to only grow stronger, in the near future revolutionizing both our insights about biological systems and the opportunities available to researchers in the life sciences.

## Methods

### Physics-based simulation and GT mechanisms

The simulation flowchart is shown in Fig. 1e (further extended in Supplementary Fig. 5) and the simulation approach is presented in detail in Supplementary Note 5. The GT mechanisms are presented in Fig. 1f and detailed in Supplementary Note 7. The simulation was implemented on a Windows-based computer combined with Python 3.6. The simulator is shared for public use (see ‘Code availability’ section).

### Preparation of the simulation data for training and testing

We used three different microscopes for imaging and therefore use similar configurations for the simulations to individually create the training datasets for each microscope (Supplementary Table 3). Our simulation framework can generate 128 × 128 image pairs (image and segmentation GT). We combine four such independent images to create a 2 × 2 tile with dimensions of 256 × 256. We use two types of training batch. The first batch is a large volume of data (7,000 tiles) of a specific microscope used for baseline experiments and the deep model is trained from scratch. The second batch is generated for two different microscopy settings and a comparatively small amount of data (3,000 slides). This batch is used to find the effect of transfer learning. We use standard data augmentation such as flip, rotation and so on during training. We consider 60% training, 20% validation and 20% for testing for each simulation dataset. The method is repeated for two different subcellular structures: mitochondria and vesicular structures.

### U-Net backbones and training details

The backbones used in the U-Net encoder for Supplementary Table 4 are ResNet50, ResNet 10019, VGG1620, Inception21 and EfficientNet-B322. In our research, we found EfficientNet-B3 as the best-performing encoder in terms of validation accuracy on the simulation-supervised mitochondria dataset. This consists of three mobile inverted bottleneck convolution layers integrated in the encoder. For each network, the input and output are 256 × 256. We use standard data augmentation such as flip, rotation and so on where applicable. Early stopping and learning rate reduction are also used based on the mIOU. All experiments are carried out using an Intel(R) Xeon(R) Gold 6154 CPU with 128 GB of RAM and an NVIDIA Quadro RTX 6000 GPU with capacity of 24 GB.

### Contemporary methods for subcellular segmentation

Here, we present details of the implementation of the contemporary methods (the results are presented in Table 1 and Supplementary Fig. 9). Otsu-based thresholding8 and adaptive thresholding14 use a histogram of the intensities for segmentation. These are non-parametric methods and therefore do not require any user input. We use the OpenCV 3.4 library combined with Python 3.7 for the implementation. The ImageJ-based morphological plugin9 (MorphoLibJ v1.4.1) is used on the Windows platform with default parameter settings. Manual thresholding is implemented using the OpenCV 3.4 library combined with Python 3.7, and a suitable threshold for best-performing segmentation is extracted by varying the global threshold over the intensity histogram of the images. The iterative backpropagation-based segmentation15 is implemented in the OpenCV 3.4 library combined with Python 3.7, and we use 1,000 iterations for the segmentation benchmarking.

### Evaluation metric

The metric used for quantification of the performance of segmentation is mIOU. This is a state-of-the-art metric used in segmentation problems10. mIOU values are calculated by taking the ratio of the overlapped segmented area and the union of the segmented area between the GT and segmented image, that is TP/(TP + FP + FN), where true-positive (TP), false-positive (FP) and false-negative (FN) regions are used.

### Processing related to the morphological analysis presented in the main text

First, we apply our simulation-supervised deep learning model for segmentation. Next, a rule-based classification on the segmented area of individual mitochondria is used to classify the rod, dot and network morphologies. The mitochondria are classified into three categories: dot, rod and network. First, the mitochondria are segmented using the proposed method. Next, the binary images are converted into a skeleton and a graph is constructed according to ref. 23, where nodes are endpoints or mitochondria junctions. The degree of a node (d) is the number of branches connected to the node. Finally, each graph is classified as dot, rod and network using

$${\rm{{Class}_{mitochondria}}}=\left\{\begin{array}{rlrlr}&{\rm{Dot,}}&&\,{{\rm{if}}}\,\,{\rm{area}}\,{\le}\, 120\,{\rm{pixel}}\,{{{\rm{and}}}}\,{\max (d)}={1}&\\ &{\rm{Rod,}}&&\,{{\rm{if}}}\,\,{\rm{area}}\,{>}\,{120}\,{\rm{pixel}}\,{{{\rm{and}}}}\,{\max (d)}={1}\\ &{\rm{Network,}}&&\,{{\rm{otherwise}}}\,\end{array}\right.$$

The statistics of the frequency of occurrences of different types of mitochondria (dot, rod and network) and area are presented in Fig. 4b as violin plots. There are 30 cells in our live-cell dataset and the mean and standard deviation are calculated for each cell. In a similar manner, the vesicles are classified into two categories: large and small. The segmented vesicles are fitted inside circles. The vesicles are classified using a heuristic threshold of the radius (r) as

$${\rm{{Class}_{vesicle}}}=\left\{\begin{array}{rlrlr}&{\rm{Small,}}&&\,{{\rm{if}}}\,\,{r}\,{\le}\,150\,{{{\rm{nm}}}}&\\ &{\rm{Large,}}&&\,{{\rm{otherwise}}}\,\end{array}\right.$$

A complex graph-based connectivity analysis is also explored by converting the segmented images into skeletons and graphs. After obtaining the graphs, the nodes are classified as shown in Fig. 4c. If a graph contains junction nodes, it is a network. The analytics show more nuanced information about networks through the endpoint–junction lengths and junction–junction lengths.

Indeed, the classification can be performed using simple rules such as used here, or fuzzy rules, more elaborate rules and even deep learning approaches may be employed for morphological classification depending on the need of the applications.

### Tracking of mitochondria and vesicles

First, the proposed U-Net-based segmentation is used to segment the subcellular structures. Then, the Kalman filter and Hungarian algorithm24 are employed to track individual structures over time25.

### Microscopes and imaging parameters

Three different microscopes were used in this work (Supplementary Table 2). The first, microscope Epi 1, is a GE DeltaVision Elite microscope and was used for datasets LiveEpi1RatMitoRed and LiveEpi1RatVesiFarRed. The exposure time for imaging the vesicles and mitochondria was 10 ms. The acquisition rate was 50 frames per second. The acquisition was performed in sequential mode. The LiveEpi1HuMitoGreen dataset was also recorded with this microscope. The second, microscope Epi 2, is a Zeiss CellDiscoverer 7 with a Plan-Apochromat ×50 water objective and an NA of 1.2. The LiveEpi2RatMitoGreen, LiveEpi2RatMitoRed and LiveEpi2RatVesiFarRed datasets were recorded with this microscope. The third, microscope Airy 1, is a Zeiss LSM 880 ELYRA with a C Plan-Apochromat ×63 oil objective with an NA of 1.4. The FixedAiry1MitoGreen, FixedAiry1MitoRed and FixedAiry1RatVesiBlue datasets were recorded using this microscope.

### Cell culture and imaging conditions for the live-cell datasets LiveEpi1RatMitoRed and LiveEpi1RatVesiFarRed

The rat cardiomyoblast cell-line H9c2 (cells derived from embryonic heart tissue; Sigma-Aldrich) were cultured in high-glucose (4.5 g l−1) Dulbecco’s modified Eagle medium (DMEM) with 10% fetal bovine serum (FBS). The cells were transiently transfected using TransIT-LT1 (Mirus) to express the mitochondrial fluorescence marker mCherry-OMP25-TM (emission maximum at 610 nm). After 24 h of transfection, the cells were incubated in serum-free DMEM medium for 4 h and then the medium was changed back to DMEM with 2% serum just before treatment for 1 h (see below). After treatment, the medium was changed back to DMEM 10% FBS. The cells were divided into three pools: normal, hypoxia and hypoxia-ADM. For the normal conditions (control) pool, the cells were kept under normal cell growth conditions at 37 °C with about 21% O2 and 5% CO2. For the hypoxia pool, the cells were subjected to hypoxia (deficiency of oxygen; 0.3% O2 level) by incubation in a hypoxic cell incubator for 60 min. For the hypoxia and ADM pool, the cells were subjected to hypoxia as for the cells above, but were simultaneously treated with the peptide hormone adrenomedullin (ADM) at a concentration of 10−6 M. This hormone has been found to exhibit protective functions under various pathological conditions, such as ischaemia in heart cells during myocardial infarction. The cells were labelled using the live-cell-friendly fluorescent marker mCLING-ATTO647N immediately before imaging using a concentration of 1:2,000 with a 12-min incubation time. After incubation, the medium was replaced with cell-culture medium (DMEM 10% FBS) for time-lapse microscopy at 37 °C, atmospheric oxygen (that is, the cells in hypoxia and hypoxia-ADM pools were no longer in an oxygen-deficient condition) and 5% CO2. The membrane marker was quickly internalized by the cells and labelled small membrane-bound vesicles in the cells. This membrane marker exhibits a fluorescence emission maximum at a wavelength of 662 nm. The mitochondrial marker mCherry-OMP25-TN and membrane marker mCLING-ATTO647N were imaged using epifluorescence microscope Epi 1 sequentially in separate colour channels.

### Cell culture and imaging conditions for the live-cell dataset LiveEpi1HuMitoGreen

MCC13 cells were maintained in an incubator at 37 °C with 20% O2 and 5% CO2, with a growth medium consisting of RPMI 1640 (Sigma-Aldrich) supplemented with 10% FBS (Sigma-Aldrich) and 1% penicillin/streptomycin (Sigma-Aldrich). The cultures used for experiments were thawed from stocks stored in liquid nitrogen a minimum of one week before labelling and imaging.

Labelling with CellLight Mitochondria-RFP BacMam 2.0 (Thermo Fisher Scientific) was carried out according to the manufacturer’s protocol with 15 to 45 particles per cell (PPC) ~20 h before imaging. Transduced cells were grown under the same cell growth conditions as described above but in antibiotic-free medium.

Immediately before imaging, the cells were incubated with MitoTracker Deep Red (Thermo Fisher Scientific) for 30 min, then washed in phosphate-buffered saline or live-cell imaging medium.

These cells were imaged without the use of a microscope incubation system at room temperature (~25 °C), but in pre-heated (37 °C) live-cell imaging solution (Thermo Fisher Scientific).

### Cell culture and imaging conditions for the live-cell datasets LiveEpi2RatMitoGreen, LiveEpi2RatMitoRed and LiveEpi2RatVesiFarRed

The rat cardiomyoblast cell-line H9c2 (described above), genetically modified using retrovirus to have stable expression of tandem tagged (mCherry-EGFP) mitochondrial outer membrane protein 25 (OMP25)-transmembrane domain (TM), was utilized. Equal expression of fluorescence intensity in cells was achieved through flow cytometry sorting. The cells were cultured in high-glucose DMEM with 10% FBS or in medium for glucose deprivation and galactose adaption. The glucose deprivation medium consisted of DMEM without glucose (11966-025, Gibco) supplemented with 2 mM l-glutamine, 1 mM sodium pyruvate, 10 mM galactose, 10% FBS, 1% streptomycin/penicillin and 1 μg ml−1 of puromycin (InvivoGen, ant-pr-1). The cells were adapted to galactose for a minimum of seven days before experiments. The cells were seeded on MatTek dishes (P35G-1.5-14-C, MatTek Corporation) and imaged when they reached ~80% confluency. Labelling of lysosomes (acidic endosomal system) was carried out by treating cells for 30 min with 50 nM Lysotracker Deep Red (cat. no. L12492, Thermo Fisher) according to the manufacturer’s recommendation. After labelling, the medium was replaced with fresh medium (described above) for live-cell microscopy. The cells were imaged at 37 °C with atmospheric oxygen and 5% CO2. Imaging was performed using the Epi 2 microscope. For the live-cell imaging, selected positions were imaged for a duration of 10 min each, with one frame being taken every 5 s, giving 120 frames. Each frame consisted of a seven-slice z-stack with a 0.31-μm interval between slices.

### Cell culture and imaging conditions for the fixed-cell datasets FixedAiry1MitoGreen, FixedAiry1MitoRed and FixedAiry1RatVesiBlue

mCherry-EGFP-OMP-25TM H9c2 cells were seeded on glass coverslips (1.5). The cell growth conditions and labelling were the same as used for the live-cell datasets in the previous paragraph (datasets 4, 5 and 6 in Supplementary Fig. 4. After labelling, the coverslips were washed once in phosphate-buffered saline, then fixed using 4% paraformaldehyde and 0.2% glutaraldehyde for 20 min at 37 °C. The coverslips were then washed in phosphate-buffered saline and mounted using Prolong glass antifade mountant (cat. no. P36980, Thermo Fisher). Imaging was performed using the Airy 1 microscope. Airyscan images were taken of regions of interest. All Airyscan images were processed using the LSM 880 Zen software packages ‘Airyscan Processing’ method.

### Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.