Deep-Learning-Based Segmentation of Small Extracellular Vesicles in Transmission Electron Microscopy Images

Small extracellular vesicles (sEVs) are cell-derived vesicles of nanoscale size (~30–200 nm) that function as conveyors of information between cells, reflecting the cell of their origin and its physiological condition in their content. Valuable information on the shape and even on the composition of individual sEVs can be recorded using transmission electron microscopy (TEM). Unfortunately, sample preparation for TEM image acquisition is a complex procedure, which often leads to noisy images and renders automatic quantification of sEVs an extremely difficult task. We present a completely deep-learning-based pipeline for the segmentation of sEVs in TEM images. Our method applies a residual convolutional neural network to obtain fine masks and use the Radon transform for splitting clustered sEVs. Using three manually annotated datasets that cover a natural variability typical for sEV studies, we show that the proposed method outperforms two different state-of-the-art approaches in terms of detection and segmentation performance. Furthermore, the diameter and roundness of the segmented vesicles are estimated with an error of less than 10%, which supports the high potential of our method in biological applications.


Materials
Sup. Figure S1 shows an example of the images included in each dataset.  Figure S2(a)), the connection step (Sup. Figure S2(b)) and the expanding path (Sup. Figure S2(c)) of the network. Sup. Figure S2. Illustration of the residual layers composing the Fully Residual U-Net architecture: (a) Contracting residual layer; (b) Last residual layer in the contracting path where the number of feature maps is not doubled; (c) Expanding residual layer where the feature maps are reduced in both branches of the block: in the residual branch, feature maps are reduced by half in the first 3 × 3 convolutional layer, and in the other branch, channels are also reduced by half, but with a 1 × 1 convolutional layer. ELU: Exponential Linear Unit activation function. DROPOUT: Dropout layer. CONV: Set of convolutions.

Data augmentation
The FRU-Net is iteratively trained with both, real and augmented data 1  Transformations are iteratively applied to the patches 200 times in batches of size 10. FRU-Net input size is set to 400 × 400 pixels, so patches of 500 × 500 pixels are first extracted and after data augmentation, their borders are cropped to the desired size and to avoid the inclusion of spurious objects in the training (see Sup. Figure S3). The procedure to obtain the patches is as follows: If the re-sized image size is smaller than 500 × 500 pixels, image borders are augmented by mirroring until the desired size is reached; Otherwise, images are split into patches of 500 × 500 pixels with 125 pixels of overlap from each side. See Sup. Figure S3 for further details.

Reconstruction of probability maps
Through the proposed method, every image is rescaled and split into patches of 400 × 400 pixels in size with an overlap of 125 pixels on each side, see Sup. Figure S3. Each of those patches are the inputs of the presented Fully Residual U-Net (FRU-Net) and the output is a probability map of the same size. The reconstruction of the rescaled probability map is obtained by taking into account the overlap between the neighboring patches. Each pixel in the patch belongs to a specific location in the rescaled image. If that pixel does not appear in any other patch, its value is assigned to the specific location in the rescaled probability map. Otherwise, the average over all values of the pixels that belong to the specific location in the rescaled image is taken there. See Sup. Figure S3. (c) Numerical representation of a template for an image of size 950 × 950 pixels and patches of 400 × 400 pixels. The numbers inside the template correspond to the weights assigned to the pixel value that lie within specific regions when the rescaled probability map is being reconstructed.
Probability map post-processing: Cluster splitting The line that splits two touching rounded objects is represented by a hole in their sinogram, see Sup. Figure S4. Therefore, each connected component (CC i ) in the masks is post-processed as follows: I Obtain the sinogram (S i ) using the Radon transform of the bounding box of the connected components, CC i .
II Enhance the contrast of S i by applying white and black top-hat transforms: where W T and BT are the white and black top-hat filters, respectively, configured with a sufficiently large kernel of disk shape (a circle of 10 pixels in radius).
III Get a binary mask b i by applying Otsu thresholding 2 .
IV Obtain the sinogram's local minimum m i j within each of the holes of the binary mask b i .
V Reconstruct the line represented by the local minimum m i j and intersect it with the connected component CC i .
VI Undo the splits that result in too small over-segmentations, taking into account the size range of sEVs.

Reference methods
In TEM ExosomeAnalyzer 3 , TEM images are preprocessed with an edge-enhancing diffusion filter 4 for which the contrast parameter λ needs to be specified (default: 0.005). Then, a gradient magnitude image is calculated and a gradual edge growing is applied to obtain those enclosed objects that represent the needed seeds for a morphological watershed processing. Candidate seeds are filtered out depending on their shape by a morphological opening with a disc structuring element, as sEVs are considered to be of circle shape. The size of the disk, α, is a fraction of the expected sEV size (default: 0.15). The U-Net 5, 6 is a fully connected CNN that consists of contracting and expanding paths. Its architecture is arranged with the same combination of parameters as the FRU-Net to facilitate the comparison between the output of both approaches. Therefore, the contracting path in the U-Net starts with a convolutional layer of 32 channels. Loss and optimizer functions are the same as for the FRU-Net. The validation dataset (i.e. 10% of the real patches before data augmentation is computed) is used to reduce the learning rate and optimize U-Net's performance in the next way: (1) While training the network, the algorithm stores the minimum value of the loss function in the validation dataset, min-loss; (2) If the loss value during the next K epochs decreases, min-loss is updated; if it is not the case, the learning rate is reduced.
The output probability maps of the U-Net are post-processed following the same steps as for FRU-Net outputs: the reconstructed probability maps are thresholded, clusters are split using the Radon transform, resulting labelled mask is rescaled to the original size and finally, detected vesicles' borders are smoothed.

Visual summary of the obtained results
In this section, we illustrate the performance of the FRU-Net and provide a comparison with two state-ofthe-art methods.
• Sup. Table S1 provides the data distribution used for training and testing the supervised methods (U-Net and FRU-Net).
• In Sup. Figure S5, a comparison between the morphological parameters obtained after the FRU-Net processing is presented.

4/10
• In Sup. Figure S6 an illustration of the detection performance for each method and dataset is shown.
• Sup. Figure S7 and Sup. Sup. Table S1. Distribution of the datasets used for training and testing of the compared methods.
Although the estimated mean values of the vesicle diameters and roundness indices are slightly biased toward the Ground Truth when using the FRU-Net, the estimated distribution is not significantly different as supported by the p-values of the Wilcoxon Rank Sum test with the 5% confidence interval. Moreover, when only correctly segmented vesicles are evaluated, it can be seen that both distributions are almost identical. See Sup. Figure S5.
TEM ExosomeAnalyzer is the method with the lowest number of false positives, followed by the FRU-Net and finally, the U-Net, which is the method with the worst performance in this sense. Both deep-learning models get better results when they are trained with Dataset 2 (i.e., FRU2 and U2 are the ones with the higher ratio of false positives), whereas TEM ExosomeAnalyzer performs similarly in Dataset 2 and 3. TEM ExosomeAnalyzer is more sensitive to a higher density of vesicles, whereas the deep-learning models strongly depend on the heterogeneity of the training data. See Sup. Figure S6. Dataset  Sup. Table S2. Statistics of the Jaccard coefficient distributions of the correctly detected sEVs (SEG * ).
As shown in Table 1 in the main manuscript, FRU-Net is the most accurate segmentation method (highest SEG in all the cases). However, when SEG * is analyzed, the difference between all the tested methods is much smaller. In particular, the main difference when comparing FRU-Net with the rest of the methods is that it segments accurately most of the correctly detected sEVs (true positives). The quartiles of SEG * are substantially higher than for the rest of the methods. See Sup. Table S2. Sup. Figure S6. Graphical representation of false positive detection. In black, the number of objects detected by each method. In gray, the number of false positives.

Evaluation of execution time
This section provides a comparison between the time required to annotate TEM images manually and the time needed first, to train a deep learning model and second, to analyze new images with the trained model. The input parameters needed for the estimation of execution time can be found in Sup.  Figure S7. Violin plots of the Jaccard coefficient for the correctly detected vesicles (SEG * ). Dashed lines in the plots represent the three quartiles as detailed in Sup. Table S2.
Sup. Figure or NTCF EA = 17.63 s (Equation 2) when the curation is performed semi-automatically using TEM ExosomeAnalyzer, Sup. Table S3. Definition of input parameters to test the required execution and manual annotation time.
Averaged curation times NTCF M , NTCF EA and NTCEA M lead us to the following conclusions: • Without the need for training a new model, it is always (i.e., for any positive number of sEVs) faster to analyze images using FRU3 (running on both GPU and CPU) and curate the results, either manually or semi-automatically using TEM ExosomeAnalyzer, than manually annotating the images. That is, the Equations 4 and 5 hold: • By taking into account 15 hours of training time for FRU3 (on GPU), the analysis of images using FRU3 and subsequent curation of the results is faster than manually annotating the images:

8/10
-If more than 825/851 sEVs are analyzed (FRU3 running on GPU/CPU) and the curation is performed manually (solution for x M in Equation 6): x M NT M ≤ x M (NT F + NTCF M ) + T -If more than 758/780 sEVs are analyzed (FRU3 running on GPU/CPU) and the curation is performed semi-automatically using TEM ExosomeAnalyzer (solution for x EA in Equation 7): x EA NT M ≤ x EA (NT F + NTCF EA ) + T (7)