Main

Cryo-electron tomography (cryo-ET) produces three-dimensional (3D) snapshots of cellular landscapes at molecular resolution, making it possible to investigate structural and functional states of macromolecular complexes in their native environment, and to unveil how different macromolecular populations interact with cellular structures1. With improved instrumentation, sample preparation protocols, and automation, high-quality in-cell cryo-ET data are generated at increasingly high-throughput2,3,4. A prerequisite for subsequent structural analysis is the reliable identification of a relatively homogenous set of macromolecular complexes5,6. However, owing to the complex and crowded nature of the intracellular milieu, and limitations arising from cryo-ET image acquisition (low signal-to-noise ratio (SNR) and incomplete angular sampling), such data mining of 3D cryo-ET volumes remains a major bottleneck.

A range of available semi-automated methods for segmentation of cellular structures and localization of macromolecular complexes (from here on, particles) in cryo-ET datasets are broadly classified as template-based and template-free approaches. Template Matching7 is a commonly applied computational approach and is based on a point-wise numerical computation of a similarity coefficient (cross correlation) to a known template of the complex in question. It is accurate in the localization of large structures, but fails at identifying smaller or less dense particles8, and is computationally intensive. Current template-free methods using classical image processing are designed for, and thus limited to, specific molecular configurations in which particles are associated with large cellular structures such as membranes or microtubules9,10,11. Both these methods typically require manual inspection and are therefore laborious and time-consuming. The advent of deep-learning methods, particularly convolutional neural networks (CNNs)12, enabled more general and automated approaches for segmentation and particle localization in cryo-ET. The first of such approaches was a two-dimensional (2D) CNN that performs semantic segmentation of large structures, such as ribosomes or membranes13. However, its 2D nature makes it less suitable for particle localization, where probing the 3D structure becomes beneficial. More recently, DeepFinder, a fully supervised 3D CNN method that is based on the U-Net architecture14 for multi-class semantic segmentation, has positioned itself as the state-of-the-art in automated particle localization in both simulated and real cryo-ET datasets15,16. Remaining limitations include the localization of less prevalent particles, and the interpretation of the obtained predictions within their cellular context.

Here, we present DeePiCt (deep picker in context), an open-source software that synergizes supervised convolutional networks for segmentation of cellular compartments (organelles or cytosol) and structures (membranes or cytoskeletal filaments), and localization of particles. We generated a set of comprehensively expert-annotated tomograms acquired on cryo-focused ion beam (cryo-FIB) lamellae of wild-type S. pombe for training and benchmarking of our method, which we openly provide to overcome the critical limitations arising from the absence of publicly available annotated experimental datasets. From here on, we refer to the term ‘network’ as the deep-learning algorithm itself and ‘model’ as the algorithm once already trained. We provide DeePiCt models, trained on experimental cryo-ET data, which show high data-mining performance and can be readily applied across species and datasets.

Results

DeePiCt for automated segmentation and particle localization

DeePiCt is based on deep-learning technologies. It combines a 2D CNN for segmentation of cellular compartments that are easily recognized in 2D, and a 3D CNN for particle localization and annotation of continuous structures, such as membranes and cytoskeletal filaments, that benefit from 3D information (Fig. 1). This synergy enables more precise particle picking and interpretation of their cellular context. The 2D and 3D CNNs are adapted from the original U-Net architecture14 (Fig. 1a and Supplementary Note 1). U-Nets have become the standard for much of modern deep learning, with enormous success beyond segmentation, including in denoising17 and reconstruction18 methods for cryo-ET. Here, U-Nets are a natural choice for our dual purpose of addressing segmentation and detection (achieved through post-processing steps) since they involve less trainable parameters than architectures designed ad hoc for object detection19. The 2D CNN employs a fixed depth of 5 (4 max-pooling layers) and 16 initial filters (Fig. 1a). The 3D CNN allows multi-label learning (Supplementary Note 1) and adjustable architectural parameters for depth, number of initial filters, a batch normalization layer, and the dropout parameter in the encoder and decoder paths (Fig. 1a). These parameters can be set according to the quality and amount of training data, and size, abundance, and shape complexity of the particle of interest. In general, larger particles benefit from larger depth to increase the receptive field of the network, and particles with a low-density print (low SNR with respect to the surrounding context) require more initial filters (Supplementary Table 1). The remaining optional layers (batch normalization and dropout) are well-known techniques in computer vision to avoid overfitting20.

Fig. 1: DeePiCt 2D and 3D CNN architecture implemented in an automated workflow combining compartment segmentations and particle localizations in cryo-ET data.
figure 1

a, CNN U-Net architecture: the 2D network performs all tensor operations on the 2D spatial coordinates, with depth (D) = 5 and number of initial features (IF) = 16; the 3D network performs tensor operations on the 3D spatial dimensions; architectural hyperparameters in red (D, IF, batch normalization layers (BN)) can be set by the user. be, The DeePiCt pipeline is used to train and predict various structural features in cryo-ET. b, The DeePiCt pipeline consists of two independent CNNs: a 2D network for compartment segmentation and a 3D network for particle localization. c, Trained networks are applied to input tomograms, which can be pre-processed with a spectrum-matching filter to improve image contrast. The example 2D tomographic slice visualizes the cytoplasm with the endoplasmic reticulum (ER), vacuoles (V), nucleus and extracellular space (ECS). d, DeePiCt raw predictions for cytosol (from the 2D CNN, top), membranes, ribosomes and fatty acid synthase (FAS; from the 3D CNN, bottom) are post-processed by thresholding, cluster size, and centroid fitting. e, The outputs of the two networks can be combined to include the cellular context by intersecting particle predictions with cytosol masks (top), selecting particles (NPC, green) in contact with specific membranes (nuclear envelope, purple, middle), and to identify particles (for example, ribosomes, orange) associated with specific organelles (for example, ER).

For training, each network requires tomograms and corresponding 3D binary segmentation masks of the structures of interest, for example organelle segmentations for the 2D U-Net and spheres representing particles for the 3D U-Net (Fig. 1b). The raw input tomograms are optionally pre-processed using an amplitude spectrum equalization filter to enhance image contrast (Fig. 1c and Extended Data Fig. 1). Both CNNs use the Adam optimizer algorithm and offer a choice of two loss functions (Dice and Generalized Dice loss) for training (Supplementary Note 1 and Supplementary Fig. 1). During training of the 2D network, tiles are randomly flipped and rotated in 90-degree increments to improve generalization. For the 3D CNN, we implemented a number of optional random transformations to the input images for data augmentation (Supplementary Note 1). In our experience, the 2D CNNs require about 6 fully segmented tomograms for training, while the 3D CNNs require about 5 tomograms for membrane segmentation, and a minimum of 300 annotated instances for particle learning independent of particle sparsity in the cellular volumes (Supplementary Fig. 2).

For predicting, the trained networks receive unseen tomograms pre-processed as the training data and output 3D probability maps that are subsequently automatically post-processed (Fig. 1d). Specifically, the predicted slices outputted by the 2D network are combined into a 3D volume, smoothened along the z-axis by applying a one-dimensional Gaussian filter, and thresholded (user-definable, default = 0.75) to generate a binary 3D map (Extended Data Fig. 2a–c). The output of the 3D CNN is thresholded at a user-defined probability value, followed by clustering, to generate a binary segmentation map. The clustered output can be integrated with contextual information from a binary map representing a tomographic region (for example, the cytosol segmentation from the 2D CNN) to reduce false positives. The mode of integration can be chosen among three different options: intersection, contact, or colocalization, depending on the users’ specific application (Fig. 1e and Supplementary Fig. 3). For particle localization, a list of coordinates is generated from the clusters’ centroids. For segmentations of continuous structures, such as cellular filaments, coordinates can be sampled along the segmentation centerline at a chosen spacing and exported; the particle orientations and structural features can then be obtained by subtomogram analysis in external software (for example, Warp21, M22, RELION23, Dynamo24, or EMAN225). For more details on the method, including post-processing and performance evaluation (Supplementary Fig. 4), we refer to Supplementary Note 1 and our Github repository (https://github.com/ZauggGroup/DeePiCt).

Generation of ground truth annotations in S. pombe

Developing and benchmarking deep neural networks require ground truth non-synthetic dataset. To this end, we created a comprehensive annotation of cellular features in tomograms acquired from wild-type S. pombe (Methods), representing diverse structures, particle sizes, and abundances.

We devised an iterative workflow combining template matching, DeePiCt, and manual picking, to annotate ribosomes, fatty acid synthases (FAS), membranes, organelles, and the cytoplasm in ten tomograms acquired by combining defocus and a Volta potential phase plate (VPP) and ten defocus-only tomograms (defocus) (Fig. 2 and Supplementary Tables 24). For annotating the nuclear pore complex (NPC), an additional dataset of 127 tomograms (denoted by defocus*) featuring ~354 NPCs26 was used to ensure enough training data for this large, low abundance (on average three per tomogram), and structurally flexible complex. For a detailed description of the ground truth construction see Methods.

Fig. 2: Iterative comprehensive annotation of ground truth for non-synthetic data.
figure 2

ah, The three columns (left to right) show: the annotation process for ribosomes (a,d,g), FAS (b,e,h), and membranes (c,f). af Show the cumulative predictions of three rounds of DeePiCt for ribosomes and one round for FAS as cross-section overlaid on a representative tomographic slice (ac; n = 20 tomograms) and in 3D view (df). Particles are classified as true positives recovered from the initial annotation (TPs recovered; yellow from template matching (TM) for ribosomes; green from manual annotation for FAS), new true positives provided by DeePiCt predictions (TPs new; blue), and false positives (FPs; red). g,h, 3D views of the resulting ground truth, including false negatives (FNs) distinguishing unrecovered FNs from the initial annotations (pink) and final round of manual picking (salmon). i, Combined ground truth annotation of ribosomes, FAS, membranes, and NPC (green cylindrical mask, manual annotation). j,k, Relative contributions of the DeePiCt rounds for ribosome (j) and FAS (k) identification are plotted across ten VPP and ten defocus S. pombe tomograms: contribution of step 1 (yellow: TM and manual annotation for ribosomes; green: manual annotation for FAS; salmon: initial annotations not detected by DeePiCt), step 2 (blue); and step 3 (pink bars). l,m, Subtomogram averages of all ribosomes, exclusively DeePiCt-detected (TPs new; blue) and all FAS particles in VPP and defocus ground truth.

In the ten VPP tomograms, we annotated a total of 25,311 ribosomes and 731 FAS. For the ribosomes, 61.6% (1,559 particles per tomogram (p.p.t.) on average) were found by initial template matching, 21% (541 p.p.t.) were added in iterative rounds of DeePiCt on the incomplete ground truth, and 17% (431 p.p.t.) by final manual annotations (Fig. 2j, Extended Data Fig. 3a,e, and Supplementary Tables 2 and 5). For FAS, since template matching failed, an initial round of manual picking contributed 58.96% (43 p.p.t.), followed by one round of DeePiCt on the incomplete ground truth, adding another 22% (16 p.p.t.). Additional rounds of DeePiCt were discarded since they led to a high false positive rate. The final manual picking added a significant fraction of 19% (14 p.p.t.) (Fig. 2k, Extended Data Fig. 3c,f and Supplementary Tables 3 and 6).

In the ten defocus tomograms, we annotated 25,901 ribosomes and 366 FAS. Of the ribosomes, template matching found only 19% (498 p.p.t.), another 19% (502 p.p.t.) were added by DeePiCt trained on incomplete ground truth, and 61% (1,590 p.p.t.) were manually identified (Fig. 2j, Extended Data Fig. 3b,e, and Supplementary Tables 2 and 5). For FAS, 49% (18 p.p.t.) came from the initial manual annotations (template matching failed), 37% (14 p.p.t.) from DeePiCt trained on incomplete ground truth, and 14% (5 p.p.t.) from the final manual picking (Fig. 2k, Extended Data Fig. 3d,f, and Supplementary Tables 3 and 6). Overall, the final manual picking was crucial to generate a comprehensive ground truth annotation for FAS and, in defocus data, also for ribosomes, highlighting the value of our carefully curated annotations.

Total numbers of ribosomes in both acquisition types were comparable, while fewer FAS particles were detected in the defocus dataset likely owing to the lower SNR despite the application of an equalization filter (Extended Data Fig. 3, Supplementary Tables 2 and 3, and Supplementary Note 2). Nevertheless, our annotations of fully assembled FAS in both dataset are in agreement with levels of FAS-α and FAS-β quantified by previous proteomics studies27,28 (Extended Data Fig. 4; Methods).

To assess the quality of the obtained ground truth annotations for ribosome and FAS, we performed structural 3D classification and refinement (Supplementary Note 2, Supplementary Figs. 57, and Supplementary Tables 7 and 8; Methods). Averages of ribosomes detected only by DeePiCt recapitulate 80S ribosomes, similar to the averages from all annotated ribosomes (Fig. 2l,m). 3D classifications of FAS demonstrated that DeePiCt together with manual annotations can recover these challenging shell-like structures independent of the data acquisition type (Fig. 2l,m, Supplementary Note 2 and Supplementary Fig. 6).

In conclusion, we provide comprehensive, high-quality annotations for large macromolecular complexes and cellular structures in S. pombe tomograms acquired with VPP and defocus. These set the ground for benchmarking the performance of the method and for future developments in particle detection approaches.

Performance analysis and hyperparameter tuning of DeePiCt

Performance analysis of 2D CNN in VPP

For assessing the performance of the 2D CNN, we evaluated two binary segmentation tasks on the ten ground truth VPP tomograms using fivefold cross-validation: segmentation of all organelles (all membrane-enclosed organelles and the nucleoplasm), and segmentation of the cytosol. The 2D CNN achieves high areas under the precision–recall curve (AUPRC; Supplementary Note 1 and Supplementary Fig. 4), with a median AUPRC of 0.92 for organelles and 0.98 for cytosol (Fig. 3a and Extended Data Fig. 5). Notably, basic hyperparameter tuning had close to no effect and a fixed architecture produced sufficiently good segmentations.

Fig. 3: DeePiCt performance, cross-domain generalization, and comparison with other methods.
figure 3

a, Performance results of the 2D CNN for organelle and cytosol segmentation (n = 2 tomograms over 5 independent experiments) for training and testing in the same domain (VPP, cyan). The median AUPRC scores are indicated (red lines). b, Performance of DeePiCt for the same-domain setting (VPP, cyan) for ribosome localization, FAS localization, and membrane segmentation tasks (left to right), with n = 2, n = 2 and n = 4 tomograms, respectively, over 3 independent experiments. In each case, the corresponding architectures of the 3D CNN were optimized by hyperparameter tuning (Supplementary Table 1 and Extended Data Fig. 6). The median F1 score for ribosomes and FAS, and a median voxel-F1 for membranes are indicated (red lines). c, Same-domain NPC segmentation results (cyan). From left to right: performance in all (n = 42 tomograms over 3 independent experiments), high quality (n = 13), and lower quality defocus* tomograms (n = 29). d,e, DeepFinder (d) and template matching (e) particle localization results for ribosome and FAS. Median F1 values are indicated (red line). d, DeepFinder results were achieved by training a multi-class DeepFinder network that simultaneously segments FAS and ribosomes (n = 2 tomograms over 3 independent experiments). e, Template matching results are shown for the VPP tomograms (n = 10 independent tomograms). f,g, Results of the cross-domain generalization (yellow) for the 2D CNN (f) and DeePiCt (g) (training on 10 VPP tomograms, testing on n = 10 defocus tomograms after amplitude spectrum equalization). Red lines indicate median performance values. Different shades of grey for dots correspond to cross-validation folds (DeePiCt and DeepFinder) or individual tomograms (template matching). Details of performance measurements (F1, voxel-F1, AUPRC) are described in Supplementary Note 1 (Supplementary Fig. 4). Box plot middle lines mark the median and the edges indicate the 25th and 75th percentiles; whiskers encompass all data that are not considered outliers (calculated by the Seaborn51 box plot function). h, Median performance summary tables associated to results in plots ag; bold numbers denote highest score values per class.

In addition, the 2D CNN has the potential to segment individual organelle types when provided with sufficient training data (Supplementary Table 4). Since it operates in 2D rather than 3D, it requires little memory (<1GB at the default batch size) and training (~15 min to train on 10 tomograms using a Nvidia V100 GPU; Methods).

Hyperparameter tuning for the 3D CNN

For the 3D CNN, we characterized the effect of the adjustable hyperparameters (Methods) using a cross-validation approach for ribosome, FAS, and membranes in the VPP dataset, and for NPC in the defocus* dataset.

To measure performance, we integrated the predictions of each target structure with an appropriate ‘region mask’ to remove false positives outside of expected regions (Supplementary Fig. 3): for ribosome and FAS, we used a cytosol prediction from the 2D CNN in ‘intersection’ mode; for membrane segmentation, we used the predicted cytosol from the 2D CNN in ‘contact’ mode; for NPCs, we used a 3D CNN prediction of the nuclear envelope in ‘contact’ mode.

Using this performance measure, we determined the hyperparameter combination that optimizes the task of localization or segmentation (Extended Data Fig. 6 and Supplementary Table 1). We found that the best hyperparameter setting is structure-dependent and related to particle size, density, symmetry, and abundance, among others, as well as on the receptive field of the network. Therefore, the architecture flexibility of our 3D CNN implementation is essential for segmentation of a diverse set of biological structures. Specifically, batch normalization shortens the learning time and notably improves performance for NPCs and FAS. Dropout layers and a number of data augmentation strategies did not improve performance (Supplementary Note 1 and Supplementary Fig. 2).

The physical receptive field of the network, that is, the context it sees for each prediction voxel, depends on the depth hyperparameter, and on the physical inter-voxel spacing. In all experiments, we used 4× binned tomograms with inter-voxel spacing of ~13.5 Å (Methods), sufficient for the detection of ~30-nm diameter particles investigated here. The binning can be optimized to detect different structures of interest. Lower binning retains higher-resolution information, but at the cost of higher noise and decreased physical receptive field of the network. The latter could be compensated with a higher depth, thus requiring more training data and possibly amplifying the well-known ‘vanishing gradient effect’ associated with deeper architectures29. Higher binning may be sufficient for larger structures such as organelles.

Performance of DeePiCt in the same-domain setting

Using the optimized networks, we analyzed the performance of the DeePiCt workflow in the same-domain setting (that is, training and testing in the same dataset type) using threefold cross-validation (Fig. 3b and Extended Data Fig. 7a–c). In VPP, it achieved a performance F1 score between 0.68 and 0.80 (median 0.79) for ribosomes, between 0.21 and 0.64 (0.46) for FAS, and a voxel-F1 between 0.58 and 0.90 (0.71) for membranes. For the NPC segmentation in defocus* (Fig. 3c), the performance depended on the quality of the tomograms, 23% of which were assessed to be high quality on the basis of lamella thickness and tilt-series alignment error (Methods; median voxel-F1 of 0.47 versus 0.19 for high and lower quality, respectively), even if the network was trained on the full dataset. Comparative analysis between DeePiCt single-class versus multi-label networks showed that single-class networks provide better results (Extended Data Fig. 6), even when employing loss functions that account for class imbalance, such as Generalized Dice, in multi-label networks training (Supplementary Fig. 1).

Comparison of DeePiCt to state-of-the-art tools

We benchmarked DeePiCt against DeepFinder and template matching for localizing ribosomes and FAS in the VPP dataset (Fig. 3d,e). Following the suggestions by the authors15, DeepFinder was trained in a multi-class fashion and evaluated in the same threefold cross-validation setting as DeePiCt. In ribosome localization, DeepFinder performed comparably to DeePiCt (median F1 0.83 versus 0.79). For FAS localization, DeepFinder performed significantly worse than DeePiCt (0.11 versus 0.46; t-test P= 0.007; Fig. 3d). Single-class DeepFinder networks for FAS failed to localize any particles. DeePiCt required ~17 h for training versus ~3 h for DeepFinder, while predicting and post-processing is equally fast for both (~500 clusters per minute; Methods). In contrast to DeepFinder, the trained networks of DeePiCt (models) for all structures mentioned in this work are open source and publicly available (Code Availability).

Template matching performance was measured by comparing the top 2,000 cross correlation peaks per tomogram of the raw output for ribosome and the top 1,000 peaks for FAS. It required several hours for detection in each dataset and showed worse performance than both deep-learning methods for the localization of ribosome, and completely failing for FAS (Fig. 3e).

DeePiCt domain generalization across acquisition conditions

Defocus data has the advantage of yielding higher-resolution maps in subtomogram averaging compared to VPP30. Yet, it suffers from a lower image SNR and contrast, making particle localization and structure segmentation more challenging. Thus, we sought to assess the power of DeePiCt models trained on VPP for segmenting on defocus data.

The 2D CNN model trained on VPP tomograms shows slightly poorer performance on defocus data than in the same-domain setting (AUPRC 0.82 for cytosol and 0.42 for organelles; Fig. 3a,f). Similarly, DeePiCt 3D CNN performance dropped with respect to the same-domain setting to a median F1 of 0.59 for ribosome, 0.15 for FAS, and median voxel-F1 of 0.41 for membranes (Fig. 3g). In both networks, employing the spectrum equalization filter in the pre-processing step further improved performance in this domain generalization setting (Extended Data Figs. 5 and 7d–f). Furthermore, using the cytosol predictions as a ‘region mask’ improved the 3D CNN performance (Extended Data Fig. 7g,h). An example of DeePiCt segmentation results in the cross-domain setting is shown in Fig. 4a,b and visually resembles the ground truth annotation (Extended Data Fig. 8a).

DeePiCt predictions result in high-quality subtomogram averages

To assess the quality of DeePiCt predictions on defocus data, we performed 3D structural classification on the output. For FAS, while DeePiCt identified fewer particles than in the ground truth (Supplementary Tables 8, 9), subtomogram averages revealed the two half domes of the barrel-shaped type I FAS with applied D3 symmetry, consistent with the ground truth average and published structures from other yeast species (Fig. 4c,d and Extended Data Fig. 8b–e). We observed the phosphopantetheine transferase (PPT) domains required for activation of FAS along the equatorial plane and three additional equatorial densities that could not be assigned on the basis of published structures. Three densities inside each half dome connected to the central α wheel and in close proximity to the ketosynthase (KS) fit with the acyl carrier protein (ACP) of Saccharomyces cerevisiae31 (Fig. 4d; Protein Data Bank (PDB) accession 2UV8) and Pichia pastoris32 (Extended Data Fig. 8c; Electron Microscopy Data Bank (EMDB) accession EMD-12139). The ACP shuttles the growing acyl-chain between the different catalytic sites, and its localization has been suggested in connection with the activity of the whole multi-enzyme complex31,33,34,35,36. Here, maps of the S. pombe FAS complex in exponentially growing cells revealed a specific, native state, ACP site (Fig. 4d, Extended Data Fig. 8, and Supplementary Note 2).

Fig. 4: DeePiCt enables exploration of macromolecular complexes in their cellular context.
figure 4

a, 2D slice of a representative defocus S. pombe tomogram (n = 10 tomograms). Mitochondrion (M), vesicle (V), the ER and the cell wall (CW) are detectable in the raw (bottom left) and with improved contrast after amplitude spectrum equalization (top right). b, DeePiCt predictions generated with models trained in the VPP data. Organelles (gray), membranes (purple), FAS (pink, c,d), ribosomes (yellow, eh) and subsets classified in RELION (head density, dark blue, i; exit tunnel density, bright blue, j), and within 25 nm of the ER (ER-bound, orange, m) and mitochondria (mito-bound, green, n). c, FAS subtomogram average (pink) fits the S. cerevisiae structure (cyan) including PPT domains. An extra density cannot be assigned. d, Cross-section of c close to the α-wheel with three densities fitting ACPs (asterisks). e, Subtomogram average of all ribosomes from ten defocus tomograms. f, Well-aligned ribosome subset detected by hierarchical 3D classification in RELION and refined in M. g,h, Slices through f reveal the PTC with a P-site tRNA and the L1 stalk facing the E-site. i, Ribosome subclass with additional densities (white arrowheads) close to the head of the small ribosomal subunit, which fits eEF3 (red), and close to the ribosomal exit tunnel. j, Ribosomes classified for a density below the ribosomal exit tunnel (white arrowhead). The S. cerevisiae ribosome with ES27L in a particular configuration (left, purple) connects to the additional exit-tunnel density, which fits Arx1 bound to the 60S pre-ribosome (middle and right, purple). k,l, Different z-slices of the representative tomogram in a (white dashed boxes) show ribosomes bound to the ER (k) and a mitochondrion (l). m, An average of ER-bound ribosomes from seven tomograms shows a connection of the peptide exit tunnel of the large subunit to the membrane density. n, An average of mitochondria-bound ribosomes from three tomograms shows a linker connecting the large subunit, at a site close to the small subunit, to the membrane density. Overlay of the ribosomes in m and n shows different interfaces with the respective organelle membranes.

For ribosomes, the particle averages and numbers derived from DeePiCt predictions are comparable to the ground truth annotations (Supplementary Tables 8 and 9, Supplementary Note 2, and Supplementary Figs. 7 and 8). 3D refinement of the ribosomal particles resulted in subtomogram averages with nominal resolutions of 11 Å (ground truth) and 15 Å (DeePiCt predictions) after multi-particle refinement in M22 (Fig. 4e and Extended Data Fig. 8f). Hierarchical 3D classification revealed a well-aligned class that was further refined in M to subnanometer resolutions of 9.3 Å (ground truth) and 9.4 Å (DeePiCt) (Fig. 4f–h and Extended Data Fig. 8i–k). This allowed the identification of tRNA occupying the P-site of the peptidyl transferase center (PTC) and the L1 stalk facing the E-site (Fig. 4g,h and Extended Data Fig. 8j, k).

DeePiCt-predicted ribosomes reveal functional subpopulations

The large number of particles localized by DeePiCt in the defocus dataset in a high-throughput manner allows for examining subpopulations of functionally distinct complexes. Focused classification of all DeePiCt-predicted ribosomes on the head of the 40S small subunit revealed a subset with additional densities close to the head and at the exit tunnel (Fig. 4i and Supplementary Fig. 9). This class was also detected in the VPP and defocus ground truth datasets (Supplementary Tables 79), resolving densities for P- and E-site tRNAs in the latter (Extended Data Fig. 9a–e). The ribosome-bound ATPase eEF3 from S. cerevisiae37 fitted well into the additional head density (CC 0.8972, EMDB accession EMD-12062; Fig. 4i and Extended Data Fig. 9a–c). During translation, this eukaryotic elongation factor facilitates binding of a new tRNA to the A-site via the ternary aminoacyl-tRNA–eEF1A–GTP complex37. Focused classification of all DeePiCt-predicted ribosomes at the ribosomal exit tunnel provided an average fitting the S. cerevisiae ribosome38 (CC 0.9657, EMDB accession EMD-1667) with the rRNA expansion segment ES27L in a particular configuration39 connecting to an additional density close to the ribosomal exit tunnel (CC 0.7938, PDB accession 3IZD; Fig. 4j and Supplementary Fig. 10). ES27L plays a role in translation fidelity. Enzymes, such as the methionine aminopeptidase (MetAP) that co-translationally processes the nascent peptide chain40, nuclear export factor Arx1, which is released during ribosomal 60S maturation in S. cerevisiae41, and its human homologue Ebp1, a translation regulator42,43, bind at locations of the observed extra density. The binding factors recruit the flexible rRNA scaffold ES27L and cover the ribosomal exit tunnel with their MetAP-like folds. This structural class was also detected in the defocus and VPP ground truth datasets (Extended Data Fig. 9f–j and Supplementary Tables 79). Thus, the large number of particles obtained with DeePiCt predictions combined with structural analysis revealed functional subpopulations of ribosomes.

Fig. 5: DeePiCt generalization across species.
figure 5

A dataset depicting a HeLa cell nuclear periphery (n = 1 tomogram)45 is segmented by applying four independently trained DeePiCt networks. The results show the segmentation of actin filaments (red) trained on RPE-1 and MEF 3T3 tomograms, microtubules (MTs, cyan) trained on C. elegans tomograms, and cytosolic ribosomes (yellow) and membranes (purple) trained on S. pombe tomograms. The inset (top right) in the DeePiCt predictions panel shows the MT subtomogram average (cyan) obtained from the DeePiCt predictions.

DeePiCt reveals ribosome–mitochondria association

DeePiCt allows studying particle populations in specific contexts on the basis of their proximity to predicted organelles. We recovered cytosolic ribosomes within 25 nm distance to predicted ER and mitochondria from seven and three defocus tomograms, respectively. Subtomogram analyses for both subsets revealed one class with and one without a membrane density (Supplementary Figs. 11 and 12). Ribosomes showed a specific orientation relative to the membrane in the first case, while they were randomly orientated in the other. The latter likely arises owing to the highly crowded nature of the S. pombe cytoplasm (Fig. 4k,l), and, in the case of DeePiCt predictions, also from imperfect organelle segmentations (Fig. 4b). ER-bound ribosomes faced the membrane with the ribosomal exit tunnel in agreement with published structures from other species38,44,45(Extended Data Fig. 9k–n and Supplementary Tables 79). Mitochondria-bound ribosomes, which posed a particular challenge to structural analysis in previous studies44,46, were found to interact with the membrane at an angular offset of around 35° in comparison to the ER-bound ribosomes (Fig. 4m,n, Extended Data Fig. 9o–r, and Supplementary Tables 79). Interestingly, a density connecting the ribosome to the mitochondrial membrane was detected on the large subunit in close proximity to the small subunit, but not to the ribosomal exit tunnel. It possibly represents the ribosome nascent chain complex in contact with the mitochondrion receptor OM1447,48, which has yet to be structurally described. Thus, ribosomes close to mitochondria and ER exhibit different interfaces with the respective membranes, potentially facilitating specific protein nascent chain membrane insertion or transfer into the organelle38,47. These results highlight the power of DeePiCt for high-throughput particle localization and cryo-ET segmentation to rapidly gain new biological insights.

Trained networks can be readily applied to other species

As a demonstration of the domain generalization potential of our workflow across species and the ease of applying pre-trained models, we predicted ribosomes, membranes, and cytoskeletal filaments (actin and microtubules) in a published VPP HeLa cell dataset45 (Fig. 5), for which we manually generated a mask (using Amira49) of the cytoplasmic volume, to exclude the nucleus. For evaluation, we computed the voxel-F1 score using the publicly available annotations (EMDB accession EMD-11992 (ref. 45)). Using the ribosome and membrane networks trained on the ten VPP tomograms from S. pombe described above resulted in a voxel-F1 of 0.55 for ribosomes and a voxel-F1 of 0.18 for membranes (Extended Data Fig. 10a). However, the available expert membrane segmentation for the dataset covered only ER membranes. Visual inspection of the predicted membrane segmentation revealed a good fit (Fig. 5).

Additionally, we generated models for microtubules prediction, initially training a network for simultaneous actin and microtubule segmentations in four tomograms of Human retinal pigment epithelial-1 (RPE-1) cells50. Owing to the high preferential orientation of the cytoskeletal filaments in this data, the performance for microtubules segmentation was low (voxel-F1 0.26; Extended Data Fig. 10a,b). We therefore trained a second 3D CNN for microtubule segmentation on 11 tomograms from dissociated C. elegans cells, containing 890 microtubules in a wide range of orientations (Methods). This network achieved a voxel-F1 score of 0.83 in the HeLa cell tomogram (Extended Data Fig. 10a,c), exemplifying the importance of a training set with a high number of structures and orientations to mitigate the effect of the missing wedge on the prediction performance. These voxel-based DeePiCt predictions were then used to derive 3D coordinates for subsequent subtomogram averaging, revealing the 25 nm diameter hollow structure of HeLa cell microtubules, which is formed by 13 protofilaments (Methods; Fig. 5 and Extended Data Fig. 10c).

Finally, we trained a dedicated network for predicting actin filaments on five manually curated tomograms, two from RPE-1 cells and three from mouse embryonic fibroblasts (MEF) 3T3 cells, that contained approximately 3,740 actin filaments (Methods). This network shows a low voxel-F1 score of 0.10 in the HeLa cell tomogram (Fig. 5 and Extended Data Fig. 10a), which is likely caused by the finer structure of actin in comparison to microtubules, the minimal set of orientations sampled in the small number of training tomograms, and the fact that most training filaments are arranged in bundles, probably causing the CNN to learn this superstructure rather than patterns of individual filaments (Extended Data Fig. 10b).

Overall, these results show that the prediction of ribosomes and microtubules, representing large macromolecular complexes, is especially well-preserved across species and that the availability of diverse and well-annotated training datasets are crucial for good performance of DeePiCt. The results shown in Fig. 5 highlight the use of trained networks from DeePiCt to segment novel datasets spanning different species.

Discussion

Our DeePiCt workflow facilitates accurate and fast localization of diverse structures in cryo-ET data of intact cells. The demonstrated high performance and the flexibility of the 3D CNN architecture offer a reliable tool for pattern recognition. This enabled us to detect lowly abundant particle species with a less dense structural signature (FAS) compared to ribosomes. The integration of structure segmentation (predicted by the 3D CNN) with the contextual information (predicted by the 2D CNN) excludes false positives in the particle localization and structures segmentation tasks and harnesses the cellular context to carry out spatial studies focused on regions of biological interest. This enabled us to investigate ribosomes in proximity to specific organelles (for example, ER versus mitochondria) and to obtain structural insights with functional implications. As the code is open source and Python-based, our flexible 3D architecture could further be expanded by implementing variations, such as ResNet encoders, atrous convolutions, class-normalization, positioning DeePiCt to serve as a tester for deep-learning techniques.

A major bottleneck in the field of supervised machine learning is the availability of expert curated training data. Here, we provide an experimental cryo-ET dataset of 20 S. pombe tomograms under two microscopy acquisition settings (VPP and defocus), together with high-quality comprehensive annotations of ribosomes and FAS, membranes, organelle, and cytosol segmentations. This constitutes the first gold-standard dataset in the field that is large enough for model training, which will enable benchmarking of current methods and spur the development of future computational tools for unbiased data mining in cryo-ET data. Subtomogram averaging of the annotated particles from either ground truth or DeePiCt predictions resulted in the first density maps of the S. pombe ribosome and fatty acid synthase, and further point to differences in the analysis of subtomograms from VPP or defocus tomograms, despite both being derived from wild-type S. pombe cryo-FIB lamellae.

The analysis of DeePiCt performance confirmed that data quality is important for its predictive ability. This was demonstrated specifically for predictions of the NPC, which, with its high degree of structural flexibility on the subunit and pore diameter levels inside cells26, is a challenging target. High SNR and contrast are overall important for good performance during training and prediction with DeePiCt, exemplified by the higher performances for data acquired with a VPP. The introduction of a pre-processing equalizing filter improves the learning process during the training of 3D segmentation networks for particles with less dense print than the ribosome (for example, FAS), and especially for the generalization power across domains, including different microscopy acquisition conditions. For the 2D network, although pre-processing did not improve model performance on the same-domain inference, it did improve cross-domain performance when training on VPP data and inferring on defocus data, or when training on both data types combined to segment organelles and cytosol. More elaborate tasks, such as prediction of individual organelle types, will likely require more training data or training of a dedicated 3D CNN with a tailored network architecture.

Our workflow allows easy adaptation to the segmentations of other structures, as demonstrated by the application of cytoskeleton segmentation networks. The networks show high-quality performance for microtubules in the HeLa cell dataset after training on data with broad orientation sampling of the filaments, producing segmentations that can be used for subsequent structural analysis. Actin predictions revealed a low F1 score and therefore likely require more training data, and sampling different orientations of the structural features, to improve performance. Altogether, the application of multiple segmentation networks to the HeLa cell dataset revealed that DeePiCt models trained on datasets from different microscopes, species, and conditions lead to reasonably good results in high-quality data. Although more in-depth analysis is needed to study the limitations for the applicability of DeePiCt networks on other datasets, the results presented here constitute the first step towards conducting large-scale quantitative analyses for structural biology using cryo-ET on cells from different laboratories and publicly available datasets. In this sense, the generated ground truth annotations in this study provide the community with a resource to improve and further develop cryo-ET object segmentation and detection tools, to ultimately enable broad exploration of particles in their cellular context. Together with the trained networks and the flexibility of the DeePiCt workflow, the software harbors great potential for quantitative cryo-ET studies in the future.

Methods

Yeast cell culture

S. pombe K972 Sp h- wild-type haploid cells were recovered from frozen stock by streaking on YES agar plates (YES Broth, Formedium, 20 g agarose per liter) and incubated at 30 °C for 1–3 days. Colonies were restreaked on fresh YES agar plates and incubated 1–3 days at 30 °C. Single colonies were inoculated in 5 ml YES medium (YES Broth, Formedium, PCM0302, FM0618/8573) and grown at 30 °C, 170 r.p.m. overnight (NCU-Shaker mini, Benchmark). On the next day, cultures were grown to their log phase at an optical density at 600 nm of 0.5–0.6 and diluted beforehand in YES if necessary.

Vitrification

Yeast cells were either diluted to optical density at 600 nm of 0.2–0.4 in YES medium or, following a wash step, in phosphate-buffered saline (PBS) containing 5% or 10% bovine serum albumin as external cryoprotectant. Transmission electron microscopy (TEM) grids (Quantifoil R1/2, Cu 200 mesh, holey carbon or SiO2 film) were glow discharged on both sides for 45 s (Pelco Easy glow). Four microliters of the cell suspension were applied to the grids inside the chamber of a Leica EM GP (Leica Microsystems). Blotting from the back side of the support was performed for 1–2 s at 22 °C and 99% humidity. Grids were plunge frozen in liquid ethane cooled by liquid nitrogen and transferred into grid boxes until further usage.

Cryo-FIB

TEM grids with vitrified yeast cells were clipped into an autogrid with a cut-out52. Mounted on a 45° pre-tilt shuttle, grids were transferred into an Aquilos Dual beam microscope (Thermo Fisher Scientific), sputter-coated with platinum for 10–15 s (1 kV, 10 mA, 10 Pa), and subsequently coated with organometallic platinum using the gas injection system (8 s with the stage positioned 3 mm below the coincidence point). In three independent sessions, three grids with five lamellae each were processed at a milling angle of 15°. Agglomerations of several cells were thinned in three steps of rough milling to a thickness of 5 µm at 1 nA ion beam current, 3 µm at 0.5 nA and 1 µm at 0.1 nA. Milling progress was visually monitored between each milling step with the scanning electron microscope beam (10 kV, 50 pA). Fine milling was performed at 50 pA to a target thickness of 200 nm. To render the lamellae conductive for TEM imaging, grids were sputtered with platinum for 5 s (1 kV, 10 mA, 10 Pa) and transferred into cryo-boxes.

Cryo-ET

Autogrids with lamellae were loaded into a Titan Krios (Thermo Fisher Scientific) such that the axis of the pre-tilt introduced by FIB milling was aligned perpendicular to the tilt axis of the microscope53. Cryo-ET acquisition parameters are summarized in Supplementary Tables 79. Tomograms were acquired on a K2 Summit direct detection camera (Gatan) operating in dose fractionation mode utilizing a Quantum post-column energy filter operated at zero-loss (Gatan). A calibrated pixel size of 3.45 Å was used for the NPC defocus* data and 3.37 Å for the remaining datasets. Up to 14 tilt series were collected on a single lamella in low dose mode using SerialEM54. Starting from the lamella pre-tilt, images were acquired in 2° increments within a range of +58° to −40° using a dose-symmetric tilt scheme55 with a constant electron dose per tilt image. For ground truth data, a set of ten tilt series were either collected with a 70 μm objective aperture or a VPP56 (Thermo Fisher Scientific) with prior conditioning for 5 min. NPC data was collected as described in previous work26,57, but at 1.5–4.5 μm defocus, with 3° increments and an effective tilt range of +50° to −50°.

Tomogram reconstruction

Tilt movie frames were aligned using a SerialEM plugin. Tilt series were filtered according to the accumulated electron dose by Fourier cropping using the mtffilter function in etomo (IMOD/BETA4.10.1258), and sorted by tilt angle using a python script. Four-times-binned tilt images were aligned in etomo (IMOD/4.9.4)58 using patch tracking (typical residual error 0.291–0.569 pixels) and tomograms were reconstructed via weighted back projection. Tomogram thicknesses ranged between 80–310 nm.

Ground truth annotation for organelles, cytoplasm, and membranes

In VPP and defocus datasets, organelles (mitochondria, vesicle, tube, ER, nuclear envelope, nucleus, vacuole, lipid droplet, Golgi apparatus, and vesicular body; Supplementary Table 4) and cytosol were annotated. Each compartment was identified through a unique numerical label to allow for selection of specific subsets of compartments. Segmentations of ten VPP tomograms were performed manually in Amira49, and used to train a 2D CNN. Using this trained CNN, we predicted in ten defocus tomograms pre-processed with the spectrum equalization filter, and manually corrected the segmentations.

Membrane annotations were performed on ten VPP tomograms and five defocus tomograms. Initially, five VPP tomograms were annotated using Amira by manual segmentation on every two to three slices and subsequently interpolated. These were then used to train a membrane segmentation 3D CNN, whose predictions on the remaining five VPP and five defocus tomograms were manually corrected in Amira.

Ground truth particle annotation in VPP data

Ribosome and FAS were localized in 4×-binned tomograms (13.48 Å voxel size) in an iterative workflow. Manually curated template matching for ribosomes and non-exhaustive manual picking for FAS (step 1; Supplementary Tables 2 and 3) were used for training 3D CNNs (step 2). CNN predictions were masked with a segmentation of the cytosol. For ribosomes, step 2 was repeated three times (always trained on combined predictions of step 1 and the preceding round) and each round consisted of three simultaneously trained networks with default hyperparameters (Supplementary Tables 2 and 3), except for the number of IF = 4, 8, and 32, to provide a cumulative prediction that is less overfitted to the incomplete training data. Cumulative predictions were manually revised in tom_chooser (ribosomes) and in EMAN2 (FAS) and manual picking was performed in either spectrum-matched or Gaussian-filtered tomograms (σ = 3) for up to three rounds in EMAN2 spt2_boxer59 (step 3; Supplementary Tables 2 and 3). The particle lists were cleaned for duplicates by applying elliptic distance constraints to the coordinates (Supplementary Note 1 and Supplementary Tables 2 and 3).

Template matching for ribosomes was performed with pyTOM60 by a 3D cross-correlation search over 1,944 Euler angle combinations using the large subunit of S. cerevisiae 80S ribosome map (EMDB accession EMD-3228 (ref. 61)) scaled to the corresponding pixel size, and a spherical mask (diameter 337 Å). The 2,000–3,000 highest cross correlation scores were manually revised in Gaussian-filtered tomograms (σ = 3) with tom_chooser (pyTOM toolbox60). For FAS in the VPP dataset template matching with a published S. cerevisiae FAS33 map (EMDB accession EMD-1623) as reference failed.

Ground truth particle annotation in defocus data

A procedure similar to the above was applied, except that: (i) the initial annotations for ribosomes were manually cleaned after template matching and initial FAS manual picking was incomplete; and (ii) in step 2 of the ribosomes ground truth construction, two of the three rounds of DeePiCt predictions were obtained using models trained on VPP data (first and second round of VPP ground truth construction; Supplementary Table 2).

Comparison of cryo-ET-derived particle numbers with proteomics

Copy numbers of ribosomes and FAS per cell were calculated with the ground truth annotations for an S. pombe cell with the assumption of 30% cytosolic volume62 of a total of 150 μm3 average cell volume63,64, considering one fully assembled FAS complex is constituted by each six alpha and beta subunits.

NPC manual localization

NPCs were manually localized as described previously26. Coordinates and initial orientations of 354 NPCs were manually determined in 127 4×-binned, SIRT-like filtered58 defocus* tomograms57. The annotated NPC data was divided on the basis of quality criteria: 38 tomograms of quality 1 have a thickness below 300 nm, and a tilt-series alignment residual error below 0.7 pixels. Quality 0 was assigned to 89 tomograms with 300–395 nm thickness and a residual error of 0.7–5.0 pixels for tilt-series alignments.

Voxel-level representation of ground truth

For ribosome and FAS, the lists of coordinates from the ground truth were used to paste spherical masks (with radii of 10.78 nm for ribosome and 13.48 nm for FAS; Code Availability). For the NPCs, a subunit mask obtained by prior 3D averaging in novaSTA (https://doi.org/10.5281/zenodo.3973623) was pasted at each of the eight subunit locations. The ribosome, FAS, membrane, organelle and cytosol masks in defocus and VPP along with the list of coordinates and respective tomograms are used in the subsequent training and performance analysis, and are available from the Electron Microscopy Public Image Archive (EMPIAR; Data Availability).

Cytoskeletal filaments segmentation and subtomogram averaging

Microtubule and actin networks were trained on tomograms of Human retinal pigment epithelial-1 (RPE-1) cells generated in a previous study50. Segmentations of the individual cytoskeletal elements were performed using the filament tracing function in Amira65,66, followed by manual curation. Trained networks were applied to a tomogram depicting the nuclear periphery of a HeLa cell and compared against the corresponding segmentations (EMDB accession EMD-11992).

For microtubules, an additional 3D CNN was trained on 11 tomograms containing 890 microtubules in a wide range of orientations from lamellae of cryo-FIB-milled C. elegans cells dissociated from a GFP::SPD-5, mCherry::histone worm line. This model was applied to the HeLa cell tomogram (EMDB accession EMD-11992) and the resulting predictions used to extract coordinates in the following steps: using skeleton3d67 and a custom script in Matlab (version 2019a; Code Availability), particles were sampled along each filament at 6 pixel steps in a 4×-binned tomogram (corresponding to 101 Å with an unbinned pixel size of 4.21 Å), with each particle rotated by 360°/13 along the third Euler angle (second in-plane rotation). Particle cropping, alignment, and averaging were performed in Dynamo24 (v.1.1.520). Particles were cropped from a 2×-binned tomogram with a pixel size of 8.42 Å per pixel, aligned over three rounds with a hollow tube as starting reference. Helical symmetry was applied starting from the second round of alignments. Resolution was determined at 39 Å (0.5 cut-off) by Fourier Shell Correlation (FSC) using the odd and even particles of the masked, final average.

Subtomogram analysis for ribosomes and FAS

Contrast transfer function (CTF) estimations, generation of 3D CTF models and subtomograms were performed in Warp21. CTFs were first estimated in the sums of raw tilt movies and subsequently in the tilt series taking the tilt angles into account. Subtomograms containing ribosomes and their CTF models were reconstructed in volumes of 140³ pixels with a pixel size of 3.3702 Å and a particle diameter of 350 Å. Initial alignments were performed in RELION version 3.0.723 in 25 iterations of 3D classification, with the S. cerevisiae 80S ribosome (EMDB accession EMD-3228, low-pass filtered to 60 Å) as reference to generate an initial single-class average. 3D refinements were performed with the resulting average as a reference. In defocus data, this average was further refined in M22 to optimize particle poses, image and volume warping to model non-linear deformations. Particles were re-extracted and hierarchical 3D classifications (25 iterations each) were performed in RELION. For VPP data, 3D classifications were performed directly after 3D refinements in RELION. Focused classifications were performed with binary masks indicated in the respective figures.

FAS subtomograms and CTF models were reconstructed in cubic volumes of 160 pixels with a pixel size of 3.3702 Å and a particle diameter of 400 Å. Initial alignments were performed in RELION in 25 iterations of 3D classification into a single class, and refined with the 3D refinement option and applying D3 symmetry using the S. cerevisiae FAS map (EMDB accession EMD-1623 (ref. 33), low-pass filtered to 60 Å) as reference. Hierarchical 3D classifications (25 iterations each) were performed either after 3D refinements in RELION (VPP data) or after subtomogram re-extraction using the ribosome-optimized image and volume models in M (described above).

Final subtomogram averages of each particle class were obtained by 3D refinement and post-processing, filtered to their respective resolutions determined by FSC of two independently refined half maps at a cut-off of 0.143. Details of particle numbers and resolutions for each subtomogram average are summarized in Supplementary Tables 79. Visualization and calculation of cross-correlations (CC) between different maps and models, was performed with the UCSF ChimeraX package68.

CNN pre- and post-processing

Tomograms were first normalized to obtain uniform mean of 0 and variance of 1 in the frequency domain, before training. The spectrum equalization filter was then applied by matching the amplitude spectrum of each tomogram to the target spectrum of one manually selected high-contrast VPP tomogram (Tomogram TS_001; Extended Data Fig. 1b). Extraction of spectra amplitudes was done using fast Fourier transform followed by radial averaging of the amplitudes across the frequency domain. If the Nyquist frequency of the target tomogram is lower than that of the input tomogram, the target spectrum was padded with zeros to match the size of the input spectrum. Next, an equalization vector was created by dividing entry-wise the target spectrum by the respective input spectrum, converted into a rotational kernel and multiplied by the input tomograms in the frequency domain in combination with a sigmoidal-shaped low-pass filter to eliminate high-frequency noise. After back transformation, the tomogram exhibits a similar contrast to the target tomogram (Extended Data Fig. 1). For the 2D CNN, tomograms and training segmentations are processed slice-wise into 2D tiles with a fixed size of 288 × 288 pixels (256 × 256 pixels and 16 pixels padding on each side). For the 3D CNN, tomograms are by default split into cubic patches of 64 × 64 × 64 voxels, and 12 voxels overlap in each dimension.

Post-processing for the 2D network assembles the per-slice prediction into a 3D segmentation. Predicted tile segmentations are cropped on each side by 48 pixels to reduce artifacts around the edges, followed by reassembly into 3D stacks, with remaining overlapping areas averaged. A one-dimensional Gaussian filter is applied along the z-axis to reduce single-slice false positives (Extended Data Fig. 2).

For the 3D CNN, individual 64 × 64 × 64 voxel patches are reassembled into the probability map and the thresholded map (usually at threshold value of 0.5) subsequently clustered. Clusters can be filtered for size and context (within or close to a given organelle/cytosol segmentation output by the 2D network) for the final prediction map (Supplementary Note 1). A list of coordinates of cluster-centroids, representing the particle location predictions, is then exported.

Evaluation metrics

To evaluate the particle localization task (ribosome and FAS), we defined true positives as those predicted particles whose coordinates overlap with a ground truth particle within a tolerance radius (10 voxels, 135 Å), and reported the F1 score as the harmonic mean between recall (proportion of ground truth particles recovered) and precision (proportion of predicted particles that were true positives; Supplementary Note 1 and Supplementary Table 1). For the structure segmentation task (membrane or the NPC segmentation), we compared the ground truth masks with the predicted post-processed segmentation by calculating their voxel-wise precision and recall and reported the corresponding voxel-based F1 (voxel-F1) score, also known as Sørensen–Dice coefficient (Supplementary Note 1 and Supplementary Fig. 4).

Cross validation and performance evaluation

For ribosome and FAS localization, and for membrane segmentation, a non-standard threefold cross-validation scheme where three (as opposed to five) subsets of eight VPP tomograms were used for training, and two for testing. In the test for domain generalization, the same three networks (trained on the eight VPP tomograms) were applied to the ten defocus tomograms.

The NPC predictions, treated as segmentation tasks, were evaluated in the defocus* independent data selected for that purpose using a threefold cross-validation where the total 127 tomograms were split into three random subsets with roughly the same number of tomograms each. Each fold consists of two such sets as training data and the remaining one for testing (Supplementary Note 1 and Extended Data Fig. 6d–f). The nuclear envelope predictions, which were used as ‘region mask’ in the ‘contact’ mode during post-processing of the NPC prediction, were achieved with a 3D CNN trained on 18 manually annotated tomograms, and which were uniformly distributed across the three dataset splits.

For organelle and cytosol segmentation evaluation, model performance was evaluated on the voxel-level for the post-processed 2D CNN predictions produced by the individually trained model of each cross-validation fold, using 5,000 voxels picked randomly from each tomogram of the respective test set of cross-validation fold. Precision and recall (Supplementary Note 1 and Supplementary Fig. 4) were computed at threshold values varying from 0 to 1 on the picked voxels to compute the area under the precision–recall curves (AURPC) for each tomogram.

Hyperparameter tuning of DeePiCt

The effect of user-defined hyperparameters of the 3D CNN (IF, D, ED, DD, and BN) were evaluated using the cross-validation and performance metrics described above. Starting with the default hyperparameter combination (Supplementary Note 1), we tested first for different values of D, while keeping the rest fixed. Then, we fixed D for which best performance was achieved and repeated the same process for IF. We continued with each of the remaining parameters, BN, ED and DD.

In all cases, we combined the 3D CNN segmentation of the target structure with an appropriate ‘region mask’ to eliminate false positives (Supplementary Note 1 and Supplementary Table 1). For ribosome, FAS and membrane, both 3D CNN (for target structure prediction) and 2D CNN (for ‘region mask’ prediction) in DeePiCt employed a threefold cross-validation in the VPP dataset as described above (Extended Data Fig. 6a–c).

Computational setup

All experiments for the 3D CNN of the DeePiCt pipeline and DeepFinder were performed using NVIDIA 2080 Ti GPU, Cuda 10.0, Python 3 and Pytorch 1.3.1. For the 2D CNN, training was conducted using an NVIDIA 2080 Ti GPU and an NVIDIA V100S GPU used for performance evaluation, using CUDA 10.0, Python 3 and Keras 2.3.1 with a tensorflow 2.0.0 backend. Detailed lists of parameters used for the 2D and 3D CNN are available alongside the DeePiCt source code (Code Availability).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.