Convolutional networks for supervised mining of molecular patterns within cellular context

Cryo-electron tomograms capture a wealth of structural information on the molecular constituents of cells and tissues. We present DeePiCt (deep picker in context), an open-source deep-learning framework for supervised segmentation and macromolecular complex localization in cryo-electron tomography. To train and benchmark DeePiCt on experimental data, we comprehensively annotated 20 tomograms of Schizosaccharomyces pombe for ribosomes, fatty acid synthases, membranes, nuclear pore complexes, organelles, and cytosol. By comparing DeePiCt to state-of-the-art approaches on this dataset, we show its unique ability to identify low-abundance and low-density complexes. We use DeePiCt to study compositionally distinct subpopulations of cellular ribosomes, with emphasis on their contextual association with mitochondria and the endoplasmic reticulum. Finally, applying pre-trained networks to a HeLa cell tomogram demonstrates that DeePiCt achieves high-quality predictions in unseen datasets from different biological species in a matter of minutes. The comprehensively annotated experimental data and pre-trained networks are provided for immediate use by the community.


Table of Contents
The presence and the frequency of the structure in the training set is one of the parameters that in our experience largely determines whether the 3D CNN is able to learn. The user can set in the configuration file a minimum label presence (0 ≤ _ _ ≤ 1), which indicates the proportion of labeled voxels in the patch that is required for it to be considered among the training patches. In this way, patches with too few labeled voxels are eliminated.
All models were trained using Adam Optimiser 74 . Application of these networks to unseen tomograms requires scaling of their pixel size to match (at least approximately) that of the training data.

Supplementary Figure 1 | Comparison of Dice and Generalized Dice Loss functions for a multi-label network.
The results of two multi-label networks, trained with Dice and Generalized Dice Loss functions, are evaluated. Both networks were trained for simultaneous segmentation of ribosomes and FAS, two highly unbalanced classes. The results show the F1 score on n=2 VPP tomograms, after being trained with 8 VPP tomograms. For the Dice Loss trained network, the results have a median F1 of 0.32 for FAS and 0.78 for ribosomes, while for the Generalized Dice Loss trained network, it has a median F1 of 0.34 for FAS and 0.81 for Ribosome. Boxplots' middle line marks the median and the edges indicate the 25th and 75th percentiles; whiskers encompass all data that are not considered outliers (calculated by the Seaborn boxplot function).

Data Augmentation
For the 2D CNN, tiles are randomly flipped and rotated in 90-degree increments during training to improve generalization.
Data augmentation for training the 3D CNN is available and optional, although it did not improve results in our hands (described below in relation to evaluation and training set size; Supplementary Fig. 2). The number of times that each of the original training examples is augmented is given by a parameter !" that can be set by the user. Each augmentation includes 4 different types of random volumetric transformations (applied to the original image and the labels, except in the case of noise), that we describe in what follows. Let A in #×#×# be a single channel cubic image of side-length > 0. Then: 1. Additive Gaussian noise. Given a predefined modulation factor > 0 , the Additive Gaussian noise transformation % is defined by % ( ) = + , where &'( ~ (0,1) are independent and identically distributed random variables (i.i.d.r.v.), and ( , ) ) denotes a normally distributed Gaussian random variable with mean and variance ) . 2. Salt-and-pepper noise. Given user-defined modulation factor > 0 and a probability rate 1 > > 0 , the salt-and-pepper noise transformation is defined by *,, ( 4. Random rotation. Given a probability of rotation ∈ (0,1) and angle range ∈ (0,180), the image A is rotated with a probability with and angle in [− , ], with respect to the −axis.

Training set sizes for particle picking
The amount of data required for training a network is dependent on the selection of hyperparameters, which in turn define the number of trainable parameters in the network. Performance results of DeePiCt for ribosome and FAS (Supplementary Fig. 2) show that 300 particles are enough to achieve the maximum F1 score for both abundant and sparse particles. The hyperparameters considered in each case correspond to the optimal values stated in Supplementary Table 1

3D CNN default hyperparameters
A default, starting point set of hyperparameters values for the 3D CNN for this study: = 2, = 4, = 0, = 0, = 0, and !" = 0. We used them to later, through testing variations on them, study the optimal combinations of hyperparameters in the different segmentation and localization tasks (Supplementary Table 1).

2D CNN post-processing
Post-processing of the 2D CNN includes the reassembly of predicted tile segmentations, by cropping 48 px on each side to reduce artifacts around the edges. Areas of sets of tiles still overlapping are averaged after cropping. In order to remove inter-slide discontinuities, a Gaussian filter with parameter > 0 (default = 5) is then applied along the z axis. 3. Integration with other segmentation maps, e.g. with the output of the 2D CNN. Importantly, a subsequent selection of clusters is made with respect to a region_mask provided by the user, which consists of an auxiliary binary image defining a region of interest (Fig 1b). We then allow 3 types of options (intersection, contact, or colocalization) for employing the region_mask through the corresponding contact_mode parameter, as follows ( Supplementary Fig. 3):

a. contact_mode: intersection:
The region_mask is used to directly mask the output of point 2, by considering only the overlapping region, e.g. including particles that are within an organelle's mask.

b. contact_mode: contact
Only clusters that have voxels in common with the region of interest are kept, e.g. selecting NPC on the nuclear envelope.

c. contact_mode: colocalization
Only clusters whose centroid is located at a given distance (colocalization_radius) from the region of interest are kept, e.g. particles within a distance to an organelle.
At this step, only clusters within the size range [ , ] are kept, where 0 ≤ < ≤ ∞ are defined by the user in the configuration file.
4. If the calculate_motl parameter is set to True, a list of the clusters' centroids resulting from previous steps is saved as a 4-columns comma separated values (csv) file, where the first column is the score (cluster size), and the three remaining columns indicate the ( , , ) centroid voxel coordinates. The list is sorted according to decreasing-score.

Segmentation evaluation
A 3D cellular segmentation probabilistic prediction (resulting either from the 2D or the 3D CNN) is a map whose value at every voxel ( , , ) is the probability 0 ≤ &,',( ≤ 1of belonging to the structure of interest ( Supplementary Fig. 4a). Setting a threshold 0 ≤ ≤ 1 induces a binarized prediction (such that if ≤ &,',( the voxel value is set to 1 and the rest to 0; Supplementary Fig. 4b), whose quality can be evaluated by comparison to a ground truth binary image, according to different metrics: 1. Voxel-based precision (P) is defined by: indicate the volume (number of entries different from 0). Thus, P is the proportion of voxels in the predicted segmentation that truly belong to the wanted structure (Supplementary Fig.  4c).

Voxel-based recall (R) is defined by:
./0(2*)4./0(5#) , thus indicating the proportion of voxels of the ground truth segmentation that were correctly recovered by the prediction (Supplementary Fig. 4c). 3. Voxel-based F1-score (also known as Sørensen-Dice coefficient, Supplementary Fig. 4c): A common metric to calculate the performance of the method by combining the P and R values simply through the voxel-F1 score, defined by their harmonic mean: Notice that here no account of the dependence with respect to the (threshold) value is used. 4. Area under the precision-recall curve (AUPRC, Supplementary Fig. 4e): Since the prediction is dependent on the threshold value 0 ≤ ≤ 1 applied to the probability map, both precision and recall depend on it as well, = ( ) and = ( ). The area under the precision-recall curve (AUPRC) is the area under the parametrized curve ( ( ), ( )), which defines a metric useful to summarize the power of the method across the threshold (and probability) dimension.

Particle detection evaluation
In the case of particles as discrete objects that are small enough to be defined by their coordinates, the evaluation of the object detection task requires the following definitions (Supplementary Fig. 4d): 1. Given a sorted list of predicted particles' centroids, one by one (in that order) they are either classified as false positives (fp) or true positives (tp). 2. A predicted particle coordinate is considered a true positive if and only if given a pre-defined tolerance radius r ( = 10 vox for both ribosome and FAS), there exists a ground-truth coordinate not previously matched to any other predicted particle. 3. Then the coordinate-based precision (p), recall (r), F1 score and area under the precision-recall curve (AUPRC) are defined by: For any given threshold 0 ≤ ≤ 1, the prediction can be binarized (yellow) and compared to the ground truth (blue). c. The voxel-F1 score is the harmonic mean of voxel-based precision (P) and recall (R), computed according to the formulas in the text, considering true positive voxels as the voxels that are in the overlap (green), false positives are voxels that belong to the prediction but not to the ground truth (pink), and false negatives are voxels that are in the ground truth but not in the prediction (purple). d. Particle-based comparisons, on the contrary, treat each cluster of the prediction and ground truth as a separate particle, and thus centroids (rather than voxels) are compared. Given a radius preset by the user, a predicted centroid (c) is a true positive (tp) when it lays within of a true centroid (p). False positives (fp) are predicted centroids located beyond from any true centroid, and false negatives (fn) are true centroids that are located beyond from all predicted centroids. According to this convention, one can calculate the particle-based precision (p) and recall (r) as described in the text. e. The voxel-based Precision-Recall pairs ( ( ), ( )) parametrized by the threshold , define the Precision-Recall Curve, whose area under the curve (AUPRC, in gray), defines a performance metric that accounts for the quality of the method across the prediction score.

Merging several lists for ground truth construction
To account for possible duplicates when merging lists of particle coordinates from different methods (e.g. manually-, TM-and CNN-originated lists), we integrate them by restricting the distance between centroid coordinates, i.e. with the constraint ( , ) > 1, where the elliptic distance ( , ) is defined by: for ( , , ) = (9,9,15) , and where = ( 6 , ) , = ) and = ( 6 , ) , = ). The choice of the elliptic distance coefficients takes into account both the effect of elongation along the zaxis in the cryo-ET volume and the size of the particle in voxels of the tomogram (1 vox = 13.48 Å). * with elliptical constraint (a,b,c)= (9,9,15).
Step 1 Step 2 Step 3    Cryo-ET datasets of S. pombe utilized in this study were derived from cryo-FIB lamellae of wild-type, native, yeast cells. 20 tomograms used for comprehensive annotations were acquired with similar parameters in a dose-symmetric tilt scheme 58 (Supplementary Tables 7-9, Methods), with the only difference being imaging with conventional defocus-only (defocus) or the usage of a VPP in addition. All ribosome and FAS particles annotated in the two datasets (defocus and VPP) were subjected to subtomogram analysis, including CTF correction utilizing CTF fitting and 3D CTF model creation in Warp, and 3D refinements, hierarchical, and focused 3D classifications in RELION. During this structural analysis we made several observations, which are discussed in the following.

VPP
As described previously 30,59,76 , acquiring tomography data with a VPP emphasizes low frequencies.
Pronounced low frequencies lead to improved image contrast, which can facilitate data mining. With our method, more particles were detected in VPP than in defocus ground truth annotations of FAS complexes, which represent more challenging targets for data mining due to their low abundance and hollow structural signature. The increased SNR in VPP was also advantageous in hierarchical classifications of ribosomes ( Supplementary Fig. 5), while no difference was apparent for FAS likely due to the low particle numbers (Supplementary Fig. 6). For VPP, all ribosomes clustered in defined averages which confirmed the high performance of DeePiCt and enabled the classification of a subset of 60S large subunits ( Supplementary  Fig. 5). In comparison, most defocus ribosomes from ground truth annotations or DeePiCt predictions clustered in poorly defined classes and no 60S large subunits could be separated even when classification results were improved by optimized particle poses (coordinates and orientations) after multi-particle refinement in M 22 (Supplementary Fig. 7, 8). In addition, in classifications focused on an area close to the head of the ribosomal small subunit and at the peptide exit tunnel, the additional head density was already apparent in 2D slices of the respective classes in the VPP data (Supplementary Fig. 9 and 10, respectively). Similarly, ribosomes close to ER and mitochondria that clustered into classes with adjacent membrane densities are more pronounced in VPP (Supplementary Fig. 11 and 12, respectively).
However, fine structural details that provide insights into functional conformations of molecular complexes are lost or cannot be recovered in the final reconstructions of VPP tomography data 30 . This has been suggested to be caused by inaccurate weighing 30 and a signal loss at high spatial frequencies 77 . The preservation of high-resolution information in defocus in comparison to VPP becomes especially apparent when looking at fine structural details in the subtomogram averages. Densities fitting ACPs of FAS (Extended data Fig. 8 d-e), P-site tRNAs (Extended data Fig. 8 j-k) inside the 80S ribosome and densities connecting ribosomes to an adjacent mitochondrion membrane (Supplementary Fig. 12) could only be resolved in defocus data.
We conclude that the usage of a VPP in tomography acquisition facilitated data mining, including the demonstrated domain generalization of applying DeePiCt models trained in VPP to defocus, and improved 3D classifications. This makes it a valuable tool that ultimately provides benefits at the exploratory phase of structural analysis, helping the design of experiments and analysis pipelines that can eventually be performed on defocus data from which higher resolution reconstructions can be obtained.  Supplementary Fig. 10a) separated in all 3 datasets one class (orange) with membrane density. 3D refinements for defocus were performed in M.
Supplementary Figure 12 | 3D classifications and refinements of mitochondria-bound ribosomes in VPP and defocus ground truth, and DeePiCt predictions in defocus datasets. a-c. 2D slices through 3D-refined subtomogram average of ribosomes in 25 nm distance to the mitochondria recovered from VPP ground truth (gt), defocus gt or DeePiCt prediction on defocus datasets. Focused 3D classifications (cyan sphere mask displayed in Supplementary Fig. 10a) separated in all 3 datasets one class (green) with membrane density. 3D refinements for defocus were performed in M.