Self-supervised deep learning encodes high-resolution features of protein subcellular localization

Explaining the diversity and complexity of protein localization is essential to fully understand cellular architecture. Here we present cytoself, a deep-learning approach for fully self-supervised protein localization profiling and clustering. Cytoself leverages a self-supervised training scheme that does not require preexisting knowledge, categories or annotations. Training cytoself on images of 1,311 endogenously labeled proteins from the OpenCell database reveals a highly resolved protein localization atlas that recapitulates major scales of cellular organization, from coarse classes, such as nuclear and cytoplasmic, to the subtle localization signatures of individual protein complexes. We quantitatively validate cytoself’s ability to cluster proteins into organelles and protein complexes, showing that cytoself outperforms previous self-supervised approaches. Moreover, to better understand the inner workings of our model, we dissect the emergent features from which our clustering is derived, interpret them in the context of the fluorescence images, and analyze the performance contributions of each component of our approach.

Cropping based on fiducial channel centering versus content-based centering. Since the fiducial nuclear marker is used to centralize the input images around a nucleus it, is the marker necessary? To answer this question we trained cytoself on a dataset cropped on the basis of the image content alone (local image entropy) -forgoing the nuclear channel entirely. We compare the clustering scores obtained from this dataset with those obtained from the dataset cropped by centering nuclei and found the difference to be negligible (see Supp. Fig. 8). This result shows that the texture of the protein localization patterns is more important than the relative position of the fiducial marker to the protein fluorescence, or of its position within the cropped images. The main advantage of using the nuclear fiducial marker is to optimize the layout of the crops relative to the cells. Ideally we want to have one crop per cell, and one cell per crop. In contrast, random cropping without fiducial marker cannot ensure that every cell is used.
Dataset splitting into training, validation, and test sets. The training protocol described in the Methods section introduces data-leakage between training, validation and test data at pixel level. Another approach for splitting the data would be to split crops per field of view to ensure no that each pixel occurs only in one subset (train, validation or test). In the following we show that splitting our data along field of view does not change our results. We also explain why splitting the data into train-val-test sets is not as critical for self-supervised as it is for supervised learning.
First, we revisit our motivations for splitting the data in training, test, and validation sets. In a supervised setting, splitting the data in training, test, and validation sets serves two important purposes: (i) the test set is used to make an estimate of the performance of the model after supervised training, which is likely to generalize to further unseen data if it is in-distribution. (ii) the validation set is used during training to adjust the learning rate as well as to ensure early stopping to avoid over-fitting which could degrade performance on the test set. These considerations (i, ii, iii) are important for supervised learning. However, in our case, all training is self-supervised, and because the auto-encoder reconstruction and protein identification pretext tasks are not used after training and the performance metrics such as losses are not important for our end purpose. For the typical use-case of generating a feature vectors from input images, we never need to infer the identity of the tagged proteins nor do we need to reconstruct these images. While we do not use the pretext-tasks per se after training, we do use the resulting trained models and the latent representations that these models produce for given input images. Instead, we evaluated these models independently using our clustering score based on manually curated localization annotations. It follows then, that with our approach we could simply use the full dataset for training, without splitting the data. However, in general, it is often advantageous for technical purposes to do a train-val-test split to measure model convergence and detect over-fitting. The only disadvantage perhaps is that we could have trained cytoself on all of our data. In an abundance of caution, we use the test data for all analysis, but we could also have used the training data for the reasons explained above. Notwithstanding, it is in general preferable to avoid over-fitting, even in a self-supervised setting.
To ensure that our model does not overfit, we split our dataset per field-of-view and retrained the cytoself model. As shown in Supp. Fig. 9a, the gap between training and validation loss does not increase after about 120 epochs and 5 days of training, indicating that over-fitting does not occur. Another piece of evidence that our model did not over-fit to the training data is that our cytoself model actually works on images from the Allen Cell Collection (see Extended Data Fig. 8). One last point is to verify that indeed our results are not sensitive to the data-splitting method. First, we check whether the results of our ablation study still hold when splitting our dataset per field of view. To check this we recomputed the clustering scores. As shown in the Supp. Fig. 9b), the relative positions among these model variants stays roughly the same. Overall these results show that the different splitting scheme does not affect the relative performance between variants of our approach. Similarly, we recompute the UMAP in Fig. 2 of our manuscript and find no difference in how well clustered the data is (see Supp. Fig. 9c). Lastly, we redid the analysis on FAM241A and reach the same conclusion (see Supp. Fig. 9d). Overall, these results show that the technical choice of data splitting does not affect our results or conclusions.  Supplementary Figure 5: Comparing the predicted localization categories of proteins present both in OpenCell and the Allen Institute dataset. We find 9 proteins in the intersection between the two datasets (9 out of 11 in the Allen dataset). We compute the feature spectra from images from the Allen dataset, predict the corresponding localization categories, and compare these to the predictions done on the basis of the OpenCell images. Localization categories are predicted in the same way as done for FAM241A.
Supplementary Figure 6: Highly correlated spectra imply shared protein complex membership. (a) Feature spectra of protein pairs that are in the same complex according to CORUM 5 show significantly higher correlations than those that are not in the same complex, confirming quantitatively that the feature spectra are sensitive enough to encode complex-specific patterns. However, the spread in correlations also indicates that not all interacting proteins have strongly correlated spectra which is expected when considering that proteins can participate in different protein complexes and thus exhibit mixed localizations. In contrast, the correlation of feature spectra for protein pairs that are not in CORUM are typically close to zero with less spread, suggesting that it is rare for non-interacting proteins to have highly correlated spectra. (b) We plot the proportion of proteins in both OpenCell and CORUM that share protein-complex membership with their most correlated neighbor. When we consider only correlations above a threshold of 0.95 we find that in 83.3% of cases the protein with the strongest correlation is in a shared complex. For a threshold of 0.90 the value is 66.3%, and for a threshold of 0.5 the value is 47.9%. Each box indicates the extent from the first to the third quartile of the data, with a line representing the median. The whiskers indicates 1.5 times the inter-quartile range. The p-value is computed using a two-sided Mann-Whitney U test.