Multi-layered maps of neuropil with segmentation-guided contrastive learning

Maps of the nervous system that identify individual cells along with their type, subcellular components and connectivity have the potential to elucidate fundamental organizational principles of neural circuits. Nanometer-resolution imaging of brain tissue provides the necessary raw data, but inferring cellular and subcellular annotation layers is challenging. We present segmentation-guided contrastive learning of representations (SegCLR), a self-supervised machine learning technique that produces representations of cells directly from 3D imagery and segmentations. When applied to volumes of human and mouse cortex, SegCLR enables accurate classification of cellular subcompartments and achieves performance equivalent to a supervised approach while requiring 400-fold fewer labeled examples. SegCLR also enables inference of cell types from fragments as small as 10 μm, which enhances the utility of volumes in which many neurites are truncated at boundaries. Finally, SegCLR enables exploration of layer 5 pyramidal cell subtypes and automated large-scale analysis of synaptic partners in mouse visual cortex.

We therefore invite you to revise your manuscript to address these concerns.Please make sure that the full code and data are made available before resubmission.We also ask that you address the concerns about generalization, as well as the other concerns voiced by the reviewers.
We are committed to providing a fair and constructive peer-review process.Do not hesitate to contact us if there are specific requests from the reviewers that you believe are technically impossible or unlikely to yield a meaningful outcome.

When revising your paper:
* include a point-by-point response to the reviewers and to any editorial suggestions Please include a "Data availability" subsection in the Online Methods.This section should inform readers about the availability of the data used to support the conclusions of your study, including accession codes to public repositories, references to source data that may be published alongside the paper, unique identifiers such as URLs to data repository entries, or data set DOIs, and any other statement about data availability.At a minimum, you should include the following statement: "The data that support the findings of this study are available from the corresponding author upon request", describing which data is available upon request and mentioning any restrictions on availability.If DOIs are provided, please include these in the Reference list (authors, title, publisher (repository name), identifier, year).For more guidance on how to write this section please see: http://www.nature.com/authors/policies/data/data-availability-statements-data-citations.pdfCODE AVAILABILITY Please include a "Code Availability" subsection in the Online Methods which details how your custom code is made available.Only in rare cases (where code is not central to the main conclusions of the paper) is the statement "available upon request" allowed (and reasons should be specified).
We request that you deposit code in a DOI-minting repository such as Zenodo, Gigantum or Code Ocean and cite the DOI in the Reference list.We also request that you use code versioning and provide a license.
For more information on our code sharing policy and requirements, please see: https://www.nature.com/nature-research/editorial-policies/reporting-standards#availability-ofcomputer-codeMATERIALS AVAILABILITY As a condition of publication in Nature Methods, authors are required to make unique materials promptly available to others without undue qualifications.
Authors reporting new chemical compounds must provide chemical structure, synthesis and characterization details.Authors reporting mutant strains and cell lines are strongly encouraged to use established public repositories.
More details about our materials availability policy can be found at https://www.nature.com/natureportfolio/editorial-policies/reporting-standards#availability-of-materialsORCID Nature Methods is committed to improving transparency in authorship.As part of our efforts in this direction, we are now requesting that all authors identified as 'corresponding author' on published papers create and link their Open Researcher and Contributor Identifier (ORCID) with their account on the Manuscript Tracking System (MTS), prior to acceptance.This applies to primary research papers only.ORCID helps the scientific community achieve unambiguous attribution of all scholarly contributions.You can create and link your ORCID from the home page of the MTS by clicking on 'Modify my Springer Nature account'.For more information please visit please visit <a href="http://www.springernature.com/orcid">www.springernature.com/orcid</a>.
Please do not hesitate to contact me if you have any questions or would like to discuss these revisions further.We look forward to seeing the revised manuscript and thank you for the opportunity to consider your work.I would also like to renew my apologies about the delays.

Best regards, Nina
Nina Vogt, PhD Senior Editor Nature Methods

Reviewers' Comments:
Reviewer #1: Remarks to the Author: The manuscript by Dorkenwald et al presents a new method to learn representations of neuron fragments, segmented from electron microscopy volumes.The algorithm starts from a segmentation and implements a procedure similar to the well-established simCLR approach.The two most interesting innovations include the method to sample positive pairs and the addition of the uncertainty prediction for the detection of the out-of-distribution fragments.Similar to contrastive learning as used for selfsupervised pre-training on natural images, the proposed approach -termed SegCLR -can learn a rich representation which allows to drastically reduce the complexity of the classifiers and the amount of training data in downstream analysis tasks.As demonstration, the method is applied to supervised cell type prediction tasks and unsupervised exploration of various subclusters in the embedding space.I find the manuscript both interesting and important.Efficient exploration of enormous EM datasets is an open problem and it's great to see one of the first attempts to solve it (there is an interesting concurrent development for non-neural tissue here: https://www.biorxiv.org/content/10.1101/2022.05.07.490949v1, should probably be cited).The unbiased and unsupervised training is also very attractive, I really like the idea of complementing a segmented dataset by such a set of embeddings.
My main concern, however, is that the authors are not giving the community any means to reproduce their findings and build on the top of them.As it stands, their contribution is limited to the embeddings for the 2 big public EM datasets, which will hopefully enable more analysis of these two particular datasets, but nothing else.As the code has not been made available (no, "being prepared for deposition" doesn't count), there is no way to retrain the network on a new dataset.I assume the pretrained one cannot be used for anything else since the authors train two separate networks for their volumes and never mention cross-dataset application.Similarly, from a method developer perspective, it is not clear to me how follow-up research could directly compare with the results presented in the paper.For example, Fig 2E shows that for the mouse dataset the embedding space overlaps for axons and dendrites.This is perfectly fine to demonstrate the proposed approach, but how can one improve on it?Is the human-labeled data available?The authors mention proofreading the segmentations, were the proofread segmentations added back to the public resources?How easy is it to extract the precise training and test datasets used in Fig. 2 from the public resources?On the other hand, arguing from a common, not computationally advanced user's perspective, from a method published in Nature Methods I would normally expect a friendly GUI and not just source code, while here we don't even have source code available.If the aim of the paper is to reach readers beyond the circle of neuroscientists interested in the two public datasets, these points must be addressed.

Other questions:
-is the manually annotated dataset biased towards the cells of the most clear type or is it a dense annotation of a subvolume?-do you ever use the "classic" contrastive approach of presenting a fragment and its own augmentation as a positive pair?-how did you decide on the dimensionality of the embedding space?Why not more?-how dangerous are the segmentation errors?I understand that the learning was performed on a proofread dataset, how much would the embedding change if you apply it on the previous version?I'm asking to get an idea on how useful the method would be with an imperfect segmentation most of us have to live with.
-do you think the difference in performance for subcompartment classification is biological or is it caused by segmentation errors?What do you think it implies for training on the next dataset, e.g.drosophila?
Reviewer #2: Remarks to the Author: In this manuscript Dorkenwald et al. propose SegCLR -a method for self-supervised learning of morphological representations of cell fragments in segmented EM data of the neural tissue.Annotating brain volume EM datasets is not only time consuming, but also highly challenging, partially due to the fact that cell processes can extend across vast lengths and often significant cell parts, including somas, could be missing in the imaged volume.To address this problem the authors adapt the latest developments in the field of representation learning for natural images to segmented brain EM data and design an unsupervised pipeline that extracts representations from small cell fragments.The authors further show how such representations can be used for multiple annotation tasks.
I find the proposed method to be technically sound and promising for analyzing electron microscopy imaging data.The fact that classifying cellular subcompartments requires 4000 times less training data when using SegCLR representations shows the potential of the method to significantly speed up the analysis of the brain EM volumes.Furthermore, reaching high accuracy in classifying cell types even on short length cell fragments that are challenging for human annotators would allow for more unbiased larger scale analysis of brain circuits.However, I found the unsupervised data exploration part much less convincing, since most of it heavily relies on extensive manual annotations.Moreover, most of the analysis presented has been performed only on the manually proofread parts of the data, raising a question of how well the method would generalize on the whole dataset.Furthermore, the clarity of the manuscript could be improved.Considering this, I believe the following points should be addressed before the paper is considered for publishing: Major: -The flow of the introduction could be improved.For example, it would be easier to follow if selfsupervised/contrastive learning was explained before introducing SegCLR.Moreover, it would be helpful if the contributions of the method were listed in one or sequential paragraphs.I guess that moving the paragraph 3 after the paragraph introducing self-supervised learning would already greatly facilitate understanding what the method does.Comparison to the previous work should rather focus on the novelty of the presented method and the new types of analysis it enables, as opposed to listing the types of analysis not done in the previous work.For example, consider the sentence "Schubert et al. trained cell representations using a triplet loss, but it was not reported whether these representations are suitable for downstream analyses".Schubert et al. did show the usefulness of their representations on multiple tasks.However, using 3D EM data clearly results in richer representations than using 2D projections or shape only, and I feel like the authors should rather focus on this strength.Finally, the concurrent work on self-supervised/contrastive learning on EM data (Wilson et al., 2022, Zinchenko et al. 2023) should also be cited for completeness.
-I found the figures to be a bit misleading, because except for Figure 1b only the cell skeletons are shown, not the underlying EM data.Especially since the authors mostly showed a much bigger cutout than was used for extracting representations, it often gave a false impression of the task being rather trivial to solve.However, it is impressive that the cell types and cellular subcompartments can be predicted from small fragments, so showing the actual fragments could not only help get a feeling of how different the underlying texture is, but also convey a much stronger message.
-"For evaluation of local cell type classification (Fig. 4), it was important to have cells with minimal reconstruction merge errors in the ground truth labeled set."I would argue that such a requirement greatly diminishes the value of the presented results.Since manual segmentation proofreading and correction is mostly more time-consuming than cell annotation, I would expect the method to be often applied to datasets where segmentation errors are not rare.Thus, it would be extremely beneficial if the authors could adjust their cell classification pipeline to deal with merge errors as well.For example, given the promising results in the Figure 5g, the authors could use their out-of-distribution input detection to locate merge errors that would further increase the value of the method.
-Figure 2: why is "the performance of a fully supervised ResNet-18 classifier trained on the full available training data" only shown for the human data, but not for the mouse data?The notably worse performance of the method on the mouse data in comparison to the human data should also be explained.
-"We used unsupervised UMAP projection to visualize samples of embeddings in the human and mouse datasets, and readily observed separate regions in UMAP space for glia versus neurons, and for axons, dendrites, and somas".The fact that there were separate regions in the UMAP space should not be used to claim that SegCLR is useful for unsupervised exploration, because these UMAP plots in the first place were generated from a representative subsample of embeddings, where the definition of "representative" is based on manual annotations.If it is computationally infeasible to use UMAP on all of the generated embeddings, a random subsample of the embeddings should be selected to run UMAP on.
-Similarly, a specific set of cells that was proofread and labeled by human experts should not be used to showcase using SegCLR embeddings for unsupervised data exploration.If the goal is to focus on a smaller subset of cells, predictions of the cell type classifier could be used to extract a specific subset.
-Furthermore, the semi-automatic pipeline of cluster extraction for unsupervised exploration of mouse visual cortex layer-5 pyramidal cells requires too many manual choices to be considered unsupervised: 3 clusters were defined based on the UMAP components 2 and 3 by running k-means with 25 groups on the first 5 UMAP components and assigning them to the visual UMAP clusters.I feel like this complexity could be reduced by using better clustering methods in the first place.While k-means is fast and efficient, it is not the best option for clustering sparse high-dimensional data.I would expect better results could be obtained using density-based HDBSCAN on several UMAP components or graph-based clustering methods directly on the representations, for example, the Louvain or the Leiden algorithms commonly used in the single-cell transcriptomics field.
-Code availability: this section should include the link to the used code.
Minor: -The gray color scale for heatmaps is a bit difficult to read.
-Figure 3a: what is WM?It would also be great if the scale bars of the upper and the lower row were vertically aligned.
-Figure 6: would be helpful if the abbreviations were explained in the figure legends.
-Methods: "For comparison, we also trained a fully-supervised subcompartment classifier directly on voxel inputs using an identical 3d ResNet-18 architecture and input configuration (photometric augmentation was omitted and random 3d rotations were added)".What was the reason to omit photometric augmentation?It seems important given the dataset with varying intensities.
-Methods: "For each fold we found the uncertainty threshold that maximized the F1-Score of the indistribution vs out-of-distribution task."I feel like using one threshold for all folds would be more appropriate.Or at least these thresholds should be reported to give an estimate of their variance.
-Methods: "Next, we labeled nodes as uncertain where the predicted uncertainty was above 0.45."How was this threshold determined?In case of no evaluation data available, would it not be more intuitive to use 0.5?-Methods: "Here, we assigned a thalamocortical label when the predicted thalamocortical probability exceeded the summed probability across all pyramidal cell types and the pyramidal subtype with the highest predicted probability otherwise".This has to be justified.
-It would be helpful if the Methods section contained direct links to access the data when it's mentioned for the first time, including links to the used datasets, generated representations and cell type annotations, as well as the used code.
Due to my limited proficiency in neurobiology, the biological conclusions of the presented analyses are outside the scope of my expertise.consists in the consideration of additional positive sample pairs from nearby locations within the same segment, and thus requires a prior segmentation of the EM volume.The embeddings learnt with this method (SegCLR) are shown to be very information rich: the embeddings allow classification of cellular subcompartments and cell types when used as input to a shallow classification network, requiring less annotated samples than fully supervised methods on the same image data.
The extension to SimCLR is rather simple: in SimCLR, positive pairs are obtained through augmentations (e.g., rotations, elastic deformations, intensity changes).In SegCLR, additional positive pairs are sampled within a threshold distance within the same object, as given by a segmentation.The significance of this work lies mostly in the application of the extension to two connectomics datasets and the demonstration that the learnt embeddings aid substantially in solving downstream analysis tasks (e.g., cell typing of downstream partners from local post-synaptic image patches).The presented experiments convincingly support the claims that SegCLR embeddings allow cellular subcompartment classification and cell typing on two large EM datasets.
1.The claim that SegCLR reduces the amount of labelled data required by a factor of 4,000 stems from the 4-class subcompartment task on the human dataset: "On the 4-class subcompartment task, the embedding-based classification matches the performance of direct supervised training while requiring roughly 4,000 times less labeled training data (10-run median F1-Score, ~400 examples total)".This claim is not sufficiently backed up.
The fully supervised baseline has only been trained on the full dataset.Judging from the progression of F1-scores on the embedding-based classifications, not all of the provided training data might be needed.This is not accounted for.
The fully supervised baseline should also be evaluated on randomly sampled subset of different sizes (as done for SegCLR) to see at what point the baseline performance plateaus.Statements about the amount of training data needed for either method should be made with respect to the obtained F1-score (e.g., "SegCLR requires X-fold less training data then a fully supervised baseline to reach an F1-score of Y").
2. The input to the SegCLR network is a downsampled volume (32-40nm resolution).I would appreciate an elaboration on the rational behind the downsampling, especially since the discussion itself mentions that the lower resolution impedes SegCLR's capabilities to analyze finer ultrastructure like vesicles (which is presumably relevant for other downstream analysis tasks beyond compartment and cell type classification).Did the authors apply SegCLR at the native resolution but found the results to be inferior for the presented analysis tasks?This would be worth reporting.Or are there technical considerations that would favor a near isotropic input (e.g., rotation augmentations in SO(3)) that are crucial for learning embeddings? 3. Figure 2f does not contain a fully supervised baseline but is otherwise similar to Figure 2d.My understanding is that a fully supervised method can be trained on the same locations for which SegCLR produced embeddings.Please provide the baseline comparison here as well, in particular since it would provide evidence for the claim that SegCLR requires less training data.* Methods, "Automated analysis of synaptic partners": "we λ" -> "we set λ"

Author Rebuttal to Initial comments
Thank you again for your interest in Nature Methods.Please do not hesitate to contact me if you have any questions.We will be in touch again soon.

Best regards, Nina
Nina Vogt, PhD Senior Editor Nature Methods Reviewer #2 (Remarks to the Author): The authors have addressed most of the comments and I feel like the new experiment on cross-dataset applications effectively demonstrates generalizability of the proposed method.
Although I still think that the Figure 2F requires a fully supervised baseline, given the amount of time required to train another fully supervised network, I understand the reluctance of the authors to do it.
I still feel like the authors can not make the following claim: "We used unsupervised UMAP projection to visualize samples of embeddings in the human and mouse datasets, and readily observed separate regions in UMAP space for glia versus neurons, and for axons, dendrites, and somas".The authors should either clearly specify in the main text that they visualized a preselected manually labeled set of embeddings, or plot the available labels on the umap of a random subsample of the embeddings (enriching for the labeled ones, if necessary).
Reviewer #3 (Remarks to the Author): The revision addresses most of the points that have been brought up.I commend especially the data and code release, which will make the method more easily available for refinement and downstream analysis.I recommend publication of the manuscript with a few minor revisions: 1.I suggest to add the response regarding the choice of the 32/40nm resolution to the discussion.2. Figure 2d's y-axis could be rescaled a bit to make it easier to see the fine differences in F1-score.3. Supplemental Figure 2b could show a different raw image sample to highlight that the training data is from a different dataset than in panel a. 4. Page 7, paragraph at bottom: I would appreciate if F1-scores of all alternatives would be mentioned to make it easier to assess the generalization performance (i.e., "trained on h01", "trained on MiCRONS" (with and without finetuning), and "fully supervised").

Final Decision Letter:
Dear Viren, I am pleased to inform you that your Article, "Multi-Layered Maps of Neuropil with Segmentation-Guided Contrastive Learning", has now been accepted for publication in Nature Methods.Your paper is tentatively scheduled for publication in our December print issue, and will be published online prior to that.The received and accepted dates will be November 18th, 2022 and October 2nd, 2023.This note is intended to let you know what to expect from us over the next month or so, and to let you know where to address any further questions.
Acceptance is conditional on the data in the manuscript not being published elsewhere, or announced in the print or electronic media, until the embargo/publication date.These restrictions are not intended to deter you from presenting your data at academic meetings and conferences, but any enquiries from the media about papers not yet scheduled for publication should be referred to us.
Over the next few weeks, your paper will be copyedited to ensure that it conforms to Nature Methods style.Once your paper is typeset, you will receive an email with a link to choose the appropriate publishing options for your paper and our Author Services team will be in touch regarding any additional information that may be required.
You will receive a link to your electronic proof via email with a request to make any corrections within 48 hours.If, when you receive your proof, you cannot meet this deadline, please inform us at rjsproduction@springernature.com immediately.
Please note that <i>Nature Methods</i> is a Transformative Journal (TJ).Authors may publish their research with us through the traditional subscription access route or make their paper immediately open access through payment of an article-processing charge (APC).Authors will not be required to make a final decision about access to their article until it has been accepted.<a href="https://www.springernature.com/gp/open-research/transformative-journals">Find out more about Transformative Journals</a> to be accepted, including <a href="https://www.springernature.com/gp/open-research/policies/journalpolicies">self-archivingpolicies</a>.Those licensing terms will supersede any other terms that the author or any third party may assert apply to any version of the manuscript.
If you have any questions about our publishing options, costs, Open Access requirements, or our legal forms, please contact ASJournals@springernature.com Your paper will now be copyedited to ensure that it conforms to Nature Methods style.Once proofs are generated, they will be sent to you electronically and you will be asked to send a corrected version within 24 hours.It is extremely important that you let us know now whether you will be difficult to contact over the next month.If this is the case, we ask that you send us the contact information (email, phone and fax) of someone who will be able to check the proofs and deal with any last-minute problems.
If, when you receive your proof, you cannot meet the deadline, please inform us at rjsproduction@springernature.com immediately.
Once your manuscript is typeset and you have completed the appropriate grant of rights, you will receive a link to your electronic proof via email with a request to make any corrections within 48 hours.If, when you receive your proof, you cannot meet this deadline, please inform us at rjsproduction@springernature.com immediately.
Once your paper has been scheduled for online publication, the Nature press office will be in touch to confirm the details.
If you have posted a preprint on any preprint server, please ensure that the preprint details are updated with a publication reference, including the DOI and a URL to the published version of the article on the journal website.
Once your paper has been scheduled for online publication, the Nature press office will be in touch to confirm the details.
Content is published online weekly on Mondays and Thursdays, and the embargo is set at 16:00 London time (GMT)/11:00 am US Eastern time (EST) on the day of publication.If you need to know the exact publication date or when the news embargo will be lifted, please contact our press office after you have submitted your proof corrections.Now is the time to inform your Public Relations or Press Office about your paper, as they might be interested in promoting its publication.This will allow them time to prepare an accurate and satisfactory press release.Include your manuscript tracking number NMETH-A50990B and the name of the journal, which they will need when they contact our office.About one week before your paper is published online, we shall be distributing a press release to news organizations worldwide, which may include details of your work.We are happy for your institution or funding agency to prepare its own press release, but it must mention the embargo date and Nature Methods.Our Press Office will contact you closer to the time of publication, but if you or your Press Office have any inquiries in the meantime, please contact press@nature.com.
To assist our authors in disseminating their research to the broader community, our SharedIt initiative provides you with a unique shareable link that will allow anyone (with or without a subscription) to read the published article.Recipients of the link with a subscription will also be able to download and print the PDF.
As soon as your article is published, you will receive an automated email with your shareable link.
You can now use a single sign-on for all your accounts, view the status of all your manuscript submissions and reviews, access usage statistics for your published articles and download a record of your refereeing activity for the Nature journals.Nature Portfolio journals <a href="https://www.nature.com/nature-research/editorialpolicies/reporting-standards#protocols"target="new">encourage authors to share their step-by-step experimental protocols</a> on a protocol sharing platform of their choice.Nature Portfolio 's Protocol Exchange is a free-to-use and open resource for protocols; protocols deposited in Protocol Exchange are citable and can be linked from the published article.More details can found at <a href="https://www.nature.com/protocolexchange/about"target="new">www.nature.com/protocolexchange/about</a>.Please note that you and any of your coauthors will be able to order reprints and single copies of the issue containing your article through Nature Portfolio's reprint website, which is located at http://www.nature.com/reprints/author-reprints.html.If there are any questions about reprints please send an email to author-reprints@nature.com and someone will assist you.
Please feel free to contact me if you have questions about any of these points.

Best regards, Nina
Nina Vogt, PhD Senior Editor