devCellPy is a machine learning-enabled pipeline for automated annotation of complex multilayered single-cell transcriptomic data

A major informatic challenge in single cell RNA-sequencing analysis is the precise annotation of datasets where cells exhibit complex multilayered identities or transitory states. Here, we present devCellPy a highly accurate and precise machine learning-enabled tool that enables automated prediction of cell types across complex annotation hierarchies. To demonstrate the power of devCellPy, we construct a murine cardiac developmental atlas from published datasets encompassing 104,199 cells from E6.5-E16.5 and train devCellPy to generate a cardiac prediction algorithm. Using this algorithm, we observe a high prediction accuracy (>90%) across multiple layers of annotation and across de novo murine developmental data. Furthermore, we conduct a cross-species prediction of cardiomyocyte subtypes from in vitro-derived human induced pluripotent stem cells and unexpectedly uncover a predominance of left ventricular (LV) identity that we confirmed by an LV-specific TBX5 lineage tracing system. Together, our results show devCellPy to be a useful tool for automated cell prediction across complex cellular hierarchies, species, and experimental systems.


March 2021
Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A description of any restrictions on data availability -For clinical datasets or third party data, please ensure that the statement adheres to our policy Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.
Sample size

Data exclusions
Replication Raw scRNA-seq data for datasets that were generated for this manuscript are publicly available under the GEO accession number: GSE184943. GEO accession numbers for published datasets used in this study can be found in the Data Availability Statement of the manuscript.
Sample sizes for testing the accuracy of devCellPy algorithm for the prediction of cell identities were determined by conducting an analysis of the minimum number of cell numbers necessary to conduct accurate cell predictions. Analysis revealed that a minimum of 16 cells per cell category for training the algorithm provided an average accuracy 83% and a minimum of 128 cells per category leading to 95% accuracy in prediction. Given this, we determined that greater than 32 cells was a sufficient sample size for training the algorithm and evaluation of its overall accuracy. For testing devCellPy, a 10-fold cross validation was implemented whereby a test data set was randomly divided into stratified partitions of 90 and 10 percent such that the 90% partition was used for training and the 10% was used for evaluation of the algorithm. By conducting 10-fold random shuffling and stratified repartitioning, the entire dataset could be tested in individual folds and a statistical calculation of devCellPy's overall prediction error could be calculated.
All single cell RNA-seq datasets used in this study containing cells on the order of thousands of cells with multiple hundred of cells per cell type assessed in all experiments. Given this number, this provided sufficient cell numbers (>32 cells) per cell category necessary for training devCellPy. DevCellPy does not have a minimum number required for cell prediction and can predict as little as 1 cell, thus allowing for the evaluation of the algorithm on smaller datasets that the algorithm has never encountered.
For flow cytometry experiment using TBX5 reporter cell line ( Figure 8D), experiments were conducted with 23 independent biological replicates which vastly exceeds the minimum requirement of 3 biological replicates to conduct statistical calculations of standard error.
For the public datasets utilized in the construction of a cardiac developmental cell atlas we filtered out low quality cells by excluding cells where the number of genes, number of mRNA counts, percent mitochondrial gene expression, or percent ribosomal gene expression exceeded the median of these metrics by plus or minus three times the median absolute deviation. Given the focus of our study was on mesodermal derivative during cardiac development we filtered non-mesodermal cell types which included endodermal and ectodermal germ layer derivatives that had been previously identified in the original papers from which the data was obtained. For datasets used for testing the devCellPy algorithm including datasets from Lescroart etal 2019, Li etal 2019, Miyamoto etal 2021 we followed the same procedure and excluded non-mesodermal cell types in order to test the concordance of devCellPy's cell predictions with manual annotation of unsupervised clusters.
For freshly collected datasets in this manuscript including the E10.5 mouse embryonic heart data and human iPSC-derived cardiomyocyte data collected, we again filtered out low quality cells by excluding cells where the number of genes, number of mRNA counts, percent mitochondrial gene expression, or percent ribosomal gene expression exceeded the median of these metrics by plus or minus three times the median absolute deviation. For the mouse E10.5 data, we filtered non-mesodermal cell types for testing of the devCellPy-generated cardiac prediction algorithm. For the human iPSC-derived cardiomyocyte data, we focused our analysis on cardiomyocyte populations and therefore filtered out non-cardiomyocyte populations based on the expression of established canonical cardiomyocyte markers TNNT2, ACTA2, and TNNI1.
We demonstrate the high performance of the devCellPy prediction pipeline using multiple datasets in our manuscript. The first large dataset that we used for testing devCellPy was a cardiac atlas consisting of greater than 100,000 cells. To test the performance of the algorithm on each layer of annotation, each layer was trained independently on subsets of the scRNA-Seq atlas. For each layer tested, the data was divided into a 90% partition for cross validation and a 10% hold-out dataset. Subsequently, the 90% cross-validation partition underwent further partitioning into a 90% segment used for training and a 10% segment used for calculation of performance metrics. 10-fold cross validation was conducted by randomly reshuffling the data used for training and testing to obtain a statistical estimate of the model's error for its overall accuracy, precision, recall, and F1-scores. After 10-fold cross validation, we fed the 10% held-out partition from the initial subdivision and calculated the confusion matrices to determine the prediction accuracy of the algorithm across all cell classes.
To further test the devCellPy-generated cardiac prediction models on data that the algorithm had not previously seen, , independent murine cardiac cell datasets were

Animals and other organisms
Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research Laboratory animals

Wild animals
Field-collected samples that used for the cardiac developmental cell atlas we conducted quality control, normalization, and a multilayered manual annotation of cell types. Moreover, we collected E10.5 mouse hearts and conducted scRNA-seq on these hearts to obtain additional data for testing devCellPy on never-before-seen data. We followed the same analysis procedure as the cardiac developmental cell atlas for manual annotation of distinct cell types. Normalized gene expression matrices were exported for each of these datasets and were input into devCellPy. To compare manual annotations to devCellPy predictions for each layer of annotation, we fed devCellPy each layer individually based on the manual annotations assigned during unsupervised clustering. We calculated confusion matrices to compare the manual label versus the devCellPy predictions for cells that were successfully classified.
We conducted additional tests to show that devCellPy accuracy at cell predictions could be replicated with cross species prediction. To do this we applied the algorithm on human iPSC-derived cardiomyocytes and correlated the predictions with the results of a LV-specific lineage tracing system. We show that the algorithm agreed with the lineage tracing data therefore showing a further level of replication.
NA. This study did not require experimental grouping.
NA. This study did not require experimental grouping or blinding.
Cardiac Troponin T monoclonal antibody (clone 13-11) was used for flow cytometry (ThermoFisher Catalog Number: MA5-12960). Secondary antibody used in this study was a goat anti-mouse IgG Alexa Fluor 647 antibody (Thermofisher Catalog Number: A-21235) Cardiac Troponin T antibody used in this study is a well referenced and validated antibody used in the field of cardiac biology. The antibody has undergone advanced verification by the manufacturer by relative expression to ensure the antibody binds to the antigen as stated (refer to manufacturer statement: https://www.thermofisher.com/antibody/product/Cardiac-Troponin-T-Antibodyclone-13-11-Monoclonal/MA5-12960). Secondary antibody used in this study has also been well cited as referenced by the manufacturers website (https://www.thermofisher.com/antibody/product/Goat-anti-Mouse-IgG-H-L-Cross-Adsorbed-Secondary-Antibody-Polyclonal/A-21235) Human Induced Pluripotent Pluripotent Stem Cell Line SCVI-111 obtained from the Stanford Cardiovascular Institute Biobank Line generated and authenticated by Stanford Cardiovascular Institute Biobank. SNP analysis performed to match iPSC lines to original patient cells and karyotypic performed for each line.
All cell lines tested negative for mycoplasma contamination.
No misidentified lines were used.
Mouse strain used for all experiment were embryonic day 10.5 CD1 wildtype mix of male and female embryos. Time pregnancies obtained commercially by Jackson Laboratories and embryos were collected upon arrival of animals to laboratory.
Study did not involve wild animals.
Study did not involve field-collected samples.