Gene-expression memory-based prediction of cell lineages from scRNA-seq datasets

Assigning single cell transcriptomes to cellular lineage trees by lineage tracing has transformed our understanding of differentiation during development, regeneration, and disease. However, lineage tracing is technically demanding, often restricted in time-resolution, and most scRNA-seq datasets are devoid of lineage information. Here we introduce Gene Expression Memory-based Lineage Inference (GEMLI), a computational tool allowing to robustly identify small to medium-sized cell lineages solely from scRNA-seq datasets. GEMLI allows to study heritable gene expression, to discriminate symmetric and asymmetric cell fate decisions and to reconstruct individual multicellular structures from pooled scRNA-seq datasets. In human breast cancer biopsies, GEMLI reveals previously unknown gene expression changes at the onset of cancer invasiveness. The universal applicability of GEMLI allows studying the role of small cell lineages in a wide range of physiological and pathological contexts, notably in vivo. GEMLI is available as an R package on GitHub (https://github.com/UPSUTER/GEMLI).

(a) Precision-sensitivity curves of lineage predictions using all genes, memory genes called using the ground truth lineages, or genes selected by GEMLI in 7 cell types (n=1 dataset for each).(b) Precision-sensitivity curves of lineage predictions as in (a) in all datasets of the indicated cell types (n=14 for MEF, n=8 for WM989, n=2 for HSC, n=44 for HSPC).Line: mean; shades: S.D. (c) Boxplot of precision (top) and sensitivity (bottom) for lineage predictions across datasets as in (b) using all genes, memory genes, or GEMLIs gene selection as input at confidence level 50.(d) Falsepositive (FPR) curve of lineage predictions using all genes, memory genes called using the ground truth lineages, or genes selected by GEMLI in 7 cell types (n=datasets; n=1 for mESC, CD8, L1210, n=14 for MEF, n=8 for WM989, n=44 for HSPC, n=2 for HSC).Line: mean; shades: S.D. (e) Boxplots for datasets as in (c) of the FPR (top; confidence level 50) and AUC values (bottom) for lineage predictions using all genes, memory genes called using the ground truth lineages, or genes selected by GEMLI.(f) ROC curves for GEMLI predictions across cell types and datasets as in HSPC were used for comparisons across same experiment, same condition, and related conditions (datasets n =20 across 4 experiments and 3 conditions (LK condition datasets n=3, LSK condition datasets n=3 and 9 respectively, LK_LSK condition datasets n= 5).For the comparison across other conditions (unrelated starting cell types), datasets of LK cells, CD8, L1210, MEF and mESC were compared (n=1 each).(f) Precision, sensitivity and FPR for GEMLI predictions using the GEMLI gene set (dark grey) or Seurat variable genes (light grey; var.) at confidence level 50 for one dataset in seven cell types (mESC, CD8, L1210, WM989, HSC, HSPC, MEF).Boxplots as in fig.S2.Source data are provided as a Source Data file.Vastly varying in size (from 10s to 1000s) depending on the cell types present in the data.
Clusters are not of a single size but more homogeneous.They commonly recapitulate the lineage sizes present in the data.
fig.S2: Similarity of cell lineages in scRNA-seq data within and across cell types.(a) tSNE embedding of a PCA on exonic and intronic data of the lineage-annotated (colors) mESC dataset (n=1) including velocity vectors (arrows).(b) Comparison of correlation distance in exonic and intronic gene expression for cells of the same cell lineage and randomly sampled cells in mESC and MEF datasets (n=1 for each).(c) tSNE embedding of the mESC dataset as in (a) coloring according to the completeness of each lineage in one cyclone-assigned cell-cycle phase.(d) Correlation distance in cell cycle-dependent gene expression in related and randomly sampled cells for the indicated cell types (n=1 dataset for each).(e) tSNE embedding of the mESC dataset as in (a) with size indicating the complexity (number of expressed genes) quantile of each cell.(f) Quantification of similarity in complexity range of related and randomly sampled cells for the indicated cell types (n=1 dataset for each).(g) Boxplot of the correlation distance in gene expression for related cells and randomly sampled cells (100 repetitions) in all datasets of the indicated cell types (n=datasets; n=8 for WM989, n=2 for HSC, n=14 for MEF, n=44 for HSPC).(h) Correlation distance as in (g) for lineages encompassing a single cell type (symmetric) or several cell types (asymmetric) in the HSPC datasets of day 4 (n=22 datasets).(i) Correlation distance as in (g) for different timepoints across a 30-day MEF reprogramming time course(n=4,10,2,2,4,4,2,6,2,2,2,8,2,2,2 datasets, respectively, for indicated days within the time course).For all boxplots: intervals between the 25th and 75th percentile and median (horizontal line).Error bars: 1.5-fold the interquartile range or the closest data point when no data point is outside this range.Source data are provided as a Source Data file.
fig.S6: Comparison of different memory gene definitions for maximal predictive power.(a) Heatmap showing the sharing of memory genes called using different methods in the mESC dataset.Memory genes are selected based on a high correlation of gene expression within cell lineages (correlation), a small intra-cell lineage variability (intraCV 2 ), a large variability across means of cell lineages (CV 2 of means), are marker genes for cell lineages found using Seurat's FindMarker() function (Bimod, Standard, Negbinom, LR, Roc, Poisson, MAST, Ttest), or have been selected using a machine learning (ML) approach.(b) Overlap in the Machine learning generated memory gene sets (red) and the CV 2 of means-based memory gene sets (blue) in the indicated cell types (n=1 dataset for each cell type).(c) Precision-sensitivity curves for predictions in the mESC dataset using memory genes defined using different methods (colors) as in (a) as input geneset (Precision=line; sensitivity=dotted line).(d) Precision-sensitivity curve for predictions on one MEF dataset as in (c).(e) FPR curve for predictions as in (c-d) for the mESC dataset (bottom) and MEF dataset (top).(f) Precision-sensitivity curves as in (e) for five cell types (n=datasets; n=1 for mESC, CD8, L1210, n=6 for MEF, n=20 for HSPC) for the four best performing methods to call memory genes (colors as in (c); line=mean precision, dotted line=mean sensitivity; shade= S.D.).(g) FPR curve for datasets and methods as in (f).(h) Boxplots of precision, sensitivity, and FPR for predictions as in (f) at confidence level 50 for the datasets as in (b; n=1 dataset for each cell type).Boxplot as in fig.S2.For abbreviations of memory gene definitions/selection methods see Methods.Source data are provided as a Source Data file.
fig.S7: Enrichment of memory genes by GEMLI.(a)Abundance as a function of mean and variation deciles for memory genes (left) and the GEMLI gene set (right) on the mESC dataset.(b) Percentage of memory genes in the GEMLI gene set in the indicated cell types (n=datasets; n=2 for HSC, n=45 for HSPC, n=6 for MEF, n=8 for WM989).(c) Percentage of memory genes selected by GEMLI and a neural network (ML) based on variability and mean expression across 40 datasets of several cell types (n=1 for mESCs, CD8, L1210, n=2 for HSC, n=6 for MEF, n=8 for WM989, n=21 for HSPC).(d) Abundance as a function of mean and variation deciles as in (a) for Seurat's highly variable genes called on the mESC dataset.(e) Abundance as a function of mean and variation deciles as in (a) for a custom selection of highly variable genes called on the mESC dataset.(f) Percentage of genes in Seurat's highly variable genes and the GEMLI gene selection that is unique or overlapping.The mean over datasets is shown (n=1 for mESCs, CD8, L1210, n=6 for MEF, n=44 for HSPC, n=8 for WM989).(g) Fraction of memory genes recovered in Seurat's variable genes and by GEMLI.Datasets as in (f).Boxplots as in fig.S2.Source data are provided as a Source Data file.
fig.S8: GEMLI lineage predictions performance in different cell types.(a) Precision-sensitivity curves of lineage predictions using all genes, memory genes called using the ground truth lineages, or genes selected by GEMLI in 7 cell types (n=1 dataset for each).(b) Precision-sensitivity curves of lineage predictions as in (a) in all datasets of the indicated cell types (n=14 for MEF, n=8 for WM989, n=2 for HSC, n=44 for HSPC).Line: mean; shades: S.D. (c) Boxplot of precision (top) and sensitivity (bottom) for lineage predictions across datasets as in (b) using all genes, memory genes, or GEMLIs gene selection as input at confidence level 50.(d) Falsepositive (FPR) curve of lineage predictions using all genes, memory genes called using the ground truth lineages, or genes selected by GEMLI in 7 cell types (n=datasets; n=1 for mESC, CD8, L1210, n=14 for MEF, n=8 for WM989, n=44 for HSPC, n=2 for HSC).Line: mean; shades: S.D. (e) Boxplots for datasets as in (c) of the FPR (top; confidence level 50) and AUC values (bottom) for lineage predictions using all genes, memory genes called using the ground truth lineages, or genes selected by GEMLI.(f) ROC curves for GEMLI predictions across cell types and datasets as in (d).The parts of the ROC curve corresponding to GEMLI predictions at confidence values of 0-10 are indicated.Boxplots as in fig.S2.Source data are provided as a Source Data file.
fig.S10: Influence of GEMLI parameters on lineage predictions.(a) Precision-sensitivity (left) and FPR (right) curves of lineage predictions in the mESC dataset (n=1) sampling different fractions of genes during each iterative clustering as indicated.(b) Precision (top left), sensitivity (bottom left), FPR (top right), and AUC (bottom right) for mESC, CD8, L1210, HSPC, HSC, WM989, and MEF datasets (n=1 each) at a confidence level of using different fractions of genes (color-coded) during each iterative clustering as in (a).(c) Precisionsensitivity (left) and FPR (right) curves of lineage predictions in the mESC dataset splitting clusters into the number indicated (colors) during each clustering iteration round.(d) Precision, sensitivity, FPR, and AUC as in (b) at a confidence level of 50 for datasets as in (b) splitting clusters as in (c; color-coded).(e) Precision-sensitivity (left) and FPR (right) curves of lineage predictions in the mESC dataset repeating the iterative clustering different number of times as indicated.(f) Precision, sensitivity, FPR and AUC as in (b) at a confidence level of 50 for datasets as in (b) using different numbers of repetitions (color-coded) as in (e).Boxplots as in fig.S2.Source data are provided as a Source Data file.
fig.S13: Performance of GEMLI for different lineage sizes.(a) Precision of GEMLI predictions for different ground truth lineage sizes at confidence level 50 in the mESC, MEF, WM989, and HSPC datasets.(b-c) Same representation as in (a) for sensitivity (b) and FPR (c).(d) Precision-sensitivity curves for different ground truth lineage sizes as indicated in mESCs.(e) FPR curve for different ground truth lineage sizes as indicated in the mESCs dataset.(f)Precision-sensitivity curves for GEMLI predictions with a lineage size parameter of 5 on the crypts dataset for three ground truth lineage size bins as indicated.(g) FPR curve for GEMLI predictions with a lineage size parameter of 2-5 on the crypts dataset for three ground truth lineage size bins as indicated.(h-i) Same representation as in (f-g) for GEMLI predictions with a lineage size parameter of 2-40.For a-c: shown is the median over (n=datasets; ESC n=1, MEF n=14, WM n=8, HSPC n=44).Source data are provided as a Source Data file.
fig.S16: GEMLI lineage predictions can identify memory genes and cell fate decisions.(a) Spearman rank correlation matrix of GO-term enrichment (top 500) in memory genes of ground truth or predicted (pred.)cell lineages at confidence level 30 in the indicated datasets (n=1; Spearman ranks test: * p-value<0.05;* * p-value<0.01;* * * p-value<0.001).(b) Overlap of memory genes of CD8, L1210 and WM989 cells from Kimmerling et al. 2016 and Shaffer et al. 2020 with the memory genes called on predicted lineages in these cell types (one dataset each as in (a)).(c) The overlap of memory genes called on ground truth cell lineages, and memory genes called on predicted (pred.)lineages across cell types (n=1 dataset each).For the human WM989 cells overlap is in a Venn diagram (left).(d) Percentage of asymmetric (left), entirely drug-susceptible (middle), and entirely drug-resistant (right) cell pairs in barcode and GEMLI lineages across WM989 datasets (n=8).Density plot and individual paired datapoints are represented.(e) The number of cell pairs in barcode and predicted lineages in all possible cell type pair categories across WM989 datasets as in (d).Each dot represents one cell type combination in one dataset.Coloring represents density in the scatter plot.Spearman rank correlation is given.(f) Scatterplot as in (e) showing the number of cell pairs in barcode and predicted lineages in all cell type combinations categories across all HSPC datasets (n=44).(g) Scatterplot as in (f) for random cell pairs.(h) Heatmap of the number (sum) of cell pairs in barcode (top right) and random cell lineages (bottom left) in all possible cell type pair categories in the HSPC datasets as in (f).(i) Representation as in (f) for barcode cell pairs