Delineating mouse β-cell identity during lifetime and in diabetes with a single cell atlas

Although multiple pancreatic islet single-cell RNA-sequencing (scRNA-seq) datasets have been generated, a consensus on pancreatic cell states in development, homeostasis and diabetes as well as the value of preclinical animal models is missing. Here, we present an scRNA-seq cross-condition mouse islet atlas (MIA), a curated resource for interactive exploration and computational querying. We integrate over 300,000 cells from nine scRNA-seq datasets consisting of 56 samples, varying in age, sex and diabetes models, including an autoimmune type 1 diabetes model (NOD), a glucotoxicity/lipotoxicity type 2 diabetes model (db/db) and a chemical streptozotocin β-cell ablation model. The β-cell landscape of MIA reveals new cell states during disease progression and cross-publication differences between previously suggested marker genes. We show that β-cells in the streptozotocin model transcriptionally correlate with those in human type 2 diabetes and mouse db/db models, but are less similar to human type 1 diabetes and mouse NOD β-cells. We also report pathways that are shared between β-cells in immature, aged and diabetes models. MIA enables a comprehensive analysis of β-cell responses to different stressors, providing a roadmap for the understanding of β-cell plasticity, compensation and demise.

On the non-integrated embedding we observed strong separation between endocrine cell types across datasets (Supplementary Figure 1a).For example, γ-cells were located in two distinct regions across datasets.They did not even co-localize among in-house datasets; i.e., P16 and 4m datasets located separately from aged, mSTZ, and db/db datasets.As these differences could not be fully explained by biological differences after visual analysis of available metadata, we assumed that they are also driven by technical factors, which would bias our interpretations if we used the embedding without batch effect removal.Thus, we decided to perform integration, as recommended when analyzing multiple datasets at once [1][2][3] .
For the integration of the atlas, we tested cVAE model implemented in the scArches package 4 and scVI method 5 from scvi-tools, as it was recently reported that neural network based integration methods are most suitable for large and complex dataset collections, such as ours 1 .We observed stronger batch correction by scVI and comparable biological conservation on the level of cell types between cVAE and scVI integrations (Supplementary Figure 1c).As we were particularly interested in β-cell analysis we also evaluated integration on the level of β-cell states defined with known markers (Supplementary Figure 1c).Since scVI-based integrations had much lower biological variation preservation within β-cells we decided to use cVAE to enable exploration of biological heterogeneity within β-cells.We observed strong sample-specific ambient RNA contamination (Supplementary Figure 1b), possibly due to different cell type proportions across samples, introducing additional batch effects.Thus, we aimed to improve integration performance by preprocessing with various ambient-contamination removal methods.Interestingly, we found that preprocessing with ambient removal tools (DecontX 6 , SoupX 7 ) did not improve integration beyond the exclusion of genes with the strongest ambient expression contribution.Thus, we decided to use ambient gene exclusion alone for the final integration (Supplementary Figure 1c).Furthermore, we also attempted to specifically improve β-cell integration by performing integration of βcells only.We assumed that by integrating only β-cells we could remove more of the ambient effects.That is, the HVG selection and thus the integration would no longer be affected by the variability of markers from other cell types.However, in the β-cell specific integration we were unable to achieve matched biological preservation at the same batch correction level as in the whole-islet integration (Supplementary Figure 1d).Thus, we did not use it further.Better integration of all cell types at once could be explained by different factors, such as a more diverse set of genes used for integration that improves biological variation preservation or by learning batch effects across cell types, thus being better able to separate technical effects from biological variation that may be more cell type specific.For example, in β-cells alone, some biological states are sample-specific, resulting in direct confounding of batch and biological variation, while when using multiple cell types, the batch effects can be more easily disentangled from biological variability at the cell type level.datasets (Figure 4a, Figure 5a, Extended Data Figure 5c), likely due to in vitro islet cultivation or ambient effect of the spike-in RNA used in the protocol 15 .
We used MIA to identify markers that are both specific for an individual β-cell state and conserved across all datasets mapping to that state by applying one-vs-rest DGE between cell states and retaining the intersection of markers across datasets (Figure 5c, Extended Data Figure 6a, Supplementary Table 5), with each state containing one to four datasets (Extended Data Figure 6b).We observed a low number of cell state markers for some states, such as adult and immature states (Extended Data Figure 6b).This may be due to the stringent requirement of markers being consistent in four datasets mapping into these states.However, this ensures high robustness.We also observed on a per-dataset level that certain cell states, including the adult state, had fewer markers, indicating that these states are less molecularly unique (Extended Data Figure 6b).Reliable detection of these states would likely benefit from a combination of positive and negative markers.However, we found a large number of db/db+mSTZ state markers that were conserved across both datasets mapping into this state (Extended Data Figure 6b), indicating the strength of molecular changes involved in the T2D model dysfunction.

Supplementary Note 5: Comparison of β-cell heterogeneity within MIA with previous literature
As β-cell heterogeneity has been extensively described before 16 , we decided to compare our results to key publications in the field.Gradual maturation of β-cells after birth was previously characterized based on the upregulation of Ucn3 and downregulation of Mafb by Zeng et al. ( 2017) and based on the downregulation of Cd81 by Salinno (2021).Accordingly, we observed higher expression of Mafb and Cd81 and lower expression of Ucn3 in our immature state (imm.)from younger mice compared to the healthy adult state (adult, Figure 5b).Furthermore, our GP23, which contained multiple immaturity marker genes (Rbp4, Mafb), was gradually downregulated from immature to adult states (imm.1 and imm.2, adultimm.1,and adul1 and adul2, Extended Data Figure 8b).However, multiple genes that were reported to be downregulated during maturation by Zeng et al. (2017) (Fos, Fosb, Egr1, Junb, Jun, Txnip) were present in our aging-associated healthy gene group 4 (Supplementary Table 8), showing some discrepancy between exact maturation-associated genes.It was also reported that proliferative β-cells decline after birth 17 and indeed, we observed the highest proportion of proliferative endocrine cells in our youngest postnatal dataset (P16, Extended Data Figure 2).Moreover, Aguayo-Mazzucato et al. (2017) reported an increase of senescence marker Cdkn2a during aging, as was also observed in our aged states (agedF and agedM, Figure 5b).Maturity and aging heterogeneity were previously also observed within individual samples 17,18 .This corresponds to our finding that all healthy adult samples contain subpopulations of cells with high expression of immaturity or aging-related genes (healthy variable gene groups 2 and 4, Supplementary Note 7).However, our integrated embedding did not show separation between cell populations sorted for maturity marker Flattop (Extended Data Figure 4, datasets: P16, 4m, aged), which was discovered by Bader et al. ( 2016) 19 and similarly we did not observe a strong difference in Cfap126 (gene encoding Flattop protein) mRNA expression between sorted populations (Extended Data Figure 5g).This is expected as Cfap126 is expressed only transiently and at low levels, while both of the reporters mark the cells even after Cfap126 is no longer expressed 20,21 .Xin et al. (2018) reported cycling of cells between states of active insulin production, resulting in metabolic stress, followed by a recovery phase.We observed similar metabolic heterogeneity in healthy adult samples, as described in the section "β-cell dysfunction patterns within healthy samples".Sachs et al. (2020), whose data is contained within MIA, described db/db and mSTZ model populations, respectively, distinct from healthy controls.Their diabetic model populations match our db/db+mSTZ population (Extended Data Figure 5c,d) and we detected molecular changes corresponding to previous publications (Supplementary Note 6).Similarly, markers Ucn3 and Slc2a2 were reported to be downregulated in the STZ model by Feng et al. (2020) and were also downregulated in our diabetic db/db+mSTZ population (Figure 5b).However, the orthologue of gene ST8SIA1, which was reported to be upregulated in human T2D by Dorrell et al. (2016), was significantly downregulated in our db/db+mSTZ DGE analysis (Supplementary Table 10).Conversely, orthologue of their marker CD9, for which they did not discuss association with diabetes 22 , was upregulated in our db/db+mSTZ DGE analysis.Our integration also preserved finer cellular heterogeneity.The trajectory between mSTZ model cells with and without anti-diabetic treatment and controls, which was reported in the original study 11 , was also observed in MIA (Supplementary Note 8).Similarly, Oppenländer et al.  (2021) reported differences between db/db model cells with and without VSG and we observed a separation of VSG-treated cells into a distinct cluster (db/db-VSG state, Extended Data Figure 5d).Thompson et al. (2019), whose data is also contained within MIA, described a distinct cell population during NOD model diabetes progression, matched to our NOD-D population (Figure 5a, Extended Data Figure 5c, Supplementary Note 6).As expected, based on their reports, our DGE analysis of the NOD model revealed upregulation of senescence markers Cdkn1a and Cxcl10 (Supplementary Table 10).

Oppenländer et al. (2021) and
While we largely observe correspondence with previous studies in terms of markers and subpopulations, we also report some results that we could not reproduce.Having a comprehensive atlas that enables quick lookup of genes and cells, such as MIA, thus helps to reveal findings that do not translate to other datasets.

Supplementary Note 6: Comparison of healthy and diabetic states with GPs
To showcase our GP-based interpretation approach we applied it to cell states that have been compared before 11,12,23 .Namely, we compared the main healthy state (adult2) to the main T2D model state (db/db+mSTZ) for the db/db and mSTZ datasets and to the T1D model state (NOD-D) for the 8-16wNOD dataset (Extended Data Figure 8c).We selected the adult2 and db/db+mSTZ states as the main healthy and T2D model states, respectively, based on them containing cells across datasets for specific conditions.According to our observations in the diabetes models comparison (section "Transcriptomic similarity of db/db and STZ diabetes model β-cells"), the GP differences were similar for db/db and mSTZ diabetes model cells and more distinct for the NOD model (Extended Data Figure 8c).Among the most differentially activated GPs in the diseased state of the db/db and mSTZ models were downregulated GP19, enriched in genes related to insulin or peptide hormone secretion, GP20, enriched in genes associated with stimuli (growth factors, hormones, cytokines) response, and GP17, enriched for p38 MAPK cascade.On the other hand, the diseased state of the db/db and mSTZ models exhibited upregulation of GP3, which is predominantly active in the T2D-model cells (Extended Data Figure 8b) and contains known T2Dassociated genes (Cck, Gc, Slc5a10, and Gast) 12 , and GP4, enriched in protein processing and golgi-ER endomembrane system.In the NOD diabetic state, we observed the strongest change in upregulation of GP27 (enriched for antigen presentation and interferon response) and GP25 (enriched for response to viruses).In all comparisons of diabetes model states there was downregulation of GP15, enriched for the circadian clock, insulin secretion, and stimuli response genes.Altogether, this shows that GPs provide relevant insights into differences between cell states, which can be broadly summarized by immunemediated β-cell dysfunction in the NOD T1D model and metabolic/stress-related dysfunction in the db/db and mSTZ T2D models 24 .
summarized in Supplementary Table 8.Groups 3 and 5 had similar expression patterns in healthy cells (Extended Data Figure 9a) and were associated with β-cell maturity and insulin production, with group 3 being more strongly associated with insulin-production-related stress and more strongly active in diabetes model states (Figure 5g, Extended Data Figure 9b, Supplementary Table 8).Group 1, which was highly active in cells with low activity of groups 3 and 5 (Extended Data Figure 9a), was negatively correlated with the expression of many insulin-production-related genes and contained genes previously implicated in β-cell recovery from metabolic stress, such as ATP-production-related genes 25 , and protective genes (Figure 5g, Supplementary Table 8).Group 2 contained immaturity genes (Figure 5g), and was more strongly expressed in immature states of MIA and in healthy adult samples with a higher proportion of immature cells (Extended Data Figure 9b,c, Extended Data Figure 5c).Its high activity in healthy adult cells was mutually exclusive with high activities of groups 1, 4, and 5, and to a lesser extent of group 3 (Extended Data Figure 9a), indicating a presence of a distinct immature population in healthy adults.Group 4 contained stress and senescence-associated genes (Figure 5g, Supplementary Table 8) and had increased activity in the aged and more mature cells from MIA (Extended Data Figure 9b,c).
We assessed the localization of cells with the highest expression of individual gene groups in healthy adult samples of the atlas and Feng dataset mapped onto the atlas (Extended Data Figure 9d).We observed that cells with high expression of individual gene groups localized to distinct regions of the UMAP embedding, with cells high in groups 3 and 5 showing somewhat greater overlap, as expected based on expression similarity of these two gene groups (Extended Data Figure 9a).In all samples cells from the aged-like gene group (4) co-localized near the aged cells (agedM β-cell coarse state), confirming the relevance of this gene group in aging.In general, group 1 (low insulin expression) high-scoring cells localized somewhat further from diabetes model cells than cells high in groups 2, 3, and 5.However, the exact regions of the group-high cells differed between 4m dataset and mSTZ and db/db datasets, as expected due to the different localization of these datasets in the integrated embedding.Especially in the mSTZ, db/db, and Feng datasets we observed a large number of group 3 and also group 5 high cells that localized near the diabetic intermediate cells (D-inter.fine state), confirming the association between stressed or diabetic-like phenotype and increased insulin production.Furthermore, all datasets contained gene group 2 high cells that co-localized with imm.3 and mSTZ fine states.In Feng et al. (2020) it was proposed that a distinct β-cell population within their dataset represents "virgin" β-cells that are located in the islet periphery, but whose function is not fully understood 26 .This "virgin" population within the Feng dataset corresponds to our gene group 2 high cells within imm.3 and mSTZ fine states (Extended Data Figure 9d,e).Similarly, gene group 2 activity is anticorrelated to genes downregulated in "virgin" βcells (Ucn3, Ero1b, G6pc2, Mafa) and correlated to genes upregulated in "virgin" β-cells (Mafb and other immaturity markers, Figure 5g) 27,28 .Interestingly, this indicates a similarity between "virgin" and a subset of mSTZ-model β-cells.Additionally, the db/db dataset also contained some gene group 2 high cells that mapped near the immature coarse state (imm.) and the gene group 2 high cells from the 4m dataset likewise mapped to both mSTZ-model and immature regions, indicating the presence of different immature populations in some of the healthy adult samples.

Supplementary Note 8: Trajectories used in diabetic DGE analysis
Dysfunctional NOD samples were relatively young (14 and 16 weeks) and thus the full dysfunctional phenotype might not yet have been developed 29 .Nevertheless, a recent study of T1D and autoantigenpositive prediabetic human donors revealed that pre-diabetic samples already show similar transcriptomic changes to those with progressed T1D 30 , indicating that despite the youth of our NOD samples they may nevertheless provide valuable insights into diabetes dysfunction.for a specific gene group the corresponding enrichment sheet is empty.Columns: signature/query_size -N genes in gene group, geneset -N genes in the gene set, overlap -N overlapping genes, background -N genes in the background, hits -overlapping genes, recall -ratio overlap/geneset.Explanation of gene group interpretation sheets: Sheet names containing the term "interpretation" contain interpretation for gene groups within the Excel file.Columns: group -group name, enrichmentmanually curated summary of gene set enrichment, example_genes -example genes contained within the group, cell_cluster_expression -summary of expression pattern across β-cell states.Empty table cells indicate a lack of interpretation for given gene group characteristics.