Introduction

The kidney is a vital and highly complex organ responsible for blood filtration, elimination of metabolites and waste products, and body homeostasis (oxygen sensing, pH balance, electrolyte levels, systemic blood pressure regulation). These critical functions are enabled through its functional subunit, the nephron, an intricate interplay between the nephron sub-structures and some mesenchymal cells, involving the coordinated action of more than 20 cell types (immune cells, stromal cells, nephron epithelial cells, and cells from rare populations such as the juxtaglomerular apparatus) from the entry of the blood in glomerular capillaries to the urine compartment1. Nephron sub-structures are organized following the urine flow: blood arrives to the glomerulus and filtered through fenestrated capillaries and the slit-diaphragm (n.b. this active fenestration process is controlled by podocytes), then primary urine passes first through the Bowman capsule lined by parietal epithelial cells (PEC) and then through the tubule per se, including the proximal tubule (PT; reabsorption of water, sodium, calcium, potassium, chloride, phosphate, proteins and glucose), the loop of Henle (LoH; reabsorption of water, sodium and chloride; urine concentration), the distal tubule consisting in the distal convoluted tubule (DCT) and the connecting tubule (CNT; fine tuning of sodium and chloride levels, regulation of H + and HCO3- assuring pH balance), and finally flows into the collecting ducts in the cortex (CCD) and medulla (MCD; water reabsorption; urine concentration) until its storage in the bladder2,3.

Although bulk transcriptomics have critically promoted the understanding of kidney development, physiology and diseases4,5,6, such approach is not suitable for investigating renal cell type-specific features at a single cell scale. More recently, advances in high-throughput single-cell (scRNA-seq) and single-nucleus (snRNA-seq) transcriptomics allowed to evaluate cell populations and biological processes of different tissues at the single-cell/nucleus level7,8,9. Except for liquid samples (e.g. blood, urines…), scRNA-seq involves tissue dissociation, single-cell emulsion and encapsulation, passage in the microfluidics one cell at a time, creation of a library through high-throughput sequencing, and finally data analysis. As it may be performed on frozen tissue, snRNA-seq may overcome some issues observed with scRNA-seq, such as dissociation-induced stress response leading to the expression of specific set of genes, poor viability and loss of rare and fragile cell types10,11. Although the heterogeneity between individuals and cell states have been demonstrated12,13, studies are often performed on a reduced number of samples due to the limited availability of human tissue and the cost of scRNA-seq and snRNA-seq experiments. Other critical confounding factors in both experimental and analytical settings may affect scRNA-seq and snRNA-seq data, including low sequencing depth, context-dependent cell states, clustering settings, or markers checked for cell type identification.

Still, scRNA-seq and snRNA-seq are very powerful techniques increasingly used within the biomedical field in general, and in the kidney field in particular14. To date, scRNA-seq or snRNA-seq studies of human kidneys have involved different technologies and different data preprocessing and analysis workflows. Cell type labelling in particular suffers from the lack of a universal definition of known nephron segments and cell types, as well as standard lists of RNA markers depicting each kidney cell types, which may therefore lower the comparison reliability between studies. Chen et al. already reported this issue and proposed a nomenclature for kidney epithelial cells to better compare studies15. However, no consensus list of human kidney cell type transcriptomic markers has been published so far. Thus, the establishment of consensus transcriptomic kidney cell type signatures might be of utmost importance considering significant batch effects within scRNA-seq and snRNA-seq datasets16,17,18. Here we present a meta-analysis of publicly available scRNA-seq and snRNA-seq datasets from 39 healthy adult kidneys, consisting in 68,028 single cells and 33,412 single nuclei. As data were taken from different sources, data were normalized following the SCTransform analysis pipeline in Seurat v4 and batch effects were mitigated by integration of scRNA-seq and snRNA-seq samples, respectively. Cell types were attributed to clusters using broad cell type markers, consensus cell type signatures were computed, and labelled scRNA-seq and snRNA-seq samples were integrated together to map cell types depending on the method used (sc/snRNA-seq). Finally, single-cell and single-nucleus consensus signatures were benchmarked by enrichment in previously published and annotated datasets.

Results

Analysis workflow

To determine consensus gene signature associated to each kidney cell type, we first aimed to collect kidney scRNA-seq and snRNA-seq data available on public databases according to the workflow presented in Fig. 1. Data collection ended up with 3 scRNA-seq and 4 snRNA-seq datasets publicly available19,20,21,22,23,24,25,26,27,28,29,30,31,32, encompassing a total of 101,431 cells and 35,764 nuclei, from 32 and 7 healthy adult kidneys, respectively (Table 1). Samples from the different datasets were pre-processed with Seurat v4, and cells expressing between 200 and 3500 genes were kept for analysis (discarding cell debris and cell doublets). As some kidney cell populations highly express mitochondrial genes, the percentage of mitochondrial gene expression threshold to use in kidney tissue is debating (varying between 20% and 50% across studies)33,34. We chose to keep cells with less than 30% mitochondrial genes expressed24 (Table 2). Despite nuclei should not express mitochondrial genes, nuclei with less than 5% mitochondrial genes expressed were kept to limit the waste due to possible little contamination. Since confounding variables may affect the different samples from the different studies and further analysis (Tables 1, 3), scRNA-seq and snRNA-seq samples were integrated separately using Seurat IntegrateData() function to mitigate the batch effects, following the newly implemented SCTransform framework for normalization and count data variance stabilization35. Finally, the integrated datasets consisted in 68,028 single cells and 33,412 single nuclei.

Fig. 1
figure 1

Meta-analysis workflow. All available scRNA-seq and snRNA-seq data were retrieved and downloaded from the Gene Expression Omnibus (GEO) or zenedo repository. Single cell RNA-seq and snRNA-seq samples were analysed separately, quality control metrics were measured and poor quality cells were filtered out in accordance. Then scRNA-seq samples and snRNA-seq samples were integrated independently. High resolution unsupervised clustering followed by visualisation of the expression of specific transcriptomic markers allowed to attribute each cluster a clear cell type (certain cell types were attributed to several clusters), or a cell type followed by « na » (i.e. not attributed) for the cells that did not show strong enough differenciation markers expression. Consensus signatures were computed using the FindAllMarkers() function in Seurat. Single cell RNA-seq and snRNA-seq were then integrated together to evaluate the matching between both datasets annotations. Finally, consensus signatures were used for cell type enrichments on previously published and annotated datasets.

Table 1 Clinical characteristics of the samples included in the meta-analysis.
Table 2 QC metrics, before and after filtering of low quality cells/nuclei.
Table 3 Technical characteristics of the datasets included in the meta-analysis.

Generation of a healthy human kidney consensus scRNA-seq dataset

To generate a healthy human kidney consensus scRNA-seq dataset, we first assessed the quality of the integration by comparing the distribution of cells on Principal Component Analysis (PCA) plot, before and after integration using both Harmony and Seurat v4 correction (Fig. 2a). The correction of PC1 and PC2 by Harmony did not look as good as the one obtained with Seurat for which PC1 and PC2 did not depend anymore from the origins of the samples after integration. Thus the Seurat v4 correction was used for further computations. Uniform Manifold Approximation and Projection (UMAP) of the integrated dataset showed a very good scattering of cells from the different samples and from the different batches (Fig. 2b,c). In addition, it has been suggested that kidney cells express subsets of genes that are regulated in a sex-dependent manner in mice36. However, as the sex was not known for 24 samples among 32, we could not evaluate whether a sex bias may occur in cell type attribution in humans (Fig. 2d).

Fig. 2
figure 2

Integration of scRNA-seq datasets. (a) PCA plots of scRNA-seq samples before any integration (left), after Harmony integration (middle), and after Seurat v4 integration (right). (b) UMAP plot showing the dispersion of cells after Seurat v4 integration, according to their sample of origin. (c) UMAP plot showing the dispersion of cells after Seurat v4 integration, according to their batch of origin (i.e. the publication). (d) UMAP plot showing the dispersion of cells after Seurat v4 integration, according to the gender; grey shade indicates that the gender is not known.

Unsupervised clustering (Louvain, resolution = 3.4) resulted in 54 distinct clusters (Fig. 3a). Despite a satisfying correction of the batch effects, certain clusters were driven by a few samples, which may reflect individual differences rather than cell types or cell states (Fig. 3a,b and Supp. Table 1). In particular, cluster 17 mostly belongs to sample GSM4145204 (50.13% of the cells), clusters 20, 48 and 53 to sample GSM4145206 (54.4%, 61.37 and 100%, respectively), and clusters 31, 38 and 44 to sample GSM4191943 (77.93%, 71.28% and 57.57%, respectively). Of note, these are the top 3 most abundant samples of the dataset (Table 2 and Supp. Table 1). These clusters were automatically labelled « not-attributed >> (na).

Fig. 3
figure 3

Unsupervised clustering of scRNA-seq dataset. (a) UMAP plot of 32 integrated scRNA-seq samples showing the scattering of the cells and the distribution of the 54 clusters. (b) Heatmap displaying the number of cells per sample, and the number of cells from each sample in each cluster (scaled by cluster).

Broad cell type markers expression was studied to allocate cell types to clusters6,9,22,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54 (Table 4, Figs. 4, 5a,b). A total of 29 cell types were retrieved, including nephron epithelial cells, kidney mesenchymal cells, and immune cells from both myeloid and lymphoid lineages. The nomenclature from Chen et al.15 was used for nephron epithelial cell labelling, with minor modifications to match our findings (e.g. we were not able to allocate several sub-populations of the descending thin limb nor of the thick ascending limb of the loop of Henle, and we did not find macula densa cells; please refer to Table 4 and Fig. 4 for a description of the adapted nomenclature). Considering the potential differences between sample origins (peritumoral, healthy donor, surveillance biopsy; Table 1), sampling methods, data pre-processing (Table 3) and distribution among clusters, some clusters were labeled « not-attributed » for ambiguous populations of proximal tubule cells (PTC.na), loop of Henle cells (LoH.na), principal cells (PC.na) and endothelial cells (EC.na) (Fig. 5b,c; populations in lightgrey). Thus, the computation of the signatures of PTC, LoH, PC, and EC did not depend on these ambiguous cells. Highly variable genes (HVG) were computed for every cell type: the corresponding gene lists define the consensus transcriptomic cell type signatures of kidney cells from healthy adult individuals (Fig. 5c and Supp. Table 2).

Table 4 List of broad cell type marker genes used to allocate cell types to clusters.
Fig. 4
figure 4

Schematic representation of a nephron and its associated cell types. Scheme of a nephron, locating anatomical structures the cell types described in the study belong to. See Table 4 for more details about the nomenclature. Keys: Macro.: Macrophages; DC: Dendritic cells; B.cells: B cells; CD4.T.cells: CD4+ T cells; CD8.T.cells: CD8+ T cells; NK.cells: Natural killer cells; EC.vei: Veinous endothelial cells; EC.glom: Glomerular endothelial cells; EC.art: Arterial endothelial cells; vSMC: Vascular smooth muscle cells; Mes.: Mesangial cells; Fibro.: Fibroblasts; PEC: Parietal epithelial cells; Podo.: Podocytes; PTC: Proximal tubule cells; LoH.DTL: Descending thin limb of the loop of Henle cells; LoH.ATL: Ascending thin limb of the loop of Henle cells; LoH.TAL: Thick ascending limb of the loop of Henle cells; DCT: Distal convoluted tubule cells; CNT: Connecting tubule cells; PC.CNT: Principal cells, connecting tubule; PC.CD: Principal cells, collecting duct; IC.A: Intercalated cells, A-type; IC.B: Intercalated cells, B-type.

Fig. 5
figure 5

Healthy human kidney landscape at the single cell level. (a) Cell type attribution to clusters based on the expression of specific transcriptomic markers. (b) ViolinPlot showing the expression of some of the kidney cell type-specific transcriptomic markers used to allocate cell types to clusters. (c) Dot plot presenting the expression of the 3 first genes of each computed signature, for all the signatures; this plot illustrates the list of consensus signatures for scRNA-seq samples.

Generation of a healthy human kidney consensus snRNA-seq dataset

The evaluation of the mitigation of the batch effects for snRNA-seq integrated dataset was not as good as the one obtained for scRNA-seq dataset, but PC1 and PC2 distribution was more satisfying after integration using Seurat v4 compared to Harmony (Fig. 6a). Hence Seurat correction was adopted to pursue the analysis. When nuclei are displayed according to the origin of the sample or the origin of the batch the sample comes from (i.e. the publication), it is clear that sample GSM3135714 from batch GSE114156 is not well integrated to the dataset (Fig. 6b,c). As there are only 7 samples, and some of the nuclei from this sample do not mix with the rest of the nuclei from the other samples, we chose to keep the nuclei from this sample in the analysis and exclude only the non-mixed ones after clustering. By contrast to scRNA-seq dataset, the gender was known for the 7 snRNA-seq samples and allowed to appreciate differences in sex representation within each identified population, in particular for the principal cells of the collecting duct (PC.CD) and the cells from both the ascending thin limb (LoH.ATL) and the thick ascending limb of the loop of Henle (LoH.TAL; Fig. 6d). Unfortunately, with only 5 men and 2 women, we could not assess whether these differences were due to a real gender bias rather than inter-individual differences or some remaining batch effects. Besides this potential sex bias was different from what was shown in mice, where the authors observed discrepancies in the PTC populations while comparing 2 males to 2 females36.

Fig. 6
figure 6

Integration of snRNA-seq datasets. (a) PCA plots of snRNA-seq samples before any integration (left), after Harmony integration (middle), and after Seurat v4 integration (right). (b) UMAP plot showing the dispersion of nuclei after Seurat v4 integration, according to their sample of origin. (c) UMAP plot showing the dispersion of nuclei after Seurat v4 integration, according to their batch of origin (i.e. the publication). (d) UMAP plot showing the dispersion of nuclei after Seurat v4 integration, according to the gender; grey shade indicates that the gender is not known.

Unsupervised clustering (Louvain, resolution = 3.0) resulted in 53 distinct clusters (Fig. 7a). As expected, several clusters (i.e. clusters 7, 23, 30, 33 and 34) consisted mainly in nuclei from sample GSM3135714 (accounting for 73.4%, 86.5%, 71.6%, 64% and 86.9%, respectively) (Fig. 6b,c, Fig. 7a,b and Supp. Table 1). We also observed that clusters 3 and 17 mainly belonged to sample GSM4572195 (58.6% and 63%, respectively), cluster 39 to sample GSM3320197-8 (52.5%) and cluster 48 to sample GSM4572192 (51.8%) (Fig. 7b and Supp. Table 1). Again, classical markers were studied to allocate cell types to clusters6,9,22,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54 (Fig. 8a,b, Fig. 4 and Table 4). Sticking as much as possible to the same nomenclature used for scRNA-seq dataset, a total of 22 cell types were retrieved among nuclei, including nephron epithelial cells, kidney mesenchymal cells, and 4 populations of PTC, LoH, PC and T cells labeled « not attributed » (Fig. 8a,b).

Fig. 7
figure 7

Unsupervised clustering of snRNA-seq dataset. (a) UMAP plot of 7 integrated snRNA-seq samples showing the scattering of the nuclei and the distribution of the 53 clusters. (b) Heatmap displaying the number of nuclei per sample, and the number of nuclei from each sample in each cluster (scaled by cluster).

Fig. 8
figure 8

Healthy human kidney landscape at the single nucleus level. (a) Cell type attribution to clusters based on the expression of specific transcriptomic markers. (b) ViolinPlot showing the expression of some of the kidney cell type-specific transcriptomic markers used to allocate cell types to clusters. (c) Dot plot presenting the expression of the 3 first genes of each computed signature, for all the signatures; this plot illustrates the list of consensus signatures for snRNA-seq samples.

The computation of HVG for every cell type has been performed, and these gene lists correspond to the consensus transcriptomic cell type signatures of kidney nuclei from healthy adult individuals (Fig. 8c and Supp. Table 2).

Joint analysis of scRNA-seq and snRNA-seq labeled datasets

To study the similarities and discrepancies between the results obtained with the two procedures, scRNA-seq and snRNA-seq samples were integrated together. The mitigation of the batch effects for the integration of 39 samples was overall acceptable, as attested by the correction of PC1 and PC2 (Fig. 9). Again, Harmony correction was not as satisfying as Seurat v4 one. Samples looked well merged, but nuclei and cells did not colocalize everywhere (Fig. 10a,b). The allocated cell types were highly consistent between cells and nuclei (Fig. 10c). Of note, we cannot rule out whether selecting viable cells on the basis of mitochondrial genes expression may influence this observation, since we cannot filter nuclei on the same basis. However the fact that overall, cell types were allocated at the same coordinates in cells and nuclei may give further confidence in the identified cell populations in both scRNA-seq and snRNA-seq datasets. In light of these results, we would not recommand to integrate scRNA-seq and snRNA-seq datasets before cell types have been allocated to cells and nuclei. Overall, these results demonstrated that snRNA-seq and scRNA-seq consensus signatures should be used to enrich for cell types within snRNA-seq and scRNA-seq datasets, respectively.

Fig. 9
figure 9

Integration of scRNA-seq and snRNA-seq datasets. PCA plots of scRNA-seq and snRNA-seq samples before any integration (left), after Harmony integration (middle), and after Seurat v4 integration (right).

Fig. 10
figure 10

Joint analysis of scRNA-seq and snRNA-seq datasets. (a) UMAP plot showing the dispersion of cells and nuclei after Seurat v4 integration, according to their sample of origin. (b) UMAP plot presenting the batch effects related to the procedure (scRNA-seq vs snRNA-seq). (c) UMAP plot showing the matching of allocated cell types between cells and nuclei.

Validation of the enrichment of consensus signatures for automatic cell type annotation

To test wether enrichment of consensus signatures are suitable for the identification of cell types within scRNA-seq and snRNA-seq datasets, we downloaded publicly available annotated datasets from Kuppe et al.55,56 (scRNA-seq) and Lake et al.9,57 (snRNA-seq). Sample expression matrices were processed and integrated as previously. Then CelliD v1.058 was used to perform enrichment analysis for scRNA-seq or snRNA-seq consensus signatures in every single cell or nucleus respectively, and the results were compared to the original labelling of the authors. To better evaluate automatic cell type annotation on test datasets, original labels were adapted to match consensus signatures nomenclature (Table 5).

Table 5 Nomenclature for test datasets.

After filtering out poor quality cells and cell doublets (less than 200 or more than 3500 expressed genes with more than 30% of mitochondrial genes), scRNA-seq dataset from Kuppe et al.55,56 consisted in 81,239 cells from 19 samples, representing a total of 13 chronic kidney disease patients (hypertensive nephrosclerosis)55 (Fig. 11 and Fig. 12a). Enrichment of consensus scRNA-seq signatures was performed following Multiple Correspondence Analysis (MCA), and UMAP was computed on the residues of the MCA using the RunMCUMAP() function implemented in CelliD58. However, to avoid annotation of cells with the « na » label that is not informative, signatures for « na » annotated cell types were not tested. Enrichment retrieved cell labels closely related to the original labels (Fig. 12b,c). Some differences were observed, in particular the non-attributed endothelial cells were recognized as B cells, a population of macrophages was recognized as dendritic cells, and the cells of the thick ascending limb of the Loop of Henle labeled as distal tubule cells. As only cell types belonging to the list of consensus signatures may be attributed, we did not find any schwann cell, urothelial cell, monocyte or mast cell (dendritic cells instead), myofibroblast or pericyte (vascular smooth muscle cells instead), macula densa cell (thin ascending limb of the loop of Henle instead) (Fig. 12c). Overall, automatic cell type annotation using scRNA-seq consensus signatures pretty matched the original labels from Kuppe et al.55, demonstrating its suitability and reliability to help in cell type allocation (Fig. 12b,c).

Fig. 11
figure 11

Scattering of cells depending on each sample in test datasets. UMAP plot showing cell spreading according to their sample of origin within Kuppe et al. dataset.

Fig. 12
figure 12

Enrichment of consensus signatures automatically reveals cell type identities within scRNA-seq datasets. (a) UMAP plot showing the authors’ original cell type annotations. These original annotations were adapted to match the nomenclatures used for consensus signatures. See also Table 5. (b) UMAP plot presenting the automatic cell type allocation performed by enrichment of scRNA-seq consensus signatures. (c) Cell count comparison for each cell type, from both original annotations by authors and consensus signature enrichment. Of note, certain labels are present in only one of the two labellings, raising counts of 0 for this label in the other labelling.

Concerning the validation of our identified snRNA-seq signature, Lake et al.9,57 snRNA-seq dataset was processed as previously described to remove poor quality nuclei, and finally consisted in 17,375 nuclei from 43 samples that belonged to 16 individuals, including 14 tumor-free regions of nephrectomies and 2 deceased donor kidneys9 (Fig. 13). The nomenclature of the original labels was modified as previously described for scRNA-seq, to match the nomenclature of the consensus cell type signatures (Fig. 14a, Table 5). Again, enrichment of snRNA-seq consensus signatures was done after computation of MCA and UMAP and signatures for « na » annotated cell types were not included for enrichment. The annotations were overall conserved between original labelling and consensus signature-based labelling (Fig. 14b,c). However, a subpopulation of proximal tubule cells was enriched for the descending thin limb of the loop of Henle (LoH.DTL) and some parietal epithelial cells (PEC) in the automatic annotation (Fig. 14b). In addition, cells originally labelled as LoH.DTL and some cells labelled as ascending thin limb of the loop of Henle (LoH.ATL) from samples NK37, NK38, NK45 and NK46, were still unassigned after consensus signatures enrichment (which means, there is no cell type enriched with a FDR < 0.01). This important unassigned population, which belonged to 4 samples among 43, may be considered « non-conventional » cells (although it may be due to remaining batch effects, as the samples were collected and conserved differently). In an original study, such nuclei would benefit from an in-depth analysis, since they could belong to non-tested cell types or non-steady cell states.

Fig. 13
figure 13

Scattering of nuclei depending on each sample in test datasets. UMAP plot showing nucleus spreading according to their sample of origin within Lake et al. dataset.

Fig. 14
figure 14

Enrichment of consensus signatures automatically reveals cell type identities within snRNA-seq datasets. (a) UMAP plot showing the authors’ original cell type annotations. These original annotations were adapted to match the nomenclatures used for consensus signatures. See also Table 5. (b) UMAP plot presenting the automatic cell type allocation performed by enrichment of snRNA-seq consensus signatures. (c) Nucleus count comparison for each cell type, from both original annotations by authors and consensus signature enrichment. Of note, certain labels are present in only one of the two labellings, raising counts of 0 for this label in the other labelling.

In conclusion, enrichment of consensus signatures allows the automatic and reliable annotation of kidney cell types in scRNA-seq and snRNA-seq datasets, and may be used to help in the decision of cell type allocation.

Discussion

Single cell RNA-seq and snRNA-seq are exponentially used within the kidney field. While major kidney cell types are generally retrieved, cell type identification is unconsistant between studies and therefore lacks reproducibility. It seems legit that every batch of samples, or every single sample, would not allow to identify the exact same transcriptomic signatures. It is mainly due to the small sample size of most human single-cell studies, subject to the cost of the technology and the limited availability of healthy human samples, that is in general not sufficient to generalize the conclusions to the overall population. In addition, scRNA-seq and snRNA-seq samples still suffer from a very low sequencing depth that may give rise to false positives or false negatives, within the differentially expressed genes computed between two populations (not to mention that these differentially expressed genes are calculated on the basis of the raw data, not the matrix normalized by the integration). Mapping reads to different versions of the reference genome, as well as the pre-processing of the data are also critical factors participating in batch effects between studies. This results in limited reproducibility and reliability between the different studies involving human kidney scRNA-seq and snRNA-seq. In this meta-analysis, we performed integration of 32 scRNA-seq and 7 snRNA-seq samples, from 3 and 4 different studies respectively26,27,28,29,30,31,32. After assessing the mitigation of batch effects, we ran high resolution unsupervised clustering and allocated cell types to clusters, based on the expression of known markers, before computing consensus cell type signatures. Despite scRNA-seq and snRNA-seq samples did not equally distribute everywhere on the UMAP, cell type allocation was highly consistent across the two datasets. Finally, we showed that enrichment of consensus signatures achieved cell type allocation consistent with previously annotated datasets56,57. These consensus signatures may thus help increasing reproducibility and reliability between future studies involving scRNA-seq or snRNA-seq in the kidney field.

Our present study tried to standardize cell type nomenclature by the way of meta-analysis. Even though proximal nephron is functionally and anatomically divided in three segments (S1 to S3), our study could not discriminate these segments and all proximal tubule data is lumped into one proximal category. Same is true for the three descending thin limbs segments and distal convoluted tubule that is not resolved into DCT1 and DCT2. One plausible explanation is that these subpopulations are part of the unattributed population, i.e. PTC.na and LoH.na. There is also a possibility that we did not find these subpopulations because of the sparsity of the data, especially with such heterogeneity in the data sources. The datasets used in this study are the first published in the field, they were pre-processed with different tools and suffer from strong batch effects that prevent to detect many markers. However, the signatures that we proposed still allow to identify their cell types in the validation step we did.

Single cell and single nucleus transcriptomics allow to study the expression of every detected gene (i.e. read count) in every single cell, or every single nucleus, within a suspension of cells. The advantages of sn- over scRNA-seq have been studied in healthy and fibrotic mouse kidney tissue, and include the following: less dissociation bias, less stressed or dead cells, and the possibility to use frozen tissue pieces which may allow to process more and better quality biological samples11. Furthermore, scRNA-seq and snRNA-seq samples may present too different transcriptomes in certain cell populations to allow a good detection of every cell type in both kind of experiment. This is not striking since the set of RNA expressed in the nucleus differs from the set of RNA expressed in an entire cell. In other words, scRNA-seq samples contain nuclear, mitochondrial and cytosolic RNA, while snRNA-seq samples only contain nuclear and cytosolic RNA. Therefore, cell type assignment depending on the expression of canonical marker genes, which is the standard in single cell/nucleus transcriptomics analyses, may differ between scRNA-seq and snRNA-seq. Our analysis confirmed these assumptions and as a consequence provides two sets of cell type signatures, obtained by scRNA-seq or snRNA-seq strategies. Besides, we were not able to find immune cells within snRNA-seq datasets except few T cells, which is consistent with previous reports attesting that snRNA-seq in the kidneys failed to detect immune cells in general9,11,59. Interestingly, we detected a population of parietal epithelial cells in both scRNA-seq and snRNA-seq datasets that express pluripotent cell, tubular epithelium and podocyte markers (e.g. SOD2, KRT8, KRT18, WT1, CD24, PAX2, SOX4, VIM, RACK1, NUPR1…; Supp. Table 2) and may actually correspond to previously described parietal epithelial stem cells60. These cells look very different from the other clusters, express self-renewal markers (e.g. CD24, PAX2) and match with the PAX2+ CD24+ population previoulsy observed in the glomerulus60. By contrast, we failed to detect any CD133+ mesenchymal stem cell-like population.

To introduce biological heterogeneity and mitigate technical variability, we encourage authors who would use previously published healthy kidney datasets as control datasets for their purpose, to integrate several samples from different studies instead of using the samples from a single study. For those who would add new healthy human kidney samples to their single-cell or single-nucleus studies, we would advice to compare the cell type signatures from control cells with the consensus signatures we provide, and to assign cell types in their dataset using enrichment of consensus signatures (e.g. CelliD58).

However this approach is biaised in the sense that cell type enrichment depends on the tested cell types, and if a cell type is not tested it could not be attributed to cells/nuclei, even if it should. Thus, one of the main limitation of this method is that every cell or nucleus will be attributed a cell type from the tested list: the enriched cell type with the lowest p-value will be attributed, which can be misleading (if there is no enriched cell type, then cell/nucleus is labelled « unassigned »). This further means the consensus signatures we provide only define the cell types identified in the current meta-analysis. Therefore, depending on the settings, it could make sense to use only certain consensus signatures, for instance if the studied cells or nuclei populations have been purified by FACS prior to the transcriptomics. For the same reason, such cell type enrichment may be used as a decision helper instead of a decision maker in cell type attribution to cells/nuclei. However, a more unbiased approach is possible for original studies, based on unsupervised clustering followed by extraction of the cell-specific signatures using CelliD, and finally enrichment of functional terms or pathways of these signatures. A more general limitation of such single cell studies is the statistical power for the computation of HVG (Wilcoxon Rank Sum test), that depends on the number of cells allocated for every single cell type. Indeed, the statistical power is higher for the computation of PTC signature (computed on 29,246 PTC cells vs 38,782 cells within the rest of the dataset) than for the one of DCT cells (computed on 248 DCT cells vs 67,780 cells) in the scRNA-seq dataset, for instance. In the future, these signatures may benefit from being updated by integrating newly published healthy human kidney single cell datasets that may increase the biological variability and the number of cells for every population while mitigating the batch effects even better. In addition, the very low sequencing depth of these experiments implies that the results should be interpreted with caution. To solve this issue in cell type identification while specifically working with kidney tissue and validate the identified cell type-specific signatures, a bulk transcriptomic analysis of micro-dissected healthy human nephron segments would be really helpful, as it has been performed in rodents6. Nevertheless, cell type allocation by enrichment of consensus signatures may depend on the size of the signatures – i.e. the size of the gene lists, spanning between 27 (scRNA-seq signatures, LoH.TAL) and 311 (snRNA-seq, EC.art) genes in the present meta-analysis. Thus, we recommand to perform such enrichment with both the complete signatures, and truncated signatures that are close in size.

Studies involving scRNA-seq and snRNA-seq technologies in the kidney are barely comparable, because of a lack of standardized workflow (technically and analytically) and a diversity in the references used for cell type recognition. In this meta-analysis, 32 scRNA-seq samples from 3 studies, and 7 snRNA-seq samples from 4 studies, were integrated and analysed. This resulted in the computation of 30 consensus cell type signatures for kidney cell types. Future studies in the field may benefit from the use of these signatures to automatically allocate cell types to cells/nuclei.

Methods

Data acquisition

Single-cell RNA-seq and snRNA-seq datasets generated from healthy adult kidney samples were downloaded from the Gene Expression Omnibus database (GEO; https://www.ncbi.nlm.nih.gov/geo/) as count matrices26,27,28,29,30,31,32. The collection consists of 7 snRNA-seq samples from 4 independent studies (GEO Accession ID: GSE114156, GSE118184, GSE131882, GSE151302) and 32 scRNA-seq samples from 3 independent studies (GEO Accession ID: GSE131685, GSE140989, GSE159115)26,27,28,29,30,31,32. The clinical and technical informations regarding the samples gathered from these studies are provided in Tables 1 and 3. Expression matrices of scRNA-seq samples GSM4819730 and GSM4819731 from batch GSE159115 were merged together prior to the analysis since they belong to the same individual, as well as snRNA-seq samples GSM3320197 and GSM3320198 from batch GSE11818421,23. Data downloaded from GEO were already pre-processed for each dataset, in different ways across the different studies involved (Table 3). This heterogeneous pre-processing of the samples may biase the analysis. However since our goal is to provide widely usable and consensus cell type signatures, this technical variation is important to retain.

To test whether the computed consensus signatures may be useful to automatically allocate cell types to clusters, we also downloaded available annotated datasets. Thus, Kuppe et al. (#4059315)55,56 scRNA-seq dataset was obtained from zenodo repository (https://zenodo.org/), and Lake et al. snRNA-seq dataset was downloaded from GEO under accession number GSE1218629,57. These datasets consisted in 19 chronic kidney disease samples and 43 healthy samples, respectively.

Quality control and filtering out of poor quality cells and nuclei

We used R software v4.1.0 (https://www.r-project.org/) and Seurat v4.0.5 package61 (https://satijalab.org/seurat/) to perform the analysis. As observed in previous studies, human kidney scRNA-seq datasets generally present with high mitochondrial gene counts, which may be attributed to the processing time of human kidney samples as well as the processing itself. Moreover kidney tissue notoriously contains a lot of mitochondria, consistent with the high levels of energy needed for a proper filtration process. Therefore the standard filtering out of cells with >5% mitochondrial gene expressed was not suitable for the processing of these scRNA-seq data. Cells with <200 or >3500 (cell debris and doublets) expressed genes, and >30% mitochondrial gene expressed, were filtered out, whereas nuclei with <200 or >3500 expressed genes, and >5% mitochondrial gene expressed, were filtered out. In total, 68,028 high quality cells and 33,412 high quality nuclei were obtained after applying these thresholds. Table 2 presents quality control metrics of every sample (i.e. number of cells/nuclei, mean number of reads per cell/nucleus, mean number of features expressed per cell/nucleus, % mitochondrial genes, % ribosomal genes), prior to and after filtering. Data were normalized and scaled (regressing out % mitochondrial genes), and highly variable genes computed using the SCTransform() function35 (Seurat v4) for every scRNA-seq and snRNA-seq sample. Identified HVG were then used to compute PCA for every sample. SCTransform is a newly implemented statistical method in Seurat v4, pooled from the sctransform R package (https://github.com/satijalab/sctransform), that aims to better resolve the technical variability and sequencing depth differencies between cells/nuclei across datasets35. It is particularly interesting when working with datasets obtained from different sources, which induce important variability.

Integration and dimensional reduction

Single-cell and single-nucleus samples always depend on confounding variables and may thus present differences that are called batch effects. To allow any comparison between samples, batch effects need to be mitigated as much as possible, which is done by the integration process (i.e. normalization step). Because further computations depend on this process, the quality of the integration deserves to be evaluated. Thus, two integration approaches were considered: the Seurat v4 method that outputs a corrected expression matrix for a list of genes to consider, and the Harmony v0.1.0 method that directly corrects the residues of the PCA for each sample.

Integration of 32 scRNA-seq samples on one hand, and 7 snRNA-seq samples on the other, was achieved by running consecutively PrepSCTIntegration(), FindIntegrationAnchors() and IntegrateData() functions from Seurat, with 2,500 integration features. Then PCA was computed and the first 30 PCs were inputed for uniform manifold approximation and projection (UMAP) of integrated scRNA-seq and snRNA-seq datasets. Harmony ran as well and UMAP was computed on the 30 first corrected PCs of both dataset. The distribution of the cells or nuclei from the different samples was compared between the two methods. Of note, Seurat and Harmony are among the best batch effect correction methods to date17,18.

Clustering and cell type annotation

High resolution clustering is important in such meta-analysis: since there are still notable batch effects, small batch-dependent clusters may be identified. In addition, more clusters may identify more cell types when closely related, thus more consensus cell type signatures if so. Unsupervised clustering was performed using FindClusters() function with Louvain algorithm in both dataset (resolution = 3.4 and 3.0 in scRNA-seq and snRNA-seq datasets, respectively). Distribution of samples across clusters was studied thanks to the pheatmap v1.0.12 R package. The cells were then labelled according to the expression of specific markers (Table 4). To match the nomenclature adopted for consensus cell type signatures, original labels from Kuppe C, et al. and Lake BB, et al. were changed (Table 5).