Single-cell transcriptomic catalog of mouse cortical development

We generated a single-cell transcriptomic catalog of the developing mouse cerebral cortex that includes numerous classes of neurons, progenitors, and glia, their proliferation, migration, and activation states, and their relatedness within and across timepoints. Cell expression profiles stratified neurological disease-associated genes into distinct subtypes. Complex neurodevelopmental processes can be reconstructed with single-cell transcriptomics data, permitting a deeper understanding of cortical development and the cellular origins of brain diseases.

However, we currently lack a comprehensive understanding of which cells are present in the normally developing mouse cerebral cortex. Here, we transcriptionally profiled a total of 18,545 mouse neocortical cells at two key times of corticogenesis using Dropseq: 10,931 cells at embryonic day 14.5 (E14.5), representing a progenitor-driven stage, and 7,614 cells at birth (P0), when all six cortical layers are formed and gliogenesis has begun (Supplementary Fig. 1-2, Supplementary Table 1) 1,2 . To increase the precision of unbiased clustering, we developed an iterative cell type refinement method (see Methods) and identified 22 principal cell types at each age (Fig. 1a,   Supplementary Fig. 3 Table 2). Each cell type was classified by expression of known and novel marker genes. We further validated cell type assignment and cortical distribution at each age using in situ hybridization data (Eurexpress, Allen Institute of Brain Science, GENSAT) ( Supplementary Fig. 5 Table 2) was used to assess inter-relatedness within and between time points (Fig. 1a).
We grouped the cortical cells into our broad classes (layer-specific, interneurons, progenitor-like, and non-neuronal) and assessed their relative proportions at each age ( Fig. 1b). Nearly 40% of cells at E14.5 were progenitor-like, including multiple classes of radial glia (RG) and intermediate progenitors localized to the ventricular zone (VZ) and sub-ventricular zone (SVZ), but represented only 28% of cells by P0, as expected 4 .
Primates have an additional outer RG layer, marked by Tnc and Ptprz1. These markers were expressed in mouse progenitor cells (E14.5 RG and P0 Astrocyte IPCs, Supplementary Fig. 15) but did not show cell-type specificity, further corroborating that mice lack this distinct layer 5 . Overall, the ratio of excitatory to inhibitory neurons was in line with previous estimates that range from 2:1 to 5:1 (Refs. 6,7). Additionally, the P0 cerebral cortex contained a greater proportion of non-neuronal cells relative to the E14.5 cortex, which is consistent with the known timing of glial proliferation 4 .
Lower-layer neurons were present at E14.5 and were most similar to their P0 counterparts, as expected given the timing of cortical layer formation 2 . One type of lower layer neuron at E14.5 (Layer V-VI; Cluster 3-E) was most similar to an upper layer neuron type at P0 (Layer II-IV; Cluster 1-P) (Fig. 1a). This class of lower layer neurons expressed upper (Satb2) and lower (Bcl11b) layer markers, a migratory marker Tiam2, and Pou3f1, a transcription factor that is expressed in Layer II-III neurons during their migration and differentiation 2,8 ( Supplementary Fig. 5, Supplementary Fig. 11).
Together, these data suggest that these cells are destined to become upper-layer neurons (Fig. 1c), some of which are born around E14.5 (Ref. 9).
Interneurons migrate tangentially from the ganglionic eminences and populate all layers of the cerebral cortex 10 . Correspondingly, Int1 and Int2 were present at E14. 5

and P0
and expressed high levels of Lhx6, a transcription factor associated with Parvalbumin (PV) and Somatostatin (SST) interneurons 11 (Fig. 1a). Sst, but not Pv, was detected in Int1 and Int2 at these ages, as expected 12 . The Int3 and Int4 interneuron classes were found only in the P0 cortex. Int3 (Cluster 11-P) expressed canonical markers of vasoactive intestinal peptide (VIP) interneurons, including Htr3a, Npas1, and Adarb2 (Ref. 11). Int4 (Cluster 6-P) expressed Cdca7, a marker of some SST + and PV + interneurons 11 , but did not express Sst at this stage. High expression of Sp9, Tiam2, Dlx5-general transcriptional and migratory markers-suggest that Int4 is an immature/migrating interneuron cluster (Supplementary Fig. 12). Together, these data uncover new molecular relationships between cell types and chart the changing cellular landscape during early cortical development.
Cortical dysfunction is implicated in neurological and neuropsychiatric diseases, including amyotrophic lateral sclerosis (ALS), Alzheimer's disease (ALZ), ASD, and schizophrenia (SCZ). Genes that increase risk for these diseases were recently identified [15][16][17][18] . To determine if these disease-associated genes are expressed broadly or specifically in developing cortical cell types, we hierarchically clustered the cellular expression profiles of these genes, and defined disease subtypes ( In conclusion, we found that the unfolding cellular complexity of brain developmentformerly pieced together over decades of research 2 -can be reconstructed with a single experimental approach. Additionally, our data reveal novel relationships between cortical cell types over time and cortical space, and provide new insights into the cellular origins and subtypes of neurological diseases. Diverse mouse models show alterations in cortical development, ranging from micro-to macrocephaly to early onset neurodegeneration. Our single-cell data provide an essential resource for future studies directed at understanding how genetic and environmental factors affect cell composition, cell states, and cell fates during early cortical development.          Tagmented samples were purified twice with 0.6x and 1x AMPure XP beads. All replicates were pooled and sequenced on one Illumina HiSeq 4000 flowcell (eight lanes) to avoid sequencing bias. Read 1 was 20 bp; Read 2 was 50 bp and Read 3 (index) was 8 bp. Data was de-multiplexed using bcl2fastq version 2.18.0.12.

Processing of Drop-seq data
FASTQ files were converted to BAM format, tagged with cell and molecular barcodes, quality-filtered, trimmed, polyA-trimmed, and converted back to FASTQ as previously described 2,3 . Reads were aligned to a mouse-human hybrid genome (mm10-hg19) using STAR 4 , then sorted, merged, and exon-tagged as described 2,3 . Bead synthesis errors were corrected as described 3 , and BAM files were separated into those containing mouse or human reads. UMIs were determined to be species-specific if >90% of the transcripts came from that species, or considered a doublet if neither species achieved 90% specificity. UMIs were not considered if the transcript count sum (mouse+human) was less than 500. Gene expression matrices were then created using only the mouse-specific UMIs, as described 2,3 . Gene expression matrices were combined from multiple biological replicates; data values for one or more replicates that did not detect a given gene were assigned to zero. Processing steps utilized the Dropseq Toolkit v1.12 where possible.

Basic analysis of Drop-seq data
Cells whose mitochondrial contribution exceeded 10% of transcripts were removed, then only genes present in at least 10 cells and having at least 60 transcripts summed across all cells were considered. We then performed batch correction using ComBat 5 where each independent replicate was considered a batch, thus minimizing any technical variation. To reduce the complexity of the data, we performed Principal Components Analysis (PCA) and eigenvalue permutation (500 shufflings) to determine how many principal components (PCs) to use, as described previously 3 . This yielded 89 PCs for E14.5 and 78 PCs for P0. These data were visualized using t-SNE 6 ; we iterated both perplexity and learning_rate parameters to optimize the visualization, ultimately setting these to 50 and 750, respectively. Code made available 3 was used where possible.

Cluster identification and refinement
We found that the Louvain-Jaccard clustering method utilized by Shekhar et al. 3 produced highly variable results for our data depending on the number of specified nearest neighbors. We therefore devised an iterative procedure that picks the optimal number of nearest neighbors to use for the given dataset. To do this, we iterate from 10 to 100 nearest neighbors, and for each iteration, repeat the Louvain-Jaccard clustering method. Then, we assess how many clusters formed and how robust they were using silhouette widths 7 based on Spearman correlation distances. After the iteration was complete, we utilized the number of nearest neighbors that produced the maximal average silhouette width across all clusters as a starting point for cell clustering. The silhouette width analysis also allowed us to assess overall performance of each cluster and demonstrated that the basic clustering method published previously 3 inappropriately assigned many cells to clusters. To refine these cluster assignments, we devised a second iterative approach that attempts to reassign extreme outliers (silhouette width < -0.1) to a better grouping. Over five iterations, these outlier cells of each cluster were given a chance to form their own novel cluster (if its own silhouette width was > 0 and there were at least 10 cells) or join the next-best cluster. If a cell was reassigned to the next-best and remained an outlier there, it would be sent back to its original assignment and flagged such that it would not be considered for reassignment in subsequent iterations. This process improved the overall cluster assignments and resulted in the creation of two novel clusters for E14.5, such that the final number of clusters for both E14.5 and P0 was 22. The final cell type assignments were visualized using t-SNE 6 .

Identification of cell type marker genes and enriched pathways
Marker genes for each refined cluster were identified as described previously 2,3 , and expression summaries were created using the code provided where possible. To identify biological pathways enriched in each cluster, markers whose expression foldchange relative to other clusters exceeding 0 were mapped to human gene symbols and assessed using a hypergeometric test in Piano 8 with MSigDB C2 classifications plus additional neurological gene sets as we described previously 9 . Pathways with an FDR < 0.1 and among the top 50 for a given cluster were considered for inclusion in the cell type annotation table.

ISH validation
Marker genes were validated with in situ hybridization data available on Eurexpress (www.eurexpress.org), Allen Brain Institute (www.developingmouse.brain-map.org) and GENSAT (www.gensat.org). Images were cropped to representative sections of the neocortex, ganglionic eminences, striatum, thalamus and choroid plexus. Marker annotations are provided in Supplementary Table 3.

Cell type sub-clustering method
To identify genes whose expression pattern was heterogeneous within a cluster, we required that a given gene was detected in at least 25% of cells but not more than 75% of cells within a cluster, then further refined the heterogeneous gene list using a feature selection tool 10 on the expression of all cells within the cluster. The expression of these genes in all cells for that cluster were then clustered hierarchically and inspected manually for coherent patterns representing sub-clusters. Genes within each sub-cluster were then assessed for functional enrichments using ToppGene 11 .

Cross-age comparisons and disease gene subtyping
Expression data for each cell cluster was merged by taking the mean of all values, and all clusters from both ages were compared to each other using Pearson correlations. To focus on genes most responsible for cell-type-specificity, we used genes that were identified either as cell type markers (log-fold-change > 1.5) or among those that were included in the sub-cluster analysis. Lists of commonly mutated genes for each disease of the cortex were downloaded as follows: ALS ( 12 , Tables 1-2), Alzheimer's disease ( 13 ,