The molecular cytoarchitecture of the adult mouse brain

The function of the mammalian brain relies upon the specification and spatial positioning of diversely specialized cell types. Yet, the molecular identities of the cell types and their positions within individual anatomical structures remain incompletely known. To construct a comprehensive atlas of cell types in each brain structure, we paired high-throughput single-nucleus RNA sequencing with Slide-seq1,2—a recently developed spatial transcriptomics method with near-cellular resolution—across the entire mouse brain. Integration of these datasets revealed the cell type composition of each neuroanatomical structure. Cell type diversity was found to be remarkably high in the midbrain, hindbrain and hypothalamus, with most clusters requiring a combination of at least three discrete gene expression markers to uniquely define them. Using these data, we developed a framework for genetically accessing each cell type, comprehensively characterized neuropeptide and neurotransmitter signalling, elucidated region-specific specializations in activity-regulated gene expression and ascertained the heritability enrichment of neurological and psychiatric phenotypes. These data, available as an online resource (www.BrainCellData.org), should find diverse applications across neuroscience, including the construction of new genetic tools and the prioritization of specific cell types and circuits in the study of brain diseases.

To find the minimally-sized gene lists which allowed us to distinguish one cell type from the others in the dataset, we framed the question as a set covering problem.In the set cover problem, we find the smallest subfamily of a family of sets that can still cover all the elements in a universe set.For our use case, we examine a given cell type A and define a family of sets, one set per gene.The set S g , corresponding to the gene g, contains all the other cell types which are 'distinguishable' (defined below) from A using the gene g.In this way, the minimal set cover of this family of sets gives the smallest subset of genes which distinguishes A from all the other cell types in the dataset (i.e. the universe set).
To determine if a gene g is 'distinguishable' between two cell types Aand B, first we define Z A,g and Z B,g to be the percent of cells in cluster A and cluster B with nonzero values of gene g, respectively.Then we say A is 'distinguishable' from B using g, denoted D(g, A, B) = 1, when Z A,g ≥ 0.25 and (Z A,g − Z B,g ) ≥ 0.5 or Z B,g ≤ 0.05 .Otherwise, we say A is "indistinguishable" from B using gene g, denoted as D(g, A, B) = 0.
Then to solve each cell type's minimum set cover to optimality (or to prove it is infeasible), we phrase this problem as a mixed integer linear programming (MILP) optimization problem.
We'll consider one cell type A of which we want to find the set cover.Given the C other cell types: c 1 , c 2 , ..., c C ; which express G genes: g 1 , g 2 , ..., g G ; we create a matrix M of dimension (C × G) where M i,j = D(g j , A, C i ).
We formulate our set cover model as follows: Decision variable.The main variable we want to optimize is the choice of which genes to select for the set cover.So, we define the binary decision variable: Constraints.We then define X = M U , a C-length vector where X i gives how many times cell type c i was covered by the genes selected in U .For U to give a proper set cover (i.e.every cell type is covered at least once by the selected genes), we constrain the model such that Note that if we want to exclude cell types some cell types C a and C b from the optimization (e.g. they are the two nearest neighbors to A) we can subset X to exclude these two rows.Equivalently, we instead generate a one-hot encoding for the cell types to be excluded with and redefine X = (M U ) + E.
Objective.Our goal is to minimize the number of genes chosen.Formally We can define this mixed integer linear model programmatically using the JuMP domain-specific modeling language in Julia [1,2].We optimized using the HiGHS open source solver [3] or the IBM ILOG CPLEX commercial solver v22.1.0.0 [4].We repeated this optimization for every cell type in the dataset.We also group the cell types by main region (detailed above) and repeat this optimization considering only cell types within each regional group.
To enumerate all possible gene lists, for use in obtaining the minimum-sized gene list (below), we used the IBM CPLEX solver with a 5 hour time-limit over two threads and the following parameters: mip pool absgap 0 mip pool intensity 4 mip limits populate 10000000

Creation of minimum-sized collated gene list
By default, the CPLEX optimizers stop after finding one optimal solution (i.e.minimally sized gene list).But, upon loosening this restriction, we found that most cell types had multiple distinct equally-optimal solutions.As an example, for the cell type Ex Pitx2 Zbtb7c 3, its minimally sized gene lists are of length 3 but with many different variations, such as {Foxd3, Pitx2, Tmem258 } and {Foxd3, Pitx2, Tmem126b}, which both satisfy the distinguishability criteria and only vary by the exact transmembrane protein chosen.
We wanted to investigate the extent to which genes repeatedly appear across these solution-sets and to examine whether these repeatedly occurring genes are enriched within any Gene Ontology (GO) classification [5].So we again used a set cover approach, and leveraged this solution redundancy to identify the most succinct set encompassing at least one gene list from each cell type.With this method we found a ∼25% reduction in the size of the encompassing gene list, as opposed to taking the union of first-discovered gene lists returned by the default solvers.
To express this collated gene list as a set cover problem, we first define the G expressed genes: g 1 , g 2 , . . ., g G .Using the above exhaustive enumeration using IBM CPLEX, we obtain a list of equally-optimal (equally-small) gene sets for each of the C cell types: c 1 , c 2 , . . ., c C .We define N 1 , N 2 , . . ., N C to be how many gene lists each cluster has.In total we have L = C i=1 N i gene lists, enumerated as l 1 , l 2 , . . ., l L .Note that lists (l 1 , l 2 , . . ., l N1 ) all come from cell type c 1 , lists (l N1+1 , l N1+2 , . . ., l N1+N2 ) all come from cell type c 2 and so on.For ease of further reference, we will use i|k to denote the index of the k'th gene list for cell type C i in the whole L-length gene list.For example, 1|1 = 1 (the first list of the first cell type is in position 1), C|N C = L (the last list of the last cell type is in the final or L'th position), and 2|1 = N 1 + 1.
We can encode the individual gene membership for each gene list with a (L × G) matrix M where M i,j gives whether gene g j is a member of list l i .
We define our MILP model such that: Decision Variables.The main binary decision variable we want to optimize is again 1, If gene g j is selected 0, otherwise.; j = 1, 2, . . ., G In this case, we need an additional decision variable for each of the L gene lists encoding whether it was completely covered by the genes selected in U .As we will see, this allows us to require at least one gene list to be completely covered.We define the binary variable 1, If all the genes in ℓ k are covered 0, otherwise.; k = 1, 2, . . ., L.
Constraints.We define X = M U , which is a L-length vector where X i gives how many genes in l i are covered by the genes selected in U .