CLIMB: High-dimensional association detection in large scale genomic data

Joint analyses of genomic datasets obtained in multiple different conditions are essential for understanding the biological mechanism that drives tissue-specificity and cell differentiation, but they still remain computationally challenging. To address this we introduce CLIMB (Composite LIkelihood eMpirical Bayes), a statistical methodology that learns patterns of condition-specificity present in genomic data. CLIMB provides a generic framework facilitating a host of analyses, such as clustering genomic features sharing similar condition-specific patterns and identifying which of these features are involved in cell fate commitment. We apply CLIMB to three sets of hematopoietic data, which examine CTCF ChIP-seq measured in 17 different cell populations, RNA-seq measured across constituent cell populations in three committed lineages, and DNase-seq in 38 cell populations. Our results show that CLIMB improves upon existing alternatives in statistical precision, while capturing interpretable and biologically relevant clusters in the data.

whereỹ im is the posterior probability that observation i belongs to class m given the observations. We follow Huang et al. (2017) 2 , applying a penalty (in our case, SCAD 3 ) to the term log π m , to obtain the penalized log-likelihood function such that the mixing proportions of unlikely clusters are shrunk to zero during the estimation process. Here, λ is the tuning parameter, γ m is the number of free parameters in cluster m, p λ is the SCAD penalty function applied with parameter λ, and ϵ is a small positive number introduced for numerical stability. The SCAD penalty is well-suited for our application since in many genomic applications the classes are not of homogeneous size. While L p style penalties will over-penalize large values of π m , the SCAD penalty yields an unbiased result and will not penalize values of π m that are large enough.
A slight modification to the usual EM algorithm is used to iteratively maximize the penalized likelihood (Equation 2). In the E-step, given current estimateθ = {π 0 1 ,μ 0 1 ,Σ 0 1 , . . . ,π 0 9 ,μ 0 9 ,Σ 0 9 }, we compute the posterior probability that each observation belongs to every class given the data as by some abuse of notation for M , which is allowed to shrink during the algorithm's progression. In the M-step, we update (2), which is separable into a component concerning the mixing proportions, and a component concerning the normal parameters. For the former, straightforward calculus with a Lagrange multiplier used to impose a sum-to-one constraint yields a closed-form solution to the maximization as where and p ′ λ is the first derivative of the SCAD penalty. For the latter, since we impose applicationspecific constraints on the model parameters, maximization must be done numerically.

Supplementary Note 2: MCMC details Computing prior hyperparameters from pairwise fits
We recycle information from the D 2 pairwise models on order to obtain more informative priors in our model. Here, we detail how we compute the hyperparameters {µ 1 , . . . , µ M , Σ 1 , . . . , Σ M }. For simplicity, we describe only the computation of {µ 1 , . . . , µ M , σ 2 1 , . . . , σ 2 M }, which denote the means and variances in just the first dimension of each of the M clusters. The computation for other dimensions follows similarly.
In order to begin, we first obtain the pairwise association labels h (1d) m , d ∈ {2, . . . , D} for each observation. As discussed in Mixing weight-based class pruning, these labels are assigned via multinomial sampling with weights equal to the observation's posterior probability of belonging to each class. Then, for a given cluster m and pairwise fit between dimension 1 and d, compute the mean of the subset of data belonging to class m in the first dimension. Then µ 1 equals the average of this value across all D − 1 pairwise fits that contain the first dimension. Variances are handled analogously; variance estimates for a fixed class m are combined across pairwise fits not by averaging, but by taking the 75th quantile of the collection of estimates across fits, as averaging often results in covariance estimates that are not positive definite. The off-diagonals of the covariance matrices are computed on the relevant data subsets as well, but do not need to be aggregated. This is because we do not obtain redundant estimates of the off-diagonal covariance terms.

Metropolis-Hastings algorithm
Note that typically, since all full conditionals (Equation 12, below) are available for the model in Equation 8, we could do inference using a Gibbs sampler. However, this requires making draws from the truncated multivariate normal and constrained inverse-Wishart distributions. While efficient methods for sampling truncated multivariate normal random variables are available 4,5 , unfortunately it is not clear how to efficiently sample constrained inverse-Wishart random variables. As an alternative, we introduce a fast algorithm that is useful for generating positive-definite matrices with constraints on the off-diagonal elements. This algorithm enables us to implement an efficient Metropolis-Hastings step in our MCMC to propose a new covariance drawn with the correct support at each iteration.
Our constrained matrix sampling algorithm modifies the procedure for generating Wishartdistributed random matrices. That is, Wijsman 6 proposed a method for sampling Wishart random matrices based on the Bartlett decomposition of a positive-definite matrix. Consider Σ ∼ W D (ν, Ψ), a Wishart distribution parameterized by D × D positive-definite scale matrix Ψ and degrees of freedom ν, ν > D − 1. To draw a random sample from this distribution, one computes U , the upper-triangular Cholesky factor of Ψ, and simulates a D × D lower-triangular matrix A as where c 2 i ∼ χ 2 ν−i+1 and z ij ∼ ϕ 1 (0, 1). Then Σ = U T AA T U is a Wishart-distributed random matrix.
We claim that we can modify Wijsman's simulation procedure to produce positive-definite matrices with the desired support. We notice that we cannot make changes to U without destroying the desired covariance structure of Σ, nor can we change the diagonal elements of A without risking simulating a matrix that is singular. However, we note that each off-diagonal element of Σ = U T AA T U can be expanded as This representation shows that, due to the triangular structure of both U and A, each offdiagonal element Σ pm computed using Equation 6 depends only on certain off-diagonal elements of A (that is, A 21 depends on no other off-diagonal element. A 31 depends only on A 21 , A 32 on A 21 and A 31 , and so forth, proceeding in a top-down, row-wise order). Further, each Σ pm is a monotonic function of A pm .
The monotonicity of Σ pm in A pm guarantees that, when A pm is small (large) enough, then Σ pm will be negative (positive). Therefore, we can find a upper (lower) bound on z pm such that Σ pm will satisfy the given sign constraints. Moreover, given the dependence structure of the off-diagonal elements of A, we observe that, conditional on obtaining a satisfactory value of A 21 , we can find a satisfactory value for A 31 , then A 32 , and so on.
For m < p, solving Equation 6 for A pm gives This formula for A pm is used to find upper and lower truncation points for the univariate standard normal random variables that comprise the off-diagonal elements of A such that Σ satisfies the desired constraints. The constraint Σ pm < 0 (Σ pm > 0) is satisfied when A pm is greater (less) than the result of Equation 7, setting Σ pm equal to 0. The full simulation procedure is made explicit in Algorithm 1. Fig. S27 shows some example data simulated using this Algorithm 1 with ν = 25, , and Algorithm 1 Sampling D × D positive-definite matrix Σ with desired support 1: Define fixed parameters Ψ, ν, R, and S, where R and S are symmetric D × D matrices of lower and upper truncation points for each of the off-diagonal elements of Σ 2: Compute U , the upper-triangular Cholesky factor of Ψ 3: for i in 1 to D do 4: A ii ∼ χ 2 ν−i+1 5: for p in 2 to D do 6: for m in 1 to p − 1 do 7: a pm ← solution to Equation 7, setting Σ pm = R pm 8: b pm ← solution to Equation 7, setting Σ pm = S pm 9: The MCMC proceeds as follows. Initialize all parameters θ = {µ 1 , . . . , µ M , Σ 1 , . . . , Σ M , π} in the model. At iteration t of the MCMC algorithm, we begin by proposing a constrained covariance matrix cluster-at-a-time. For each cluster m, we propose a new covariance, drawn according to Σ ⋆ m ∼ Algorithm 1(Σ t m , ν t m , h), since covariances that do not meet the given constraints will have zero likelihood. Σ t m is our current estimate of Σ at iteration t and ν t m is our current proposal degrees of freedom, log adaptively tuned for each cluster 7 . The acceptance ratio for this proposal is where g is a multinomial density with weights , q is an inverse-Wishart density, and η accounts for the asymmetric proposal, maintaining detailed balance. The first line of Equation 11 captures the likelihood of the data given the estimated cluster mean vector, covariance matrix, and cluster indicator and the likelihood of the cluster indicators given the data and all estimated means, covariances and mixing weights. The second line captures the priors on the mean and covariance. The final term is proportional the the univariate χ 2 and truncated normal densities that make up the samples from Algorithm 1. This procedure conveniently sidesteps any expensive computation of intractable normalizing constants.
Following this Metropolis step, cluster indicators z, means µ, and mixing weights π are sampled using standard Gibbs updates. The full conditionals are written

Simulation 1: ChIP-seq data
In simulation 1, based on ChIP-seq data, we assess the performance of each method by comparing the identified consistent signals with the truth and computing the precision and recall at a series of identification thresholds. As shown in Fig. S3, CLIMB outperforms mash and SCREEN across Figure S1: Pairwise plots of data used in simulation studies. a, Simulated data between two different pairs of dimensions from simulation 1, resembling ChIP-seq data and b, simulation 2, resembling differential analysis of RNA-seq data.  Figure S2: 27 simulations were conducted to investigate the computational demand of each step of CLIMB across different sample sizes, dimensions, and numbers of latent classes. Parameter settings and association vectors were randomly generated in each simulation. The MCMC sampler was run for 15,000 iterations in each simulation. The dimension of the dataset is the major source of increased computational cost. Technically, the MCMC sampler could have a worst-case time complexity of O(D 5 ). However, the time complexity is also largely driven by how dense the latent association vectors h are with non-zero elements, as this drives the number of parameters needing estimation. In the context we study using this method, there are a large number of zeros in each class (e.g. Supplementary Figs. S10, S18, and S29). Analyses were done on a machine (2.8 GHz Intel Xeon Processor with 256 GB RAM) with 20 cores for parallel processing. all tested u, where u is the consistency threshold of the partial conjunction hypothesis (Section Testing consistency of effects). We also examined the frequency of missigned signals (Table S1), and found that CLIMB and SCREEN do not missign signals, while mash does. In fact, since SCREEN assumes all signals have positive sign-an assumption that these data satisfy-SCREEN benefits in this comparison. Mash, on the other hand, sometimes missigns effects as negative, and is especially likely to do so for lower u. This is likely due in part to the fact that mash expects data to be symmetric and unimodal, but ChIP-seq data do not in general display such structure. CLIMB, on the other hand, can adapt to the ChIP-seq data structure. Indeed a closer examination of the results after pairwise fitting shows that all candidate latent classes with a -1 label are removed during the pairwise fitting step. These results collectively suggest that CLIMB's pairwise fitting step is well-suited to identify pairwise associations in the data, and that the final joint modeling step boosts statistical power to identify consistent signals across u.

Simulation 2: differential RNA-seq data
Simulation 2, designed to resemble a differential RNA-seq dataset, reveals several contrasting results. Here again, CLIMB outperforms mash and SCREEN ( Supplementary Fig. S4). For these data, we noticed that as u increases, accurate classification appears to become more difficult. This is likely due to the fact that few observations are consistent at high thresholds, and the most  Table S1: Proportion of effects identified as significant at level 0.05 that are true effects, yet incorrectly signed by CLIMB, mash, and SCREEN, from simulation 1.  Figure S4: Precision-recall curves for simulation 2, resembling a differential analysis of RNA-seq dataset. The curves assess CLIMB, mash, and SCREEN's ability to identify effects that are consistent in at least u out of 11 tissues. The dashed black line is the baseline for a random classifier.
consistent observations in this dataset are neither well-separated from inconsistent observations nor strongly correlated across dimensions. For example, we find that a mere 480 observations (3.2% of total sample size) are consistent at u = 5. The average signal overall is 2.05, and the average signal for the observations that are consistent at u = 5 is only 1.61. These features of the data would challenge any method, explaining the precipitous drop in performance of mash and SCREEN at higher thresholds (u ≥ 4). Due to the stochastic nature of class pruning and modeling employed by CLIMB, this challenging setting can affect CLIMB's performance as well, yielding somewhat differing results for u ≥ 4 across multiple runs. Still, CLIMB outperforms the other two methods and all methods perform better than a random classifier. We next investigated the frequency with which each method missigns signals (Supplementary  Table S2), and found that CLIMB performs the best according to this metric. Most notably, SCREEN struggles to make accurate inference in this setting. This is because SCREEN does not differentiate signs, thus it identifies many inconsistent effects that appear significant in both the positive and negative directions, as consistent. Since CLIMB and mash model signals in both the positive and negative directions, these methods missign signals far less frequently.
On the other hand, mash reports a very small estimated mixing weight of 2.02 × 10 −3 for the null class, resulting in almost all observations being called significant, a trend that CLIMB and SCREEN do not exhibit (see Supplementary Fig. S9    inappropriate for testing consistency of signals. CLIMB, meanwhile, demonstrates the strongest performance, based on both the precision-recall curves ( Fig. S4) and the proportion of missigned effects (Table S2). Taken together, these results affirm that CLIMB's flexible modeling framework is essential to adapt to complex datasets, and the inclusion of latent association labels reduce oversensitivity when identifying signals.

Simulation 3: RNA-seq data across cell differentiation
We set up simulation 3 to mimic an analysis whose goal is to understand how gene expression levels change across cell developmental stages. It supports a claim that CLIMB shows better precision than DESeq2 in the specific setting of multi-condition differential analysis. In particular, though DESeq2 is known to be an effective tool for differential expression analysis between a pair of conditions, the modeling and testing approach, which aggregates pairwise tests across conditions, may weaken the analysis when there are additional conditions. This is a setting where CLIMB can excel by cohesively capturing expression patterns across differentiation.

Supplementary Note 4: Simulation details
In simulations 1 and 2, testing for consistency was carried out for each method using comparable, but not exactly the same, procedures. For CLIMB, we tested for consistency using the procedure described in the Methods (see Testing significance and replicability of effects). For SCREEN, we used the test provided by Amar et al. 8 in the SCREEN software. Both tests are similar, in that they use binary (for SCREEN) and ternary (for CLIMB) vectors as class labels of association patterns to determine the probability of a given effect being significant in a given dimension. They are different in that SCREEN clusters the dimensions and thus does not produce D−dimensional cluster labels, and that CLIMB requires the additional step of selecting the sign of each effect (Equation 12). Mash, on the other hand, does not define its classes by association patterns dictating the significance and direction of effects. Therefore, we designed a consistency test for mash to be as close to the test used for CLIMB as possible.
The consistency test for mash begins with the quantitites p + id and p − id , i ∈ {1, . . . , n}, d ∈ {1, . . . , D}, the probability that effect i in dimension d is non-null positive or non-null negative, respectively. These quantities are readily output by mash. We used these terms to computeP u/D+ i andP u/D− i , the probability that effect i is non-null positive or non-null negative, respectively, in at least u out of D dimensions. These terms are analogous to those used by CLIMB (see Equation 13), and are calculated with the equations (13) by minor abuse of notation as index j in fact refers to a whole set of dimension indices. Again, analogous to the replicability test for CLIMB, we defineP as the probability that effect i is replicable at level u. This test comes with the caveat that it assumes the p ij 's are independent.
When testing with SCREEN, the sign of the association is always positive. For both CLIMB and mash, the sign of the association must match the direction of association that maximizes

all off-diagonal covariance terms are equal in magnitude for
Simulations indicated that, without these constraints, estimation is rendered inconsistent; simultaneous model selection and flexible parameter estimation tended to result in an underestimation of the true number of classes. We maintain that in the pairwise fitting step of CLIMB, accurate estimation of the number of latent classes is the most salient task. While imposing the above parameter constraints results in some estimation bias, we find that the method is still able to consistently estimate the number of latent classes. Thus, we imposed these constraints in all analyses with CLIMB. It is important to note that the D−dimensional Bayesian normal mixture model implemented in the last step of the CLIMB methodology does not impose these constraints.
The Overview of CLIMB section describes how CLIMB assumes the data model to follow a constrained normal distribution ϕ c (see Equation 2) such that for any class h, elements of µ h corresponding to non-null dimensions are non-zero. This assumption on the non-null elements of µ h can be modified to accomodate a more strict definition of the non-null dimensions of each class. That is, we can assume that a non-null dimension must have a mean whose magnitude is greater than or equal to some positive value, for example, the 95% quantile of a standard normal distribution. Indeed, for all analyses presenting here we assumed the non-null elements of µ h to be non-zero.
Informativity of the prior. In An empirical Bayesian model, we state that the degree of freedom parameters in the Bayesian mixture model, ν and κ, can be set approximately equal to nα, where α is computed from the pairwise fitting results with Equation 6. For each class h, these parameters control the concentration of the prior on the covariance around the Ψ 0 h and the concentration of the prior on the mean around µ 0 h , respectively. These terms can therefore be adjusted to tune the informativity of the priors. That is, for smaller values, the priors will become more diffuse, whereas for larger values, the priors will become more concentrated. In all simulations and VISION ChIP-seq analyses, we left ν and κ equal to nα. In each VISION RNA-seq analysis, we set these hyperparameters equal to 50 for all classes. We chose this path because the clusters were less well-separated, and since n was large for these datasets, setting κ = nα would possibly place undue confidence in the priors. Deliberately making priors more diffuse is a good strategy for one who is concerned with sensitivity to the prior.
MCMC implementation details. Setting up the MCMC for CLIMB requires some user intervention. This is mainly through the selection of threshold δ (Equation 6), which determines the number M of classes included in the final model. As with any MCMC, the user must also choose the number of iterations for which to run the MCMC and discard initial iterations for a chain-specific burn-in period. These values are reported in Table S3. Parameters for the MCMC were initialized by drawing from the priors. Visual inspection of traceplots for a subset of model parameters was used to assess convergence.
Inferring similarity between conditions with CLIMB. Output from CLIMB can be used to determine a relationship among the dimensions in a dataset. To do this, we first compute pairwise distances between dimensions based on the estimated mixture model by extending correlation-based distances 9 . LettingĈ m be the correlation matrix derived from the estimated covariance matrix Σ m for cluster m, and (Y ) ij be the (i, j)th element of matrix Y , the distance between dimensions d and d ′ is Hierarchical clustering is used on the matrix of pairwise distances found via Equation 14 to extract the association between dimensions.

Obtaining parsimonious characterization of condition-specificity by merging classes with CLIMB.
To merge classes with CLIMB's output, we compute all pairwise distances between them. Noting that each class is described parametrically by a multivariate normal distribution, we can compute the distance between any two classes as where q and p are estimated multivariate normal densities corresponding to classes m and m ′ , and KL is the Kullback-Leibler divergence. Equation 15 can be computed analytically. Then, one can perform hierarchical clustering based on this distance, and cut the tree to produce the desired number of groups. A group of merged classes inherits a new class mean as a weighted average of the mean estimates of the member classes. That is, the mean estimateμ G of group G, comprised of classes g ∈ G, is (16)

Running SCREEN, mash, and NMF
SCREEN asks the user to select several input settings before running the model. These are 1. nH: the maximum number of binary latent class configurations to store in memory 2. lfdr_method: the algorithm used to estimate the two-groups model 3. use_power: whether or not to weight each dimension by the estimated quality of data in that dimension In all analyses, we ran SCREEN at the default settings, setting nH = 10,000, lfdr_method = znormix, which is based on the EM algorithm for normal mixtures 10 , and use_power = True.
Mash asks the user to supply a matrix of standard errors for each observation if available, and also to pre-specify a list of candidate covariance matrices. In all analyses, we used built-in functions of the mashr R package to generate the list of candidate covariances. These include 1. canonical covariance matrices (e.g., the identity matrix or matrices that correspond to signals specific to a single dimension) obtained using the cov_canonical function 2. data-driven covariances obtained from the subset of the data with the strongest signals -strongest signals are selected by first running dimension-specific analyses separately for each dimension, the applying the get_significant_results function on the results -from these strongest signals, data-driven covariances based principal components analysis with 5 components are found using the cov_pca function -also from the strongest signals, data-driven covariances based on the Extreme Deconvolution method 11 are computed using the cov_ed function We did not provide mash with standard error estimates as they were not available. All other parameter settings were left at their defaults.
In implementing NMF, we followed the procedure used by Meuleman et al. 12 . To avoid sensitivity to random initial conditions, we seeded the NMF algorithm with a singular value decomposition. We examined the performance of the NMF from ranks 5 to 30, and selected the optimal rank as the one to maximize the silhouette coefficient before that same metric began a sharp decline. This NMF analysis completed in 36 hours.

Supplementary Note 6: Processing of empirical data
For the VISION ChIP-seq analysis, reads were aligned to reference genome mm10 and peaks were called using MACS 13 according to the ENCODE pipeline 14 . The peaks called across various cell types were aligned using BEDTools 15 . CTCF peaks, reported from MACS using P -values, were normalized with S3norm across different samples to adjust for variation in both signal and background regions 16 . Normalized values were converted to Z-scores by applying the standard normal quantile transformation, then taking the negative such that strong peaks correspond to positive Z-scores. For any locus for which a peak was called in at least one cell type but not all others, there are not P -values available for the cell types without a called peak. We imputed the Z-scores for these absent peaks by imputing a null signal that is randomly simulated from a standard normal distribution. In doing so, this imputation introduces null signals for each cell type at places where we do not have strong evidence for the presence of CTCF and avoids producing ties in the data. The proportion of loci with imputed signals in a given cell type ranged from 0.25% to 33.16%. The final data set is comprised of 10,141 binding sites.
For the VISION RNA-seq data, RNA abundances were quantified with RSEM 17 . For a given lineage, genes whose estimated transcripts per million (TPMs) were equal to zero across all experiments were removed from the analysis. Estimated TPMs were then averaged across two replicates for each cell population. The data were then shifted by an offset to prevent infinities, log 2 transformed, and quantile normalized. Lastly, we applied a location shift to position the data's positive mode over the origin. The corresponding datasets for these lineages respectively contained 21,303, 20,995, and 22,940 genes.
For the DNase-seq data, we downloaded the peak significance data made available via Zenodo (https://doi.org/10.5281/zenodo.3838751, file name dat_FDR01_hg38.RData). We subsetted the samples to the 38 hematopoietic cell populations, and filtered out loci such that a peak was present in two or more samples. As with the ChIP-seq data, we converted the P -values to Z-scores with a standard normal quantile transformation, and imputed Z-scores in the case of absent peaks. A small number of accessible signals were so large as to cause numerical issues for CLIMB. Thus, we applied an upper threshold to the signals, thresholding any value that exceeded the 99.9th quantile of all signals at that quantile.

Supplementary Note 7: Enrichment analyses
Each gene ontology genomic regions enrichment analysis was conducted with an appropriate reference set. For the GREAT analysis described in the analysis of the VISION CTCF ChIP-seq data, the set of all detected CTCF binding sites on chromosome 11, across all cell populations, was used. For the VISION RNA-seq analyses, the set of input genes for each lineage was used as the reference set for each lineage-specific gene ontology enrichment analysis. For the GREAT analyses described in the analysis of the DNase-seq data, the set of all analyzed accessible sites across all autosomes was used as the reference set.
Gene sets obtained by CLIMB and DESeq2 for the differential expression analysis of each of the three lineages examined are presented in Supplementary Data 1. To select gene ontology terms displayed in Fig. 4b, we searched for terms that were significantly enriched in at least one gene set (i.e., CLIMB only, DESeq2 only, or their intersect for some lineage), and reported the terms with distinct meanings.

Supplementary Note 8: Analysis of CTCF ChIP-seq on chromosome 7
In order to understand the stability of CLIMB, we complemented our original VISION CTCF ChIP-seq analysis with an examination of CTCF ChIP-seq data on chromosome 7. We prepared these data in the same manner as with chromosome 11. The proportion of loci with imputed signals in a given cell type ranged from 1.00% to 31.18%. The final data set is comprised of 8,983 loci. We identified 3 classes corresponding to constitutive binding behavior: the class of all ones, the class of all ones except for in the CFUE cell population, and the class of all ones except for in the CFUE and MONO cell populations. As in the previous analysis, these 3 classes make up ∼ 36% of all analyzed loci, and the average signal of the classes of constitutive loci is larger than that of the loci that are not constitutive (one-sided t-test, P = 4.52 × 10 −4 ). In fact, similar to the analysis of chromosome 11 which contained 15 non-empty classes, we estimated 14 non-empty classes to be in the data from chromosome 7. Indeed, many of the classes are the same as those previously identified (see Supplementary Fig. S29 for a full illustration of non-empty classes). Further, CLIMB again returns a clustering of cell types that more closely resembles the expected relationship among the cell populations when compared against mash and Pearson correlations (Fig. S28). This clustering is quite similar to the one obtained from the analysis of chromosome 11, suggesting that if CTCF binding patterns are similar across chromosomes, CLIMB's inference is fairly robust.

Supplementary Note 9: Proofs Proposition 1. Assume the estimated pairwise association vectors are equal to or a superset of the true pairwise association vectors. Then the graph-based enumeration and pruning algorithm is conservative, in that it results in a collection of all the true latent labels, with the possibility of additional, unsupported classes. That is, letting H true be the collection of true latent classes and H alg be the collection of latent classes obtained from the enumeration and pruning algorithm, then
Proof. Assume to the contrary that ∃h m ∈ H but h m / ∈ H alg . This implies ∃(i, j) ∈ {1, . . . , D}, i < j, such that (a i , a j ) / ∈ h ij . We obtain an immediate contradiction because H alg contains all paths h ′ such thath ′ = {h m : (a i , a j ) ∈ h ij ∀i, j where i < j}. To show that H alg is not necessarily equal to H, an explicit example of this, when pairwise labels are obtained exactly, is worked out in Fig. 1.

Proposition 2. The graph-based enumeration and pruning algorithm provides a unique list of latent classes up to permutation of dimension labeling.
Proof. Consider a list of valid paths produced by executing the path enumeration and pruning algorithm, defined by edge set of the form S, (1, a 1 ) , (1, a 1 ), (2, a 2 φ(a 2 )) , . . . , (φ(D), φ(a D )), T corresponding to valid latent classes {φ(a 1 ), . . . , φ(a D )}. Proof. We will prove the consistensy ofα m inductively. For simplicity of notation, we replace previous notation x The inductive hypothesis for generic D is and we want to show that, for D + 1, we still have       ATAC T-CD4 Figure S14: ATAC-seq signals for CTCF binding sites on chromosome 11 in LSK, GMP, ERY, and T-CD4 cells. The same data is clustered by CLIMB (left) and mash (right). H3K4me1 T-CD4 -2.5 peak center 2.5Kb Figure S15: H3K4me1 ChIP-seq signals for CTCF binding sites on chromosome 11 in LSK, GMP, ERY, and T-CD4 cells. The same data is clustered by CLIMB (left) and mash (right).
H3K4me3 T-CD4     Figure S18: Class labels and sizes in each of the VISION RNA-seq data analyses. The gene expression classes obtained from CLIMB for the a, erythroid, b, megakaryocytic, and c, myeloid lineages are represented as a series of blocks in each column, with one block for the genes in that expession class for each cell types. The blocks are colored by the expression category, with -1 for genes that are lowly expressed or off, 0 for moderately expressed genes, and 1 for highly expressed genes. All genes were assigned to a class based on their maximum a posteriori class estimate. The log (base 10) of the number of genes found in each class is plotted above the matrix of blocks.   Figure S21: Lineage-specific differential gene expression. Some myeloid-specific gene ontology (GO) terms were significantly enriched in the differentially expressed gene sets for the non-myeloid lineages (Fig. 4b). To more closely examine this result, we collected the genes whose expression patterns led to the myeloid-associated GO term enrichments in non-myeloid lineages. Three genes driving these GO term enrichments were Aif1, C1qc, and Fcgr2b. We visualized the RNAseq patterns at these representative loci Aif1 (chr17: 35,169, 740 − 35,177,254), C1cq (chr4: 136,888,989 − 136, 893,881) and Fcgr2b (chr1: 170,954,239 − 170,980,505) for the a, erythroid series, b, megakaryocyte series, and c, myeloid series. The RNA-seq data are shown only for the minus strand (with respect to the reference genome) because the genes are oriented right to left in the mouse genome assembly (mm10). In the conventions for this display of stranded RNA-seq data, negative values indicate the density of reads mapping to the minus strand. Indeed, the raw data support that these genes are differentially expressed across megakaryocyte and erythroid lineages, though they are generally expressed at much lower levels than for the myeloid lineage.             Table S11: Mean vectors used for each cluster in simulation 3.      H3K4me3 LSK H3K4me3 GMP H3K4me3 ery H3K4me3 T-CD4 Figure S33: H3K4me3 ChIP-seq signals for CTCF binding sites on chromosome 11 in LSK, GMP, ERY, and T-CD4 cells. Same as in Fig. S16, but hierarchically clustered using Pearson correlation and cutting the tree to have 15 clusters.