Transcriptional dynamics of colorectal cancer risk associated variation at 11q23.1 correlate with tuft cell abundance and marker expression in silico

Colorectal cancer (CRC) is characterised by heritable risk that is not well understood. Heritable, genetic variation at 11q23.1 is associated with increased colorectal cancer (CRC) risk, demonstrating eQTL effects on 3 cis- and 23 trans-eQTL targets. We sought to determine the relationship between 11q23.1 cis- and trans-eQTL target expression and test for potential cell-specificity. scRNAseq from 32,361 healthy colonic epithelial cells was aggregated and subject to weighted gene co-expression network analysis (WGCNA). One module (blue) included 19 trans-eQTL targets and was correlated with POU2AF2 expression only. Following unsupervised clustering of single cells, the expression of 19 trans-eQTL targets was greatest and most variable in cluster number 11, which transcriptionally resembled tuft cells. 14 trans-eQTL targets were found to demarcate this cluster, 11 of which were corroborated in a second dataset. Intra-cluster WGCNA and module preservation analysis then identified twelve 11q23.1 trans-eQTL targets to comprise a network that was specific to cluster 11. Finally, linear modelling and differential abundance testing showed 11q23.1 trans-eQTL target expression was predictive of cluster 11 abundance. Our findings suggest 11q23.1 trans-eQTL targets comprise a POU2AF2-related network that is likely tuft cell-specific and reduced expression of these genes correlates with reduced tuft cell abundance in silico.

. Overview of study design and analysis. This study makes use of two single cell RNA sequencing (scRNAseq) datasets: Smillie et al. 22 -n = 32,261 and Elmentaite et al. 23 -n = 11,651. The analysis performed on each dataset is outlined by arrow colour: blue = all cells from Smillie et al. 22 , yellow = individual clusters from Smillie et al. 22 , green = individual clusters from Elmentaite et al. 23 . WGCNA weighted gene co-expression network analysis, GSEA gene set enrichment analysis.  Table S2.

Scientific Reports
To assess the contribution of 11q23.1 genes to the correlation of the blue module with POU2AF2, we performed gene set enrichment analysis of the 11q23.1 trans-eQTL target genes (FDR < 0.05) against the genes in the blue module, ranked by their module membership-a measure of their relatedness to all other genes in the module 20 . We found the genes in this module are highly enriched for 11q23.1 trans-eQTL targets, normalised enrichment score (NES) = 2.04, p = 7.78e−04, Fig. 2c. In addition, 11 trans-eQTL targets comprised the 12 genes with the greatest adjacency in the blue module (adjacency > 0.3), Fig. 2d. We also find that module membership is highly correlated with gene significance for POU2AF2 in the blue module, reinforcing the overall correlation of the network with POU2AF2 expression (Supplementary Fig. S1). Together, this replicates the trans-eQTL target expression relatedness previously described 11 and indicates that the expression of the majority of significant 11q23.1 trans-eQTL target genes is correlated with POU2AF2 specifically in this dataset.
Analysing cell-specificity of 11q23.1 eQTL target expression. To test the potential for cell-specific expression of 11q23.1 eQTL targets in this dataset, we performed dimensionality reduction and clustering of the single cells, identifying a total of 12 transcriptionally distinct cell-clusters, named '0'- '11' ,Fig. 3a. We then calculated the markers of each cluster and found the markers of cluster 11, comprising 318 cells, includes fourteen 11q23.1 trans-eQTL targets (FDR < 0.05), Table 1 (full list of cluster 11 marker genes available in Supplementary Data 2). In addition, cluster 11 markers are significantly enriched for 11q23.1 trans-eQTL targets (NES = 2.15, p = 7.52e−06), Fig. 3b. Cluster 11 was found to transcriptionally resemble the Smillie et al. 22 , tuft cell cluster,   11 , we wanted to assess the variability in the expression of these genes within each cluster. We pseudo-bulked the expression of all cells from each sample, within clusters, and analysed 11q23.1 trans-eQTL target expression (Fig. 3c). The relative expression level and variability of POU2AF2 and 18 of the 11q23.1 trans-eQTL targets is overwhelmingly greatest within cluster 11, indicating the eQTL effect on these genes is exacerbated within this cluster. Notably, the relative variation and expression of cis-eQTL targets COLCA1 and COLCA2 and trans-eQTL targets ANKHD1 and GIN1, is not greatest within this cluster, suggesting the eQTL effect on these genes may not be driven by transcriptional dynamics within this cell-type. In addition, we analysed 11q23.1 eQTL target variability at the single cell level, Supplementary Fig. S2. The variability of POU2AF2 and the same 18 trans-eQTL targets was found to be greatest within cluster 11, replicating the potential exacerbation of the eQTL effect within this cluster and supporting the validity of our pseudo-bulk method. POU2AF2 expression, however, was not found to demarcate cluster 11 by marker identification analysis.
To test the robustness of tuft cell-like mapping of 11q23.1 trans-eQTL target expression and variability, we replicated this analysis in an independent dataset of 11,651 healthy adult colonic epithelial cells from 3 individuals 23 , Supplementary Fig. S3. In this instance, we identified 19 cell-clusters by dimensionality reduction and unsupervised clustering, named '0'-'18' . The markers of cluster 18 were significantly enriched for the expression of 11q23.1 trans-eQTL targets (NES = 2.50, p = 5.52e−09), 11 of which were identified as markers of this cluster, Supplementary Table S3. Cluster 18 was also enriched for the tuft cell signature (NES = 2.41, p = 7e−08), replicating our previous finding. The relative variability and expression of the majority of 11q23.1 trans-eQTL targets was also greatest within cluster 18. The expression of 13 trans-eQTL targets was also most variable within cluster 18 when expression was analysed at the single cell level, Supplementary Fig. S4. Interestingly, POU2AF2 expression variability was not found to vary greatest within cluster 18 when using expression from single cells.
Understanding 11q23.1 cis-and trans-eQTL relatedness within clusters. The demarcation and variability of 11q23.1 trans-eQTL target expression within a tuft-like cell cluster strongly indicates the eQTL effect is derived from altered gene expression in this cell-type specifically. However, it is possible that the genegene relatedness of 11q23.1 eQTL targets, including those with POU2AF2, is not specific, but rather exacerbated in this cluster. To test this, we sought to divide samples by genotype at rs3087967, the variant associated with expression changes of trans-eQTL targets 11 , and analyse the congruence of trans-eQTL target expression relatedness within clusters. While genotype information was unavailable for the Smillie et al. 22 , dataset, raw sequencing reads were available for the Elmentaite et al. 23 dataset and as rs3087967 lies within the 3ʹ UTR of POU2AF2, we performed variant calling on these samples using orthogonal tools, Supplementary Table S4. Using freebayes 24 , we found that all samples were called as homozygous for the non-risk allele at rs3087967, except for one sample called as heterozygous. However, as all 3 other samples from this individual were called as homozygous nonrisk, it is likely this is a technical error. Using bcftools 25 , all samples were identified to be homozygous non-risk for rs3087967. The lack of genetic variation at rs3087967 in these samples is consistent with the absence of heightened POU2AF2 variability in cluster 18 and indicates this dataset would be unlikely to be of use to identify POU2AF2-related expression dynamics. The relatively high variability of trans-eQTL target expression within cluster 18 may be indicative of non-11q23.1-related dynamics, such as changes during differentiation or cell cycle progression.  49 . Avg_ log2FC average log2 fold change between cluster 11 and all other clusters, Pct1 proportion of gene expression in cluster 11, Pct2 proportion of expression in non-cluster 11, p_val_adj FDR-corrected p-value. www.nature.com/scientificreports/ To test the potential efficacy of the Smillie et al. 22 , dataset to further study the 11q23.1 eQTL target expression dynamics, we compared the standardized variability of 11q23.1 trans-eQTL targets within the respective demarcated clusters at the single-cell level, Supplementary Table S5. We found that the expression variability of 14 of the 15 11q23.1 trans-eQTL targets, expressed in both Elmentaite et al. 23 , cluster 18 and Smillie et al. 22 , cluster 11, was significantly increased in the latter (fold-change range 1.46-9.7, median = 1.97, mean = 2.83, 95% confidence interval = 1.50-4.18, 100,000 permutation p < 1e−5). The only 11q23.1 trans-eQTL target not to exhibit increased variability in Smillie et al. 22 , cluster 11 was OGDHL (fold change = 0.73). Notably, the expression of POU2AF2, was also greater in Smillie et al. 22 , cluster 11 (fold change = 1.94). Subsequent analysis was therefore focussed on the Smillie et al. 22 , dataset.
To divide samples into high and low expressors of POU2AF2-expression related genes, we hierarchically clustered samples based on the relative expression of pseudo-bulk WGCNA-identified blue module hub genes. Hub genes were defined by module membership > 0.7, intramodular connectivity > 0.7 and network adjacency > 0.3 and include 10 genes: TRPM5, PSTPIP2, SH2D6, ALOX5, BMX, GNG13, SH2D7, HCK, PLCG2, MATK, Fig. 4a. Five samples were separated at the first branch of clustering and exhibit a strong relative reduction in the expression of these genes. This grouping of samples is henceforth referred to as the 'blue module hub gene grouping' . To assess the significance of this separation at representing underlying transcriptional differences, we performed 10,000 permutations, using 10 randomly sampled genes, and generated a p-value of 0.055.
To first test gene-gene relatedness within cluster 11, we performed WGCNA on the relative, pseudo-bulked expression of the top 5000 most variable genes within this cluster, identifying a total of 20 modules, 7 of which exhibited correlations with sample covariates that approached significance (FDR < 0.1), Fig. 4b. We found one module, 'cluster 11 black' , which was highly correlated with both the 'blue hub gene grouping' (cor = 0.72, FDR = 0.032) and POU2AF2 expression (cor = 0.68, FDR = 0.048). 'Cluster 11 black' is comprised of 290 genes, including fifteen 11q23.1 trans-eQTL targets, Supplementary Data 3. As the 'blue module hub gene' grouping is derived from analysis of genes correlated with POU2AF2 across all cells, the correlation of 'cluster 11 black' with both this grouping and POU2AF2 expression indicates that this relationship is preserved within and potentially derived from this cell-cluster.
We then sought to test whether the gene-gene relatedness of 11q23.1 trans-eQTL targets, correlated with POU2AF2, was specific to cluster 11. To this end, we defined an auxiliary module, comprised of the twelve 11q23.1 trans-eQTL targets correlated with POU2AF2 (cor > 0.5, p < 0.05) in cluster 11 black, which exhibited high module membership (MM.black > 0.5), Fig. 4c. These genes include: HTR3E, LRMP, GNG13, ALOX5, SH2D7, PTGS1, MATK, BMX, AZGP1, IL17RB, SH2D6 and OGDHL. We performed pairwise module preservation of this auxiliary module, using identical parameters as used for intra-cluster 11 analysis, in all other clusters, see "Methods" section. For two modules, cluster 8 and cluster 10, the 5000 most variable genes included only a single member of the auxiliary module and so were excluded from this analysis. To analyse the preservation of the connectivity of these genes, we assessed the similarity of each gene correlation with the module eigengene, and the equivalent values in cluster 11 (cor.kME), Fig. 4d. No module exhibited a significant cor.kME with cluster 11 (p > 0.05), indicating there is low overall preservation of the connectivity between these genes in all other clusters. The average module membership of genes in this module (average.MM) was also reduced in all clusters compared with cluster 11, Fig. 4e. Finally, we analysed the pairwise gene-gene correlations between all members of the auxiliary module and POU2AF2 within each cluster, Fig. 4f. While there are rare correlations between these genes in other modules (cor > 0.5, p < 0.05), all comparisons reach this threshold in cluster 11. Together, this evidence indicates these 12 trans-eQTL targets comprise a transcriptional network that is correlated with POU2AF2 expression and likely specific to cluster 11. Identification of cluster 11 abundance associated genes. As many of the 11q23.1 eQTL targets, including those comprising the cluster 11-specific network, demarcate this cluster, we wanted to examine the relationship between their expression and cluster 11 abundance. First, we performed linear modelling of the relative abundance of cluster 11 and the pseudo-bulked expression of POU2AF2 only. We found that the expression of POU2AF2 is associated with the relative abundance of cluster 11 (coefficient = 0.389, p = 0.00431) indicating potential for the expression of POU2AF2 alone to be moderately predictive of the abundance of this cell-type, Fig. 5a.
To agnostically test the predictive power of 11q23.1 trans-eQTL target expression on cluster 11 abundance, we tested the association between the expression of all genes and the proportion of cluster 11 in samples, Fig. 5b. We found that all genes significantly associated with the abundance of this cluster (FDR < 0.05, log-fold change > 1) were indeed 11q23.1 trans-eQTL targets, including: ALOX5, BMX, GNG13, MATK, SH2D7, PSTPIP2, TRPM5 and PTGS1. In fact, the strength of gene association with cluster 11 abundance was also significantly enriched for 11q23.1 trans-eQTL targets (NES = 2.15, p = 8.03e−10, Fig. 5c).
While our linear modelling strongly supports a predictive role of 11q23.1 eQTL target expression in the abundance of cluster 11, we wanted to agnostically test whether the expression of the POU2AF2-related trans-eQTL targets was correlated with abundance changes to any cluster. To this end, we utilised miloR 26 to calculate cell-neighbourhoods, Supplementary Fig. S5, which were then used to perform differential abundance testing across the 'blue module hub gene grouping' , defined in Fig. 4a. To generalise neighbourhoods to the cell clusters we have identified, we subsequently filtered for neighbourhoods which represented a majority of a single cluster (majority proportion > 0.8), Fig. 5d. No neighbourhood from cluster 10 passed this threshold and so this cluster was excluded. In the low blue hub gene group, there are significant reductions (Spatial FDR < 0.01) from cluster 0 and increases in neighbourhoods in clusters 1, Fig. 5d. The change in abundance of these neighbourhoods represent a small proportion of the total number of neighbourhoods detected in these clusters (2.1% and 1.3% respectively) and are therefore unlikely to represent a dramatic phenotype. In contrast, all 7 of the www.nature.com/scientificreports/

Discussion
In this study, our pan-cluster WGCNA serves as validation of the expression relatedness between trans-eQTL targets previously identified to be correlated with CRC associated variation at 11q23.1 11 . 11q23.1 trans-eQTL target expression was also found to be more correlated with POU2AF2 over other cis-eQTL targets. Following clustering of single cells, many of the genes demarcating a single cluster, number 11, were found to be 11q23.1 trans-eQTL targets. Enrichment of these markers against putative gene sets showed cluster 11 transcriptionally resembled tuft cells, replicated in an independent dataset. WGCNA within cluster 11 identified several 11q23.1 trans-eQTL targets that exhibited a high level of relatedness, subsequent analysis of the preservation of this relatedness indicated this was likely to be specific to this cell cluster. Finally, samples which exhibited low overall expression of 11q23.1 trans-eQTL targets most correlated with one another, were found to exhibit a specific and dramatic reduction of the tuft cell-like cluster. Therefore, our results potentiate that transcriptional dynamics correlated with CRC risk associated variation at 11q23.1, are indicative of a reduced abundance of colonic tuft cells.
To our knowledge, this is the first study to map CRC risk associated eQTL targets to specific epithelial celltypes. Heritable inflammatory bowel disease risk loci have recently been associated with shifts in transcriptional Neighbourhoods were identified using miloR 26 and grouped by cluster. Only neighbourhoods with major cluster proportion > 0.8 are plotted and only significant neighbourhoods (Spatial FDR < 0.01) are coloured by their logFC (red = down, blue = up). No neighbourhoods from cluster 10 were comprised of major proportion > 0.8. www.nature.com/scientificreports/ dynamics in individual colonic epithelial cells 22 and it is possible that other CRC risk variants, with robust eQTL effects, are associated with cell-specific changes in transcriptional dynamics. Delineating the cell-specific expression of CRC risk-associated eQTLs may provide valuable insights into the mechanism of risk-associated pathophysiology and should be a focus of future work. The growing size and availability of scRNAseq datasets will likely make cell-specificity mapping of heritable disease associated eQTLs significantly easier, particularly if genotype data is available. Indeed, our study is also not the first use of WGCNA to detect gene-gene relatedness in scRNAseq data. WGCNA has been utilised to identify gene modules associated with the activation of neuronal stem cells 27 and human induced pluripotent stem cells 28 , however, like our own study, these studies did not utilise expression from individual cells as the input for WGCNA. Surprisingly, the expression of the vast majority of 11q23.1 eQTL targets mapped to the cell-type that transcriptionally resembles tuft cells, including; LRMP, IL17RB, SH2D6, PLCG2, PSTPIP2, TRPM5, SH2D7, AXGP1, PTGS1, ALOX5, BMX. Many of the most significant 11q23.1 trans-eQTL targets, such as LRMP, SH2D7 and ALOX5, have not previously been associated with specific expression in this cell-type, potentiating their status as a marker in the colonic epithelium. Additional markers of the tuft cell-like cluster include HCK and HPGDS, for which there is some orthogonal evidence of expression within tuft cells 29,30 . This improves our confidence that cluster 11 represents this cell-type and is not an artefact of the analysis.

Scientific Reports
Notably, our pan-and intra-cluster WGCNA indicates decoupling of 11q23.1 cis-eQTL targets, suggesting trans-eQTL target expression is attributed to the expression of POU2AF2 but not COLCA1 or COLCA2. While the Smillie et al. 22 , dataset is not genotyped, numerous studies identify POU2AF2, COLCA1 and COLCA2 as eQTL targets of 11q23.1 variation 10,11,13,14 , supporting our use of their expression as a proxy of 11q23.1 genetic variation. In addition, the 11q23.1 cis-eQTL targets are highly correlated with one another in the bulk expressionbased studies, our observation of divergence in their expression across transcriptionally distinct cell clusters is consequently novel. Delineation of the 11q23.1 transcriptional dynamics toward POU2AF2 specifically, implies the association between POU2AF2 expression and tuft cell abundance is a potentially causal feature of CRC risk. However, as these findings are based on correlation-based analyses in silico, causal relationships are only inferred. Experimental testing of such using gene knock out models is required to both confirm POU2AF2 potential causality and assess whether COLCA1 or COLCA2 confer causality.
Recent studies have identified a direct interaction between POU2AF2 and master transcriptional regulator of the tuft cell lineage, POU2F3 31,32 . These studies indicate that in cell line models of a tuft cell-like subtype of small cell lung cancer, POU2AF2 acts as a transcriptional coactivator of POU2F3 targets, including 11q23.1 trans-eQTL targets PTGS1 31 and AVIL 32 . While POU2F3 was not identified as an initial 11q23.1 trans-eQTL target, it was found to correlate with POU2AF2 expression in cluster 11, Supplementary Table S3. As the 11q23.1 trans-eQTL targets we find correlated with POU2AF2 putatively demarcate tuft cells in our analyses, direct interaction with POU2F3 is a potential mechanism by which POU2AF2 mediates their expression along with tuft cell differentiation and determination in the colon. Interestingly, POU2AF2 expression was also found to be positively correlated with small cell lung cancer cell survival in vitro and in vivo 32 . While we find reduced POU2AF2 expression to be associated with CRC risk, the functional interaction between POU2F3 and POU2AF2, in association with 11q23.1 eQTL target expression, is concordant with the cluster-specific transcriptional dynamics we observe.
It is noteworthy that many of the genes with expression found to be in association with POU2AF2 in the scRNAseq data were indeed identified as trans-eQTL targets in the bulk analysis 11 . While the mapping of the expression of these genes to a tuft-like cell cluster could only be achieved by using expression from individual cells, the prior identification of these genes by bulk analysis is a testament to the power of such approaches, and concordance of their findings with single cell-based methods.
Finally, the overall potentiation of tuft cell perturbation is of great interest regarding characterisation of the mechanism governing CRC risk at 11q23.1. Tuft cells are associated with stem-, neurotransmitting-and immunerelated functions [33][34][35][36] , however much of the evidence regarding their function is derived from other organs and cannot necessarily be extrapolated to the colon. Interestingly, genetic ablation of tuft cell abundance has been associated with exacerbated tumour progression in mouse models of pancreatic cancer 37,38 . Both studies show this is likely in association with perturbed immune cell function and signalling. In line with this, tuft cell abundance has recently been shown to be reduced in quiescent ulcerative colitis patients 39 , suggesting tuft cells are involved in immune regulation in the colon. Future work should aim to experimentally validate the relationship between 11q23.1 variation and tuft cell abundance, examine how this impacts tumourigenesis and identify the potential mechanism of CRC risk predisposition.

Methods
Pre-processing, dimensionality reduction and clustering of scRNAseq data. In the analysis of Smillie et al. 22 , scRNAseq data, samples from one patient, N51, were removed as they were found to be outliers on the basis of cell-level mitochondrial and ribosomal protein gene expression, in addition to principal component analysis of pseudo-bulked expression. Fastq files of the Elmentaite et al. 23 , scRNAseq data were aligned to the hg19 transcriptome using the 10× Genomics Cell Ranger v3.02 pipeline 40 to produce raw gene-level counts.
All subsequent expression analysis was completed in R version 4.0.2. Once raw counts had been obtained for both datasets, bad quality droplets were filtered by a series of quality control steps: (i) potential empty droplets were detected by thresholding at the inflection point of the barcode rank plot of cells, calculated using DropletUtils v1.8.0 41 , (ii) genes expressed in fewer than 20 cells were removed, (iii) cells with an expression sparsity > 0.99 were removed, (iv) cells with proportion of mitochondrial gene expression greater than 2.5× (median absolute deviation) from the median proportion were removed.
Following filtration, counts were loaded into a Seurat object using Seurat v4.0.1 42 . Initial clustering was performed using Seurat's reciprocal PCA method of batch correction with SCTransform as per authors guidelines www.nature.com/scientificreports/ (https:// satij alab. org/ seurat/ artic les/ integ ration_ rpca. html) 42 . The processed Seurat object was first split by sample and data integration was performed on the first 50 principal components. The principal components of the integrated data were then calculated, which was used to calculate the UMAP embedding for all cells, based on 50 principal components.
To identify clusters, a nearest neighbour graph was constructed on the integrated data, using 50 principal components. Clusters were then identified via the FindClusters function using a resolution of 0.6. For the Smillie et al. 22 , data analysis, we utilised a k value of 250, as this is concordant with the authors analysis and provided the greatest confidence in identification of clusters than other k values. Concordance was tested by enrichment against the authors cluster markers. For the Elmentaite et al. 23 , analysis, we utilised a k value of 20 as any value greater than this resulted in failure to detect the tuft-cell resembling cluster.
To identify potential doublets in the filtered dataset, we utilised DoubletFinder v2.0.3 43 as per the authors guidelines (https:// github. com/ chris-mcgin nis-ucsf/ Doubl etFin der). The fully filtered dataset was then re-used as input for integration, dimensionality reduction and clustering as described above.
To test the robustness of the clusters we identified, we performed pairwise differential gene expression analysis by a receptor operator curve test using Seurat's FindMarkers function. In order to only merge clusters which were extremely similar, without over-clustering, we defined similar clusters as those with less than 30 differentially expressed genes with area under the curve score of 0.6 differentiating them. We found no clusters that fall below this threshold and so did not modify our initial clustering in either dataset.
Pan-cluster WGCNA. To analyse correlation of gene expression across all cells from the filtered Smillie et al. 22 , data, we first subset for the nominally significant (p < 0.01) 11q23.1 trans-eQTLs previously reported 11 . The relative pseudo-bulked expression was then calculated by: (i) summing reads for each gene across all cells within samples, (ii) recombining the summed reads into a non-normalised bulk matrix, (iii) log normalising using TMM normalised size factors, calculated using edgeR v3.32.1 44 . Log-TMM normalised bulk expression was then z-scored across genes before analysis.
To perform the network analysis, we used WGCNA v1.69 20 . First, POU2AF2, COLCA1 and COLCA2 pseudobulk expression was extracted. A soft threshold of 14 was then selected after computation of mean connectivity and scale free topology, using the recommended 'powerEstimate' . A signed adjacency matrix was then calculated, which was subsequently used to calculate a topological overlap matrix (TOM). Modules were defined by using a dynamic tree cut of hierarchically clustered gene expression by average distances. We did not find any modules with a height of module separation below 0.25 and so did not merge any module. Module eigengenes were then calculated and their correlation with binarized sex, batch, site, and relative, pseudo-bulked expression of POU2AF2, COLCA1 and COLCA2 was then assessed. Correlation p-values were multiple testing corrected by the Benjamini-Hochberg method 45 . To visualise the blue module hub genes, a network object was generated from the TOM of blue module genes using network v1.17.1 46 . Non-connected genes, along with adjacencies < 0.3, were then removed and remaining genes were plotted using ggplot2 v3.3.5 47 .
Gene set enrichment analysis. All gene set enrichment analysis was carried out using R package fgsea v1.14.0 48 . Genes were ranked by their module membership, gene significance for POU2AF2 expression, or log fold change of differential expression as stated. In the case where multiple gene sets were being tested, i.e. cluster 11 markers against all Smillie et al. 22 , putative markers, p-values were multiple testing corrected by the false discovery rate method.

Calculation of cluster markers.
To calculate the markers of each of our own and the Smillie et al. 22 , clusters, we first generated a log-transcripts per 10,000 expression matrix for the expression of genes within each cell. This was done so that markers could be calculated using expression values that were not affected by the relative expression of genes within this dataset and so were more applicable to future use. Markers were identified by differential expression of genes within each cluster and all other clusters combined, using MAST v1.160 49 . Analysing the variability of trans-eQTL targets within clusters. To analyse the variability of the expression of 11q23.1 trans-eQTL targets within each cluster, we applied the pseudo-bulking approach described in Pseudo-bulk WGCNA to each cluster independently. To make expression variability comparable across samples and clusters, the expression was z-scored across samples.
To analyse expression variability at the single level, we utilised Seurat's FindVariableFeatures function and variance stabilising transformation. In concordance with the identification of many trans-eQTL targets identified as markers, their mean expression was very low in several clusters within each dataset. For within dataset comparison of variance, we therefore utilised the raw variance, not normalising for mean expression. For the between dataset comparison of variance, we utilised the normalised variance values. The probability of 11q23.1 eQTL target variation across datasets was calculated by 100,000 permutations of normalised variance values.
Genotyping of Elmentaite et al. 23 samples. To identify the rs3087967 genotype of Elmentaite et al. 23 , samples, we utilised two variant calling approaches. These were chosen based on findings from a recent review which identified these methods as the most sensitive for this purpose 50 . Freebayes 24 was used with default settings, genotyping over a 10 bp region including rs3087967. Bcftools 25 variant calling was performed on chromosome 11 using a minimum base quality of 30, disabling read-pair overlap detection and not discarding anomalous pairs. www.nature.com/scientificreports/ Sample group definition. In the absence of genotype data for Smillie et al. 22 , patients, we grouped samples into high and low expressors of a POU2AF2-associated signature by defining the hub genes of the pseudo-bulk WGCNA blue module. These were defined by a module membership (MM) > 0.7, intramodular connectivity (kIM) > 0.7 and network adjacency > 0.3. Samples were then hierarchically clustered by their relative pseudobulked expression of these genes using complete distance. We tested the robustness of our sample grouping by bootstrapping, selecting 10 random genes 10,000 times, and counting the number of times this exact separation was achieved-which was 550 times.
Intra-cluster WGCNA. To agnostically identify gene-gene relatedness within cluster 11, we performed WGCNA as described (see Pseudo-bulk WGCNA), selecting only the 5000 most variable genes using Seurat's FindVariableFeatures and variance stabilising transformation. The scale free topology threshold utilised was 6, as per the 'power estimate' . p-values were corrected for multiple testing as before.

Module preservation analysis.
To analyse the preservation of the relatedness of POU2AF2-correlated 11q23.1 trans-eQTL targets, we defined an auxiliary module, comprised of the 12 trans-eQTL targets correlated with POU2AF2 expression (cor > 0.5, p < 0.05) within the 'cluster 11 black' module. Pseudo-bulked expression of each cluster was calculated and subset for the 5000 most variable genes as before, see Pseudo-bulk WGCNA and Intra-cluster WGCNA. The expression from each cluster was then raised to the same scale free topology threshold as cluster 11. The preservation of all modules, including the auxiliary module, was calculated in a pairwise manner with cluster 11, following the author's tutorial (https:// horva th. genet ics. ucla. edu/ html/ Coexp ressi onNet work/ Modul ePres ervat ion/ Tutor ials/) 20 . To summarise module preservation results, we extracted cor.kME values-the correlation of auxiliary module genes with the auxiliary module eigengene, and equivalent results from cluster 11. We also extracted the p-value for this correlation (log.p.cor.kME), which was subsequently un-logged. The mean connectivity of auxiliary module genes with the module eigengene was also calculated within each cluster. Pairwise correlations (p < 0.05) between the expression of the auxiliary module genes with one another and POU2AF2 expression were plotted using corrplot 51 .
Linear modelling of cluster abundance and pseudo-bulked gene expression. Univariate linear modelling of cluster 11 abundance was performed on the TMM normalised, non-z-scored pseudo-bulked expression matrix, so that results are more relevant to further study. We fitted a linear model to all genes and performed empirical Bayes moderation using limma v3.46.0 52 . p-values were adjusted by Benjamini-Hochberg multiple testing correction 45 .
Differential abundance testing. Differential abundance testing was performed using miloR v0.99.8 26 . To mitigate any potential for package-specific artefacts of the analysis, we first re-generated the k nearest neighbour graph using 250 nearest neighbours from the integrated expression. The graph was then built using 4 PCA components. Differential abundance testing was carried out using a Quasi-Likelihood F-test on TMM normalised cell proportions. Differentially abundant neighbourhood results were then annotated for their majority cluster proportion and those comprised of fewer than 80% of the majority cluster were removed.
Human subject inclusion. All the data utilised in this study has been previously published. Informed consent and approval for human subjects from Smillie et al. 14

Data availability
Fastq files are not publicly available for the Smillie et al. 22 , data. Raw gene-level expression counts of healthy colonic epithelial cells, published by Smillie et al. 22 , were obtained from Single Cell Portal using accession code SCP259. Fastq files for the second dataset, published by Elmentaite et al. 23 , were obtained from https:// www. ebi. ac. uk/ array expre ss using accession code E-MTAB-9543. All code used for the analysis performed in this study is available at https:// github. com/ Bradl eyH017/ Harris_ et_ al_ WGCNA scRNA. www.nature.com/scientificreports/