Single-nuclei transcriptomics enable detection of somatic variants in patient brain tissue

Somatic variants are a major cause of human disease, including neurological disorders like focal epilepsies, but can be challenging to study due to their mosaicism in bulk tissue biopsies. Coupling single-cell genotype and transcriptomic data has potential to provide insight into the role somatic variants play in disease etiology, such as by determining what cell types are affected or how the mutations affect gene expression. Here, we asked whether commonly used single-nucleus 3’- or 5’-RNA-sequencing assays can be used to derive single-nucleus genotype data for a priori known variants that are located near to either end of a transcript. To that end, we compared performance of commercially available single-nuclei 3’- and 5’- gene expression kits using resected brain samples from three pediatric patients with focal epilepsy. We quantified the ability to detect genetic variants in single-nucleus datasets depending on distance from the transcript end. Finally, we demonstrated the ability to identify affected cell types in a patient with a RHEB somatic variant causing an epilepsy-associated cortical malformation. Our results demonstrate that single-nuclei 3’ or 5’-RNA-sequencing data can be used to identify known somatic variants in single-nuclei when they are expressed within proximity to a transcript end.

www.nature.com/scientificreports/ three pediatric patients with focal epilepsies. We tested our ability to identify variants on a single-nuclei basis from these datasets depending on the distance of a variant from the transcript end. Finally, we demonstrated a successful example of single-nuclei genotyping from 5'-RNA-sequencing data that enabled us to determine the affected cell types in a patient with a RHEB somatic variant.

Results
Study overview. We established a workflow to compare performance between the 10 × Genomics Chromium NextGEM Single-Cell 3' and 5' gene expression kits and evaluate detection of known germline and somatic variants in the resulting datasets (Fig. 1). For these analyses, we obtained frozen resected brain tissue from three pediatric patients treated for focal epilepsy and enrolled in an IRB-approved research protocol. In a previous study, germline and somatic variants were identified in bulk tissue samples via exome sequencing analysis 9 . In this study, we performed single-nuclei RNA-sequencing (snRNA-seq) on remaining material cut from the same tissue section (Fig. 1a). To evaluate performance of the 3' versus 5' gene expression kits, we compared QC metrics, gene expression, cell type clustering, and marker gene efficacy. For variant analysis we evaluated the detection rate of both germline and somatic variants depending on distance from the transcript end and expression level of each gene of interest (Fig. 1b). Finally, as a proof-of-concept, we identified cell types expressing a disease-causing RHEB variant using single-nuclei 5'-RNA-seq data (Fig. 1c).

Consistent QC and gene expression across kits.
We generated data from 41,953 nuclei in total from three patient samples (µ = 7,000 nuclei per sample), sequenced to equivalent depth with single-nuclei 3' or 5'-RNA-seq kits (Fig. 2a). Both kits had similar sensitivity in detecting RNA molecules, yielding a similar number of UMIs and genes detected per nucleus (Fig. 2b,c). The quality of captured nuclei was also similar, as we observed a similar low proportion of mitochondrial and ribosomal genes, which is a metric that has been used to detect low-quality or stressed cells (Fig. 2d,e) 10 . We repeated these comparisons on a patient specific level and found similar results (Fig. S1). Next, we compared average gene expression by kit, which we expected to be similar as both sets of nuclei were isolated from the same original tissue section. Indeed, the mean lognormalized gene expression values by kit were highly correlated on average (R 2 = 0.99) (Fig. 2f), as well as on a patient-specific basis (Fig. 2g,h).

Similar cell type distribution and proportion.
As cell type identification is a major goal of single-cell RNA-sequencing experiments, we determined whether the two kits detected a similar distribution and proportion of cell types. We integrated all six datasets and clustered nuclei by their gene expression profiles, performing annotation by examining expression of known marker genes (Fig. 3a). In general, we observed a similar distribution of nuclei across both kits (Fig. 3b,c). To determine the similarity of gene expression within the detected cell types by kit, we performed pairwise correlations between kits for each of the cell type clusters using average gene expression values. Cell types identified in each kit were highly similar, as the lowest correlation between two cell types was 0.95 (Fig. 3d). We then quantified the number of nuclei per cell type for each capture method and found similar numbers of nuclei per cell type for each kit. There were slight differences in the proportion of excitatory neuron populations and MGE-Derived interneurons (Fig. 3e). This could be due to random variation in cell type capture or a difference in the efficacy of marker genes between kits. To assess this possible difference in cell type markers for each kit we utilized a technique by Brekken et al. 11 to calculate a marker gene efficacy score (Fig. 3f). We found that most cell type markers had similar efficacy for both 3' and 5'-RNA-seq data sets. Marker genes that were high-scoring in one dataset but not in the other (ARHGAP15, SKAP1, and CCND3) can be explained by the low number of nuclei present in their respective cell type clusters (Fig. 3f, S2).
Detection of germline variants in snRNA-seq data. Having shown that samples processed via either kit have similar gene expression and cell type distribution, we next aimed to characterize our ability to detect genetic variants in snRNA-seq datasets. For this analysis, we used a set of exonic germline variant positions previously identified in these tissues via bulk tissue exome sequencing 9 . We gauged the detection sensitivity of snRNA-seq by looking at coverage of these positions in our datasets. First, we asked how distance from a transcript end (3' or 5') affects the rate of detection for the variant positions in snRNA-seq data. The 10 × Genomics Chromium Next GEM Single-Cell 3' and 5' gene expression kits sequence 91 and 90 bp from the captured transcript end respectively. 58.2% of positions within 100 bp of the 5' or 3' transcript end were identified in at least 5% of nuclei in their respective dataset. However, we found several examples in which we had coverage of positions hundreds or thousands of base pairs from the transcript end (Fig. 4a). Detection of positions far from the transcript end could be a result of mispriming, which has been previously shown to occur with these technologies 12 . During reverse transcription, a poly dT primer is used to target the Poly-A tail of mRNA molecules. If there are any repeated A sequences within the body of the transcript, those sequences could be misprimed and initiate reverse transcription instead. An example is RTN4 p.Ser440Asn. It is a variant position that should not be detected due to its location near the middle of the transcript, yet it was detected in ~ 25% and 12% of nuclei from our 5' and 3'-RNA-seq datasets. Surrounding this gene we found repeat A-sequences that suggest internal poly-A mispriming may have lead to the observed coverage of this position (Fig. S3). 45.1% of positions located within 1,000 bp of a transcript end had coverage in at least 5% of nuclei, suggesting mispriming occurs to a significant extent in 10 × Genomics single-nuclei gene expression data. On the other hand, there were positions that fell within the 91 bp constraint but had little to no representation in our snRNA-seq data. Therefore, we queried the average log normalized expression of each gene and correlated this to distance from the transcript end and proportion of nuclei genotyped (Fig. 4b).
Positions near the transcript end that had little representation in our snRNA-seq data were mostly not highly expressed in our dataset. Overall, 77.2% of posi- www.nature.com/scientificreports/ tions in genes moderately expressed (log normalized abundance > 5; range 0-31) were detected in at least 5% of cells, whereas only 36.7% of positions in genes with log normalized abundance > 0.1 were detected. Finally, to assess the detection rate of false positives, we looked for germline variants not expected to be found in any given patient. For this test, we used germline exonic variants called by GATK haplotype caller that were absent in gnomAD (total = 432). The rationale for using variants absent from gnomAD is to ensure they aren't common germline variants that could be seen in another patient. We manually confirmed no overlap of variants across the three patients. We observed only two false positive "alt" calls (false positive rate < 0.5%). In both cases, the false positive call was only detected in a single nucleus. www.nature.com/scientificreports/ Successful detection of RHEB somatic variant. From our whole exome sequencing data, we previously identified candidate disease-causing variants in our patients 9 . There was at least one candidate somatic  www.nature.com/scientificreports/ variant present within each patient. One patient diagnosed with Type II focal cortical dysplasia (FCD) had a somatic variant in RHEB, a gene that is known to play a major role in the mTOR signaling pathway and has been altered in patients with FCD 13 . This particular variant was located within the first 105 nucleotides from the transcription start site, making it an ideal candidate for genotyping in single-nuclei 5'-RNA-seq data (Fig. 4c). We used VarTrix (GitHub-10XGenomics/vartrix: Single-Cell Genotyping Tool) to identify reference-and variant-supporting reads in individual nuclei and then quantified the number of variant calls (Alt), reference calls (Ref), and the absence of coverage (No call) for each chemistry. The RHEB variant was expressed in about 5% of cells in total, which is close to the 11.40% VAF detected by bulk exome sequencing. As expected, we observed a higher number of genotyped nuclei within our 5' dataset (27%) versus the 3' dataset (0.7%) because of the variant's proximity to the 5' transcript end (Fig. 4d). Coverage and gene expression are also important determinants of variant identification. A large proportion of nuclei in the No Call group had zero expression of RHEB and a lower overall number of RNA molecules detected per nucleus (Fig. 4e). The small number of nuclei genotyped in the 3' dataset may be caused by an A repeat sequence observed just upstream of the variant position. We then wanted to determine how these variant calls were distributed across cell types within our datasets. Most Alt calls were enriched in Microglia, Excitatory Neurons, Astrocytes, Oligodendrocytes, and Oligodendrocyte Precursor Cells at VAFs of 5%, 3%, 34%, 43%, and 31%, respectively (Fig. 4f). As most cell types expressed the RHEB variant, including cell types from distinct developmental lineages (i.e., microglia from the mesoderm and neurons from the ectoderm), we reasoned that this somatic variant arose before gastrulation. We repeated the analysis on candidate variants for the other two patients, but in one case the gene was not expressed highly enough to obtain meaningful genotyping data and in the other case the variant was too far from the transcript end (Fig. S4).

Discussion
Our results show comparable performance of 10 × Genomics 3' and 5' gene expression kits in quality control metrics, gene expression, and cell type recovery when directly comparing the same patient samples. Our observations are consistent with a previous study that demonstrated high transcript and gene detection sensitivity of 10 × Genomics Single-cell 3'-RNA-seq v.3.0 and 5'-RNA-seq v.1 kits compared to other scRNA-seq methods 14 . We expanded on these findings by investigating the most recent 5' v.2 and 3' v.3.1 kits and found similar numbers of transcripts and genes detected per cell, as well as similar gene expression and cell type recovery. The similar kit performance suggests that it would be appropriate to decide which chemistry to use based on targeted variant location, in cases where single-nuclei genotyping is desired as we demonstrated here. We were able to show the successful detection of genetic variants from 10 × Chromium gene expression datasets, provided that the variant is (a) expressed and (b) located near to a transcript end or in proximity to an alternative polyA capture site. We demonstrated how the ability to detect somatic variants in single cell transcriptomic data can give insight into the affected cell types within an epilepsy-associated malformation of cortical development. We chose to use VarTrix from 10 × Genomics to identify reference and variant-supporting reads within our single-nuclei transcriptomic datasets because it provides a streamlined user experience when starting with CellRanger output files. Our results confirmed that VarTrix can identify a priori known variants in snRNA-seq data with a low false positive rate of < 0.5%. VarTrix is one of many tools developed for this purpose. Notably, SCReadCounts not only tabulates read counts at known variant positions, but also has a discovery mode for identifying novel somatic variants in single-cell RNA-seq data. We replicate the results of Prashant et al. using VarTrix and obtain very similar results (https:// github. com/ bedro sian-lab/ 5p3p_ genot yping) 15 . SCMut is a tool that can discover somatic variants in single-cell transcriptomes and control for false positives 16 . Selection of tool will depend on the needs of each specific study.
Our analysis of the RHEB somatic variant provides an example of the utility of genotyping from single-nuclei transcriptomic data. This analysis enabled us to determine that the RHEB variant is enriched, but not restricted to Microglia, Neurons, Astrocytes, Oligodendrocytes, and Oligodendrocyte Precursor Cells. As microglia and other brain cell types arise from different germ layers in development, this implies that this post-zygotic variant is acquired before gastrulation occurs. This finding is meaningful because it provides a model of how this diseasecausing variant arose in development and it suggests the variant could potentially be found in other tissues of the patient besides brain with yet to be determined consequences.
Genotyping from single-cell 5' or 3' gene expression data will always be limited in utility to specific kinds of variants. Other groups have attempted to extend on the 10 × Genomics platform to obtain improved genotype or isoform information from droplet based single-cell RNA-sequencing data by incorporating additional sequencing of leftover barcoded cDNA from the 10 × Genomics protocol 8,17 . One method that we recently published leveraged long-read sequencing of the leftover full-length barcoded cDNA fraction to genotype somatic variants deep in the PTEN gene 18    www.nature.com/scientificreports/ interest in the leftover cDNA fraction 8 . Both methods have been successful in correlating genetic information to scRNA-seq datasets; however, there are some aspects that make these methods less user friendly. The first is the utilization of other sequencing methods. This can add additional cost and effort to an already costly technology. Another caveat of using other sequencing methods is that long-read sequencing currently has lower throughput than short-read sequencing. This limitation may be addressed as sequencing technologies advance. Another limitation mentioned above is mispriming that occurs when repeated A sequences are present within the transcript. These A-repeat sequences can serve as alternative polyA capture sites and can initiate reverse transcription 12 . Depending on their location, they can lead to unexpected sequencing coverage far from a transcript end. While mispriming can result in unwanted off-target effects, it can also allow for variants to be sequenced far from the transcript end. This of course is not a factor that can be controlled. But the presence of an alternative polyA capture site could increase the chances for representation of a genetic variant regardless of its position in the transcript.
Here, we provided a direct comparison of single-nuclei 5' versus 3'-RNA-seq using parallel patient samples. We also benchmarked our ability to detect genetic variants in 10 × Chromium gene expression datasets. This approach could allow researchers to gain genotype information from datasets without the need for additional sequencing methods in some cases. This allows for more insights to be gathered with fewer resources. This increased information can allow for the production of new insights that can inform the field on the role somatic variants have in different human diseases.

Methods
Tissue acquisition and ethical considerations. Tissue samples were derived from surgically resected brain tissue obtained from three pediatric patients enrolled in a research protocol (IRB18-00786) approved by the Nationwide Children's Hospital Institutional Review Board. All methods were performed in accordance with the appropriate guidelines and regulations. As patients were under the legal age of consent, informed consent was obtained from parent/legal guardians prior to the start of this study. Details of patient enrollment, sample collection, exome sequencing, and variant calling have been previously published 9 .
Nuclei preparation and sequencing. Snap frozen samples were stored at -80 °C until use. Briefly, tissue was dissociated in a glass dounce homogenizer and then nuclei were washed and stained with Hoechst dye, as previously described 18 . Approximately 15,000 Hoechst-positive nuclei per sample were sorted on a BD Influx cell sorter directly into master mix from the 10 × Genomics Next GEM Single-Cell 3' or 5' reagent kits. Reverse transcriptase was added to the master mix and reactions were loaded into a 10 × Genomics Chromium Controller for single-nuclei capture. Libraries were constructed in accordance with the 10 × Genomics Chromium Next GEM Single-Cell 3' Reagent kit v.3.1 or the Chromium Single-Cell 5' Reagent kit v.2. Libraries were sequenced on an Illumina NovaSeq 6000 instrument to generate paired-end sequencing data.
Data pre-processing and quality metrics. Sequencing data were processed using the 10 × Genomics Cell Ranger v.6 analysis pipeline. Generation of fastq files and data preprocessing, including alignment, filtering, barcode counting and UMI counting, were performed using the Cell Ranger mkfastq and count commands following the default parameters. Number of nuclei, number of genes detected, and total reads per library were obtained from Cell Ranger count. Downstream analysis was performed using Seurat v.4 for R 19 . The VlnPlot function was used to visualize number of UMIs and number of genes detected per nucleus. Low-quality nuclei with greater than 5% mitochondrial reads were filtered out. Doublets were identified using DoubletFinder for R and then visualized on FeaturePlots in Seurat and excluded from the datasets before further analysis 20 .
Single cell RNA-seq data analysis. Normalization and variance-stabilization of feature-barcode matrices was performed using the SCTransform function of Seurat 21 . All six libraries were integrated using the SCTransform method. Dimensionality reduction was performed using principal component analysis and the distance matrix was organized into a K-nearest neighbor graph, partitioned into clusters using Louvain algorithm, and clusters were visualized on a tSNE plot 22,23 . Top differentially expressed genes representing each cluster were found using FindAllMarkers. Cell types were annotated by inspection of canonical marker genes. Nine major cell types were identified: Microglia, Lymphocytes, MGE-Derived Interneurons, CGE-Derived Interneurons, Astrocytes, Oligodendrocytes, Oligodendrocyte Precursor Cells (OPCs), Excitatory Neurons, and Mitochondrial (a cluster of expressing a high proportion of mitochondrial reads).

Marker gene scoring.
To assess the ability of each marker gene to differentiate cell types in 5' versus 3' data, we calculated a marker score as previously described 11 . We selected the top 10 marker genes (highest average log2 fold change) for each cell type cluster and then calculated separately for 5' and 3' datasets the proportion of cells in each cluster expressing each marker gene at greater than 1 count per million (CPM). Next, we calculated a marker score as the sum of the squared differences in proportions divided by the sum of the differences in proportions. The resulting score (ranging from 0 to 1) represents the specificity of the marker gene, with zero reflecting evenly distributed expression or no expression across all clusters, and one representing perfectly binary expression in the marked cluster.

Gene expression comparison.
To compare overall gene expression by capture method, we normalized raw counts for each gene to read depth (per nucleus), scaled by 10,000, log-transformed using the natural log, and calculated average expression in 5' versus 3' data. The analysis was repeated for patient-specific pairs of data. www.nature.com/scientificreports/ To compare gene expression by capture method for each cell type cluster, we repeated the analysis averaging expression on a per cluster basis and generated a distance matrix using the cor function in base R.
Variant detection in single-nuclei. For each patient, germline and somatic variation was previously detected from whole exome sequencing analysis of resected brain tissue as published [9(15]. To determine representation of each variant at a patient-specific level in the single-nuclei datasets, we used Vartrix software (GitHub-10XGenomics/vartrix: Single-Cell Genotyping Tool) to generate a call set identifying each nucleus as expressing the variant or reference allele. The calls were joined by cell barcode to each Seurat object's metadata slot for visualization and to calculate a proportion of cells expressing each variant. The distance of each variant to the transcript end was calculated using MANE transcript IDs and the function "transcriptLengths" from the R package "GenomicFeatures" 24 . Further filtering was done to remove intronic variants from our datasets as we observed some intronic regions were sequenced as a byproduct of WES.

Data availability
The datasets generated and/or analysed during the current study are available in the github repository, https:// github. com/ bedro sian-lab/ 5p3p_ genot yping, along with a tutorial notebook. The single cell RNA-seq data presented in this publication have been deposited in NCBI's Gene Expression Omnibus (GEO) and are accessible through GEO Series accession number GSE210670. All other data is available upon reasonable request.