CaSpER identifies and visualizes CNV events by integrative analysis of single-cell or bulk RNA-sequencing data

RNA sequencing experiments generate large amounts of information about expression levels of genes. Although they are mainly used for quantifying expression levels, they contain much more biologically important information such as copy number variants (CNVs). Here, we present CaSpER, a signal processing approach for identification, visualization, and integrative analysis of focal and large-scale CNV events in multiscale resolution using either bulk or single-cell RNA sequencing data. CaSpER integrates the multiscale smoothing of expression signal and allelic shift signals for CNV calling. The allelic shift signal measures the loss-of-heterozygosity (LOH) which is valuable for CNV identification. CaSpER employs an efficient methodology for the generation of a genome-wide B-allele frequency (BAF) signal profile from the reads and utilizes it for correction of CNVs calls. CaSpER increases the utility of RNA-sequencing datasets and complements other tools for complete characterization and visualization of the genomic and transcriptomic landscape of single cell and bulk RNA sequencing data.

from these statistics that most of the segments that are shorter than 1 megabases will be very 48 hard to detect (unless in gene dense regions) because they cover very few genes. 49 This has several implications: We cannot identify the CNV segments that are solely in the 50 intergenic space (no gene overlap) and solely in the introns of genes because there is very little 51 or no RNA-seq signal in them. Therefore, we must consider the regions in the genome without 52 any genes in them as inaccessible for RNA-seq based CNV calling methods because the methods 53 cannot identify CNVs there. It is worth noting that this is similar to detection of CNVs using the 54 whole exome sequencing (WES) data except that WES can quantify the coverage on the exons 55 so it may be able to detect breakpoints within genes and introns. But, similar to RNA-seq, WES 56 cannot be used to detect the CNVs primarily in the intergenic space. 57 We assessed the performance of CaSpER using accuracy metrics at the gene level and at the 58 segment level information (See Methods). The segment level sensitivity, however, must be 59 carefully interpreted with above considerations in mind. 60

Supplementary Note 3: Inference of CNV architecture in single-cell RNA-Seq 61
For 5q:14q event pair in patient MGH31, we discovered GFPT2 gene to be highly expressed in 62 5q amplified clone. The previous study has discovered that higher expression of GFPT2 is linked 63 with poor survival and identified GFPT2 gene to be a potential target for therapeutic inhibition 53 . transcript sequencing exhibits a much less bias when 5' and 3' ends are compared in terms of 119 read depth coverage. 120 We next used BAFExtract to generate candidate SNVs from which we compute BAF shift. We 121 computed the candidate SNV density over all the genes and aggregated the SNV density over 122 every position on the genes (from 5' to 3'). As with read depth aggregation, we normalized the  It can be seen clearly that the candidate SNVs are enriched around the 3' end of genes for 3' 126 sequencing data. Full transcript sequencing data, on the other hand, shows a much more uniform 127 coverage of the transcripts. This result indicates that the BAF shift can be reliably quantified 128 around the ends of genes for 3' sequencing. On the other hand, full transcript sequencing can 129 give a more uniform and comprehensive measurement of the BAF shift signal by covering the 130 SNVs more uniformly along with the genes with less bias on the location of SNVs. 131 It should also be noted that the above arguments are contingent on the fact that the suitability of 132 CaSpER for 3' technologies will be affected by the read depth of sequencing. Although 3' 133 sequencing covers the 5' ends of genes to a much lesser extent than 3' ends, there is still some 134 coverage on the 5' ends of sequences as seen in above plots. If both technologies are used with 135 decent coverage, the read coverage over the SNVs (even around 5' ends of genes) can provide 136 similar power to measure BAF shift. Consequently, both technologies can provide useful BAF 137 information with decent sequencing depths. 138

Supplementary Note 7: Applying CaSpER on 10x single cell RNA-Seq data 139
We assessed the suitability of CaSpER 3' transcript sequencing data using the multiple myeloma 140 10X RNA-Seq data (MM135) presented on HONEYBADGER study 21 . Even though the exact ground truth CNV calls are unknown for this study, we show the concordance of CaSpER CNV 142 calls with HONEYBADGER. 143 This MM135 data contains normal and MM cells, therefore, we first performed TSNE to separate 144 MM and normal cells. We also checked expression levels of MM markers (SDC1, CD38) in all 145 cells to better identify MM cells. We next normalized the expression values of MM cells using 146 normal cells. We plotted heatmap of median filtered expression data ( Supplementary Fig. 23). these events are also reported in HONEYBADGER paper. In summary, we recommend using 153 CaSpER on 10x data. However, data preprocessing steps should be performed carefully because 154 10x has different biases compared to SMART-Seq data. We added documentation specifically for 155 10x RNA-Seq data at our website. Several technical and biological factors are underlying the higher sensitivity of CNV detection for 159 bulk RNA-Seq data. We discuss these factors below. 160 The first factor is that DR-Seq technology or other simultaneous RNA-DNA sequencing 161 techniques are not standardized, unlike the genotyping arrays where we can directly identify 162 known CNV calls. This would mean that the technical artifacts in the DR-Seq technology are 163 potentially affecting the accuracy of the results. 164 The second factor is that, in DR-Seq study, absolute copy numbers are assigned to the segments 165 using a computational approach. However, both CaSpER and HoneyBADGER assigns relative 166 copy numbers to the detected segments where copy numbers are assigned with respect to the 167 average DNA content in each cell. Therefore, we processed the absolute copy numbers from DR-168 Seq study to build the relative copy numbers (i.e., amplification, deletion, neutral) in each cell. 169 The difference in calculating relative copy number calls among different datasets might lead to 170 poor performance. 171 The third factor is that, as we have stated earlier, the CNV calling methods work at the level of 172 genes, which means that the smallest resolution for segment coordinates will be at the gene 173 boundaries. In other words, the regions in the genome that do not contain any genes will not be 174 accessible for calling CNVs on them. Single cell RNA-Seq datasets cover on average 6000 genes 175 genomewide, whereas bulk RNA-Seq dataset measure all 20,000 protein-coding genes. Most of 176 the segments that are shorter than 1 megabases will be very hard to detect (unless in gene dense 177 regions) because they cover very few genes. We believe that these should be kept in mind while 178 interpreting the accuracy results. This has several implications: We cannot identify the CNV 179 segments that are solely in the intergenic space (no gene overlap) and solely in the introns of 180 genes because there is very little or no RNA-seq signal in them. This impacts the way that the 181 accuracy is estimated because inclusion of inaccessible regions in sensitivity estimation will 182 adversely (and unfairly) decrease the sensitivity of the methods. 183 The fourth factor is that DR-Seq data is applied to metastatic breast cancer cell line (SK-BR-3) 184 harboring very complicated diverse chromosomal alterations. Following our previous point, the 185 complex CNV profiles of SK-BR-3 cell line is affecting sensitivity. It should, however, be noted 186 that even for the complex ovarian cancer the accuracy of the CNV detection of bulk RNA-seq is 187 around 60% TPR. Therefore, other factors are affecting probably more. 188 The fifth factor is the limited number of cells. CaSpER will have better estimates with more cells. shift signal among potential consecutive SNVs. We are therefore claiming that one can detect the 206 B-allele shift without the existence of a high-quality set of SNVs. We believe that the concordance 207 between our B-allele shift plots (Generated without a high quality SNV call set) and the expression 208 signals is convincing evidence that supports this claim. 209 RNA-editing is a general mechanism where RNA is edited post-transcriptionally by cells. The 210 nucleotide sequence of the transcripts is changed as a result. As the RNA sequence is changed, 211 the sequenced RNA reads will reflect the nucleotide modifications. While these edited nucleotides 212 may not affect the gene expression quantifications (Assuming they do not interfere with read 213 mapping), they may potentially impact the BAF shift information because the modified nucleotides 214 may look like SNVs while reads are being processed. The RNA editing events are mainly 215 dominated by A-to-I modifications. While the exact number of RNA editing sites are not known, 216 the common RNA editing events are hypothesized to be rare, on the order of 10,000-100,000 217 sites 56,57 and many of the detected editing events overlap with repeat regions and do not overlap 218 with coding sequences. 219 Since the number of the events is low, we believe that these events will not have a major impact 220 on the BAF shift generation. Secondly, the RNA-seq data is predominantly enriched in the coding 221 and exonic sequences. Since there is a high association of RNA-editing events with repetitive 222 non-coding elements, we believe that they will have an almost unnoticeable impact on the BAF 223 shift signal measured on the exonic regions. However, the cancer cells may still have high RNA 224 editing just because the editing pathways may be malfunctioning. Therefore, we believe the 225 impact of RNA editing should be considered carefully any time RNA-Seq data is used to identify 226

CNVs. 227
Amplification errors stem from the PCR amplifications that are used to increase the cDNA content 228 in the sequencing libraries. PCR amplification may introduce errors that get amplified within any 229 cycle. Most alarmingly, these errors are correlated among different reads and it is hard to correct 230 for these errors. These amplified errors may seem like novel mutations. Especially in the lack of 231 a matching control data, these PCR errors may adversely impact BAF shift signal generation. We 232 hope that the other less biased PCR-free amplification assays will be more available to decrease 233 the amplification errors. In addition, molecular barcoding such as UMIs may be useful to detect 234 duplicates and remove them effectively. 235 Sequencing errors are introduced in the sequencing step when the sequencing machine wrongly 236 assigns a base. These errors are less of a concern then PCR amplifications since they are 237 generally independently introduced in different reads but this may not always be the case because 238 sequencing errors can be context specific. Nevertheless, the sequencing error rates of short read 239 sequencing technologies such as Illumina is very low; around 0.1% per base 58 . We therefore think 240 the sequencing errors will impact BAF shift generation slightly. 241

Supplementary Note 10: Accuracy of BAF shifts signal estimated by CaSpER 242
We expect that regions with amplification or deletion to be mostly accompanied with loss of 243 heterozygosity (LOH) and harbor BAF shift except for homozygous deletions and balanced 244 amplifications. In most of the cases, we do not have matched normal RNA-Seq samples, 245 therefore; we predict the BAF shift signal only from tumor RNA-Seq samples. To evaluate the 246 accuracy of the BAF shift signal from CaSpER, we compared our BAF shift signal with the signal 247 generated from GATK best practices workflow for RNA-Seq variant calling. For this, we first used 248 GATK to call SNVs using the tumor RNA-seq data. Next, we used the heterozygous calls from 249 the tumor and computed the GATK based BAF shift signal. We did not observe any particular 250 features in the GATK-based BAF shift signal that will provide extra benefit for CNV detection 251 compared to CaSpER's BAF shift signal (Supplementary Fig 26). It is important to note that 252 CaSpER performs several SNV filtering steps to remove variants that may bias BAF shift signal 253 adversely (Methods Section). This result supports our hypothesis that BAF shift signal detection 254 for CNV identification can be performed without reliance on a good variant call set. 255

Supplementary Note 11: Change of HMM model parameters with increasing number of 256 cells 257
The exact ground truth CNV calls are unknown for both MM135 and MGH31 study. However, 258 based on CaSpER and HONEYBADGER calls we know that most of the MM cells in MM135 259 harbor large scale chromosome 13 and 18 deletions. Therefore we first randomly selected n=7, 260 n=50 and n=100 cells among the MM135 cells that harbor 18p and 13q deletion. When we 261 sampled n=7 cells, the accuracy of calling chromosome 13 and 18 deletions were 71%, whereas 262 with n=50 accuracy was 92% and with n=100 cells accuracy was 94%. Supplementary Fig. 33 shows the large scale CNV event calls for chromosome 18 and 13 for sampled n=7, n=50 and 264 n=100 cells. As it is seen from Supplementary Fig. 33, concordance of CNV calls increases with 265 the number of cells. We calculated the mean values of the normal distributions corresponding to 266 5 copy number states (q1:homozygous deletion, q2:heterozygous deletion, q3:neutral, 267 q4:amplification, q5:high-level amplification) that are derived from randomly selected n=7, n=50, 268 n=100 cells and all cells in MM135 dataset. We observed that the initial HMM model parameters 269 change when different numbers of cells are analyzed (Supplementary Fig. 34). 270 Similarly in MGH31, based on CaSpER and HONEYBADGER calls we know that most of the 271 MGH31 cells harbor chromosome 10 deletion. Therefore we first randomly selected n=7 and n=50 272 cells among the MGH31 cells that harbor 10p and 10q deletion. When we sampled n=7 cells 273 accuracy of calling chromosome 10 deletions were 35%, whereas with n=50 accuracy was 99%. 274 Supplementary Fig. 35 shows the large scale CNV event calls for chromosome 18 and 13 for 275 randomly selected n=7 and n=50 cells. As it is seen from Supplementary Fig. 35, the concordance 276 of CNV calls increases with the number of cells. We calculated the mean values of the normal 277 distributions corresponding to 5 copy number states (q1:homozygous deletion, q2:heterozygous 278 deletion, q3:neutral, q4:amplification, q5:high-level amplification) that are derived from randomly 279 selected n=7, n=50, n=100 cells and all cells in MM135 dataset. We observed that the initial HMM 280 model parameters change when different numbers of cells are analyzed ( Supplementary Fig. 36). 281

Supplementary Note 11: Extension of CaSpER to other functional genomics datasets 282
In this paper, we have focused on the detection of CNV events from RNA-Seq datasets. This is 283 because RNA-sequencing is becoming an everyday tool in research and clinical settings. Also, 284 the number of RNA-sequencing datasets is increasing comparably with the whole genome and 285 whole exome sequencing datasets. Although we have focused only on RNA-sequencing data, 286 the analysis framework that CaSpER utilizes can be extended to other functional genomics 287 datasets such as ChIP-Sequencing, which are currently not performed as often as RNA-288 sequencing. We hypothesize that the CNV architecture can be reliably detected using these 289 functional genomics datasets jointly. As more cancer epigenomics datasets are generated, 290 CaSpER can be tuned to analyze data from these assays for detection of copy number and LOH 291 events. 292

Supplementary Note 12: Experimental design optimization using CaSpER 293
Even though RNA-seq and WGS are performed regularly on the same samples, we foresee that 294 experimental designs can first start by performing RNA-seq rather than starting with WCGS. This 295 will enable the researchers to study the samples simultaneously from the genetic and the 296 transcriptomic perspective. After this evaluation, the researchers can focus on the 297 uncharacterized samples, for example, samples that do not harbor any of the known large scale 298 CNVs associated with known subtypes of cancer. After selecting the uncharacterized samples, 299 the researchers can continue with performing WGS of these uncharacterized samples. This 300 "sequential" or "RNA-seq first" design is advantageous for several reasons. First, the cost can be 301 decreased significantly. WGS is currently much more expensive in terms of sequencing, analysis, 302 and storage compared to the costs associated with RNA-seq. If the researchers can decrease 303 the number of samples for which WGS is performed by focusing on the most interesting samples, 304 this can decrease the costs substantially. Additionally, RNA-seq can be used to identify much 305 more information such as the transcript fusion events that may have important clinical 306 implications. While we are optimistic with the RNA-seq first design, we are just presenting this 307 case as a potential example and we do not want to overstate this scenario because the sample 308 preselection may create adverse statistical biases that must be properly controlled. B. BAF signal generated after filtering reads with less than 50 mapping quality and only BAF 575 values more than 0.2 and less than 0.8 are considered. C. BAF signal generated after filtering 576 reads with less than 50 mapping quality and only BAF values more than 0.2 are considered. D. 577 BAF signal generated using only BAF values that are more than 0.2 and less than 0.8. For small smoothing window sizes, segments show relatively low concordance with the ground 616 truth. As the window size increases, the concordance increases and saturates around 80% at the 617 around window length of 50 for both event lengths. Lines depict the median values; boxes plot 618 25th to 75th percentiles, whereas separately plotted dots show the outliers.