Nanopore sequencing of single-cell transcriptomes with scCOLOR-seq

Here we describe single-cell corrected long-read sequencing (scCOLOR-seq), which enables error correction of barcode and unique molecular identifier oligonucleotide sequences and permits standalone cDNA nanopore sequencing of single cells. Barcodes and unique molecular identifiers are synthesized using dimeric nucleotide building blocks that allow error detection. We illustrate the use of the method for evaluating barcode assignment accuracy, differential isoform usage in myeloma cell lines, and fusion transcript detection in a sarcoma cell line.


Barcode and UMI synthesis strategy
Solid-phase phosphoramidite oligonucleotide synthesis is performed on Toyopearl HW-65S resin. Following the attachment of a hexaethylene glycol (HEG) linker (a), the PCR handle is synthesised using single reverse phosphoramidites (b). The barcode and UMI region of the capture oligonucleotide are formed of blocks of homodimer nucleotides (c), added using reverse dimer phosphoramidites. Finally, polyT oligonucleotide region is synthesised using single reverse phosphoramidites. Barcodes are assigned to cells by grouping reads that differ by no more than a specified edit distance (we typically use a Levenshtein distance of 6). Indels are typically not considered.
The first step involves a first pass whitelisting approach where barcodes without errors are identified based on full base pair complementarity. Blacklisted barcodes containing errors are then compared to the whitelist of error free barcodes. Barcodes with the closest match to a barcode without an error are then selected, as long as it is within the specified edit distance. If the barcode is outside the specified edit distance, then the read is discarded.

UMI counts correction strategy
We adapted the directional approach first proposed by UMI-tools for correcting PCR and sequencing errors within UMI sequences. a. Representation of theoretical sequencing and PCR errors within a UMI of two independent UMIs. During analysis, we first evaluate homodimer nucleotide complementarity. Molecules that show perfect homodimer complementarity within the 16mer molecule are collapsed to a single 8mer. For molecules with a mis-match in the dimer's nucleotides, these are inferred as having a sequencing error, we then split the read into two 8mer molecules. Each of these molecules are then added as independent UMIs to the UMI-tools directional algorithm (Thus an error containing molecule adds two counts to the directional algorithm).   Evaluating different edit distances for correcting Illumina scRNA sequencing data.
A dual oligonucleotide scRNA-seq library was generated and around 500 human HEK293T and mouse 3T3 cells were sequenced using the Illumina platform. Barcodes that contained a sequencing error, as determined by dual nucleotide block complementarity were identified.
Barcodes were then error corrected using increasing edit distances. a The number of human cells identified using increasing Levenshtein distance for barcode error correction. b The corresponding numbers of mouse cells identified with increasing Levenshtein distance. c The corresponding numbers of mixed cells identified with increasing Levenshtein distance.
d, e, f, g, h, i Barnyard plots showing mouse and human UMIs detected per cell.

Figure 6
Error correction of Illumina droplet based scCOLOR-seq data.
Human HEK293T and mouse 3T3 were mixed at a 1:1 ratio and approximately 500 cells were taken for encapsulation and cDNA synthesis. Barcodes and UMIs identified as having at least one sequencing error were processed using before a and after barcode error correction using an edit distance of 4 b and the proportion of mouse and human UMIs shown in the Barnyard plot. Insert bar plots show the number of cells identified for each species.  The frequency of errors within barcodes that contain at least one error.
HEK and 3T3 cells were encapsulated at a 50:50 ratio and then a library was prepared for both Illumina and Nanopore sequencing using the same cDNA. a 8000 Illumina sequenced barcodes were randomly selected from barcodes that contained at least one sequencing error. The frequency of sequencing errors is plotted as a bar graph. b Similar for a, but for nanopore sequenced barcodes.  Evaluating different edit distances for correcting Nanopore scRNA sequencing data.
A dual oligonucleotide scRNA-seq library was generated and around 500 human HEK293T and mouse 3T3 cells were sequenced using the Oxford Nanopore platform. Barcodes that contained a sequencing error, as determined by dual nucleotide block complementarity were identified. Barcodes were then error corrected using increasing edit distances. a The number of human cells identified using increasing Levenshtein distance for barcode error correction. b The corresponding numbers of mouse cells identified with increasing Levenshtein distance. c The corresponding numbers of mixed cells identified with increasing Levenshtein distance.
d, e, f, g, h, i Barnyard plots showing mouse and human UMIs detected per cell.

Figure 11
Evaluating the effect of increasing the edit distance to 7 for cell assignment.        The detection of fusion transcripts within the mixed species dataset We measured the presence of fusion transcripts within our mouse and human mixed species experiment. a The frequency of UMI counts per unique fusion transcript. b The same data shown in a, but limited to UMI counts between 1 and 5. c The same data shown in a but limited to UMI counts between 5 and 50. The colours indicate the presence of mixed species.

Genome browser tracks showing the read pile up across both EWSR1 and FLI1
The top panel shows the read pileup across the exons of EWSR1. The bottom panel shows the read pileup across the FLI1 gene. The peak observed within the intronic region between exons 5 and 6 appears to be an alignment artifact and does not represent a real peak.