a Schematic of the scRCAT-seq method. Full-length cDNA was synthesized by template-switching reverse transcription, amplified by PCR, and tagmented with Tn5 transposases. The TAG added to both ends contains the UMI (unique molecular identifier) and CI (cell identifier). Both 5′ and 3′ ends of the cDNA were captured and amplified by PCR, producing indexed libraries for pooled sequencing. Sequencing data were processed and transcription start sites (TSSs) and transcription end sites (TESs) were identified using machine learning models. CS1: common sequence 1; CS2: common sequence 2; TSO: Template-switching oligo; T30: 30 repeating T bases. b Schematic of the machine learning models. Features were collected based on characteristics related to the peaks, including the read distribution, motifs related to real TSSs/TESs, and sequence features related to internal false-positive signals, and used to train RF, LR, SVM, and KNN models. c Gene body coverage of scRCAT-seq reads derived from DRG (n = 18). Shown is the mean coverage of reads shaded by 95% confidence intervals. d Accuracy in identifying authentic TSSs and TESs with different machine learning models. Error bars represent standard deviation of the mean (n = 3). e Distance of the identified TSSs/TESs to those annotated in hg38. TSSs/TESs were identified from the scRCAT-seq peaks derived from hESC with the RF model. f Pie chart illustrating the distribution of the identified TSSs in hESC relative to the TSSs in the FANTOM5 database. The total number of TSS peaks identified after optimization by the machine learning models is indicated under the pie chart. g Pie chart illustrating the distribution of the identified TSSs in hESC relative to the TESs in PolyA_DB3. Source data are provided as a Source data file.