Maximizing transcription of nucleic acids with efficient T7 promoters

In vitro transcription using T7 bacteriophage polymerase is widely used in molecular biology. Here, we use 5′RACE-Seq to screen a randomized initially transcribed region of the T7 promoter for cross-talk with transcriptional activity. We reveal that sequences from position +4 to +8 downstream of the transcription start site affect T7 promoter activity over a 5-fold range, and identify promoter variants with significantly enhanced transcriptional output that increase the yield of in vitro transcription reactions across a wide range of template concentrations. We furthermore introduce CEL-Seq+ , which uses an optimized T7 promoter to amplify cDNA for single-cell RNA-Sequencing. CEL-Seq+ facilitates scRNA-Seq library preparation, and substantially increases library complexity and the number of expressed genes detected per cell, highlighting a particular value of optimized T7 promoters in bioanalytical applications.

fficient amplification of nucleic acids is critical for many procedures in molecular biology 1 . Over the last decades, polymerase chain reaction (PCR) has been the most widely applied approach to efficiently produce DNA. However, exponential amplification can entail amplification biases, particular in case of low input material. Alternatively, linear amplification of RNA by in vitro transcription (IVT) using single subunit viral RNA polymerases has become core to a host of genomics applications 2 , and is used for large-scale production of RNA. For example, recent single-cell RNA-sequencing (scRNA-seq) methods such as CEL-Seq2 3 , or microdroplet-based (inDrop) procedures 4 , rely on IVT using a T7 bacteriophage promoter and optimized reaction conditions for recombinant T7 RNA polymerase. Furthermore, emerging single-cell DNA-sequencing procedures such as Single-cell whole-genome analyses by linear amplification via transposon insertion (LIANTI 5 ), chromatin integration labeling sequencing (CHIL-Seq 6 ), scDam&Tseq 7 , or the sciL3 method 8 fundamentally rely on T7-promoter based IVT.
During transcription initiation, T7 polymerase binds the promoter DNA from nucleotide position −17 to −5 with high specificity, while the DNA double strand is melted from position −4 to +3 to prime RNA synthesis from a GTP nucleotide at position +1 9,10 . The growing RNA:DNA hybrid then expands the initiation bubble from position −4 to +7 11 . Beyond addition of the +8 nucleotide to the nascent RNA molecule, the probability for initiation bubble collapse increases and substantial conformational rearrangements within T7 polymerase mark the transition of the complex into processive elongation [12][13][14] . The sequence determinants that specify polymerase binding to the core promoter are well characterized, and base substitutions between position −17 and +3 can strongly affect transcriptional output [15][16][17] . However, to what extent sequences in the extended initiation bubble impact on transcription remains unclear.
Here, we used rapid amplification of cDNA 5′ ends coupled with deep sequencing (5′ RACE-Seq) to test if sequences beyond the +3 nucleotide affect the activity of the T7 promoter. We find that sequence motifs between the +4 and +8 nucleotide have a strong impact on transcriptional output over an unexpected fivefold range, and present a comprehensive list of motifs that can serve as a guideline for the optimization of IVT reactions. We furthermore introduce CEL-seq+ (single-cell RNA-Seq by multiplexed linear amplification+), which uses an optimized T7 promoter for amplification of cDNA from single cells. CEL-Seq+ facilitates the preparation of scRNA-Seq libraries and increases the number of UMI counts and detected genes per cell, demonstrating the utility of optimized T7 promoter sequences for bioanalytical applications.

Results
The initially transcribed region affects activity of the T7 promoter. We used 5′ RACE-Seq to profile the 5′ ends of individual 210 bp long RNAs that were transcribed from 10 ng (~10 10 copies) of a +2 to +16 randomized T7 promoter template. An aliquot of the promoter library was directly sequenced in parallel to a depth of 2.3 × 10 8 reads to account for potential sequence biases in the randomized template DNA. We then interrogated the initially transcribed region of >10 8 sequenced RNA molecules for cross-talk with transcriptional activity (Fig. 1a). After normalization with the template library, base frequencies in transcribed RNA molecules showed high variance from position +2 to +7 and reached baseline levels at nucleotide position +9 (Fig. 1b), indicating substantial sequence preference in the region corresponding to the extended initiation bubble 11 .
The obtained sequencing depth provided~5000-fold oversampling of all possible motifs in the +2 to +8 region, which enabled us to next determine the relative transcriptional activity of individual promoter variants. After normalization with the corresponding motif frequencies in the template library (Supplementary Data 1), transcripts with a G at positions +1 to +3 were transcribed more robustly compared to other +2/3 nucleotide combinations (Fig. 2a), which is in agreement with previous findings and with common guidelines for T7 usage in biomolecular applications. Here, a G triplet may prevent premature dissociation of short abortive transcripts that result from competition between slippage of the nascent RNA and active site translocation. Accordingly, 79% of transcripts that start with a G triplet displayed insertion of an additional G at the 5′ terminus (Fig. 2b). Addition of more than one G was rarely observed (<2% of transcripts with two extra G).
Surprisingly, downstream sequences from +4 to +8 also affected promoter activity over a 5-fold range (Figs. 1c, d and 2c). Importantly, differential transcriptional outputs from individual +4 to +8 promoter variants in 5′ RACE-seq were also recapitulated in individual IVT reactions across the full range of promoter activities (Figs. 1d and 2c, d; Supplementary Data 1). Highly active + 4 to + 8 sequences were generally AT-rich (i.e., AAATA, ATAAT), potentially indicating facilitated DNA double strand melting during initial transcription. However, the observed effects were also sequence-specific, with TTAAA ranking at position #247 in 5′ RACE-Seq, compared to rank #4 for ATAAT (Supplementary Data 1). The distribution of motif activities further suggests extensive combinatorial crosstalk between base positions +4 to +8, which has precluded the detection of transcriptional effects in previous screens based on single-base substitutions 15,16 . As a consequence, the here-reported striking sequence determinant has so far not been taken into account in IVT-based methods. Importantly, initially transcribed sequence motifs still determine transcriptional output in the presence of saturating concentrations of template DNA (Fig. 2e), highlighting their significance for large-scale in vitro synthesis of RNA.
We next tested if initially transcribed sequence determinants were shared by related RNA polymerases. We performed 5′ RACE-seq with RNAs transcribed by SP6 polymerase from a +2 to +16 randomized promoter. In agreement with previous findings 18 , transcriptional output by SP6 polymerase was mostly determined in the +1 to +3 region, with AA and AT being the most actively transcribed +2/3 sequence variants ( Fig. 3a-c). In contrast to T7 polymerase, differences between +4 to +8 downstream motifs did not affect transcriptional output by SP6 (Fig. 3d).
The AT-rich upstream element increases transcription at low template concentrations. It was previously shown that introduction of a short AT-rich sequence element upstream of the T7 core promoter increases binding affinity of the polymerase for the DNA template 19 . However, the impact on transcription remained unclear. In our hands, introduction of a short AT-rich sequence at position −21 to −18 only led to a modest increase in transcriptional output of 1.1-fold (Fig. 4a). Similar results were obtained when the promoter was moved to the 5′ terminus of the template DNA, emphasizing the downstream promoter flanking region as main modulator of transcriptional activity under normal conditions (Fig. 4a). However, enhanced promoter affinity did translate into a 1.5-fold increase in transcriptional output when the template concentration was reduced to 1 pg/µl, a template amount typically observed in single cell derived cDNA libraries (Fig. 4a).
Enhanced single cell RNA-Seq with an optimized T7 promoter. A host of single cell genomics methods rely on amplification of scarce nucleic acids via T7-based amplification. To demonstrate the utility of an optimized T7 promoter for these applications, we set out to rationally design an optimized primer for single-cell RNA sequencing based on the popular CEL-Seq2 procedure 3 . Here, mRNA from single cells is captured via hybridization of an oligo d(T) primer fused to an upstream T7 promoter sequence, then reverse transcribed and amplified by IVT with T7 polymerase ( Supplementary Fig. 1). The transcribed antisense RNA (aRNA) is again converted into cDNA and processed into a library for deep sequencing. Successful preparation of a sequencing library strongly depends on the efficiency of the T7-based amplification step. Furthermore, a main characteristic of singlecell RNA-seq is a high "drop out" rate, i.e., features that are present in an individual cell but escape detection. Accordingly, scRNA-seq data is inherently shallow, meaning that most expressed genes are only represented by a small number of transcript counts. This is because many mRNA molecules are not initially captured by hybridization, or are subsequently lost in one of the downstream steps of library preparation, all of which are not 100% efficient. We therefore hypothesized that additional copies generated by a more efficient T7 polymerase may increase the probability especially of lowly abundant transcripts to be represented in the final sequencing output, corresponding to enhanced sensitivity of scRNA-seq.
The T7 promoter in the standard CEL-Seq2 primer contains a CG-rich 5 bp upstream region, which was replaced by the ATrich upstream motif (Fig. 4b). Surprisingly, insertion of the highranking +4 to +8 motif #4 into the CEL-Seq2 primer only led to a modest increase in aRNA yield ( Supplementary Fig. 2). We speculated that this was due to less efficient mRNA capture by the extended primer. Therefore, in order to maintain the shortest possible primer length and to minimize the overall sequence change, we instead only inserted a GA dinucleotide at T7-position +3/+4 in the CEL-Seq2 primer, which elevated the 5 ′ RACE-Seq A 500 bp dsDNA library harboring a T7 promoter template with randomized nucleotide composition from +2 to +16 (highlighted in red) was transcribed in vitro, using T7 RNA polymerase. The resulting 210 nucleotides long RNAs were reverse transcribed, and the 5 ′ end of the respective cDNA was converted into a library for deep sequencing. In parallel, an aliquot of the promoter DNA library was directly sequenced to account for potential sequence bias in the template. b Normalized average nucleotide compositions of T7 promoter sequence variants from positions +2 to +16 in amplified RNAs, determined by 5′ RACE-Seq. The region of the extended initiation bubble, which extends from positions −4 to +7, is highlighted in light gray. c Differential promoter activity of +4 to +8 sequence motifs determined by 5 ′ RACE-seq. Shown are the log2 relative abundances of individual sequence motifs. All promoters contain a G at positions +1 to +3. High correlation was observed between two independent experiments. d In vitro transcription reactions comparing +4 to +8 sequence motifs with high, low, and intermediate promoter activity. A 410-nucleotides long RNA was in vitro transcribed for the indicated time points using the displayed promoter variant. Shown is the resulting fold amplification of template DNA. Error bars represent the standard deviation of triplicate experiments. rank of the +4 to +8 region from #177 to #66 (Supplementary Data 1). Following mRNA capture with the optimized primer, we observed a robust~2.5-fold increase in aRNA amplification from pooled cDNA of 10 single K562 cells, which facilitated library preparation from scarce material (Fig. 4b). Most importantly, leveraging the new T7 promoter sequence for CEL-seq+ substantially increased the number of detected genes (average 9749 vs. 8281), and the number of unique molecular identifiers (UMIs; average 85066 vs. 53541) from single cells (Fig. 4c). Detection of almost 10,000 genes per cell likely approaches the  entire set of expressed genes present in a single cell at any given time. In addition to higher detection rates, expression levels of individual genes were measured with higher accuracy by CEL-Seq+ across the entire gene expression spectrum, as reflected by higher UMI counts and a consistently lower coefficient of variation (CV) (Fig. 4e, f). As an additional benchmark we next divided the genes detected in deep sequenced bulk K562 RNA-Seq 20 into expression quartiles and tested their recovery in CEL-seq2 vs. CEL-Seq+. Enhanced recovery of genes from lower expression quartiles in CEL-Seq+ suggested that improved linear amplification based on the here introduced T7 promoter sequences directly translates into substantially increased sensitivity of single-cell genomics applications (Fig. 4g). At the same time, the detection rate of transcription factors, chromatin binders, or genes involved in CML disease   Accordingly, single cell studies aiming at a mechanistic understanding of tumor biology may broadly benefit from application of an enhanced T7 promoter in CEL-Seq+. Furthermore, important lowabundant regulatory genes are usually missed by microdroplet-based single-cell RNA-sequencing methods that rather produce particularly shallow transcriptomes. Also microdroplet-based methods such as inDrop 4 may therefore benefit from using the here reported T7 promoters to increase gene detection rates.

Discussion
In general, uniform and accurate amplification of nucleic acids is important when starting material is scarce and precious. The optimized T7 promoter sequences presented here can be readily applied in various IVT-based methods to boost linear amplification of nucleic acids. This is of particular interest for single-cell DNA-and RNA-sequencing approaches, and for nucleic acids based diagnostics of scarce clinical material such as circulating tumor cells 22 . In current biomolecular approaches, the initially transcribed region either contains random linker DNA or represents the 5′end of the amplified sequence, resulting in highly variable outcomes. Accordingly, an optimized T7 promoter with reliably high transcriptional output is of particular relevance for T7-based biosensors that need to amplify different target molecules with consistent efficiency. Importantly, the optimized T7 promoters presented here enhance transcription across a wide range of template concentrations, so that any IVT-based application can be readily enhanced by up to 500% by simply following our sequence recommendations. In addition to bioanalytical application, the here introduced promoter sequences may thus help to improve large-scale production of (modified) RNA or protein in vitro.  6 and 84 × 10 6 reads were used for motif analysis, respectively. For background libraries, reads were filtered for PhiX using Bowtie 2, resulting in 232 × 10 6 total reads for motif analysis. To determine relative abundances of +2 to +8 promoter variants, reads with identical +2 to +8 sequence were pooled, and the resulting read counts were divided by the total number of filtered reads and by the corresponding counts in background libraries. Homopolymeric motifs were removed from the analysis because reads containing long stretches of G appeared to be disproportionally removed by the real time analysis software application of the Illumina NextSeq 500 sequencer (base positions corresponding to dark sequencing cycles correspond to G in this two fluorescence channel sequencing system).

Methods
IVT for validations. One ng of dsDNA template (gBlocks Gene Fragments, Integrated DNA Technologies) was in vitro transcribed in a 20 µl reaction using 1.5 µl T7 or T6 RNA polymerase mix and 7.5 mM each of GTP, ATP, CTP, and UTP from the HiScribe T7/SP6 RNA Synthesis Kit (NEB E2040S, E2070S). RNA was purified with 1.6 volumes RNAClean XP beads, followed by elution in 20 µL water. RNA was quantified using the QubitTM RNA Assay Kit. For testing of CELseq adapter variants, 1 pg of double stranded 330 bp CEL-seq cDNA mimics were in vitro transcribed using the HiScribe T7 RNA Synthesis Kit. RNA was purified with 1.6 volumes RNAClean XP beads, and eluted in 10 µL water. RNA was quantified using a High Sensitivity RNA ScreenTape (Agilent 5067-5579) on an Agilent Tapestation.
Single-cell RNA-sequencing. CEL-Seq2 was performed as described in 3 . In brief, single cells were FACS sorted using a BD Aria III device into 96 well plates with barcoded primers (0.4 µM) in lysis buffer (110 µM dNTPs, 0.007% Triton X-100, 1.4 U SUPERaseIn) and stored at −80°C. After thawing, plates were heated to 72°C for 3 min and incubated for 10 min at 10°C. Four microlitre Superscript II reaction mix were added at room temperature and the plate was incubated for 1 h at 42°C followed by heat inactivation for 10 min at 70°C. Second strand synthesis was performed at 16°C for 2 h with the NEBNext ® Ultra II Non-Directional RNA Second Strand Synthesis Module (NEB E6111S). Single cell reactions were pooled separately for CEL-Seq2 or CEL-Seq+ primers, purified with 1.2 volumes of Ampure XP, and in vitro transcribed for 15 h using the MEGAscript T7 kit. After EXOSAP-IT treatment (Thermo, 78201) and fragmentation for 1.5 min (NEB E6150S), RNA was cleaned up by 1.8 volumes of Ampure XP. The fragmented RNA was combined with HEX-primer and dNTPs and incubated at 65°C for 5 min followed by addition of superscript II mic and incubation at 25°C for 10 min and 42°C for 1 h. The cDNA was amplified with 12 cycles of PCR using KAPA HiFi HotStart ReadyMix. Final reactions were cleaned up with 0.8 fold Ampure XP and sequenced on a MiniSeq (Illumina) instrument.
Single-cell RNA-sequencing data analysis. Fastq files from each dataset were processed by zUMI software (version 2.5) for filtering, single cell demultiplexing and mapping the reads into genes. Filtering was done by using default parameters of zUMI (minimum reads per cell equal to 100), reads were mapped to hg38 genome assembly version and gencode (v27) annotation for gene read count. From each dataset, the output matrix files from zUMI were accessed by Seurat (version 3). Reads from intronic and exonic region of the genes were included and no further coverage filtering was applied.
Statistics and reproducibility. Standard deviation between triplicate IVT experiments was determined using student's t test. Numbers of detected genes and UMIS in CEL-Seq2 and CEL-Seq+ were compared using the Mann-Whitney Wilcoxon test.