Global repositioning of transcription start sites in a plant-fermenting bacterium

Bacteria respond to their environment by regulating mRNA synthesis, often by altering the genomic sites at which RNA polymerase initiates transcription. Here, we investigate genome-wide changes in transcription start site (TSS) usage by Clostridium phytofermentans, a model bacterium for fermentation of lignocellulosic biomass. We quantify expression of nearly 10,000 TSS at single base resolution by Capp-Switch sequencing, which combines capture of synthetically capped 5′ mRNA fragments with template-switching reverse transcription. We find the locations and expression levels of TSS for hundreds of genes change during metabolism of different plant substrates. We show that TSS reveals riboswitches, non-coding RNA and novel transcription units. We identify sequence motifs associated with carbon source-specific TSS and use them for regulon discovery, implicating a LacI/GalR protein in control of pectin metabolism. We discuss how the high resolution and specificity of Capp-Switch enables study of condition-specific changes in transcription initiation in bacteria.

B acteria translate environmental signals into cellular responses using a network of regulatory RNA and proteins that control genome-wide transcription patterns. Many of these regulators affect where RNA polymerase initiates messenger RNA (mRNA) synthesis at transcription start sites (TSS). As such, locating and quantifying changes in TSS usage is an important step to understand bacterial gene regulation. Here, we investigate TSS architecture in Clostridium phytofermentans ISDg, a soil bacterium that ferments plant biomass into ethanol, H 2 and acetate 1 , and belongs to the Lachnospiraceae family that includes gut commensals with important roles in host nutrition 2,3 . This anaerobic mesophile metabolizes diverse plant components including cellulose, hemicellulose and pectin by tailoring expression of many carbohydrate-active enzymes (CAZymes) and other metabolic enzymes to the available substrate 4,5 . C. phytofermentans has a 4.8 Mb genome with 3,926 predicted protein-encoding genes 3 , and its ability to alter gene expression in response to carbon sources and other environmental cues is mediated by over 300 transcription regulator proteins 6 and numerous non-coding RNA including metabolite-sensing riboswitches 7 .
We investigate genome-wide patterns of C. phytofermentans transcription initiation on heterogeneous plant substrates by demonstrating an approach called Capp-Switch sequencing. The initiating nucleotide of nascent mRNA is distinguished by a 5 0 triphosphate (5 0 -PPP), which has been exploited for genomewide TSS identification with dRNA-seq 8 by depleting rRNA and other monophosphorylated transcripts using terminal exonuclease (TEX). dRNA-seq has been applied to diverse bacteria [9][10][11][12][13] , but incomplete and non-specific degradation of processed RNA requires TSS identification to be based on statistical comparison of read coverage in þ TEX and À TEX samples. Capp-Switch avoids these problems by capturing and purifying 5 0 mRNA fragments, which are reverse transcribed with template-switching to tagged cDNA for high-throughput sequencing (Fig. 1). The 5 0 -PPP of mRNA are modified by vaccinia capping enzyme (VCE) to bear a biotinylated guanosine cap that facilitates their capture and purification using streptavidin magnetic beads. Recently, TSS were identified by Cappable-Seq 14 using VCE to add a desthiobiotin cap for beadbased capture of 5 0 mRNA, which were then eluted from the beads and de-capped to ligate adapters for reverse transcription to tagged cDNA. Capp-Switch streamlines this approach by reverse transcribing the 5 0 mRNA fragments using template-switching by Moloney murine leukemia virus reverse (MMLV) transcriptase 15 . Template-switching avoids adapter ligation and enables synthesis of 5 0 -tagged cDNA without releasing RNA from the beads, permitting use of an irreversible, biotinylated cap to increase RNA capture affinity. In all, we show Capp-Switch is a robust method that yields a genome-wide, strand-specific, quantitative map of TSS at single nucleotide resolution.
We apply Capp-Switch sequencing to define a genome-wide map of 9,457 TSS during C. phytofermentans growth on raw biomass, heterogeneous polysaccharides (cellulose, hemicellulose and pectin) and their constituent sugars. We use this TSS map to investigate features controlling gene regulation, such as RNA polymerase binding sites, 5 0 untranslated region (UTR) structure, alternative promoters, operons and non-standard (leaderless and antisense) transcription. We identify sequence motifs associated with groups of TSS that are differentially expressed on specific carbon sources and show these motifs can be used to reconstruct transcription factor regulons. By integrating Capp-Switch data with an updated genome annotation, RNA-seq and proteomics, we discover novel transcriptional units (TU) and proteinencoding genes. Finally, we discuss how Capp-Switch sequencing can be applied as a general approach to explore transcription regulation in prokaryotes.

Results
General transcriptome features. Capp-Switch sequencing quantified TSS with high reproducibility between duplicate model substrate (Fig. 2a) and raw biomass (Fig. 2b) cultures. We identified 9,457 TSS across treatments (Supplementary Data 1), one-third of which were expressed in both sugar and polysaccharide cultures (Fig. 2c). Most reads (74%) contribute to InterS TSS (Fig. 2d), which we observed upstream of 898 genes. Among these, 687 genes (77%) are predicted to start operons 16 (Supplementary Data 2), supporting these operon predictions and the existence of many sub-operons. The 5 0 UTR, spanning from the primary TSS to the start codon, is less than 100 bp for most genes, but there is no correlation between 5 0 UTR length and TSS strength (Fig. 2e). Studies in other bacteria report many leaderless mRNA without 5 0 UTR and ribosome binding sites (RBS) 11 . Four per cent of InterS TSS are potentially leaderless in C. phytofermentans, but these genes generally have another upstream TSS and retain a typical RBS similar to highly expressed C. phytofermentans genes ( Supplementary Fig. 1). Most genes were expressed from a single, primary TSS on all substrates (Fig. 2f), but 191 (21%) genes altered their primary TSS in response to carbon source. Further, genes with substratespecific InterS TSS are often differentially expressed on that carbon source (w 2 test, Po0.01 for all substrates relative to glucose) (Fig. 2g), supporting that changing TSS is a widespread means of transcription regulation. In total, more than a thousand TSS are specific to each polysaccharide (Supplementary Fig. 2A). Xylan-specific ( Supplementary Fig. 2B) and pectin-specific ( Supplementary Fig. 2C) TSS are primarily associated with carbohydrate metabolism genes, while the most abundant functional category of cellulose-specific TSS is prophage genes ( Supplementary Fig. 2D). The C. phytofermentans genome includes a large prophage island that is not predicted to encode a viable phage 3 , but whose transcription is up-regulated on cellulose and biomass ( Supplementary Fig. 3). This burst of transcriptional initiation at viral genes could indicate prophage excision was triggered on cellulosic substrates, that is, by low carbon stress, or that viral proteins contribute to bacterial fitness 17 .
Sequences upstream of primary TSS generally contain the sigma-A-type consensus À 35 and À 10 hexamers recognized by RNA polymerase (RNAP) and associated elements that likely contribute to promoter function in this organism. An A-rich region upstream of the -35 hexamer (TTGACA) (Fig. 2h) resembles the 'UP element' that stimulates transcription initiation by interacting with the RNAP alpha subunit 18 . Also, the Pribnow hexamer (TATAAT) has an upstream TG di-nucleotide (Fig. 2i), which enhances transcription in certain other bacteria 19-21 by interacting with the RNAP sigma-A subunit 22 . In contrast, searching upstream of IntraS TSS identified an AT-rich stretch B10 bp upstream of the TSS lacking RNAP binding sites ( Supplementary Fig. 4A), suggesting IntraS TSS often result from promiscuous initiation at AT-rich sequences. We observed IntraS TSS comprised that more than 50% of TSS (Fig. 2d), albeit with fewer reads per site than InterS TSS. dRNA-seq studies have rationalized similarly abundant intragenic TSS as resulting from incomplete TEX degradation 12 , but our data support these TSS bear 5 0 -PPP indicative of transcription initiation. IntraS TSS are preferentially found in the 5 0 end of genes ( Supplementary  Fig. 4B), supporting they are under selective pressure and may have roles including expression of alternative protein isoforms or as mimicry molecules to sequester other RNA and ribonucleases from their mRNA targets 9 .
Capp-Switch reads ( Fig. 3a-d) start at specific positions with respect to known genes showing TSS at single base resolution, whereas RNA-seq reads begin throughout genes ( Fig. 3e-h). We observed four common TSS situations: genes with a single upstream TSS, genes with both upstream and intragenic TSS, genes with multiple TSS on a single substrate and genes with substrate-specific TSS. For example, the glyceraldehyde 3-phosphate dehydrogenase (gadph) gene is constitutively transcribed from a single TSS (Fig. 3a). The pyruvate ferredoxin oxidoreductase (pfor) gene is transcribed from a single, upstream TSS and another, weaker TSS in the coding sequence (Fig. 3b). The cel5A cellulase gene 23 is simultaneously transcribed from multiple TSS on cellulose (Fig. 3c), as are other cellulases ( Supplementary Fig. 5). CAZyme expression in C. phytofermentans is controlled by carbon source 24,25 and our data supports their regulation involves multiple promoters. The cphy1510 gene encoding the most active xylanase 5 is transcribed from three TSS on xylan and a different, upstream TSS on pectin (Fig. 3d). Similarly, genes for other CAZymes including three cellulases, one other xylanase, four pectinases and two glycosyl transferases changed their primary TSS as a function of carbon source. We confirmed the positions of the primary TSS identified by Capp-Switch for gadph, pfor (IntraS and primary TSS), cphy2243 and cphy1510 (xylan and pectin) using 5 0 RACE ( Supplementary Fig. 6).
Motifs associated with TSS clusters. We clustered TSS based on expression across carbon sources and searched sequences surrounding TSS for overrepresented motifs (Supplementary Fig. 7; Supplementary Data 3), revealing TSS clusters that share motifs with potential regulatory functions (Fig. 4). For example, the TSS cluster up-regulated on galacturonic acid and homogalacturonan (HG) (Fig. 4c) has a palindromic motif resembling the cre operator (TGAAAGCGCTTTCA) bound by B. subtilis CcpA 26,27 , a LacI/GalR regulator of numerous carbon metabolism genes. LacI/GalR genes often have upstream copies of their operators to auto-repress transcription 28 , and we found three copies of the galacturonic acid cluster motif in the 5 0 UTR of cphy2742, a LacI/GalR gene specifically up-regulated on galacturonic acid (Fig. 5a). Further, three of the six LacI/GalR genes with detected primary TSS have upstream variants of the cre operator that are conserved in their orthologs from related species (Fig. 5b-d), leading us to propose C. phytofermentans LacI/GalR regulators recognize related, but distinct, operators to control separate regulons. Supportingly, the putative Cphy2742 operator (Fig. 5b) is upstream of 22 genes in the C. phytofermentans genome (Supplementary Table 1) including 3 CAZymes (PL9 pectin lyases) that degrade HG to galacturonic acid 5 and transcription units containing all genes needed to assimilate galacturonic acid 29 (Supplementary Fig. 8).
The putative Cphy2742 operator sites are co-located with or downstream of TSS for HG degradation and galacturonic acid metabolism genes (Fig. 5e), supporting Cphy2742 binds these sites to block transcription. Transcription of the pl9 genes cphy2919 and cphy3869 switches to upstream primary TSS on galacturonic acid relative to HG, but all TSS are close enough to be potentially regulated by Cphy2742 operators. The pta-ackA (cphy1326-7) acetate synthesis operon also has a Cphy2742 operator and both pta-ackA expression and acetate formation are elevated on galacturonic acid ( Supplementary  Fig. 9). While B. subtilis CcpA represses most of its targets, it activates pta and ackA transcription 30,31 by binding upstream of their promoters 32 . The Cphy2742 operator is also upstream of the pta gene TSS, suggesting Cphy2742 may similarly activate transcription of the pta-ackA operon as well as the glycolytic gene ppdK and the hydrolase gene cphy0367. Collectively, we propose Cphy2742 represses a comprehensive set of pectin fermentation genes by binding a conserved palindrome at or downstream of their TSS to block transcription. In response to a galacturonic acid-based signal, Cphy2742 de-represses itself and its targets, and may activate transcription of acetate synthesis and other aspects of carbon metabolism by binding upstream of TSS. Antisense and novel transcripts. Recent studies found 30-40% of TSS are antisense in other bacteria 8,9,13 . However, antisense transcription appears rare in C. phytofermentans: o1% of TSS were antisense either between (InterA) or within genes (IntraA) (Fig. 2d). To further investigate whether diffuse antisense transcription was underestimated by our TSS thresholds, we classified all mapped read starts, including those not meeting TSS thresholds. Even then, InterA and IntraA classes together comprise o4% reads. This dearth of antisense transcription may relate to the early evolutionary divergence of the Clostridiales 33 . Alternatively, we would not detect antisense transcripts that were processed to remove 5 0 -PPP or that are below the 200 bp size threshold of our cDNA libraries, but studies in other bacteria using larger size thresholds found antisense TSS  in B35% of genes 10 . While comparatively rare, antisense transcription appears to have important cellular functions. For example, we observed an antisense TSS in the 5 0 UTR of the sporulation regulator spoOA (cphy2497) that also opposes transcription of the spoIVB peptidase (cphy2498) (Fig. 6a). This TSS was expressed on all sugars, but not polysaccharides, supporting antisense transcription has a role in repressing sporulation during log growth in sugar-replete conditions. TSS reveal novel transcriptional features such as a TU downstream of the glycoside hydrolase cphy2658 that is up-regulated to have the strongest initiation site in the genome on cellulose and corn stover (Fig. 6b). This region contains a hypothetical open-reading frame (ORF) in the MaGe annotation (clops3132) that has no similar sequences in Genbank, but the ORF lacks an ribosome binding site (RBS), and we did not detect any expressed peptides from this region by mass spectrometry, suggesting it is a non-coding RNA. The most highly expressed ABC transporter on glucose is a putative operon (cphy2241-3) with a single TSS (Supplementary Fig. 5C,F). On all other carbon sources, we observed repression of cphy2241-3 along with appearance of an upstream, antisense TU (Fig. 6c) that has no mapped peptides or predicted ORF. Non-coding RNA are often associated with ABC transporters in clostridia 34 , and they may also regulate ABC transport in this organism.
The C. phytofermentans genome may encode significantly more genes than in the NCBI Genbank annotation. Classifying TSS using the MaGe annotation showed 735 (7%) TSS map to MaGe-specific clops genes of unknown function (Supplementary Data 4), including 64 clops genes with InterS TSS. We examined which of these novel TU encode proteins by mapping C. phytofermentans MS/MS peptide spectra to the genome translated in all frames, identifying peptides outside the predicted proteome in 21 InterS, 13 IntraS, 5 InterA and 25 IntraA regions (Supplementary Data 5). The combination of TSS and expressed peptides supports ORFs with N-terminal extensions such as cphy0891 (Supplementary Fig. 10A) and the existence of novel ORFs. For example, clops3461, which overlaps with cphy2929 on the opposite strand (Fig. 6d), and an antisense overlapping ORF in cphy1953 encoding the ComEA competence protein (Supplementary Fig. 10B).
TSS also show mechanisms of RNA-mediated gene regulation. Comparative genomics with other clostridia detected a putative T-box upstream of the C. phytofermentans trp operon 34 . In low tryptophan conditions, the T-box promotes antitermination of the trp operon by base pairing with uncharged tRNA trp (ref. 35). We observed transcription halted abruptly in the 5 0 UTR of the trp operon in glucose cultures (Fig. 6e), consistent with T-boxmediated repression. In cellulose cultures, antitermination in the T-box enabled trp operon mRNA expression, potentially enabling translation of the trytophan-rich carbohydrate binding modules in cellulases and other CAZymes. TSS also support riboswitches associated with genes for metabolism of flavin mononucleotide (FMN), cobalamin, thiamine pyrophosphate (TPP) and lysine (Supplementary Data 6). For example, C. phytofermentans is auxotrophic for thiamine, which it uptakes by a thiamine transporter, Cphy0729 (ref. 36). The cphy0729 gene has a single, constitutive TSS with an extended 5 0 UTR containing a putative TPP-sensing riboswitch (Fig. 6f) that could regulate transporter expression in response to intracellular TPP levels 37 .

Discussion
The strategy presented here to quantify condition-specific changes in transcription initiation by Capp-Switch sequencing could be generally applied to dissect the regulation of complex bacterial phenotypes. In this study, we explored the transcriptional programme enabling C. phytofermentans to ferment the cellulosic, hemicellulosic and pectic components of plant biomass. We found that growth on these different carbon sources entailed widespread TSS changes, including use of substrate-specific TSS for genes encoding biomass-degrading enzymes such as cellulases, xylanases and pectinases. Substrate-specific TSS could enable tuning of expression by changing promoters or the regulatory properties (that is, binding sites or secondary structure) of the 5 0 UTR. We observed that genes encoding cellulases and other enzymes are simultaneously expressed from more than one TSS. Multiple regulators may control transcription of these genes, reflecting the numerous transcription factors encoded by this organism (Supplementary Data 7). Genes for biomass-degrading enzymes in other Clostridiales are regulated by various transcription factors including a two-component system for hemicellulases 38 , a LacI/GalR protein for b-1-3 glucanases 39 and alternative sigma factors for cellulases 40 . We defined TSS clusters that were differentially expressed on specific carbon sources and used them to guide the discovery of sequence motifs with potential regulatory function, leading us to identify the LacI/GalR Cphy2742 as a putative regulator of pectin metabolism. Combining TSS mapping with motif searching could be broadly applied to LacI/GalR regulators and other types of transcription factors. For example, each of the 4 TetR regulators for which we detected TSS also have conserved, TSS-associated palindromes that resemble operator sites ( Supplementary Fig. 11).
We also gained insight into regulatory mechanisms such as antisense transcription, leaderless transcription and non-coding RNA. We observed that antisense and leaderless transcription are much rarer than reported in other bacteria and it will be interesting to see if they are similarly uncommon in closelyrelated bacteria. We also show that integration of Capp-Switch TSS mapping with RNA-seq and proteomics enables discovery of novel transcription units and protein-encoding genes. Transcription initiation is a complex and important component of gene regulation for which most of the underlying mechanisms in C. phytofermentans are yet unknown. Further, these results illustrate how little we know about gene regulation in plantfermenting clostridia, a group of bacteria with important roles in soil and gut microbiomes that have significant potential to serve as biocatalysts for industrial transformation of plant biomass.
Capp-Switch library preparation. Total RNA was extracted from duplicate cultures for each treatment using TRI reagent (Sigma 93289) and treated with Turbo DNase (Ambion AM2238) at 0.2 U mg À 1 RNA for 30 min at 37°C. RNA was purified by Zymo Concentrator-5 (Zymo Research R1015) (4200 bp capture) into 15 ml water. RNA was 5 0 capped using VCE (NEB M2080) at 3 U mg À 1 RNA with 0.1 mM SAM and 0.5 mM 3 0 biotin-GTP (NEB N0760) for 30 min at 37°C and purified by Zymo Concentrator-5 (4200 bp capture) with two additional washes into 45 ml water. RNA was fragmented for 30 s at 94°C using NEBNext Magnesium-based RNA fragmentation buffer (NEB E6101) and purified by Zymo Concentrator-5 (total RNA capture) into 100 ml water. Streptavidin magnetic beads (NEB S1421S) were pre-washed twice with low-salt buffer (10 mM Tris, 50 mM NaCl, 1 mM EDTA), twice with binding buffer (10 mM Tris, 500 mM NaCl, 1 mM EDTA) and resuspended at 4 mg ml À 1 beads in binding buffer. Capped RNA fragments were bound to streptavidin beads for 20 min at room temperature and magnetically separated from other RNA by washing twice with binding buffer and twice with low-salt buffer to elute non-bound RNA. Beads were washed once with 1 mM Tris-HCl pH 7.5 and resuspended in 1 mM Tris-HCl pH 7.5.
RNA was converted to single-strand cDNA by SMARTscribe MMLV reverse transcriptase (Clontech 634836) at 10 U ml À 1 with 2.5 mM DTT, 1 mM dNTP, 1.2 mM SMARTer stranded oligo and 0.6 mM SMART stranded N6 primer (Clontech 634836) by incubating 90 min at 42°C and 10 min at 70°C. Beads were collected and the supernatant was combined with the liquid fraction after the beads were washed with 30 ml 1 mM Tris pH 7.5. The cDNA was twice purified using 1 volume of solid phase reversible immobilization (SPRI) beads (Beckman Coulter A63880). cDNA was left on beads after the second purification and doublestranded cDNA was synthesized by 18 cycles PCR using SeqAmp DNA polymerase (Clontech 638504) with 0.25 mM primers (Universal Forward PCR primer and indexed Reverse PCR primer) and then SPRI purified with 1 volume of beads. DNA was sequenced on Illumina MiSeq with 150 bp paired-end reads chemistry.
TSS identification and classification. Sequencing reads were quality filtered 44 and the 3 bp MMLV reverse transcriptase 3 0 non-template extension was removed from the 5 0 end of forward (R1) reads. Reads were mapped to the C. phytofermentans ISDg genome (NCBI NC_010001.1) using Bowtie 2 (version 2.2.4) 45 . Alignments showed 87-98% of reads mapped to unique positions in the C. phytofermentans genome, yielding between 0.4 million (corn stover) and 3.4 million (glucose) reads per culture (Supplementary Table 2). TSS were identified using R1 reads by calculating the number of reads starting at each genomic position, clustering read counts within a 5 bp sliding window, and retaining the position with the greatest number of reads. TSS were defined as genome positions with greater than 10 read starts per million reads in both duplicate cultures. Capp-switch TSS were confirmed by 5 0 RACE (Sigma 03353621001) using primers in Supplementary Table 3 to amplify PCR products, which were resolved by electrophoresis, excised and sequenced.
Genes in the NCBI and MicroScope (MaGe) annotations 46 were used to divide TSS into four categories: InterS (intergenic TSS with downstream gene in same orientation), InterA (intergenic TSS with downstream gene opposite orientation), IntraS (intragenic TSS in gene with same orientation) or IntraA (intragenic TSS in gene with opposite orientation). The InterS TSS with the most reads for each gene was defined as the primary TSS. Capp-Switch results were compared with strand-specific (dUTP) RNA-seq of C. phytofermentans grown in the same culture conditions 5 . RNA-seq gene expression was calculated as RPKM using the Bioconductor 47 package 'easyRNASeq' and differential expression was defined as a DESeq 48 (version 1.22.1) P-value o0.05 adjusted for multiple testing of the 3,902 genes in C. phytofermentans genome by Bonferroni correction. Peptides corresponding to novel ORFs were identified by mapping peptide MS/MS spectra from glucose, xylan and cellulose cultures 4 to the genome translated in all six frames. Peptides were identified from spectra using SEQUEST and filtered to a 5% false discovery rate using a target-decoy approach 49,50 including a target database and a decoy of the reversed sequences.
Motif analysis. Sequence motifs were identified using MEME 51 with a background model of di-nucleotide frequencies in the C. phytofermentans genome. Searches for RNA polymerase binding site motifs included positions 25-50 bp ( À 35 motif) and 5-20 bp ( À 10 motif) upstream of all primary TSS expressed on the three sugars and polysaccharides. The top palindromic motifs associated with LacI/GalR and TetR regulators were found by searching sequences from À 250 (upstream) to þ 50 bp (downstream) relative to the start codon of C. phytofermentans genes and their putative orthologs from related genomes identified by top reciprocal BLAST searches (Supplementary Table 4). These motifs were used for genomewide scans from À 250 to þ 50 bp within all C. phytofermentans genes using MAST 52 . To cluster TSS by expression, the 1,188 TSS with at least a 30-fold change in read counts between two conditions were log 2 -transformed and each TSS was normalized to have a median value of 0 across conditions and scaled so the sum of the squared expression levels is 1. TSS were separated into 24 clusters by K-means using the city-block similarity metric. Significant motifs (eo0.001) associated with individual K-means clusters were identified by searching À 100 to þ 10 bp with respect to each TSS.
Data availability. The authors confirm that all data underlying the findings are fully available without restriction. RNA sequencing files in FASTQ format are available in the European Nucleotide Archive under study accession PRJEB13063.