The primary transcriptome of the Escherichia coli O104:H4 pAA plasmid and novel insights into its virulence gene expression and regulation

Escherichia coli O104:H4 (E. coli O104:H4), which caused a massive outbreak of acute gastroenteritis and hemolytic uremic syndrome in 2011, carries an aggregative adherence fimbriae I (AAF/I) encoding virulence plasmid, pAA. The importance of pAA in host-pathogen interaction and disease severity has been demonstrated, however, not much is known about its transcriptional organization and gene regulation. Here, we analyzed the pAA primary transcriptome using differential RNA sequencing, which allows for the high-throughput mapping of transcription start site (TSS) and non-coding RNA candidates. We identified 248 TSS candidates in the 74-kb pAA and only 21% of them could be assigned as TSS of annotated genes. We detected TSS for the majority of pAA-encoded virulence factors. Interestingly, we mapped TSS, which could allow for the transcriptional uncoupling of the AAF/I operon, and potentially regulatory antisense RNA candidates against the genes encoding dispersin and the serine protease SepA. Moreover, a computational search for transcription factor binding sites suggested for AggR-mediated activation of SepA expression, which was additionally experimentally validated. This work advances our understanding of the molecular basis of E. coli O104:H4 pathogenicity and provides a valuable resource for further characterization of pAA virulence gene regulation.

non-unique reads (reads mapped to multiple regions) were corrected based on (divided by) the number of locations they could be mapped to.

TSS and PS annotation
Annotation of transcription start sites (TSS) and processing sites (PS) candidates was performed with the TSSPredator program available at http://it.inf.unituebingen.de/TSSpredator 5 . Briefly, the coverage graphs for the TEX+ and TEX-libraries generated by READemtion were converted into GR files and served as input for TSSpredator. revealed that two known/hypothetical virulence-associated genes are missing and one is wrongly annotated in the current pAA annotation of E. coli O104:H4. Therefore, we introduced the following changes in NC_018666 annotation: (i) aar coding for the AggRactivated regulator Aar 10 was annotated in the region encompassing nucleotide position 74181-164 on the minus strand. The E. coli O104:H4 homolgue is 3 amino acids longer in comparison to the one in EAEC 042; (ii) aap coding for dispersin 11 was annotated with the coordinates 2094-2444 on the minus strand; and (iii) O3K_26127, a homologue of the ECO42_pAA004 gene coding for a hypothetical protein in the EAEC strain 042 was extended from 11391-11678 to 11391-11930 on the minus strand and thus replaced O3K_26132 (11687-11911, minus strand), which in the DOOR operon annotation are part of a three gene operon (O3K_26137-26132-26127). In addition, the following changes in the operon annotation were introduced: (i) The genes aap and aar are not mapped in the vicinity of any previously annotated gene in NC_018666 and were therefore assumed to be transcribed monocistronically; (ii) the AAF/I operon-encoding genes O3K_26212, 26207, 26202, 26197, were clustered in a single operon 12 ; (iii) O3K_26192, considered to be co-transcribed with O3K_26197 (the last gene of the AAF/I cluster ) in DOOR was changed to an alone standing gene based on the 3' end of the AAF/I operon mapped in this study. (iv) operons O3K_26482-26477 and O3K_26472-26467-26462 which encode the genes of the aatPABCD cluster 13 were combined into one transcription unit. In summary, we assume in our analysis that there are 81 coding sequences in E. coli O104:H4 pAA plasmid, 44 of them clustered in 16 operons and 37 transcribed as alone standing genes (Supplementary Table S2).

Correlation of AggR binding site occurrence and gene type
The correlation of predicted AggR binding sites with gene function was evaluated using the following information: the operon information described in the "Reannotation of NC_018666 and operon organization of pAA" section, gTSS and gPS (mapped within the 300 nt upstream regions of ORFs) annotated by TSSpredators (Supplementary Dataset S1 and S2) and the FIMO-predicted AggR binding sites (Supplementary Table S4). The FIMO hits were considered to be associated with genes when they were located in the region from -250 nt in respect to detected mRNA end/ATG to + 50 nt in respect to ATG of monocistronic genes or first genes in an operon. The mRNA end coordinates were considered for genes with annotated gTSS or gPS (in case no gTSS could be detected) or ATG for genes with no detected gTSS or gPS. The coordinates used for the AggR analysis are based on the locations of the experimentally determined binding sites used for the generation of the AggR consensus binding motif [14][15][16] . Once a predicted transcription factor binding site was mapped within the above described regions of a first gene of an operon, then all genes of this operon were considered to be associated with the corresponding binding site. The list of virulence genes was created based on published data 10,11,13,[17][18][19][20][21] . The number of virulence gene with and without associated AggR/H-NS predicted binding sites were compared to the number of nonvirulence genes with and without those biding site hits by means of a Fisher's exact test.

5' RACE & 3' RACE
TSS and PS candidates represented with various stepHeight (cDNA reads starting at the respective position) and enrichmentFactor (enrichment between differential libraries) values, and belonging to different categories were selected for verification in a biological replicate by 5' RACE analysis. TSS and PS assigned to known or hypothetical virulence-associated genes were preferentially chosen. 5' RACE and 3' RACE analyses were performed as previously described 22 . For 5' RACE 1μg of total RNA from LB226692 was treated with tobacco acid pyrophosphatase (TAP; Epicentre) or incubated in buffer alone for 60 min at 37 °C. Next, 5' RNA linker (5R_L; Supplementary Table S5) was ligated to the RNA for 60 min at 37°C using 40 U of T4 RNA ligase (Epicentre). Reverse transcription was carried out with up to 6 target specific reverse primers and SuperScript III reverse transcriptase (LifeTechnologies).
For verification of gTSS and aTSS, the target specific primers were placed within the ORFs to which the TSS were assigned. All above described reactions were purified by an organic extraction (25:24:1 v/v phenol/chloroform/isoamyalcohol) and RNA was recovered by overnight precipitation with 3 volumes of ethanol/3 M sodium acetate at -20 o C. Nested PCRs were set with target and linker specific primers and the cDNA from the reverse transcription step as template. The mature 16S rRNA, which has a 5'-P end, was used as a 3' linker in 3' RACE. 1μg of total RNA from LB226692 was self-ligated for 60 min at 37°C using 40 U of T4 RNA ligase (Epicentre). Control reactions without ligase were included. Next, RNA was reverse-transcribed using rrn16 specific primer (3R_1). Target and rrn16 specific primers were used in nested PCRs for the amplification of the reverse transcription products. 5'-and 3' RACE PCR products were analyzed on 1.5% agarose gels and products of interest were excised, purified and Sanger sequenced using BigDye® Terminator Cycle Sequencing Kit (Life Technologies). PCR products containing multiple 5'or 3' pAA termini were ligated into pGEM®-T (Promega) and transformed into NEB 5-alpha competent E.coli cells (New England Biolabs). 5 bacterial clones per transformation were selected, their DNA was PCR amplified with primers M13_for and rev (vector specific primer) and sequenced as described above. To ensure that the detected 5' ends in 5' RACE are not amplification artifacts, only different nucleotide combination at the NNN region of the RNA linker were considered for the calculation of the number of clones supporting dRNA-seq data (Supplementary Table S1).

Heterologous expression of SepA and AggR in E. coli K-12
The pAA regions from position 55353 to 59821, encompassing sepA, its promoter region and both predicted AggR binding sites, and from position 6823 to 7773, containing aggR without the promoter region were amplified from total DNA using the Phusion Polymerase (Thermo Scientific) and the primers PB74 + PB75 and PB76 + PB77, respectively. The fragment containing sepA was digested with XbaI (restriction site was introduced in the 5' end of the reverse primer PB75) and ligated into pBAD24 23 linearized with the restriction enzymes XbaI and EcoRV. This cloning strategy allowed for the expression of SepA under the control of its native promoter. KpnI and XbaI sites were added to the 5' ends of the forward (PB76) and reverse (PB77) primer, respectively, used for aggR amplification. After digestion with the respective enzymes, the aggR-containing fragment was ligated into the corresponding sites in pBAD33 23 and it was thus cloned under control of the arabinose-inducible promoter. Sanger sequencing was used to verify the correct sequence of the cloned constructs. The plasmids pSepA carrying sepA with its native promoter was later on used as a template for a PCR with primers PB76 and PB77. The obtained product was digested with XbaI and ligated into pBAD24 23 linearized with the restriction enzymes XbaI and EcoRV. This cloning strategy allowed for the construction of the plasmid pSepA* carrying the pAA region from position 55507 to 59821, encompassing sepA and its native promoter region but excluding the upstream predicted AggR binding sites. The correct sequence of the cloned construct was verified by Sanger sequencing. The plasmids carrying sepA with its native promoter and the predicted AggR binding sites (pSepA; Amp R ), sepA with its native promoter and excluding the upstream predicted AggR binding sites (pSepA*; Amp R ), aggR under the transcriptional control of an inducible promoter (pBAD33-AggR, Cm R ), pBAD24 (Amp R ) and pBAD33 (Cm R ) were transformed in electrocompetent cells of the E. coli K-12 strain CSH50 24 .

Monocistronic gene
Supplementary Table S2. Operon map of the pAA plasmid. The pAA operon organization and 5' mRNA ends detected by dRNA-seq are presented. The operon organization is based on the information available in the Database of prokaryotic operons (DOOR) and our reannotation (see Supplementary Materials and Methods). Genes clustered in an operon are color-grouped and marked with 1 in the "Operon" column. The position of the candidate 5'-PPP and 5'-P mRNA ends, i.e. gTSS and gPS, is given. gTSS found internal to operons which could potentially be involved in its transcriptional uncoupling from the main upstream promoter are marked in red.

Mio
Supplementary Figure S2. dRNA-seq experimental set up and an overiview of sequenced and mapped reads. Equal amounts of gDNA-free total RNA isolated from exponentially growing cells of the E. coli O104:H4 strain LB226692 were incubated with TEX+ (5'-P depleted RNA) or left untreated (containing both 5'-PPP and 5'-P RNAs). Next, RNA was treated with tobacco acid pyrophosphatase (TAP), which converts 5'-PPP to 5'-P, poly(A)-tailed using poly(A) polymerase and a 5' end RNA adapter was ligatated. First-strand cDNA was synthesized using an oligo(dT)-adapter primer and the M-MLV reverse transcriptase. Sequencing was performed on a HiSeq 2500 machine (Illumina) in a 100 bp single-end read mode. Approximately 7.2 million reads were obtained from each library. After linker and poly A-tail removal rRNA and tRNA reads were filtered out. The resulting 3.5 and 6.5 million reads for the TEX+ and TEX-library, respectively, were mapped to the E. coli O104:H4 genome consisting of the chromosome and the plasmids pESBL, pG and pAA. In total, 54,408 and 109,883 reads were mapped onto pAA (NC_018666.1) in TEX+ and TEX-library, which correspond to coverage of 59x and 118x, respectively.  Figure S4. The transcriptome profile of aggR. cDNA reads from TEX+ (red) and TEX-(black) libraries mapped against aggR are shown. Due to differential abundance of associated cDNA reads the dRNA-seq graphs are split in two parts (dashed vertical line) and the y-axis (abundance relative score) for each part is given. Annotated TSS (red) and PS (black) are shown above the dRNA-seq graphs. TSS candidates discussed in the main text are indicated in the dRNA-seq graphs with black arrows. The approx. position of 3' ends determined by 3' RACE are indicated by red diamond arrows. Predicted AggR binding sites are represented on the DNA strand as orange boxes.
Supplementary Figure S5. The transcriptome profile of aap. cDNA reads from TEX+ (red) and TEX-(black) libraries mapped against the aap gene coding for dispersin are shown. Annotated TSS (red) and PS (black) are shown above the dRNA-seq graphs and the ones subjected to 5' RACE verification in a biological replicate are marked with an asterisk. TSS candidates discussed in the main text are indicated in the dRNA-seq graphs with black arrows. The approx. position of 3' ends determined by 3' RACE are indicated by red diamond arrows. The predicted AggR binding site is represented on the DNA strand as an orange box.  Figure S6. The transcriptome profile of aatPABCD. cDNA reads from TEX+ (red) and TEX-(black) libraries mapped against the aat gene cluster coding for the dispersin secretion system are shown. Annotated TSS (red) and PS (black) are shown above the dRNA-seq graphs and the ones subjected to 5' RACE verification in a biological replicate are marked with an asterisk. TSS candidates discussed in the main text are indicated in the dRNAseq graphs with black arrows. The approx. position of 3' ends determined by 3' RACE are indicated by red diamond arrows. Predicted AggR binding sites are represented on the DNA strand as orange boxes.  Figure S7. The transcriptome profile of sepA. cDNA reads from TEX+ (red) and TEX-(black) libraries mapped against the sepA gene coding for the serine protease SepA are shown. Due to differential abundance of associated cDNA reads the dRNA-seq graphs in (A) and (D) are split in two parts (dashed vertical line) and the y-axis (abundance relative score) for each part is given. Annotated TSS (red) and PS (black) are shown above the dRNA-seq graphs and the ones subjected to 5' RACE verification in a biological replicate are marked with an asterisk. TSS candidates discussed in the main text are indicated in the dRNA-seq graphs with black arrows. The approx. position of 3' ends determined by 3' RACE are indicated by red diamond arrows. Predicted AggR binding sites are represented on the DNA strand as orange boxes.  R3 R1 R2 R3