Genome-scale determination of 5´ and 3´ boundaries of RNA transcripts in Streptomyces genomes

Streptomyces species are gram-positive bacteria with GC-rich linear genomes and they serve as dominant reservoirs for producing clinically and industrially important secondary metabolites. Genome mining of Streptomyces revealed that each Streptomyces species typically encodes 20–50 secondary metabolite biosynthetic gene clusters (smBGCs), emphasizing their potential for novel compound discovery. Unfortunately, most of smBGCs are uncharacterized in terms of their products and regulation since they are silent under laboratory culture conditions. To translate the genomic potential of Streptomyces to practical applications, it is essential to understand the complex regulation of smBGC expression and to identify the underlying regulatory elements. To progress towards these goals, we applied two Next-Generation Sequencing methods, dRNA-Seq and Term-Seq, to industrially relevant Streptomyces species to reveal the 5´ and 3´ boundaries of RNA transcripts on a genome scale. This data provides a fundamental resource to aid our understanding of Streptomyces’ regulation of smBGC expression and to enhance their potential for secondary metabolite synthesis. Measurement(s) 5´-ends of transcripts • 3´-ends of transcripts • RNA • TSS • transcription_termination_signal Technology Type(s) dRNA-Seq • Term-Seq • RNA sequencing Factor Type(s) Streptomyces growth phase Sample Characteristic - Organism Streptomyces avermitilis • Streptomyces clavuligerus • Streptomyces coelicolor • Streptomyces griseus • Streptomyces lividans • Streptomyces tsukubensis • Streptomyces venezuelae Measurement(s) 5´-ends of transcripts • 3´-ends of transcripts • RNA • TSS • transcription_termination_signal Technology Type(s) dRNA-Seq • Term-Seq • RNA sequencing Factor Type(s) Streptomyces growth phase Sample Characteristic - Organism Streptomyces avermitilis • Streptomyces clavuligerus • Streptomyces coelicolor • Streptomyces griseus • Streptomyces lividans • Streptomyces tsukubensis • Streptomyces venezuelae Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.13259393


Background & Summary
Streptomyces species are gram-positive filamentous bacteria and hold a great importance for their ability to produce a wide range of clinically or industrially important secondary metabolites 1,2 . During the middle 20th century, the number of available antibiotics rapidly increased and especially, more than 70% of the antibiotics from bacteria were discovered from Streptomyces species, emphasizing their importance as the dominant source of antimicrobial compounds 3 . However, the discovery of novel antibiotics rapidly decreased during the latter part of 20th century as research progress with Streptomyces species declined as reflected by a decreasing number of novel secondary metabolite discovered 4 . Fortunately, with the emergence of Next-Generation Sequencing (NGS) technique, the genome sequences of many Streptomyces species have been collected and increased the potential to produce novel secondary metabolites 5 . Computational prediction revealed that a single Streptomyces species typically possesses about 20-50 secondary metabolite biosynthetic gene clusters (smBGCs), and the great number of smBGCs in Streptomyces genomes encourages researchers to revisit these organisms to cope with the threat of emerging multi-drug resistant bacteria 6,7 .
Despite their potential for the production of diverse secondary metabolites, most of the smBGCs have not been characterized in terms of their products and corresponding molecular functions, mainly due to the silent nature of the smBGCs under the laboratory culture conditions 8 . Since most secondary metabolites are not essential for growth and produced to respond to environmental stimuli, such as osmotic pressure or nutrient limitations or inter-species competition, the smBGCs are expected to be under tight and complex regulation [9][10][11] . To utilize the genomic potential of Streptomyces, an understanding of the genetic regulatory mechanisms for activating smBGCs is crucial. Especially, understanding transcriptional regulatory mechanisms is important  different growth phases, including early-exponential, transition, late-exponential and stationary phases, to cover genes expressed under starvation condition as well as genes involved in primary metabolism at the active growth ( Fig. 1a) 24 . dRNA-Seq reveals the transcription start sites (TSSs) of transcripts by differentiating the TSSs from the 5′-ends of processed transcripts. For dRNA-Seq, two libraries are constructed, one from the 5′-ends of unprocessed bacterial primary transcripts and the other from the 5′-ends of processed transcripts. By comparing the two libraries, TSSs can be differentiated from the processed 5′-ends. In contrast, Term-Seq captures the 3′-ends of transcripts, which lead to identification of the genuine transcription termination sites (TTSs) and processed 3′-ends 25 .
From the TSSs determined from dRNA-Seq, the promoter sequences can be identified with the aid of computational motif discovery tools 26 . In addition, TSS information enables to determine 5′-untranslated region (5′-UTR) of each gene in nucleotide resolution, which contains transcriptional or translational regulatory elements, such as the ribosome binding site (RBS), riboswitches and upstream open reading frames 15,[27][28][29] . Likewise, transcriptional terminator sequences and 3′-UTR can be determined from the 3′-end information of transcripts obtained from Term-Seq. With the aid of genome-wide transcriptome and translatome information which can be obtained from RNA-Seq and Ribo-Seq, respectively, the transcriptional and translational effect of each regulatory element, including the promoter sequence, RBS or transcription terminator sequence, can be evaluated. Furthermore, the determined regulatory elements can be utilized for improving the production of secondary metabolites in Streptomyces through synthetic biology approaches. The transcript boundary information obtained from dRNA-Seq and Term-Seq will serve as fundamental resources to understand the complex regulatory mechanisms in bacteria and improve the industrial applications.  (Fig. 1a). For NGS library preparation, cultures for each strain were inoculated in eight flasks as biological octuplicates and cells were harvested from two flasks for each growth phase as biological duplicates.

RNA extraction.
After harvesting, the cells were washed with polysome buffer (20 mM Tris-HCl pH 7.5, 140 mM NaCl, 5 mM MgCl 2 ), and resuspended with lysis buffer (0.3 M sodium acetate pH 5.2, 10 mM EDTA, 1% Triton X-100). The cell suspension was frozen with liquid nitrogen, and then physically lysed by grinding using mortar and pestle. The cell lysate was centrifuged at 4 °C for 10 min at 16000 × g and the supernatant was saved and stored at −80 °C until used for RNA extraction. For RNA extraction, the supernatant was mixed with equal volume of phenol:chloroform:isoamyl alcohol = 25:24:1 solution. The mixture was then centrifuged and RNA was extracted from the upper aqueous phase with ethanol precipitation.
For Term-Seq of S. coelicolor and S. griseus, RNA was extracted by lysing cells with hot phenol. The harvested cells were resuspended with Sol 1 (25 mM Tris-HCl pH 8.0, 10 mM EDTA, 50 mM glucose, 2 mg/mL lysozyme) and incubated at 30 °C for 10 minutes. After incubation, the cells were centrifuged down and the supernatant was discarded. The cell pellet was resuspended with AE-SDS (50 mM sodium acetate pH 5.2, 10 mM EDTA, 1% sodium dodecyl sulfate) and the suspension was mixed with equal volume of phenol:chloroform = 5:1 solution. Cells were lysed by incubating at 65 °C for 5 min and centrifuged. RNA was extracted from the upper aqueous phase with isopropanol precipitation, and genomic DNA aggregate formed upon addition of isopropanol was removed before precipitation.
To remove any DNA contamination, the RNA samples were treated with DNase I (New England Biolabs, Ipswich, MA, USA). dRNA-Seq library preparation. The four DNase I treated RNA samples from the four growth phases were mixed equally to obtain one 10 μg RNA mixture and a total of two RNA mixtures were prepared from the eight RNA samples as the biological duplicates for each strain. The rRNA in the RNA mixture was depleted using Ribo-Zero rRNA Removal Kit for Bacteria (Epicentre, Madison, WI, USA). The rRNA-depleted RNA was incubated in 1 × RNA 5′ polyphosphatase (TAP; Epicentre) reaction buffer and 1 U of SUPERase-In (Invitrogen, Carlsbad, CA, USA) at 37 °C for 1 h, with or without TAP for TAP( + ) or TAP(−) libraries, respectively. The reaction was cleaned up with ethanol precipitation and 5 pmol of 5′ RNA adaptor (5′-ACACUCUUUCCCUACACGACGCUCUUCCGAUCU-3′) was ligated to the purified RNA using T4 RNA ligase (Thermo Fisher Scientific, Waltham, MA, USA) by incubating at 37 °C for 90 min in 1 × RNA ligase buffer and 0.1 mg/mL BSA. The ligation product was then purified using Agencourt AMPure XP beads (Beckman Coulter, Brea, CA, USA) according to the manufacturer's instructions. The purified product was reverse-transcribed with SuperScript III Reverse Transcriptase (Invitrogen) according to the manufacturer's instructions and purified using Agencourt AMPure XP beads. The purified cDNA was amplified and indexed using Phusion High-Fidelity DNA Polymerase (Thermo Fisher Scientific) for Illumina sequencing. The www.nature.com/scientificdata www.nature.com/scientificdata/ amplification step was monitored using a CFX96 Real-Time PCR Detection System (Bio-Rad Laboratories, Hercules, CA, USA) and stopped before the PCR reaction was fully saturated. Finally, the amplified library was purified using Agencourt AMPure XP beads.
Term-seq library preparation. Term-Seq libraries for six species except S. coelicolor were prepared as previously described 15,17 . The equal amounts of DNase I-treated RNA from the sampling time points were mixed and used for the input of Term-Seq library construction. The RNA was treated with Ribo-Zero rRNA Removal Kit for Bacteria (Epicentre) to deplete rRNA. The resulting 500~900 ng of rRNA-depleted RNA was mixed with 1 μL of 150 μM amino-blocked DNA adaptor (5′-p-NNAGATCGGAAGAGCGTCGTGT-3′), 2.5 μL of 10 × T4 RNA ligase 1 buffer, 2.5 μL of 10 mM ATP, 2 μL of DMSO, 9.5 μL of 50% PEG8000, and 2.5 μL of T4 RNA ligase 1 (New England BioLabs). The mixture was incubated at 23 °C for 2.5 h and reaction was cleaned-up using Agencourt AMPure XP beads. The adaptor ligated RNA was then fragmented by incubating at 72 °C for 90 seconds in fragmentation buffer (Ambion, Inc, Austin, TX, USA). The fragmentation reaction was cleaned-up using Agencourt AMPure XP beads. The fragmented RNA (8 μL in total) was reverse transcribed with SuperScript III Reverse Transcriptase using 1 μL of 10 μM reverse transcription primer (5′-TCTACACTCTTTCCCTACACGACGCTCTTC-3′) according to the manufacturer's instructions. The cDNA was then purified with Agencourt AMPure XP beads. Another amino-blocked adaptor with different sequence (5′-p-NNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC-3′) was ligated to the cDNA with increased incubation time (8 h). The ligation product was purified using Agencourt AMPure XP beads and indexed by PCR with Phusion High-Fidelity DNA Polymerase using forward (5′-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCT-3′) and reverse (5′-CAAGCAGAAGACGGCATACGAGATNNNNNN (6 nt index) GTGACTGGAGTTCAGAC-3′) primers. The PCR reaction was monitored using a CFX96 Real-Time PCR Detection System and stopped before the PCR reaction was fully saturated. The PCR product was purified with Agencourt AMPure XP beads.
For S. coelicolor, 1 μg of the total RNA instead of rRNA depleted RNA was ligated with 1 μL of 150 μM amino-blocked DNA adaptor (5′-p-NNAGATCGGAAGAGCGTCGTGT-3′) as described above. After ligation, rRNA was removed by using Hybridase ™ Thermostable RNase H (Lucigen Corporation, Middleton, WI, USA). High-throughput sequencing and data processing. All libraries were sequenced using either Illumina MiSeq or Illumina HiSeq. 2500 platform with either 1 × 100 bp (dRNA-Seq) or 1 × 50 bp (Term-Seq) read length except the dRNA-Seq of S. tsukubaensis. For the dRNA-Seq of S. tsukubaensis, both TAP(+) libraries and TAP(−) libraries were sequenced using Illumina MiSeq platform with 1 × 150 bp read length. The reads were processed using CLC Genomics Workbench. The raw reads were first mapped to phiX sequence, which is used in Illumina sequencing platform for quality control. The detailed mapping parameters are as follow. Mismatch cost: 2; Insertion cost: 3; Deletion cost: 3; Length fraction: 0.9; Similarity fraction: 0.9; Map randomly for non-specific matches. After mapping to phiX sequence, unmapped reads were collected and trimmed to remove adaptor sequences, short reads and low quality reads. The detailed parameters are as follow. Quality score limit: 0.05; Maximum number of ambiguities: 2; Remove adaptors; Discard read lengths below 15. For Term-Seq, two nucleotides at both ends were removed since the adaptors include random 2 nucleotides. The trimmed reads were mapped to the available reference genomes (Accession numbers: BA000030 for S. avermitilis, CP027858 and CP027859 for S. clavuligerus, NC_003888 for S. coelicolor, NC_010572 for S. griseus, CP009124 for S. lividans, CP020700 for S. tsukubaensis, CP059991 for S. venezuelae) with same parameters for phiX mapping, except the non-specific match handling (non-specific matches were discarded). After mapping to reference genomes, the directions of mapped reads of Term-Seq were inverted since the sequencing output comes in reverse direction.

identification of read count enriched positions.
To determine the read count enriched peak positions where represent possible TSSs for dRNA-Seq or TTSs for Term-Seq, the read count enrichment to a specific position was represented with the z-score of the read count at the specific position as previously described 31 . The detailed calculation is as follow.   www.nature.com/scientificdata www.nature.com/scientificdata/  been generated covering four different growth phases with biological replicates (dRNA-Seq data of S. coelicolor covering more diverse culture condition is available in the previous study performed by our group) (Fig. 1) 24 . The sequencing resulted in 4.97-26.60 and 3.47-16.1 million reads per library for dRNA-Seq and Term-Seq, respectively, after removing the phiX mapped reads (Tables 1 and 2). The retained reads were trimmed to remove adaptor sequences and discard short and low-quality reads. After trimming, the retained reads were subject to sequencing quality control, in terms of the Phred quality score 42 . Most reads showed average Phred quality score around 30-40, representing that the base-calling error probabilities in NGS runs are lower than 10 −3 (Fig. 2a, b)