Background & Summary

Streptomyces species are gram-positive filamentous bacteria and hold a great importance for their ability to produce a wide range of clinically or industrially important secondary metabolites1,2. During the middle 20th century, the number of available antibiotics rapidly increased and especially, more than 70% of the antibiotics from bacteria were discovered from Streptomyces species, emphasizing their importance as the dominant source of antimicrobial compounds3. However, the discovery of novel antibiotics rapidly decreased during the latter part of 20th century as research progress with Streptomyces species declined as reflected by a decreasing number of novel secondary metabolite discovered4. Fortunately, with the emergence of Next-Generation Sequencing (NGS) technique, the genome sequences of many Streptomyces species have been collected and increased the potential to produce novel secondary metabolites5. Computational prediction revealed that a single Streptomyces species typically possesses about 20–50 secondary metabolite biosynthetic gene clusters (smBGCs), and the great number of smBGCs in Streptomyces genomes encourages researchers to revisit these organisms to cope with the threat of emerging multi-drug resistant bacteria6,7.

Despite their potential for the production of diverse secondary metabolites, most of the smBGCs have not been characterized in terms of their products and corresponding molecular functions, mainly due to the silent nature of the smBGCs under the laboratory culture conditions8. Since most secondary metabolites are not essential for growth and produced to respond to environmental stimuli, such as osmotic pressure or nutrient limitations or inter-species competition, the smBGCs are expected to be under tight and complex regulation9,10,11. To utilize the genomic potential of Streptomyces, an understanding of the genetic regulatory mechanisms for activating smBGCs is crucial. Especially, understanding transcriptional regulatory mechanisms is important since the transcription is the first step of gene expression and diverse regulations take place in transcription12,13. Here we report NGS datasets describing the differential RNA-Seq (dRNA-Seq) and Term-Seq of seven important Streptomyces species to broaden our understanding on transcriptional regulation of Streptomyces in genome-scale by providing the transcript boundary information (Fig. 1)14,15,16,17.

Fig. 1
figure 1

Experimental design and workflow. (a) Growth profiles and sampling time points of the seven Streptomyces species in R5− media. (b) Work flow of dRNA-Seq and Term-Seq. dRNA-Seq and Term-Seq differ in the directions of sequencing adaptor ligation. For dRNA-Seq, two libraries are prepared with or without RNA 5′ polyphosphatase (TAP) to differentiate transcription start sites from the 5′-ends of processed transcripts.

The seven species studied include Streptomyces avermitilis, Streptomyces clavuligerus, Streptomyces coelicolor, Streptomyces griseus, Streptomyces lividans, Streptomyces tsukubaensis and Streptomyces venezuelae, widely known for their scientific, clinical and industrial importance. S. coelicolor is the most well-known model species and S. avermitilis, S. clavuligerus, S. griseus and S. tsukubaensis are known for the ability to produce the anthelmintic agent, avermectin, β-lactamase inhibitor, clavulanic acid, antibiotics, streptomycin, and immunosuppressive agent, FK506, respectively5,18,19,20,21. S. lividans and S. venezuelae are majorly used for industrial applications for heterologous expression of proteins and the smBGCs of other Streptomyces, since the genetic backgrounds are suitable for heterologous expressions for the two organisms22,23. Total RNAs of Streptomyces were pooled from four different growth phases, including early-exponential, transition, late-exponential and stationary phases, to cover genes expressed under starvation condition as well as genes involved in primary metabolism at the active growth (Fig. 1a)24. dRNA-Seq reveals the transcription start sites (TSSs) of transcripts by differentiating the TSSs from the 5′-ends of processed transcripts. For dRNA-Seq, two libraries are constructed, one from the 5′-ends of unprocessed bacterial primary transcripts and the other from the 5′-ends of processed transcripts. By comparing the two libraries, TSSs can be differentiated from the processed 5′-ends. In contrast, Term-Seq captures the 3′-ends of transcripts, which lead to identification of the genuine transcription termination sites (TTSs) and processed 3′-ends25.

From the TSSs determined from dRNA-Seq, the promoter sequences can be identified with the aid of computational motif discovery tools26. In addition, TSS information enables to determine 5′-untranslated region (5′-UTR) of each gene in nucleotide resolution, which contains transcriptional or translational regulatory elements, such as the ribosome binding site (RBS), riboswitches and upstream open reading frames15,27,28,29. Likewise, transcriptional terminator sequences and 3′-UTR can be determined from the 3′-end information of transcripts obtained from Term-Seq. With the aid of genome-wide transcriptome and translatome information which can be obtained from RNA-Seq and Ribo-Seq, respectively, the transcriptional and translational effect of each regulatory element, including the promoter sequence, RBS or transcription terminator sequence, can be evaluated. Furthermore, the determined regulatory elements can be utilized for improving the production of secondary metabolites in Streptomyces through synthetic biology approaches. The transcript boundary information obtained from dRNA-Seq and Term-Seq will serve as fundamental resources to understand the complex regulatory mechanisms in bacteria and improve the industrial applications.

Methods

Strain and culture condition

S. avermitilis MA-4680, S. clavuligerus ATCC27064, S. coelicolor M145, S. griseus NBRC13350, S. lividans TK24, S. tsukubaensis NBRC108819 and S. venezuelae ATCC15439 were used in this study. The mycelium of each Streptomyces was maintained in 25% glycerol at −80 °C. Cells were cultured in 50 mL R5− media with 8 g glass beads (3 ± 0.3 mm diameter) at 30 °C. The R5− medium consists of 5.73 g TES (pH 7.2), 103 g sucrose, 10 g glucose, 5 g yeast extract, 10.12 g MgCl2∙6H2O, 0.25 g K2SO4, 0.1 g casamino acids, 0.08 mg ZnCl2, 0.4 mg FeCl3∙6H2O, 0.02 mg CuCl2∙2H2O, 0.02 mg MnCl2∙4H2O, 0.02 mg Na2B4O7∙10H2O, and 0.02 mg (NH4)6Mo7O24∙4H2O in 1 L distilled water. Cell growth was determined by measuring optical density at 600 nm with biological triplicates. The cells were sampled at four different time points according to the growth profile of each strain, which were early-exponential (E), transition (T), late-exponential (L), and stationary (S) phases. The E, T, L, and S time points were 13, 17, 19.5, 33.5 h for S. avermitilis, 26, 80, 105.5, 125 h for S. clavuligerus, 12, 16, 20, 34 h for S. coelicolor, 15.5, 28, 32.5, 48.5 h for S. griseus, 9.5, 14, 16, 20 h for S. lividans, 13, 19, 23, 31 h for S. venezuelae, and 15, 18.5, 28, 48 h for S. tsukubaensis after inoculation, respectively (Fig. 1a). For NGS library preparation, cultures for each strain were inoculated in eight flasks as biological octuplicates and cells were harvested from two flasks for each growth phase as biological duplicates.

RNA extraction

After harvesting, the cells were washed with polysome buffer (20 mM Tris-HCl pH 7.5, 140 mM NaCl, 5 mM MgCl2), and resuspended with lysis buffer (0.3 M sodium acetate pH 5.2, 10 mM EDTA, 1% Triton X-100). The cell suspension was frozen with liquid nitrogen, and then physically lysed by grinding using mortar and pestle. The cell lysate was centrifuged at 4 °C for 10 min at 16000 × g and the supernatant was saved and stored at −80 °C until used for RNA extraction. For RNA extraction, the supernatant was mixed with equal volume of phenol:chloroform:isoamyl alcohol = 25:24:1 solution. The mixture was then centrifuged and RNA was extracted from the upper aqueous phase with ethanol precipitation.

For Term-Seq of S. coelicolor and S. griseus, RNA was extracted by lysing cells with hot phenol. The harvested cells were resuspended with Sol 1 (25 mM Tris-HCl pH 8.0, 10 mM EDTA, 50 mM glucose, 2 mg/mL lysozyme) and incubated at 30 °C for 10 minutes. After incubation, the cells were centrifuged down and the supernatant was discarded. The cell pellet was resuspended with AE-SDS (50 mM sodium acetate pH 5.2, 10 mM EDTA, 1% sodium dodecyl sulfate) and the suspension was mixed with equal volume of phenol:chloroform = 5:1 solution. Cells were lysed by incubating at 65 °C for 5 min and centrifuged. RNA was extracted from the upper aqueous phase with isopropanol precipitation, and genomic DNA aggregate formed upon addition of isopropanol was removed before precipitation.

To remove any DNA contamination, the RNA samples were treated with DNase I (New England Biolabs, Ipswich, MA, USA).

dRNA-Seq library preparation

The four DNase I treated RNA samples from the four growth phases were mixed equally to obtain one 10 μg RNA mixture and a total of two RNA mixtures were prepared from the eight RNA samples as the biological duplicates for each strain. The rRNA in the RNA mixture was depleted using Ribo-Zero rRNA Removal Kit for Bacteria (Epicentre, Madison, WI, USA). The rRNA-depleted RNA was incubated in 1 × RNA 5′ polyphosphatase (TAP; Epicentre) reaction buffer and 1 U of SUPERase-In (Invitrogen, Carlsbad, CA, USA) at 37 °C for 1 h, with or without TAP for TAP( + ) or TAP(−) libraries, respectively. The reaction was cleaned up with ethanol precipitation and 5 pmol of 5′ RNA adaptor (5′-ACACUCUUUCCCUACACGACGCUCUUCCGAUCU-3′) was ligated to the purified RNA using T4 RNA ligase (Thermo Fisher Scientific, Waltham, MA, USA) by incubating at 37 °C for 90 min in 1 × RNA ligase buffer and 0.1 mg/mL BSA. The ligation product was then purified using Agencourt AMPure XP beads (Beckman Coulter, Brea, CA, USA) according to the manufacturer’s instructions. The purified product was reverse-transcribed with SuperScript III Reverse Transcriptase (Invitrogen) according to the manufacturer’s instructions and purified using Agencourt AMPure XP beads. The purified cDNA was amplified and indexed using Phusion High-Fidelity DNA Polymerase (Thermo Fisher Scientific) for Illumina sequencing. The amplification step was monitored using a CFX96 Real-Time PCR Detection System (Bio-Rad Laboratories, Hercules, CA, USA) and stopped before the PCR reaction was fully saturated. Finally, the amplified library was purified using Agencourt AMPure XP beads.

Term-seq library preparation

Term-Seq libraries for six species except S. coelicolor were prepared as previously described15,17. The equal amounts of DNase I-treated RNA from the sampling time points were mixed and used for the input of Term-Seq library construction. The RNA was treated with Ribo-Zero rRNA Removal Kit for Bacteria (Epicentre) to deplete rRNA. The resulting 500~900 ng of rRNA-depleted RNA was mixed with 1 μL of 150 μM amino-blocked DNA adaptor (5′-p-NNAGATCGGAAGAGCGTCGTGT-3′), 2.5 μL of 10 × T4 RNA ligase 1 buffer, 2.5 μL of 10 mM ATP, 2 μL of DMSO, 9.5 μL of 50% PEG8000, and 2.5 μL of T4 RNA ligase 1 (New England BioLabs). The mixture was incubated at 23 °C for 2.5 h and reaction was cleaned-up using Agencourt AMPure XP beads. The adaptor ligated RNA was then fragmented by incubating at 72 °C for 90 seconds in fragmentation buffer (Ambion, Inc, Austin, TX, USA). The fragmentation reaction was cleaned-up using Agencourt AMPure XP beads. The fragmented RNA (8 μL in total) was reverse transcribed with SuperScript III Reverse Transcriptase using 1 μL of 10 μM reverse transcription primer (5′-TCTACACTCTTTCCCTACACGACGCTCTTC-3′) according to the manufacturer’s instructions. The cDNA was then purified with Agencourt AMPure XP beads. Another amino-blocked adaptor with different sequence (5′-p-NNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC-3′) was ligated to the cDNA with increased incubation time (8 h). The ligation product was purified using Agencourt AMPure XP beads and indexed by PCR with Phusion High-Fidelity DNA Polymerase using forward (5′-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCT-3′) and reverse (5′-CAAGCAGAAGACGGCATACGAGATNNNNNN (6 nt index) GTGACTGGAGTTCAGAC-3′) primers. The PCR reaction was monitored using a CFX96 Real-Time PCR Detection System and stopped before the PCR reaction was fully saturated. The PCR product was purified with Agencourt AMPure XP beads.

For S. coelicolor, 1 μg of the total RNA instead of rRNA depleted RNA was ligated with 1 μL of 150 μM amino-blocked DNA adaptor (5′-p-NNAGATCGGAAGAGCGTCGTGT-3′) as described above. After ligation, rRNA was removed by using Hybridase™ Thermostable RNase H (Lucigen Corporation, Middleton, WI, USA). 13.5 μL of the purified ligation product was mixed with 1.5 μL of 10 × DNase I Reaction Buffer (New England BioLabs), 15 μL of Hybridase complement buffer (90 mM Tris-HCl pH 7.5, 200 mM KCl), 1 μL of anti-rRNA oligo mix (detailed composition is available in Figshare) and 2 μL of 50 mM MgCl2 and incubated in thermal cycler (heat to 95 °C, cool down to 65 °C)30. 2 μL of Hybridase™ Thermostable RNase H was added to the mixture and the mixture was incubated at 65 °C for 20 minutes, 90 °C for 1 second and 65 oC for 10 minutes. rRNA depletion reaction was cleaned up using RNA Clean & Concentrator Kits (Zymo Research, Irvine, CA, USA) and the product was processed according to the remaining procedures as described above.

High-throughput sequencing and data processing

All libraries were sequenced using either Illumina MiSeq or Illumina HiSeq. 2500 platform with either 1 × 100 bp (dRNA-Seq) or 1 × 50 bp (Term-Seq) read length except the dRNA-Seq of S. tsukubaensis. For the dRNA-Seq of S. tsukubaensis, both TAP(+) libraries and TAP(−) libraries were sequenced using Illumina MiSeq platform with 1 × 150 bp read length. The reads were processed using CLC Genomics Workbench. The raw reads were first mapped to phiX sequence, which is used in Illumina sequencing platform for quality control. The detailed mapping parameters are as follow. Mismatch cost: 2; Insertion cost: 3; Deletion cost: 3; Length fraction: 0.9; Similarity fraction: 0.9; Map randomly for non-specific matches. After mapping to phiX sequence, unmapped reads were collected and trimmed to remove adaptor sequences, short reads and low quality reads. The detailed parameters are as follow. Quality score limit: 0.05; Maximum number of ambiguities: 2; Remove adaptors; Discard read lengths below 15. For Term-Seq, two nucleotides at both ends were removed since the adaptors include random 2 nucleotides. The trimmed reads were mapped to the available reference genomes (Accession numbers: BA000030 for S. avermitilis, CP027858 and CP027859 for S. clavuligerus, NC_003888 for S. coelicolor, NC_010572 for S. griseus, CP009124 for S. lividans, CP020700 for S. tsukubaensis, CP059991 for S. venezuelae) with same parameters for phiX mapping, except the non-specific match handling (non-specific matches were discarded). After mapping to reference genomes, the directions of mapped reads of Term-Seq were inverted since the sequencing output comes in reverse direction.

Identification of read count enriched positions

To determine the read count enriched peak positions where represent possible TSSs for dRNA-Seq or TTSs for Term-Seq, the read count enrichment to a specific position was represented with the z-score of the read count at the specific position as previously described31. The detailed calculation is as follow.

$$Z\left(x\right)=\frac{r\left(x\right)-m\left(C\left(x\right)\right)}{\sigma (C(x))}$$

Z(x) is the modified z-score at position x, r(x) is the read count of position x. The read counts were determined for 5′-ends of mapped reads for dRNA-Seq and 3′-ends of mapped reads for Term-Seq. m(C(x)) and σ(C(x)) are the mean and standard deviation of read counts of other positions, C(x), near the position x. C(x) is the set of positions y, satisfying 2 < |x − y| ≤ 50. The z-score of each position was first calculated separately for biological replicates and then averaged. Note that the position x, where m(C(x)) ≤ 0.25 in any of the biological replicates was ignored. For dRNA-Seq, the z-scores were calculated only for the positions of TAP(+) libraries, where normalized read counts are more than two-fold higher compared to the normalized read counts of the same positions in TAP(−) libraries. Finally, positions with z-scores higher than 12 were retained and if there are multiple positions within 3 nt distance, the positions with less z-scores were discarded.

Data Records

For dRNA-Seq, raw read FASTQ files of three species (S. avermitilis, S. clavuligerus, S. tsukubaensis) were deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive under the accession number SRP15802332, SRP18829033, and SRP10379534, respectively. The dRNA-Seq raw read FASTQ files of S. griseus, S. lividans and S. venezuelae were deposited in the European Nucleotide Archive (ENA) under the study accession number PRJEB4091835, PRJEB3150736 and PRJEB3637937, respectively. For Term-Seq of S. avermitilis and S. lividans, raw read FASTQ files were deposited under the same accession as the dRNA-Seq. For S. clavuligerus, the Term-Seq raw read FASTQ files were also deposited in the NCBI Sequence Read Archive under the accession number SRX693712338 and SRX693712439. For S. coelicolor and S. griseus, the Term-Seq raw read FASTQ files were deposited under the same accession as the dRNA-Seq of S. griseus. The Term-Seq raw read FASTQ files of S. tsukubaensis and S. venezuelae were deposited in the European Nucleotide Archive (ENA) under the study accession number PRJEB3637937.

The RNA-Seq data for the six Streptomyces species, S. avermitilis, S. clavuligerus, S. coelicolor, S. lividans, S. tsukubaensis and S. venezuelae, were retrieved from the accession number SRP15802332, SRP18829033, SRP05883040, PRJEB3150736, SRP10379534 and PRJEB3421941, respectively. And the RNA-Seq data for S. griseus were deposited under the same accession as the dRNA-Seq of S. griseus and Term-Seq of S. coelicolor and S. griseus. The predicted TSSs and TTSs along with the utilized python scripts were deposited in Figshare30. The brief summary of smBGC information and the detailed composition of anti-rRNA oligo mix utilized for depletion of rRNA of S. coelicolor were also uploaded in Figshare30.

Technical Validation

Evaluation of sequencing results

A total of 38 NGS libraries, including 24 dRNA-Seq libraries of S. avermitilis, S. clavuligerus, S. griseus, S. lividans, S. tsukubaensis and S. venezuelae, and 14 Term-Seq libraries of S. avermitilis, S. clavuligerus, S. coelicolor, S. griseus, S. lividans, S. tsukubaensis and S. venezuelae has been generated covering four different growth phases with biological replicates (dRNA-Seq data of S. coelicolor covering more diverse culture condition is available in the previous study performed by our group) (Fig. 1)24. The sequencing resulted in 4.97–26.60 and 3.47–16.1 million reads per library for dRNA-Seq and Term-Seq, respectively, after removing the phiX mapped reads (Tables 1 and 2). The retained reads were trimmed to remove adaptor sequences and discard short and low-quality reads. After trimming, the retained reads were subject to sequencing quality control, in terms of the Phred quality score42. Most reads showed average Phred quality score around 30–40, representing that the base-calling error probabilities in NGS runs are lower than 10−3 (Fig. 2a, b). After evaluating the quality of trimmed reads, the reads were mapped to the reference, resulting in 59.39% mapped reads for dRNA-Seq and 75.00% mapped reads for Term-Seq in average.

Table 1 Sequencing statistics of dRNA-Seq.
Table 2 Sequencing statistics of Term-Seq.
Fig. 2
figure 2

Technical Validation of dRNA-Seq and Term-Seq. (a) Average Phred score of dRNA-Seq reads after trimming. (b) Average Phred score of Term-Seq reads after trimming.

Assessment of sequencing datasets

For technical validation of dRNA-Seq and Term-Seq results, the correspondence of read count enriched positions in dRNA-Seq and Term-Seq to increment and decrement of RNA-Seq profiles, respectively, was evaluated. The dRNA-Seq and Term-Seq read count enriched positions were determined for each Streptomyces, which highly likely represent the in vivo TSSs and TTSs, respectively (refer to Methods for detailed information about determination of read count enriched positions)31. In average, about 525 potential TSSs and 1285 potential TTSs were determined for each species, and about 7% of the predicted TSSs and 8% of the predicted TTSs were found in the smBGC regions (the smBGCs for each Streptomyces species were predicted using antiSMASH)43. The determined TSS and TTS information and smBGC information are available at Figshare30. Then, the RNA-Seq read density near the potential TSSs and TTSs was calculated (the RNA-Seq data were obtained from same culture conditions)32,33,34,36,41. Across the TSSs predicted from dRNA-Seq, RNA-Seq read density drastically increased for all the six Streptomyces species in four growth phases, indicating that the TSSs were successfully captured from dRNA-Seq (Fig. 3a). Likewise, RNA-Seq read density drastically decreased across the TTSs predicted from Term-Seq, indicating that the TTSs were successfully captured from Term-Seq for all the seven Streptomyces species (Fig. 3b).

Fig. 3
figure 3

RNA-Seq read density across the predicted TSSs and TTSs. (a) RNA-Seq read density across the TSSs predicted from dRNA-Seq. (b) RNA-Seq read density across the TTSs predicted from Term-Seq. E, T, L and S represent the RNA-Seq read density of early-exponential, transition, late-exponential and stationary growth phases, respectively.