The Transcriptional Landscape of the Photosynthetic Model Cyanobacterium Synechocystis sp. PCC6803

Cyanobacteria exhibit a great capacity to adapt to different environmental conditions through changes in gene expression. Although this plasticity has been extensively studied in the model cyanobacterium Synechocystis sp. PCC 6803, a detailed analysis of the coordinated transcriptional adaption across varying conditions is lacking. Here, we report a meta-analysis of 756 individual microarray measurements conducted in 37 independent studies-the most comprehensive study of the Synechocystis transcriptome to date. Using stringent statistical evaluation, we characterized the coordinated adaptation of Synechocystis’ gene expression on systems level. Evaluation of the data revealed that the photosynthetic apparatus is subjected to greater changes in expression than other cellular components. Nevertheless, network analyses indicated a significant degree of transcriptional coordination of photosynthesis and various metabolic processes, and revealed the tight co-regulation of components of photosystems I, II and phycobilisomes. Detailed inspection of the integrated data led to the discovery a variety of regulatory patterns and novel putative photosynthetic genes. Intriguingly, global clustering analyses suggested contrasting transcriptional response of metabolic and regulatory genes stress to conditions. The integrated Synechocystis transcriptome can be accessed and interactively analyzed via the CyanoEXpress website (http://cyanoexpress.sysbiolab.eu).


Description of microarray platforms
Publicly available microarray expression data for Synechocystis that have been integrated in our study, were generated using seven different platforms, of which three platforms (CyanoCHIP, Postier, and Tu) contain complementary sequences to open reading frames (ORFs) produced by polymerase chain reaction (PCR), while the rest of the platforms use synthetically generated oligomers as probes. Here, we provide short descriptions of these platforms based on their original descriptions by the authors.

Dickson:
The most recent microarray platform developed for Synechocystis is the 60-oligomer array described by Dickson et al. 9 and synthesised by NimbleGen containing 13 probes per target (385,000 probes) directed against 3567 protein-coding genes.

Quality control of individual datasets
To calculate differential expression for the various datasets in a standardized manner, all datasets were loaded individually as marrayRaw objects in R/Bioconductor. Data upload and conversion were carried out using R scripts customized for each platform and repository format. Subsequently, datasets were inspected for artefacts using various tools.
Firstly, we assessed the intensity distribution of microarrays within each dataset. An example of this process is shown in Figure S2, which was produced using the data generated by Rowland et al. 10 studying thermal acclimation (the raw microarray data can be accessed through GEO accession: GSE21133.) In this study, the researchers used the two-color CyanoCHIP platform containing probes against 3079 Synechocystis 6803 genes. Their time series included four replicates for each of the five time points, with the control sample (t = 0) labelled with Cy3 and the test samples (t = 30 / 60/ 120 / 240 / 480 mins) with Cy5.
To compare the distribution of the log 2transformed signal ratios (M) for each microarray, box-plots were generated for raw and normalized data. Box-plots facilitate the visualization of data heterogeneity and serve to corroborate that the applied normalization method (i.e., OIN) corrected the observed intensity biases. As seen in Figure S2A, the distribution of probe intensities differs noticeably between replicates before normalization. Such differences are indicative of inter-sample variation, which can be caused by various experimental factors. In this case, normalization using the OIN algorithm corrected these observed scaling discrepancies ( Figure S2B). Alternatively, the spread of the signal intensities was visualized using density plots. For two-color arrays, intensities for each channel were plotted independently. In this case, the x-axis represents the intensities of the probes, and the y-axis shows density level ( Figure S3). These plots can be indicative of artefacts and interferences linked to high background signals. As observed previously with the boxplots, the normalization efficiently reduced the observed differences in such distributions ( Figure  S3B). Figure S2. Raw (A) and normalized (B) probe intensity values plotted against their density on the arrays. Green curves represent the control probes (Cy3-labeled), while red curves represent the test samples labelled with Cy5. RG = red green.

A B
For datasets generated by two-color arrays, we also used M-A plots to investigate dye bias and to determine whether normalization corrected it ( Figure S4).

Figure S3
: Plots of logged fold changes (LFC) with respect to average logged spot intensity of both channels (Cy3, Cy5) for raw data (left panel) and normalized data (right panel). Raw data show a clear bias towards Cy5 intensities (indicated by M values larger than zero) at higher intensities, whereas this dye-bias is corrected in the normalized data.
Finally, scatter-plots of the logged intensities for each channel extracted from the different replicates served to visualize the experimental reproducibility ( Figure S5; top panel). To examine the result of

A B
normalization on replicate bases, a plot was generated using the normalized M values ( Figure S5; lower panel).

Detection of outlier experiments
In microarray analyses, we generally assume that only a small fraction of genes is differentially expressed. Thus, we expect a similar overall distribution of different expression in the collected datasets, despite being generated by various laboratories using different platforms and protocols. To check whether this is indeed the case, and whether some experiments or samples are distinct from other experiments or samples, we applied hierarchical clustering and principal component analysis (PCA) to the expression changes of the integrated data set. Hierarchical clustering of microarray data indicated that the samples from an unpublished data set (GSE5391) had quite distinct expression, forming a clearly separated cluster ( Figure S6). This was also reflected in the PCA results, where these samples showed considerable separation from the remaining samples ( Figure S7). Although such analyses do not per se demonstrate that samples from GSE5391 are less reliable, they nevertheless show that these samples are outliers. To avoid effects of a small number of outliers, we decided to remove this dataset from our analyses. In addition, it has not been described in any publication. Two other non-published datasets GSE4604, and GSE4613, for which sample descriptions did not match with the clustering of their replicates also were removed.

Other supplemental figures
. Figure S7: Heat map showing clustering of the contrast obtained using wild type or mutated strains analyzed in this study. Experiments performed using mutants lacking genes encoding Ser/Thr protein kinases 11 , or genes encoding histidine kinases, or its cognate response regulator 12,13 are labeled by boxes numbered from 1 to 3, respectively.    Table S1: Summary of relevant information on microarray data used in this study. Excel file listing the transcriptome datasets and measurements integrated and examined in the meta-analysis.

Supplemental tables
[Supplementary Table S1 online]  Table S3: Excel file including the results of the functional enrichment analysis for the clusters identified in Figure   2. The colors of the column headings correspond with the different clusters shown in Figure 2.
[Supplementary Table S3 online]  terminal oxidases (H10), but are not of the same functional sub-category. The thresholds for inclusion were set based on the median correlation of genes within these categories. For H2 and H3, whose members showed relatively low internal correlation, the threshold was set equal the 0.9 quantile, while a more relaxed threshold of 0.8 quantile was set for the other categories that showed higher internal correlation.