Multi-omics integration accurately predicts cellular state in unexplored conditions for Escherichia coli

A significant obstacle in training predictive cell models is the lack of integrated data sources. We develop semi-supervised normalization pipelines and perform experimental characterization (growth, transcriptional, proteome) to create Ecomics, a consistent, quality-controlled multi-omics compendium for Escherichia coli with cohesive meta-data information. We then use this resource to train a multi-scale model that integrates four omics layers to predict genome-wide concentrations and growth dynamics. The genetic and environmental ontology reconstructed from the omics data is substantially different and complementary to the genetic and chemical ontologies. The integration of different layers confers an incremental increase in the prediction performance, as does the information about the known gene regulatory and protein-protein interactions. The predictive performance of the model ranges from 0.54 to 0.87 for the various omics layers, which far exceeds various baselines. This work provides an integrative framework of omics-driven predictive modelling that is broadly applicable to guide biological discovery.

Supplementary Figure 3: The genome-scale model is composed of multiple layers with interdependencies. The input layer (612 features) is grouped into four categories: strain genotype (154), medium composition (120), stress (52), and genetic perturbation (GP, 286). To predict expression and growth dynamics in a novel condition, a recurrent neural network first predicts the genome-scale transcriptional expression (4096 transcripts) that it is then used together with information from gene ontology, co-expressed protein network (CPN), protein-protein interaction network (PPI), and other pathways to predict protein expression (1001 proteins). Concentrations of 356 metabolites are predicted from the transcriptome and proteome layers. Fluxes of 2382 reactions are predicted using Flux Balance Analysis (FBA) with constraints from the input, transcriptome and proteome layer. The growth rate is then inferred by consensus of predictions made from all five layers (including the input layer). KO; knock-out, OE; over-expression, SF; sigma factor, TF; transcription factor    MC4100  AG102  GM5555  GLBRCE1  MG1655  LJ110  DH1  D31  PC05  BW25113  P4X  DH5alpha  JM109  T75  AB1157  T177  GM3819  T175  NCM3722  GM5556  W3110  CG2   0 LB+Gly (

A B
AB1157 and its derivatives JO2057 and its derivatives S3974 derivatives N3433 derivatives MG1655 and its derivatives         We have performed comparative analysis of prediction performance between two different training datasets; One is based on fold-change and another one is based on absolute normalization. Fold-change data was collected from the compendium COLOMBOS v3 whereas absolute-normalized data was from the compendium Ecomics (this work). We extracted the overlapping conditions between two sources. Absolute-normalized data utilizes expression levels of reference conditions (conditions used for base of perturbation) and perturbation conditions separately unlike the fold-change data that collects only differences in expression levels between perturbation condition and corresponding reference condition. Expression data for reference conditions used for each of the perturbations are added in absolute-scale dataset unlike fold-change dataset. Then the cross-validation experiments of Recurrent Neural Network (RNN) we designed (Supplementary Text; Section 3.

3)
were performed based on each of two different sources separately. As for fold-change dataset, by definition of fold-change, it does not have any wild-type expression levels. Thus, unlike the absolute level-based RNN that uses average of expression levels in wild-type conditions (MG1655 with no stresses and no genetic perturbations) as background values of internal nodes, background values of RNN trained on fold-change data were computed by taking average of fold-changes across all perturbations.     The model first uses input features only to predict growth rate and gradually adds the transcriptome, proteome, metabolome, and fluxome layers. PCC was measured between predicted growth rates and measured growth rates from leave-one-condition-out cross validation. 120 conditions were tested for validating prediction of the integrated model. Among them, 101 were cases with wild-type conditions in training set (denoted as WT presence), 60 were novel wild-type conditions (in green).    Prediction performance of growth rate by three methods of i) using condition characteristics, ii) using expression data, and iii) using both. PCC and SRC were measured between predicted growth rates and measured growth rates from leave-one-condition-out cross validation.  Table 1: Genes selected for knock-out experiments. Phase 0 includes 9 genes that were identified as likely candidates in a previous work [1], which we now transcriptionally profiled and added in the compendium. Phase 1 represents the 16 genes that were adaptively selected in this work to maximize GO coverage and their profiles were subsequently added to the compendium.

Overview
The Ecomics compendium is divided into two major parts, the multi-omics genome-scale profiles and the experimental meta-data (Section 3.1.2). In addition to the Ecomics compendium, we have curated genome-scale interaction data for signal transduction, transcriptional, protein and metabolic layers (Section 3.1.3). The genome-scale profiles are classified in five groups, based on their omics layer: •  [7,23,[26][27][28] • phenomics: we experimentally measured the growth characteristics (lag, slope, final OD) of 2187 profiles in our lab. In cases where the growth rate is also reported in the published work, that value is also added and annotated accordingly (767 profiles).

Meta-data collection
Each profile in Ecomics is described by several attributes that characterize the environmental and genetic background (Supplementary Data 6). These attributes can be grouped in the following categories.
• Genotype: We matched strain information from Yale's Coli Genetic Stock Center (CGSC) [78] and Ecoliwiki [79], as well as information that was present in each publication to assemble the genetic background of each strain used. Mutations at different loci are treated as different alterations for any given gene. • Medium: Similarly, we used information from the respective papers to reconstruct the chemical composition of all media used. Whenever a standard medium was used but no further information is given, we referred to EcoCyc [80] to estimate its formulation. In rare occasions, we had to generalize on media compositions (for example for media with casamino acids we assume that all 20 amino acids are present). • Stress: Stress is defined as any experimental condition that is known to (or is expected to) induce stress-response mechanisms. There is not a clear ontology of stresses, with a notable effort by the Plant Stress Ontology (PSO) project that is currently under development [90]. Here, we introduce a organism-centric approach to draft an ontology in Section 3.2.4. • Time: Elapsed time after the strain was exposed to a perturbation and the collection of samples for profiling. • Genetic perturbation: The type of genetic perturbation, if one is present. Values include Wild-type (WT), knock-out (KO), over-expression (OE), mutation (defined as changing one or more nucleotides in the CDS), insertion (an insertion of one or more genes), large-scale deletion (that affects more than one gene). • Number of perturbed genes: Genes involved in perturbations.
• Temperature: Temperature in which the experiment was conducted. In cases where temperature is not reported, 37 • C is assumed. • pH level: The pH level in which the experiment was conducted. In cases where pH level is not reported, a pH 7.0 is assumed. • Compounds: Any compounds that were added beyond what is expected to be in the respective media. Usually includes chemicals that induce stress (e.g. mutagens). • Compound concentrations: Defined in g/L, Moles or percentages.

Molecular interaction data
Transcription Factor network Transcription Factor (TF) binding information was provided by RegulonDB (v8.6) [88]. All interactions with strong or weak experimental evidence (defined as having at least one independent sources of validation) were used, which results in 3,489 interactions of 179 TFs and 1,478 genes. In addition, we added recent datasets to this set [91], resulting to 3,809 TF-gene bindings of 181 TFs and 1,538 genes (Supplementary Data 5).
Sigma factor network Sigma factors are transcription initiation factors that enable a specific binding of RNA polymerase to gene promoters. Cell utilizes different sigma factors that bind to a gene promoter, which in turn modulates transcriptional activity of the target gene in order to react to different environmental signals [101][102][103]. We built a global map between 7 sigma factors and target genes by compiling known information from multiple sources [80,88,89]. The consolidated network is composed of 8,381 interactions between seven sigma factors and 3948 genes (Supplementary Table 1

, Supplementary Data 5).
Signal transduction network For the signal transduction network, we used our previous work reported in [1], which consists of 191 interactions for 105 TFs and 129 metabolites. [84]. We compiled the interaction data from four distinctive sources. We collected data from (a) affinity purification approaches followed by mass spectrometry (AP/MS), a total 11,496 interactions among 1,631 proteins [85,86], (b) binary-Y2H experiments, a total of 3,936 interactions among 2,045 proteins [87], (c) the STRING public database, a total of 817,650 interactions among 4,147 proteins [83], (d) the EcoCyc database, a total of 701 interactions [80]. Ultimately, this led to a reconstructed E. coli PPI network of 823,098 interactions for 4,250 proteins, which is the most comprehensive PPI network for the organism until now [104] (Supplementary Data 5).

Strain and Medium representation
Strains are characterized with 154 genotypic features that correspond to DNA variations (i.e. deletions, insertions, polymorphisms) with respect to the ancestral K-12 strain (Supplementary Fig. 1). Similarly, the media are characterized by 120 chemical features (Supplementary Fig. 2). There are 65 strains and 112 media in Ecomics.

Overview
Ecomics includes information for 65 strains, 112 media, 52 stressors and 286 genetic perturbations. Its 4,346 genome-wide profiles span the transcriptional (3,579 profiles), protein (71 profiles) and metabolic (696 profiles) layers. A "condition" is defined a combination of the genetic background (strain, genetic perturbations) and the environmental setting (medium, stress). The compendium retains 649 different conditions (596, 36 and 53 conditions with transcriptome, proteome and metabolome profiles, respectively, with 68.2% of them with 3 or more profiles (replicates from one or more studies). 65.6% of all profiles in the compendium include either a stressor or a genetic perturbation and among them 517 profiles have them both (Supplementary Fig. 4A). As in Supplementary Fig. 4B, the condition with most profiles (187 profiles; 4.2% of Ecomics) is the MG1655 strain in MOPS medium with carbon-shift from glucose to lactose.

Strain
The strains with the most profiles in Ecomics is MG1655 (64.4%), BW25113 (9.6%) and W3110 (7.8%), with profile distribution depicted in Supplementary Fig. 5). Only the MG1655 and BW25113 strains had profiles in all four layers. Genotypic map of the strains used (Supplementary Fig. 1) are characterized with 154 genotypic features show sparsity in shared features, where the number of genotypes for each strain ranges from 1 to 11. Although there are no genotypic features that are prevalent in all strains, many of them are locally shared between 3 or 5 strains (e.g. knockouts in the crl genes are shared in 5 out of the 65 strains).
Expression-based strain ontology. We used average linkage hierarchical clustering (UPGMA) [105] to construct both genotype-based and expression-based ontologies for Ecomics strains. Profiles of MG1655 strains in LB medium with no stress or genetic perturbations were used. Expression profiles in a given condition were averaged for each gene to have a representative expression profile for the condition. Then the expression level of each gene was scaled between 0 and 1 by min-max normalization [106], which is also discussed in Section 3.3.3. The distance between a pair of vectors, both in the case of expression levels and phenotype features was measured by PCC. Supplementary Fig.  9A depicts the ontology that is generated based on the phenotypic characteristics. Supplementary  Fig. 9B illustrates the ontology that results from the integration of these two sources of information by using linear superposition with equal weights. Interestingly, the cophentic correlation of the genetic and expression-based ontology is 0.21. The same distance between the integrated ontology and either of the two sources (genetic, expression-based) is 0.52. Thus, relying solely on genetic characteristics to infer genome-scale expression and phenotypic traits can clearly be misleading.

Correlation between expression profiles and genetic features.
To investigate the correlation of shared genotypic characteristics to the resulting expression profiles, we identified the conditions where two or more strains had expression profiles and calculated the strain pairwise correlation, thus creating a expression-based covariance matrix, where row/columns are the 65 strains. We then constructed a genotype-based covariance matrix and compare the two based on Pearson's correlation between two vectorized lists of matrices (0.58).

Medium
In the 112 media in Ecomics, glucose and glycerol are the most popular carbon source. M9 [107], LB [108], and MOPS [109] are the three most abundant media (Supplementary Fig. 10. the number of chemicals used for each medium ranges from 4 to 45 (15.3±9.0) (Supplementary Fig. 2. Similarly to the strain ontology, we created media ontologies based on the genome-scale expression levels (Supplementary Fig. 10B). We applied the same method as described in strain ontology analysis (Section 3.2.2). All profiles of the MG1655 strain (largest number of profiles) with no stress and genetic perturbations were used for this test. Fig. 7A) and some of their combinations, divided in 16 groups (Supplementary Fig. 7B). The most abundant stressors were nutrient-shift (359 profiles), followed by oxygen starvation (280 profiles). Among the 5 different types of nutrient-shift, glucose to lactose shift was the most dominant diauxie, followed by glucose to acetate shift. There were 15 different antibiotics in the compendium, with Norfloxacin having the most profiles (200 profiles). We created an expression-based stress ontology which provides an insight on what conditions result to similar genome-scale expression in the bacteria that are exposed to them (Supplementary Fig. 11).

Genetic Perturbations
The 286 genetic perturbations (e.g. 183 knockouts, 63.9%; 91 over-expressions 31.8%) cover 27% of KEGG pathways, 44% of GO Biological Processes (BP), 7% of GO Molecular Functions (MF). A pathway, process or molecular function is covered if the compendium has one or more profiles for at least one of its genes.

Targeted experimentation to increase GO coverage.
To increase the GO coverage of Ecomics, we investigated which new conditions would lead to the most informative transcriptional profiles. We first transcriptionally profiled and included in Ecomics 9 KO experiments that we had identified before [1] as being highly informative for both GO coverage and model performance (Supplementary Table 1). We assumed that a GO term is represented in the compendium if the compendium has profiles where one or more genes that include that GO term have been perturbed. With that assumption, Supplementary Fig. 8A depicts the increase in coverage after the inclusion of transcriptional profiles that correspond to novel gene perturbations in the compendium. We identified the gene rank iteratively under the assumption that all genes of higher rank have been perturbed. Based on the resulting ranked list, we performed transcriptional profiling for the top 16 genetic perturbations (in triplicate; 48 profiles total, Section 3.4.1), which led to an increase in coverage by 19.8% for KEGG, 24.3% for BP and 63.9% for MF (Supplementary Fig. 8B, Supplementary Data  1). The list of 16 genes for knock-out experiments are in Supplementary Table 1 (phase 1). We also performed differential expression analysis between pre-processed gene expression profiles for each KO gene of BW25113 strain using DESeq2 [110] to find novel genes that are affected by genes we perturbed.

Variability analysis.
We identified secD, galE, mcrB, phoR, ∆(ara-leu)7697, ∆(codB-lacI)3 mutations as the key factors in strain-dependent variability (Supplementary Fig. 5). For galE, it has been known that when a galE mutant is grown in the presence of D-galactose, growth is arrested due to low availability of CTP and UTP, which results in reduced RNA synthesis [111]. Moreover, PhoR has been known to indirectly sense and to respond to variations in the level of extracellular inorganic phosphate, which is one of essential macronutrients for biological growth [112]. Hence, dysfunction of the phosphate regulatory mechanism by mutating phoR is expected to impact global changes in the cell, which is on a par with the medium variability analysis results, where phosphate-containing compounds are a key correlate to expression variability.
Focusing on media (Supplementary Fig. 6B), our analysis revealed that Na2H(PO4), H3BO4, KH(PO4), Fe(III)citrate, yeast extract, MgSO4, NH4Cl, sodium citrate, H3BO3, biotin and FeSO4 are key factors of expression variability. Na2H(PO4) and KH(PO4) are the sole source of phosphate in most defined media and any perturbation in the phosphate availability will evoke a global response [113,114]. MgSO4 and NH4Cl are sources of magnesium and nitrogen that are critical for key enzymes [113] and synthesis of amino acids [114]. Yeast extract is an undefined media so the observed high variance in expression is expected.
Similarly, transcriptional responses within replicates exhibited stress-dependent variability ( Supplementary  Fig. 12). Expression variability is measured as the coefficient of variation for genome-scale gene expression, normalized for the same strain and medium (MG1655 and M9 medium, respectively). We found that environmental conditions that trigger targeted specific biological processes have lower variance in expression within replicates than environments that target multiple/global biological processes. We confirmed that the observed expression variability is not an artefact of laboratory biases or sample size: the Pearson Correlation Coefficient (PCC) between the expression variability and the number experimental laboratories or publications is -0.09; similarly, the PCC between the expression variability and the number of replicates per strain is -0.11. Antibiotics such as norfloxacin (Nx), ampicillin (Am), and Polymyxin B that act on specific biological processes including inhibition of cell division and altering cell wall, lead to profiles with lower variability among replicates, compared to heat and acidic stress that are associated with a global transcriptional response (0.18±0.07 vs. 0.34±0.15 coefficient of variation, respectively; Supplementary Fig. 10). In addition, acidic stress led to higher variance than heat stress (0.46±0.11 vs. 0.22±0.08 coefficient of variation, respectively). This result argues that E. coli's heat response program is more robust than acid homeostasis. Indeed, E. coli has a well established system to up-regulate heat responsive genes through σ32, and to protect protein aggregation and misfolding by expressing several chaperons [115]. In contrast, to cope with acidic stress, E. coli has three response systems: a σ38-dependent, a glutamate-dependent and an arginine-dependent response [116]. However, none of these systems induces the expression of chaperones as it is evident from our transcriptional profiling analysis and reported in literature [117], which is expected to lead in higher protein instability. Similarly, the high expression of chaperon proteins in hypoxic conditions [118] may also play a role on the low expression variance within hypoxic replicates. The observed variance in our profiles also holds for combinations of stresses. For instance, in the case of simultaneous treatment with acidic and hypoxic stress, the transcriptional variability is the mean of that of the respective stresses. High variability in recombinant protein stress is expected, as different recombinant proteins and expression systems impose different metabolic burden to the cells [119].

Overview
We built a genome-scale model that aims to predict an output vector y that contains the expression levels of mRNAs, proteins, metabolites, metabolic fluxes and growth dynamics, for a given experimental condition. The experimental condition is represented as an input vector x with features that contain information about the genetic (strain, genetic perturbation) and environmental (medium, stress) background. The model was designed to be modular, in order to allow for a better representation of biological organization, an easier manipulation of individual modules and to avoid dimensionality issues. We evaluated various statistical techniques in their ability to capture biological structure and make accurate predictions. The final model includes Recurrent Neural Networks (RNN; [120]) with regularization of sigmoid activation functions [121] for the transcriptome layer and LASSO constraint regression [122] for the other layers (using R package glmnet [123]). The model is trained on the Ecomics compendium and all the network resources that are summarized in Section 3.1.1. Our analysis is focused on profiles for samples in the exponential phase, as the amount of data is sufficient for a rigorous assessment of the model.
The input features (x) are directly used into transcriptome layer to predict genome-wide expression levels of genes. Then the rest of layers are predicted from the transcriptome layer. The motivation of this approach is that the profiles from a wide spectrum of environmental conditions are mostly enriched in transcriptome layer in the compendium. The responses in the proteome are predicted from the transcriptome layer. The metabolome layers are predicted from the transcriptome layer as well as the proteome layer. Once complete, the genome-scale fluxes are predicted based on the constraintbased model by imposing the constraints of fluxes from the three layers of input, transcriptome, and proteome. Finally, growth rate is predicted by consensus of predictions made from all five layers (input, transcriptome, proteome, metabolome, fluxome).

Sample representation
A sample i is defined as the pair of multi-dimensional vectors (x (i) , y (i) ), where x can be thought as the input or the independent variables of the model and y can be thought as the output, or the dependent (response) variables of the model. The superscript denotes the index of the sample and here a sample is a profile from the Ecomics database. The 612-by-1 vector x contains features that encode information about the environmental condition and genetic background, hence we will refer to them as the conditions of the sample. The y vector consists of two parts: a 7,835-by-1 vector of variables that encode for the organism's biomolecular abundances and metabolic fluxes and a vector of variables that correspond to specific phenotypic traits. In our formulation, we assume that we have only one phenotype that we want to predict, namely the growth rate. As such, the size of the output vector y is 7,836-by-1.
More specifically, the input vector x consists of four types of features: • Strain information: the strain information is represented as 154 genotypic features {x i } 154 i=1 where each feature x i is a discrete random variable and it denotes a DNA-level status of a gene (e.g. x 2 = {0, 1} where x 2 can represent the rfb gene, the values 0 and 1 denote wild-type and rfb knockout, respectively).
• Medium: the medium composition is expressed as 120 chemical features {x i } 274 i=155 where each feature x i is a ordinary random variable and it denotes the availability of either a defined chemical feature (e.g. x 122 = {0, 1} represents K 2 HPO 4 , 0 and 1 denotes absence and presence of the chemical, respectively) or an undefined chemical feature (e.g. yeast extract).
• Environmental abiotic stress: the environmental cues are expressed as a 52-by-1 vector where each is a binary random variable and represents a stress (e.g. x 276 = {0, 1} represents heatshock stress, 0 and 1 denotes absence and presence of the stress, respectively). • Genetic perturbations: Genetic perturbations are a 286-by-1 vector {x i } 612 i=327 where each x i is a binary variable denoting the presence or absence of a genetic perturbation (a perturbation includes knock-out, mutation, over-expression, and insertion). For example x 342 = {0, 1} corresponds to the rpoS gene having an original copy (0) of the gene or being knocked-out (1).
The output vector y consists of five types of features: • Transcriptome: represented with a 4,096 element vector {y i } 4096 i=1 where each feature y i is a continuous variable and it denotes the absolute mRNA expression level of a gene. i=5454 where each feature y i is a continuous variable and it denotes a relative flux of a reaction.
• Cellular phenotype: The growth rate is represented by a scalar in the last position of the y-vector, (y 7836 ), as a continuous variable with (h −1 ), which is a continuous random variable.

Transcriptome module
For predicting transcriptomic response from input x, we employ a Recurrent Neural Network (RNN) framework. A RNN is a special type of an Artificial Neural Network that allows feedback and hence it is capable to encode for memory or past internal states [124][125][126][127]. This enables it to model biologically relevant dynamic behaviors such as feedback-loop of molecular interactions in a cell.
To simulate interactions of extracellular factors and biomolecules in a cell (e.g signal transduction pathways) as well as interactions between intracellular biomolecules (e.g. transcriptional regulation), the architecture of RNN directly connects input nodes into output nodes and allows connections between output nodes. In other words, we avoid to use hidden nodes in the RNN architecture. The representation allows better interpretation of the constructed model as well as integration of known biological evidences, which potentially helps to avoid over-fitting. The schematic diagram of RNN representation that predicts transcriptomic response from input features is depicted in Supplementary  Fig. 13.

Data preparation
We focused our analysis for profiles in the exponential phase, where adequate data exist for model evaluation. We first extracted the transcriptome profiles that correspond to samples in the exponential phase from the compendium (2610 profiles). We then used min-max standardization on the absolute scale values for each gene, to be able to compare and make the training process easier to initialize and converge [106]. The standardization process was applied for each entry i of the first 4,096 entries in the output variable y, which correspond to the transcriptome layer. The variable values were then updated after standardization: Parameters and activation function Connection weights. Each connection c i,j between a node i and a node j has a weight w i,j associated with it (Supplementary Fig. 13). We represent the weights for connections between input nodes x (612 nodes) and output nodes y (4096 nodes) as w x and the weights for connection among output nodes y as w y , so that the set W = {w x , w y } includes all weights of the RNN. Weights are initialized based on a normal distribution N (0, 0.05 2 ), which provides a balanced weight range and lead to fast convergence for the Ecomics dataset.
Activation function. The sigmoid function (h(x) = 1 1+e −x ) was used as the activation function for each node of the RNN, as it sufficiently approximates the Hill function dynamics of regulatory interactions and restricts the output between 0 and 1. In addition, we performed n iterations (updates) on calculating the output, where n is the memory depth and capture cycles and feedback loops that might be present in the RNN. The initial state of the output nodes y (0) (i.e. gene expression value) was set to the average gene expression levels in the training data under WT conditions (MG1655 with LB or M9 medium with no stresses and no genetic perturbations). As such, given input x and weights {w x , w y }, the output vector y is iteratively computed by: for 1 ≤ i ≤ n, where y (i) is the state of the output vector y at iteration time i.

Training weights
Objective. During the training phase, the goal is to find the RNN connection weights so that the RNN has the best representation of the training data, i.e. its output y (n) approximates well the target outputŷ (gene expression for all genes in the training dataset D). This objective can be represented as minimization of the residual sum of squares, or cost function C o (w) between y (n) andŷ for D: where |D| is the number of conditions in the training dataset (262 conditions) and M is the number of genes in a single profile/condition (4,096 genes).

Back-propagation through time (BPTT).
We use BPTT [128] to train the weights W in our RNN model. In BPTT, the RNN is unfolded to interconnected feed-forward neural networks (FFNN), with the node values at each FFNN representing the state of the output at a given time (iteration), as shown in Supplementary Fig. 13. Once the RNN is unfolded to a chain of FFNNs, standard back-propagation is used to train the weights of the network [129]. Stochastic gradient descent [130] is used to update the weights through this procedure, with the following update rule: With α being the update or learning rate (set at 0.01). Note that the update is performed for several iterations, or epochs, which are determined empirically (see next section).

Parameter optimization
We evaluated the effect of various hyper-parameter in the performance of RNN and selected the optimal values for the number of epochs, memory depth, type of relations between features, dimensionality reduction methods and the regularization techniques. More specifically:

Number of epochs.
An epoch is defined as one training iteration over all samples in the training dataset. We performed a Leave-one-condition-out (LOCO) cross-validation (Section 3.3.3) on Ecomics for the transcriptional profiles of 178 transcription factors in the exponential phase. We opted on using only this dataset instead of all genes in Ecomics due to computational constraints. Since TFs are the most representative genes for the enacted regulatory programs in any given conditions, this set encapsulates well the genome-wide changes for each condition. From the 493 conditions with transcriptome profiles available (exponential phase) in Ecomics, 262 of them can be used as part of a LOCO crossvalidation, as when excluded from the training dataset, there is at least one condition that has the same value for one of more features of the genetic background or environmental setting. In this set of 262 conditions, we evaluated the effect of the number of epochs, as shown in Supplementary Fig.  16A. We found that in 63% of the conditions, 50 epochs were sufficient to achieve convergence (i.e. ∆P CC ≤ 0.01), while in all cases, convergence is achieved within 100 epochs. Therefore, we set the number of epochs to 100.

Regularization.
To avoid over-fitting and create a sparse representation of the transcriptional organization, we used L 1 regularization (weight decay) [131] to the original cost function: where C 0 is the original cost (the equation 3), n is the size of the training set and λ is the parameter for adjusting the relative importance of the regularization term compared to C 0 . Then taking the partial derivatives of the cost function with respect to w yields The biases in w don't change from the regularization term and thus only non bias weights are updated from the equation (6) as λ is chosen based on RNN performance observed from LOCO cross-validation of the data set of 178 transcription factors (Supplementary Fig. 16B). Our results show that the optimal performance (PCC= 0.76±0.12) is achieved when λ approximates the value of 0.005, which was selected for all subsequent experiments.
Fixed or time-variant weights. We have considered the case where weights are changing through time unfolding, to capture different correlations that might be present in feedback loops. We performed LOCO cross-validation (described in Section 3.3.3) from the transcriptome data of 2,610 profiles of 178 transcription factors for two RNN models: one where weights are variable through time/iteration and another with fixed weights (Supplementary Fig. 16D). Weights are fixed by averaging weights through time for any given connection. Our analysis shows that fixed weights are faster to train and lead to equal or better ANN predictors (PCC=0.68±0.14) than variable weights (PCC=0.64±0.17).
Integration of biological knowledge to network topology. We have evaluated whether the network interaction data that has been experimentally validated confer extra information to the predictive model. We compared the performance of RNNs with and without prior knowledge of the transcriptional regulatory network (Sections 3.1.3, 3.1.3, 3.1.3). Performance comparison was conducted as described in Section 3.3.3. Our results show that the RNN with a priori connections (PCC 0.68±0.14) is better to the one without it (PCC=0.63±0.20).

Memory depth.
To get a sense on what memory size (n) will capture the majority of feedback loops in E. coli, we first investigated the transcriptional regulatory network (TRN) compiled from public repositories (e.g. RegulonDB) and literature for E. coli (Section 3.1.3). In total, 173 cycles were detected, ranging from one cycle (self-loop) to length of 8 (Supplementary Fig. 26). 65% of cycles were selfloops and 85% of all cycles are below a cycle length of 4. Next, we trained RNN with different memory sizes and assessed the effect of memory on the RNN performance. As shown in Supplementary Fig.  16C, a memory depth of 2 is sufficient for our purposes, which captures 75% of known cycles.

Model evaluation
Leave-one-condition-out (LOCO) cross-validation. For model validation, we leave out profiles of a condition from the dataset and build a model from the rest and prediction performance of the model is tested from the condition left-out. This procedure repeats until all conditions are tested. To evaluate model performance, we measure Pearson's correlation coefficient (PCC) between predicted expression levels and average of known expression levels for profiles belonging to the test condition. Not all conditions are "testable" as unknown as any of four types (i.e. strain information, medium, environmental abiotic stresses, and genetic perturbations) representing test condition might not be present in the training set. From the compendium, 262 are "testable" as unknown among 493 conditions. Theses conditions are classified into 6 groups based on presence of genetic perturbations and stresses (Supplementary Fig. 28).
Baseline measurement. We defined three baselines for model performance evaluation. The first baseline ("random baseline") provides the PCC between the average expression level for each gene calculated from 10 random profiles in the training data, to the expression profiles of the test data. For each comparison we report the "random baseline" calculated from 1000 times of repetitive sampling without replacement of the 10-profile random set. The second baseline ("mean baseline") provides the PCC between the average expression level of each gene calculated from all profiles in the training data, to the expression profiles of the test data. The third baseline ("WT baseline") provides the PCC between the average expression level for each gene calculated from all WT profiles in the training data, to the expression profiles of the test data. Again, we define as WT (wild-type) profiles the ones of the MG1655 strain in LB or M9 medium, any carbon source, without any stresses or genetic perturbation.

Proteome module
The expression level for each of the 1,001 proteins is predicted through an ensemble method that integrates information from four sources: the transcriptional regulatory network (TRN), the protein-protein interaction (PPI) network, the co-expressed protein network (CPN) and other pathway information. Supplementary Fig. 20 depicts the integration among these layers to derive the protein expression level. More specifically: • Transcriptional Regulatory Network (TRN). The transcriptional regulatory network was built based on the RegulonDB database as described in Section 3.1.3. The protein level of a target gene in a novel condition is predicted by LASSO regression [122] of the expression levels of genes that are connected through a regulatory link to the target gene. • Protein-Protein Interaction (PPI) Network. The PPI network was constructed from the five distinctive sources that were described in Section 3.1.3. Similarly, the protein level of a target gene in a novel condition is predicted by LASSO regression of the expression levels of genes whose respective proteins are connected through a Protein-Protein interaction to the target protein. proteins was built from the core proteome dataset (Section 3.1.1), which represents 20 expression profiles of 1,001 proteins, which are not used for proteome prediction. For two proteins to be considered co-expressed, their pairwise correlation should be larger than 0.7. Any given protein level in a novel condition is predicted by LASSO regression of the protein expression levels of co-expressed proteins with respect to the target protein.
• Pathway clustering. We cluster genes that are implicated in the same pathways, as represented in the KEGG database. The protein level of a target gene in a novel condition is predicted by LASSO regression of the expression levels of genes that are implicated in the same pathways as the target gene.
Prior to building the prediction models, we apply min-max scaling of the proteome data for preprocessing as in Section 3.3.3. The λ L 1 -regularization parameter for each linear function is selected based on cross-validation. For evaluating prediction of proteome layer, LOCO cross-validation is used where a profile is left out for testing, the rest is used for training and this procedure repeats until all profiles are tested. PCC between predicted expression levels and known expression levels for all tested profiles is used for a measure of evaluating the prediction performance. We evaluated each of these methods individually, as well as their integration through an Ensemble method where the protein expression level is the mean of the predicted expression level from each of the four prediction modules. The evaluation was performed in an Ecomics-derived dataset of 18 profiles (5 conditions) with expression levels in both the transcriptional (4,096 transcripts) and proteome layers (1,001 proteins). The integration of four methods has the highest protein coverage and can predict all 1001 proteins, while three of the four individual methods can predict a substantial lower number of proteins (250 for TRN, 547 for KEGG, 1,000 for PPI, 847 for CPN). The prediction performance of the integrated method (0.55±0.26) outperforms all individual methods (0.41±0.23 for TRN; 0.47±0.23 for KEGG; 0.48±0.26 for PPI; 0.52±0.24 for CPN). Although we cannot rigorously compare the different PCC among the different methods (they correspond to different sets and numbers of proteins), we can evaluate the protein expression prediction for the top 50 most variable proteins that are common among the five sets. In that case, the Ensemble method that integrates the four different prediction sources clearly outperforms all other combinations with PCC of 0.77±0.27 while the second best method is CPN-based prediction, with PCC of 0.69±0.27. Additionally, predicting first the target gene expression from the mRNA expression levels of the corresponding genes does not perform well, achieving a PCC of 0.34±0.18 in the general case and PCC of 0.18±0.51 for the 50 most variable proteins ( Supplementary  Fig. 20).

Metabolome module
First, we investigated what information and from which layer (transcriptome, proteome) can lead to a better predictor for the metabolome layer. For this, we used 33 profiles of 126 metabolites, 53 proteins and 75 genes that constitute the core metabolism, in order to predict the concentration of each metabolite. We used i) the expression levels of all 53 proteins and ii) the expression levels of all 75 genes, using constraint LASSO regression in both cases (Supplementary Fig. 15). As shown in Supplementary Fig. 21A, the prediction performance is better in using (measured) protein expression levels (PCC 0.65±0.21) than by using gene expression levels (PCC 0.47±0.26). For predicting concentrations of metabolites in non-core metabolism, we resort to inferring their levels from mRNA expression levels due to the paucity of profiles with both metabolome (including metabolites in non-core metabolism) and proteome information (only 6 profiles). Variance analysis indicates that the prediction of metabolite concentrations in non-core pathways are robust, with a variance of 0.02±0.01, comparable to that of core metabolism (0.06±0.01), suggesting that such metabolites are highly predictable even without the need of protein expression data (Supplementary Fig. 21B). For testing this, we use 115 profiles of 230 metabolites (non-core) and 4,096 genes. For each of metabolites, we train a linear function using LASSO that predicts the concentration of a metabolite from transcriptional expression levels. For metabolites having known enzyme-substrate relations, we predict its concentrations from the mRNA expression levels of the related enzymes. For those with no such information, we fit from all the genes by using LASSO, which allows variable selection. We perform leave-one-out cross-validation from the data set to validate the prediction performance of the approach. The results ( Supplementary  Fig. 21C) show that the prediction performance is PCC 0.87±0.16, although the baseline is also high (PCC 0.77±0.17 for mean baseline and 0.70±0.13 for random baseline) because of the invariance of metabolite concentrations in non-core. Finally, we predict the metabolome layer from two components that i) predicts 126 metabolites from 75 protein expression levels in core metabolism and ii) predicts 230 metabolites from 4096 gene expression levels in non-core metabolism (Supplementary Fig. 15).

Fluxome module
Fluxes are predicted using Flux Balance Analysis (FBA), which is a mathematical approach for analyzing the flow of metabolites through a metabolic network [132]. Basically, FBA can be formulated as the following optimization problem: where c is a vector of coefficients and v is a reaction vector. S is stoichiometric matrix. We make reaction constraints based on expression levels of related enzymes. Specifically, we change the lower bounds of reactions by the following rule: where g i is expression levels of enzymes in reaction i (r i ). Threshold t is determined within the range of {0.02, 0.04, 0.06, 0.08} that maximizes the prediction performance for growth rate in training data and it was determined to be 0.04. The range was determined to be below 0.1 as the mean expression level of genes was 0.10. For this, GPR (gene-protein-reaction) associations from BiGG database [81] are retrieved (iJO1366). For exchange reactions for the metabolites present in LB medium, we followed the reaction bounds from [133]. For anaerobic condition, exchange reaction of oxygen is lower-bounded to zero. Bounds for all exchange reactions are listed in Supplementary Data 8. Expression levels of enzymes are interrogated from proteome layer if measured and from transcriptome layer otherwise. If any of enzymes in a reaction are not measured, then lower bound of the reaction is unconstrained. Typically, a FBA solution is not one as multiple genome-wide fluxes might achieves the same objective. From the multiple solutions, we advocate the flux distribution that minimizes total the total absolute flux (MTF) [134] with the same objective value. To have the predicted growth rate from FBA comparable to ones predicted from other layers, we built a linear transformation function between predicted growth rate from FBA and measured growth rate for training conditions.

Model integration and phenome prediction
Predicting growth rate from a single layer We built 3 linear functions using LASSO to predict growth rate from (a) condition features, (b) known expression levels of molecular species, and (c) integration of both. We evaluate the predictive capacity of this method for the three layers of transcriptome, proteome, and metabolome. In our evaluation, we keep only the profiles with exponential phase for which growth rates are present in the compendium. This results to the following datasets: The results show that for all layers, the integration of condition information (strain, medium, stress, genetic perturbation) and expression profiles for a single layer predicts more accurately the growth rate than using the condition input and the expression profiles independently (Supplementary Fig. 27). The performance of models that solely use condition features is variable across different layers.
Prediction performance of the integrated model. We integrate all layers for a consensus prediction of growth rate by using a weighted sum model, which was found to outperform other candidates in our testing conditions. The final predicted growth rate is the weighted sum of the growth rate predicted in each layer. The weight for each term is the normalized performance on predicting the growth rate for each layer, measured by LOCO cross-validation during the training phase. As such: where g i is the consensus growth rate from all five layers, L is a set of all five layers, w l is the weight factor for layer l, g l is the growth rate predicted from layer l. The normalized training performance for layer l is calculated by PCC between measured growth rate and predicted growth rate from layer l for all conditions in the training set normalized by sum of PCC for all l. Our results show that the performance of the model integrating all five layers (PCC=0.6) performs better than any other alternative model (Fig.  22). The performance increases in cases where wild-type conditions are present in the training set. Additionally, when the model is predicting unknown WT conditions, the PCC increases up to 0.76. We also investigated individual cases to document the effect of each layer to prediction performance. In 53 cases, the distance between predicted and measured growth rate gradually decreases as each layer is supplemented (R < 0). As shown in (Fig. 23), the largest PCC gains are with the introduction of the transcriptome (increase from PCC=0.46 to 0.64) and fluxome layer (increase from PCC=0.65 to 0.70).

Targeted experimentation for RNA-Seq
For the compendium enrichment, three replicates of selected E. coli mutants Supplementary Table  2 from the Keio collection [135] were grown in minimal M9 salt medium with 0.4% (w/v) glucose as carbon source. Cells were grown in 3 mL of media till mid stationary phase (approx. 8 hours) and then mixed with 1.5 ml of chilled 5% phenol/ethanol (v/v). After centrifugation, cells were stored at -80C until use, never more than one week. RNA was extracted using the RNeasy kit (Qiagen) and enriched with MICROBExpress (Ambion). RNA fragmentation, cDNA production, A-tailing, linker ligation and PCR enrichment were made using the KAPA Stranded RNA-Seq Library Preparation Kit (Kapa Biosystems). Size selection of the final library was performed with Agencourt AMPure XP (Beckman Coulter). After quality control, libraries were pooled and sequenced using an Illumina HiSeq 2500 sequencer.

Proteome profiling
E. coli MG1655 was grown in M9 with 0.4% glucose with and without 0.6% n-butanol for 12 hours (early stationary). Bacteria were pelleted by centrifugation and the samples were analyzed in the Proteomics Core at Genome Center (UC Davis). The processing steps of protein quantification are described in the main text.

Growth experiments
Strains and media. Growth measurements were performed using different strains of the E. coli. E. coli strains used were either from our lab collection or were gifted by different research laboratories. E. coli strains were preserved in Luria Bertani broth supplemented with 15% glycerol. Specific media were made as per requirement.
Lag phase, maximum growth rate and maximum cell density measurements. For the phenomics layer, the growth curves were experimentally measured (Supplementary Data 7) at least in triplicates. 5 µl of the preserved E. coli cells were grown overnight in either 1 ml of 0.2% glycerol M9 medium or 1 ml of the required medium (in the case of carbon shift) in an incubator shaker (Innova 44, New Brunswick Scientific) operating at 175 rpm at 37C. Next day, 3 µl of the growing cultures were taken and inoculated into 197 µl of specific media in a 96 well flat bottom plate (Costar). Cells were grown in an incubator shaker (BioTek Synergy HT) operating at the required temperature. Cell densities were measured at every 15 minutes. For anaerobic growth curves, a custom chamber was made, which was saturated with the nitrogen. 96 well plate was placed inside the chamber and 3 µl of the growing cultures were taken and inoculated into 197 µl of required media saturated previously with nitrogen. Plate was sealed using adhesive tape sheets (OmniGenX Microplate Seal Pad) and cell densities were measured at every 15 minutes. An automated script was written in the MATLAB to calculate the lag phase, maximum growth rate and maximum cell density. The lag phase was defined as the duration to achieve 5% of the maximum growth rate. Growth rates were determined by calculating the differential between 10 to 90% of cell density using a virtual sliding window with the width of 1 hour and sliding every 15 minutes.