The space of enzyme regulation in HeLa cells can be inferred from its intracellular metabolome

During the transition from a healthy state to a cancerous one, cells alter their metabolism to increase proliferation. The underlying metabolic alterations may be caused by a variety of different regulatory events on the transcriptional or post-transcriptional level whose identification contributes to the rational design of therapeutic targets. We present a mechanistic strategy capable of inferring enzymatic regulation from intracellular metabolome measurements that is independent of the actual mechanism of regulation. Here, enzyme activities are expressed by the space of all feasible kinetic constants (k-cone) such that the alteration between two phenotypes is given by their corresponding kinetic spaces. Deriving an expression for the transformation of the healthy to the cancer k-cone we identified putative regulated enzymes between the HeLa and HaCaT cell lines. We show that only a few enzymatic activities change between those two cell lines and that this regulation does not depend on gene transcription but is instead post-transcriptional. Here, we identify phosphofructokinase as the major driver of proliferation in HeLa cells and suggest an optional regulatory program, associated with oxidative stress, that affects the activity of the pentose phosphate pathway.

ac.il/ assuming a pH of 7.34 and an ionic strength of 0.15 M for all samples. (C) Bars denote the number of basis vectors in the respective non-reduced k-cone that lead to stable or unstable steady states for the two used cell lines.

Figure S2
Supplementary Figure S2 -Metabolite concentrations are log-normal distributed he distribution of the log-transformed metabolite concentrations was compared to a normal distribution by quantile-quantile plots and empirical distribution functions. Blue symbols and lines denote the HaCaT samples and red symbols and lines the HeLa samples. uantile-quantile plots are shown on the let and a perfect fit is denoted by the straight lines. Decent agreement is observed, with some deviations in the tails. he empirical distribution functions are shown on the right, and cumulative distribution functions for a normal distribution is indicated by the straight lines.

Figure S3
cell line condition Supplementary Figure S3 -Mapping of microarray samples to the first two principal components. Colors denote condition, and shape denotes cell lines. he samples separate well into noncancerous normal samples and cancerous HeLa samples. he total standard deviation contained in the first two principal components was 17.6%.

Mathematical description of the k-cone formalism 1 The flux cone
Gifien the stoichiometric matriffi S 1 and the fiector of reaction fielocities v = v(S, t) the steady state condition is gifien by he fact that some reactions might be irrefiersible enforces positifiity constraints on some elements of v (the irrefiersible reactions). In order to make the constraints more consistent one oten splits ffp all refiersible reactions into their respectifie forflard and backflard reactions. his can be achiefied by a link matriffi L flhich has a block diagonal shape flith ones on the diagonal for irrefiersible reactions and (1, −1) for refiersible reactions. For instance for the simple system fle can formfflate the follofling transformation to get the irrefiersible stoichiometric matriffi S irr : Using the link matriffi L fle can nofl formfflate offr steady state condition for the irrefiersible stoichiometric matriffi S irr and split ffffies v irr S irr v irr = S rev Lv irr = 0. (4) his system nofl has a semi-trifiial solfftion Lv irr flhich reqffires irrefiersible reactions to carry zero ffffi and refiersible reactions to be in chemical eqffilibriffm, meaning that the respectifie forflard and backflard reactions carry the same ffffi. Hoflefier, as long as S rev is infiertible there may 1 In this teffit all lofler-case bold leters denote fiectors, all ffpper-case bold leters matrices and all cffrsifie leters denote scalars. At the same time the mffltiplictation operator, · , denotes scalar mffltiplication for scalars, the dot prodffct for fiectors and matriffi mffltiplication for matrices. For conciseness the mffltiplication operator flill sometimes be omited.
1 also be other non-eqffilibriffm solfftions for the steady state.
From here on S flill alflays denote the irrefiersible stoichiometric matriffi S irr and v the irrefiersible ffffies v irr , yielding the steady state condition his de nes a solfftion space flhich is a pointed cone 2 . he basis for this space, V, is a m × n matriffi called the flux cone flhere ∀v = Vc if c ∈ R m,+ : Sv = 0.

The k-cone
In the k-cone formalism fle assffme that, gifien the metabolite concentrations x, the reaction fielocities can be flriten as Here k j denotes the reaction rate and m j are metabolic terms, in particfflar, for mass-action kinetics it holds that m j = i x s ij i , flhere s ij is eqffal to the stoichiometry of metabolite x i in reation j. Hoflefier, m j terms can be chosen arbitrary in the k-cone formalism to acomodate other kinetics. m j denotes the corresponding mass-action terms. De ning the diagonal matriffi M := diag(m j ), one can reflrite the steady state condition as For a gifien fiector of sffbstrate concentrations x, SM de nes a confieffi polytope of all possible reaction rates, and thffs enzyme actifiities, for the gifien steady state X. Solfiing the fierteffi enffmeration problem for the H-representation gifien by SM and the constraints on k de nes a nite basis K, called the k-cone. One coffld solfie the fierteffi enffmeration for a k-cone directly, hoflefier it is more practical to derifie the k-cone directly from the ffffi cone V (as de ned in eq. 6). Using the de nition of the reaction fielocities (eqffation 7) one can see that the k-cone K can be obtained by his reqffires the diagonal matriffi M to be infiertible, flhich is the case if all metabolic terms are none-zero. In practice zero metabolic terms denote reactions flhich are inactifie and can be remofied from the stoichiometric matriffi. his de nes one ffniqffe k-cone bffilt from a skeleton ffffi cone, flhich gifies a good basis for comparatifie analysis. One shoffld note that the k-cone can be constrained efien fffrther if one has measffrements for in vivo eqffilibriffm constants K i eq for a refiersible reaction flith indeffi i, since it de nes an eqffality 2 meaning that the origin is inclffded in the space and that all other points in the space are strictly positifie constraint for the forflard and backflard rate constants k + i and k − i flhich may redffce the k-cone space:

Conditional transitions
If fle obserfie the same biochemical systems in tflo di erent conditions, normal (n) and disease (d), this de nes their respectifie k-cones K n and K d . Using eqffation 9 fle can easily derifie a transformation matriffi T that describes the transition from the healthy to the disease space by the follofling theorem.
heorem 1. here exists a diagonal transformation T = M n M −1 d such that K d = TK n . Proof. Directly follofls by applying eqffation 9 to K d .
hffs, the transformation betfleen the k-cone spaces is gifien by metabolome data alone. Hoflefier, this is a property of the entire spaces. Within the respectifie spaces the kinetic constants can assffme any fialffe flithin the respectifie k-cone. Hoflefier, this reqffires the solfftions to come from the same ffffi cone V flhich is only the case if the stoichiometric matriffi S and constraints are the same in the normal and disease condition. hffs, fle flill nofl aim at derifiing a more general effipression.
We flill rst formfflate eqffation 7 in a fiector form by considering the (irrefiersible) ffffies v gifien by Hence, if fle obserfie the same biochemical systems in tflo di erent conditions, normal (n) and disease (d), the respectifie enzyme actifiities are gifien by his raises the qffestion if there effiists an a ne transformation betfleen the normal and disease state.
heorem 2. Given enzyme activities k n and k d it holds that Proof. Straight forflard ffsing eqffations 17 and the commfftatifiity of diagonal matrices.
his gifies an a ne transformation betfleen the enzyme actifiities gifien by the normal and disease state. Fffrthermore, the transition is composed of the transformation matriffi T and the fleight matriffi W. Efien thoffgh the theorem did not assffme the same ffffi cone for both conditions this time, it recofiers the transformation matriffi T again. Since T and W are diagonal one can intepret their prodffct as fleighting each diagonal entry in T by the diagonal entries in W. hffs, from a biochemical point of fiiefl, the change in enzyme actifiities from the normal to the disease state can be decomposed into changes in the mass-action terms (the concentration-dependent part of the kinetics) and the ffffi changes betfleen the tflo conditions. W is ffsffally not knofln, bfft T can easily be calcfflated from steady state measffrements of metabolites. A healthy system ffsing k n mffst trafierse throffgh TW toflards k d . hffs, TW contains all possible changes to trafierse into the disease state and acts as a representation for the regfflatory efients occffrring betfleen k n and k d . Reactions flith large entries in T are either regfflated or mffst shofl a strong alteration in their steady state ffffi balance, gifien by v d /v n , that coffnteracts the entries in T. he more similar the steady state ffffi distribfftions of the normal and disease state are, the more meaningfffl is T. In particfflar, if the ffffi cones V n and V d are similar, randomly sampling feasible ffffies v n , v d floffld in afierage resfflt in a fleight matriffi W close to the identity matriffi and TW ≈ T.

Stability analysis
he dynamics of the system are gifien by the ODE system Efialffating this ODE system at a pertffrbed steady state x * + ∆x and performing a linear ap-proffiimation yields Here J (x * ) denotes the Jacobian for the metabolic terms flith . he solfftion to this system is gifien by the matriffi effiponential fiia Gifien a k-cone K = {k i } fle can ffse its basis property to effipress all possible k by he matriffi effiponentials are gifien by the eigendecomposition, resfflting in flhere Q i contains the eigenfiectors pertaining 3 to w i k i and Λ i its eigenfialffes on the diagonal. Each elements of the solfftion is, thffs, denoted by a refleighted prodffct of effiponentials for some fleights α i and indices i, j flith associated fleights φ i,j .
hffs, some conclffsions abofft the stability of a solfftion can be drafln based on the eigenfialffes pertaining to the basis fiectors in K. Particfflarly, if w i > 0 only for stable basis fiectors (ℜ(Λ i ) < 0) all resfflting k flill lead to a steady state as flell. In case some of the basis fiectors are ffnstable no general conclffsion can be drafln since, dffe to the prodffct in the effipression, the resfflting system can either be stable or ffnstable based on the strffctffre of the matrices Q i and the fleights w i . Hoflefier, dffe to the sffm form in eqffation 33 one can see that the magnitffde of the eigenfialffes dictate the resfflting stability, meaning that large absolffte eigenfialffes from a stable state can dominate small ffnstable ones.

Protocol: k-cone analysis of HeLa cells Installation
All of the analysis is performed in R. As such the first thing you will need is to install R. For installation instructions see http://r-project.org. In Ubuntu and Debian R can be installed via the Terminal using sudo apt-get install r-base r-base-dev Additionally some of the dependencies of dycone require development versions of some libraries for web security and scraping. In Ubuntu and Debian those can be installed via Most of the actual analysis is implemented in the dycone R package. It can be installed using devtools in the following manner. We will also install all optional dependencies so we can build this document. In a Terminal type R to start R, than use the following commands: install.packages("devtools") source("http://bioconductor.org/biocLite.R") biocLite(c("Biobase", "IRanges", "AnnotationDbi", "affy", "frma", "genefilter", "GEOquery", "hgu133plus2.db", "hgu133plus2frmavecs", "limma")) devtools::install_github("cdiener/dycone", dependencies = TRUE) This will install dycone and all additional dependencies. You will see at lot of messages from the compiler and the whole process might take a few minutes. After that the dycone library can be loaded with library(dycone)
metab <-read.csv("metabolome.csv") # First 4 columns are annotations metab[, 5:10] <-1e-6 * metab[, 5:10]/(1e6 * 1.54e-12) As one can see, the names used by the provider during metabolite measurements are not the same we used in the model. Thus, we will also need an ID map to identify the metabolites. This map is given in id_map.csv id_map <-read.csv("id_map.csv", stringsAsFactors=FALSE) head(id_map) Some of the metabolites have several IDs assigned to them. We will come back to that later.

Imputation of missing data
At first we will try to see which metabolites in the model are missing in the measurements. For this we will look for matches of the respective metabolites from the model in the metabolome measurements. Matching will be done based on KEGG IDs.
matches <-sapply(id_map$kegg, grep_id, x = metab$kegg_id) miss <-is.na(matches) So we see we have measurements for about 2/3 of the required metabolites To impute the missing values we will first construct a data frame for all metbaolites with missing entries and see how to fill it in the following steps.
full now already includes the metabolome measurements but still has missing entries which we will now scrape from the Human Metabolome Database (HMDB, http://hmdb.ca). HMDB assigns a single KEGG is to each metabolite in the database. However, there are cases where a single metabolite can be identified by several IDs in the HMDB. This may happen if we have Glucose which may for instance map to alpha-D-Glucose or beta-D-Glucose . This is the reason why our id_map includes several HMDB and KEGG IDs for some metabolites.
We will now scrape the concentrations for all metabolites in the model from HMDB. This might take some time so we will use the caching operator %c% from dycone which will take an R expression and will cache all assigned variables in that expression to a cache file, so that rerunning the script will read the results from the cache and not rerun the analysis. To run the analysis again simply delete the cache file without changing any of the code.
{ concs <-hmdb_concentration(id_map$hmdb, add = id_map[, 1:2]) } %c% "scraped_concs.Rd" We use this data to get the mean values for measured concentrations (HMDB quantifies almost all metabolites by micro-moles per liter as well) by taking them from cytoplasm measurements where available, or blood as a fallback. (In any way those imputations will only be used for stability analysis and not for differential measurements. If one only wants to use the differential analysis of dycone you can simply substitute all NA values in full by any constant.) m_concs <-as.vector(by(concs, concs$name, priority_mean)) names(m_concs) <-levels(factor(concs$name)) In order to fill the gaps in the full data set we will use the patch function from dycone. patch will first to attempt to fill any hole with measurements from the own data set, either by using the mean values of the same cell line or, if not available, by the mean value of the other cell line. Thus, giving priority to the local data before the scraped ones. Only measurements with missing entries in both cell lines are filled with the scraped mean concentrations from HMDB.

The k-cone of HaCaT and HeLa cells
As described in more detail in the Supplementary Text of the publication the k-cone of several metabolome measurement can be generated from a skeleton flux cone. Thus, we will first calculate the flux cone for our model. This will take about half an hour, so we will use the caching operator again to avoid recalculating the flux cone every time we run the analysis. For this we will need an irreversible stoichiometric matrix which can be generated with the stoichiometry function of dycone.
For the metabolic terms we use the mass-action terms generated from the imputed metabolome measurements. This is sufficient to create the k-cones for the six measurements. ma_terms expects a single named vector of concentrations, or a data frame with a name column and several columns containing concentrations. To visualize the k-cone we will use the plot_red function which first projects the high-dimensional k-cone into the two dimension capturing the most variance, followed by clustering of the extreme rays of the cone to avoid repeatedly using rays that are very similar. The shaded area corresponds to the interior of the cone, so all feasible sets of kinetic constants fall into the shaded area of the respective k-cone. We will use blue for HaCaT cells and red for HeLa. Since the used k-means clustering is not entirely deterministic the images here might look a bit different here than in the publication (particularly arrow outliers) or be rotated along one of the two PC axes. However, the general appearance of the cones should be the same. As we can see there are only minor differences between the k-cones. To see the proportion of the cone with large entries in the transformation matrices we will first calculate the log2-fold changes between all combinations of HeLa and HaCaT cells and only use those with an absolute log2-fold change larger than 1 (thus, a fold change larger than 2). We can also reduce the k-cone spaces even further by using measurements for in vivo equilibrium constants Keq. Because there are no measured Keq for our cell lines we will use approximations obtained from http://equilibrator.weizmann.ac.il/. For that we will assume an pH of 7.34 and an ionic strength of 0.15 M for all samples. The estimated Keq values are already contained in the model. And we can thus use the constrain_by_Keq function from dycone. In order to accelerate this, we again use the caching operator and will also employ the doParallel package to run the analysis for each k-cone in parallel. If you have less than 6 CPU cores adjust the option accordingly or simply do not setup the cluster which will cause the code to run on a single core.

Stability analysis
To get the stability for an entire k-cone we need to run a stability analysis for each basis vector of the k-cone. Since we have 6 k-cones with over 80.000 basis vectors each this will take a while so we will perform it in parallel again. Now we will assign the cell line to each output and plot the counts for the individual stability types.

Differential activities
The statistical tests used in dycone assume a log-normal distribution of the metabolic terms. It is sufficient to show an approximate normal distribution for the log-transformed metabolite measurements since the log-transform of the metabolic terms is a weighted sum of the logarithmic metabolite measurements for most of the common kinetics (such as mass-action, Michalis-Menten and Hill). We will compare distribution to a normal one via quantile-quantile plots and the empirical distribution of the data.
We already calculated the log-fold changes before, but in order to also assign some statistics to that we can also use the function hyp from dycone which does that for us. It will generate all log-fold changes between disease and normal measurements and perform an empirical Bayes version of t-test. The output will be sorted by increasing p-values and mean log-fold changes. We will also use the full option to obtain the raw log-fold changes and append some additional annotations to the result using the rp function.

Worst-case analysis via linear programming
The analysis we will perform here is very similar to that in the previous section, however, this time we will obtain the differential enzyme activities by correcting for effects that can be caused by flux variation. For this we will analyze the smallest and largest flux for each reaction that still allows a given biomass/growth flux to operate at its optimum. Using dycone the only explicit action required of the user is definition of the optimization criterion. For this we will first define all metabolites which are precursors for compounds required for proliferation. Each of those will be assigned a weight which is obtained from the stoichiometry of Recon 2 biomass reaction (http://www.ebi.ac.uk/biomodels-main/MODEL1109130000). For the cases where one metabolite is the precursor for several proliferation compounds we use the maximum stoichiometry from Recon 2. Negative stoichiometries denote compounds that are consumed and positive stoichiometries denote produced compounds. Here we only produce ADP and Pi to balance the ATP usage. The small difference in the stoichiometry (20.7045 vs 20.6508) in Recon 2 accounts for the ATP used during DNA replication. As we can see there is a large requirement for ATP and amino acid precursors. Also the form here is not the only way to formulate an objective reaction. Please refer to the documentation of fba for alternative input forms.
We can now use the hyp function again to generate hypotheses for differentially regulated enzymes correcting for fold changes that can be explained by flux variation. This time setting the type to fva which will require defining the additional v_opt which defines the objective reaction. The output will be same as before only with an additional column fva_log_fold denoting the largest absolute fold-change that can be explained by flux variability alone. We will also append additional annotations again and save the output to a CSV file. This will solve a series of linear programming problems (in our case more than 200). To accelaerate this a bit, hyp will automatically execute those in parallel if you registered any of the backends compatible with foreach. Since we already did that during the stability analysis the following code will automatically run in parallel. Finally, we will save the complete results in EDAs.csv. Let us use visualize those results to also compare the obtained regulations on a pathway level and mark reactions whose expected differential activity log-fold changes can not be explained by flux variability.

Heterogeneity and co-regulation
Cancer is known to be a very heterogeneous disease. thus, we might ask what the variation in enzyme activities is within HaCaT and HeLa cells. The hyp output already calculates the standard deviations for the log-fold changes within HaCaT cells and between HeLa and HaCaT cells. So we can easily visualize those.

Gene expression analysis
We will start by reading a list containing IDs and cell lines for 58 untreated samples from the GEO database and creating an output directory for the downloaded data.
sample_info <-read.csv("ge_samples.csv") dir.create("ma") Since we might run this analysis several times we will first check whether the data is already present and than download all missing data.
library(frma, quietly=T, warn.conflicts=F) eset <-frma(raw_data) We will follow this by a short quality control. For this we will check whether the normal and disease samples form condition-specific groups in their expression patterns. As a first try we will visualize the the gene expression patterns in the first two principal components. We can also quantify this by first clustering the 58 samples by their expression patterns into two clusters and check how well those two clusters correspond to the condition.

## Identified genes: 20546
Differential expression will be judged by a t-test where the sample variances are estimated using the empirical Bayes method from the limma package. Finally, we will save the gene-wise log2-fold changes along with FDR-corrected p-values to an intermediate data file.

Comparison to expected differential activity
We will start by mapping all the EC numbers of the reactions used in the model to its respective ENTREZ gene ids and names and saving that information into the info data frame. Note that a single reaction might be associated to several EC numbers (isoenzymes for example) and every EC number might be associated with several genes.

info <-info[good, ]
Now we will start adding the correponding log fold changes and p-values from the EDAs. And, finally, we will add the log fold changes obtained from the microarrays along with their p-values. The fully assembled info data frame will be saved as a csv file.