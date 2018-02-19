Metabolic reconstruction.

Recon3D has been assembled using multiple data sources, that is, HMR 2.00 (ref. 6) (2,478 reactions), metabolomics data sets (1,865 reactions), a drug module22 (721 reactions), a transport module (51 reactions), host–microbe reactions (24 reactions), absorption and metabolism of dietary compounds (20 reactions), and others (1,004 reactions). The 'others' category included reactions that captured metabolism in specific human organs, (e.g., kidney), as well as novel metabolic pathways of lipoproteins, bile acids, and sphingolipids. The expansion of Recon 2 was performed in an iterative manner (Supplementary Fig. 1). With each addition, there followed extensive model debugging and manual curation for flux consistency and refinement.

Recon 2 was expanded in two stages: (i) additions of new reactions and (ii) network refinements for building a high-quality flux-consistent model (Supplementary Fig. 1). The total number of novel additions included 6,163 reactions, 1,589 metabolites, and 1,654 genes completing Recon3D. These new reactions were mostly from transport (32%), lipid metabolism (24%), exchange (19%), xenobiotic (11%), and amino acid (7%) metabolism (Supplementary Fig. 2b,c). Other major additions include those required for debugging the network for flux consistency (10% of newly added reactions), reactions representing organ-specific metabolism (7%), transport module (2% of newly added reactions), and those representing lipoprotein metabolism (2% of newly added reactions), novel dietary compounds and their associated reactions (1% of newly added reactions), and reactions capturing interaction between gut microbes and host (1% of newly added reactions). For details on the precise metabolic pathways, see Supplementary Note 1.

The largest contribution for new metabolic genes were those from: (i) lipid metabolism (10%), (ii) carbohydrate metabolism (5%), (iii) transport processes (5%), (iv) amino acid (3%), and (v) nucleotide metabolism and vitamin metabolism (1%) (Supplementary Fig. 2). The miscellaneous category mostly contained genes from HMR 2.0 (99%) (Supplementary Fig. 2). The largest contribution for new metabolites were lipid (42%) and amino acid (19%) classes. Novel metabolites added in other subsystems include miscellaneous and xenobiotics (18%), carbohydrates (2%), vitamins (1.4%), and nucleotide (0.3%) metabolism (Supplementary Fig. 2).

Once reactions and genes were added to Recon3D, the reconstruction was subjected to various quality control or quality assurance tests (Supplementary Fig. 1). These included (i) checking for reaction and metabolite duplicity, (ii) modification of gene-protein-reaction associations, (iii) modification of metabolite formulae to pH 7.2 along with mass-charge balancing of reactions, (iv) a leak test, checking for stoichiometric and flux consistency and checking for thermodynamic feasibility54, (v) debugging and curation for removal of dead-end metabolites, and (vi) checking for network accomplishment of defined functions/tests (Supplementary Fig. 1).

To check reaction and metabolite duplicity, we took several approaches. First, Quek et al.55 reported 95 duplicate metabolites, 71 of which were replaced (Supplementary Data 9 and Supplementary Note 1). Second, the reaction and metabolite duplicity was checked for HMR reactions and metabolites (before inclusion in Recon 3). The metabolite formulae, particularly those received from HMR 2.0, were adjusted to an internal pH of 7.2, using mol files28 and COBRA toolbox56 and ChemAxon software (https://chemicalize.com/). This led to correct assignment of reaction stoichiometry and mass-charge-balancing of reactions. Third, gene-protein-reaction associations were curated and corrected for 2,180 reactions (Supplementary Data 6 and 7 and Supplementary Note 1). Finally, we performed additional QC/QA tests (e.g., functional leaks, production of matter from water and oxygen, etc.).

The COBRA toolbox56 was used to identify a subset of 10,600 reactions involving 5,835 metabolites, representing the stoichiometrically consistent flux balance model. The final model was tested for 431 model objectives, representing essential biochemical functions of the human body. The model debugging was mostly done by the addition of extracellular and intracellular transport reactions. Examples include the addition of novel transport proteins for bile acids and folate intermediates. Novel intracellular transport proteins, i.e., mitochondrial pyruvate carriers (MPC1, GeneID: 51660 and MPC2, GeneID: 25874) were added for phenylpyruvate that operates in a proton symport mechanism57. These transport reactions connected the intracellular and extracellular compartments of the model, enabling flux consistency. Manual curation of the relevant scientific literature was followed to obtain complete information on the respective biochemical pathway. A typical example includes the addition of 4-methyl-thio-oxo-butyrate (an intermediate of methionine metabolism) into the network. Upon literature curation, addition of the alternative route of methionine transamination and decarboxylation reactions were identified and added (Supplementary Data 1).

Please refer to Supplementary Note 1 and Supplementary Data 1–10 for detailed information on the network building and refinements.

GEM-PRO reconstruction.

We followed the previously described procedure26 to map, assess, and refine PDB or homology models for integration into genome-scale models. For Recon3D, additions to the gene identifier mapping workflow were made to address inconsistencies in gene isoforms across database entries and the ability to link isoforms to available homology models. In addition, QC/QA steps were taken in order to ensure the correct sequence was being retrieved (Supplementary Fig. 5 and Supplementary Note 3). For PDB structures with missing residues, we have filled in the gaps by querying previously generated databases of I-TASSER homology models58,59, and manually generating homology models for genes that were not part of these databases using previously defined protocols26,60. In the final master GEM-PRO data frame (Supplementary Data 11), we note where available homology models have been mapped to their respective genes. For most homology modeling procedures, the amino acid sequence of a protein is all that is required to generate a homology model of a protein. It is important to note that for certain PDB structures with unresolved residues or gaps in the structure a homology model can also be generated to enhance the structural coverage of the amino acid sequence. Homology models were not generated for any sequences longer than 600 amino acids long. We assessed the overall quality of the information coming from homologous templates in terms of (i) which organism the protein was crystallized from, (ii) the resolution of the PDB template, and (iii) the deposition date. We used these properties to compare the templates that were used to construct homology models in the previous GEM-PRO models with those of the recently updated versions (Supplementary Tables 2–4 and Supplementary Fig. 6).

To identify structures for the given set of metabolites in Recon3D, we evaluated a number of databases where metabolite structures are publicly available, such as PDB (ligand-expo: http://ligand-expo.rcsb.org/, http://ligand-expo.rcsb.org/ld-search.html), PubChem61 (https://pubchem.ncbi.nlm.nih.gov/), and ChEBI (http://www.ebi.ac.uk/chebi/). We downloaded structures in various formats: 2D structure in .mol format (ChEBI), 3D structure in .sdf format (PubChem61), and in .pdb/.xyz format (RCSB). Supplementary Data 14 provides all the information content processed for metabolites in Recon3D, which includes SMILEs and INCHI descriptors, Kyoto Encyclopedia of Genes and Genomes (KEGG)62 IDs, CID IDs, CID file names, ChEBI file names, ChEBI IDs, and experimental coordinate file URL locations and the ideal coordinate file name. The ChEBI mapping procedure contained the following steps: (i) identification of the particular metabolite from ChEBI using the source link (the metabolite name was the starting point of search which is taken from the metabolite names in the Supplementary Data 14); (ii) checking the molecular formula and charge (neutral or charged) of the metabolite in the ChEBI database; (iii) capturing the ChEBI link, ChEBI ID, SMILES, and INCHI into the respective fields in the data set spreadsheet; (iv) 2D-structure is downloaded in .mol format. The same overall search was conducted in PubChem and PDB (Ligand expo) with slight variations as to the initial search inputs and file type outputs.

The data set of human single nucleotide polymorphisms (SNPs) and single nucleotide variants (SNVs) was collected from UniProt from a subset of protein altering variants from the 1000 Genomes Project. Furthermore, all SNPs and SNVs for model genes were downloaded directly from dbSNP (http://www.ncbi.nlm.nih.gov/SNP/) via the Ensembl BioMart interface63. We then selected all variants that were characterized to be 'damaging' or 'possibly damaging' as a predicted functional impact using the PolyPhen2 bioinformatics tool41. Functional annotations of the missense mutations were also annotated using SIFT (http://sift.jcvi.org/). In addition, we linked the missense variants to their gene–drug associations (clinically relevant pharmacogenomics interactions) using the PharmGKB pharmacogenomics database (https://www.pharmgkb.org/). All annotated gene–drug pairs contain information such as dosing guidelines, drug label annotations and each pair is generally specified in more than 1 type of annotation (dosing guideline, drug label, clinical annotation, variant annotation, VIP, or pathway). These selected pharmacogenomic associations allow us to understand whether certain missense variants have functional effects on drug therapies. All selected missense variants and their drug associations have been provided as Supplementary Data 15 and 16.

More details on the process and procedure for network reconstruction, protein and metabolite structure integration, identification of representative protein domains, linking to pharmacogenomics databases, linking to cancer genome atlases, mutation hotspot analyses, and comparison of tissue-specific cancer and pharmacogenomic/gene variation networks are all provided in Supplementary Notes 3 and 4.

Atom–atom mapping.

Generation of atom mapping data requires chemical structures, reaction stoichiometry and an atom mapping algorithm. Atom mappings were predicted using the Reaction Decoder Tool64, and the DREAM algorithm65 for 7,535 (86%) mass balanced reactions with implicit and explicit hydrogens, respectively, while Reaction Decoder Tool and the CLCA algorithm66 were used to predict atom mappings for a further 269 reactions with incompletely specified metabolites (e.g., R group) with implicit and explicit hydrogens, respectively. We compared these predictions for internal reactions to a set of 512 reactions with atom mappings that we and others manually curated (Supplementary Note 3). This reaction set is representative of all six top-level (Enzyme Commission) EC numbers. Based on this comparison, we observed that the predicted atom mappings are highly accurate for most of the reaction types28 (Supplementary Fig. 7).

3D mutation hotspot analysis.

We filtered a set of mutations (whose genes are associated with experimental protein structures) based on whether the location of the mutated residue itself was resolved (e.g., certain protein domains are unresolved due to flexibility or unstructured regions of the protein being challenging to crystallize). Once the subset of mutations was established to (i) be linked to genes with experimental protein structures and (ii) be located within regions of the protein that were experimentally determined, we carried out 3D structure alignments between all proteins and their representative domains (mapping to representative protein domains is described previously in the section entitled “mapping and alignment of PDBs to their representative domains”). In contrast to sequence alignments, 3D structure alignments find a best fit in terms of the 3D shape or geometry of two proteins. Therefore, any two proteins that have different sequences but share a common domain architecture can be successfully aligned in 3D space. Similar to sequence alignments, the 3D structural alignment provides a direct residue-to-residue mapping for residues that share structurally equivalent positions in a common/shared domain motif. Once this residue-to-residue mapping was established for all proteins in our data set, we located 3D “hotspot” mutations by tallying all residues in the representative domains that map to mutated residues in a given protein of interest. To this end, certain residues in a representative domain may have multiple hits if more than one gene is linked to that representative domain and the same structurally equivalent residue is mutated across various genes. Supplementary Data 17 provides the mapping between the residue number of the Uniprot missense variant > the PDB residue number > the PDB chain where the residue is located > the representative domain ID linked to a given PDB chain > the structurally equivalent residue within that representative domain.

Mapping cancer mutations in 3D.

We used the TCGA level 3 variant data in the cBioPortal (http://www.cbioportal.org/). For this study, we used high-level (processed) data from a subset of pre-analyzed mutations from 178 tumor–normal pairs of lung squamous cell carcinoma36. When the MutSig1.0 approach was applied on this data set35, it identified 450 genes as significantly mutated. Starting from this set of genes, we identified a subset of 86 genes that have Uniprot accession numbers and protein structural information. Within this set of genes, we found that 889 somatic cancer mutations map to residues that have been successfully resolved in the crystallographic structures of proteins. We used the list of 86 genes to query the cBioPortal web-based data set and downloaded various information including: somatic cancer mutations, cancer study sample IDs, amino acid mutations, annotations (coming from various sources, such as http://oncokb.org/ and https://www.mycancergenome.org/), type of mutation, copy number changes, overlapping mutations in COSMIC, the predicted functional impact score (from Mutation Assessor), variant allele frequency in the tumor sample, and total number of nonsynonymous mutations in the sample. A summary of cancer data sets used in this study is given in Supplementary Data 21 and a detailed summary of all somatic mutations for this set of genes is provided in Supplementary Data 22 and 23. The 3D hotspot analysis was carried out as detailed above and mutations were rank-ordered on the basis of how many mutations fell within a 5Å sphere (i.e., number of nearest neighbors). We performed a sensitivity analysis to understand whether the selection of data points had an effect on the significance of these results.

The above 3D hotspot analysis approach was also applied to 22 genes from which cancer mutations have already been analyzed53 (exome samples of 291 glioblastomas) and 92 genes involved in cholesterol metabolism, owing to the fact that cholesterol biosynthesis plays an important role in GBM39.

Statistical tests.

We performed a sensitivity analysis to understand whether the selection of data points had an effect on the significance of these results. We find that the 3D hotspot analysis is more likely to select somatic mutations compared to a random selection. Data points (50–700) were selected so that 0.065–0.91 of the total data set was covered. We performed the 3D hotspot analysis across the different selections and found P values in the range 0.017–0.049 compared to 0.182–0.241, using a random residue selection.

For annotations of mutations that are known oncogenes and known hotspots, selection of the databases on 3D hotspot analysis is important, regardless of the number of mutations (or % of data) selected (P < 0.05). Compared to a random selection, our computed (using a two-tailed t-test) p value is > 0.1. We also performed a sensitivity analysis using the slices of the total data set as mentioned above (50–500 data points) and computed the total number of known oncogenes and known hotspots (from previously published analyses), using the 3D hotspot analysis compared to a random selection. We find that the percentage of data selected is significantly higher using the 3D hotspot analysis. For known oncogenes, 37–83% of the data is selected using 3D hotspot compared to 0.046–0.43 at random. Similarly, for known hotspots, 72.5–88.3% of the data is selected using 3D hotspot analysis compared to 9.8–64%. See Supplementary Note 5 for more information.

Gene deletion simulations in GBM.

In silico single gene deletion (SGD) simulations were performed as previously described67. Given a certain GEM, the simulation of a SGD was performed by formulating the linear program problem (1) for each gene g in the GEM:

where v obj is the flux through the biomass equation, γ is an arbitrary number set to 1, S is the stoichiometric matrix of the GEM (that is, a m × n matrix where m is the number of metabolites and n is the number of reactions and each (i,j) entry is the stoichiometric coefficient of the metabolite corresponding to row i in the reaction corresponding to column j), v is the vector containing the values of the fluxes through each reaction in the GEM, and j indexes each exchange reaction known to be present in a rich mammalian medium (Ham's medium, HAM; see Supplementary Note 5 for more details). The simulation was carried out for the following GEMs: Recon3D, HMR2.00, and 22 personalized GEMs for glioblastoma multiforme (GBM) previously reconstructed using HMR2.00 as a template from as many GBM expression profiles retrieved at The Cancer Genome Atlas8.

Drug perturbation analysis.

To compute metabolic pathways with gene expression perturbed by drugs, the human metabolic network model was first converted into an irreversible network. Then, the MetChange algorithm42 was run using gene expression presence/absence p-values from the Connectivity Map (Cmap) database44 build 02. Drug indications were taken from the Side Effect Resource (SIDER) database45 for all available drugs overlapping with the Cmap database. Synonyms were aggregated when present as with side effects. A minimum of ten drugs for each indication was required for the inclusion in the analysis, corresponding to a much greater number of expression sets for each indication. A total of 48 drug indications were analyzed for 1,459 expression sets corresponding to 334 drugs. A genetic algorithm (Supplementary Fig. 15) was then implemented as described in Supplementary Note 6. Details of the gene indication signatures can be found in Supplementary Note 6.

Code availability.

“IndiFinder.m” contains the Matlab implementation of the genetic algorithm for finding metabolic signatures underlying drug indications given presence/absence of the indication existing for a given sample. It is submitted as Supplementary Software and requires previous installation of the COBRA Toolbox. Code is commented for guidance.

Life Sciences Reporting Summary.

Further information on experimental design is available in the Life Sciences Reporting Summary.

Data availability.

Recon3D is available as a metabolic reconstruction at http://vmh.life. A GIT repo that contains the GEMPRO, GBM-specific model files and simulations, and gene deletion simulations can be accessed via https://github.com/SBRG/Recon3D. Recon3D GEM-PRO has been consolidated into a shareable JSON file and submitted as Supplementary Data 27, which can be used to start structural analyses. This model assigns a single representative structure per gene in the reconstructed metabolic model. The accompanying software package required for reading and working with the GEM-PRO JSON is available at https://github.com/SBRG/ssbio. This entire repository can be cloned to a user's computer and contains Jupyter notebooks in the root directory to guide a user through the content available in the Recon3D GEM-PRO model (Recon3D_GP - Loading and Exploring the GEM-PRO.ipynb) as well as to update the model with revised sequence information or newly deposited structures in the PDB (Recon3D_GP - Updating the GEM-PRO.ipynb). This repository also includes all sequence and structure files mapped per gene, metadata downloaded through UniProt and the PDB, as well as the ability to rerun the QC/QA pipeline with different parameters such as sequence identity and resolution cutoffs. These notebooks also include basic visualization features enabled with the NGL viewer package68.