Abstract
Genome-scale network reconstructions have helped uncover the molecular basis of metabolism. Here we present Recon3D, a computational resource that includes three-dimensional (3D) metabolite and protein structure data and enables integrated analyses of metabolic functions in humans. We use Recon3D to functionally characterize mutations associated with disease, and identify metabolic response signatures that are caused by exposure to certain drugs. Recon3D represents the most comprehensive human metabolic network model to date, accounting for 3,288 open reading frames (representing 17% of functionally annotated human genes), 13,543 metabolic reactions involving 4,140 unique metabolites, and 12,890 protein structures. These data provide a unique resource for investigating molecular mechanisms of human metabolism. Recon3D is available at http://vmh.life.
Main
It is widely recognized that progress in the biomedical sciences is hampered by the difficulty of integrating multiple disparate data types to obtain a coherent understanding of physiology and disease1. A genome-scale network reconstruction represents a curated knowledge-base containing many different data types and sources, including high-quality genome annotation, assessment of biochemical properties of gene products, and a wide array of physiological functional information. Computational genome-scale models integrate large-scale omics data from these knowledge-bases to aid in the interpretation and prediction of biological functions2. In recent years, human metabolic network reconstructions3,4,5,6 have generated insights into inborn errors of metabolism7, cancer8, and human microbiome co-metabolism9,10.
Using metabolic reconstructions, researchers store and continually update information about chemical reactions in a standardized biochemical and genetic representation11. Over the past ten years, updating the human metabolic network reconstruction has focused on expanding metabolic reaction coverage. From the first human reconstruction, Recon1 (ref. 4), to the most recent version, Recon2 (ref. 3), the content has been expanded from 1,496 genes (corresponding to 3,311 reactions) to 1,675 genes (7,785 reactions). Various other reconstructions have been released and community-driven efforts have been made to ensure interoperability of these resources3,5.
Historically, systems biology has focused on characterizing the catalytic or regulatory roles of proteins in metabolism without placing emphasis on the 3D structure of the proteins themselves. For example, studies on genetic variation have mainly focused on frequency of occurrence12 or sequence-based attributes13. Only recently have mutations been explored in the context of their 3D location or spatial relationship to each other14,15,16,17. Exploring mutations in 3D extends beyond nucleotide sequence identity18, as mutations that may be far away from each other in linear sequence may be proximal in the folded state. The availability of protein and metabolite structure data have enabled the progression of systems biology to a 3D perspective. In one study, protein structures were mapped to the metabolic network of Escherichia coli, to reveal the role of ribosome pausing in co-translational protein folding19. In another, human population variation was studied by integrating protein structures with the human erythrocyte metabolic network to understand the adverse effects of drugs on genetic variants20 and identify new pathways related to drug perturbation. These studies highlight the value of integrating different types of data to address complex biological questions.
We present Recon3D, an updated and expanded human metabolic network reconstruction that integrates pharmacogenomic associations, large-scale phenotypic data, and structural information for both proteins and metabolites. Recon3D contains over 6,000 more reactions than Recon2, all of which were manually curated to remove redundant or blocked reactions. We use Recon3D to prioritize putative disease-causing genetic variants by mapping single-nucleotide variants (SNVs) to protein structures. We show that deleterious mutations are more likely to cluster together into functional hotspots than non-deleterious mutations. In contrast to previous models, these mutation hotspots identify ACAT1 as a cancer-related gene. Furthermore, we demonstrate how structural information can be used to investigate the potential mechanisms by which drugs exert an effect on metabolism. Recon3D provides new avenues for investigating the molecular basis of disease and may aid the development of treatment strategies, biomarkers, and drug repurposing.
Results
Increasing the scope of the human metabolic network
We expanded Recon 2 (ref. 3) by using ten metabolomic data sets to identify new metabolites and transport and catalyzing reactions (1,865 reactions). We added reactions from HMR 2.0 (ref. 21) (2,478), a drug module22 (721), a transport module23 (51), host–microbe reactions10 (24), and absorption and metabolism of dietary compounds (20). Overall, 66 metabolic subsystems, including lipoprotein (44 reactions) and bile acid (216 reactions), were expanded and 10 new subsystems were added (Supplementary Figs. 1, 2, 3, Supplementary Data 1 and 2, and Supplementary Note 1). We further refined numerous aspects of the reconstruction, including 2,181 gene–protein–reaction (GPR) associations, reaction/metabolite duplication, reaction directionality, and thermodynamic feasibility (Online Methods). The metabolic scope was extended by 82% for reactions (total 13,543) and 58% for unique metabolites (total 4,140) (Fig. 1a,b). Recon3D is the most comprehensive metabolic resource currently available (Supplementary Table 1a–c). Out of the 20,266 human proteins documented in UniProt24 (queried July 2016), 19,213 are functionally annotated (i.e., not hypothetical) and 17% of this subset is metabolic, well-characterized, and included in Recon3D.
Genome-scale network reconstructions can be converted into computational models that enable predictive biology2. We derived a computational model from Recon3D by removing reactions that were stoichiometrically inconsistent and that were flux inconsistent (i.e., reactions that could not carry flux under the applied reaction bounds; Supplementary Note 2 and Supplementary Data 1). After performing standard quality-control tests, the resulting generic Recon3D model contained 10,600 reactions (78% of the reconstruction reactions) and was able to reproduce literature-consistent energy (ATP) yields from different carbon sources (Supplementary Data 3–10), to fulfill metabolic functions describing cellular and whole body metabolism, and to replicate the predictions of infant growth from a previous study25 (Supplementary Fig. 4).
Enabling a 3D view of metabolism
Using a recently established approach26, the metabolic network content of Recon3D was expanded to include 3D protein structures from the Protein Data Bank (PDB)27 as well as homology models (Fig. 2a, Supplementary Fig. 5 and Supplementary Data 11–13). In addition, we mapped content from a variety of external database resources, to include metabolite structures (Supplementary Data 14). We obtained high-quality structural coverage for over 80% of the human metabolic proteome (Fig. 2a, Supplementary Fig. 6 and Supplementary Tables 2–4) and 85% of the metabolome (Fig. 2b). Furthermore, we used 2,369 unique metabolite structures to algorithmically trace atom transitions28 (from each substrate to product atom) for 7,804 (87%) internal, mass-balanced reactions of the Recon3D-derived model (Supplementary Note 3). The prediction accuracy of the algorithms was validated by comparison with 512 manually curated atom-mapped reactions (Fig. 2c and Supplementary Fig. 7). The atom mappings enable identification of conserved moieties, which are the fundamental structural units of any chemical reaction network. Hence, we provide an invaluable bridge between metabolic modeling and chemoinformatics. For the first time, relationships between human metabolic genes, their encoded proteins, and the reactions they catalyze can be described in the context of specific 3D configurations, interactions, and properties (Fig. 1c,d).
Web visualization of protein structures in metabolic networks
Using Recon3D, we have implemented the first web-based visualization of 3D macromolecular structures in the context of their neighboring chemical reactions, metabolites, and their metabolic subsystems (e.g., glycolysis, citric acid cycle, amino acid metabolism, and carbohydrate metabolism, among others). This tool utilizes a recently developed global human network map29, together with network visualization software and conversion tools (Supplementary Note 4 and Supplementary Figs. 8, 9, 10, 11) and is available through the RCSB PDB website (http://www.rcsb.org/). The systems biology interface provides users with the ability to visualize networks that have been annotated to highlight which reactions are associated with experimental crystallographic structures, homology models, or metabolite structures (Fig. 2d). Dataframes for Recon3D are found in the github repository (https://github.com/SBRG/Recon3D).
Gene variation in 3D
We probed mutations in the context of representative protein domains (i.e., common structural regions redundant across the proteome). Such domains (e.g., tim-barrel motif) are often linked directly to their encoding gene's function, and thus provide a new way to directly assess the functional impact of a mutation.
We used Recon3D to map missense mutations from the Single Nucleotide Polymorphism database (dbSNP)30, UniProt24, and PharmGKB31, among others, to the metabolic network, using a previously established pipeline20 (Fig. 3a). We chose to focus on SNPs that were known to be deleterious or potentially harmful. In total, we mapped 3,536 SNPs to 655 genes within Recon3D. We identified representative protein domains for this set of genes using a structure-based clustering algorithm32. We tallied the number of SNPs (or SNVs) occurring in each protein domain and found the gene-to-domain ratio to be <1 (i.e., domain redundancy; Supplementary Fig. 12a). This analysis resulted in the identification of specific regions within protein domains that are commonly mutated (mutation hotspots), share common disease associations, and are prone to malfunction. Six genes share the Bruton's tyrosine kinase representative domain (PDP:4RFZa, PF007714) and, when mutated, are affiliated with diseases such as cancer (Fig. 3b). This kinase domain is known for its role in non-small-cell lung cancer33, and the SNPs associated with lung cancer cluster in one specific region of the protein (see the red-colored mutation hotspot in Fig. 3b).
The power of exploring gene variation in the context of both protein and network structure is further illustrated by Aryl sulfatase A (ARSA). Within the subset of SNPs that map to the representative domain of ARSA (SCOP: d1e2sp_), the mutation P428L (P426L in PDB 1e2s; dbSNP rs28940893) is associated with metachromatic leukodystrophy disease (MLD)34. This mutation influences the biological assembly of ARSA, in which the native homo-octamer state (Fig. 4a) is disfavored relative to the dimeric state (Fig. 4b). Other SNPs associated with the most severe form of MLD are located in the vicinity of the metal binding site, a mutation hotspot (Fig. 4c). ARSA is also located within a 'network hotspot', with other deleterious SNPs dispersed throughout the neighborhood of surrounding reactions (Fig. 4d). All mappings between SNPs, PDB, their representative domain, hotspots, and disease relevance are provided in Supplementary Data 15–20.
Oncogenic mutations cluster in structurally equivalent positions in the human proteome
The first application of Recon3D demonstrates its capability to discriminate pathogenic mutations from passenger mutations. We studied 889 somatic cancer mutations in 86 genes (which were previously analyzed35) from whole-exome sequence data of 178 tumor–normal pairs of lung squamous cell carcinooma36. Furthermore, we obtained detailed annotations about each of the mutations from cBioportal37, including whether a gene is a known oncogene37,38 or the mutation is recurrent12, has a gain-of-function (GOF) mutation, and has a drug association. Using Recon3D, we mapped each of the mutations to its corresponding protein, and the protein's representative domain(s) and network reaction(s) (Fig. 5a).
Analysis of all cancer mutations in the context of their representative protein domains suggests that oncogenic mutations cluster in structurally equivalent positions within representative domains. For the 86 genes, we counted the number of mutations that occur within 5 Å of another mutation within the representative domain (referred to as the 3D hotspot analysis; Online Methods and Supplementary Note 5). In some cases, mutations from different genes co-occurred in the same region of a shared domain, suggesting that the shared domain plays an important role in oncogenesis. Mutations co-occurring in the same location as other mutations are significantly more likely to be associated with somatic mutations, when compared to a random selection (P < 0.02, using a two-tailed t-test; Fig. 5b). All data mapping related to the somatic cancer mutations can be found in Supplementary Data 21–23).
Filtering mutations based on their spatial relationships brings about several significant biomedical implications. When mutations are rank-ordered by the number of neighboring mutations, we can filter the mutations with known roles in oncogenesis (based on known annotations37; Fig. 5c). For example, we find that selecting the top 25% of data by this ranking recovers 82% and 88% of known oncogenic mutations and GOF mutations (based on analysis via co-occurrence aids), respectively (compared with only 1.6% of oncogenic mutations and 2.9% of GOF mutations when selected at random; Fig. 5c; for a sensitivity analysis, see Supplementary Note 5). Furthermore, striking similarities in protein structure, based on 3D structure alignments, indicate that not only do mutations co-occur in shared domains, they also occur in structurally similar proteins within the same data set (Supplementary Fig. 12b). These findings suggest that cancer mutations cluster in functionally relevant parts of protein domains and that this property could guide the discovery of novel biomarkers and drug targets.
We combined our approach with metabolic modeling to understand whether structural information could improve the predictive power of the model. We focused on glioblastoma multiforme (GBM), a malignant brain tumor, and studied the mutational landscape of metabolic genes (Fig. 5a). Genes were selected based on the rate of mutation found in exome samples of 291 glioblastomas as well as involvement in cholesterol metabolism39 (Supplementary Note 5). Gene knockdowns were performed and the essential genes were compared across different generic and cell-type-specific human metabolic models (Recon3D, HMR2, and HMR-derived and TCGA-derived models8). Notably, the majority of models predicted the gene ACAT1 (GeneID 38) to be non-essential (Fig. 5d and Supplementary Fig. 13). Yet, a 3D hotspot analysis of the mutations in this gene suggested that it may be important in cancer (Fig. 5e). This finding was recently validated, confirming that inhibition of ACAT1 suppresses GBM growth by blocking SREBP-1-mediated lipogenesis40. This result highlights the potential for structure-based analysis in genome-scale models to identify important genes for cell growth.
Co-occurring mutations across shared protein domains are significantly more deleterious
We used Recon3D to identify potentially deleterious mutations in a large-scale population study. We analyzed SNP data from multiple gene variation databases (dbSNP30, UniProt24, and PharmGKB31) and assessed whether the 3D location of variants in a gene could, in general, discern whether mutations were deleterious or tolerated41. We mapped over 10,000 SNPs to their 3D structural coordinates using our 3D hotspot analysis workflow and computed the number of mutations co-occurring in 5- and 10-Å spheres in common protein domains. 1,385 unique genes had 3,649 SNPs that mapped to regions of a protein where structural data exist. We computed the number of mutational co-occurrences across this set of SNPs and found that deleterious mutations are much more likely to neighbor other deleterious mutations (P < 0.05 using a two-tailed t-test) than those predicted to be tolerated (P > 0.1, using a two-tailed t-test; Supplementary Fig. 14 and Supplementary Tables 5 and 6). These added features enable predictive power over any existing model, in that mutational data can be assessed in the context of protein structure and compared with network-level, genome-wide model knockdowns (e.g., Fig. 5e). Prior reconstructions are unable to identify structural changes that affect complex assembly or other intrinsic protein properties. Such details can now be explicitly studied using Recon3D. To this end, Recon3D provides new inroads for metabolic models to explore disease-relevant mutations.
Elucidating relationships between drug indications and their metabolic responses
Drug interventions influence the behavior of metabolic networks42, but the impact of drug treatment on metabolic responses and the mechanisms underlying these responses are poorly understood. We used Recon3D to combine large-scale data on drugs, their indications, and their effects on gene expression. These data were used to guide and inform genome-scale constraint-based modeling analyses42,43 to identify the metabolic pathways most perturbed in a given condition (Supplementary Fig. 15). More specifically, we used a machine-learning approach to assess similarities in metabolic responses to a given drug (Fig. 6a). Using a genetic algorithm, the area under the curve (AUC) of the receiver operating characteristic (ROC) curve was maximized to predict the indication of the drug based on the type and degree of perturbation (Fig. 6b, Supplementary Data 24 and Supplementary Note 6). Finally, we use the structural information in Recon3D to provide insights into the possible mechanisms by which the drugs exert their effects on metabolic pathways.
We first grouped 6,040 transcriptomic profiles (exposed to over 1,200 drug compounds in breast, leukemia, and prostate cancer cell lines from the Connectivity Map, or CMap44) by drug indication, using information from the Side Effect Resource (SIDER) database45 (Supplementary Table 7). A total of 47 drug indications were analyzed in the context of the metabolic network, using a previously described machine-learning approach42 (Supplementary Fig. 15). The analysis revealed that indication-specific drugs induced similar patterns of gene expression changes or 'gene indication signatures'. Our findings suggest that metabolic responses are significantly conserved for a wide range of drugs (Fig. 6b), with the most conserved pathway perturbations occurring for antipsychotic drugs (median AUC of 0.80; Supplementary Data 25). For this specific case, the gene indication signature is composed of nine genes that have been previously associated with schizophrenia (Supplementary Table 8). We also find associations between changes in lipid and cholesterol pathways and common antipsychotic drug side effects (weight gain, cardiovascular risk, and anti-inflammatory effects). Notably, some drugs with entirely different indications shared similar pathway-level changes with antipsychotic drugs and had previously been tested as adjunctive schizophrenia treatments46 (Supplementary Table 9). A list of the drugs with the most predictable metabolic responses is provided in Supplementary Data 26.
We then used protein and metabolite structural data in Recon3D to probe for mechanistic insights into drug response. In general, understanding mechanistic details entails identifying single or multiple targets of drug binding (or off-target binding) and the respective downstream effects. Information in Recon3D can be visualized as a topological network to indicate shared features across nodes (genes) in a gene indication signature. Displayed in Figure 6c is one connected hub of genes (antipsychotic gene indication signature) and several features for comparison: protein structural domains, metabolites, biochemical reactions, and disease relevance. For this signature, we found several overlapping features, such as the metabolic subsystems targeted by known drugs (e.g., lovastatin and fatty acid metabolism) and the function of certain protein domains (e.g., an influence in membrane binding/trafficking). Despite these shared domain functions, minimal structural alignment of the protein domains and metabolites indicates that the majority of genes in this signature are not direct drug targets, but may play a role in compensatory signaling pathways that mediate drug effects synergistically. Finally, structural alignment of the drug compounds themselves yielded unexpected results; drugs that induced the same pattern of perturbation (both drugs with known antipsychotic action and unrelated drug indications) were found to be structurally diverse (Fig. 6d). This finding is surprising given that drug discovery efforts tend to emphasize small changes in molecular structure to tune a desired biochemical effect. Here, we find that structurally diverse molecules exert similar effects on metabolic pathways, highlighting the potential of Recon3D for drug repurposing and the design of multitargeted therapies that support a new polypharmacological paradigm in drug research47,48.
Discussion
Recon3D is the first network reconstruction to include protein and metabolite structures as well as atom–atom mappings. Recon3D provides functional insights into genetic variation and the mechanisms underlying the effects of drugs on metabolic response in humans. It also serves as a computable knowledge-base with clear functional connectivity between genes and biochemical pathways. Pairing Recon3D with biomedical data provides a compelling avenue for studying disease at scale.
Recon3D integrates multiple layers of biological data and provides a tool to study variation and its impact on individual proteins and complex pathways. The inclusion of different data types offers new opportunities for network reconstruction in that it introduces atomic-scale properties, such as ligand-binding interactions; it provides new avenues for precision medicine by exploring human variation14,15; and it enables the probing of genetic variation via changes in the molecular properties of proteins20. In this way, individual sequence variations can be explicitly represented and the functional connections among disease, genetic perturbation, and drug action can be probed systematically.
Recon3D enables straightforward data integration, as its content has been linked to external databases (KEGG, PDB, CHEBI, PharmGKB, UniProt). This knowledge-base can be converted into a genome-scale model, which can be computationally interrogated and characterized. Constraint-based methods43 can be used to assess network properties, and bioinformatics tools32 can be used to assess protein or metabolite properties. Our findings present preliminary, yet compelling, support for the potential of Recon3D to complement traditional structure-based approaches for empowering applications in drug discovery and target validation. We have shown that a systematic exploration of mutations in the context of their 3D spatial relationship provides a unique means for filtering out functionally relevant mutations and determining potential genes of interest. Furthermore, analysis of in vitro drug-treated gene expression profiling in the context of the human metabolic network provides insight into the broad metabolic response to different drug therapies.
Recon3D provides a framework for integrating structure–function relationships and assessing specific and proteome-wide effects of sequence variation. Integrated frameworks like Recon3D enable understanding of how mutations or binding events lead to downstream responses and could aid in the identification of novel targets when coupled to structural bioinformatics16, molecular dynamics simulations20,49, and kinetic modeling50. In contrast, current metabolic models are not able to contextualize the effect of a sequence variant (beyond gene deletions) and therefore cannot be used to study disease-relevant mutations. Recon3D will potentially aid in translating biomedical knowledge, from large-scale omics data to drug discovery, target identification, and clinical biomarker development. Future efforts are likely to extend to precision medicine applications where drug responses can be assessed in the context of individual patient-specific genomes. Recon3D is available via two databases3,51 (http://bigg.ucsd.edu/ and http://vmh.life).
Methods
Metabolic reconstruction.
Recon3D has been assembled using multiple data sources, that is, HMR 2.00 (ref. 6) (2,478 reactions), metabolomics data sets (1,865 reactions), a drug module22 (721 reactions), a transport module (51 reactions), host–microbe reactions (24 reactions), absorption and metabolism of dietary compounds (20 reactions), and others (1,004 reactions). The 'others' category included reactions that captured metabolism in specific human organs, (e.g., kidney), as well as novel metabolic pathways of lipoproteins, bile acids, and sphingolipids. The expansion of Recon 2 was performed in an iterative manner (Supplementary Fig. 1). With each addition, there followed extensive model debugging and manual curation for flux consistency and refinement.
Recon 2 was expanded in two stages: (i) additions of new reactions and (ii) network refinements for building a high-quality flux-consistent model (Supplementary Fig. 1). The total number of novel additions included 6,163 reactions, 1,589 metabolites, and 1,654 genes completing Recon3D. These new reactions were mostly from transport (32%), lipid metabolism (24%), exchange (19%), xenobiotic (11%), and amino acid (7%) metabolism (Supplementary Fig. 2b,c). Other major additions include those required for debugging the network for flux consistency (10% of newly added reactions), reactions representing organ-specific metabolism (7%), transport module (2% of newly added reactions), and those representing lipoprotein metabolism (2% of newly added reactions), novel dietary compounds and their associated reactions (1% of newly added reactions), and reactions capturing interaction between gut microbes and host (1% of newly added reactions). For details on the precise metabolic pathways, see Supplementary Note 1.
The largest contribution for new metabolic genes were those from: (i) lipid metabolism (10%), (ii) carbohydrate metabolism (5%), (iii) transport processes (5%), (iv) amino acid (3%), and (v) nucleotide metabolism and vitamin metabolism (1%) (Supplementary Fig. 2). The miscellaneous category mostly contained genes from HMR 2.0 (99%) (Supplementary Fig. 2). The largest contribution for new metabolites were lipid (42%) and amino acid (19%) classes. Novel metabolites added in other subsystems include miscellaneous and xenobiotics (18%), carbohydrates (2%), vitamins (1.4%), and nucleotide (0.3%) metabolism (Supplementary Fig. 2).
Once reactions and genes were added to Recon3D, the reconstruction was subjected to various quality control or quality assurance tests (Supplementary Fig. 1). These included (i) checking for reaction and metabolite duplicity, (ii) modification of gene-protein-reaction associations, (iii) modification of metabolite formulae to pH 7.2 along with mass-charge balancing of reactions, (iv) a leak test, checking for stoichiometric and flux consistency and checking for thermodynamic feasibility54, (v) debugging and curation for removal of dead-end metabolites, and (vi) checking for network accomplishment of defined functions/tests (Supplementary Fig. 1).
To check reaction and metabolite duplicity, we took several approaches. First, Quek et al.55 reported 95 duplicate metabolites, 71 of which were replaced (Supplementary Data 9 and Supplementary Note 1). Second, the reaction and metabolite duplicity was checked for HMR reactions and metabolites (before inclusion in Recon 3). The metabolite formulae, particularly those received from HMR 2.0, were adjusted to an internal pH of 7.2, using mol files28 and COBRA toolbox56 and ChemAxon software (https://chemicalize.com/). This led to correct assignment of reaction stoichiometry and mass-charge-balancing of reactions. Third, gene-protein-reaction associations were curated and corrected for 2,180 reactions (Supplementary Data 6 and 7 and Supplementary Note 1). Finally, we performed additional QC/QA tests (e.g., functional leaks, production of matter from water and oxygen, etc.).
The COBRA toolbox56 was used to identify a subset of 10,600 reactions involving 5,835 metabolites, representing the stoichiometrically consistent flux balance model. The final model was tested for 431 model objectives, representing essential biochemical functions of the human body. The model debugging was mostly done by the addition of extracellular and intracellular transport reactions. Examples include the addition of novel transport proteins for bile acids and folate intermediates. Novel intracellular transport proteins, i.e., mitochondrial pyruvate carriers (MPC1, GeneID: 51660 and MPC2, GeneID: 25874) were added for phenylpyruvate that operates in a proton symport mechanism57. These transport reactions connected the intracellular and extracellular compartments of the model, enabling flux consistency. Manual curation of the relevant scientific literature was followed to obtain complete information on the respective biochemical pathway. A typical example includes the addition of 4-methyl-thio-oxo-butyrate (an intermediate of methionine metabolism) into the network. Upon literature curation, addition of the alternative route of methionine transamination and decarboxylation reactions were identified and added (Supplementary Data 1).
Please refer to Supplementary Note 1 and Supplementary Data 1–10 for detailed information on the network building and refinements.
GEM-PRO reconstruction.
We followed the previously described procedure26 to map, assess, and refine PDB or homology models for integration into genome-scale models. For Recon3D, additions to the gene identifier mapping workflow were made to address inconsistencies in gene isoforms across database entries and the ability to link isoforms to available homology models. In addition, QC/QA steps were taken in order to ensure the correct sequence was being retrieved (Supplementary Fig. 5 and Supplementary Note 3). For PDB structures with missing residues, we have filled in the gaps by querying previously generated databases of I-TASSER homology models58,59, and manually generating homology models for genes that were not part of these databases using previously defined protocols26,60. In the final master GEM-PRO data frame (Supplementary Data 11), we note where available homology models have been mapped to their respective genes. For most homology modeling procedures, the amino acid sequence of a protein is all that is required to generate a homology model of a protein. It is important to note that for certain PDB structures with unresolved residues or gaps in the structure a homology model can also be generated to enhance the structural coverage of the amino acid sequence. Homology models were not generated for any sequences longer than 600 amino acids long. We assessed the overall quality of the information coming from homologous templates in terms of (i) which organism the protein was crystallized from, (ii) the resolution of the PDB template, and (iii) the deposition date. We used these properties to compare the templates that were used to construct homology models in the previous GEM-PRO models with those of the recently updated versions (Supplementary Tables 2–4 and Supplementary Fig. 6).
To identify structures for the given set of metabolites in Recon3D, we evaluated a number of databases where metabolite structures are publicly available, such as PDB (ligand-expo: http://ligand-expo.rcsb.org/, http://ligand-expo.rcsb.org/ld-search.html), PubChem61 (https://pubchem.ncbi.nlm.nih.gov/), and ChEBI (http://www.ebi.ac.uk/chebi/). We downloaded structures in various formats: 2D structure in .mol format (ChEBI), 3D structure in .sdf format (PubChem61), and in .pdb/.xyz format (RCSB). Supplementary Data 14 provides all the information content processed for metabolites in Recon3D, which includes SMILEs and INCHI descriptors, Kyoto Encyclopedia of Genes and Genomes (KEGG)62 IDs, CID IDs, CID file names, ChEBI file names, ChEBI IDs, and experimental coordinate file URL locations and the ideal coordinate file name. The ChEBI mapping procedure contained the following steps: (i) identification of the particular metabolite from ChEBI using the source link (the metabolite name was the starting point of search which is taken from the metabolite names in the Supplementary Data 14); (ii) checking the molecular formula and charge (neutral or charged) of the metabolite in the ChEBI database; (iii) capturing the ChEBI link, ChEBI ID, SMILES, and INCHI into the respective fields in the data set spreadsheet; (iv) 2D-structure is downloaded in .mol format. The same overall search was conducted in PubChem and PDB (Ligand expo) with slight variations as to the initial search inputs and file type outputs.
The data set of human single nucleotide polymorphisms (SNPs) and single nucleotide variants (SNVs) was collected from UniProt from a subset of protein altering variants from the 1000 Genomes Project. Furthermore, all SNPs and SNVs for model genes were downloaded directly from dbSNP (http://www.ncbi.nlm.nih.gov/SNP/) via the Ensembl BioMart interface63. We then selected all variants that were characterized to be 'damaging' or 'possibly damaging' as a predicted functional impact using the PolyPhen2 bioinformatics tool41. Functional annotations of the missense mutations were also annotated using SIFT (http://sift.jcvi.org/). In addition, we linked the missense variants to their gene–drug associations (clinically relevant pharmacogenomics interactions) using the PharmGKB pharmacogenomics database (https://www.pharmgkb.org/). All annotated gene–drug pairs contain information such as dosing guidelines, drug label annotations and each pair is generally specified in more than 1 type of annotation (dosing guideline, drug label, clinical annotation, variant annotation, VIP, or pathway). These selected pharmacogenomic associations allow us to understand whether certain missense variants have functional effects on drug therapies. All selected missense variants and their drug associations have been provided as Supplementary Data 15 and 16.
More details on the process and procedure for network reconstruction, protein and metabolite structure integration, identification of representative protein domains, linking to pharmacogenomics databases, linking to cancer genome atlases, mutation hotspot analyses, and comparison of tissue-specific cancer and pharmacogenomic/gene variation networks are all provided in Supplementary Notes 3 and 4.
Atom–atom mapping.
Generation of atom mapping data requires chemical structures, reaction stoichiometry and an atom mapping algorithm. Atom mappings were predicted using the Reaction Decoder Tool64, and the DREAM algorithm65 for 7,535 (86%) mass balanced reactions with implicit and explicit hydrogens, respectively, while Reaction Decoder Tool and the CLCA algorithm66 were used to predict atom mappings for a further 269 reactions with incompletely specified metabolites (e.g., R group) with implicit and explicit hydrogens, respectively. We compared these predictions for internal reactions to a set of 512 reactions with atom mappings that we and others manually curated (Supplementary Note 3). This reaction set is representative of all six top-level (Enzyme Commission) EC numbers. Based on this comparison, we observed that the predicted atom mappings are highly accurate for most of the reaction types28 (Supplementary Fig. 7).
3D mutation hotspot analysis.
We filtered a set of mutations (whose genes are associated with experimental protein structures) based on whether the location of the mutated residue itself was resolved (e.g., certain protein domains are unresolved due to flexibility or unstructured regions of the protein being challenging to crystallize). Once the subset of mutations was established to (i) be linked to genes with experimental protein structures and (ii) be located within regions of the protein that were experimentally determined, we carried out 3D structure alignments between all proteins and their representative domains (mapping to representative protein domains is described previously in the section entitled “mapping and alignment of PDBs to their representative domains”). In contrast to sequence alignments, 3D structure alignments find a best fit in terms of the 3D shape or geometry of two proteins. Therefore, any two proteins that have different sequences but share a common domain architecture can be successfully aligned in 3D space. Similar to sequence alignments, the 3D structural alignment provides a direct residue-to-residue mapping for residues that share structurally equivalent positions in a common/shared domain motif. Once this residue-to-residue mapping was established for all proteins in our data set, we located 3D “hotspot” mutations by tallying all residues in the representative domains that map to mutated residues in a given protein of interest. To this end, certain residues in a representative domain may have multiple hits if more than one gene is linked to that representative domain and the same structurally equivalent residue is mutated across various genes. Supplementary Data 17 provides the mapping between the residue number of the Uniprot missense variant > the PDB residue number > the PDB chain where the residue is located > the representative domain ID linked to a given PDB chain > the structurally equivalent residue within that representative domain.
Mapping cancer mutations in 3D.
We used the TCGA level 3 variant data in the cBioPortal (http://www.cbioportal.org/). For this study, we used high-level (processed) data from a subset of pre-analyzed mutations from 178 tumor–normal pairs of lung squamous cell carcinoma36. When the MutSig1.0 approach was applied on this data set35, it identified 450 genes as significantly mutated. Starting from this set of genes, we identified a subset of 86 genes that have Uniprot accession numbers and protein structural information. Within this set of genes, we found that 889 somatic cancer mutations map to residues that have been successfully resolved in the crystallographic structures of proteins. We used the list of 86 genes to query the cBioPortal web-based data set and downloaded various information including: somatic cancer mutations, cancer study sample IDs, amino acid mutations, annotations (coming from various sources, such as http://oncokb.org/ and https://www.mycancergenome.org/), type of mutation, copy number changes, overlapping mutations in COSMIC, the predicted functional impact score (from Mutation Assessor), variant allele frequency in the tumor sample, and total number of nonsynonymous mutations in the sample. A summary of cancer data sets used in this study is given in Supplementary Data 21 and a detailed summary of all somatic mutations for this set of genes is provided in Supplementary Data 22 and 23. The 3D hotspot analysis was carried out as detailed above and mutations were rank-ordered on the basis of how many mutations fell within a 5Å sphere (i.e., number of nearest neighbors). We performed a sensitivity analysis to understand whether the selection of data points had an effect on the significance of these results.
The above 3D hotspot analysis approach was also applied to 22 genes from which cancer mutations have already been analyzed53 (exome samples of 291 glioblastomas) and 92 genes involved in cholesterol metabolism, owing to the fact that cholesterol biosynthesis plays an important role in GBM39.
Statistical tests.
We performed a sensitivity analysis to understand whether the selection of data points had an effect on the significance of these results. We find that the 3D hotspot analysis is more likely to select somatic mutations compared to a random selection. Data points (50–700) were selected so that 0.065–0.91 of the total data set was covered. We performed the 3D hotspot analysis across the different selections and found P values in the range 0.017–0.049 compared to 0.182–0.241, using a random residue selection.
For annotations of mutations that are known oncogenes and known hotspots, selection of the databases on 3D hotspot analysis is important, regardless of the number of mutations (or % of data) selected (P < 0.05). Compared to a random selection, our computed (using a two-tailed t-test) p value is > 0.1. We also performed a sensitivity analysis using the slices of the total data set as mentioned above (50–500 data points) and computed the total number of known oncogenes and known hotspots (from previously published analyses), using the 3D hotspot analysis compared to a random selection. We find that the percentage of data selected is significantly higher using the 3D hotspot analysis. For known oncogenes, 37–83% of the data is selected using 3D hotspot compared to 0.046–0.43 at random. Similarly, for known hotspots, 72.5–88.3% of the data is selected using 3D hotspot analysis compared to 9.8–64%. See Supplementary Note 5 for more information.
Gene deletion simulations in GBM.
In silico single gene deletion (SGD) simulations were performed as previously described67. Given a certain GEM, the simulation of a SGD was performed by formulating the linear program problem (1) for each gene g in the GEM:
where vobj is the flux through the biomass equation, γ is an arbitrary number set to 1, S is the stoichiometric matrix of the GEM (that is, a m × n matrix where m is the number of metabolites and n is the number of reactions and each (i,j) entry is the stoichiometric coefficient of the metabolite corresponding to row i in the reaction corresponding to column j), v is the vector containing the values of the fluxes through each reaction in the GEM, and j indexes each exchange reaction known to be present in a rich mammalian medium (Ham's medium, HAM; see Supplementary Note 5 for more details). The simulation was carried out for the following GEMs: Recon3D, HMR2.00, and 22 personalized GEMs for glioblastoma multiforme (GBM) previously reconstructed using HMR2.00 as a template from as many GBM expression profiles retrieved at The Cancer Genome Atlas8.
Drug perturbation analysis.
To compute metabolic pathways with gene expression perturbed by drugs, the human metabolic network model was first converted into an irreversible network. Then, the MetChange algorithm42 was run using gene expression presence/absence p-values from the Connectivity Map (Cmap) database44 build 02. Drug indications were taken from the Side Effect Resource (SIDER) database45 for all available drugs overlapping with the Cmap database. Synonyms were aggregated when present as with side effects. A minimum of ten drugs for each indication was required for the inclusion in the analysis, corresponding to a much greater number of expression sets for each indication. A total of 48 drug indications were analyzed for 1,459 expression sets corresponding to 334 drugs. A genetic algorithm (Supplementary Fig. 15) was then implemented as described in Supplementary Note 6. Details of the gene indication signatures can be found in Supplementary Note 6.
Code availability.
“IndiFinder.m” contains the Matlab implementation of the genetic algorithm for finding metabolic signatures underlying drug indications given presence/absence of the indication existing for a given sample. It is submitted as Supplementary Software and requires previous installation of the COBRA Toolbox. Code is commented for guidance.
Life Sciences Reporting Summary.
Further information on experimental design is available in the Life Sciences Reporting Summary.
Data availability.
Recon3D is available as a metabolic reconstruction at http://vmh.life. A GIT repo that contains the GEMPRO, GBM-specific model files and simulations, and gene deletion simulations can be accessed via https://github.com/SBRG/Recon3D. Recon3D GEM-PRO has been consolidated into a shareable JSON file and submitted as Supplementary Data 27, which can be used to start structural analyses. This model assigns a single representative structure per gene in the reconstructed metabolic model. The accompanying software package required for reading and working with the GEM-PRO JSON is available at https://github.com/SBRG/ssbio. This entire repository can be cloned to a user's computer and contains Jupyter notebooks in the root directory to guide a user through the content available in the Recon3D GEM-PRO model (Recon3D_GP - Loading and Exploring the GEM-PRO.ipynb) as well as to update the model with revised sequence information or newly deposited structures in the PDB (Recon3D_GP - Updating the GEM-PRO.ipynb). This repository also includes all sequence and structure files mapped per gene, metadata downloaded through UniProt and the PDB, as well as the ability to rerun the QC/QA pipeline with different parameters such as sequence identity and resolution cutoffs. These notebooks also include basic visualization features enabled with the NGL viewer package68.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Acknowledgements
The results here are in whole or part based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/. This work was funded by the Novo Nordisk Foundation Center for Biosustainability and the Technical University of Denmark (grant number NNF10CC1016517), the National Institutes of Health (grant GM057089 to B.O.P.) and by the Luxembourg National Research Fund (FNR) through the National Centre of Excellence in Research (NCER) on Parkinson's disease and the ATTRACT programme (FNR/A12/01), by the European Union's Horizon 2020 research and innovation programme under grant agreement No 668738, by the Institutional Strategy of the University of Tübingen (German Research Foundation DFG, ZUK 63), and by Google Inc. (Summer of Code 2016). RCSB PDB is funded by the National Science Foundation (NSF DBI-1338415 to S.K.B.), the Department of Energy, and the National Institutes of Health (NIGMS and NCI). This research used resources of the National Energy Research Scientific Computing Center. The authors gratefully acknowledge P. Mischel and W. Zheng for experimental help and discussions on GBM, N. Lewis, A. McCammon, J. Mesirov, J.M. Thornton, J. Monk, and J. Lerman for scientific discussions and Z. King for help with Escher integration in RCSB PDB, M. Abrams for manuscript editing, V. Kohler and A.E. Kärcher-Dräger for drawing the platelet and RBC map in Escher, and F. Monteiro and M.A.P. Oliveira for help in reconstructing the dopamine subsystem.
Integrated supplementary information
Supplementary figures
- 1.
Iterative model building of Recon 3
- 2.
Statistics of Recon 3.
- 3.
Subsystem comparison between Recon 2 and Recon 3.
- 4.
Simulations of infant growth on human breast milk.
- 5.
GEM-PRO workflow for mapping gene identifiers to the UniProt, RefSeq, and Ensembl databases when considering isoforms.
- 6.
Distribution of total energy-related (PSQS) scores for all 3D protein structures.
- 7.
Predictive accuracy of algorithmically derived atom mappings versus manually curated atom mappings.
- 8.
SBGN-PD map
- 9.
Escher map view of Recon 2.01.
- 10.
Escher map of the human red blood cell
- 11.
Escher map of the human platelet cell
- 12.
Disease Networks in Recon3D
- 13.
Single gene deletion simulations for GBM-specific cell line models.
- 14.
3D hotspot analysis statistics
- 15.
Basic and detailed workflows for identifying drug-induced perturbed pathways and linking them to their indication.
Supplementary information
PDF files
- 1.
Supplementary Text and Figures
Supplementary Figures 1–15
- 2.
Life Sciences Reporting Summary
- 3.
Supplementary Tables and Supplementary Notes
Supplementary tables1–9 and Supplementary notes1–6
Excel files
- 1.
Supplementary Datafiles 1-10
Reconstruction; Recon3D
- 2.
Supplementary Datafiles 11-14
File contains all GEM-PRO related content for Recon3D.Contains Supplementary Data Files 11-14.
- 3.
Supplementary Datafiles 15-26
File contains all mappings to variant disease SNPs/somatic mutations, FATCAT representative domain annotations and drug indication analyses. Contains Supplementary Data Files 15-26.
Zip files
- 1.
Supplementary Datafile 27
Recon 3D GEM-PRO has been consolidated into a shareable JSON file, which can be used to start structural analyses.
- 2.
Supplementary Software
IndiFinder.m