A biochemical network modeling of a whole-cell

All cellular processes can be ultimately understood in terms of respective fundamental biochemical interactions between molecules, which can be modeled as networks. Very often, these molecules are shared by more than one process, therefore interconnecting them. Despite this effect, cellular processes are usually described by separate networks with heterogeneous levels of detail, such as metabolic, protein–protein interaction, and transcription regulation networks. Aiming at obtaining a unified representation of cellular processes, we describe in this work an integrative framework that draws concepts from rule-based modeling. In order to probe the capabilities of the framework, we used an organism-specific database and genomic information to model the whole-cell biochemical network of the Mycoplasma genitalium organism. This modeling accounted for 15 cellular processes and resulted in a single component network, indicating that all processes are somehow interconnected. The topological analysis of the network showed structural consistency with biological networks in the literature. In order to validate the network, we estimated gene essentiality by simulating gene deletions and compared the results with experimental data available in the literature. We could classify 212 genes as essential, being 95% of them consistent with experimental results. Although we adopted a relatively simple organism as a case study, we suggest that the presented framework has the potential for paving the way to more integrated studies of whole organisms leading to a systemic analysis of cells on a broader scale. The modeling of other organisms using this framework could provide useful large-scale models for different fields of research such as bioengineering, network biology, and synthetic biology, and also provide novel tools for medical and industrial applications.

1 The Modeling of Particular Interactions

Cell Division Reaction
The cell division is a biochemical and mechanical event involving several molecules and structures. In the M. genitalium's whole-cell network it was modeled as a single reaction with all necessary molecules linked as modifiers. Figure S1 illustrates the reaction and Table S1.

Replication Reactions
The template for the replication reactions is described in the main document. Table S2 displays the information about the illustrated nodes.

Transcription Reactions
The template for the transcription reactions is described in the main document. Table S3 displays the information about the illustrated nodes. Once these re-actions are templates, the exact name of molecules and reactions depends on the gene in the subject. Thus, we use the placeholder GENE which can stand for the gene's name or transcription units for single and polycistronic genes respectively. The placeholder CHRM REG stands for the chromosome region.

Transcription Stall Reactions
A transcription reaction can be interrupted for several reasons. One of them is the collision with other molecules in the same region of a DNA strand. Here we modeled the stall reaction for transcribing complexes when a replication complex is in the next chromosome region. Once the transcription reaction can be interrupted at many chromosome regions, one incomplete RNA molecule is created for each reaction. The name of the molecule carries its sequence.

RNA Degradation Reactions
The RNA degradation reaction template is depicted in Figure S3. The Peptidyl-tRNA Hydrolase is needed only in the case of aminoacylated tRNAs. Modifications in RNAs were not taken into account due to inconsistencies in WholeCel-lKB. Table S5 shows the component's names in WholeCellKB and the network model.

Translation Reactions
The template for the translation reactions is described in the main document.

Translation Stall Reactions
Just as transcription reactions, the translation process can be interrupted by several reasons too. However, when a transcription complex stalls, the incomplete protein needs to be tagged with a specific amino acid sequence in order to be rapidly degraded. Thus, this process is represented by two template reactions: the stall of the translation complex and the translation of the signal peptide. Once we do note represent intermediate molecules during the translation process, all stalled translation reactions will only produce the same incomplete peptide, which contains only the degradation signal sequence. The reactions' templates are described in Figure S4 and Table S9.

Protein Degradation Reactions
The proteins produced by the cell can be degraded in order to recycle amino acids, control proteins' concentration, remove defective proteins from the cytosol, and other reasons. Figure S5 and Table S10 shows the template for protein degradation reactions. According to the protein's location (cytosol or membrane), different proteases can be recruited for its degradation. Proteins tagged with the Proteolysis Peptide are degraded by the membrane protease. Figure S5: Protein Degradation Template.

Software Structure and Implementation
The software called PiCell was developed to build the Whole-Cell Extended Biochemical Network of Mycoplasma genitalium but also being adaptable for other organisms. It is composed of three parts: • Database Handler • PiCell Core • Network Constructor that can be accessed by Python 3 scripts. The database handler is the interface between databases and the PiCell core. One handler should be implemented for each database to be used as a source of the model. The PiCell Core is responsible for organizing the data obtained from databases and create intermediate molecules and reactions in order to fulfill the central dogma of biology in the model. When all necessary information os gathered in the PiCell Core, it can be exported as a single network model, with linked molecule and reaction nodes, following the framework proposed in this work. This model is then further submitted to validation and analyses. In Figure S6 the reader can find a schematic of the software implemented to build the M. genitalium's network.

Database Handler
The necessary information for the model was acquired from the WholeCellKB through the WholeCellKB Handler, a piece of Python 3 code implemented specifically for this database. The data in the WholeCellKB database was available in several formats. The JSON format was chosen because of its easiness of access from Python. In addition to the JSON database file, the Handler can read two other files: one containing the database entries to be ignored, and another containing a name mapping to be applied in the database. Figure S6: The schematic implementation of the PiCell, a software to build Whole-Cell Extended Biochemical Networks.

Model Builder
The control of the modeling is made through an IPython Notebook using the Jupyter interface. Before acquiring the database's information, the model must be configured. Information about the canonical cellular processes must be provided in order to be constructed from the templates by the PiCell Core. The information provided is described in the Tables S2 to S10.
The genetic information about the organism must also be provided. In the case of M. genitalium, it was also available in the WholeCellKB. The chromosome sequence, chromosome features, genes, and transcription units are necessary to construct the canonical processes.
Molecules and reactions to be added in the model can be retrieved from the database or inserted manually. An example of the latter is the cell division reaction and its structure and components can be found in Figure S1 and Table  S1. Reactions, such as metabolic and aminoacylation, were retrieved from the database, as well as the participant molecules.

PiCell Core
The PiCell Core is responsible to structure the information acquired from databases and inserted manually in such a way that it can be more easily manipulated, checked for inconsistencies, and be further translated into an extended biochemical network.
Chromosome Representation The first function of the PiCell Core is to create a representation of the cell's chromosomes based on the genetic information provided. Each chromosome is divided into regions according to annotated regions and respecting a maximum region length. In the case of M. genitalium, the maximum region length was set a very high value so that all the regions' sizes are only constrained by the annotations in the genome. Transcription Units' starts and ends were not considered in this process.

Recursive Creation of Canonical Reactions
The second function of the PiCell Core is to generate missing canonical reactions for macromolecules in the model. This functionality is based on the premise that all macromolecules in the model must have at least one biosynthesis and one degradation reaction. Thus, this process can iterate from protein complexes needing their complexation reaction, up to the expression of their respective genes. For example, consider that a given metabolic reaction inserted in the model is catalyzed by a protein complex. The complex must be synthesized by a protein complexation reaction. The monomers required in this reaction must be synthesized by a translation reaction from their respective mRNA. The mRNA then needs to be synthesized by a transcription reaction from its respective DNA regions. Finally, DNA regions must be synthesized by their replication reactions. This cycle of reactions must be created for every macromolecule in the model. Similarly, the degradation reactions for each macromolecule is created. All reactions created by the PiCell Core are based on the templates described before. Particularities of each reaction created, such as specific chaperones in protein translation, are added in the reactions according to data availability in the database.
Consistency Checks Additionally to the premise presented in the last paragraph, the PiCell Core performs a mass-balance check in order to probe for inconsistencies in the reactions. All metabolites must have their composition formula described in the model. From their atomic composition, their mass is estimated. Given that all macromolecules are combinations of basic metabolites, the mass of all molecules can be estimated upwards. Then, to check the mass-balance consistency of any reaction, we simply calculate the mass of reactants minus the mass of products. The absolute value obtained must be less than one, the mass of a hydrogen atom. It is important to notice that although this methodology adds an extra layer of confidence in the model, the correctness of all reactions still relies on the data sources.
Extended Biochemical Network Construction After the model completion, it is ready to generate a working model following the extended biochemical network framework. For each reaction described in the PiCell core, a respective reaction is created in the network. The molecules are created respecting their location. If a given molecule can occur in more than one location, one molecule node is created for each location and linked to their respective reactions accordingly. Reversible reactions are represented by two reaction nodes, one for each direction. The final model can be exported in SBML, some network formats, and also as a networkx graph. The data formats are described in Section 3.

Software Dependencies
The PiCell is developed using Python 3 and it depends on some Python Packages. The packages are all open source and are listed in the following: For the scripts used in the analysis of the model, you will also need the following packages: • numpy

Network's Data Formats
The M. genitalium's whole-cell biochemical network is available in three formats, SBML, GML, and GraphML (Additional file 2).