Galaxy-SynBioCAD: Automated Pipeline for Synthetic Biology Design and Engineering

We introduce the Galaxy-SynBioCAD portal, the first toolshed for synthetic biology, metabolic engineering, and industrial biotechnology. The tools and workflows currently shared on the portal enables one to build libraries of strains producing desired chemical targets covering an end-to-end metabolic pathway design and engineering process from the selection of strains and targets, the design of DNA parts to be assembled, to the generation of scripts driving liquid handlers for plasmid assembly and strain transformations. Standard formats like SBML and SBOL are used throughout to enforce the compatibility of the tools. In a study carried out at four different sites, we illustrate the link between pathway design and engineering with the building of a library of E. coli lycopene-producing strains. We also benchmarked our workflows on literature and expert validated pathways. Overall, we find an 83% success rate in retrieving the validated pathways among the top 10 pathways generated by the workflows.

Computation has become an essential tool in life science research.Synthetic biology, metabolic engineering and industrial biotechnology make no exception to that trend.As part of this endeavor, significant attention is being paid to the development of workflows adhering to design principles from engineering such as standardization and abstraction of modular parts, as well as the decoupling of design from fabrication.
Following the electronic design automation (EDA) concept, there are many design automation tools for genetic circuits, these are extensively reviewed in Appleton et al. 1 .As an example, Cello 2 applies the EDA approach to genetic circuits.Cello comprises several steps, which are connected and therefore need to use standardized input/output formats.Among those formats are Verilog to represent a logic function and Eugene 3 to encode a set of parts and constraints between the parts.While Cello achieved the compilation and standardization of several pieces of software for genetic design, in general, this is not true for most freely available synthetic biology and metabolic engineering design tools, where the fragmentation remains a significant barrier to adoption.Nonetheless, two main standards have emerged in the past two decades.The first, SBML 4 is a biological modeling standard that has been developed by the systems biology community to encode strains and pathways.The second, SBOL 5 , is a data exchange standard specific to synthetic biology.SBOL has been developed to document genetic components (DNA, RNA, protein, etc.) and their interactions for the purpose of biodesign engineering.SBOL can now encode complex genetic circuits, metabolic pathways, vectors, and plasmids.Complying with SBML and SBOL standards a suite of in silico genetic circuit design tools was recently proposed 6 .
As for genetic circuits, there are plenty of software tools to assist the biosynthetic pathway design process 7 .Briefly, from a given target compound and a given chassis strain, the first step consists of finding metabolic reactions that link the target compound to the native metabolites of the host strain.This step is carried out by retrosynthesis software tools [8][9][10][11][12][13] and, should one wish to search for novel pathways or find pathways that produce unnatural target compounds, requires the use of reaction rules 14 .The result of retrosynthesis is a metabolic map and there is a need in a second step to enumerate the pathways linking the chassis metabolites to the target.There are many solutions for pathway enumeration and search 15 , which are sometimes integrated into the retrosynthesis software itself.The third step is to find the most promising enzyme sequences catalyzing the metabolic reactions.This can be achieved either through similarity search to enzyme annotated metabolic reactions [16][17][18] , or machine learning trained on metabolic databases 19,20 .Once the pathways have been annotated with enzyme sequences, they can be ranked in a fourth step.The ranking criteria are diverse, they can be among others based on thermodynamics 21 , predicted yield of the target 22 , target rate of production through flux balance analysis 9,11,21 , chassis cytotoxicity of the target and intermediates 21 , along with simpler criteria like pathway length.
In addition to the enzyme identities, there are multiple layout solutions and settings to engineer the top-ranked pathways.Indeed, the individual genes coding for the enzyme can be placed under different promoters, in a different order, with different RBS strength (if the chassis is a bacteria), and on different plasmids with different origins of replication if the engineering is performed on a plasmid.The fifth step deals with this issue by making use of tools such as the RBS calculator 23 to compute RBS sequences for different strengths, and design of experiments (DoE) 24,25 to sample the space of possible constructs, which can be very large.The result of that step is a library of layouts representing either the same or different pathways.At this stage, one can either synthesize the whole layout DNA or, as it is most commonly done, synthesize individual DNA parts and use combinatorial DNA assembly methods to include variations of the control elements, such as promoters and RBS sequences.Several computational tools can be used to perform this sixth and last step of assembly design before constructing the pathways, these tools compute parts to be synthesized depending on the chosen assembly protocol.Computation tools to help the build tasks are sparser than for design.One can cite here Aquarium 26 , which provides instructions to a person or a robot to perform the assembly tasks along with Antha 27 , BioBlocks 28 , and DNA-BOT 29 .Engineered pathways are generally evaluated using HPLC or mass spectrometry analyses.Here too, computational tools can help in particular the workflows produced by OpenMS 30 or Worlflow4Metabolomics 31 .
Considering the above, we are clearly at a stage where the pathway engineering process is not that far from being fully driven by computer software products.However, there are several hurdles that prevent this from happening even for tools covering pathway design only.First, the tools are not easily findable, they are stored in different places and the keywords to search online are not obvious.Secondly, some of the tools are difficult to access, some requiring registration, purchase, or access fees.Thirdly, almost none of the tools are interoperable and cannot be chained one after another to ensure that computational experiments are communicated well, and hence reproducible.Lastly, and perhaps most problematic for wider acceptance, the tools can be difficult to comprehend, requiring a level of expertise that limits their use by a large community.
Scientific workflows help to address these issues by providing an open, web-based platform for performing findable and accessible data analyses linked to experimental protocols for all scientists irrespectively of their informatics expertise, along with interoperable and reproducible computations regardless of the particular platform that is being used 32 .Indeed, without programming skills, scientists that need to use computational approaches are impeded by difficulties ranging from tool installation to determining which parameter values to use, to efficiently combining and interfacing multiple tools together in an analysis chain.Scientific workflows provide solutions where data is combined and processed into a configurable, structured set of steps.Existing systems often provide graphical user interfaces to combine different technologies along with efficient methods for using them, and thus increase the efficiency of the scientists using them.Additionally, workflow systems generally provide a platform for developers seeking a wider audience and broad integration of their tools, and can thus drive forward further developments in a specific field of research.Among existing workflow platforms, Galaxy is a system originally developed for genome analysis 33 which now includes more than 8500 tools that can be found in the public ToolShed 34 .
Here, we introduce the Galaxy-SynBioCAD portal 36 , the first Galaxy set of tools for synthetic biology, metabolic engineering and industrial biotechnology.It allows one to easily create workflows from the incorporated toolset or use already developed shared workflows.The portal is a growing community effort where developers can add new tools and users can evaluate the tools performing design for their specific projects.The tools and workflows currently shared on the Galaxy-SynBioCAD portal 36 cover an end-to-end metabolic pathway design and engineering process from the selection of strain and target to automated DNA parts assembly and strain transformation.

Results
To develop an integrated ecosystem, we selected software applications from among the computational tools mentioned above.Several criteria were used for this selection: (i) the tools needed to be relevant for pathway design and engineering, (ii) be published, (iii) open-source under MIT, GNU GPL, or related licenses, (iv) well documented and deposited in GitHub, (v) make use of standard input/output, and (vi) exist as a standalone command-line tool.

Pathway design and engineering tools and workflows
The selected tools are further described in the 'Supplement_Text' file (cf.SynBioCAD Tools).These tools can be divided in three categories: (i) those aimed at finding pathways to synthesize heterologous compounds in chassis organisms (RetroRules, RetroPath2.0,RP2Paths, rpCompletion), (ii) thoses aimed at evaluating and ranking pathways (rpThermo, rpFBA, rpReport, rpViz, rpScore) and those related to genetic design and engineering (Selenzyme, SbmlToSbol, PartsGenie, OptDOE, DNA Weaver, LCR Genie, rpBASICDesign, and DNA-Bot).Following FAIR principles 34 , all selected tools are open source with code available on GitHub and installable through the Conda package manager 35 (cf.Tools design and integration process in the 'Supplement_Text' file).Therefore, any user can install the tools needed on their own computer and run these as standalone programs or chain them together to process more complex calculations.
To go further in chaining tools, three types of Galaxy workflows are available on the Galaxy-SynBioCAD portal 36 40 .This workflow provides a direct link between machine-enabled design and automated assembly.It takes as input a pathway and generates a script to operate an Opentrons liquid handler robot performing assembly and chassis transformation,.
The Retrosynthesis and Pathways analysis workflow generate annotated SBML files (cf.Pathway annotation in Methods), which are taken as input to the Genetic design and engineering workflows to produce plasmids encoded in SBOL format along with assembly plans (in CSV files) and liquid handler instructions (Python scripts directly executable by Opentrons).

Benchmarking workflows for lycopene production
The Retrosynthesis workflow was run at the Genoscope laboratory (Paris region, France) for the production of lycopene in E. coli.We used iML1515 41  The genetic design and engineering workflow for BASIC assembly was run at two different locations: Paris (Micalis Institute) and London (Imperial College).In both cases, as design input we used the top lycopene ranked pathway predicted by the Pathway Analysis workflow.
Constraining the enzyme search within the organism Pantoea ananas, enzymes CrtE (UniProt ID: P21684), CrtB (P21683) and CrtI (P21685) were predicted by the Selenzyme tool for the 3reaction pathway (Figure S6 in the 'Supplementary_Text' file and 'Lycopene_Benchmark' Supplementary file).A total of 88 construct designs were generated and fed to DNA-Bot, which was executed in Paris and London with different labware identifiers and associated parameters (cf.Genetic design and engineering workflow execution in Methods).In both laboratories, DNA-Bot generated four scripts (clip reactions, purification, assembly and strain transformation) which were run on Opentrons robots.Additional information can be found in the Supplementary file 'Lycopene_Benchmark'.
Following the scripts produced by the genetic design and engineering workflow the three genes of the pathway (crtE, crtB and crtI) were assembled in distinct order, together with six different RBS-linkers (Figure 1.A).Each of these linkers held a ribosome-binding site with specific strength (cf.Supplementary file 'Lycopene_Benchmark').The pathway operon was expressed using one of two different promoters (medium strength PJ23105 and low strength PJ23116).
In both laboratories (Paris and London) the scripts were used for the 88 constructs and to spot 10 µL of the transformed cells (E. coli DH5-alpha) on a rectangular LB-agar plate (cf.Lycopene production in Methods).Of these, only 30 (   An analysis of the number of successful transformants obtained in the two laboratories for the different combinations of promoter, RBS, and gene order indicates a preference for the weaker J23116 promoter (Figure 1.C).Overexpression of the three pathway genes from a strong promoter may be too toxic for the cell, resulting in overall reduced fitness and consequently fewer successful transformants.Four transformant colonies with visibly different levels of red color (Figure 1.E) were used for acetone extraction of lycopene at Micalis (cf.Genetic design and engineering workflow execution in Methods), and similarly eight colonies were used for lycopene extraction at Imperial.In both the weak (J23116) and the high (J23105) promoter groups, low lycopene production was observed from constructs with more than one highstrength RBS (Figure 1.D).When comparing the constructs with the same gene order, for example CrtE-CrtI-CrtB (H10, B9), CrtB-CrtI-CrtE (B11, D2), or CrtI-CrtE-CrtB (C9, D8), constructs with more low-strength RBSes exhibited higher lycopene production.There was also an apparent preference for the CrtI-CrtE-CrtB (G6, C9, D8) among the highest producing constructs.Taken together, these data indicate that maximizing the expression of pathway genes can increase cellular burden, resulting in lower pathway productivity.

Benchmarking workflows with literature data and expert validation trial
Criteria computed by the Pathway analysis workflow like target product flux, thermodynamic feasibility, pathway length, and enzyme availability score inform the user as to the best potential candidate pathway to produce a compound of interest.These criteria can be combined in a global score value.To achieve this, we developed a machine learning scoring tool (cf.Machine Learning Global Scoring in Methods) taking training data from literature (cf.Literature data benchmarking in Methods) and a validation trial conducted by metabolic engineering experts (cf.Expert validation trial benchmarking in Methods and acknowledgement section for the list of experts enrolled).The process is summarized in Figure 2, overall, our training set comprised 7919 pathways, 754 of which were labeled positive either because of their matching with a literature pathway or because they were selected as feasible to engineer during the validation trial.The scoring process depicted in Figure 2, was used to rank the top 50 SynBioCAD pathways generated for 60 target molecules taken from our literature pathway training set.Results are shown in Figure 3 where each row is a ranked list of collections of predicted pathways for a given target molecule in a given chassis.Pathways are ranked according to the machine learning predicted global score aforementioned.Overall, we find that our scoring schema has an 83% success rate in retrieving the literature or expert selected pathways among the top 10 predicted pathways, the number rises to 94% in retrieving the literature pathway among the top 50 predicted pathways.

Discussion
We have presented several Galaxy workflows to design and engineer pathways in host organisms.These workflows have been built using 25 different computational tools.Chaining the tools together to form workflows was made possible only because the input and output of each tool were standardized.As far as standardization is concerned, we chose community adopted standards like InChI and SMARTS for compounds and reactions, SBML for pathways and strains, and SBOL for genetic constructs.
We illustrated our workflows by designing and engineering a library of 88 pathway variants designed to produce lycopene in E. coli DH5-alpha on Opentrons liquid handlers.The workflows were executed at four different locations demonstrating the ability of the Galaxy-SynBioCAD portal to run workflows (including robot drivers with different labwares) at different sites, and consequently the possibility of completing multi-partners design and engineering projects.
There are many standard protocols for biological engineering but as argued elsewhere 42 , written protocols without practical guidance can lead to problems, as protocols often contain ambiguities or rely on tacit knowledge.Here the possibility of running several times the same workflow that incorporates automated experiments provides a systematic way to quantify reproducibility (cf. Figure 1).
To assess the validity of our generated pathways, we used the double-blind testing strategy performed by a pool of participants 43 .In that strategy, neither the participants nor the conductors are aware of the origin of the pathways, and the participants are asked to flag pathways they deemed valid without having explicit information on pathways found in the literature.We applied this approach to develop a machine learning based scoring function reaching high predictability when ranking pathways.
The Galaxy-SynBioCAD portal presented in this paper, proposes the first set of synthetic biology and metabolic engineering computational tools in a Galaxy framework 33 .We chose Galaxy as our workflow system because the tools found in the ToolShed 34 , have reached way beyond genome analysis for which Galaxy was originally developed.Just by focusing on the tool categories relevant to our study, one can cite proteomics, transcriptomics, metabolomics, flow cytometry analysis, and computational chemistry.Several communities are using Galaxy and many papers can be found online for omics (752 publications are found as of 16/02/2022) microbiome (380 publications), diseases like cancer (386 publications), and drug design and discovery (96 publications).We created a new Galaxy category named 'Synthetic Biology' currently comprising 25 tools stored in the ToolShed 34 .
The offering in Galaxy-SynBioCAD focuses on providing tools for pathway design and engineering.However, as Galaxy-SynBioCAD is a community effort, we anticipate our toolset will grow.
Regarding pathway design tools, many of the software products listed in the introduction could be considered to be added to the portal.In particular, strain design including knockout genes to maximize targeted product fluxes could easily be implemented via the flux balance analysis tools.
Additionally, there are already Galaxy workflows to take up and analyze metabolomics flow cytometry data in the ToolShed 34 , and these workflows could directly be incorporated into the portal to deal with data generated in the 'Test' step of the synthetic biology Design-Build-Test-Learn (DBTL) cycle.As mentioned in the introduction several open-source software products deposited in GitHub [26][27][28]44 could address the 'Build' step and eventually provide drivers to automated constructions using different robotic workstations beyond those provided by Opentrons. Regrding the 'Learn' step in DBTL, the OptDoE tool could easily be adapted to propose new designs as it was done in Carbonell et al. 25 .More complex approaches to be considered are methods that make use of machine learning as in Borkowski et al. 45 .While all design examples provided in the current paper are for engineering pathways in host organisms, because of the recent development of models (similar to genome scale models) for cell-free systems 46 , one can also consider adapting the portal for design and engineering in cell-free.
All of the above-suggested additions could be implemented in our portal with relatively small efforts.There are other applications that could be envisioned beyond pathway design and engineering.For instance, as shown in Delepine et al. 10 retrosynthesis software can easily be adapted to design biosensors, and tools for genetic logic circuits engineering could also be considered.

Retrosynthesis from target to chassis
Typically, the target compound, also named "source compound" is the compound of interest one wishes to produce, while the precursors are usually compounds that are natively present in a chassis strain.Starting from the source compound at the first iteration, reaction rules matching the chemical structure of the source are applied and newly predicted chemicals are generated.Reaction rules are generic descriptions of (bio)chemical reactions encoded into the community standard SMARTS 47 .The use of reaction rules allows estimating the outcomes of chemical transformation based on the generalization of reactions available in knowledge DBs such as BRENDA 48 , MetaCyC 48 , Rhea 49 , or MetaNetX 50 .The degree of generalization is controlled by describing the surrounding environment of the reaction center up to a given diameter as described in Duigou et al. 14 .To ensure the accuracy of the predicted transformations that will outcome from the reaction rules, the RetroRules dataset provided by the Galaxy RetroRules tool has been validated by (i) checking that rules allow to reproduce the template reactions, and by (ii) checking that results obtained by decreasing diameters are supersets of results obtained with higher diameters.Only the reaction rules that successfully passed the 2 checks are retained.The validation of this dataset has a success rate of 99.3%.
For each reaction rule, a score is calculated based on the ability to retrieve enzyme sequences catalyzing substrate to product transformations, the method is detailed in Delepine et al. 10 .Newly produced chemicals are scanned and kept for the next iteration if they are not within the set of available precursors.In that way, a new iteration is started using the previously collected chemicals as the new source set.The iterative process stops when either no new chemicals are discovered or the predefined number of steps is reached.RetroPath2.0 carries out this task.
The retrosynthesis tools RetroPath2.0 and RP2Paths outputs a set of pathways composed by chemical transformations based on reaction rules.To obtain reactions, we have to re-build them from template reactions which have been used to generate the rules.In addition, within a pathway one single chemical transformation can reference multiple rules.Such pathways will be called master pathways.For each master pathway, the algorithm takes each transformation and creates one fork per reaction rule referenced.Then for each reaction rule, again the algorithm creates a new fork per template reaction used to build the current rule.The enumeration of all forks create a set of slightly different pathways (made of chemical reactions) for one master pathway.To perform the enumeration, datasets from RetroRules and MetaNetX are used.The reaction completion tool (rpCompletion) takes as input the CSV outputs of RP2paths and RetroPath2.0 and produces a collection of annotated SBML files.Those SBML files are "enriched" with additional information that are not stored as part of the normal SBML schema (see Pathway annotation section).

Pathway annotation
Some results generated by the workflow produced in this study cannot be readily stored in the SBML files natively (information about chemical reactions and species, thermodynamic and fluxes properties, as well as pathway information).
Using the Minimal Information Required In the Annotation of Models (MIRIAM) conventions, one can store within the SBML file information that instructs the user on the provenance of the reactions and chemical species within the model by cross-references to a wide range of databases.However, this needs a third-party database lookup to match the database ID with its structural information.In this work, intermediate products are generated ad-hoc references and may not necessarily have database entries whereas some other information cannot be contained under the MIRIAM annotations as it is : SMILES, InChI, InChIKey for chemical species, reaction rule ID (RetroRules ID), associated template reaction ID, and rule score based on the expected enzyme availability.
As such, we elected to enrich the SBML format in such a way that our information can be stored directly within the SBML file without breaking any standard of the original file.Because SBML files are based on XML, new XML annotations are created outside the standard scope of an SBML file and thus are ignored by any standard SBML readers 51 .As a result, this enriched file format (denoted rpSBML) is fully compliant with SBML version 3 specifications.
Standard SBML extensions are also used in this project.The "groups" package is used to link the heterologous reactions and chemical species to identify them easily, as well as classifying the chemical species that are main actors in a heterologous pathway 51 .While the FBA package is used to define the FBA simulation conditions 52 .The tools also adhere to the MIRIAM annotation standard for the cross-references of chemical species to public databases 53 .

Flux Balance Analysis with Fraction of Reaction
We need metrics to rank pathways, this is why we developed an in-house Flux Balance Analysis (FBA) objective to simulate the flux of a target while considering the burden that the production of the target would cause on the cell.Under such simulation conditions, the analysis that returns a low flux may be caused by the starting native compound itself not having a high flux, or the cofactors required having a low flux, while the pathways with high flux would be caused by both the starting compound and the cofactors being in abundance.In either case, bottlenecks that limit the flux of the pathway may be identified and pathways that do not theoretically generate high yields can be filtered out.Furthermore, the production of heterologous molecules in an organism often causes a burden on the growth of the cell.To emulate such a condition, we use the method named 'fraction of reaction'.We first perform FBA (with COBRApy 54 ) for the biomass reaction and record its flux.The upper and lower bounds of the biomass reaction are then set to the same amount, defined as a fraction of its previously recorded optimum.This ensures that any further FBA solution would have a fixed biomass production regardless of the conditions set for further analysis.
The tool optimizes the target molecule and records the flux directly to the SBML file and all changed bounds are reset to their original values before saving the file.It is important to note that orphan chemical species (those which are only consumed or produced by the metabolic pathway) are ignored.Such species are documented in the SBML file within the group named rp_fba_ignored_species.

Thermodynamics
Thermodynamics is critical in synthetic pathway design by providing quantitative indicators to determine best metabolic pathways among a set of predicted ones.Thus, one can perform thermodynamic analysis to know whether a reaction direction of a pathway is feasible in physiological conditions.
In this work, we performed thermodynamic analysis for species, reactions and pathways.We use eQuilibrator 55 to compute the formation energy of chemical species and the Gibbs free energy for each chemical reaction and the heterologous pathway.
For each species involved in a heterologous pathway, the first challenge is to find the corresponding compound in the eQuilibrator database.To find the right compound, we try to exactly match species ID, InChIKey, InChI or SMILES and stop with the first hit.Then, if no compound is found, in the last resort, the first part of species InChIKey is looked for within the eQuilibrator cache.When the result (a list) is not empty, the first compound is taken.If species have no known structure neither in eQuilibrator database nor in any public one, the user has the possibility to specify substitution for identifier, InChI and InChIKey for these species.This substitution is documented in the SBML file with the group named rp_thermo_substituted_species.If a species has no known structure and is not substituted, then the reaction which involves this species will not have thermodynamics values.Conversely thermodynamics can be computed by eQuilibrator for all reactions for which all species have been identified.
At the level of the pathway, we build a global pseudo-reaction linking chassis substrates to the target molecule and we compute thermodynamics with the eQuilibrator engine for the global pseudo-reaction.
Building the global pseudo-reaction requires finding the appropriate stoichiometric coefficients such that the intermediate compounds of the pathway cancel out.A linear optimization program (eq. 1) can be set to find the stoichiometric coefficients.The program can be solved using SciPy 56 with a simplex algorithm.max    such that  = 0 (1) and 1 ≤  where c is the objective function, A the stoichiometric matrix, and  the unknown stoichiometric coefficient multipliers.
As an example, let's consider the following 3-reaction set: where MNXM are species ID from MetaNetX, CMPD are intermediate species within the heterologous pathway and not present in the chassis organism, and TARGET is the product of interest.We note that Rxn2 is the reaction to retain and CMPD3,4 are species to remove as intermediate compounds.Thus, the parameters of linear solver are: The solver outputs the following coefficients of reactions: The global pseudo-reaction for the reaction sets becomes:

Genetic design and engineering workflow execution
From amongst the pathway predicted by the Pathway Analysis workflow, the top ranked one was selected with a score of 0.989.The search scope of the Selenzyme tool was restricted to the taxon ID of Pantoea ananas, i.e., 553 taxon ID.The combination of polycistronic constructs was built using 2 constitutive promoters (PJ23105 and PJ23116), 2 RBS linkers (A03, having a TIR 46%, and A04 with a TIR of 3%), 1 backbone (BASIC_SEVA_36_CmR-p15A.1) and enabling CDS permutation, resulting in a maximal number of constructs of 96 for 3 CDS.The labware IDs and parameters used with DNA-Bot parameters are listed in Tables 1 and 2. Additional changes in the purification step were needed because the 2 labs own different versions of the magnetic module (generation 1 vs generation 2).An updated version of the original DNA-Bot tool 44 was developed to be fully compatible with the Opentrons APIv2 and compatible with a command line interface, whilst retaining the option of using an enhanced GUI for direct user control.

Lycopene production materials and methods
Lycopene genes were synthesized by Twist Bioscience, flanked by the LMP prefix and the LMS suffix sequences 40 , and cloned into pTwist high copy vector (AmpR, ColE1 replication origin) using Golden Gate.The resulting storage plasmids (pTwist_High_BASIC_CrtE, pTwist_High_BASIC_CrtB, pTwist_High_BASIC_CrtI) were confirmed by sequencing.
Storage plasmids for lycopene genes and assembly vector BASIC_SEVA_36_CmRp15A.1 were prepared using Monarch® Plasmid Miniprep Kit (Micalis) and E.Z.N.A.® Plasmid DNA Mini Kit (Imperial).The samples were diluted to 200 ng/µl ready to use in the clip reactions.Plasmids coding for the lycopene pathway variants were constructed using Biopart Assembly Standard Idempotent Cloning (BASIC) method.Five-part BASIC reactions were performed, replacing the dropout mScarlet cassette in the assembly vector (BASIC_SEVA_36_CmRp15A.1) by a promoter

Purification step
and three genes with appropriate linkers.A collection of neutral and functional linkers (encoding RBS sequences) is available in a ready to use 96-well plate format (www.biolegio.com).For this work the standard BASIC linker set (Biolegio: BBP-18500) was used.
DNA-BOT was executed as described in detail in the Genetic design and engineering workflow execution section.A clip reaction master mix was prepared by combining 3 µL of 10X NEB T4 DNA ligase buffer, 1 µL NEB BsaI-HF v2 (NEB #R3733), 1 µL T4 DNA Ligase (NEB M0202 (Micalis) or Promega M1804 (Imperial)) per 20 µL required for each reaction.The Opentrons OT-2 pipetted the 20 µL master mix for each reaction, plus 1 µL of Biolegio BASIC linkers and 1 µL of each DNA parts, together with sufficient H2O to give a total volume of 30 µL.Clip reactions were incubated in a thermocycler (Applied Biosystems) for 30 cycles (37°C for 5 min, 16°C for 5 min), followed by a 5 min incubation at 60 °C at Micalis.Clip reactions were incubated in Opentrons Thermocycler Module for 20 cycles (37°C for 2 min, 20°C for 1 min), followed by a 10 min incubation at 60 °C at Imperial.For clip reaction purification, 54 µL of Mag-Bind® TotalPure NGS magnetic beads (OMEGA BIO-TEK (Micalis) or AMPure XP (Imperial)) were added; 150 µl 70% ethanol was used during wash steps; following resuspension in H2O, 40 µL of the eluent was transferred to a fresh well.Constructs were assembled in volumes of 15 µL using 1.5 µL of each purified clip reaction in a solution of 1X assembly buffer (CutSmart Buffer,NEB #B6004) .Assembly reactions were incubated at 50 °C for 45 min in a thermocycler (Applied Biosystems (Micalis) or Opentrons Thermocycler Module (Imperial)).20 µL of DH5-alpha Competent E. coli (NEB #C2987H, Micalis) or home-made DH5-alpha competent E. coli (Imperial) were distributed per well into 96-well plates; then, they were used for transformation reactions.5 µL of the assembly reactions were mixed with cells.Heat shock was conducted according to the manufacturer's instructions.SOC media (125 µL) was transferred to each assembly and the reaction incubated at 37 °C for 1 h with lids off.Transformation reactions were spotted on plates (Thermo Scientific™ OmniTray™ Single well) each containing 40 mL LB-agar supplemented with 17.5 µg/mL chloramphenicol.The spotting protocol was run twice in order to spot 2 times 5 µL for each transformation reaction.The spotting step at Imperial was repeated using 40 µL of transformation reaction on a 12-well plate (Costar® 12-well 3737), each well containing 10mL of LB_agar supplemented with 17.5 µg/mL chloramphenicol.100 µL of each transformation reaction was plated manually on LB-agar plates containing 17.5 µg/mL chloramphenicol as well.
To quantify lycopene production, 2 mL of overnight cultures grown in LB (Cm 17.5 μg/mL) were pelleted at 5000 x g (10 min), washed by re-suspension in 1 mL water, re-pelleted at 5000 x g (10 min), and the pellet re-suspended for extraction in 1 mL acetone.The cells in acetone were incubated at 55 °C for 20 min with continuous shaking (1300 rpm, Eppendorf Thermomixer comfort), centrifuged at 19000 x g (10 min), and the supernatant transferred to a fresh tube.Lycopene absorbance of the supernatant was measured at 474 nm using a quartz cuvette (Hellma 104.002B-QS) in a spectrophotometer (UVisco V-1100D (Micalis) or NanoDrop™ One UV-Vis (Imperial)), and the pellet was dried at 50 °C for 48 h to determine the gDCW.Absorbance (OD474) was converted to molar concentration value by dividing by 150479, the molar extinction coefficient (ε) of lycopene 57 .The yield per gram dry cell weight (mg/gDCW) was calculated by dividing the absolute yield (mg) by the weight of the dried cell pellet.

Literature data benchmarking
A list of 77 pathways corresponding to 60 expressed compounds in engineered organisms (E.coli, S. cerevisiae, and P. putida) was collected from the literature (cf.Supplementary file 'Literature_Pathways').For each of the 77 collected pathways and each heterologous pathway reaction, we compiled the EC number of the reaction along with the substrates and products of the reaction.Each target compound within that list was used to run the Retrosynthesis and Pathways Analysis workflows to generate a collection of 5874 predicted pathways that produce the same target molecule in the same host organism as those reported in the literature.Following that, the predicted collection of pathways were compared with their corresponding literature pathways using a matching algorithm described in the 'Supplementary_Text' file.Figure 2.D shows for each literature pathway, the predicted pathway with the highest matching score (raw data is in Supplementary file 'Literature_Pathways').Any pathway generated by SynBioCAD is labeled 'literature pathway' if its score is above 0.5 and that pathway is added to the training set of a machine learning model predicting global score (Figure 2.F-G).

Expert validation trial benchmarking
Pathways generated by SynBioCAD should not be discarded even when they do not appear in the literature, for the obvious reason that not all pathways have been engineered for the 60 targets of our literature benchmarking.To palliate this shortcoming, we generated a set of 7919 predicted pathways for 80 (target, chassis) pairs using the Retrosynthesis and Pathway Analysis workflows.The set included the 5874 pathways generated for our literature benchmarking along with 2045 and 20 additional pathways and (target, chassis) pairs taken from the Laser database 58 .We next spliced the set in batches of 5 pathways synthesizing the same target in the same chassis.The predicted pathways best matching the literature pathways (when known) was included and the four remaining pathways were drawn randomly.We next recruited 40 experts in the metabolic engineering community (see acknowledgement section) and asked them to select valid pathways in the list they received.To help the selection process, the experts received a clickable map of the 5 pathways (Figure 2.E) where they could collect information on compounds and reactions, reaction and pathway thermodynamics, a Selenzyme ranked list of enzymes catalyzing each reaction, and reaction and pathway production fluxes.An example of such a map can be found on the SynBioCAD portal 59 .The results were recorded and merged with the literature benchmarking results using an OR function for identical pathways.At the end of this process, among the 7919 pathways, 754 were labeled positive either because their matching score with a literature pathway was above 0.5 or because they were selected as feasible to engineer by the experts (cf.Supplementary file 'ML_Scores').

Machine Learning Global Scoring
The purpose of the machine learning model is to predict if a given predicted pathway is a valid one or not.To that end, we developed a classifier based on the XGBoost library 61 .The classifier was trained on 7919 SynBioCAD generated pathways (comprising 43392 reactions) used during the expert validation trial where 754 pathways were labeled positive.The training set can be department, U. Paris-Saclay, Ile-de-France (IdF) region's DIM-RFSI, and ANR DREAMY (ANR-21-CE48-003).GSB acknowledges the support of UKRI (EP/R034915/1).

Figure 1 .
Figure 1.Automated construction of 88 distinct plasmids coding for lycopene pathway operons with genes in different orders and varying promoters and RBS sequences.(A) Plasmids coding for lycopene genes were assembled using the BASIC method.In this method DNA parts are assembled using DNA linkers.Three genes in the lycopene pathway crtE, crtB, crtI (parts 3, 4 and 5) were assembled in an operon with UTR-RBS linkers containing different strengths of RBSs.The 3 gene operon plus promoter (part 2) were assembled into a standard backbone with an origin of replication p15A (ORI) and Chloramphenicol resistance gene (Cmp-R).The assembled parts were flanked by methylated linkers that recapitulate the BASIC Prefix and Suffix (LMP and LMS).(B) Number of constructs for which

Figure 2 .
Figure 2. Scoring SynBioCAD predicted pathways with literature pathways and expert validation data.(A) Pathways for different targets and different hosts are extracted from literature (cf.Literature data benchmarking in Methods), this is illustrated here for production of phenol in E. coli.(B) Galaxy SynBioCAD workflows are run on the literature targets and hosts.(C) A collection of SynBioCAD generated pathways is compiled.Pathway 'A' producing phenol in E. coli from tyrosine is highlighted.(D) The SynBioCAD generated pathways are compared with the literature pathways using a matching algorithm (cf.Supplementary_Text file).The plot shows for each literature pathway the best scoring SynBioCAD generated pathways.Pathways scoring above 0.5 are identical (similarity of 1) to literature pathways as far as main substrate and products are concerned.The plot raw data can be found in Supplementary file 'Lirerature_Pathways'.(E)SynBioCAD generated pathways are evaluated by metabolic engineer experts whose task is to select in batches of 5 generated pathways which ones are valid (cf.Expert validation trial benchmarking in Methods).(F) Valid pathways according to experts and pathways matching literature are added to a training set of labeled pathways.(G) The set of labeled pathways is used to train a machine learning tool printing out the probability of any given pathway to be a valid one (cf.Machine Learning Global Scoring in Methods).The figure plots resultsobtained for all pathways generated by SynBioCAD.The raw data including the training set can be found in the Supplementary file 'ML_Scores'.Using a probability threshold of 0.5, the accuracy retrieving literature of expert labeled pathways is 0.91 with a false positive rate of 0.10 in 4-fold cross validation (cf.Supplementary file 'ML_scores').

Figure 3 .
Figure 3. Ranking predicted pathways with machine learning global score.The color code on the right side shows the global score (from 1 top to 0).The black boxes show the location of the literature or expert selected pathways for a set 60 literature target engineered in E. coli (*), S. cerevisiae (**) or P. putida (***).If a row does not contain a black box then the literature or expert selected pathway is not found within the first 50 scored pathways.The numbers listed on the right side are the total numbers of pathways generated by the SynBioCAD workflows.The data used to generate the figure can be found in Supplementary file 'ML_Scores'.

workflows with literature data and expert validation trial). 3
. Two Genetic design and engineering workflows that produce assembly plans for plasmids encoding the pathways generated by the Retrosynthesis or Pathway analysis workflows (cf.

Genetic design and engineering workflows in 'Supplemetary_Text' file
22red + 8 white) and 33 (21 red + 12 white) constructs gave us transformant colonies at Paris and London respectively (Figure1.B).However, only 12 (11 red + 1 white) of these transformants were common across the two laboratories, suggesting that 10 µL may be too low a volume to spot for these transformations.To test if more transformants can be obtained by increasing this volume, we manually plated 100 µL of the transformed cells in Paris and repeated the spotting step in London using 40 µL on a 12-well plate (cf.Lycopene production in Methods).Of the 88 constructs, this time we obtained transformants for 51 (41 red + 10 white) and 63 (49 red + 14 white) constructs at Paris and Imperial, respectively, including 36 (33 red + 3 white) constructs in common.
successful transformant colonies were obtained are reported from each laboratory (Micalis Institute, Paris, and Imperial College, London), and the number of constructs common to the two labs are indicated in the intersection.Color of colonies (red and/or white) are indicated.At the left, data are from spotting 10 µL of transformation reactions by Opentrons on LB plate and at right are from spotting 100 µL manually or 40 µL by Opentrons on LB plates.(C) Count-plots indicate the number of constructs for which we got successful colonies, grouped by position and gene (details in Supplementary file 'Lycopene Benchmark').At the left are results from Micalis Institute Paris, and at the right are results from Imperial College London and in the middle are the results which were common between both the laboratories.At the top are the constructs with promoter PJ23116 (weaker) and at the bottom with promoter PJ23105 (stronger).The RBS types are shown by different colors.The position of each of the 3 genes (crtE, crtB, crtI) are indicated at the bottom of each plot.Mean of the number of constructs for each promoter type is shown by a dashed line in each plot.(D) Lycopene extraction measurement (mg of lycopene per gram of the dried cell weight) from different constructs from both laboratories.Types of RBS and promoters and gene orders are indicated.E: crtE, B: crtB, I: crtI.Promoter and the terminator are shown at the extremity of each construct.(E) Examples of red and white colonies acquired by Opentrons at the top, the pellet preparation step in the middle and acetone extraction of lycopene at the bottom.

Table 1 .
Labware IDs used at Imperial College (London) and Micalis Institute (Paris) laboratories

Table 2 :
DNA-Bot parameters that differ between Imperial College (London) and Micalis Institute (Paris) laboratories