Main

All our analyses are based on Allchemy’s collection of approximately 10,000 generalized reaction transforms expert-coded based on the underlying reaction mechanism and including—but not limited to—robust reaction types common in chemical industries, especially pharmaceutical21,22 but also agrochemical and flavour/fragrance. These reaction rules are much broader and also more accurate23 than machine-extracted transforms24, and the expert-coding approach has been validated by successful experimental execution of numerous computer-planned syntheses: such as in Allchemy for understanding the origins of life16, and in Chematica/Synthia25,26,27,28,29, which uses a different set of retrosynthetic rules, for the syntheses of drugs26 and complex natural products28.

Reaction rules

Each transform in Allchemy specifies the reaction type/class, scope of admissible substituents, structural motifs incompatible with a given reaction (approximately 400 groups are considered), typical conditions and reagents, suggested solvent (categorized as protic/aprotic and polar/nonpolar), temperature range (categorized as very low, less than −20 °C; low, −20 °C to +20 °C; room temperature (RT); high, +40 °C to +150 °C, and very high, greater than 150 °C), propensity of a given reaction to be performed in tandem with some other reaction(s), and more. Importantly, the programme also calculates a range of molecular properties (logP, polar surface areas and other structural descriptors, energies, pKa values, and so on), flags problematic reagents (here, those in the US Environmental Protection Agency (EPA) List of Extremely Hazardous Substances30, the European Union REACH regulation List of Substances of Very High Concern31, and reagent guides from GlaxoSmithKline (GSK)17,18) and solvents, and uses environmental and health criteria19 to suggest ‘greener’ alternatives, including the possibility of enzymatic reactions. For instance, Oxone instead of meta-chloroperoxybenzoic acid (mCPBA) is suggested in alkene epoxidation, thionyl chloride instead of triphenylphosphine and tetrachloromethane in deoxychlorination; and Pseudomonas cell culture to carry out oxidation of lactic acid instead of Dess–Martin periodinane (see the synthesis of mirabegron in Fig. 3). In terms of solvents, dimethyl sulfoxide (DMSO) is given as an alternative to dimethylformamide (DMF) typically used in Williamson ether synthesis from phenols, and t-butyl ethyl ether rather than tetrahydrofuran (THF) is suggested for the reduction of ketones to alcohols, and so on. Also, for each reaction, the software calculates quantities such as atom economy32 or reaction heats (using Benson’s approach)33. Additionally, reaction sequences involving the same solvent in consecutive steps are promoted while reactions requiring very low or very high temperatures are penalized because of high energetic cost.

Construction of reaction networks

The reaction transforms are iteratively applied to the substrates of interest. Although the notion of ‘chemical waste’ may have different meanings, we consider here as substrates 189 small molecules that we identified to be waste by-products of large-scale industrial processes (see Methods, Fig. 1 and full list in Supplementary Information section 1). In the most basic version of the algorithm, the molecules produced in each synthetic generation, G, are combined with the products of preceding generations and with the original substrates, and the cycle is repeated until a user-defined limit of synthetic generations is reached (Fig. 2a). However, because the reaction networks16,25,34,35 created by this method (Fig. 2b) expand very rapidly with the number of substrates and the number of generations (Fig. 2c, d), we bias network generation towards the syntheses of high-value molecules of interest (here, 2,466 approved drugs from DrugBank36 and 1,647 agrochemicals subcategorized as pheromones, herbicides, insecticides and fungicides). In this approach, products made in each synthetic generation are retained for further calculations only if they are (i) small (molecular weight (MW) < 150) and can thus serve as useful building blocks for subsequent syntheses, or (ii) have 150 ≤ MW < 500 but above a certain fingerprint-based Tanimoto similarity37 threshold to at least one of the ‘target’ drugs or agrochemicals. The similarity threshold is adjusted such that the total number of molecules retained in the network after each generation does not exceed a user-defined limit (the ‘width’ of the search, typically W = 10,000–100,000). In this way, we are able to propagate networks starting from large substrate collections (hundreds to greater than 1,000 molecules) up to generation 7 or even 8, and evaluate synthetic spaces spanning hundreds of millions of molecules. Such calculations take several days on a multicore workstation.

Fig. 1: Examples of small molecules recycled from various types of industrial chemical waste.
figure 1

Coloured stars correspond to the map on top and indicate the geographical locations at which companies producing these substrates on large scales are located (for the list of companies, see Supplementary Table 1). We note that in addition to the recycling processes and industries indicated in the figure, the molecules shown here can also be desired targets of other processes (that is, not waste)—for instance, phenol can be produced in large quantities in the so-called cumene process, and terephthalic acid can be made via the Amoco process.

Fig. 2: Generation and properties of ‘forward’ synthetic networks.
figure 2

a, Scheme of the iterative, forward-synthesis algorithm. Substrates in the zeroth generation, G0, are subject to reaction transforms to produce first generation of products, G1, which can be then pruned (generation G1′) by various structure-based or property-based filters. The G0 and G1′ molecules are then combined and the rules are again applied, this time creating molecules in generation G2 and, after pruning, G2′. The process continues until a user-defined limit of generations is reached. b, A screenshot from Allchemy illustrating how rapidly the network can expand from just few substrates (here, isopropanol, glycine, formaldehyde and 3,4-dihydroxyphenylglycol) in the absence of any pruning. Up to G4, this network contains 4,283 molecules (red nodes = drugs, blue = agrochemicals; pink = hazardous compounds). The colourful arcs trace 14 possible syntheses to edetic acid. c, Exponential growth of the average number of products obtained from a given number of ‘waste’ substrates after just three synthetic generations. Values for each histogram bar are averaged over n = 189 independent runs with different substrate sets drawn at random from the collection of 189 ‘waste’ substrates. Inset, cumulative distributions of the number of products obtained in individual runs with different substrates (here, two and eight, normalized by the corresponding averages from the histogram). d, For a given number of substrates, the size of the network grows faster than exponentially with the number of synthetic generations. Each value in the plot is, as in c, an average over n = 189 independent runs with different substrate sets drawn at random from 189 ‘waste’ substrates. Inset, changes of cumulative distributions for runs starting from two substrates analogous to those in c but for a different number of synthetic generations. As the number of generations increases, the fraction of runs with a below-average number of products also increases.

Pathway retrieval and scoring

Once the networks are generated, a breadth-first search algorithm is used to retrieve all syntheses connecting the wastes and valuable products. Because the shortest pathway(s) may not necessarily be optimal (for example, involving problematic conditions), we also consider routes longer by up to two steps. Owing to the high interconnectedness of the network, there are generally multiple syntheses of a given valuable product—from relatively few routes, illustrated by colourful arcs in Fig. 2b, to hundreds or even thousands in Extended Data Figs. 1, 2—and it is desirable to rank them with respect to various ‘process’ variables.

Here these variables are intended to ‘add cost to’ (penalize) the use of some undesired reaction conditions or properties and evaluate the overall pathway structure. With full definitions and details of rescaling to appropriate value ranges provided in Methods, these variables define: X1 = a penalty on the use of harmful reagents based on GSK criteria17,18; X2 = a penalty for problematic solvents as defined by GSK19; X3 = a penalty for extreme temperatures; X4 = a penalty proportional to the exothermicity or endothermicity of the reaction; X5 = a set ‘cost’ for executing each reaction step (to promote shorter routes); X6 = a penalty for low atom economy, defined in ref. 32; X7 = a penalty for pathways being linear rather than convergent and accounting for the position of the convergence point(s) and average yields38; X8 = a ‘geolocation’ variable assigning penalty to pathways for which the waste substrates come from different continents (see Fig. 1, stars); X9 = a penalty for pathways with high estimated cumulative process mass intensity (PMI), calculated based on a previous methodology39 and using tables40 of PMI values for individual reactions.

With the understanding that this selection may not be complete or accurate (notably, PMI values are only approximate, may entail substantial spread39,40, and are not calibrated for flow conditions; information about large-scale pricing and batch-to-batch purity variations is currently unavailable to us, and so on), we use variables X1X9 to define a simple ‘cost’ function, \({\rm{C}}{\rm{o}}{\rm{s}}{\rm{t}}=[({w}_{7}{X}_{7}+\Sigma {\rm{S}}{\rm{t}}{\rm{e}}{\rm{p}}{\rm{C}}{\rm{o}}{\rm{s}}{\rm{t}})/{X}_{8}^{{w}_{8}}]\times {X}_{9}^{{w}_{9}}\), where \({\rm{StepCost}}={\sum }_{i=1}^{6}{w}_{i}{X}_{i}\). Within the Allchemy web application (https://waste.allchemy.net), the user can dynamically adjust the weights wi of all variables and thus guide the selection of pathways meeting their process criteria. This is illustrated in Extended Data Fig. 3a, b, whereby without any penalties, the top-scoring synthesis of acetaminophen from phenol and acetic acid ‘wastes’ is four steps long but involves the use of thionyl chloride in toluene or dichloromethane (DCM) and AlCl3 in DCM (of which DCM does not have any ‘greener’ replacement in the Friedel–Crafts acylation). When, however, penalties are assigned for harmful reagents and problematic solvents (Extended Data Fig. 3c), the programme prioritizes a one-step-longer pathway that avoids the acylation step and DCM. The new top-scored synthesis (Extended Data Fig. 3d) starts from p-hydroxybenzaldehyde and acetonitrile wastes. In the first step, an aldol reaction, the programme suggests lithium tetramethylpiperidide (LiTMP) as replacement for the harmful lithium diisopropylamide (LDA) typically used in this reaction. In subsequent alcohol oxidation, MnO2—well rated for EHS (overall environment, health and safety score)15—is suggested as an alternative to the explosive Dess–Martin periodinane.

In all the examples discussed below, we ranked the pathways according to the cost function in which non-zero weights were assigned to all variables, although with the highest importance given to reagents, solvents and geolocation (w1 = w2 = w8 = 10, w3 = w4 = w5 = w6 = w7 = w9 = 1).

Examples of synthetic networks

The first large-scale network was propagated from the ‘basic’ set of 189 waste substrates, with W = 10,000–30,000 and up to generation 7. Within some 300 million molecules comprising this network, the algorithm identified 69 drugs and 98 agrochemicals, suggesting 1–2,081 syntheses per target (on average, 216; see Extended Data Fig. 1a and all results stored at https://wasteresults.allchemy.net). Extended Data Fig. 4 highlights only some of the top-ranking pathways longer than three steps, with coloured arrows corresponding to steps previously reported in the literature. We observe that several targets can be made from waste available on the same continent (note the correspondence between large and small stars), and the vast majority of steps rely on benign conditions (an exception is the synthesis of eugenol in which the aromatic electrophilic alkylation step requires the use of ‘non-replaceable’ DCM). Given the simplicity of the targets, it is not surprising that most of the chemistries involved are straightforward, although not all approaches are necessarily obvious—for instance, synthesis of the dapsone antibiotic via a double Smiles rearrangement (see Fig. 4 and Methods for experimental validation).

Nevertheless, the ‘wastes’ alone clearly lack synthetic flexibility to build more complicated scaffolds—with this in mind, our second calculation augmented the set of waste substrates with the aforementioned 1,000 basic and popular reagents (https://github.com/rmrmg/wasteRepo/blob/main/popular_reagents.smi). Propagating the network with a more ‘focused’ width parameter, W = 10,000, and up to G8 generated a space of more than 160 million synthesizable compounds including 71 additional drugs and 20 agrochemicals.

These targets are more structurally complex than in the first network (for example, valsartan, mirabegron, dofetilide) and include some of the world’s most prescribed medicines41 (for example, salbutamol is ranked 7th, carvedilol is ranked 33th, and chlorhexidine is ranked 286th). Their syntheses (on average, 92 per target, see https://wasteresults.allchemy.net) are longer than those in Extended Data Fig. 4, and involve a higher proportion of steps that are, to our knowledge, previously unreported (black reaction arrows). Figure 3 and Extended Data Fig. 5 show some of the routes top-ranked according to the cost function: only in few cases they involve regulated intermediates (for example, aryl hydrazine in synthesis of carvedilol, oxirane in synthesis of bisopropol) or solvent and/or reagents for which the programme suggests no greener alternatives (for example, azide in synthesis of tetrazole ring in valsartan, diazomethane in mirabegron synthesis). The interplay between various variables of the scoring scheme is further illustrated in Extended Data Fig. 6 for the synthesis of Cysview. Prioritizing only the pathway length yields a convergent route (blue reaction arrows) that starts from EPA-regulated30 allyl alcohol and relies on the use of toxic and potentially carcinogenic diisopropyl azodicarboxylate (DIAD) and triphenyl phosphine in Mitsunobu reaction, and ozone in ozonolysis. A cost function penalizing the uses of harmful substances top-ranks a more linear pathway (violet arrows) in which, however, allyl alcohol is not used and the problematic steps are replaced by milder bromination (NH4Br, Oxone conditions) and SN2 reaction.

Fig. 3: Examples of highly ranked syntheses of more advanced drugs starting from waste substrates and few simple, auxiliary molecules used frequently in organic synthesis.
figure 3

We show here only some intermediates along the routes; for more complete plans, also to some other drugs, see Extended Data Fig. 5. a, The auxiliary substrates are shown in the innermost circle. ‘Waste’ substrates are shown in red in the outer circle. Hazardous substances30,31 are marked by yellow ellipses (for example, allylic alcohol). Small stars indicate geographical locations (Europe, Asia, North America; see Fig. 1 and Supplementary Information section 1) at which companies producing these substrates are located. Larger stars next to some of the drugs and agrochemicals indicate that they can be synthesized from ‘same-star-colour’ substrates available at the same geographical location (for example, ibuprofen can be made solely from waste substrates produced in North America). The synthetic pathways to different targets are differentiated by colours. Within each pathway, the reaction arrows for steps already reported in the literature are coloured, whereas those without literature precedent are in black. Details of all syntheses shown in this figure as well as other routes of each target are available at https://wasteresults.allchemy.net. HNPhth, phthalimide.

Finally, we considered a network to support a specific commercial operation, namely decentralized and fully automated production of pharmaceuticals and active pharmaceutical ingredients (APIs) by On Demand Pharmaceuticals (ODP)20. Propagating the network with a broad exploration width (W = 40,000–107,000) up to G5, generated a space of approximately 350 million molecules, including additional 27 drugs and 11 agrochemicals. Of particular and immediate interest, ODP identified drugs and/or their intermediates urgently sought42 for ventilated COVID-19 patients: cisatracurium (a muscle relaxant), midazolam (a sedative), and propofol (an anaesthetic).

Experimental validations

Several routes traced by our algorithms within the abovementioned networks were committed to experimental validation. Initially, we performed laboratory-scale syntheses shown in Fig. 4 and intended merely to confirm the general correctness of computer-designed plans. These examples were chosen because the software either suggested some interesting transformations (for example, double Smiles rearrangement in the synthesis of dapsone) or because the proposed pathways lacked prior literature precedent for several steps (marked with yellow stars). The syntheses were generally straightforward and proceeded in good yields under benign conditions suggested by Allchemy (for details, see Methods and Supplementary Information section 5).

Fig. 4: Experimental validation of selected, computer-designed pathways in laboratory scale syntheses.
figure 4

ad, Allchemy-designed, waste-to-drug syntheses of dapsone (a), carvedilol (b), bisoprolol (c) and proxymetacaine (d). Steps lacking literature precedents and executed by us experimentally are marked by black arrows and with yellow stars above them (see Methods section ‘Laboratory-scale validations’). Steps with existing literature precedent are indicated by coloured arrows. Above these arrows, text in the corresponding colour indicates the reaction ID from Reaxys, the literature-reported yield, and the conditions used (non-‘green’ conditions are given in red). For comparison, conditions suggested by Allchemy are provided below the arrows—‘typical conditions’ suggested by the programme are in black unless they involve harmful reagents (in red font); in the latter case, greener alternatives suggested by the programme are in green. In all pathways, ‘waste’ substrates are coloured red and commonly used chemicals are coloured in pink. aInformation about the yield was not available in the source publication. bDimer rather than very unstable monomer of glycolaldehyde was used in the reaction. cReaction ID not available, literature precedent from SciFinder. CPME, cyclopentyl methyl ether; DDQ, 2,3-dichloro-5,6-dicyano-1,4-benzoquinone; DIPEA, N, N-diisopropylethylamine; DMSO, dimethyl sulfoxide; LAH, lithium aluminium hydride; RT, room temperature; THF, tetrahydrofuran; TFA, trifluoroacetic acid.

Next, we tested the applicability of computer-planned routes at larger scales and in realistic industrial settings, using ODP’s flow chemistry platform43 fed with adulterated waste streams (to mimic varying qualities of starting materials from various vendors at different locations). Specifically, in the continuous processes leading to strategic intermediates of cisatracurium, midazolam, and to propofol, ODP built strategic isolation points to ensure high product quality (Fig. 5a–c). For the cisatracurium intermediate (22), homoveratric acid (20) served as the starting material and a potential entry point for the second substrate, homoveratrylamine (21). Industrial-scale production of 20 from vanillin (recovered from biomass) and glycine (produced from lignin waste) has previously been described44,45. In our process, the acid chloride derivative of 20 was generated in the presence of 5 total mol% vanillic acid and guaiacol (both represent potential waste stream adulterants)46 and subsequently reacted and isolated with no impact on product quality. Of note, 22 had a substantially different, Allchemy-calculated logP value (2.62) compared to either 20 (1.33) or 21 (1.21). This difference in partition coefficient led us to evaluate a binary solvent system (pentane/isopropyl alcohol), ultimately allowing for selective extraction of the more polar impurities from the process stream. With this purification in hand, a 12-h production run yielded greater than 1 kg of 22 with a liquid chromatogram area percent (LCAP) of more than 98% by high-performance liquid chromatography (HPLC) (Supplementary Table 3). The cumulative PMI for the process was 9 and compared to the theoretically predicted 24–84 range and 52 average (see Supplementary Tables 7, 10).

Fig. 5: Syntheses of COVID-19 intensive care unit medications or their intermediates performed on an automated, modular ODP platform.
figure 5

ac, Allchemy-designed, waste-to-drug syntheses of key intermediates of cisatracurium (a), and key intermediates of midazolam (b) and propofol (c; last violet arrow is offline decarboxylation). Manufacturing steps of stable hold points and potential purification points in the respective syntheses are depicted. Grey arrows indicate previously known, patented steps starting from waste molecules (red structures). Isolation points are highlighted in green. d, The process skid configured to manufacture the propofol. DCM, dichloromethane; DMF, dimethylformamide, DMAP, 4-dimethylaminopyridine; IPA, isopropyl alcohol; NMP, N-methyl-2-pyrrolidone.

For midazolam, a benzodiazepine family member, ODP’s entry points were bromoacetic acid (23) and the commercially available 2-amino-5-chloro-2′-fluorobenzophenone (24, logP = 3.29), which is available from approximately 97 suppliers and synthesizable from 4-chloroaniline, which is, in turn, manufactured on industrial scale from recycled chlorobenzene47,48. Batch experiments demonstrated minimal impact on conversion to the lactam when the benzophenone was contaminated with 10 mol% of both nitrobenzene and chlorobenzene. For the production run, the benzophenone was reacted with bromoacetyl chloride, which was also generated in-line from the corresponding acid and oxalyl chloride, to give 48 g of the acetamide over a 10-h run with a LCAP of 91.6% by HPLC (experimental cumulative PMI = 27 versus calculated range 24–84, with an average of 52; Supplementary Tables 8, 10). Whereas the acetamide (logP = 4.04) serves as a possible isolation and purification point, potential issues of stability and toxicity associated with the reactive acetamide functionality made further processing of this material to the corresponding lactam (25, logP = 3.27; see Supplementary Table 4) a more attractive hold point. This lactam serves as the main commercial entry point and currently has limited availability owing to the surge in demand as a result of COVID-19. Our process has the benefit of two distinct purification points that can purge potential impurities present in the waste stream.

Finally, propofol (28), a lipid-soluble anaesthetic, was manufactured from 4-hydroxybenzoic acid (26; produced from lignin waste) in isopropyl alcohol. To demonstrate the viability of this reaction as part of a circular economy, the feedstock (26, logP = 1.09) was adulterated with 10 total mol% vanillin and vanillic acid, two compounds that can also be obtained via lignin degradation49. Leveraging a distinct difference in logP values between this starting material and 3,5-diisopropyl-4-hydroxybenzoic acid intermediate (27, DIHA; logP = 3.37) enabled a continuous-stirred tank reactor coupled with concurrent gravity separations and precipitation, in the end providing multiple avenues for the successful purging of impurities present in the feed material. A 12-h production run through the ODP system yielded 150 g of DIHA with a LCAP of 99% by HPLC (Supplementary Table 5) and with cumulative PMI = 214, compared to Allchemy’s calculated value in the 112–390 range with an average of 217 (see Supplementary Tables 9, 10). A portion of this material was forward-processed to deliver propofol API with a LCAP of 99.9% by HPLC (Supplementary Table 6). An additional purification unit operation such as a recrystallization step would enable an enhancement in purity specifications. All three processes, meant to be a part of a decentralized manufacturing, are designed to ultimately meet the supply of local hospital systems.

Conclusions

To summarize, we showed that computers equipped with comprehensive rules on chemical reactivity can rapidly trace and rank, to our knowledge, unprecedented numbers of circular syntheses establishing new, productive uses of industrial chemical waste. Naturally, we envision extensions and improvements to the schemes we described—in particular, if more accurate data with which to estimate PMI or E factors40,50 and process scaling metrics became available, they should be updated in the cost function that ranks the candidate syntheses; when adequately broad substrate scopes for additional enzymes are delineated, these biocatalytic transformations51,52 should be added to Allchemy’s reaction knowledge base. In the fullness of time, applications such as Allchemy will be most impactful if adopted and shared across the chemical industry—for instance, with some companies inputting waste substrates they wish to dispose of, some indicating the products they would like to have synthesized, and some bidding to perform the waste-to-drug syntheses planned (or inspired) by the machine. In performing these tasks, we envision synergies between software such as Allchemy (to guide chemists to potential valorization opportunities) and distributed manufacturing networks such as ODP (to rapidly and cost-effectively deploy multiple production units utilizing locally available waste streams). Such an industry-wide system would help synchronize the circular-chemistry efforts but its implementation will probably require incentivization by administrative bodies overseeing chemical industry.

Methods

Retrosynthesis versus forward synthesis

Traditionally, chemists are accustomed to analysing how to make desired target molecules (retrosynthesis) rather than what molecules can be made from a given set of substrates (forward synthesis). However, a computerized retrosynthesis approach25,26,27,28,29 is ill suited for our purpose because it is not a priori known which valuable products are synthesizable from the waste substrates: If retrosynthetic searches to these targets do not terminate after a long time, it is impossible to distinguish whether they simply need more iterations28 or whether a given drug molecule cannot be navigated to waste precursors (and in this case, the searches will never terminate). By contrast, forward searches can exhaustively delineate the networks of molecules synthesizable from a given set of substrates including these (and only these) valuable products that are makeable from waste. Moreover, such networks are highly interconnected16, ensuring that large numbers of possible synthetic solutions can be identified.

Choice of substrates

As ‘chemical waste’, we considered 189 small molecules which we identified to be waste by-products of large-scale industrial processes. Within this ‘basic set’, we further identified a ‘commercial’ subset of 56 molecules that are recycled from chemical waste or biomass, and are available commercially from companies located mostly in Asia, North America and Europe (see coloured star markers in Fig. 1 and full list in Supplementary Information section 1). For example, Chinese Jiangsu Kesheng Chemical Machinery company makes resorcinol as part of aramid fibre production process; USA-based BioCellection produces succinic, glutaric and adipic acids from plastic wastes, and European conglomerate Global Industrial Dynamics offers ethylene derived from waste biomass. All of these molecules are pre-loaded into the Allchemy software (https://waste.allchemy.net) and additional entities can be proposed via https://wastedb.allchemy.net portal (for details, see Supplementary Fig. 12). We note that although some of the ‘wastes’ are widely used as solvents, we are not interested in their uses as such—instead, they should be used as reaction substrates. In some searches, we also consider auxiliary sets—notably, 1,000 basic reagents most often used (as quantified in ref. 34) in literature-reported syntheses and including molecules such as nitromethane, phthalimide and di-tert-butyl dicarbonate (for full list, see https://github.com/rmrmg/wasteRepo/blob/main/popular_reagents.smi).

Definitions of process variables X 1X 9

Detailed definitions of the process variables discussed in the text are as follows.

X1 is a penalty assigned to reactions using harmful reagents as defined by GSK criteria17,18. The GSK’s original scores are rescaled to the range 0–1 (10 = most harmful). In most cases, alternative reagents are also suggested, and the final value is calculated as weighted average of the ‘primary’ and alternative conditions (0.3:0.7 weights).

X2 penalizes problematic solvents as defined by GSK19. The specific value is assigned on the 0–10 scale as for X1.

X3 assigns a +10 penalty for extreme reaction temperatures below −20 °C or above 150 °C.

X4 expresses a penalty that is linearly proportional to the exothermicity, ΔH/2, or endothermicity, ΔH/5, of reactions. The penalty is bounded to +10; ΔH is calculated using Benson’s group contributions method and is expressed in kcal mol−1.

X5 assigns a +10 ‘cost’ for executing each reaction step (this variable simply promotes shorter pathways). If consecutive steps can be performed in the same solvent (one pot), the penalty is reduced to 3.

X6 penalizes reactions that are characterized for low atom economy, defined as in ref. 32, and takes into account both substrates and reagents. Its role is to promote reactions that produce the least amount of by-products and/or waste. Each reaction gets a score ranging from 0 to 10.

X7 promotes convergent rather than linear pathways. This variable is defined to account for the position of the convergence point, and is expressed as an average of two terms, (linearity penalty + convergence location)/2. In this expression, the ‘linearity penalty’ is defined by the ratio of the longest linear sequence to the total number of reactions. The ‘convergence location’ term promotes routes in which convergence point(s) are closer to the final product, and is expressed as \(1-\exp (-0.1\times {\sum }_{i}{{\rm{avgYield}}}^{-{N}_{i}})\), where avgYield is the average yield of a typical organic reaction (taken here as 75%)38, Ni is a distance measured in synthetic steps from substrate i to the target, and the sum is over all substrates. The average of the two terms is multiplied by 10 to give a final score of a pathway between 0 and 10 (for examples of this scoring scheme for different pathway structures, see Supplementary Information section 4.3).

X8 is a ‘geolocation’ variable that assigns a penalty to pathways in which the waste substrates come from different continents (see the stars in Fig. 1), implying increased transportation costs and/or longer delivery times. The overall pathway score is divided by a coefficient >1 if all ‘waste’ substrates are on the same continent. Here we promote such pathways by up to 20% (coefficient 1.25). If, for the substrates we considered, the location of production could not be determined, the geolocation was assigned to the company’s country of origin (although, in the Allchemy web application, the variable can also be calculated for user-defined locations, see Supplementary Fig. 6).

X9 penalizes pathways with high estimated cumulative PMI, calculated based on a previous methodology39 and using tables40 of PMI values for individual reactions. The raw value of cumulative PMI is rescaled to a range 1–1.5 based on the user-selected purification method. The overall pathway score is then multiplied by \({X}_{9}^{{w}_{9}}\), promoting pathways with the lowest cumulative PMI (for calculation details see Supplementary Information section 4.1).

Software details

Allchemy is a software platform for forward synthesis—that is, for iterative generation of synthetically plausible products and synthetic routes starting from arbitrary, user-defined substrates. The software can be run in either batch or web application modes; the web app can be used to visualize pathways obtained via both of these modalities. Allchemy’s web-app is based on the Django (https://www.djangoproject.com) framework and uses the d3.js library (https://d3js.org) for graph representation. Substrates can be input as SMILES or drawn in Chemwriter (https://chemwriter.com). Results of synthetic calculations are stored using PostgreSQL (https://postgresql.org). Communication between the web app and Allchemy’s backend is supported by Redis (https://redis.io) and RQ queue systems (https://python-rq.org).

The software has different modules focused on various aspects of forward synthesis: from the generation and exploration of networks created by prebiotic chemistries16, to in silico combinatorial chemistry and scaffold optimization, to targeted searches towards specific molecules (here, drugs and agrochemicals). The prebiotic-chemistry module is based on ~600 reaction rules generally accepted as plausible under conditions of primitive Earth; other modules are based on ~10,000 rules covering reactions commonly used in pharmaceutical chemistry (including stereoselective ones) as well as those most capable of generating molecular diversity in as few synthetic generations as possible (multicomponent reactions, rearrangements). All rules are coded in the SMARTS notation and each has a much broader scope than any particular literature precedent underlying it (see section ‘Reaction rules’ and references16,23,25,26).

In the ‘targeted’ searches implemented in this work, at each synthetic generation (Fig. 2a, b), the rules are applied to the original substrates and to the subset of intermediates retained (that is, those that can still serve as useful building blocks and those above a certain similarity threshold to the ‘target’ molecules). A molecule is deemed suitable for a given reaction if it contains the core of at least one substrate as defined by the reaction rule but, at the same time, does not contain any groups incompatible with the reaction. These matching conditions are evaluated using the ‘GetSubstructMatches’ function from the RDKit library (www.rdkit.org). Reactions are executed using the ‘RunReactants’ function from the ChemicalReaction class of the RDKit library with in-house enhancements to enforce proper stereochemistry and/or tautomeric forms. If a reaction template matches more than one locus on the substrate, RunReactants is executed at each and all of them. The products generated by RunReactants are filtered by algorithms developed in-house to recognize and eliminate chemically invalid molecules (for example, those violating Bredt’s rules) as well as molecules that do not satisfy user-specified constraints (for example, those exceeding a certain allowed molecular mass). As the network of reactions is being generated, reaction paths leading to each molecule are stored as an ordered list of reaction steps, each of which is a tuple of reaction SMILES and reaction name.

Laboratory-scale validations

With reference to Fig. 4, we first considered synthesis of the antibiotic dapsone (Extended Data Fig. 4, bottom) from lactic acid and phenol. Unlike in a traditional route based on double aromatic nucleophilic substitution of 4-chloronitrobenzene with sodium sulfide, this synthesis relies on the Smiles rearrangement involving bisphenol S 1 and 2-bromopropionamide 2, the latter prepared from lactic acid as described previously53. We validated this transformation, which is to our knowledge previously unreported, under benign conditions (K2CO3, KI, 50 °C in DMSO followed by NaOH, 130 °C in DMSO), achieving 82% yield (Fig. 4a, starred step I).

The second example was synthesis of carvedilol used to treat high blood pressure, congestive heart failure, and left ventricular dysfunction. Its proposed waste-to-drug synthesis (starting from aniline from biomass, guaiacol from lignin waste, and resorcinol from textile industry) features only one previously undescribed reaction, reductive amination of 2-(2-methoxyphenoxy)acetaldehyde 4. We carried out this transformation, denoted by a star II in Fig. 4b in 86% yield using a previously proposed environmentally friendly approach54 (Rh/Al2O3 catalyst and 25% aqueous solution of ammonia).

In the synthesis of a heart medication bisoprolol, four steps, denoted by stars III–VI in Fig. 4c, lacked direct literature precedent. Straightforward esterification of 4-(allyloxy)benzoic acid 6 (from 4-hydroxybenzoic acid recyclable from lignin processing) proceeded in 72% yield (star III), followed by quantitative reduction of ethyl 4-allyloxybenzoate 7 (star IV). Subsequent conversion of 8 to the corresponding 4-allyloxybenzyl chloride 9 was based on a published procedure and also proceeded in quantitative yield. This chloride was then alkylated with 2-isopropoxyethanol 10 (under phase transfer catalysis conditions with 50% NaOHaq) to give allyl ether of 4-(2-isopropoxyethoxymethyl)-phenol 11 in 85% yield (star V). Finally, the unsaturated product was treated with Oxone in aqueous solution of phosphate buffer resulting in 4-(2-isopropoxy-ethoxymethyl)phenyl glycidyl 12 ether in 81% yield (star VI).

In the synthesis of the topical anaesthetic proxymetacaine (starting from p-hydroxybenzoic acid from lignin waste and four other waste substrates: propanol, formaldehyde, acetaldehyde and acetonitrile; see Supplementary Table 1), three steps required experimental validation. With reference to Fig. 4d, 2-(diethylamino)ethanol 15 was obtained from 1,4-dioxane-2,5-diol (dimer of 14) and diethyl amine 13 in 48% yield (star VII) via reductive amination in ethyl acetate using NaBH(OAc)3 as reducing agent. Esterification reaction between 2-(diethylamino)ethanol 15 and 4-hydroxy-3-nitrobenzoic acid 16 in dry toluene in the presence of catalytic amount of HCl followed to give 2-(diethylamino)ethyl 4-hydroxy-3-nitrobenzoate 17 in 67% yield (star VIII). Subsequently, this product engaged in alkylation reaction with n-propyl chloride 18 in acetonitrile providing 2-(diethylamino)ethyl 3-nitro-4-propoxybenzoate 19 in 89% yield or in 54% yield in greener acetone (star IX). Further synthetic details of this and other routes discussed in this section are provided in Supplementary Information section 5.

Regarding larger-scale validations, the processes for cisatracurium, midazolam, and propofol precursors were all conducted on ODP’s reconfigurable platforms. Sub-kits utilized plug flow reactors with perfluoroalkoxy tubing flow paths, commercial continuous stirred tank reactors, and in-house designed filter–washer–dryers that have been described previously20. Reagents were purchased from their respective vendors and used as is without any need for additional purification. Simulated waste streams were created as described in Supplementary Information section 6, and analysis was carried out through HPLC versus a commercial standard.