Introduction

Enzymes have been employed for a wide variety of chemical processes for decades (Fig. 1). For example, nitrile hydratases are used to make acrylamide on the thousands of tons scale, and enzymes have been added to detergents for more than 30 years1,2. More recently, the use of proteins as catalysts for chemical synthesis of more complex molecules, such as pharmaceuticals, has become increasingly widespread. Enzymes are particularly powerful because they merge the advantages of a directing group controlling selectivity and a catalyst in a single reagent3, which can also be used with other enzymes in a one-pot reaction. Over the past 20 years, combined synthetic–enzymatic systems have enabled multiple total synthesis endeavours, and the use of enzymes is becoming routine in some process chemistry groups in industry4. Until recently, only a subset of enzymes, such as lipases or ketoreductases (KREDs), were available for chemical synthesis applications5. However, the growth of potential sources of enzymes for process chemistry applications has accelerated, resulting in a diverse toolkit of enzymes now available to researchers. In 2014, the development of a total enzymatic synthesis of the nucleoside didanosine highlighted the possibility of ‘bio-retrosynthesis’6. Based on the principles of retrosynthesis, where the target molecule is transformed into simple precursors by ‘breaking’ bonds that can be formed from synthetic transformations, ‘bio-retrosynthesis’ involves the design of an artificial enzyme cascade — a synthetic biochemical pathway — that offers a possible route towards the desired target molecule by choosing enzymes as catalysts for the required chemistry. The fully biocatalyst-driven synthesis of the HIV inhibitor islatravir (Fig. 1), which will be discussed in more detail in the Applications section, demonstrates the power of combining modern approaches towards designing new enzyme cascades, including repurposing of known biosynthetic pathways, screening of saturation mutagenesis libraries of enzyme variants and directed evolution against selected residues towards increased enzyme stability and turnover7.

Fig. 1: Examples of different products synthesized using biocatalysis.
figure 1

Biocatalysis is enabled by various factors including reaction design, biocatalyst choice and optimization as well as bioprocess development. It has been used for the synthesis of a wide range of chemicals and pharmaceuticals, including those shown.

In this Primer, we discuss the different development stages (reaction design, biocatalyst choice and optimization, and bioprocess development) that can lead to a range of industrial products as shown in Fig. 1. These stages are interdependent and need to be closely integrated. Starting with a target molecule, a single or multistep biocatalytic process needs to be designed, often by manual design using expertise and precedent literature from organic synthesis and biocatalysis. More recently, programmes such as RetroBioCat8 including biocatalyst databases are being developed to speed up this step and enable the automatic design of de novo biosynthetic pathways. Once a process has been designed, suitable enzymes need to be selected for each step and tested (Experimentation). The increasing adoption of biocatalysis by the pharmaceutical industry has been driven by innovative tools in protein engineering, which allow fast optimization of catalyst activity, including laboratory evolution and computational design (Experimentation). As a result, strict reaction parameters (Results) can now be met at reasonable timescales for successful bioprocess development (Applications). These parameters include non-physiological reaction conditions such as high activity on non-natural substrates, high temperature, high concentration of substrates and tolerance of organic solvents and wide pH ranges. Alongside protein engineering tools, databases of available biocatalysts with their reaction profiles are starting to be established (Reproducibility and data deposition). We also detail the current limitations of biocatalysis and areas of importance for further advancing this method to expand the breadth of applications (Limitations and optimizations). Finally, we highlight what the future holds for biocatalysis and the impact it will likely have in the next decade (Outlook).

Experimentation

In this section, we highlight several sources available to scientists looking for an enzyme as a starting point to develop a new biocatalyst. We discuss how one can optimize biocatalysts using directed evolution and computational design as well as how to incorporate non-canonical amino acids to enable novel chemistries.

Sources of biocatalysts

Enzymes can be sourced from a few outlets that include commercial sources, adaptation of enzymes from biosynthesis, screening of metagenomic libraries and in silico mining of databases.

Commercial sources of biocatalysts

Purified enzymes or lyophilized crude cell lysates are often available for direct purchase through chemical vendors (Fig. 2a). For example, one can purchase specific dehydrogenases, reductases or carbohydrate-active enzymes, and directly employ them in chemical synthesis. Libraries of commercially produced enzymes can be screened against specific substrates to identify candidate biocatalysts. Available libraries include oxidoreductases (KREDs, imine reductases (IREDs), ene reductases, Baeyer–Villiger monooxygenases, monoamine oxidases), transferases (transaminases), lyases (nitrilases, halohydrin dehalogenases, acylases), carbon–carbon bond-forming enzymes and carbon–nitrogen bond-forming enzymes9 (Box 1).

Fig. 2: Sources of biocatalysts.
figure 2

a | Types of biocatalyst useful in chemoenzymatic synthesis and biochemical applications. b | Natural product biosynthetic enzymes. In the biosynthetic pathway of griselimycin, the Fe(II) and 2-oxoglutarate-dependent enzyme, GriE, reacts with l-leucine to give (2S,4R)-5-hydroxy-leucine. This regioselective and stereoselective hydroxylation enables a chemoenzymatic synthesis of manzacidin C. c | Metagenomic screening approaches. Top: first step of metagenomics screening is extraction of total DNA from environmental samples such as marine sponges or soil microorganism communities. This DNA is constructed into a metagenomic library and is commonly screened using sequence-based or function-based approaches. Bottom: enzymatic conversion of A/B antigens into H antigen. All blood group antigens depicted are type I.

In addition to these types of commercial sources and kits, the wide availability of commercial gene synthesis means that, in principle, any enzyme with a known amino acid sequence can be obtained through gene synthesis. Researchers can order the synthetic gene corresponding to the protein, recombinantly express it in a desirable host organism or by cell-free protein synthesis and, then, purify the protein for testing as one would similarly do with a catalyst in a chemical synthesis. Databases ranging from SciFinder (with a licence) to UniProt10 and BioCatNet11 allow researchers to identify enzymes that catalyse desired chemical transformations. Thus, through the combination of publicly available sequence data and commercial gene synthesis, any enzyme reported is available to the researcher, for a cost.

Natural product biosynthetic enzymes

Enzymes involved in the synthesis of specialized metabolites, or natural products, are particularly useful as starting points for biocatalysis. Natural products tend to have diverse chemical structures, and studies on the biosynthesis of such natural products have unveiled a correspondingly diverse set of biosynthetic enzymes. Therefore, natural product biosynthetic enzymes are a potential source for diverse catalysts. A recent review discusses the wide-ranging chemical and enzymatic diversity found in natural product biosynthesis12. From a biocatalytic point of view, the most important criteria in selecting a potential biosynthetic enzyme include its substrate specificity, cofactor dependence, turnover, stability, functional recombinant expression and ability to perform a stand-alone function outside its natural pathway within a cell.

One group of biosynthetic enzymes widely found in natural product biosynthesis are oxidative Fe(II) and 2-oxoglutarate-dependent enzymes, which can catalyse challenging reactions such as hydroxylation, halogenation and oxidative cyclizations, typically for C(sp3)–H functionalization, due to the reactive intermediates generated in the catalytic cycle13. These powerful enzymes have emerged as useful biocatalysts for chemical synthesis14. One example is GriE — an enzyme involved in the synthesis of the peptide griselimycin in Streptomyces15 that hydroxylates the δ-carbon of l-leucine and was employed for a key step in the total synthesis of manzacidin C16 (Fig. 2b). Other examples include halogenases such as BesD, which chlorinate free amino acids to generate non-proteinogenic amino acids17. A final example is KabC, which catalyses an oxidative cyclization and was employed in a biocatalytic synthesis of the neurochemical kainic acid18. These types of Fe(II) and 2-oxoglutarate-dependent enzymes represent a promising group of biocatalysts for further development as they can be reconstituted either using Escherichia coli cell lysates or in vitro19.

Other known natural product biosynthetic enzymes with demonstrated use in chemical synthesis include methyltransferases20, Diels–Alderases that catalyse cycloadditions21, halogenases22, uridine diphosphate-dependent glycosyltransferases23, laccases24 and pyridoxal phosphate-dependent enzymes25. However, biosynthetic enzymes are typically sluggish catalysts, with Michaelis constant kcat values typically about 30-fold lower than those of primary metabolism26, which is probably owing to the constraints at work in their evolutionary histories26,27. To this end, biosynthetic enzymes are viewed as starting points for directed evolution efforts to improve rates of catalysis, substrate tolerance as well as stability and solubility in desired solvents, which include ionic liquids and water-soluble organic solvents23.

Metagenomic and in silico screening

Metagenomic libraries are an additional source for new biocatalysts28; these are genomic libraries of DNA obtained from environmental samples such as marine sponges, soil or faeces29. Functional-based screening or sequence-based screening is often used to search through a genomic library to find enzymes of interest (Fig. 2c). Functional-based screening can incorporate the use of colorimetric assays30,31,32, mechanism-based probes33 and/or droplet microfluidics34. Researchers are looking for enzymes in these libraries that demonstrate catalysis of the desired target reaction within these screening platforms. For example, mammalian microbiomes are a rich source of carbohydrate-active enzymes, which are often encoded in polysaccharide utilization loci in specific bacterial genomes. Screening of a human metagenomic library using a fluorogenic substrate identified a pair of enzymes that convert the A antigen into the H antigen of O-type blood, enabling a biocatalytic approach to produce universal donor O-type blood from A-type blood35.

By contrast, sequence-based screening is based on finding new enzymes with sequence similarity to known enzymes. One approach involves the use of PCR amplification of genomic sequences with degenerate primers. These primers are a mixture of oligonucleotide sequences that code for highly conserved regions of a desired type of catalyst, and therefore allow amplification of genes encoding for this catalyst from naturally derived DNA libraries. For example, sponge microbiomes are known for their biosynthetic capabilities, and in a case of sequence-based screening with degenerate primers, a new halogenase Krml was isolated from a sponge microbiome and employed for the regioselective halogenation of tryptophan and a range of indole-derived peptide substrates36.

Another approach is the direct analysis of sequencing data, where specific types of desired enzymes can be found through in silico analysis37. For example, the sequencing of a domestic drain metagenome was analysed in silico to find transaminase candidates, which were then tested for their ability to carry out transaminations on diverse substrates. One enzyme from this set of candidates retained activity in 50% dimethyl sulfoxide (DMSO), a solvent tolerance that had not been reported in a transaminase previously. As these examples demonstrate, once a gene encoding a desired enzyme is identified, the gene can be cloned and the enzyme can be expressed recombinantly to establish its function38.

Directed evolution

Wild-type enzymes are often not suitable for direct use in industrial applications and must first undergo optimization to improve properties such as substrate specificity and selectivity as well as catalytic efficiency and stability. Directed evolution is a powerful and versatile technology for adapting these enzymes to perform new functions (as highlighted by the award of the 2018 Nobel Prize for Chemistry)39,40,41. The directed evolution cycle involves iterative rounds of DNA library design and generation, gene expression and screening of enzyme library members (Fig. 3). Multiple properties can be optimized in parallel and improved variants can be isolated, characterized and used as templates for further rounds of evolution.

Fig. 3: Directed evolution cycle.
figure 3

First step in the evolution cycle involves creating a library of DNA variants using standard molecular biology techniques (step 1). Host cells are transformed with the DNA library and target proteins are produced (step 2). Proteins are evaluated using high-throughput screening techniques such as plate-based assays coupled with high performance liquid chromatography analysis, colony-based assays coupled with colorimetric detection or microfluidics-based assays coupled with fluorescence detection, where X represents the starting material and Y represents the fluorescent product or by-product (step 3). Final step of the evolution cycle involves isolation and characterization of the most active clones. The most active variant will serve as a template for subsequent rounds of evolution (step 4).

Following identification of a suitable starting template, DNA libraries are generated using numerous standard molecular biology techniques, such as random mutagenesis or site saturation mutagenesis. The chosen method of library generation depends on factors such as the availability of structural information and screening capacity. The design of smaller, more focused libraries (102–104 variants) often employs computational modelling and bioinformatics to guide the selection of amino acid residues for randomization. These libraries are generated using techniques such as saturation mutagenesis or iterative combinatorial active site testing42 and often employ reduced codons43,44. Larger library diversity (105–108 variants) can be generated using techniques such as error-prone PCR45 and gene shuffling46. It is common to use multiple mutagenesis techniques during enzyme evolution to target different regions of the protein structure. For example, focused active site mutations can be beneficial for reshaping substrate binding pockets and improving activity towards non-native substrates, and mutations to the protein surface and flexible loop regions can often result in improved solvent tolerance and thermostability. Beneficial mutations are typically combined during evolutionary optimization using DNA shuffling and can be guided by computational algorithms.

Transforming cells with DNA libraries leads to spatial separation of library members and establishes a link between genotype and phenotype that must be maintained during protein production and screening to allow characterization of individual library members. Arraying colonies into multiwell plates for protein production and screening offers the greatest versatility, as variants can be evaluated using a wide range of chromatographic, spectrophotometric and spectroscopic techniques. Although chromatographic methods are of relatively low throughput, they are commonly employed for applications in industrial biocatalysis as the assays are compatible with screening under process conditions, which often employ high substrate loadings (>100 g l–1), co-solvents (for example, up to 50% DMSO in aqueous media) and high temperatures (40–50 °C). This workflow can also be automated to improve speed, accuracy and throughput. For example, colony pickers allow users to array 103 colonies per hour into 96-well plates, liquid handling robots accelerate aliquoting and transfer steps required for library generation and protein production, and reaction analysis using state-of-the-art ultra-high performance liquid chromatography systems allows evaluation of 103 clones per instrument per day. GSK used this approach to engineer an enantioselective IRED with a 38,000-fold improvement over the wild-type enzyme47. This IRED variant was employed in a reductive amination and kinetic resolution step to manufacture the lysine-specific demethylase 1 (LSD1) inhibitor GSK2879552, a treatment for small cell lung cancer and acute leukaemia. Following a similar approach, Codexis and Merck have engineered five different enzymes, which form part of an impressive nine-enzyme cascade process to manufacture the HIV treatment islatravir7 (see Applications section for a more detailed discussion and reaction scheme).

In order to evaluate larger libraries of 105–108 variants, more specialized screening approaches can be employed, including colony-based assays48,49,50, fluorescence-activated cell sorting51, phage display52, microfluidic-based screening53,54, selection-based approaches55 and continuous evolution56,57. For example, monoamine oxidase from Aspergillus niger (MAO-N) has been extensively engineered for the selective oxidation of a wide range of amine substrates using a colorimetric colony-based assay that relies on the detection of the hydrogen peroxide co-product by horseradish peroxidase (HRP) and a reactive dye58. The throughput of oxidase evolution based on hydrogen peroxide detection can be further increased by screening variants in picolitre droplets. Indeed, in a recent study, an ultra-high-throughput microfluidic assay was used to screen a library of 107 cyclohexylamine oxidase variants and, after only a single round of evolution, the most improved variant isolated had a 960-fold improvement in catalytic efficiency59. Although the screening capacity in this example is greatly increased, picolitre droplet sorting is currently restricted to fluorescence as a detection method, which limits the versatility of this approach. Ongoing research is focused on coupling microfluidics with alternative methods of detection, such as mass spectrometry, to provide approaches for a wide host of chemistries60.

Selection-based approaches including continuous evolution platforms53,54,55, where improved catalyst performance is linked to cell viability, offer ultra-high-throughput screening capabilities (106–1010 variants). These methods are highly specialized as improvements in the enzyme activity of interest must be linked to cell survival. Optimization of these multicomponent systems requires considerable effort and can take up to several years; it is important to have control over the stringency of the selection pressure and to ensure that the host organism is not able to evolve alternative mechanisms of survival. However, for enzymes whose native activities can be associated with organism fitness, this type of assay is particularly powerful as it allows rapid evaluation of broad sequence space. A key example is the development of a selection-based method for engineering pyrrolysyl-tRNA synthetases (PylRS) for the genetic incorporation of non-canonical amino acids into proteins55. This versatile approach has been applied by numerous research groups to the evolution of a panel of PylRS variants, which are now able to accept more than 150 different non-canonical amino acids61.

Computational enzyme design and engineering

For the most part, the high efficiency of enzymes in accelerating chemical reactions has been attributed to their highly pre-organized active site pockets that precisely position the catalytic residues for transition state stabilization62. This precise arrangement in the active site pocket to optimize the chemical steps is complemented by the inherent flexibility of the enzyme structure. Enzymes can adopt multiple conformations that often play critical roles in equally important processes, such as substrate binding and/or product release for restarting the catalytic cycle63. To this end, computational enzyme design protocols should propose specific amino acid changes (located in the active site but also in remote positions) to achieve highly pre-organized active site pockets for transition state stabilization, and optimize the enzyme conformational ensemble to favour substrate binding and product release64.

In practice, available computational protocols focus only on a selected set of the complex features of enzyme catalysis; that is, they design enzymes based on either the chemical steps of the desired chemical transformations (see Fig. 4A), the substrate binding/product release process or the enzyme conformational dynamics (Fig. 4B). Different computational techniques are needed for each of the above features (see Fig. 4).

Fig. 4: Computational approaches to enzyme design.
figure 4

A | Methods that consider the chemical steps of the desired chemical transformations: quantum mechanics (QM), quantum mechanics/molecular mechanics (QM/MM) and empirical valence bond (EVB). Quantum mechanics (QM) is used to model the active site and associated transition states of the desired transformation in the theozyme to evaluate potential changes for rate acceleration. This optimal arrangement is then grafted onto an existing protein scaffold and further optimized by protein design software, such as Rosetta (part Aa). Protein conformational dynamics can be incorporated using short molecular dynamics (MD) simulations to refine this estimation, ultimately enhancing enzyme activity and selectivity (catalytic selectivity by computational design (CASCO)) (part Ab). Additional refinements for multistate design allow for considering a set of related structures often coming from short-timescale MD simulations in the design process (part Ac). B | Methods that consider the conformational changes key for substrate binding or product release: MD simulations and Monte Carlo (MC) simulations. CAVER software is used to model the different tunnels that are available for substrate binding to the active site and product release (part Ba). Tunnel analysis allows identification of the narrowest regions, that is the bottleneck, where mutations are usually introduced (part Ba). Recent approaches take advantage of the information gained from extensive MD simulations for identifying key conformationally relevant residues (not strictly located at the active site), which by mutation can induce a shift in the relative populations of the conformational states stabilizing the catalytically relevant ones (shortest path map, SPM) (part Bb). Selection of an ancestral enzyme scaffold displaying high levels of flexibility can be also used for the design of new enzymatic activities (part Bc). Images of proteins throughout the figure were generated using PyMOL (The PyMOL Molecular Graphics System, Schrödinger, LLC.)

Initial attempts to rationally design enzymes were focused on the chemical steps of the process (Fig. 4A, and selected examples Fig. 4Aac). The transition states of the desired transformation in the theoretical enzyme or ‘theozyme’ active site pocket is modelled with quantum mechanics to assess the potential rate acceleration and the ideal geometric constraints for optimal transition state stabilization (Fig. 4Aa). This optimal arrangement that contains only a few active site residues owing to the high computational cost of quantum mechanics calculations is then grafted onto an existing protein scaffold, and further optimized by means of Rosetta or other related protein design software65,66. Further refinements to this original formulation (named inside-out) can be made by incorporating data on protein conformational dynamics by means of short molecular dynamics simulations. In these simulations, the enzyme variant is immersed in a water solvent box, and whether the optimal arrangement of the catalytic residues — also known as the near attack conformation — is maintained throughout the simulation time is assessed. A higher number of near attack conformations explored during the molecular dynamics is attributed to a higher catalytic activity and/or selectivity as the catalytic residues are properly arranged for catalysis most of the simulation time. These observations resulted in the development of some computational methodologies based on Rosetta and molecular dynamics simulations for enhancing the enzyme activity and selectivity (catalytic selectivity by computational design (CASCO), as shown in Fig. 4Ab) or thermostability (framework for rapid enzyme stabilization by computational libraries (FRESCO))67. Additional refinements such as the use of an ensemble of closely related enzyme conformations from either normal mode analysis, short molecular dynamics simulations or small perturbations in the enzyme backbone angles for multi-state design (Fig. 4Ac) were proposed to include some limited protein flexibility in the design process68. Although these strategies include some protein flexibility during the design, the ensemble of conformations used is rather similar as they come from usually short (picosecond to nanosecond) molecular dynamics simulations. Other strategies are based on computing the direct effect of the included mutations on the activation barrier of the enzyme-catalysed process (with the computationally more demanding quantum mechanics/molecular mechanics or empirical valence bond (EVB) strategies, as shown in Fig. 4A), rather than estimating the effect by means of some key geometrical constraints (as in the near attack conformation analysis)69.

The importance of enzyme conformational dynamics for enzyme design gained popularity in recent years70,71,72 (Fig. 4B). Conformationally flexible loops adjacent to the active site pocket can regulate substrate binding and/or product release, and some studies have shown these loops as crucial for enhanced enzymatic activity in many enzyme families73. Bioinformatic tools such as CAVER have been developed to identify tunnels and channels, and to suggest potential mutational hotspots for novel catalytic activity74 (Fig. 4Ba). The analysis of some natural and laboratory evolution pathways demonstrated that increased enzymatic activity is often achieved by introducing mutations that alter the enzyme’s conformational ensemble75. These mutations can be located at the active site or may be located at distal positions and induce a long-range effect that impacts the enzyme active site pocket and, thus, catalysis. This impact on enzymatic activity is often achieved by favouring the enzyme conformational states that are key for the novel functionality (catalytically productive conformations), while disfavouring non-productive conformational states, thus converting computational enzyme design into a population shift problem64. In this direction, some conformationally driven computational approaches focused on identifying such long-range allosteric networks of interactions include the shortest path map (SPM) tool and have been used recently for predicting distal and active site mutations76 (Fig. 4Bb). Multistate computational design based on ensembles of enzyme conformations taken from room-temperature X-ray crystallography corresponds to a successful strategy for efficient computational enzyme design77. The reconstruction of ancestral enzymes that display a higher degree of flexibility than their modern counterparts and their use as initial scaffolds for enzyme design has additionally yielded interesting new insights78. The higher flexibility observed in many ancestral variants was key for achieving high catalytic activity with only a few mutations located at the active site. Ancestrally reconstructed enzymes are usually less specialized than their modern counterparts, thus often presenting higher levels of substrate and catalytic promiscuity79, which makes them excellent starting protein scaffolds for enzyme design (Fig. 4Bc). These examples indicate that both the selection of a conformationally rich scaffold and the consideration of multiple enzyme conformations is crucial for successful computational enzyme design.

Biocatalysts with non-canonical amino acids

The design and engineering of enzymes with an expanded amino acid alphabet is a nascent and rapidly developing area of biocatalysis. Enzymes are exceptionally powerful catalysts capable of promoting chemical transformations with efficiencies and selectivities that are difficult to achieve with small-molecule systems. However, enzymes are typically biosynthesized from the 20 canonical amino acids that contain a limited number of functional groups, restricting the range of catalytic mechanisms that can be installed into designed active sites. The emergence of powerful genetic code expansion methodologies has enabled the site-specific installation of hundreds of structurally and functionally diverse non-canonical amino acids into proteins80,81. Careful selection of a suitable non-natural amino acid and its positioning within the target protein scaffold is required to address the application of interest (Fig. 5). For instance, a key active site residue is often replaced with a non-canonical amino acid that is a close structural analogue to modulate catalytic function for mechanistic investigations of natural enzymes82,83,84,85,86. Alternatively, to design enzymes with new functions, the selection of amino acid takes inspiration from structural motifs present in small-molecule catalysts with positioning within the protein guided by computation87,88.

Fig. 5: Design and engineering of biocatalysts using non-canonical amino acids.
figure 5

Representative workflow for the use of genetic code expansion to enhance biocatalytic function, using the design and engineering of a de novo enzyme with a non-natural catalytic nucleophile as an example82. (Step 1) Biocatalytic challenge to be addressed must be defined. (Step 2) Identify a protein scaffold and a suitable target residue for modification. (Step 3) If the challenge cannot be addressed through standard mutagenesis, identify a functional non-canonical amino acid (ncAA) to achieve the desired properties. For example, replacement of the His23 nucleophile of the de novo protein BH32 with Nδ-methylhistidine (Me-His) should increase the activity of acyl-enzyme intermediates. (Step 4) Are translation components already available to incorporate the ncAA of interest? If not, an aminoacyl-tRNA synthetase (aaRS) can be evolved to accept the target ncAA using directed evolution75,84. (Step 5) Use engineered translation components to produce the catalytically modified protein, which is then evaluated for the activity of interest. (Step 6) Catalytic function can be optimized using directed evolution workflows adapted for an expanded genetic code. Ribbon representation of the evolved OE1.3 enzyme structure (light blue) has a Me-His residue shown as a ball and stick representation (green) along with the sites of mutation revealed through directed evolution (orange) (protein image generated using Chimera). (Step 7) The optimized enzyme is characterized. In the case of OE1.3, ester hydrolysis was accelerated by >9,000-fold compared with the free ncAA (Me-His) in solution. Steps 2–5 and 7 adapted from the original figure created with BioRender.com.

The favoured method for encoding a non-canonical amino acid exploits an engineered aminoacyl-tRNA synthetase/tRNA pair that is orthogonal to the host’s translation machinery to direct the incorporation of the non-canonical amino acid by suppression of a nonsense codon, which is most commonly the UAG stop codon. The aminoacyl-tRNA synthetase of the orthogonal translation component pair is typically engineered towards a desired non-canonical amino acid through iterative rounds of positive and negative selections, which link cell viability to aminoacyl-tRNA synthetase activity and selectivity80,89. Introducing the non-canonical functionality directly through the cellular translation machinery offers significant advantages over alternative methods of chemically modifying protein structures. For instance, this approach facilitates the homogeneous production of precisely edited proteins, enables the introduction of aminoacyl-tRNA synthetase at diverse sites in any protein scaffold and, perhaps most significantly, allows for rapid optimization of enzyme properties using directed evolution workflows adapted to an expanded genetic code87,90.

The availability of an expanded set of amino acid building blocks offers exciting new opportunities for biocatalysis. Genetically encoded non-canonical amino acids have been used to improve both biocatalyst activity and stability91 as well as provide new tools to understand how enzymes function at the molecular level82,83,84. Key recent examples include the replacement of serine and cysteine catalytic nucleophiles with 2,3-diaminopropionic acid as a means of trapping acyl-enzyme intermediates for structural characterization85, and the use of non-canonical axial haem ligands to unravel the active site features that control the reactivities of high-energy metal-oxo intermediates86. The availability of an increased repertoire of covalently embedded functional groups also provides exciting opportunities to design de novo enzymes with catalytic mechanisms inspired by small-molecule catalysis. This approach was recently showcased through the design of an artificial hydrolase OE1 (ref.87) that employs Nδ-methyl histidine (Me-His) as a catalytic nucleophile, which operates with a similar mode of action to the widely employed small-molecule catalyst dimethyaminopyridine (DMAP)92. Histidine methylation was integral to catalytic function and leads to the generation of reactive acyl-imidazolium intermediates, which are readily hydrolysed to regenerate the catalytic nucleophile (Fig. 5). By contrast, the catalytic function of de novo hydrolases employing canonical histidine, serine or cysteine nucleophiles was compromised by the formation of unreactive acyl-enzyme intermediates93,94,95,96. The modest initial hydrolysis activity of OE1 was subsequently enhanced via iterative rounds of directed evolution giving rise to variant OE1.3 containing six active site mutations87. OE1.3 accelerates ester hydrolysis beyond 9,000-fold and 2,800-fold as compared with free Me-His and DMAP, respectively. Further rounds of evolution lead to the enantioselective hydrolase OE1.4. This study showcases how the interplay of genetic code expansion, computational design and directed evolution can provide a truly versatile platform for building de novo biocatalysts with new and improved catalytic functions.

Cascade development

Combining multiple enzyme-catalysed steps in the same pot is a very important research area. Biocatalysis is particularly well suited to these cascade processes as enzymes possess inherent chemoselectivity, regioselectivity and stereoselectivity and operate in a common aqueous media. Akin to natural biosynthetic pathways, fully de novo non-natural biocatalytic cascades can be designed and developed for the synthesis of complex targets. It should be pointed out that biocatalytic cascades have only become more commonly used because of the advances in biocatalyst design and build discussed elsewhere in this Primer.

Biocatalytic cascades97,98 typically feature two or more steps (functional group interconversions or bond forming) with at least one enzymatic transformation and without intermediate isolations (Fig. 6a). The definition of ‘cascade’ is generally broadly applied within the biocatalysis community to describe not only concurrent, multienzyme processes in one pot but also reactions in which components are added sequentially or process steps are telescoped despite attempts to impart order on the nomenclature99,100,101,102.

Fig. 6: Biocatalytic cascades and their development.
figure 6

a | Biocatalytic cascades are reaction sequences in which each chemical step is catalysed by an enzyme. Reaction conditions for enzymatic reactions are often compatible with multistep one-pot protocols. b | Design–build–optimize cycle is a useful concept for generating optimized biocatalytic cascades — for each step, several complementary methodologies are now available. CASP, computer-aided synthesis planning.

The development of novel biocatalytic cascades can be broadly described by a design–build–optimize cycle (Fig. 6b) until a final process is achieved3,103. Initially, retrosynthetic analysis is performed using the principles of biocatalytic retrosynthesis104,105,106,107 and/or retrobiosynthesis108 to make key bond disconnections and plan the forward route. This can be performed manually or complemented by more recent computer-aided synthesis planning tools that are becoming the focus of increased interest8,109,110. Additionally, selecting a cascade design that will enable the planned synthesis is required, which can range from simple linear to orthogonal or cyclic processes99. Any cofactor requirements or potential compound incompatibilities should be considered at this stage.

Once a process design is in place, enzymes need to be identified to fulfil each cascade step. Enzymes can be identified from the literature, from screening of enzyme libraries or from enzyme discovery efforts. When it comes to building the cascade, there is a choice of operating the enzymatic steps with purified or crude cell-free extracts (in vitro), with viable whole cells (in vivo) or with a combination of the two (hybrid)103. A multitude of factors will help determine which system is best to use, such as enzyme availability, cofactor recycling requirements and reactor/facility infrastructure. Often, each step in the cascade is validated individually before any single-pot combinations are tested.

Finally, optimization of the process can help maximize throughput and product titre. Several rounds of protein engineering are typically required, especially for industrial application, to improve enzyme activity and stability to overcome any bottlenecks in the pathway and maximize pathway flux. General process engineering optimizations also complement cascade development; for example, enzyme immobilization strategies to simplify the workup and/or improve a biocatalyst’s lifetime, recoverability and reuse111,112.

Further understanding of the full process can then influence subsequent design–build–optimize cycles in an iterative fashion to streamline the entire synthetic route to the desired compound.

Results

What makes a good industrial biocatalyst?

Before scientists embark on the challenge of discovering a good (or ideally excellent) industrial biocatalyst, they need to define which properties the biocatalyst must have for efficiently performing a commercially interesting target reaction under select industrial conditions. Here, we describe the beneficial characteristics that are usually found in industrial biocatalysts, metrics to assess their performance in industrial processes and a few exciting examples. Other illustrative examples can be found in excellent recent articles4,113,114,115. New biocatalytic processes aim to generate new molecules of considerable commercial interest. They may also be designed to replace or complement existing non-optimal chemical or biocatalytic syntheses in industry. In either case, the viability and possible bottlenecks of a biocatalytic process can be assessed using both economic and green chemistry process performance metrics116,117,118,119 (Table 1). Both high substrate concentration and conversion are desired in industrial reactions to achieve high product concentrations and so reduce the product recovery cost. Reactions resulting in low product concentrations may require additional concentration steps or large volumes of extraction solvent, which will increase costs associated with a rise in energy consumption and/or waste production. Thus, an ideal biocatalyst’s activity should not be inhibited by high substrate concentrations (>50 g l–1) or the amount of co-solvent required for substrate solubilization. It is worth mentioning that substrate loadings as high as >1 kg l–1 for aldoxime dehydratase, which catalyses the synthesis of linear aliphatic nitriles, have been reported120. Nevertheless, frequently observed detrimental effects at high substrate concentration or by organic solvents on biocatalysts can be alleviated using fed-batch strategies121. In examples where inhibition of the enzyme by-product, unfavourable thermodynamic reaction equilibria or product side reactions are problematic, in situ product removal can be applied122. To remove a product resulting from an ongoing enzyme-catalysed reaction, various techniques can be used such as in situ product crystallization, adsorption, distillation and extraction122,123. For example, in situ product crystallization can be achieved by forming a product salt via inclusion of an appropriate counter-ion in the reaction media. Similarly, another option is to perform the bioconversion in the presence of a resin that selectively adsorbs the product from the solution.

Table 1 Metrics used to evaluate efficacy of biocatalytic processes

High stability under industrial process conditions is an essential property of a good biocatalyst. Numerous robust enzymes of industrial interest have been discovered or redesigned over the past decade by enzyme engineering, computational methods, genome mining, ancestral sequence reconstruction or combinations thereof. A recent example using FRESCO generated an alcohol dehydrogenase mutant with a melting temperature — the temperature at which half of the protein is unfolded at equilibrium — of 94 °C (close to water’s boiling point) and this has previously been applied successfully to other enzyme classes124. Sequence reconstruction of a robust ancestor has been achieved for an increasing number of biocatalysts including cytochrome P450 monooxygenases125, carboxylic acid reductases126, flavin-containing monooxygenases127 and laccases128, made available in a recently created database of resurrected proteins with 211 members (Revenant)129. Enzyme immobilization, which facilitates repeated enzyme reuse, has also been used to enhance enzyme operational stability in industrial processes130,131,132. As there is great interest in the utility of enzyme immobilization, especially in continuous flow systems133, tolerance to immobilization without significant loss of activity or selectivity is an appealing property for a biocatalyst134.

Biocatalytic processes outcompete their chemical counterparts regarding sustainability, as illustrated when comparing the chemical and biocatalytic synthetic routes for pregabalin, atorvastatin intermediate, sitagliptin and ambrox135. In contrast to chemical catalysts, biocatalysts are derived from renewable resources, are biodegradable, act in aqueous solvent under mild reaction conditions and generate low amounts of waste by-products. Furthermore, biocatalytic synthetic routes obviate the need for hazardous chemicals, high energy usage and additional reagents for functional group activation, protection or deprotection steps.

Biocatalytic processes requiring whole-cell fermentations (for either enzyme production or substrate conversion) generate waste biomass, which can be reused as a source of energy or animal feed. To reduce water usage and carbon feedstocks required for cell growth, biotransformations with isolated enzymes or cell lysates can be performed instead of whole-cell fermentations at increased concentrations. A reduction in biocatalyst loading, without reducing productivity as measured by yield and speed, can be accomplished by using engineered biocatalysts that offer improved properties such as higher turnover rates and/or stability for reuse. Energy consumption due to biocatalyst recovery from reaction solutions can be minimized by enzyme immobilization. Importantly, inexpensive renewable carriers for enzyme immobilization, such as rice husk, are being developed to replace organic fossil-based carriers136. However, a significant expansion of enzyme-based technologies in the production of bulk chemicals (high volume, low priced) must be achieved to increase the impact of biocatalysis on sustainability137. So far, biocatalysts are more frequently used to synthesize high-price low-volume products such as pharmaceuticals.

Various companies (for example, Merck, Pfizer, GlaxoSmithKline and AstraZeneca) have become active in the development of new biocatalytic processes and often collaborate with academic groups to accelerate progress in this research area. Examples of some of the enzymatic processes developed by industry with biocatalysts including KREDs, transaminases, hydroxylases and IREDs are described in recent review articles138,139. When selecting a biocatalyst for process development, it is often desirable to select enzymes that will enable freedom to operate to avoid infringing intellectual property rights or to access desired patented biocatalysts during the early stages of process design. To this end, industries and universities often provide experts in the complex and rapidly evolving field of intellectual property to guide research scientists.

A good industrial biocatalyst should combine numerous beneficial properties to deliver higher-value molecules under demanding industrial conditions while achieving satisfactory economic and green metrics for various applications (Fig. 7). A few of the most desired characteristics of efficient industrial biocatalysts have been highlighted above, which include high activity, stability, ease of immobilization, environmental sustainability and accessibility. The importance of other relevant properties of a good biocatalyst, such as substrate selectivity, evolvability and affordability, will be illustrated through various examples in the following sections.

Fig. 7: Beneficial properties of an excellent biocatalyst under industrial process conditions.
figure 7

Ideal industrial biocatalysts are able to efficiently convert renewable raw materials into higher-value molecules. Their most desired characteristics are indicated in the plot. Under highly demanding industrial conditions, biocatalyst properties (top right, orange background) such as robustness or tolerance to mutations enhancing stability often become essential. Other beneficial biocatalyst properties are desired under either mild (blue background) or more demanding operational conditions.

Applications

An ideal catalyst converts renewable, cheap and readily available raw materials such as plant-derived feedstocks, generates few to no undesired by-products, is safe and exhibits a reduced environmental footprint (low energy consumption and waste). These characteristics are not often observed for industrial chemical catalysts. Also, biocatalysts usually act under mild reaction conditions and can be engineered towards the desired substrate scope. Thus, biocatalysis paves the way for a bio-based economy, less reliant on fossil fuels117. Here, we highlight the utility of biocatalysis in various applications, first according to different reaction metrics or enzyme properties that are of importance in biocatalysis followed by an overview of enzyme cascade development.

Activity and productivity

Biocatalysts should exhibit high activity under the desired industrial conditions to achieve a high reaction productivity. Chemically heterogeneous catalysts are challenging rivals for biocatalysts in terms of productivity, often reaching production rates of 1–10 and 0.001–0.3 kg l–1 h–1, respectively140. High productivities of 50–100 g l–1 h–1 have been achieved using free-resting Rhodococcus cells containing nitrile hydratase for the synthesis of acrylamide from acrylonitrile, considered to be one of the most successful industrial biocatalytic processes140,141. Acrylamide is used to produce polyacrylamide, which is used in water treatment, oil exploitation and the textile industry sector, as well as many others. The potential of nitrile hydratase as an industrial biocatalyst for the hydration of nitriles to form higher-value amides was demonstrated in the 1980s1. The vast market for acrylamide and the lack of an efficient chemical process for its production have propelled the improvement of the biocatalytic process over the past few years. The selection and optimization of a robust microbial host for nitrile hydratase was instrumental in preventing enzyme inactivation, owing to the high acrylamide concentrations required in the industrial process (300–500 g l–1) and the underlying exothermic nature of the reaction141. A selective robust transaminase was obtained, by combining rational mutagenesis, directed evolution and a substrate walking approach, for the large-scale manufacture of the antidiabetic drug sitagliptin under demanding industrial conditions (200 g l–1 substrate loading, 50% DMSO and 45 °C)142. This is an impressive example of an excellent industrial biocatalytic approach that outcompeted the previously used rhodium-catalysed sitagliptin synthesis in terms of selectivity, productivity, sustainability and cost.

A relatively high productivity (13 g l–1 h–1) was recently achieved for IREDs by testing a commercially available IRED collection and various reaction conditions at the pilot plant scale, which was facilitated by a design of experiments strategy121. IREDs are of great interest for the industrial synthesis of cyclic and acyclic amines via the reduction of C=N bonds. This study identified reaction bottlenecks (for example, enzyme stability) and exposed possible strategies to overcome them (for example, using a fed-batch process) for a model reaction. Importantly, the first industrial synthesis catalysed by an IRED (on a 20-l scale) was recently reported47, highlighting an excellent industrial biocatalyst after three rounds of directed evolution, which outcompeted the corresponding chemical process with respect to green metrics such as lower catalyst requirement. This engineered IRED is used for the industrial synthesis of the LSD1 inhibitor GSK2879552. In contrast to the IRED used as starting point in this study, the engineered IRED is an excellent biocatalyst due to its increased stability under the required reaction conditions (moderately acidic pH and 20 g l–1 substrate concentration) showing a 38,000-fold improvement in turnover. In this case, the selectivity — another requirement for a good biocatalyst — needed no further improvement. The preparation of the fragrance ingredient (−)-ambrox using an engineered squalene hopene cyclase is another example of a successful industrial biocatalytic process, which achieved relatively high productivity (12 g l–1 h–1) for catalysing the cyclization of (E,E)-homofarnesol to yield (−)-ambrox143. The enzyme variant used in this study, which exhibited a 10-fold increase in productivity over the wild type, was discovered by random mutagenesis. This cyclase whole-cell biotransformation in E. coli was carried out under conditions that were optimized using a design of experiments strategy, in which the optimized parameters included the cell, sodium dodecyl sulfate (SDS) and (E,E)-homofarnesol concentrations, temperature and pH. SDS was required in this process to ensure substrate solubilization and access to the enzyme through the cell membrane.

Selectivity and substrate scope

Enzymes with excellent regioselectivity, chemoselectivity and/or stereoselectivity and the desired substrate scope for industrial applications can be obtained by either mining the enormous diversity evolved by nature or performing protein engineering campaigns in the laboratory. Studies that have uncovered the extraordinary diversity of enzymes involved in natural product biosynthetic pathways have provided promising industrial biocatalysts with complementary selectivity as well as substrate scope. For example, the recent comparison of three similar FAD-dependent monooxygenases, which catalyse the oxidative dearomatization of phenol and resorcinol in different biosynthetic pathways, has revealed their complementary site selectivities and stereoselectivities by testing a diverse panel of unnatural substrates144. This approach enabled the identification of an optimal biocatalyst for specific asymmetric transformations of phenols into ortho-quinols, a chemical reaction of great value in the synthesis of various bioactive natural products144. In another example that highlights the importance of enzyme discovery and characterization, the substrate scope of 87 putative flavin-dependent halogenases was determined using a high-throughput mass spectrometry-based screen22. Various halogenases discovered in this study exhibited complementary regioselectivity on relatively complex substrates. Thus, this enzyme library is attractive for late-stage C−H functionalization of drug leads, leading to diverse drug candidates from common intermediates. Furthermore, this study enabled the discovery of new halogenases for biotechnology applications, which exhibited beneficial properties such as regioselectivity, substrate scope and stability that were engineered in other previously discovered halogenases22.

An increasing number of studies demonstrate that required selectivities can be readily engineered into different enzyme classes145. A recent example is the synthesis of a Janus kinase (JAK) inhibitor, which involved engineering IRED variants with markedly improved selectivity and activity compared with the wild type146. Synthesis of enantiomerically pure compounds is a key driver for the implementation of enzymes in the pharmaceutical industry3. Enantioselective enzymes are also used industrially for the production of target molecules required in food supplements, flavourings, fragrances and agrochemicals147. To this end, a wild-type cytochrome P450 monooxygenase catalyses the enantioselective and regioselective C5 hydroxylation of decanoic acid to form (S)-5-hydroxydecanoic acid, which is subsequently converted by chemical lactonization into the high-value fragrance compound (S)-δ-decalactone148. In the food industry, small-scale reactions using an engineered ethylenediamine-N,N′-disuccinic acid lyase have demonstrated its utility for the enantioselective synthesis of chiral synthons for artificial dipeptide sweeteners149. The lyase used as a starting scaffold exhibited excellent enantioselectivity for the target substrate but had low activity, which was increased 1,140-fold by rational protein engineering.

Enzyme cascades

From an industrial perspective, biocatalytic cascade processes are especially attractive as they eliminate the need for intermediate isolation steps, reducing waste, saving time and costs as well as streamlining the overall synthesis150. Some intermediates can be unstable to isolation or have inhibitory effects on the enzymes present in the system, and therefore the use of a cascade process can be beneficial to overcome these challenges and avoid the build up of problematic intermediates.

Several recent reviews have been published on enzymatic cascades that reveal the potential and scope of these processes99,101,102,151,152. Some examples97 of industrially applied systems are highlighted here (Fig. 8). Evonik Degussa GmbH described a whole-cell cascade to produce diamines — which are valuable building blocks in the polymer industry — from renewably sourced dicarboxylic acids (Fig. 8a). The patented process153 details the co-expression of a carboxylic acid reductase and a transaminase to enable the desired cascade. An alanine dehydrogenase was also incorporated to provide a source of l-alanine, required for the transaminase step, from ammonia as an input nitrogen source. Additional process considerations for the in vivo implementation of the cascade included co-expression of fatty acid transporters to improve substrate uptake or the incorporation of an initial esterase step enabling the use of esters as starting materials.

Fig. 8: Industrial examples of biocatalytic cascades.
figure 8

a | Evonik Degussa GmbH described a whole-cell cascade to produce diamines from renewably sourced dicarboxylic acids153. b | Hydrogen-borrowing, redox-neutral cascade developed by GSK for production of GSK2879552 (ref.47). c | Merck & Co.7 developed a total enzymatic synthesis of the HIV drug islatravir built on five key enzymatic steps. AcK, acetate kinase; AlaDH, alanine dehydrogenase; CAR, carboxylic acid reductase; DERA, deoxyribose phosphate aldolase; ee, enantiomeric excess; GOase, galactose oxidase; HRP, horseradish peroxidase; IRED, imine reductase; KRED, ketoreductase; PanK, pantothenate kinase; PPM, phosphopentamutase; PNP, purine nucleoside phosphorylase; SP, sucrose phosphorylase; TA, transaminase.

A hydrogen-borrowing, redox-neutral cascade was developed by GSK for the production of GSK2879552 (ref.47) (Fig. 8b). A KRED IRED system was evaluated to take the desired alcohol to the chiral amine via an aldehyde, with internal cofactor recycling between the two enzymes. The main synthetic focus of the work was the engineering of the IRED step, involving reductive amination and concurrent resolution of the racemic amine substrate. The cascade synthesis enabled generation of the desired product in 48% yield with high enantiopurity (99.5% enantiomeric excess). Although the IRED step can operate as a stand-alone process and achieve higher yields, the proof of concept for the cascade was established. Process development and a more active KRED were highlighted as areas of potential focus to further improve the cascade and realize its potential for manufacturing.

Recently, Merck & Co.7 developed a total enzymatic synthesis of the HIV drug islatravir built on five key enzymatic steps (Fig. 8c). The selected enzymes were subjected to multiple rounds of protein engineering to achieve either the desired activity, stability or selectivity for operation of the cascade. A single aqueous reaction stream was employed throughout the entire process, in which the galactose oxidase (GOase) and pantothenate kinase (PanK) steps operated sequentially to avoid cross-reactivity between substrates. The final deoxyribose phosphate aldolase (DERA), phosphopentamutase (PPM) and purine nucleoside phosphorylase (PNP) steps were then run concurrently, and the equilibrium of these steps was pulled through to product formation by an orthogonal sucrose phosphorylase (SP) step that removed phosphate from the reaction mixture. The cascade synthesis of islatravir (and, more recently, of molnupiravir)154 replaced alternative chemical routes to this drug that required more than double the step count with protecting group manipulations, thereby vastly improving the efficiency of synthesis.

Reproducibility and data deposition

Databases for biocatalysis

Over the past decade, the cost of DNA sequencing and synthesis has fallen rapidly; a trend commonly referred to as the Carlson curve155.

This associated abundance of protein sequence data provides a rich seam for mining for new biocatalysts. The National Center for Biotechnology Information (NCBI) maintains databases of both DNA and protein sequences, regularly updated with new sequencing data, and with the option to search for sequences of interest using tools such as BLAST (Basic Local Alignment Search Tool)156. Other databases, such as UniProt, InterPro or Pfam, offer further analysis of protein sequences, structures or families.

As the amount of data collected for an increasing number of enzymes and enzymatic transformations rises, it becomes prohibitive for interested researchers to efficiently scour the literature in search of ideal/appropriate candidates to analyse. Catalyst and enzyme selection, for use in organic chemistry syntheses or synthetic biology pathways, respectively, already benefit from numerous well-developed databases. Reaxys157 and SciFinder158 contain a plethora of searchable information related to reaction conditions, choice of catalyst, substrate scope, percentage conversions and analytical information, among others, for use when designing a synthetic chemistry route towards a target molecule, whereas BRENDA159 and KEGG160 hold data on the natural substrate specificity, and sequence information, of biosynthetic enzymes to be used in a synthetic biology pipeline. A comparable repository, comprising information collected for synthetic enzyme reactions in biocatalysis, would be of great use for the biocatalysis community.

Despite the fact that several databases for the biocatalysis community have been developed, none of them contains information related to the whole biocatalytic toolbox, and the majority do not provide such critical information as the substrate scope of specific enzymes, successful reaction conditions or reaction yields (Table 2). In general, the majority of the resources listed in Table 2 rely on data extracted from pre-existing databases such as BRENDA and PDB (Protein Data Bank) and, as such, are restricted to solely utilizing the sequence and/or structural information contained within them. Additionally, the curation and maintenance of substantial databases is often laborious and challenging, and so most biocatalyst databases focus on a specific reaction type or enzyme type of interest, rather than compiling data on the field as a whole. One of the few examples of a database recording information related to substrates, products and reaction outcomes in a biocatalysis context has been developed for the prenyltransferase enzyme class (PrenDB)161. PrenDB aims to collect data in the literature concerned with prenyltransferase enzymes and use them in various algorithms to achieve wider application of this family of synthetically useful enzymes. The compilation of a biocatalysis database, similar in scope to PrenDB but covering a broad spectrum of the different enzyme classes available in the biocatalytic toolbox, would unquestionably enhance the development of new enzymatic (cascade) reactions.

Table 2 Databases containing information specifically on enzymes for biocatalysis

An ideal database dedicated to biocatalytic transformations would capture both successful and unsuccessful transformations on an enzyme by enzyme basis and would broadly collect both enzyme activity data and enzyme sequence data. For example, data regarding the substrate scope, reaction temperature and length, buffer choice and pH, cofactor use, co-solvent use, substrate concentrations, reaction outcomes including percentage conversions and selectivities would all need to be collected to maximize the applicability of such a database. Enzyme homologue information, such as the amino acid sequence, structural information, mutant information and accession codes, would also need to be obtained. Additionally, integration with existing databases, such as those outlined above for chemistry and synthetic biology applications, would allow for extremely powerful synthesis planning towards target molecules. A fully functioning and searchable biocatalyst database could be used to augment tools designed to automate synthesis planning and would, ultimately, benefit researchers from both the chemistry and biocatalysis fields. In related fields such as natural product discovery, crowdsourcing has been successfully utilized for the construction of similar databases162. Indeed, a platform for curation of biocatalysis data has recently been made available to the community with this in mind, as part of the computer-aided synthesis planning tool RetroBioCat8.

Reproducibility issues in the field

A successful biocatalyst database requires a system that captures all useful information on biocatalyst performance reported in the literature. However, the diverse scientific communities that work with and characterize biocatalysts have varying standards when it comes to recording reaction parameters and outcomes, with some favouring kinetic data and others preferring to record percentage conversions and overall yields, for example. These different approaches have resulted in a wealth of information for many different enzyme classes and homologues that may not be directly comparable with one another, and so it becomes necessary to standardize the data collected in order to obtain a better overall picture of where select developments stand. One such way of standardizing data reported in the literature would be to categorize reactions qualitatively for enzyme activities with respect to a given substrate (for example, high, medium, low, none). Different data sources, such as percentage conversions and specific activities, could be categorized in this way and then compared against each other.

Alternatively, biocatalytic experiments could be standardized in the laboratory prior to data deposition. For numerous years, the STRENDA (Standards for Reporting Enzyme Data) commission has sought to provide guidelines on the experimental detail required when reporting enzyme activities and kinetics163. Recently, these guidelines have been incorporated into an online storage and validation tool, where enzyme data can be deposited and checked for compliance with the STRENDA guidelines164. This serves as a useful blueprint for reporting biotransformations in biocatalysis, but likely must be extended to include the additional datatypes often reported in biocatalysis papers, for example percentage conversion.

Recently, numerous start-up companies have emerged across biology and chemistry to develop smart-laboratory infrastructures, aiming to make research more reproducible by capturing data on all of the possible variables in an experiment165. Others offer platforms to structure the collected data in a process, allowing machine learning to pull insight out of the vast data sets that smart-laboratories might produce166. Experimentally, this can allow trends to be observed that might otherwise be missed — for example, a new batch of a reagent causing a drop in yields, or a shift in pH causing improved enantioselectivity. The digitization of experimental procedures and data collection should greatly improve the reproducibility of experiments across biology and chemistry. In particular, this may allow methods sections in journals to offer links to a more atomized record of the experimental procedures carried out. However, uptake by academia may be slow in comparison with industry laboratories, where electronic laboratory notebooks are more commonly employed.

Limitations and optimizations

Cost and accessibility

Biocatalyst cost usually has an influence on the viability of an industrial process, but especially in the synthesis of low-priced products. Currently, a wide variety of affordable enzyme collections (kits) are accessible from various vendors (for example, Prozomix, Almac, Codexis and Gecco). Enzyme discovery and production in-house is the alternative approach. Advantages and limitations of these options, ‘the buy or build operating models’, have been previously discussed167. The choice of biocatalyst format (for example, purified, whole-cell or crude preparations) varies, depending on the particular application and enzyme class. Obviously, well-expressed enzymes are highly desired to reduce costs and effort. Access to an increasingly diverse platform of molecular biology tools allows the tailoring of enzyme performance to meet demanding industrial requirements and to efficiently convert non-natural substrates. Generation of improved biocatalysts by enzyme engineering is possible simply because enzymes are able to tolerate in vitro mutation. Thus, evolvability is another highly desirable property of a good biocatalyst. Engineering one enzyme may take only a few months, but building complex cellular metabolic networks may take years and demand considerable economic investment168. These timescales are not fast enough to meet ‘the need for speed’ in industry169. A recent example of a three-step route including two enzymatic steps, which was developed in just 6 months, is the synthesis of the COVID-19 direct-acting antiviral molnupiravir154. Development of highly efficient biocatalysts by either rational or evolution techniques will be accelerated in the near future by expanding both the use of machine learning170 and ultra-high-throughput screening171 technologies for protein engineering.

Machine-learning algorithms use the sequence-function data resulting from experimental work to predict which new enzyme mutants may exhibit the desired property. Thus, DNA sequences of both improved and unimproved variants are valuable in generating initial data sets. Importantly, machine-learning methods allow a reduction in the number of mutants that have to be produced and tested in the laboratory to discover a significant fraction of improved enzymes, and are particularly interesting in cases that require expensive or labour-intensive screening methods170. The additional costs of implementing machine-learning algorithms in a traditional protein engineering laboratory include computation and DNA sequencing, costs that are decreasing and, thus, are affordable for numerous research groups in both academia and industry. To explore a vast protein sequence space (library sizes >106 variants), ultra-high-throughput screening technologies have been developed. Many academic or industrial researchers have access to flow cytometers to perform fluorescence-based screenings of up to 108 enzyme variants per day172. Complementary or improved technologies are rapidly emerging in this field. Miniaturization of the reaction volume is generally pursued because it increases the speed of screening and reduces associated costs and waste. Label-free detection methods are also highly desired, allowing for screening without a reporter molecule. A recent example meeting both objectives allowed the analysis of around 15,000 samples in 6 h using droplet microfluidics (nanolitre scale) coupled to electrospray ionization mass spectrometry for detection60. Development and wide access to novel technologies for biocatalysis has been propelled by recent investments from, for example, the European Commission and the UK Biotechnology and Biological Sciences Research Council (BBSRC) to facilitate collaborations between industry and academia. The recent establishment of a Global Biofoundry Alliance represents another example173. Biofoundries are facilities to automate the design–build–test iteration cycle for engineering biology, which allows the fast delivery of genetically reprogrammed organisms for biotechnology77. Access to biocatalysts is also facilitated by other strategies such as Science Exchange (an online marketplace of research services) and collaborations established between the Centre of Excellence for Biocatalysis, Biotransformations and Biocatalytic Manufacture (CoEBio3) and various companies.

Expanding the range of biocatalysis

Biocatalytic transformations, particularly those routinely applied in industry, often effect functional group interconversions with high conversion and selectivity. However, one of the biggest gaps is broader enzyme platforms that perform C–C bond formation. Despite a plethora of enzymes used in nature for C–C bond formation in primary and secondary metabolism, they are often challenging to repurpose for non-natural substrates. Only a handful of enzymatic C–C bond-forming enzymes have been utilized for industrial applications, mainly limited to aldol reactions, acyloin condensations or cyanohydrin formation catalysed by lyases174,175. A recent review highlights progress made in this space to diversify the toolbox of enzymes and the C–C bond-forming reactions they catalyse176. Another industrial gap is scalable and robust oxidative enzymes. Despite the potential to catalyse remote and unactivated C–H oxidations, which are chemically challenging, enzymes such as cytochrome P450s and other oxygenases are problematic to scale up due to low activity, instability and promiscuity, resulting in a mixture of products177. However, these features are well-suited for small-scale, late-stage diversification of biologically active compounds in which the enzyme promiscuity is advantageous to generate new libraries of compounds for evaluation178,179. The synthetic utility of the transformation afforded by these enzymes encourages continued efforts to find solutions and realize the potential of these biocatalysts for large-scale manufacture180.

Speeding up synthesis

In the pharmaceutical industry, the acceleration of the drug development process is crucial to be able to deliver new medicines to patients as quickly as possible as well as maximize patent lifetimes for approved drugs. As such, time pressures for synthetic development are increasingly tight, which is driving advancements in the speed of rounds of protein engineering and the establishment of biocatalysis earlier in synthetic route planning3,169. These advances include improvements in DNA library syntheses, smart library design and high-throughput screening. On the horizon are technologies such as cell-free expression, which enables skipping the need for growing and harvesting cells that contain enzyme mutants, to further reduce cycle times181,182. Although the acceleration of development timelines is often associated with the pharmaceutical industry, these improvements are also beneficial to the wider chemical industry, making development more efficient and cost-effective140.

Outlook

Advances in protein engineering, genomic database mining and computational methods have enabled a step change in biocatalysis over the past 20 years, and have led to its increasing application in the chemical and pharmaceutical industries as highlighted in this Primer. Adoption of biocatalysis is also driven by reduced cost, the need to develop environmentally friendly processes and use of renewable resources183.

The number of chemical reactions realized as amenable to biocatalysis has dramatically increased, as new enzyme classes become accessible and non-natural biocatalytic reactions are being developed184. However, the range of reactions compared with those used in organic synthesis is still small and there are some obvious gaps in the repertoire of biocatalytic reactions that are currently being identified, including halogenation, amide-bond formation, C–C bond formation and cleavage, ether formation, carbonylation, C=C bond functionalization, isomerization and reduction of isolated C=C bonds140. Some of these issues are being addressed by combining chemical and enzymatic reactions185. Biocatalysis also offers opportunities to develop reactions that are chemically difficult, such as remote and selective C–H activation, which is often observed in nature148,178, but the scale and substrate scope remain limitations for biocatalytic C–H activation186.

The use of enzyme cascades is a particularly attractive aspect of biocatalysis, because of general reaction compatibility and the ability to telescope several reactions either in cell-free systems or whole cells103, akin to biosynthetic pathways. The design of such cascades is already starting to become automated using dedicated computational tools and databases that provide rich resources to the scientific community8. The accessibility of obtaining biocatalysts through commercial sources or from synthetic genes is continuing to lower the barrier to entry for biosynthesis and the bottleneck for reaction screening is now often at the assay stage, where more label-free high-throughput analytical methods are needed50. Current successes for compounds such as islatravir7 and molnupiravir154 have demonstrated the application of biocatalysis to multistep syntheses of small molecules. The next challenges will be to extend the application scope to targets of increasing molecular complexity and size, as well as to decrease the time required to develop efficient biocatalytic industrial processes. Examples of production of bulk chemicals and polymers by biocatalysis are still rare and offer a rich opportunity in terms of green chemistries. Biocatalysis also has a role to play in generating new modalities more efficiently and selectively for the biopharmaceutical industries, such as producing biomacromolecules and antibody–drug conjugates.

Looking to the future, there are numerous key trends and scientific breakthroughs that are promising to have a significant impact on accelerating the discovery, development and application of biocatalysts. First of all, the range of chemical, new to nature biocatalytic reactions is rapidly expanding using de novo design187 and/or directed evolution188. Increasingly powerful computational tools will allow for better de novo design but will also provide better selection tools for identifying suitable biocatalysts from the rich protein primary sequence information already accessible in databanks. Advances in computational methods to predict the protein structure from sequences through artificial intelligence189 and subsequent prediction of function and physicochemical properties will provide access to biocatalysts that are finely tuned to the requirements of a desired target reaction and/or product190,191,192. To maximize synthetic utility, these tools will need to be integrated with the design of new biocatalytic cascade processes. Many individual steps of biocatalyst development can already be automated at the implementation stage, including desktop DNA printing, cell-free protein expression, enzyme immobilization and analysis, which hints at the potential for ‘fully automated biocatalytic synthesizers’ being available to individual laboratories3 within the next decade.

In conclusion, biocatalysis has enabled essential contributions to the safe, cheap and sustainable production of high-value chemicals and pharmaceuticals, but still provides many exciting challenges for potential advancements.