Efficient rational modification of non-ribosomal peptides by adenylation domain substitution

Non-ribosomal peptide synthetase (NRPS) enzymes form modular assembly-lines, wherein each module governs the incorporation of a specific monomer into a short peptide product. Modules are comprised of one or more key domains, including adenylation (A) domains, which recognise and activate the monomer substrate; condensation (C) domains, which catalyse amide bond formation; and thiolation (T) domains, which shuttle reaction intermediates between catalytic domains. This arrangement offers prospects for rational peptide modification via substitution of substrate-specifying domains. For over 20 years, it has been considered that C domains play key roles in proof-reading the substrate; a presumption that has greatly complicated rational NRPS redesign. Here we present evidence from both directed and natural evolution studies that any substrate-specifying role for C domains is likely to be the exception rather than the rule, and that novel non-ribosomal peptides can be generated by substitution of A domains alone. We identify permissive A domain recombination boundaries and show that these allow us to efficiently generate modified pyoverdine peptides at high yields. We further demonstrate the transferability of our approach in the PheATE-ProCAT model system originally used to infer C domain substrate specificity, generating modified dipeptide products at yields that are inconsistent with the prevailing dogma.

The earliest reported attempts to create artificial NRPS enzymes were substitutions of A-T domains into the second and seventh modules of the NRPSs involved in the biosynthesis of the lipopeptide surfactin. 6,7 Although modified lipopeptides were detected using mass spectrometry, in each case the yield was substantially diminished, to only trace levels. 7 Evidence that C domains exhibit stringent specificity towards the acceptor substrate activated by their cognate (downstream) A domains offered an explanation for the reduced yields. 4,8 This evidence was based on NRPS enzymes artificially loaded with amino-acyl CoA or aminoacyl-N-acetylcysteamine thioesters, which mimic an amino acid attached to a T domain. Soon after, a substrate binding pocket of the C domain was suggested to play a role in controlling the direction of biosynthesis. 9 Subsequent efforts at substituting cognate C-A domains together enjoyed modest success, reinforcing the belief that this was necessary to bypass C domain substrate specificity. 10,11 Since then the most successful engineering attempts have focused on substituting C-A domains together 12,13 or A-T-C domains with the condition of not disrupting C domain acceptor site specificity. [14][15][16] It is now widely accepted in the field that successful domain substitution requires working within the constraints imposed by C domain specificity. 5,12,14 Until now, our own work using pyoverdine as a model system has been consistent with this dogma. Pyoverdine from Pseudomonas aeruginosa PAO1 is a UV-fluorescent siderophore formed from an 11-membered peptide, encoded by NRPS modules that will here be referred to as Pa1-Pa11 (Supplementary Figure 1). We observed that five out of five synonymous A domain substitutions into PvdD (an NRPS comprised of modules Pa10 and Pa11, both of which specify L-Thr as the native substrate 17 ) yielded detectable pyoverdine products, versus zero of nine substitutions of A domains specifying alternative residues. 13,18 In contrast, non-synonymous C-A domain substitutions generated novel pyoverdines in three out of ten cases, suggesting that a cognate C domain was required for functionality. 13,18,19 However, we now show that our previous failure to generate novel pyoverdines by A domain substitution is surmountable by use of more effective recombination boundaries, and that the C domains in the pyoverdine NRPS system do not impose stringent proof-reading constraints. Tellingly, we show that this is also true in the artificial PheATE/ProCAT dimodular NRPS model of Belshaw et al (1999), the system from which stringent C domain selectivity for the acceptor substrate was originally inferred. 4

Results and Discussion
Semi-rational DNA shuffling to identify regions involved in C domain substrate specificity Previous structural biology and bioinformatics approaches to identify residues involved in the presumed substrate specificity of C domains have been unsuccessful. 1,20,21 We instead adopted a semi-rational strategy, seeking to shuffle C domains derived from modules that incorporate different amino acid substrates, then retrospectively identify important substrate-defining residues. For this we selected the Lys-specific module Pa8 and the Thr-specific module Pa11, as we had previously found a pyoverdine with Lys at position 11 could be generated by substituting the C-A domains (but not the A domain) from Pa8 into Pa11 of PvdD. 13 At the time, we interpreted this result as showing that the Pa8 A domain can function within the PvdD environment, but only when paired with a Lysspecifying C domain. An added attraction of the model is that the Pa8 and Pa11 C domains appear to be paralogs, as they are nearly identical apart from three stretches of low sequence identity (Supplementary Figure 2A). We reasoned that these low-identity regions were likely to contain substrate-specifying residues, and a homology model based on the C domain from TycC (pdb: 2JGP 22 ) suggested the three regions could be shuffled effectively, with only a minimal number of amino acid perturbations introduced (Supplementary Figures 2B and 2C).
DNA blocks spanning each of the three variable regions of the Pa8 (presumed K-specific) and Pa11 (presumed T-specific) C domains were shuffled to yield all eight possible combinations (TTT, KTT, ..., KKK; Figure 1A). Each shuffled C domain was then introduced into a pvdD gene construct, immediately upstream of either the native Pa11 or a substituted Pa8 A domain ( Figure 1A; Supplementary Figure 3). We observed that region 3 played a dominant role in defining the substrate compatibility of the recombinant C domains. With the exception of the recombinant C domain KKT, which was not functional in association with either A domain, C domains that contained region 3 from Pa11 were substantially more active in partnership with the Pa11 A domain, and only C domains that contained region 3 from Pa8 were active in partnership with the Pa8 A domain ( Figure 1B; Supplementary Figure 4). These data suggested that region 3 contains key specificitydetermining residues. However, it is important to note that the recombination point between the C and A domains was near the A1 motif. This meant region 3 of the C domain was always substituted in association with the corresponding loop that delineates C and A domains and is often referred to as a linker region. 14,23 When designing this experiment we considered this was unlikely to be a significant factor, as there is no structural basis for considering that the linker region could be involved in acceptor site specificity, and the linker region has appeared unimportant in previously successful synonymous substitutions. 13,14,18 Figure 1. A) Overview of the semi-rational shuffling approach used to narrow down substrate specifying regions. The three variable regions of the C domains from modules Pa8 (green) and Pa11 (purple) were shuffled in every combination to create eight variant C domains. Each of these was inserted into: 1) a plasmid containing a pvdD gene lacking the Pa11 C domain, and 2) a plasmid containing a pvdD gene lacking the Pa11 C domain and in which the Pa11 A domain had additionally been replaced by the Lys-specific A domain from Pa8. B) Pyoverdine production from strains transformed with the variant pvdD genes from panel A was assessed by measuring absorbance at 400 nm relative to a wild-type P. aeruginosa strain. Error bars represent the standard deviation from six independent replicates. C) Homology models highlighting (in light green) clusters of residues that were substituted as groups within the third region of Pa11 together with the linker-only substituted control. D) Pyoverdine production from strains transformed with the variant pvdD genes from panel C was assessed by measuring absorbance at 400 nm relative to a wild-type P. aeruginosa strain. Error bars represent the standard deviation from six independent replicates. Region 3 contains 38 non-identical residues between the Pa8 and Pa11 C domains (Supplementary Figure 2A). With the goal of narrowing down the key substrate defining elements, proximal clusters of residues within the C domain of the Thr-specific module Pa11 were substituted by the corresponding residues within Pa8 ( Figure 1C; Supplementary Figure 5). These substitutions focused on clusters of 6 or 12 residues closest to the catalytic histidine and/or the loop extending across the solvent channel. We also generated a control in which the Pa11 C domain was fused to the linker region from module Pa8. Surprisingly, modifications to the C domain had little effect on pyoverdine production, however changing the Pa11 linker region to that from Pa8 was sufficient to allow the native PvdD C domain to function efficiently with the Pa8 lysine-specifying A domain ( Figure 1D; Supplementary Figure 6). The resulting pyoverdine species was produced in high yield and contained lysine at position 11. Conceptually, substituting the A domain together with its cognate linker is just an A domain substitution that uses a different recombination boundary. Thus, this result was inconsistent with our hypothesis that C domain acceptor site specificity had caused our previous A domain substitutions to be non-functional. 13,18

Evaluating the efficiency of A domain substitutions
Our previous C-A substitutions had a combined success rate of only 3/10 constructs yielding a detectable pyoverdine product, with two of these being in very low yield. 13,18,19 To test whether A domain substitution using our new upstream recombination boundary was a more efficient strategy, we generated the equivalent "linker + A domain" substitution constructs for each of our three previously successful C-A domain substitutions. In each case, the pyoverdine yield was increased by substituting the linker and A domain together (Figure 2A; Supplementary Figures 7 and 8). We then tested whether we could efficiently produce other modified pyoverdines, using A domains that activate substrates other than Thr (Supplementary Table 1; Supplementary Figure 9). Six of these A domain substitution variants gave modified pyoverdines at high yields ( Figure 2B; Supplementary Figure 10). Collectively, these results confirmed not only that substitution of an A domain without the corresponding C domain is possible, but that it can result in improved success rates and yields compared to C-A domain substitutions. Figure 2. A) Pyoverdine production for C-A domain substitution strains compared with the corresponding linker and A domain substitution strains. Pyoverdine levels were measured by optical density at 400 nm, and error bars represent the standard deviation from six independent replicates. B) Pyoverdine production for nine additional A + linker domain substitution variants as measured by optical density at 400 nm. Error bars represent the standard deviation from six independent replicates.

Evolution of NRPS diversity within the Pseudomonas, Streptomyces and Bacillus genera
The tight acceptor site specificity originally proposed by Belshaw et al 4 has fuelled speculation that C and A domains are likely to co-evolve. 24,25 This supposition is at odds with observations that complete or partial A domain substitution has driven diversification of the microcystin, 26 aeruginosin, 27 hormaomycin, 28 and lipo-octapeptide 29 biosynthetic pathways. While it is possible that these examples represent isolated instances of relaxed-specificity C domains, our experimental success in generating novel pyoverdines led us to consider that A domain substitution might be a more global phenomenon driving NRPS diversification.
To test this on a small scale, we first constructed distinct maximum likelihood phylogenetic trees based on the C and A domains from NRPS enzymes involved in the biosynthesis of four different pyoverdines (Figure 3). We reasoned that modules within a single pathway that specify the same substrate, e.g. the two Thr modules in pvdD, are likely to have evolved via domain duplication and recombination. Consistent with this, we observed that pyoverdine NRPS A domains cluster tightly by substrate specificity ( Figure 3A). In contrast, the C domains from these four systems appear to have evolved independent of substrate specificity ( Figure 3B). To perform a more global analysis of whether C and A domains have evolved independently, we downloaded the sequences of all NRPS gene clusters from the AntiSMASH database 30 for the genera Pseudomonas, Streptomyces and Bacillus. A total of 437, 370 and 213 L CL-A-T tri-domain sequences for Pseudomonas, Bacillus and Streptomyces species, respectively, were extracted, clustered at 95% identity and aligned. Analysis with TreeOrderScan, 31,32 which assesses 400 bp subalignments at 50 bp intervals, revealed increased phylogenetic incompatibility was between A domains and the surrounding domains ( Figure 4A). Segregation analysis of the 400 bp subalignments identified an A domain region associated with increased clustering by substrate specificity ( Figure  4B).
The TreeOrderScan analysis is consistent with A domain substitution driving NRPS evolution but does not identity natural recombination points. To identify hotspots of recombination, analysis of the sequences was performed using RDP4, an ensemble of tools that collectively identify regions at which DNA sequences are likely to have recombined. 33 The breakpoint distribution identified recombination hotspots located between C and A domains ( Figure 4C, red shading), upstream to the A domain substrate binding pocket between the A2 and A4 motifs ( Figure 4C, green shading), and downstream to the binding pocket starting from close to the A5 motif ( Figure 4C, blue shading). The largest hotspots were located immediately on either side of the binding pocket, flanking the region that segregates most strongly by substrate specificity (Figure 4A-C). These data were consistent for each of the Pseudomonas, Bacillus and Streptomyces genera, and inconsistent with the hypothesis that C and A domains co-evolve. We conclude that complete or partial A domain substitution appears to play a primary role in diversification of NRPS pathways in nature, rather than being an exception. Analysis was performed using TreeOrder Scan, 31,32 considering alignments of 400 bp at 50 bp intervals and using a bootstrap value of 70% to calculate phylogenetic violations. B) Segregation of alignments by consensus substrate specificity predictions from AntiSMASH. 30 Segregation scores were calculated using a 0% (red line), 50% (blue line) and 70% (green line) bootstrap cut off. The locations of key conserved motifs are indicated along the top of the graph. For this analysis, 400 bp alignments were compared for segregation into groups based on consensus substrate specificity predictions from AntiSMASH. 30 A segregation score of 0 means perfect segregation by substrate specificity and a score of 1.0 means no segregation on the basis of substrate specificity. Shaded blocks have been added to aid comparison between regions of interest in panels C and D. C) Recombination hotspot analysis of C-A-T domains from Pseudomonas, Bacillus and Streptomyces species. 'X' marks the location of the recombination point used for the successful (linker control) Lys A domain substitution in Figure 1D. Dark and light grey areas indicate local breakpoint hotspots at the 95% and 99% confidence level respectively, and the two horizontal lines indicate cut-offs for global breakpoint hotspots at the 95% and 99% confidence level. D) The light blue shaded plot indicates the average number of clashes calculated using SCHEMA that would be introduced by recombination of nine alternative pyoverdine NRPS modules from Figure 2B with the domains from Pa11. The dark blue region of the graph indicates 1 standard deviation. Partial A domain substitution has previously been attempted in two key laboratory studies. 34,35 However, only one of these (working in an initiation module that lacks a C domain) described the formation of a novel peptide, the rate of formation of which was greatly reduced in vitro and not tested further in vivo. 35 We tested partial A domain substitution in PvdD using several different recombination boundaries, but did not achieve visible production of pyoverdine in any case ( Supplementary Figures 11 and 12). Reasoning that this might be due to structural clashes restricting efficient recombination of NRPS enzymes within certain domain regions, we used SCHEMA 36 to predict the number of perturbations generated by recombination of the C-A-T domains from PvdD with the C-A-T domains from alternative pyoverdine NRPS modules ( Figure 4D). Whereas the recombination sites employed in our successful linker + A domain substitutions were within regions with low potential to cause structural perturbations during substitution, the recombination hotspot between the A2 and A4 motifs ( Figure 4C, green shading), was in a region with high potential for perturbations. We believe that this explains why our (and previous 34,35 ) experimental attempts at partial A domain substitution were generally unsuccessful. In contrast, because natural recombination processes favour short sequences, 37,38 this appears to be a preferred region of recombination during natural NRPS evolution, with the low success rate presumably offset by the high frequency of recombination events. Irrespective, our diverse analyses provide unanimous support for partial and complete A domain substitution being a primary mechanism for NRPS diversification.

A domain substitution in the PheATE-ProCAT model system
Our high success rates in modifying pyoverdine via A domain substitution and our bioinformatics analyses suggested that C domain acceptor substrate specificity is less of a barrier to NRPS engineering than previously proposed by several groups, including us. 2,4,5,8,14-16 However, the possibility remained that the Pa8 and Pa11 C domains are more relaxed than C domains such as that in module 2 of tyrocidine biosynthesis -the system that Belshaw et al used to develop the original hypothesis of C domains exhibiting strong acceptor substrate specificity. 4 We therefore considered it important to test whether we could create novel dipeptides by performing A domain substitution within the same system.
Belshaw et al showed that tyrocidine module one (PheATE, incorporating D-Phe) and module two (ProCAT, incorporating L-Pro) could be artificially loaded with different amino acids in vitro, and the resulting dipeptide purified and analysed. 4 Artificial loading with L-Pro resulted in the expected product, but no product resulted when ProCAT was loaded with L-Leu, suggesting stringent substrate specificity during condensation. We therefore considered that efficient production of D-Phe-L-Leu dipeptides via A domain substitution in the ProCAT module would disprove the C domain substrate specificity hypothesis. To facilitate construct generation and analysis we used a similar two-plasmid system to previous researchers, 39,40 with the replacement of the T domain from ProCAT with the T-Te domains from SrfAC, to enable release of linear D-Phe-L-Leu dipeptides. 41,42 The genetic regions encoding four different Leu-specifying A domains were selected for substitution into the ProCATTe plasmid and compared to an unsubstituted control (Supplementary Table 2; Supplementary Figure 13). All A domains shared relatively low amino acid identity to the A domain from ProCAT (40.4% to 47.6%; Supplementary Table 2). The Leu-specifying A domain from SrfAC was of particular interest because the crystal structure of this module fuelled speculation that C-A domains form a tight interface, which may further restrict A domain substitution. 23 As such this experiment combined all the main factors that have been suggested to prohibit effective A domain substitution, i.e. a C domain believed to exhibit tight acceptor site specificity, the substitution of distantly related A domains, and substitution of an A domain believed to depend on a tight C-A domain interface with its cognate C domain partner.
The recombinant ProCATTe plasmids were co-transformed with a second plasmid containing PheATE into a BAP1 strain of E. coli 43 . The strains were grown for 24 hours, after which the supernatant was extracted and dipeptides quantified using HPLC and absorbance at 214 nm ( Figure  5). We detected production of the native D-Phe-L-Pro diketopiperazine at 7.8 mg/L by the control strain ( Figure 5A), a yield that compares favourably to previous reports. 39,40 We also detected D-Pro-L-Leu dipeptides for three of the four strains containing Leu-specific A domain substitutions, at yields ranging from ca. 25% to 40% of the diketopiperazine control ( Figure 5B). Despite sharing the lowest amino acid identity with the ProCAT A domain, and its previously hypothesised requirement to maintain a native C domain interface, the strain containing the A domain from SrfAC was found to produce D-Phe-L-Leu at 1.8 mg/L. We therefore conclude that neither C domain substrate specificity nor a failure to maintain native C-A domain interfaces pose a prohibitive barrier to generating functional A domain substitution constructs. Rather, it appears that successful A domain substitution relies greatly on the recombination boundaries used. Figure 5. A) Schematic showing the domain arrangement of the PheATE/ProCATTe constructs used in this study and HPLC traces comparing the product made by an E. coli strain expressing these constructs (1) relative to an analytical standard of D-Phe-L-Pro DKP (dark blue trace). B) Schematic and HPLC traces for the products generated by four strains (2-5) bearing variants of ProCATTe in which the Pro-specifying A domain had been substituted by a Leu-specifying A domain. Strains 2, 3 and 4 show a peak corresponding to an analytical standard of D-Phe-L-Leu (dark blue trace). Samples were analysed for 3 independent replicates and used to calculate average yield and standard deviation. Masses of the expected products were verified using HR-ESI-MS (Supplementary Figure S14).

Conclusions
Although natural evolution has given rise to a large diversity of non-ribosomal peptides, effective reengineering of NRPS templates in the laboratory has proven difficult. A primary focus in reengineering NRPS enzymes has been to accommodate the presumed acceptor site specificity of the C domain. We have shown this is not necessary, performing A domain substitution with high success rates in two distinct NRPS pathways; one of these being the model system that was used to develop the original hypothesis of C domain specificity. Our successful A domain substitution approach is consistent with phylogenetic evidence that A domain substitution has played a primary role in the evolution of NRPS diversity across three genera of bacteria. We anticipate that these findings will pave the way to efficient rational and combinatorial modification of medically relevant bioactive peptides, to rationally improve desirable physicochemical properties or evade emerging resistance mechanisms.

Competing Interests
Drs Calcott and Ackerley have submitted provisional patent filing AU2019903420, teaching methods for re-engineering NRPS assembly lines based on the optimal A domain recombination sites identified in this research.

Data Availability Statement
All data generated or analysed during this study are included in this published article (and its supplementary information files).

Code Availability Statement
All code generated during this study is available on GitHub via the links provided in the Methods.

DNA manipulation
All plasmids, primers and sequences used in this study are provided in the Supplementary file "Plasmids_primer_and_sequences.xlsx".

For modifications of pvdD
To create vectors for substituting domains into pvdD, the PBAD promoter was excised from pSW196 using the restriction sites NsiI and SacI and ligated into pUCP22 using the restriction sites PstI and SacI. The resulting plasmid was named pUCBAD. Next, the pvdD gene lacking the C-A domains from the second module (module Pa11) was excised from the plasmid pSMC 1 using NheI and SacI restriction sites and annealed into the pUCBAD vector using the restriction sites NheI and SacI to create the vector pUCBAD-SMC. The Thr-specific A domain from the second module of PvdD and the Lys-specific A domain from the first module of PvdJ (module Pa8) were PCR amplified and separately ligated into the pUCBAD-SMC vector. This resulted in the creation of the vectors pDEC-Thr and pDEC-Lys, which contained a Thr-specific A domain and a Lys-specific A domain variant of the pvdD gene, respectively. For modifying the third variable region of the C domain from pvdD, the first two variable regions of the pvdD C domain from module Pa11 were PCR amplified. The resulting fragment was ligated into the pDEC-Thr vector using a 5′ SpeI/XbaI and a 3′ SalI/SalI ligation to create the plasmid pTRN. The destruction of the SpeI restriction site within the vector by SpeI/XbaI ligation meant the introduced SpeI site downstream to region 2 of the C domain was unique within the plasmid, allowing insertion of region 3 using SpeI and SalI restriction sites. C domains were created by overlap PCR or synthesis (Twist Bioscience; San Francisco, CA). C domain sequences were amplified using the appropriate forward and reverse primers specific to the C-domain from modules Pa8-Lys, Pa11-Thr, Ps5-Ser or Pf6-Orn (Pa indicates Pseudomonas aeruginosa PA01; Ps5-Ser indicates the fifth serine-incorporating pyoverdine NRPS module from Pseudomonas syringae 1448a; Pf6-Orn indicates the sixth ornithine-incorporating pyoverdine NRPS module from Pseudomonas fluorescens SBW25; Supplementary Figure S1). PCR products were digested using XbaI and XhoI, and ligated into the plasmids pDEC-Thr and pDEC-Lys using compatible SpeI and SalI restriction sites. The partial C domain fragments containing region 3 of a C domain were amplified using the appropriate forward and reverse primers. PCR products were digested using SpeI and XhoI, and ligated into the plasmid pTRN at compatible SpeI and SalI restriction sites.
A domains were selected to activate a range of substrates, from modules exhibiting a range of amino acid sequence identities with Pa11-Thr (Supplementary Table S1). To enable cloning of A domains into the substitution vector pTRN, inserts were designed to have an upstream region identical to the C-domain from Pa11-Thr, fused to the linker and A domain from the selected modules. Recombination points for each substitution are shown in Supplementary Figure S9. To reduce the GC content to acceptable levels for synthesis, 5_AP013068.1.cluster003_CA1 was codon optimised for P. aeruginosa PAO1 using the guided random method in OPTIMIZER. 2 Partial A domain substitutions were created for the Lys-specific A domain from module Pa8 and the Ser-specific A domain from number 2 in Figure 2B. Substitutions were created by ligating synthetic DNA constructs into the vector pTRN. Recombination points for partial A domain substitutions are shown in Supplementary Figure S11, and are labelled in the Supplementary file "Plasmids_primer_and_sequences.xlsx" using the GrsA nomenclature from Kries et al. 3 The recombination points tested were T221 and I352 corresponding to those used previously by Kries et al, 3 and K205 and A322 corresponding to those used by Crüsemann et al., 2013 et al. 4 Based on our SCHEMA analysis we also tested the promising recombination points S233 and A332 as well as upstream recombination points at A185, I167 and S233 in combination with the downstream site used in the full A domain substitutions.
C-A domains from modules Ps5-Ser, Pf6-Orn and Pa8-Lys were amplified by PCR and ligated into the vector pUCBAD-SMC.
For modifications of PheATE and ProCATTe DNA encoding PheATE and ProC-TTe was artificially synthesised (Twist Bioscience; San Francisco, CA) following codon optimisation for E. coli using the guided random method in OPTIMIZER. 2 PheATE was cloned into pACYCDuet-1 using NcoI and XhoI restriction sites, and ProC-TTe was cloned into pET28a+ using NcoI and XhoI restriction sites. The Pro-specific A domain from ProCAT and four Leu-specific A domains were codon optimised and ligated into the SpeI and NotI restriction sites of ProC-TTe using compatible NheI and NotI sites. Alignments of A domains and sequence origins and identities are provided in Supplementary Figure S14 and Supplementary Table  S2).

Homology model creation
Structural models of the C domain from the second module of PvdD were created by submitting the C domain sequence to multiple automated servers, 5-20 and using the Swiss-Model server (http://swissmodel.expasy.org/) 21 and Modeller 9.11 22 . Models created from each method were submitted to the QMEAN server to obtain QMEAN6 and QMEANclust scores. 23,24 The model RaptorXmsa was selected to work with as it scored best overall considering both measurements. The model RaptorXmsa aligned well to the C domain structure from TycC 25 with a root mean-square deviation for the backbone α carbons of 0.381 Å.

Sequence acquisition
The AntiSMASH database 26 was queried for all NRPS biosynthetic gene clusters from the genera Pseudomonas, Bacillus and Streptomyces. The genbank files for each cluster were downloaded, and the python script "extractCATdomains_consensus.py" (available at https://github.com/MarkCalcott/Extract_CAT_domains/) used to find, extract and save DNA encoding L CL-A-T tridomains into a separate FASTA file for each genera. The criteria for extracting tridomains were that the A domain had an antiSMASH consensus substrate specificity prediction, and that C-A-T domains were located on the same protein in the correct order, with fewer than 250 amino acid residues between domains. Each extracted DNA sequence was annotated with the consensus substrate specificity. Sequences were dereplicated using USEARCH 10.0.240 27 , and clustered at 95% identity (Table 1). A codon-alignment of the centroid nucleotide sequence from each cluster was generated using MUSCLE 28 . Sequences were trimmed to the C1 and T motifs inclusive, and any sequences not containing these motifs were removed. Regions of ambiguous alignment were removed using GBLOCK version 0.91b 29 . The default parameters were used for GBLOCK except the minimum number of sequences for a flank position was set equal to 50% of the total sequences, the minimum length of a block was 5, and gap positions were allowed in half of the sequences.