Expansion of a core regulon by transposable elements promotes Arabidopsis chemical diversity and pathogen defense

Plants synthesize numerous ecologically specialized, lineage-specific metabolites through biosynthetic gene duplication and functional specialization. However, it remains unclear how duplicated genes are wired into existing regulatory networks. We show that the duplicated gene CYP82C2 has been recruited into the WRKY33 regulon and indole-3-carbonylnitrile (ICN) biosynthetic pathway through exaptation of a retroduplicated LINE retrotransposon (EPCOT3) into an enhancer. The stepwise development of a chromatin-accessible WRKY33-binding site on EPCOT3 has potentiated the regulatory neofunctionalization of CYP82C2 and the evolution of inducible defense metabolite 4-hydroxy-ICN in Arabidopsis thaliana. Although transposable elements (TEs) have long been recognized to have the potential to rewire regulatory networks, these results establish a more complete understanding of how duplicated genes and TEs contribute in concert to chemical diversity and pathogen defense.

P lant secondary or specialized metabolites are essential for plant survival in co-evolving biotic and fluctuating abiotic environments. The evolutionary process of chemical innovation resulted in the collective synthesis of hundreds of thousands of ecologically specialized, mostly lineage-specific metabolites [1][2][3] . Plant-specialized metabolic enzymes are ultimately produced from primary metabolic enzymes through gene duplication and subsequent functional divergence of one or both paralogs to produce enzymes with altered expression patterns and/or protein functions [3][4][5] . They are also often organized into transcription factor (TF) regulons of co-regulated genes for optimal timing, amplitude, and tissue-specific pathway gene expression and subsequent metabolite accumulation 6,7 .
Changes in cis-regulatory modules such as enhancers and promoters can accelerate the capture of duplicated genes into regulons, thus driving phenotypic diversity [8][9][10] . Enhancers consist of TF binding sites (TFBSs) and are derived either through mutation or co-option of a TFBS-carrying transposable element (TE) 10,11 . TE exaptations are hypothesized to be responsible for the rapid transcriptional rewiring of gene regulatory networks in ancient lineages of vertebrates [12][13][14] and plants 15 , but general understandings of the physiological significance of this rewiring are greatly limited.
Bacteria elicit two primary immune defense modes in plants, pattern-and effector-triggered immunity (PTI and ETI) 16 . Pathogenic bacteria additionally compromise PTI via specific virulence effector proteins (effector-triggered susceptibility, ETS) 16 . PTI involves the extracellular perception of conserved molecules known as microbe-associated molecular patterns (MAMPs), whereas ETI involves the cytosolic perception of effectors. Although ETI results in the formation of more rapid and robust pathogen-specific responses including a form of programmed cell death known as the hypersensitive response (HR) 16 , both result in the ability of naive host cells to generate, through non-self perception and subsequent transcriptional reprogramming, pathogeninducible specialized metabolites necessary for defense [17][18][19] .
The TF WRKY33 regulates the pathogen-inducible biosynthesis of camalexin in A. thaliana 31,32 and its orthologs regulate numerous unrelated specialized metabolites in other flowering plant lineages 33 . The group I class of WRKYs to which WRKY33 belongs is an ancient clade of regulators; orthologs in the green alga Chlamydomonas reinhardtii may be ancestral to all higher plant WRKYs 33,34 . Although all WRKY TFs bind to the W-box core sequence [TTGAC(T/C)], WRKY33 preferentially binds Wboxes that are within 500 nt of the WRKY33-specific motif [(T/G) TTGAAT]) 35  Data represent mean ± SE of four replicates of 15 ± 2 seedlings each. Different letters denote statistically significant differences (P < 0.05, one-factor ANOVA coupled to Tukey's test). ICN totals consist of the sum of ICN and methanolic degradation product ICA-ME. 4OH-ICN totals consist of the sum of aqueous and methanolic degradation products 4OH-ICA-ME and 4OH-ICA, respectively Here we show that a recent, lineage-specific TE exaptation has resulted in the expansion of a core regulon within the framework of Arabidopsis Trp-derived defense metabolism. Specifically, the LINE retrotransposon EPCOT3 has retroduplicated from a WRKY33-TFBS-carrying progenitor and inserted upstream of the newly duplicated gene CYP82C2. Subsequent chromatin remodeling in A. thaliana has led EPCOT3 to become a bona fide enhancer with demonstrated biochemical, regulatory, physiological, and fitness-promoting characteristics by way of WRKY33binding and pathogen-responsive CYP82C2 transcription, 4OH-ICN biosynthesis, and antibacterial defense.

Results
4OH-ICN requires ETI-like responses. To identify the major Trp-derived specialized metabolites synthesized in ETI in A. thaliana, we compared host transcriptional and metabolic responses to the PTI-eliciting bacterial MAMPs flg22, elf26, and fungal MAMP chitosan; the PTI/ETS-eliciting pathogens Pseudomonas syringae pv. tomato DC3000 (Pto DC3000 or Pst); P. syringae pv. maculicola ES4326 (Pma); and the ETI-eliciting pathogens Pst avrRpm1 (Psta), Pst avrRpt2, Pst avrRps4, Pma M2, and Pma avrRpt2 under similar conditions as those of previous studies 19,36 . Psm M2 is an ETI-eliciting strain from which the avrRpm1 gene was originally isolated 37 . Both flg22 and Psta induced genes involved in camalexin, 4OH-ICN, and 4M-I3M biosynthesis, with camalexin and 4OH-ICN biosynthetic genes having a higher level of induction than those of 4M-I3M in Psta-inoculated plants 36 (Supplementary Table 1). On the other hand, metabolite responses between PTI and ETI differed qualitatively. 4M-I3M and its immediate precursor 4-hydroxy-I3M (4OH-I3M) were present in uninfected plants and accumulated to modest levels at the expense of parent metabolite I3M in flg22and Psta-inoculated plants 19 ( Supplementary Fig. 1a). By comparison, camalexin, ICN, and 4OH-ICN were absent in uninfected plants and accumulated to high levels upon inoculation with ETI-inducing pathogens ( Fig. 1b and Supplementary  Fig. 1b). Furthermore, camalexin, ICN, and 4OH-ICN metabolism was greatly diminished, and 4M-I3M, 4OH-I3M, and I3M levels were mostly unchanged in the rpm1 mutant (Supplementary Fig. 1), which is impaired in ETI recognition of Psta 40 . By contrast, camalexin and ICN were largely at low-to-undetectable levels in plants treated with saturating concentrations of the bacterial MAMPs flg22 and elf26 38,39 and PTI/ETS-eliciting pathogens, with 4OH-ICN not detected in most cases (Fig. 1b). One exception was the fungal MAMP chitosan. Chitosan (150 μg/ mL) induced high levels of camalexin and detectable levels of ICN (Fig. 1b), consistent with previous observations of camalexin biosynthetic gene upregulation 41 . Higher chitosan concentrations (≥ 200 μg/mL) have been shown to induce HR-like cell death in Arabidopsis 42 , a phenomenon commonly observed for ETI 16 . To our surprise, 300 μg/mL chitosan additionally induced detectable levels of 4OH-ICN (Fig. 1b). These results suggest that 4OH-I3M, 4M-I3M, camalexin, and ICN are synthesized in response to multiple PTI elicitors, whereas 4OH-ICN biosynthesis is specific to ETI-like responses.
WRKY33 is required to activate 4OH-ICN in response to Psta. 4OH-ICN biosynthetic genes are highly co-expressed with each other 23 and with camalexin biosynthetic genes (Supplementary  Table 2), which are in the WRKY33 regulon 31,43 . To determine whether 4OH-ICN biosynthetic genes are also in the WRKY33 regulon, we compared camalexin, ICN, and 4OH-ICN levels between wild-type and a wrky33 loss-of-function mutant that encodes two differently truncated proteins 44 (Fig. 2a). Consistent with a previous report 31 , wrky33 was impaired in camalexin biosynthesis in response to Psta and Pst avrRps4 ( Fig. 2b and Supplementary Fig. 2a). The wrky33 mutant was similarly impaired in 4OH-ICN biosynthesis ( Fig. 2b and Supplementary  Fig. 2a). These results indicate that WRKY33 is required for camalexin and 4OH-ICN biosynthesis in response to multiple ETI elicitors.
To confirm that WRKY33 is required to activate the 4OH-ICN pathway, we used a two-component glucocorticoid-inducible system to generate wrky33 plants that in the presence of the glucocorticoid hormone dexamethasone (dex) express a wild-type copy of WRKY33 with a C-terminal fusion to 1× flag epitope (wrky33/DEX:WRKY33-flag; Supplementary Fig. 2b-c). Induced expression of WRKY33-flag restored camalexin and 4OH-ICN biosynthesis in Psta-challenged wrky33 plants to greater than wild-type levels ( Supplementary Fig. 2d). These results indicate that WRKY33 is required to activate camalexin and 4OH-ICN biosynthesis in response to Psta.
Natural variation in WRKY33 affects metabolism and defense. Intraspecific variation in TFs can contribute to gain or loss of phenotypes, such as branching in maize 45 or pelvic loss in threespined stickleback fish 46 . In addition, the wide variation in camalexin biosynthesis reported among natural accessions of A. thaliana 47 suggests that a similar variation in 4OH-ICN biosynthesis may exist. To identify additional transcriptional activators of 4OH-ICN biosynthesis that otherwise might be refractory to traditional genetic approaches, we compared intraspecific variation in Psta-induced camalexin, ICN, and 4OH-ICN among 35 re-sequenced accessions and wrky33 (Col-0 accession). We found camalexin and 4OH-ICN levels to be positively correlated among accessions (R 2 = 0.37; Supplementary  Fig. 3a), lending further support to their co-regulation by WRKY33. Accession Dijon-G (Di-G) was identified to produce less camalexin and 4OH-ICN and more ICN than its nearisogenic relatives, the Landsberg accessions Ler-0 and Ler-1 ( Fig. 2b and Supplementary Fig. 3a-b). In addition, differences observed in the metabolite response between Landsberg accessions and Di-G most closely resembled those between Col-0 and wrky33 mutant ( Fig. 2b and Supplementary Fig. 3a). These results led us to hypothesize that genetic variation in a regulatory gene, as opposed to an immune signaling gene, is responsible for the metabolite phenotypes observed in Di-G. To test this hypothesis, genetic variation between Di-G and three sequenced Landsberg accessions (La-0, Ler-0, and Ler-1) were used to identify 354 genes that were differentially mutated to high effect in Di-G ( Supplementary Fig. 3c). Twenty-eight of these mutated Di-G genes were annotated by Gene Ontology to have roles in defense, including WRKY33 (Supplementary Table 3). We confirmed by Sanger sequencing that Di-G WRKY33 harbors a nonsense mutation early in the N-terminal DNA-binding motif (Fig. 2a), likely abolishing protein function. Our findings indicate that camalexin and 4OH-ICN are sensitive to intraspecific variation in WRKY33.
Camalexin and 4OH-ICN promote plant fitness by contributing non-redundantly to pathogen defense against the fitnessreducing Pst 23 . To confirm that disease resistance to Pst is also sensitive to intraspecific variation in WRKY33, we measured bacterial growth in adult leaves of wrky33, Di-G, and their respective (near-)isogenic accessions Col-0 and Ler-1. wrky33 and Di-G were more susceptible to Pst than their (near-)isogenic relatives and comparable to the 4OH-ICN biosynthetic mutant cyp82C2 23 (Fig. 2c) We additionally generated wrky33 plants that in the presence of dex express a wild-type copy of WRKY33 with a C-terminal fusion to a larger 6× myc epitope (wrky33/DEX:WRKY33-myc;  We then tested for WRKY33-binding to W-box-containing regions upstream of camalexin and 4OH-ICN biosynthetic genes in dex-treated and Psta-infected wrky33/DEX:WRKY33-flag seedlings by chromatin immunoprecipitation (ChIP)-PCR. WRKY33 reportedly binds upstream of 4OH-ICN biosynthetic gene CYP71A12 to a W-box region that also contains three WRKY33-specific motifs 49 . We additionally observed that Pstainduced WRKY33 bound strongly (greater than fivefold enrichment) upstream of 4OH-ICN biosynthetic genes FOX1 and CYP82C2 to W-box regions that also contain one to three WRKY33-specific motifs (W2 and W4, respectively; Fig. 3b, c and Supplementary Fig. 5b). Together with our expression analysis, our findings indicate that WRKY33 uses preferred WRKY33binding sites to directly activate 4OH-ICN biosynthetic genes in response to pathogen effectors.
Interestingly, Psta-induced WRKY33 did not bind to the W5 region upstream of CYP82C2 (Fig. 3b, c), a W-box region that does not contain any WRKY33-specific motifs and is just upstream of neighboring gene of unknown function At4g31960. WRKY33 reportedly binds to W5 in response to flg22 49 and Botrytis cinerea 35    These findings suggest that WRKY33 may use W-box extended motifs or alternative specificity motifs to target camalexin biosynthetic genes in response to pathogen effectors, or 4OH-ICN biosynthetic genes in response to MAMPs or fungal pathogens.
CYP82C2 underwent regulatory neofunctionalization. CYP82C2 catalyzes the last step in 4OH-ICN biosynthesis, hydroxylating ICN to form 4OH-ICN 23 , and likely was the last 4OH-ICN pathway gene to be recruited to the WRKY33 regulon in A. thaliana. To explore the phylogenetic distribution pattern of 4OH-ICN biosynthesis, we profiled ICN and 4OH-ICN metabolites in close and distant relatives of A. thaliana in response to Psta. Although ICN biosynthesis was observed across multiple close relatives, 4OH-ICN was only detected in A. thaliana ( Fig. 4a and Supplementary Fig. 6a). This result suggests that 4OH-ICN manifests a species-specific diversification of pathogen-inducible Trp-derived metabolism in the mustard family.
In A. thaliana, CYP82C2 resides in a near-tandem cluster with paralogs CYP82C3 and CYP82C4 (Fig. 4b). We performed phylogenetic and syntenic analyses to identify putative CYP82C2 orthologs in a clade inclusive of ICN-synthesizing species. All identified homologs are syntenic to CYP82C2 or CYP82C4, and encode proteins with > 88% identity to one another ( Fig. 4b and Supplementary Fig. 6b-c). CYP82C3 is present only in A. thaliana and, although more similar to CYP82C2 than CYP82C4 in sequence ( Fig. 4b and Supplementary Fig. 6b), it is not functionally redundant with CYP82C2 23 . CYP82C4 is required for the biosynthesis of sideretin, a widely conserved, phenylalanine-derived metabolite required for iron acquisition 50 . CYP82C4 has syntenic orthologs in the mustard family ( Fig. 4b and Supplementary Fig. 6b), correlating with the distribution of sideretin biosynthesis 50 . By contrast, CYP82C2 has syntenic orthologs only within the Arabidopsis genus ( Fig. 4b and Supplementary Fig. 6b). These results suggest that CYP82C2 duplicated from CYP82C4 prior to the formation of the Arabidopsis genus and then acquired a new expression pattern and/or catalytic function prior to A. thaliana speciation~2 million years later 51,52 .
CYP82C2 and CYP82C4 were previously characterized to 5hydroxylate with equal efficiency the specialized metabolite 8methoxypsoralen, a molecule structurally reminiscent of ICN and sideretin 53 . The apparent similarities in substrate specificity and catalytic function suggest that CYP82C2 may have diverged from CYP82C4 in expression but not protein function. To test this, we first compared the expression of CYP82C2 and CYP82C4 in A. lyrata and A. thaliana in response to Psta. 4OH-ICN biosynthetic genes CYP79B2, CYP71A12, and FOX1 were upregulated in both species, consistent with the common presence of ICN (Fig. 4a, c). By contrast, CYP82C2 levels were respectively upregulated and unchanged in A. thaliana and A. lyrata, correlating with the distribution of 4OH-ICN in these species (Fig. 4a, c). CYP82C4 expression was unchanged in both species (Fig. 4c). These results indicate that 4OH-ICN biosynthesis is linked with pathogeninduced expression of CYP82C2.
We then compared the aligned upstream sequences of CYP82C2 and CYP82C4 in A. lyrata and A. thaliana, and observed good sequence conservation among orthologs but poor conservation among paralogs ( Supplementary Fig. 6d), indicating that sequences upstream of CYP82C4 and CYP82C2 were independently derived. We performed expression analysis in A. thaliana to confirm that CYP82C2 and CYP82C4 have different expression patterns. Consistent with previous reports 23,50 , CYP82C2 expression is upregulated in response to Psta and unchanged under iron deficiency, whereas CYP82C4 is upregulated under iron deficiency and unchanged in response to Psta (Fig. 4c, d and Supplementary Table 1). Finally, CYP82C4 expression is unchanged in Psta-challenged, dex-induced wrky33 and wrky33/DEX:WRKY33-flag ( Supplementary Fig. 6e). Our findings suggest that CYP82C2 diverged from CYP82C4 by acquiring WRKY33 regulation for its pathogen-induced expression.
We next assessed dN/dS ratios along branches of the CYP82C phylogenetic tree (Supplementary Fig. 6b) and found good support for purifying selection acting on CYP82C enzymes (ω = 0.21) and no support for positive selection acting on CYP82C2/3 enzymes (Supplementary Data 1). Lastly, we identified nonconserved amino acid residues among CYP82C homologs and mapped this information onto a homology model of CYP82C2. The protein inner core, which encompasses the active site and substrate channel, is highly conserved among CYP82C homologs ( Supplementary Fig. 6f), and is consistent with CYP82C2 and CYP82C4's reportedly redundant catalytic functions 53 . Altogether, our findings suggest that CYP82C2 underwent regulatory neofunctionalization, diverging from CYP82C4 in expression but not protein function.  Fe-  Supplementary Fig. 5c). Preferential WRKY33binding at this region should also be influenced by chromatin features associated with cis-regulatory elements such as enhancers and basal promoters 54 . To investigate how CYP82C2 acquired WRKY33-binding for its pathogen-induced expression, we compared the aligned upstream sequences of CYP82C homologs within a clade inclusive of ICN-synthesizing species. We observed three large upstream sequences specific to A. thaliana CYP82C2, hereafter named Eighty-two-C2 Promoter Contained Only in A. Thaliana1-3 (EPCOT1-3; Fig. 5a). EPCOT3 in particular is a 240 nt region that completely encompasses W4 (Fig. 5a), indicating that WRKY33's regulation of CYP82C2 in response to Psta may be species-specific. Further bioinformatics analysis revealed that EPCOT3 is enriched with the activating histone mark H3K4me2 and lacks the repressive histone mark H3K27me3 (Fig. 5b) [57][58][59] . Our findings suggest that EPCOT3 functions as an enhancer that mediates WRKY33-binding and activation of CYP82C2 in response to pathogen effectors.
EPCOT3 contains a 3′-poly-A tail and is flanked by variablelength target site duplications ( Fig. 5c and Supplementary Fig. 7a), which are hallmarks of eukaryotic LINE retrotransposons 60 . LINE retrotransposition (reverse transcription and integration) results in frequent 5′-truncation of retrocopies 61 . We identified 11 variably truncated retrocopies similar to EPCOT3 throughout the genome (Fig. 5c, Supplementary Fig. 7a-b, and Supplementary Table 4), including Ta22, one of the first LINEs characterized in A. thaliana 62 . EPCOT3-related LINEs were sorted into two groups roughly correspondent to their phylogenetic placement: EPCOT3-LIKE (EPL) for those with high identity (> 65%) to EPCOT3, and Ta22 or Ta22-LIKE (Ta22L) for the remainder (Supplementary Fig. 7a and Supplementary Table 4). Only Ta22 and Ta22L1 are full-length LINEs (Fig. 5c), presumably encoding the proteins necessary for their own transposition and for the transposition of non-autonomous family members such as EPCOT3. Through synteny analysis, we also identified two species-specific Ta22Ls, but no EPLs, in A. lyrata (Supplementary Table 4). To confirm the involvement of EPCOT3 in speciesspecific expression of CYP82C2, we introduced WRKY33 into Nicotiana benthamiana (tobacco) leaves and A. thaliana cyp82C2 protoplasts transfected with either the A. thaliana or A. lyrata CYP82C2 locus (coding and 3000 nt upstream sequences, Fig. 5d). We observed transactivation by WRKY33 of the A. thaliana gene, but not that of A. lyrata (Fig. 5d and Supplementary Fig. 7d). Altogether, these data indicate that EPCOT3 and EPLs arose from retrotransposition following the speciation of A. thaliana, and that the EPCOT3-containing A. thaliana CYP82C2 promoter is sufficient to confer WRKY33-mediated transcription of CYP82C2.
Of the EPL retrocopies, EPL1 is most similar to EPCOT3 (85.4% identity), sharing the W-box and WRKY33-specific motif, whereas EPL2 is less similar (67%) and lacks the WRKY33specific motif (Fig. 5c, Supplementary Fig. 7a, and Supplementary Table 4). EPL1 and EPL2 are much less truncated than EPCOT3 (Fig. 5c) and lack epigenetic signatures typical of cis-regulatory sequences 55,56 (Supplementary Fig. 7c). To investigate whether the sequences and chromatin features associated with EPLs are sufficient for WRKY33 binding, we tested for WRKY33 binding to EPL sequences homologous to the W4 region of EPCOT3 in dex-treated, Psta-infected wrky33/DEX:WRKY33-flag plants by ChIP-(q)PCR. Compared with EPCOT3 (Fig. 3c), WRKY33 bound weakly or not at all to EPL1 and EPL2, respectively (Fig. 5e, and Supplementary Fig. 7e). Our findings suggest the following history: (1) EPL1 likely retroduplicated from EPL2 or its progenitor, which already contained a W-box; (2) EPL1 then acquired a WRKY33-specific motif by mutation; and (3) EPCOT3 retroduplicated from EPL1 and then acquired epigenetic signatures of an enhancer, thereby allowing selection to act on standing variation rather than de novo mutation for CYP82C2 recruitment into the 4OH-ICN biosynthetic pathway.

Discussion
TEs were originally conceived to act as controlling elements of several loci in the genome 63 , and exaptation of TEs into cisregulatory modules has been hypothesized to be responsible for the rapid transcriptional rewiring in more ancient lineages of vertebrates [12][13][14] . However, few (if any) evolutionarily recent TE exaptation events in vertebrates and higher plants have been demonstrated to have biochemical, regulatory, physiological, and fitness-promoting functions 14 . With well over a dozen genomes available including the genetic model A. thaliana, the mustard family presents an excellent system for examining such events. In this study, we show that EPCOT3 is a TE-derived enhancer that mediates WRKY33 binding, pathogen-responsive transcription of CYP82C2, synthesis of the species-specific metabolite 4OH-ICN, and pathogen defense (Fig. 6). These results demonstrate how a recent TE exaptation can wire a new gene into an ancient regulon, ultimately leading to a positive effect on fitness.
Although the EPL1/EPCOT3 progenitor retrotransposed a preferred WRKY33-TFBS in the form of EPCOT3 upstream of CYP82C2, a further series of epigenetic modifications were needed to facilitate optimal access of EPCOT3 by WRKY33 (Fig. 6). EPL1 exists in a silenced heterochromatin state 55,56 (Supplementary Fig. 7c), typical for TEs 64 , and is bound weakly by WRKY33 (Fig. 5e), whereas EPCOT3 is in an open chromatin state 55,56 (Fig. 5b) and bound relatively strongly by WRKY33 (Fig. 3c). The more severe 5′-truncation of EPCOT3 could account for its release from TE-silencing mechanisms and the initially weak WRKY33 binding could provide a seed for chromatin remodelers to drive the exaptation of newly retrotransposed EPCOT3 into a bona fide enhancer. Further epigenomic sampling within Arabidopsis is needed to better clarify the epigenetic transformations underlying the EPCOT3 exaptation event.
Compared with closely related Landsberg accessions (Supplementary Fig. 3), Di-G synthesizes less camalexin and 4OH-ICN 47 (Fig. 2b), and is more susceptible to a range of bacterial and  (Fig. 2c). WRKY33 has been implicated in camalexin biosynthesis 31 and antifungal defense 44 . We identified WRKY33 as causal for some if not all of these phenotypes in Di-G. In addition, WRKY33's involvement in antibacterial defense is consistent with the contribution of camalexin and 4OH-ICN toward antibacterial defense 23 .
WRKY33 is an ancient TF responsible for many fitnesspromoting traits in plants; thus, it is unexpected that an A. thaliana accession would have a naturally occurring wrky33 mutation (C536T transversion). Di-G is the sole member of 1,135 sequenced accessions to have a high-effect single-nucleotide polymorphism (SNP) in WRKY33 66 , and may have originated from a Ler-0 ethyl methanesulfonate (EMS) mutagenesis screen, whose mutagenesis rate 67 is well within the range of~25,000 SNPs that are not concordant between Di-G and Ler-0 66 (Supplementary Fig. 2f). However, features of EMS mutations (i.e., transversion mutations) or X-ray mutations (i.e., indels) are not enriched in the Di-G pseudogenome relative to related pseudogenomes (Supplementary Table 5). These findings suggest that the wrky33 Di-G mutation is naturally derived.
Bacterial infection and MAMP elicitation. A single colony of P. syringae pv. maculicola (Pma) M2 (containing avrRpm1, but not avrRps4 or avrRpt2), Pma ES4326 (containing no aforementioned effectors), Pma ES4326 avrRpt2, P. syringae pv. tomato DC3000 (Pst, containing no aforementioned effectors), Pst avrRpm1, Pst avrRps4, and Pst avrRpt2 from a freshly streaked 3-day-old agar plate was used to inoculate 2 mL of LB medium containing appropriate antibiotics. Strains were grown to log phase, washed in sterile water twice, resuspended in sterile water to OD 600 of 0.2, and incubated at room temperature with no agitation for 3 to 6 h, prior to infection. Chitosan (90% deacetylated chitin; Spectrum Chemical Mfg Corp, New Brunswich, NJ) was prepared in 0.1 N acetic acid and neutralized with 0.1 N NaOH to pH 5.8, to a stock concentration of 1.2 mg/mL.
Four-to-five-week-old adult leaves were treated with 0.0125% Silwet (Phytotechnology Laboratories) or 0.0125% Silwet and 20 μM dex for 20 s, and incubated on 0.8% (w/v) tissue-culture agar plates on a light cart at 21°C for 6-8 h. Leaves were then surface-inoculated with Pst (OD 600 = 0.002 or 10 6 colonyforming units (cfu)/cm 2 leaf area) in the presence of 0.01% (v/v) Silwet L-77 for 15 s and incubated on 0.8% (w/v) tissue-culture agar plates at 21°C under a 16 h light cycle (70-80 μE m −2 s −1 light intensity) for 3 days. Leaves were then surfacesterilized in 70% ethanol for 10 s, rinsed in sterile water, surface-dried on paper towels, and the bacteria were extracted into water, using an 8 mm stainless steel bead and a ball mill (20 Hz for 3 min). Serial dilutions of the extracted bacteria were plated on LB agar plates for cfu counting.
RNA isolation and RT-PCR. Total RNA was extracted from snap-frozen seedlings using TRIzol reagent (Invitrogen) and 2.5 µg was reverse-transcribed with 3.75 µM random hexamers (Qiagen), 20 units of M-MuLV (New England Biolabs), and 20 units of ProtoScript II (New England Biolabs) for 1 h at 42°C and then for 15 min at 70°C. Resulting cDNA:RNA hybrids were treated with 10 units of DNase I (Roche) for 30 min at 37°C and purified on PCR clean-up columns (Macherey-Nagel). qPCR was performed with Kapa SYBR Fast qPCR master mix (Kapa Biosystems) and CFX96 real-time PCR machine (Bio-Rad). Biological and technical replicates were performed on the same 384-well PCR plate, and EIF4A1 (AT3G13920) and AlEIF4A1 (AL3G26100) housekeeping genes were used to normalize mRNA levels between different samples. Primer sequences and efficiencies are listed in Supplementary Data 2. Total RNA from tobacco leaf discs and A. thaliana protoplasts was extracted from snap-frozen samples using 300 µL of Cell Lysis Solution (2% (w/v) SDS, 63 mM sodium citrate, 132 mM citric acid, 1 mM EDTA), 100 µL of Protein-DNA Precipitation Solution (4 M NaCl, 16 mM sodium citrate, 32 mM citric acid), and 300 µL of isopropanol. 2 μg RNA was then reversetranscribed and complementary DNA was diluted 7.5-fold. Four microliters of cDNA was used in 20 μL PCR reactions and resulting PCR products were separated on 2% agarose gels. PCR was performed on C1000 thermal cycler (Bio-Rad) with the following thermal cycling program: 95°C for 3 min; 40 cycles of 95°C for 10 s, 53°C for 15 s, and 72°C for 7 s (WRKY33, CYP82C2, AlCYP82C2), 15 s (NbAC-TIN1), or 21 s (CYP82C2*). Primer sequences are listed in Supplementary Data 2.
Samples were separated on an Ultimate 3000 HPLC (Dionex, Sunnyvale, CA) system, using a 3.5 μm, 3 × 150 mm Zorbax SB-Aq column (Agilent, Santa Clara, CA) with the gradient shown in Supplementary Table 7 Glucosinolate extraction and LC-DAD-FLD-MS. For glucosinolate extraction, a 96-well 0.45 μm polyvinylidene fluoride (PVDF) filter plate (EMD Millipore, Billerica, MA) was charged with 45 mg DEAE Sephadex A25 (GE Heathcare) and 300 μL of water per well, and equilibrated at room temp for 2 h. Prior to sample homogenization, the plate was centrifuged at 400 × g for 1 min to remove the water. The homogenate was extracted with 500 μL 70% (v/v) aqueous methanol at 67.5°C for 10 min and centrifuged at 16,000 × g for 2 min. Added to the supernatant was 3 μL of IS (1.25 mM sinigrin (Sigma-Aldrich) in 80% (v/v) ethanol) per mg sample dry weight. Extract was applied to and incubated on the ion exchanger for 10 min. The sephadex resin was washed three times with 70% (v/v) methanol, three times with distilled deionized water (ddH 2 O), and two times with 20 mM sodium acetate (pH 5). Twenty microliters of 25 mg/mL aryl sulfatase (Type H1 from Helix pomatia, Sigma-Aldrich) was applied to and incubated on the sephadex resin at RT overnight. The plate was centrifuged at 400 × g for 1 min and desulfoglucosinolates were eluted from the sephadex resin by two 100 μL washes with 60% (v/v) methanol and two 100 μL washes with ddH 2 O. Eluate volume was reduced to 250-350 μL using an evaporator.
Samples were separated on an Ultimate 3000 HPLC system, using a 3.5 μm, 3 × 150 mm Zorbax SB-Aq column with the gradient shown in Supplementary Table 7. A coupled DAD-3000RS diode array detector, FLD-311 fluorescence detector, and MSQPlus mass spectrometer collected UV absorption spectra at 229 nm, fluorescence spectra at 275/350 nm (ex/em), and ESI mass spectra in positive/ negative ion modes at 100-1000 m/z, respectively. Glucosinolates were quantified using integrated areas of desulfoglucosinolates in the UV chromatographs at 229 nm and published response factors 73 . mM NaCl, Complete EDTA-free protease inhibitor cocktail), sonicated in a Covaris S2 sonicator (Covaris, Woburn, MA) using 10% duty, 7% intensity, 200 cycles per burst for a total time of 11 min, and centrifuged at 16,000 × g for 10 min at 4°C to precipitate SDS. ChIP was performed using Anti-FLAG M2 Affinity Gel (Sigma-Aldrich). Beads were pre-treated with 0.1% (w/v) non-fat milk in 1× phosphatebuffered saline (PBS) and 0.5 mg/mL sheared salmon sperm DNA (Invitrogen). Following de-crosslinking, DNA samples were phenol-chloroform-extracted, diluted to the same OD 260 concentration, and 1.5 μL was used in a 15 μL PCR reaction.
PCR analysis was performed on nuclear extracts prior to antibody incubation (input) and after ChIP. PCR conditions were as follows: 95°C for 3 min; 40 cycles of 95°C for 15 s, 53°C for 15 s, and 72°C for 1 min; 72°C for 5 min. Densitometric determination of signal intensity in each ChIP and input sample was calculated using ImageJ. Fold enrichment was determined by calculating the ratio of PCR product intensities in ChIP D/M to Input D/M. In cases where amplicons were absent, an arbitrary value of 10 was assigned. For EPL2, qPCR analysis was additionally performed to confirm absence of amplicons in ChIP samples. RLU counts at the 25th cycle were used for quantification. Primer sequences are listed in Supplementary Data 2.
Comparative genomics. All phylogenetic species trees were adapted from published data 74,75 . To generate phylogenetic maximum likelihood (ML) trees, sequences were aligned using MUSCLE in MEGA7 76 and the JTT model (for CYP82C and LINE alignments) or Tamura-Nei model (for the EPCOT3 alignment). Sequences for all genes with the description "non-LTR retrotransposon family (LINE)" (n = 263) were batch-downloaded from TAIR (https://arabidopsis. org). Of these, sequences containing intact reverse-transcriptase domains (PGPDG, LIPK, FRPISL, or FADD sequences; n = 126) were used for subsequent phylogenetic analysis ( Supplementary Notes 1 and 2). Gaps were removed from the CYP82C alignment, leaving a total of 480 codons. Information on genomes used for synteny analysis is shown in Supplementary Table 8.
Selection estimates based on nonsynonymous-to-synonymous substitution ratios were calculated from the CYP82C ML tree. A Newick tree file was generated from this ML tree ( Supplementary Fig. 4b and Supplementary Data 1) and for Branch site models, branches were pre-defined. CodeML analysis in PAML 77 was then conducted with the following modified parameters: ncatG = 8, CodonFreq = 3. The M0 test was performed with model = 0 and NSsites = 0. The M1a-null test was performed with model = 0 and NSsites = 1. A more stringent null test (fixed omega) was performed for each Branch site model to be tested (model = 2 and NSsites = 2), where omega was fixed to 1. Branch site models were then tested with unfixed omega. Likelihood ratio tests were performed by comparing critical values and degrees of freedom between each unfixed Branch site test and either the M1a test or the corresponding fixed-omega test. Pre-defined branches with P-values < 0.05 for both tests were regarded as under positive selection (Supplementary Data 1).
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.