De novo transcriptome characterization of Iris atropurpurea (the Royal Iris) and phylogenetic analysis of MADS-box and R2R3-MYB gene families

The Royal Irises (section Oncocyclus) are a Middle-Eastern group of irises, characterized by extremely large flowers with a huge range of flower colors and a unique pollination system. The Royal Irises are considered to be in the course of speciation and serve as a model for evolutionary processes of speciation and pollination ecology. However, no transcriptomic and genomic data are available for these plants. Transcriptome sequencing is a valuable resource for determining the genetic basis of ecological-meaningful traits, especially in non-model organisms. Here we describe the de novo transcriptome assembly of Iris atropurpurea, an endangered species endemic to Israel’s coastal plain. We sequenced and analyzed the transcriptomes of roots, leaves, and three stages of developing flower buds. To identify genes involved in developmental processes we generated phylogenetic gene trees for two major gene families, the MADS-box and MYB transcription factors, which play an important role in plant development. In addition, we identified 1503 short sequence repeats that can be developed for molecular markers for population genetics in irises. This first reported transcriptome for the Royal Irises, and the data generated, provide a valuable resource for this non-model plant that will facilitate gene discovery, functional genomic studies, and development of molecular markers in irises, to complete the intensive eco-evolutionary studies of this group.

To find the putative genes and function, transcripts were aligned against the UniProt non-redundant protein database (2016-09-26) and against PFAM protein family database 50 , using BLASTX alignment with an e-value cutoff to < 0.0001 51 . To classify functions of the transcripts, they were also aligned against the Gene Ontology (GO, http:// geneo ntolo gy. org/) and the Clusters of Orthologous Groups (COGs, https:// www. ncbi. nlm. nih. gov/ resea rch/ cog-proje ct) protein databases. Annotations were computed using eggnog-mapper 52 , based on eggNOG 4.5 orthology data 53 . For transcription factors prediction, we submitted the sequences to search against PlantTFDB 54 .

Phylogeny analysis.
We retrieved all Iris transcripts that were annotated as either MADS or MYB proteins from the transcriptome and translated the longest open reading frame (ORF) using Virtual Ribosome 55 . We took the transcriptome transcripts that also contain the MADS or R2R3-MYB domain by PFAM. We downloaded I. fulva protein sequences for MIKCc MADS-box and R2R3-MYB transcription factors from NCBI 19 . Arabidopsis and rice (oryza sativa) MADS and R2R3-MYB sequences were taken from their genome databases [The Arabidopsis Information Resource (TAIR): www. arabi dopsis. org and the Rice Genome Annotation Project (RGAP): rice.plantbiology.msu.edu, respectively]. The gene identifiers were denoted to AtMYB genes in Arabidopsis and the locus id in rice to avoid confusion when multiple names are used for same gene. The sequences of each gene family were trimmed using trimAl (v1.3, http:// trimal. cgeno mics. org/) 56 and aligned using ClustalW alignment 57 , in MEGA X Molecular Evolutionary Genetics Analysis Software (https:// www. megas oftwa re. net/) 58 . We tested for the best substitution model and found that the best model for MADS is the JTT (Jones, Taylor, Thornton) model 59 + Gamma-distributed rates (G), and for MYB, JTT + G + amino acid frequency (F). For comparative phylogenetic analysis, we used maximum likelihood in MEGA X 58  SSRs mining. In order to utilize the transcriptome sequenced also for population genetic markers, we searched for short sequence repeats (SSRs; microsatellites) in the assembled contigs. We used a Perl script (find_ ssrs.pl 61 ) to identify microsatellites in the unigenes. In this study, SSRs were considered to contain motifs with two to six nucleotides in size and a minimum of four contiguous repeat units.

Results and discussion
Sequencing of Iris transcriptome. To generate the Iris transcriptome, eight cDNA libraries were sequenced: root, leaf and three bud stages from one genotype of I. atropurpurea (DR14), and buds in stages 1 and 2 from a different genotype of the same population (DR8). We generated a total of 195,412,179 sequence reads. www.nature.com/scientificreports/ The average GC content of Iris contigs was 47% (Tables 1 and 3). Reads were of very high quality throughout their length, without evidence of adapter content (Phred score > 30). Using Trinity, we assembled 258,466 contigs (isoforms) longer than 200 bp, which clustered into 184,341 transcripts, with a total length of 168,049,166 bp. A larger N50 length and average length are considered indicative of better assembly. The longest contig was 27,971 bp and half of the contigs (N50) with more than 500 bp were above 1,312 bp long ( Table 1).
The length distribution of the assembled contigs revealed that 126,194 (68.46%) contigs ranged from 201 to 500 bp in length; 37,335 (20.25%) contigs ranged from 501 to 1000 bp in length; 16,282 (8.83%) contigs ranged from 1001 to 2000 bp in length; and 4530 (2.46%) contigs were more than 2000 bp in length (Fig. 2). Subjecting our transcriptome to BUSCO analysis 48 confirmed that our transcriptome assembly contains 82.4% of gene representation of the available orthologue groups at Liliopsida, and more than 90% at Embryophyta. Only 3 to 6.1% of the single-copy orthologs were classified as missing from our assembly, indicating high quality of the assembly ( Table 2).
To quantify the abundance of contigs assembled, the reads of the separated Iris organs were mapped to the assembled contigs, with 125,074,925 (65%) mapped reads overall, and an average of 45% reads per tissue that mapped to a unique sequence in the assembled transcriptome (Table 3). Low mapping rates could be due to reads   www.nature.com/scientificreports/ belonging to sequences below the 200 bp cut-off and also, presumably, due to the complexity of the Oncocyclus irises genome that is very large and highly repetitive 62,63 .
Annotation of Iris transcriptome. Using BLASTX search against the UniProt database, we identified 28,708 transcripts with at least one significant hit. Transcripts mostly annotated to Arabidopsis thaliana (56.9%), Oryza sativa Japonica Group (10.5%) and Nicotiana tabacum (3.5%) (Fig. 3). Surprisingly, a significant proportion of the annotated transcripts were annotated as Arabidopsis thaliana, while only 10% were annotated as Oryza sativa, which is a monocot and therefore more closely related to irises. This is probably due to the higher representation of genomic resources for Arabidopsis thaliana. A considerable number of transcripts annotated to "non-plant" organisms, most of them to human (Homo sapiens, 2.4%) (Fig. 3). This may be attributed to housekeeping genes, which are preserved across all species in eukaryotes, and may also be due to the highly annotated human genome.
In the gene ontology analysis, 12,623 transcripts were assigned to GO terms under the three categories (supplementary table 1). Within the biological process category, 'cellular process' and 'metabolic process' were the two GO terms with the highest numbers of transcripts. In the cellular component category, 'obsolete cell' and 'obsolete cell part' were the most abundant. For the molecular function category, 'catalytic activity' and 'binding' had the highest number of transcripts (Fig. 4, supplementary info file).
Transcription factors (TFs) are key regulators in biological processes. For prediction of transcription factors, we assigned the protein sequences of all the transcripts to PlantTFDB 54 . We found 1021 transcripts that are predicted to be involved in transcription regulation and were classified into 54 transcription factor families (Fig. 6b, supplementary table 1, supplementary info file). The basic helix-loop-helix (bHLH) transcription factors family was the most abundant in I. atropurpurea consisting of 99 gene family members. In plants, the bHLH proteins are associated with a variety of developmental processes, such as trichomes development 70,71 , phytochrome signaling 72 , and cell proliferation and differentiation 70,73 . bHLH proteins have also been shown to interact with other transcription factors such as MYB 71,74 . Furthermore, a protein complex of bHLH and MYB transcription factors, associated with a WD40 repeat protein, regulates various cell differentiation pathways and the anthocyanin biosynthesis pathway 75,76 . The rest of the top 10 TFs are: NAC, MYB-related, C2H2, MYB, bZIP, WRKY, GRAS, C3H and ERF.
In total, we identified 33,033 transcripts in at least one database (supplementary table 1). We were unable to annotate or give a functional prediction to a large fraction of the transcripts. These transcripts could be Iris specific genes, genes that have diverged considerably, or genes that are not yet identified in plants. www.nature.com/scientificreports/

Phylogenetic analysis of MADS-box and R2R3-MYB gene families. In the search for orthologous
genes involved in flower development in irises, we phylogenetically analyzed two major transcription factor groups, the MADS-box and MYB protein families, to validate the subfamily identities of these genes from I. atropurpurea. We performed the phylogenetic analyses using MADS-box and R2R3-MYB protein sequences from Arabidopsis thaliana and rice (Oryza sativa), the top two annotated species in the transcriptome, and from Iris fulva.
MADS-box genes. MADS-box proteins, and their complex function, regulate floral organ characteristics and are essential for flower development 28,77,78 . In the Iris atropurpurea transcriptome, 43 transcripts were annotated as belonging to the MADS-box family and/or contain the MADS domain. Phylogenetic analysis using Arabidopsis, rice, and I. fulva, shows orthologous of I. atropurpurea in almost all clades of MADS-box proteins (Fig. 7, supplementary table 2). The general organization for most clades was similar to previous comparative phylogenies 19,37 .
Of the I. atropurpurea MADS-box genes identified, 19 clustered with MIKCc, 2 with Mα, 0 with Mβ, 1 with Mγ, and 2 grouped with MIKC*/ Mδ-type genes. Among the genes clustered with type II MIKCc MADS, we identified all 14 documented clades 19,37,38 ), comprising representative genes of Arabidopsis, rice, and I. fulva. I. atropurpurea had representative transcripts in 10 of the MIKC C clades, except for FLC-like, AGL15-like, DEFlike and StMADS11-like. Similar to previous reports, FLC-like and AGL15-like clades consist only Arabidopsis genes, suggesting eudicot specific lineages 19,37,38,79 . Three groups consist I. atropurpurea sequences but lack I. fulva representatives, AGL12-like, AGL17-like, and GMM13-like. AGL17-like and GMM13-like are not supported by the bootstrap analysis. AGL12-like has three I. atropurpurea transcripts, and this clade was well supported. AGL12-like and AGL17-like genes are involved in root development 80,81 , and while the I. atropurpurea sequences were derived also from root tissue, the I. fulva transcriptome was based on floral and leaves tissues 19 . Four I. atropurpurea sequences were clustered alone in a well supported group (81%, designated "Unknown"). These sequences might be of genes unique to I. atropurpurea.
Within most of the clades I. atropurpurea, I. fulva and rice grouped together and Arabidopsis sequences grouped together, suggesting a strong species and monocot/eudicot homology. In Arora et.al. Arabidopsis and rice also cluster together within the type I MADS clades 37 . Furthermore, a phylogeny of representative type I and II MADS-box genes from several distantly related plant species also showed similar monocot/eudicot separation within clades 33 .
R2R3-MYB genes. We found 256 transcripts in the I. atropurpurea transcriptome that were annotated as belonging to the MYB family by either trinity or PFAM. Sixty-seven of them were found to have the R2R3-MYB To create the phylogenetic tree, we aligned the transcripts against R2R3-MYB sequences from Arabidopsis, rice, and I. Fulva (Fig. 8, supplementary table 3). R3-MYB (R1R2R3) is another major MYB type, that was either the origin of R2R3-MYBs in plants 82 or evolved from R2R3-MYB 83 , and was also included in the phylogenetic analysis. To analyze the tree, we mainly followed the classification made by Ballerini et al., which consist Iris sequences 19 . The organization of the clades in the dendrogram corresponds with that in Ballerini et al., with 26 of the groups supported by bootstrap (> 50%). Several groups showed differences from the phylogeny in Ballerini et al., mostly in the form of a sequence clustered to a different clade, and in most cases not supported by bootstrap. Some major differences were observed for example in group 10, which was separated into 2 clades in our analysis, one with the Arabidopsis sequences and one with rice. Similar separation was found in Du et.al., in which the rice sequences are in a separated clade with Zea Maize, designated as S42 84 . In our tree, group 16 was also separated into 2 clades, in accordance with other published MYB trees 42,84 .
Fourteen groups of R2R3-MYB genes in the phylogenetic tree lack I. atropurpurea representatives, whereas in nine of them I. fulva representatives were also lacking, suggesting gene lineages that might not exist in Irises (11,12,15,24,30,31,33,34,and Os1). Consistent with previous phylogenetic studies, groups 12 and 15 also lack rice representatives, suggesting eudicot specific lineages 19,42 . A comparative analysis of R2R3-MYBs from 50 major eukaryotic lineages showed that group 12 consists only of Arabidopsis sequences and that group 15 consists only of eudicot species 84 . Genes in these groups have been shown to control trichome initiation in shoots, root hair patterning, and Cruciferae-specific glucosinolate biosynthesis 41,43,85 . Two of the groups lacking representatives from Iris, 33 and Os1, consist only rice genes. Genes from group 33 were previously designated in a monocot-specific clade together with corn (Zea maize) sequences 84 . Several groups had only I. atropurpurea representatives, lacking I. fulva, and vice versa. In addition, in contrast with our expectations, only in a few of groups I. atropurpurea and I. fulva clustered together within the clade. These observations further support the phylogenetic distance between the two species.
We found two new (bootstrap supported) subgroups consisting only rice and I. atropurpurea sequences. Previous phylogenetic studies in other plant species also identified new R2R3-MYB subgroups with no A. thaliana representatives. These subgroups might represent genes with specialized functions which were either lost in Arabidopsis or obtained after the divergence from the last common ancestor 19,86 . Several I. atropurpurea www.nature.com/scientificreports/ sequences did not cluster together with R2R3-MYBs from any other species, including I. fulva. This suggests that these MYB genes might have been acquired in I. atropurpurea after divergence within the Iris group.
Other MYB and MADS gene groups, which were not identified in our transcriptome, could be genes that were not conserved in irises. Alternatively, these genes might be expressed in earlier stages of flowering initiation, before the appearance of buds 45 , and thus undetected in the transcriptome. In Iris lortetii, it was shown that flower organs genes are mostly expressed in an early stage, about two months before stem elongation, when the flower meristem is hidden in the rhizome 45 . Possibly this is the stage when more flower development genes can be found; however, this stage was not sampled in this study and will be explored in further research.
Development and characterization of cDNA-derived SSR markers. For the development of new molecular markers, we used all of the 258,466 contigs, generated in this study, to mine potential microsatellites. We defined microsatellites as di-to hexanucleotide SSR with a minimum of four repetitions for all motifs. We identified 1,503 potential SSRs in 1,241 contigs, of which 263 sequences contained more than one SSR. Only 164 of the contigs containing SSRs had annotation and were annotated to 115 genes. We assessed the frequency, type, and distribution of the potential SSRs (Fig. 9). The SSRs included 924 (61.5%) di-nucleotide motifs, 396 (26.4%) tri-nucleotide motifs, 173 (11.5%) tetra-nucleotide motifs, 10 (0.7%) penta-nucleotide motifs, and zero (0%) hexa-nucleotide motifs. The di-, tri-, tetra-and penta-nucleotide repeats had 8, 30, 37 and 9 types of motifs,  Di-nucleotide SSRs are usually more common in genomic sequences, whereas tri-nucleotide SSRs are more common in RNA sequences [87][88][89][90] . Also, tri-nucleotide repeats are more abundant than dinucleotide repeats in plants 87 . However, in our SSRs, the di-nucleotide repeat type was the most abundant motif detected of all repeat lengths. A higher number of di-nucleotide repeats in RNA sequences has been reported in Louisiana irises 91 , and in other plants such as rubber trees 92 and Cajanus cajan (pigeonpea) 93 . The most abundant di-and tri-nucleotide motifs in I. atropurpurea were GA/TC and TTC/GAA, respectively. These results were also coincident with SSRs developed for Louisiana irises, in which the most abundant di-and tri-nucleotide motifs are AG/CT and AAG/ CTT 91 .
Until now, SSRs in irises were reported only for Louisiana and Japanese irises 91,94 ; however, these SSRs were not transferable to Oncocyclus irises (Y. Sapir, un-published). The relatively large set of SSRs obtained from the I. atropurpurea transcriptome may enable development of markers for population genetic studies in the Royal Irises. www.nature.com/scientificreports/

Conclusions
In this study, we reported a comprehensive characterization of the transcriptome of Iris atropurpurea, an important emerging model for understanding evolutionary processes 3,10-17 . Although transcriptome based on a single replication cannot enable gene expression analysis and extensive biological conclusions, the Iris transcriptome established in this study provides a useful database that will increase the molecular resources for the Royal Irises. These resources are currently available only for other iris species 91,94 , which despite belonging to the same genus, they are quite distant from the Royal Irises, hence not easily transferable. In the past decade, many studies have been using transcriptome de novo sequencing and assembly to generate a fundamental source of data for biological research 19,20,[95][96][97] . We generated a substantial number of transcript sequences that can be used for the discovery of novel genes, and specifically genes involved in flower development in irises. While we did not perform a complete analysis of MADS and R2R3 MYB evolution, we mainly aimed to identify flower development genes and classify their function, and thus provide a framework for the Iris genes sequenced in this study. The numerous SSR markers identified will enable the construction of genetic maps and answering important questions in population genetics and conservation. Although genetic studies are still in their early stages in the Royal Irises, we believe that our transcriptome will significantly support and encourage future evolutionary-genetic research in this ecologically important group.

Data availability
Sequences generated in this study were deposited in NCBI's Gene Expression Omnibus (GEO, http:// www. ncbi. nlm. nih. gov/ geo/) under the GEO accession number GSE121786.