Introduction

The Passifloraceae family belongs to the Malpighiales order and is a member of the Rosids clade, according to classical and molecular phylogenetic analysis. The family consists of 700 species, classified in 16 genera. The majority of species belong to the genus Passiflora (~530 species), popularly known as passion fruits1. This genus is widely distributed in tropical and subtropical regions of the Neotropics. Approximately 150 species are native to Brazil, which is acknowledged to be an important centre of diversity2.

Among the American tropical species of Passiflora, 60 fruit-bearing species are marketed for human consumption. Moreover, several species and hybrids have been produced for ornamental purposes (see www.passiflora.it;)3, and pharmacologists have found that passion fruit vines contain bioactive compounds that are used in traditional folk medicines as anxiolytics and antispasmodics4. Passiflora edulis is the major species of passionflowers grown for fresh fruit consumption and juice production in climates ranging from cool subtropical (purple variety) to warm tropical (yellow variety). Species grown particularly in Brazil include P. edulis (sour passion fruit) and P. alata (sweet passion fruit). Because of the quality of its fruit and yield for processing into commercial juices, P. edulis is grown in 90% of the commercial orchards. The most recent agricultural production survey showed that 58,089 hectares were planted with passion fruits, yielding 838,444 tons per year5.

P. edulis is a diploid (2n = 18)6, self-incompatible species7,8, with perfect, insect-pollinated flowers. Over the last two decades, our research group has carried out studies for estimating the genetic parameters of experimental populations9, as well as constructing genetic maps10,11 and mapping quantitative loci associated with the response to Xanthomonas axonopodis infection12. Munhoz and co-workers were able to determine which gene expression patterns were significantly modulated during the P. edulis-X. axonopodis interaction13.

Despite its commercial success, little is known about the genome structure of P. edulis. The genome size has been estimated at ~1,230 Mb (1 C DNA content = 1.258 pg by flow cytometric analysis)14. To fill in this gap in our knowledge, a large-insert genomic BAC (Bacterial Artificial Chromosomes) library was built and denoted Ped-B-Flav (https://cnrgv.toulouse.inra.fr/library/genomic_resource/Ped-B-Flav). It contains 83,000 clones, which are kept at the National Centre for Plant Genomic Resources (CNRGV: cnrgv.toulouse.inra.fr) at INRA in Toulouse, France. In addition, previous studies provided initial insights into the P. edulis genome using BAC-end sequence (BES) data as a major resource15, and described the structural organization of the plant’s chloroplast genome, which differs from that of various Malpighiales species due to rearrangement events16.

Although based on small-sized sequences, BAC-end sequences can be mapped to intervals of sequenced related genomes17 in order to identify collinear microsyntenic regions as a preliminary step towards selecting clones for full sequencing, which can be done with high accuracy using the single-molecule real-time (SMRT) sequencing (Pacific Biosciences). This method produces long, unbiased sequences that, in turn, facilitate subsequent assembly18, a critical step in plants due to the high proportion of repetitive sequences throughout their genomes19.

Most of the projects aimed at obtaining a draft or a complete plant genome were performed using large-insert based sequencing methods20,21 to allow estimation of the number of genes, and abundance of transposable elements and microsatellites. In the functional part of the genome in particular, the annotation of large-inserts can provide an arsenal of biological information to facilitate comparison against databases and, in addition, to determine the distribution of BAC inserts relative to related genomes in order to examine the degree of synteny between them and gain insights into evolutionary relationships22,23.

In this scenario, the P. edulis genome is continuing to be studied based on the large-insert BAC library and using the SMRT sequencing platform to completely sequence over 100 inserts of BAC clones. These clones were pre-selected based on BES microsynteny results and probes homologous to transcripts from a subtractive library of P. edulis in response to Xanthomonas axonopodis infection, which allowed us to obtain a gene-rich fraction of this genome. The repetitive content, predicted genes, and coding sequences were annotated. Also, microsyntenic regions of P. edulis common to Populus trichocarpa (Salicaceae, 485 Mb24) and Manihot esculenta (Euphorbiaceae, 742 Mb25), two related Malpighiales species with available fully sequenced and well-annotated genomes, were identified.

Material and Methods

BAC Selection and DNA Preparation

BAC clones were selected from the findings of Santos et al.15, which provides an initial overview of the P. edulis genome using BAC-end sequence (BES) data as a major resource. The results of comparative mapping between P. edulis’ BES and the reference genomes of Arabidopsis thaliana, Populus trichocarpa and Vitis vinifera were also used to choose BAC clones for sequencing. In addition, based on BES functional annotation results, the BAC-inserts with coding sequences (CDS) in one or both BESs were also selected.

A second selection procedure was performed after screening the genomic library using the probes homologous to P. edulis transcripts described in13. Briefly, the authors used suppression subtractive hybridization to construct two cDNA libraries enriched for transcripts induced and repressed by Xanthomonas axonopodis, respectively, 24 h after inoculation with a highly virulent bacterial strain.

The homologous probes were prepared via PCR, using as a template the genomic DNA from ‘IAPAR-123’, the accession used to construct the Ped-B-Flav BAC library. Specific primers were used to generate a single amplicon (200 to 600 bp in size) for each probe gene sequence. The ‘DecaLabel DNA Labeling Kit’ (Fermentas) was used for radiolabeling the probes. The amplification products were then purified with ‘Illustra ProbeQuantTM G-50 Micro Columns’ (GE Healthcare). The library was previously gridded onto macroarrays in which 41,472 clones were double-spotted on each 22 × 22 cm nylon membrane. These membranes were submerged in a bath of SSC (Saline-Sodium Citrate) solution (6×, 17 min., 50 °C); incubated overnight (68 °C) in hybridization buffer [6× SSC, 5× Denhardt’s Solution, 0.5% (w/v) SDS (Sodium Dodecyl Sulfate)]; hybridized with denatured probes (10 min, 95 °C; 1 min., cooled on ice); and washed twice in buffer 1 [2× SSC, 0.1% (w/v) SDS] (15 min., 50 °C) and buffer 2 [0.5× SSC, 0.1% (w/v) SDS] (30 min., 50 °C). Next, the hybridized membranes were placed in a film cassette for 24 h.; radioactive signals were detected using a PhosphorImagerTM and Storm 820 scanner (Amersham Biosciences) and analyzed using HDFR3 software, to identify the positive clones. Each positive clone was individually validated by PCR.

In order to estimate insert sizes, the preserved cultures were scraped and a positive single colony of each BAC grown in a 96-well plate (overnight, 37 °C) containing 1200 µL of LB medium with chloramphenicol (12.5 µg/mL) and glycerol (6%). DNAs were then isolated using a NucleoSpin® 96 Flash (Macherey-Nagel) BAC DNA purification kit, digested with 5 U of FastDigest™ NotI enzyme (Fermentas) and size-fractioned by PFGE (6 V.cm−1, 5 to 15 s switch time, 16 h run time, 12.5 °C) in a Chef Mapper XA Chiller System 220 V (BioRad), followed by ethidium bromide staining and visualization. The insert sizes were determined by comparison with PFGE (pulsed-field gel electrophoresis) standard size markers.

To prepare the DNA for sequencing, 1 μl of the above cultures was allowed to regrow in 20 mL of LB medium (plus 12.5 µg/mL chloramphenicol at 37 °C overnight) under shaking (250 rpm). The cultures were then mixed in pools, at a maximum of 20 clones per pool. DNA extraction was performed using the Nucleobond Xtra Midi Plus kit (Macherey-Nagel) according to the manufacturer’s instructions.

DNA Sequencing and Assembly From Long Sequence Reads

Approximately 5 µg of each pool was used for the construction of a SMRT library based on the standard Pacific Biosciences (San Francisco, CA, USA) preparation protocol for 10-kb libraries. Each pool was sequenced in one SMRT Cell using P6 polymerase in combination with C4 chemistry, following the manufacturer’s standard operating procedures and using the PacBio RS II long-read sequencer.

Reads were assembled by a hierarchical genome assembly process (HGAP workflow)26, and using the v2.2.0 SMRT® analysis software suite for HGAP implementation. Reads were first aligned by the PacBio long-read aligner or BLASR27 against the complete genome of Escherichia coli, strain K12, substrain DH10B (GenBank: CP000948.1). The E. coli reads, as well as low quality reads (minimum read length of 500 bp and minimum read quality of 0.80) were removed from the data set. Filtered reads were then preassembled to yield long, highly accurate sequences. To perform this step, the smallest and the longest reads were separated from each other to correct errors by mapping single-pass reads to the longest reads (seed reads), which represent the longest portion of the read length distribution. Next, sequences were filtered against vector (BAC) sequences, and the Celera assembler used to assemble data and obtain draft assemblies. The last step was performed in order to significantly reduce the remaining indels and base substitution errors in the draft assembly. The Quiver algorithm was used for this purpose. This quality-aware consensus algorithm uses rich quality scores (Quality Value/QV scores) and QV is a per-base estimate of base accuracy. QV scores over 20 are from very good data with only 1% error probability. Finally, Quiver polishes the assembly for final consensus26.

Once the refined assembly was obtained, each BAC-insert sequence was individualized by matching the end sequences to the pool of assembled sequences using BLAST. Read coverage was assessed by aligning the raw reads on the assembled sequences with BLASR.

Identification and Annotation of Repetitive Sequences

Eukaryotic genomes contain a substantial portion of repetitive elements which are organized into three main classes: dispersed repeats (mostly transposable elements and retrotransposed genes), local repeats (tandem repeats and simple sequence repeats or microsatellites) and segmental duplications (duplicated genomic fragments)28. It is highly recommended to identify and mask repetitive regions before gene prediction. Otherwise, unmasked repeats can produce spurious BLAST alignments, resulting in false evidence for gene annotations29.

The v2.2 REPET package was used for de novo detection and annotation of transposable elements (TEs). The annotation process starts with self-alignment of the sequences by all-by-all comparison. Matching clusters are then identified based on the same cluster sequences in a given family. A consensus for each family is created, and each consensus is classified according to the structures and domains present. The last step entails annotating TE copies30,31.

The resulting elements were then compared with sequences deposited in the Viridiplantae section of the Repbase repeat database32. They were classified by PASTEC, a tool for classifying TEs by searching for structural features and similarities33 and implementing the hierarchical classification system proposed by34. Repeat masking was subsequently performed with RepeatMasker Open-3.035 using the library generated by the REPET and Repbase Viridiplantae dataset32.

MISA36 was used to search for microsatellites based on microsatellite sequences with at least 10 nucleotides in the repeat for mono-, 5 for di -, and 3 for tri-, tetra-, penta- or hexanucleotides. Composite microsatellites were also identified. They are formed by multiple, adjacent, repetitive motifs. Hence, a microsatellite is considered composite if it has a maximum interruption of 10 bp between motifs37,38.

Gene Prediction and Functional Annotation

Evidence-driven gene prediction was performed based on gene models of Arabidopsis thaliana and Theobroma cacao and using the following software: Augustus39, GlimmerHMM40, GeneMark.hmm41, and SNAP42. Ab initio gene finding was performed with the BRAKER pipeline43. Protein homology detection and potential intron resolution were detected by Exonerate software44 against the annotated genomes of Populus trichocarpa, Salix purpurea, Ricinus communis and Manihot esculenta, downloaded from the Phytozome website45. These species are among the plant genomes with the highest number of top hits for P. edulis15.

Additionally, a P. edulis RNA-seq library (see details below) was used to support gene model predictions. PASA46 was used to produce alignment assemblies based on overlapping transcript alignments from P. edulis RNA-seq data. The results were combined by EVidence Modeler software47, and PASA was used to update the EVidence Modeler consensus predictions, adding UTR annotations and models for alternatively spliced isoforms. Exon-intron boundaries were manually examined using GenomeView48 and adjusted where necessary.

RNA-seq reads (2 × 100 bp; Illumina HiSeq2000) were trimmed based on quality (Phred quality score >20). Contaminants, remaining adapters, and sequences (<50 bp) were removed using SeqyClean v1.9.949. Total RNA-seq assembly was implemented by Trinity50. In brief, RNA-seq reads were derived from three libraries (each replicated three times) of shoot apexes of juvenile, vegetative and reproductive adult plants of P. edulis, constructed with the aim of performing comparisons of these three developmental stages (Dornelas M.C. et al., unpublished data).

Functional annotation of the predicted gene sequences was performed using Blast2GO v3.2 tools51 for assigning ontological terms in accordance with BLASTX results (e-value cut-off of 1 × 10−6). In addition, protein signature recognition was performed using the InterProScan tool52.

Microsynteny Analysis

The 20 P. edulis BAC-inserts with the highest number of annotated genes were used for the identification of potential microsyntenic regions between P. edulis and Populus trichocarpa (Salicaceae), and P. edulis and Manihot esculenta (Euphorbiaceae), two related Malpighiales species with entirely sequenced and well-annotated genomes. P. edulis coding sequences were compared with these two genome sequences, available in the Phytozome database45 using BLASTN.

Based on the phylogenetic relationships among the Malpighiales species, we chose P. trichocarpa because it is the closest species to P. edulis. Taxonomically speaking, Passifloraceae appears as a sister group to Salicaceae. On the other hand, M. esculenta is the most distant species from P. edulis among those Malpighiales with fully sequenced and well-annotated genomes.

To consider two genes as orthologs, the alignment had to show an e-value < 10−10 and coverage >50%. After identifying the orthologs, microsyntenic regions were defined. These are regions with more than four pairs of orthologous genes. All gene positions in the microsyntenic regions were recorded to construct comparative graphs. The analysis was carried out on JBrowse, (Phytozome v12.1 platform)45 to search for genes exhibiting each P. edulis microsyntenic region and in the P. trichocarpa and M. esculenta genome. The initial and final positions of the orthologous genes and chromosome identification were used as a basis for constructing comparative graphs. Using the GenomeView browser48, each of the microsyntenic regions was visualized and confirmed. Finally, comparative graphs were constructed using a graphics application.

Results

BAC Selection, Sequencing and Assembly

A total of 66 BAC inserts were selected for complete sequencing based on our previous BAC-end sequencing results15, and 46 were selected using probes homologous to transcripts of P. edulis53 (Supplementary Table S1). Thus, in total, 112 BAC inserts from the P. edulis genomic library were sequenced. The sequencing process resulted in 571,565 high quality reads, ranging from 500 to 46,831 bp in length. Sequences were between 24,316 and 142,456 bp in length, corresponding to their respective band sizes resolved by PFGE. The high quality of the long reads (QV > 47) and high coverage of the contigs (on average 278×) are indications of the reliability of our data (Supplementary Table S2), leading to the conclusion that all inserts were completely sequenced and assembled. The assembly, gene models, and genome browser are available at https://genomevolution.org/coge/GenomeInfo.pl?gid=52053.

The sequencing method was of sufficient quality to provide a single contig per insert, with only two exceptions; in the assembly process, insert sequences Pe101K14 and Pe141H13 had overlapping regions that resulted in a single contig of 172,337 bp; similarly, Pe20N3 and Pe64C12 resulted in a single contig of 114,997 bp. In addition, of the 112 BAC insert sequences, three corresponded to organelle DNA, and therefore these sequences were not included. Thus, 107 sequences were subjected to annotation, totaling 10,401,671 bp (10.4 Mb) corresponding to approximately 1.0% of the P. edulis genome. GC content across this genome fraction was 41.09%, and in the CDS 46.49%.

Gene Representativeness, Structure and Functional Annotation

Structural sequence annotation resulted in the prediction of 1,883 genes ranging from 153 to 24,687 bp in length, with an average of 2,448 bp. These gene sequences represented 44% of the total sequenced nucleotides, corresponding to 4,608,830 bp. Intergenic regions covered from 0 (overlapped genes) to 92,497 bp, with a mean length of 3,184 bp. Between 3 and 36 predicted genes were identified per sequenced insert, with an average of 17.6 predicted genes per insert (Table 1, Supplementary Table S3). Taking into account the estimated size of the P. edulis genome (~1,230 Mb), the high number of genes identified herein (1,833) endorses the efficiency of the strategy used for selecting BAC-inserts that were supposedly gene-rich.

Table 1 Gene content in a gene-rich fraction of the Pasiflora edulis genome and structural annotation.

One third of the genes (631) had no introns. The remaining (1,252) had up to 50 introns. A total of 6,122 introns (ranging from 26 to 7,869 bp in length) and 8,005 exons (ranging from 3 to 6,249 bp) were recognized. CDS ranged from 153 to 14,583 bp in length, totaling 1,985,892 bp, with a mean of 1,054 bp. A total of 61 were insert-end sequences and therefore incomplete gene sequences. According to the RNA-seq read alignment results, 252 genes exhibited more than one transcript (Supplementary Table S3), including glutamine synthetase leaf enzyme, chloroplastic (6 transcripts), ultraviolet-B receptor UVR8, a protein responsive to UV-B (5), the auxin response factor (2), an abscisic acid insensitive protein (2) and an ethylene receptor protein (2).

Of the 1,883 predicted genes, 1,502 showed significant levels of similarity (e-values < 1 × 10−6) to plant proteins (Supplementary Table S3) according to the Blast2GO results. The top hits for this large fraction of genes (~80%) were from Jatropha curcas (298), Populus trichocarpa (275), Populus euphratica (232) and Ricinus communis (212). These results were expected, since among all available plant genomes, these species are phylogenetically close to P. edulis, and all belong to the Malpighiales order. Functional annotation resulted in 3,178 ontological terms assigned to 1,191 genes. These GO terms were related to several processes, and are usually classified into three broad categories (known as level 1): biological process, molecular function and cellular component. The distribution of level 2 terms within each of these major categories is shown in Fig. 1 and matches the results of BES annotation15.

Figure 1
figure 1

Distribution of GO annotations assigned to gene products in ontological categories: (A) Biological process, (B) Molecular function and (C) Cellular component. GO annotations were extracted from all sequences (10.4 Mb) of Passiflora edulis.

Regarding the 46 regions selected using probes homologous to transcripts induced and repressed by X. axonopodis infection, none of the functional categories related to plant defense were found to be overrepresented. However, protein signatures related to plant immunity and defense functions were identified. The serine/threonine-protein kinase active site (32 genes), and the leucine-rich repeat domain, L domain-like (27 genes) were among the most represented signatures (Table 2). In total, automated searches for protein signatures recognized 1,383 signatures in 1,488 genes of P. edulis: 783 domains, 453 protein families, 125 sites and 22 replicates (Table 2). Most of these signatures (769) were taken from the Pfam database54, and the remainder from SuperFamily (239)55 and Smart (223)56.

Table 2 Most frequent protein signatures (≥10) recognized in genes of Passiflora edulis according to InterProScan results.

Richness of Transposable Elements and Microsatellites

The search for transposable elements resulted in the identification of 250 TEs that, in turn, were automatically classified as Class I (retrotransposons) and Class II (DNA transposons), and in terms of order33. These TEs represented 17.6% of total data, corresponding to 1,830,620 bp. Class I was prevalent with 96.4% (241/250) retrotransposons (Table 3). These TEs were preferentially hosted in intergenic regions (70.4%, 176/250); 74 TEs were found within genes, including 70 exonic TEs, and only four were located in introns.

Table 3 Classification of transposable elements identified in a gene-rich fraction of the Pasiflora edulis genome.

The LTR (Long Terminal Repeat) retrotransposon was the most frequent order, and accounted for 75.1% (181/241) of retrotransposons, corresponding to 1,418,389 bp or 13.6% (1,418,389 bp/10,401,671 bp) of all sequence data. The other orders of Class I were poorly represented, but note that LARDs (Large Retrotransposon Derivatives) accounted for 36 elements (Table 3). Only 3.6% (9/250) of TEs were of Class II, the majority (6) classified as TIR (Terminal Inverted Repeats) (Table 3).

The search for microsatellites resulted in the identification of 11,020 simple sequence repeats (SSR), representing 1.05% of all sequence data (109,695 bp/10,401,671 bp). In CDS (1,985,806 bp) there were 1,762 SSRs (~16% of the total). Taking into account all sequence data, 106 SSRs were found every 100 kb (one SSR every 0.94 kb). Analyzing the CDS region, 89 SSRs were found every 100 kb (one SSR every 1.12 kb); hence, the frequency of SSRs was slightly lower in the CDS region (~1.2×, 1.12 kb/0.94 kb). Our estimates were 10× lower than those reported in15 using P. edulis BES data as a major resource (10.8 SSRs every 100 kb or one SSR every 9.25 kb).

Microsatellite sequences were grouped according to motif, and all possible classes of repeats were found, with trinucleotides the most prevalent in both data sources. Compound SSRs accounted for 17.4% (1,919/11,020) of all SSRs, and 15.7% (278/1,762) of these SSRs were found in CDS (Fig. 2A). Among the mononucleotides, the A/T motif far surpassed the number of G/C motifs. The most frequent dinucleotides were AT/AT (49.3%), followed by AG/CT (35.4%), which were prevalent in CDS (74%). Among the trinucleotides, AAG/CTT were the most frequent in both data sources (~23%). Other occurrences (tetra-, penta- and hexanucleotides) are shown in Fig. 2B.

Figure 2
figure 2

(A) Percentage of mono-, di-, tri-, tetra-, penta- and hexanucleotides in microsatellites (SSRs) found in all sequences (10.4 Mb) of Passiflora edulis (blue bars) and in coding DNA sequences (CDS, orange bars). (B) Percentage of the most frequent motifs in each class of microsatellites (SSRs) found in all sequences (blue bars) and in coding DNA sequences (CDS, orange bars) of Passiflora edulis.

Microsynteny Analysis Results

The following 20 P. edulis BAC-inserts were used for microsynteny analysis: Pe101K14 + 141H13 (36), Pe185D11 (36), Pe164B18 (29), Pe214H11 (29), Pe164D9 (28), Pe186E19 (28), Pe43L2 (27), Pe164K17 (26), Pe215I8 (26), Pe84I14 (25), Pe84M23 (25), Pe93M2 (25), Pe171P13 (25), Pe207D11 (25), Pe93N7 (24), Pe108C16 (24), Pe173B16 (24), Pe185J16 (24), Pe198H23 (24) and Pe212I1 (24). These regions were found to contain the highest number of annotated genes (given in parenthesis) and account for 2,243,840 bp, encompassing 534 genes (Table 1).

Microsynteny analysis showed that 18 of the 20 P. edulis regions contained syntenic P. trichocarpa chromosomal regions, and 15 P. edulis regions had syntenic M. esculenta chromosomal regions (Figs 37, S1S13). In some comparisons, the microsyntenic region of P. edulis had the opposite orientation with respect to the chromosomes of both (see Fig. 3) or one of the species compared.

Figure 3
figure 3

Collinear microsyntenic regions identified in Passiflora edulis (yellow bars) and Populus trichocarpa chromosome 2 (green bar) and Manihot esculenta chromosomes 12 and 13 (brown bars). Note the opposite orientation of the P. edulis microsyntenic region relative to the chromosomes of both species. The orthologous genes of P. edulis are duplicated in M. esculenta chromosomes.

Figure 4
figure 4

Collinear microsyntenic regions identified in Passiflora edulis (yellow bars) and Populus trichocarpa chromosomes 4 and 9 (green bars). Note the opposite orientation of P. edulis microsyntenic region. The orthologous genes of P. edulis are duplicated in P. trichocarpa chromosomes.

Figure 5
figure 5

Collinear microsyntenic regions identified in Passiflora edulis (yellow bars) and Populus trichocarpa chromosome 14 (green bar) and Manihot esculenta chromosomes 1 and 5 (brown bars). Note the opposite orientation of M. esculenta chromosome 1, and rearranged segments at the end of the P. edulis microsyntenic region. The orthologous genes of P. edulis are duplicated in M. esculenta chromosomes.

Figure 6
figure 6

Collinear microsyntenic regions identified in Passiflora edulis (yellow bars) and Populus trichocarpa chromosome 1 (green bars) and Manihot esculenta chromosome 6 and 14 (brown bars). Note the opposite orientation of M. esculenta chromosome 6. There are translocated segments in the P. edulis microsyntenic region relative to chromosome 1 of P. trichocarpa. The orthologous genes of P. edulis are duplicated in M. esculenta chromosomes.

Figure 7
figure 7

Collinear microsyntenic regions identified in Passiflora edulis (yellow bars) and Populus trichocarpa chromosomes 4 and 9 (green bars) and Manihot esculenta chromosome 4 (brown bar). Note the opposite orientation of the P. edulis microsyntenic region relative to P. trichocarpa chromosomes, and the large segment of P. trichocarpa chromosome 4 that is missing in P. edulis. The orthologous genes of P. edulis are duplicated in P. trichocarpa chromosomes.

The 18 P. edulis regions span 1,702,975 bp and contain 406 genes. They matched syntenic segments of P. trichocarpa chromosomes that span 7,137,451 bp and contain 966 genes, including 501 orthologs (Table 4). Ten of the syntenic regions of P. edulis have orthologous genes that are duplicated in P. trichocarpa chromosomes. Interestingly, a continuous region in P. edulis (Pe214H11) is syntenic to segments of P. trichocarpa chromosome 4, and these segments are separated by 1.4 Mb. The same is true for segments of chromosome 9, separated by 1.2 Mb (Fig. 4). Other large segments of the P. trichocarpa chromosome 4 are also missing in the corresponding P. edulis syntenic region (Fig. 7). These presumably relate to deletion events that occurred in P. edulis.

Table 4 Characterization of 18 Passiflora edulis regions found to have syntenic Populus trichocarpa chromosomal regions.

Average gene length in P. edulis (2,785 bp) is slightly lower than that of P. trichocarpa (3,290 bp). However, the average intergenic spacer length in P. trichocarpa (8,694 bp) is four times that of P. edulis (1,871 bp) (Supplementary Table S4). The gene order is conserved in most of the syntenic regions, but rearrangements were observed. On comparing P. edulis with P. trichocarpa, two typical inversion events in the gene order were recognized (Supplementary Figs S3 and S6). Moreover, two adjacent genes in P. trichocarpa chromosome 1 were found to be inverted, and also interrupted in the P. edulis syntenic region (Fig. 6). Finally, it is worth noting the occurrence of particular gene duplications within the syntenic regions involving two to seven copies. Figure 4 shows two P. edulis genes (8th and 22nd) that have four copies in P. trichocarpa chromosome 9.

In the comparison with M. esculenta, the 15 regions of P. edulis span 1,392,795 bp and contain 348 genes, matching syntenic segments of M. esculenta chromosomes that span 5,053,254 bp and contain 633 genes, including 365 orthologs (Table 5). Eleven of the syntenic regions of P. edulis contain orthologous genes that are duplicated in M. esculenta chromosomes.

Table 5 Characterization of 15 Passiflora edulis regions found to have syntenic Manihot esculenta chromosomal regions.

The average P. edulis gene length (2,641 bp) is slightly lower than that of M. esculenta (3,886 bp). However, the average intergenic spacer length (6,777 bp) was three times that of P. edulis (1,850 bp) (Supplementary Table S4). Gene order is also conserved in most of the syntenic regions, but rearrangements were recognized in genes of both P. edulis and M. esculenta (Figs S1, S2, S6, S7). The occurrence of particular gene duplications within syntenic regions involving two to five copies was also detected. Figure 3 shows three copies of a P. edulis gene (18th) arranged in tandem on chromosome 13 of M. esculenta and two copies in tandem on chromosome 12, totaling 5 copies. The 2nd gene within the P. edulis microsyntenic region is also duplicated in M. esculenta chromosome 12.

In terms of specific genes, note that a single copy of the gene encoding a KIN1-related stress-induced protein was found in P. edulis but there are seven orthologous copies in P. trichocarpa chromosome 4 and three in chromosome 17 (Supplementary Fig. S2). Moreover, five copies in tandem of the gene encoding an endo-1,3 1,4-beta-D-glucanase were found in P. edulis, but no orthologs were found in P. trichocarpa and M. esculenta. Finally, four copies in tandem of the salicylic acid-binding protein 2-like gene were found in P. edulis: an orthologous copy was found in chromosome 4 and three in chromosome 9 of P. trichocarpa, but only one copy was found in chromosome 17 of M. esculenta (Supplementary Fig. S1).

There is a higher degree of comparative microsynteny between P. edulis and P. trichocarpa than between P. edulis and M. esculenta. The number of genes is significantly high in most P. trichocarpa and M. esculenta chromosomes compared to that found in P. edulis microsyntenic regions (Tables 4 and 5). The highest level of synteny conservation was found between Pe173B16 and P. trichocarpa chromosome 9, with 29 orthologous, collinear gene pairs (Table 4; Fig. 7), and between Pe185D11 and M. esculenta chromosome 12, with 27 orthologous, collinear gene pairs (Table 5; Fig. 3).

Discussion

Despite great advances in genome sequencing, the process of sequencing a plant genome is still laborious, due primarily to the size and complexity of genome regions which pose a challenge when it comes to sequencing and assembly. For instance, Passiflora species are extensively diversified in morphological terms, with genome sizes ranging from 207 Mb to 2.15 Gb14 and there are no draft genomes for any passion fruits, even the most cultivated species, P. edulis. In this study, a gene-rich fraction of the P. edulis genome was sequenced and assembled from long sequence reads, allowing us to obtain 10.4 Mb of highly curated data.

About half of all sequences (44%) matched P. edulis gene sequences and annotation revealed several functional categories and protein domains. Interestingly, the most frequent domain was retrotransposon gag, associated with transcripts of the LTR retrotransposon, followed by the kinase domains. This abundance was to be expected, since kinases belong to a superfamily of proteins with copies in the hundreds or thousands and are components of all cellular functions. These proteins use ATP γ-phosphate to phosphorylate serine and threonine or tyrosine residues from other proteins57. Note that to date there is an enormous scarcity of information on Passiflora nuclear genes in databases. This means that obtaining gene-based probes for selecting new regions for whole sequencing is practically impossible. The structural and functional annotation of 1,883 genes provides a significant set of high quality gene sequences that can be used in many other studies on Passiflora (see Supplementary Table S3).

Transposable elements (TEs) are highly widespread in plant genomes, accounting for 14% of the Arabidopsis thaliana genome58, up to 80% of the maize genome59 and 17.6% of all P. edulis sequences. The vast majority are retroelements that belong to Class I (96.4%), and especially to the LTR order. This abundance is very similar to that previously reported15 analyzing ~10,000 BES (18.5% TEs, 94.1% Class I TEs, the majority belonging to the LTR order), and this pattern should be repeated in P. edulis. On examining high quality genomes, several authors have stated that the spread of TEs (mostly retrotransposons) is the main driver of genome size variation in plants. This is particularly true of LTR retrotransposons due to the replication mechanism. LTRs are found mainly in centromeric regions, playing important role in chromatin structure maintenance, centromere performance and the regulation of host gene expression60,61,62.

The content of LTR elements in P. edulis is comparable to that identified in related Malpighiaceae species with completely sequenced genomes, although the abundance of TEs is highly variable. This variation is to be expected and is indicative of particular TE-driven evolutionary processes60. For instance, ~42% of the P. trichocarpa genome consists of transposable elements (although only 12.9% of the sequences could be classified as known TEs), the majority belonging to the LTR order (~60%). These figures relate to the draft genome of P. trichocarpa24, and the authors state that this genome could contain even more non-classified LTRs. In R. communis, approximately 50% of the genome consists of transposable elements, and LTRs were the most abundant, making up ~16% of the genome63, close to the value observed in P. edulis (13.6%), although the genome size of this species is ~3.8× larger than that of R. communis. Finally, in Manihot esculenta, ~25.7% of the genome consists of transposable elements, and LTR is also the most represented order among classified TEs, forming ~11% of the genomic sequences25. In this case, the genome report was based on 65% of an assembled genome of the domesticated variety.

In terms of microsatellite abundance, ~1.0% of all P. edulis sequences consisted of SSRs, with trinucleotide repeats prevalent (55.6%), even in CDS (93.8%). Microsatellite abundance generally varies from one genome region to another, but trinucleotides are usually overrepresented in coding sequences, due to selection pressures against mutations that may alter the reading frames64. The P. edulis results corroborate the findings of a pioneer study65 with regard to the effect that trinucleotide repeats are significantly more abundant in the expressed regions of plant genomes. Recently, a total of 1,300 perfect microsatellite sites were described in P. edulis genomic regions (with minimum 15× coverage as a cut off; Illumina paired-end reads) that were selected for marker development and Passiflora diversity analysis66. In this significant sample, the prevalence of tri-, tetra- and dinucleotides was found to be 41.0%, 36.4% and 22.6%, respectively.

In the P. trichocarpa genome, the predominance of mono- (69.8%), di- (19.5%) and trinucleotides (9.0%) decreased stepwise as the motif length increased (mono- to hexanucleotide repeats); 98% of P. trichocarpa mononucleotides consist of A/T motifs and only 2% of C/G motifs. The same applies to P. edulis (Fig. 2B). For di- and trinucleotides, the most frequent motifs were AT/AT (60.5%) and AAT/ATT (48.2%). In terms of coding sequences, 90.3% and 76.6% of the mono- and dinucleotides consist respectively of A/T and AG/CT motifs. Trinucleotides consist mainly of AAG/CTT, ACC/GGT and AGG/CCT motifs (~20% of each), and the frequencies of tetra-, penta- and hexanucleotides were very low67.

In M. esculenta, 37.4% of all SSRs corresponded to dinucleotides, and tri- and pentanucleotides were found in the same proportion (~24%); within the coding sequences, tri- and hexanucleotides accounted for 95.6%. AT/AT and AAT/ATT were the most common di- and trinucleotide motifs (~23% and ~12%, respectively) and, as in P. edulis, AG/CT and AAG/CTT were the most prevalent in coding sequences (~4% and ~23%, respectively)68. In the R. communis genome, most of the SSRs found were also dinucleotides (70.4%), followed by trinucleotides (24.9%). AT/TA was the most frequent motif among dinucleotides (75.3%) and AAT/TTA among trinucleotides (71%)69.

Clearly, the particular occurrence of certain motifs in plant genomes and in different genome regions is due to selection pressure during evolution70,71, and structural and functional genome attributes, like GC content and codon usage bias, may be responsible for the unique content and distribution patterns of microsatellites72,73.

Remarkable, there are several benefits that can be derived from the knowledge we have generated. First, a draft sequencing of the Passiflora edulis nuclear genome, especially of a gene-rich fraction, provides a platform for functional analysis and development of genomic tools in applied passion fruit improvement. Our work also represents a first step towards full sequencing of the P. edulis genome. Moreover, wild Passiflora species harbor a variety of characteristics that determine their ecological importance and adaptability. The availability of gene sequences could help researchers test for the presence of gene variants or polymorphisms in different environments. This is also possible for cultivated species. Gene prediction has yielded around 1,900 genes, and functional annotation has associated genes with plant immunity and defense functions (Supplementary Table S3).

Taxonomically speaking, the genus is subdivided into four subgenera: three clades were recognized as monophyletic (Astrophea, Decaloba, and Passiflora), but the position of Deidamioides remained unresolved, as this particular clade was found to be paraphyletic. Therefore, gene sequences could be used in phylogenetic analysis to obtain accurate evolutionary information.

By providing information on the levels of synteny conservation and rearrangements within the microcollinear regions (inverted and translocated segments, deletion and gene duplication events), this study will help confirm the relationships between a Passiflora species and related Malpighiales, with important taxonomical implications. Our previous phylogenetic analyses based on the available chloroplast genomes of members of the four families that compose the Malpighiales order indicated that the Passifloraceae are more closely related to the Salicaceae than to the Euphorbiaceae16. This proximity is definitively confirmed herein by microsynteny analysis, confirming the importance of using comparative genomic approaches as an additional resource for elucidating the phylogenetic relationships in the families that compose the Malpighiales order, one of the largest of flowering plants.

Although P. edulis microsyntenic regions were compared with whole genomes of P. trichocarpa (Salicaceae) and M. esculenta (Euphorbiaceae), i.e. species that belong to different taxonomic families, the analysis showed that overall gene order was well conserved. The level of microsynteny observed between the majority of P. edulis BAC inserts and these genomes is surprising, given the long divergence time that separates them from the common ancestor of the Malpighiales, some 100 million years ago74. The event of whole genome duplication (WGD) in P. trichocarpa occurred about 60−65 million years ago and reached around 92% of its genome24. On the other hand, M. esculenta has undergone a paleo-genome duplication event, and a number of its genes were found to have only two copies25,75. This may be related to the loss of one of the homologous copies in M. esculenta owing to selection pressure that restored the single-copy state of genes that impair fitness when present in multiple copies76.

The genome size of P. edulis is estimated at ~1.23 Gb, significantly higher than the estimated genome sizes of P. trichocarpa (~485 Mb)24 and M. esculenta (~742 Mb)25. These differences raise the question: did an ancestor of the passionflowers undergo genome duplication? Possibly. According to cytogenetic studies, the basic chromosome number in the genus Passiflora is x = 6, with several species containing secondary numbers, as in the case of P. edulis (x = 9). These species with secondary chromosome numbers are possibly of polyploid origin77,78. Nevertheless, there is evolutionary evidence indicating x = 12 as the basic chromosome number, since x = 6 was reported to occur only in the subgenus Decaloba. In primitive Passiflora species, such as those of the Astrophea subgenus, x = 12, and the same applied to other species of the Passifloraceae family78,79. This suggests that descending dysploidy events may have occurred in the Passiflora (x = 9) and Decaloba (x = 6) subgenera, lending weight to the hypothesis that genome duplication occurred in an ancestor of the Passifloraceae. In actual fact the diploid numbers 2n = 12, 18, 24, and 72 have been reported for Passiflora species80.

An examination of the microsyntenic regions shows that the P. edulis gene-rich segments are more compact than those of the species compared, even though its genome size is three times longer than that of P. trichocarpa, and almost twice the size of the M. esculenta genome. The limited sampling of P. edulis genome analyzed herein does not account for these apparently contradictory attributes regarding the compactness of gene regions and genome sizes. Further studies are required to elucidate the abundance of repetitive DNA (including TEs) associated with gene-poor regions and/or the occurrence of large heterochromatin blocks in P. edulis81,82.

Finally, wide variations in genome size occur within the genus Passiflora14 indicating that genome duplication, DNA sequence acquisition and loss throughout the evolution of the genus (favoring species disruption) have occurred since its diversification from the common ancestor about 38 million years ago83.

Conclusion

The outcome of this research was a unique set of high quality sequence data on a gene-rich fraction of the Passiflora edulis genome, describing gene content and abundance of repetitive elements. The structural and functional annotations of 1,883 genes of P. edulis are detailed. It is proposed that there is a relatively high degree of conservation in gene regions of P. edulis, Populus trichocarpa and Manihot esculenta, according to our microsynteny analysis results. Collinear orthologous genes are shown to be prevalent, although some disruptions of collinearity have occurred due to rearrangements (inversion, translocation events) within microsyntenic regions. Interestingly, even though the P. edulis genome is much larger than those of P. trichocarpa (3×) and M. esculenta (2×), which evolved by polyploidy, the P. edulis gene-rich segments are much more compact. In this study the first steps have been taken, but further studies are required to elucidate the abundance of repetitive DNA associated with gene-poor regions and/or the occurrence of large heterochromatin blocks in P. edulis, in order to contribute to our understanding of the evolutionary issues that these genomes raise.