Genomic analyses of primitive, wild and cultivated citrus provide insights into asexual reproduction

The emergence of apomixis—the transition from sexual to asexual reproduction—is a prominent feature of modern citrus. Here we de novo sequenced and comprehensively studied the genomes of four representative citrus species. Additionally, we sequenced 100 accessions of primitive, wild and cultivated citrus. Comparative population analysis suggested that genomic regions harboring energy- and reproduction-associated genes are probably under selection in cultivated citrus. We also narrowed the genetic locus responsible for citrus polyembryony, a form of apomixis, to an 80-kb region containing 11 candidate genes. One of these, CitRWP, is expressed at higher levels in ovules of polyembryonic cultivars. We found a miniature inverted-repeat transposable element insertion in the promoter region of CitRWP that cosegregated with polyembryony. This study provides new insights into citrus apomixis and constitutes a promising resource for the mining of agriculturally important genes.

Asexual reproduction is a remarkable feature of perennial fruit crops that facilitates the faithful propagation of commercially valuable individuals by avoiding the uncertainty associated with the sexual reproduction of hybrid plants. Many fruit crops reproduce or are propagated by asexual mechanisms. These mechanisms include natural means of propagation, such as layering, cutting and grafting 1 , and modern technologies, such as tissue culture. Apomixis is a naturally occurring mode of asexual reproduction that yields offspring that are genetically identical to the mother plant 2 . It is uncommon in agriculturally important crops apart from citrus and apple 3 . The first example of apomixis in citrus was reported in 1719, when Leeuwenhoek observed the development of two plantlets from the same seed 4 . A better understanding of apomixis and its introgression into agronomically important crops could potentially help revolutionize modern agriculture 5 . As demonstrated by fruit crops, apomixis enables breeders to fix valuable traits and heterozygosity. Most previous studies have focused on gametophytic apomixis 2,6 . Sporophytic apomixis, including nucellar embryony, is not as well studied.
Apomixis in citrus is sporophytic-the embryos develop from somatic nucellar cells-and very stable among commercial varieties, including sweet oranges, mandarins, grapefruits and lemons 7 . In general, 2-10 embryos develop within a seed. However, in particular genotypes, 30 or more embryos can develop in one seed 3,8 . This phenomenon is typically called 'polyembryony' and is considered an important trait for breeding purposes. On the one hand, polyembryony is widely employed in citrus nurseries and propagation programs to generate large numbers of uniform rootstocks from seeds. On the other hand, polyembryony has sometimes caused problems for breeding that required sexual crosses. Data from crosses between monoembryonic and polyembryonic cultivars indicate that one or two dominant loci control polyembryony 9,10 . Molecular markers linked to a polyembryony locus have been reported 7,11 . This locus was later mapped to a genomic region that spans approximately 380 kb 12 . Genes differentially expressed during polyembryogenesis have also been identified [13][14][15] .
Citrus fruit trees, designated as Citrinae, are a large group belonging to the subfamily Aurantioideae and the family Rutaceae 16 . Citrinae are classified into primitive citrus, near citrus and true citrus on the basis of botanical characteristics 16 . Atalantia buxifolia, also known as Chinese box orange and formerly named Severinia buxifolia, is an example of a primitive citrus species. The true citrus group comprises almost all of the commercially cultivated citrus, with major varieties belonging to the Citrus genus (Supplementary Note 1). The best known citrus varieties are the sweet orange (Citrus sinensis), mandarin (Citrus reticulata), lemon (Citrus limon), grapefruit (Citrus paradisi) and pummelo (Citrus grandis or Citrus maxima). There are a range of reproductive systems among the different varieties of primitive and modern citrus species that include sexual reproduction, clonal propagation and mixtures of these systems.

A r t i c l e s
We aimed to understand the genomic characteristics of primitive and cultivated citrus and the genetic basis of asexual reproduction in citrus. We de novo sequenced pummelo and three representative species of Citrinae. We also sequenced a population of 100 citrus accessions that ranged from primitive to cultivated citrus. Our genetic analyses of segregating and natural populations, and our transcriptome profiling, provide new insights into the genetic basis of apomixis in citrus.

Citrinae phylogeny and reproductive systems
Conserved single-copy genes from eight species of Citrinae were used for the construction of a phylogenetic tree (Fig. 1a). As expected, the primitive citrus Atalantia was located at a basal position in the tree. The spiny and wild species Citrus ichangensis (Ichang papeda, hereafter referred to as papeda) is phylogenetically located between primitive and cultivated citrus. The three basic species, C. reticulata (mandarin), C. grandis (pummelo) and C. medica (citron) showed close relationships to each other. Primitive, wild and cultivated citrus each have nine haploid chromosomes (Supplementary Fig. 1). Citrus species exhibit broad sexual compatibility. Interspecific and intergeneric hybridizations are feasible even between primitive and cultivated citrus 17 . Our experiments also support the grafting compatibility and partial sexual compatibility of Atalantia and Citrus (Supplementary Fig. 2). Atalantia and papeda are mainly seed propagated, with large or many seeds that mostly fill the locules of the fruit (Fig. 1b), and are not suitable for propagation by grafting owing to strong and numerous prickles and stiff twigs ( Supplementary  Fig. 3). In contrast, the vegetative propagule that facilitates propagation by grafting has an ideal morphology in mandarin, pummelo, sweet orange and other cultivated citrus (Supplementary Fig. 3).

A high-quality and three draft genomes of Citrinae
A high-quality genome of citrus was generated from haploid pummelo (Supplementary Fig. 4) by single-molecule sequencing. A total of 56.8× coverage of single-molecule sequences on the PacBio RS II platform via a shotgun approach was used for genome assembly (Supplementary Table 1). Additionally, 370.3× Illumina sequencing data were used to correct sequencing errors and to fill in the gaps (Supplementary Fig. 5 and Supplementary Table 2). The contig N50 and N90 of the final assembly were 2.2 Mb and 70 kb, respectively. Moreover, the scaffold N50 and N90 were 4.2 Mb and 565 kb, respectively ( Table 1). The sequence contiguity represents 18-fold and a 44-fold improvements of the C. clementina and C. sinensis genomes, respectively, as evaluated by contig N50 (refs. 18,19). Ninety percent of the assembly is in 101 scaffolds larger than 565 kb. The largest scaffold is 14.3 Mb. A total of 117 scaffolds, accounting for 87% of the assembled genome, were anchored by 1,563 molecular markers (Supplementary Figs. 6 and 7 and Supplementary Table 3).
Three draft genomes of heterozygous Citrinae species-atalantia (A. buxifolia, primitive citrus), papeda (C. ichangensis, wild citrus) and citron (C. medica) (Fig. 1a)-were sequenced and assembled with Illumina shotgun sequencing reads. High-coverage sequencing data that ranged from 164.4× to 401.9× were produced from multiple libraries, with small insert sizes that ranged from 140 bp to 500 bp and long insert sizes of 2 kb, 5 kb and 20 kb (Supplementary  Tables 4-6). The sequenced reads were assembled using the de Bruijn graph algorithm from the SOAPdenovo package 20 , resulting in scaffold N50 values of 1,074 kb, 504 kb and 367 kb (with contig N50 values of 23.9 kb, 76.6 kb and 46.5 kb) for atalantia, papeda and citron, respectively (Supplementary Table 7).
Ab initio gene prediction programs, homology searches and RNAseq analysis were integrated to annotate the pummelo genome. We identified 30,123 protein-coding genes with 42,886 transcripts. The gene transcripts had an average length of 1,572 bp, a mean coding sequence size of 1,141 bp and an average exon length of 277 bp. RNAseq data revealed that 12,763 gene loci (42.4%) encoded two or more transcriptional isoforms. Regarding the core and dispensable portions of these genomes, all of the genes in the six genomes (29,655 in sweet orange; 24,533 in clementine mandarin; 30,123 in pummelo; 32,579 in citron; 32,067 in papeda; 28,420 in atalantia) (Supplementary Table 7) were classified into 23,829 gene families on the basis of the homology of their encoded proteins. A total of 12,457 (52%) gene families were shared by all six of these genomes (Supplementary

Genomic characteristics of citrus
The four sequenced species from this study (atalantia, papeda, citron and pummelo), and the two previously published sequenced species (sweet orange and clementine mandarin) 18,19 range from primitive to cultivated citrus (Fig. 1). An overview of the genome synteny and sequence variation is presented in Figure 2 and Supplementary Figures 9-13. The genome sizes are similar among the Citrus species. We observed the most variation in genome size between Atalantia and the Citrus species (Supplementary Table 11 and Supplementary Note 2). On the basis of our presence-absence variation (PAV) analysis, 33.28% of the unique regions in pummelo (relative to atalantia) contained transposable elements (TEs), and 2.68% of the unique regions contained protein-coding genes (Supplementary Table 12). Approximately one-fifth of the genes in the unique regions (153 of 780 annotated genes) of pummelo were predicted to have known functions. These genes were significantly  enriched (P value < 0.05 and FDR < 0.05) in three GO terms: defense response and proteolysis (biological process category) and pectate lyase activity (molecular function category) relative to the entire genome of pummelo (Supplementary Fig. 14). Among the 153 genes with functional annotations, 93 are homologous to one or more genes (60.78%) (Supplementary Table 13). Conversely, the unique regions in atalantia relative to pummelo had a higher TE content (54.20%) than the entire atalantia genome (43.55%).
SNVs were confidently identified by sequence alignment (Supplementary Fig. 15) and filtered using resequencing data. A pairwise comparison of each species versus pummelo showed that the transition/ transversion ratios varied from 1.38 to 1.66, with an average of 1.57 (Supplementary Table 14). The distributions of SNVs in gene structures displayed the highest density in the 5′ and 3′ UTRs (an average of 18.8 and 17.2 SNVs per kb for 5′ and 3′ UTRs, respectively). The second highest density of SNVs was in introns (an average of 14.2 SNVs per kb). The lowest density of SNVs was in coding sequence (CDS) regions (an average of 9.7 SNVs per kb) (Supplementary Table 15).
The ratio of nonsynonymous to synonymous substitutions (dN/dS) was calculated for each of the 8,551 single-copy orthologs shared by these six citrus genomes. We found that 207, 100, 104, 66, 106 and 77 genes showed higher dN/dS values in atalantia, papeda, citron, pummelo, clementine mandarin and sweet orange (Supplementary  Table 16), respectively, relative to other species. For the dN/dS of single-copy orthologous genes, 12.79% of the genes showed a value greater than one in atalantia (Supplementary Table 17). This proportion was higher than the average value of 6.82% from other citrus (Supplementary Tables 18-22), indicating greater sequence divergence in atalantia relative to other citrus. These genes may have functional associations with the biological features of atalantia. For instance, Cg4g018790 encodes a cytokinin oxidase/dehydrogenase homologous to the rice OsCKX2 gene, which has been reported to control seed number 21 . In this gene, higher sequence polymorphisms were observed in atalantia relative to other citrus species (Supplementary Fig. 16).

Genetic diversity and comparative population analysis
Citrinae species exhibit high levels of morphological diversity ( Fig. 3a), which is a reflection of their diverse genetic backgrounds. We selected and sequenced 100 citrus accessions with an average depth of 30-fold genome coverage (Supplementary Table 23). Sequencing data from three mandarins, one orange and four pummelos from previous studies 18,19 were also analyzed. This set of 108 citrus accessions ( Supplementary Fig. 17) was used for population diversity analysis, genetic differentiation analysis and association analysis of the apomixis trait. For comparative population analysis, 15 atalantia accessions, 11 papeda accessions and 19 pummelo accessions, representing primitive, wild and cultivated citrus, were used ( Fig. 3b and Supplementary Fig. 18). The fixation index (F ST ) values among these citrus populations fluctuated around 0.28 ( Supplementary  Fig. 19), which indicates moderate genetic differentiations among citrus populations. For a comparison, the F ST was 0.34 between two wild populations of common bean and 0.44 between wild and cultivated cucumber 22,23 .
Nucleotide diversity (π) analysis indicated that atalantia has the highest genetic diversity and that papeda has a higher level of genetic diversity than pummelo (Fig. 3c). We observed a dramatic reduction in genetic diversity between atalantia and papeda. Additional  A r t i c l e s sequence analysis indicated that atalantia has more highly diverged regions (threefold more highly divergent regions, on average) than the other two populations (Supplementary Table 24). For example, a long region at the end of chromosome 5 (39.64-49.38 Mb, about one-fifth of the chromosome) displayed the highest diversity in atalantia and a dramatic reduction in diversity in papeda and pummelo (Fig. 3c).
Genomic regions that showed both high differentiation and reduced diversity were identified among the primitive, wild and cultivated citrus (Supplementary Fig. 20 and Supplementary Tables  25-27). The results of pairwise comparisons indicated that the distribution of such regions was uneven along the chromosomes ( Fig. 3d and Supplementary Fig. 21). On the basis of the pairwise comparison between atalantia and pummelo, we concluded that a total of 18.1 Mb of genomic regions, containing 2,430 genes, were probably under selection in pummelo ( Fig. 3d and Supplementary Table 28). These regions include genes associated with vegetative growth, such as Cg7g021010, Cg4g017400 and Cg9g010410, which are homologous to TCP15 (ref. 24), LAS 25 and YUCCA3 (ref. 26), respectively (Supplementary Fig. 22). Notably, we observed a strong signal on chromosome 9, spanning 220 kb, from 13.8 Mb to 14.0 Mb ( Fig. 3d and Supplementary  Fig. 23) in the pummelo genome. Thirteen of the fifteen annotated genes in this region have functions related to cytochrome c biogenesis, mitochondrial proteins, ATP synthase, NADH dehydrogenase and mitochondrial ribosomal proteins (Supplementary Table 25). Comparative analysis also indicated that flowering-related genes, such as the gene encoding the WUSCHEL-related homeobox (WOX) 27 transcription factor (Cg4g011660; Supplementary Fig. 24) and the gene encoding the flowering time control protein FPA (Cg2g027110; Supplementary Fig. 25), showed strongly reduced diversity and a high degree of differentiation. Therefore, we suggest that these genes were probably under selection. These regions were shown to harbor signals of selection as defined by recent studies 22,28 . The effect caused by other possibilities, such as genetic drift and bottlenecks 29 , could not be excluded but might be limited, as indicated by the similar pairwise sequentially Markovian coalescent (PSMC) curve of these three populations (Supplementary Fig. 26). FPA has a critical role in the regulation of flowering time in Arabidopsis 30 . This finding is consistent with the observation that atalantia has a prolonged flowering season (from May to December) and that in contrast, the flowering season is synchronized among all of the other cultivated citrus.

The genomic basis of citrus apomixis (nucellar polyembryony)
The emergence of a particular type of apomixis, nucellar polyembryony, in mandarin and most cultivated varieties of citrus is one of the prominent features of the recent evolution of citrus (Fig. 4a). We scored the polyembryony phenotypes of 124 fruit-yielding progeny from a segregating population derived from a cross between HB pummelo (monoembryony) and Fairchild mandarin (polyembryony) that exhibited contrasting phenotypes (Supplementary Table 29). For the bulk segregant analysis (BSA), the transformed ∆(SNP index) was calculated and plotted against the genomic sequence in 250-kb sliding windows with step sizes of 10 kb (Fig. 4b). The peak value for the transformed ∆(SNP index) was localized to a region spanning from 23.695 to 25.655 Mb on chromosome 4. These data indicated that this 1.96-Mb region harbors a major locus contributing to nucellar polyembryony.
To fine map the polyembryony locus, a local association analysis with the variant information from the protein-encoding genes in the 1.96 Mb region was performed using the genome sequencing data from the 108 aforementioned accessions, including 45 polyembryonic and 63 monoembryonic accessions (Supplementary Table 23). We employed P values from the two-tailed Fisher's exact test to measure the correlations between nucleotide variation and phenotype   (d) Distribution of the reduction of diversity (log 10 π ratios) and differentiation (F ST values) from the pairwise comparison of the atalantia and pummelo populations. π ratios were calculated as π atalantia /π pummelo in 50-kb sliding windows with 10-kb steps. The region above the dashed line in the distribution of log 10 π ratios corresponds to the 5% right tail of the empirical distribution (log 10 π ratio = 0.36). The regions above the dashed line in the F ST values distribution are in the 5% right tail of the empirical distribution (F ST is 0.40). Fig. 27), which were adjusted with a false discovery rate (FDR) correction for multiple testing. This analysis localized the polyembryony locus to an ~80-kb region on chromosome 4, between 24.54 and 24.62 Mb, that contains 11 genes ( Fig. 4c and Supplementary  Fig. 28). Consistent with this result, the population differentiation between the polyembryonic and monoembryonic accessions was substantially higher for this 80-kb region than the surrounding regions (Supplementary Fig. 29). The linkage disequilibrium (LD) among polyembryonic accessions was larger than the LD among monoembryonic accessions (Supplementary Fig. 29), which indicates a selection signature between monoembryonic and polyembryonic citrus.

(Supplementary
Among the 11 candidate genes, Cg4g018910 and Cg4g018970 showed the highest association with the polyembryony phenotype (Fig. 5a). Expression analysis of all candidate genes indicated that Cg4g018970 was highly expressed in ovules and specifically expressed in polyembryonic citrus cultivars but not in monoembryonic citrus cultivars (Fig. 5b,c and Supplementary Fig. 30). Cg4g018970 encodes a RWP-RK domain-containing protein similar to the Arabidopsis RKD family of proteins, which serve as regulators of egg cell-related genes 31 . Thus, we hereafter refer to Cg4g018970 as CitRWP. A sequence alignment of CitRWP from polyembryonic and monoembryonic citrus identified four SNPs in the gene sequence and an insertion of a miniature inverted-repeat TE (MITE) in the promoter region (g.24610316_24610317ins(203) in chromosome 4 in the reference of pummelo genome (MKYQ00000000)) ( Fig. 5d and Supplementary  Fig. 31). To test whether this transposon is linked to the polyembryony locus, we screened the 124 fruit-yielding individuals from the segregating population of 217 progeny derived from an HB pummelo × Fairchild mandarin cross. We did not find this MITE insertion in any of the monoembryonic progeny (n = 66). In contrast, we found the MITE insertion in all of the polyembryonic progeny (n = 58) (Supplementary Fig. 32). Thus, we conclude that this MITE insertion is associated with the polyembryony locus. Next, we used a germplasm collection containing 786 accessions of citrus to independently test this association. Consistent with our results, we found this MITE insertion in all of the polyembryonic accessions (n = 213). In contrast, all the monoembryonic accessions (n = 573) lack this particular MITE insertion (Fig. 5e,f and Supplementary Table 30).

Transcriptome of sexual and nucellar embryos from citrus
To investigate the molecular processes and genes involved in sexual and nucellar embryogenesis in citrus, we compared the RNA-seq data derived from leaves, ovules, fruits and seeds of pummelo (sexual type) and sweet orange (asexual type). We found that 1,637 and 1,840 genes were highly expressed in ovules (≥2-fold higher than in other tissues) of pummelo and sweet orange, respectively ( Fig. 6a and Supplementary Tables 31 and 32). We found that 853 of these genes were highly expressed in both pummelo and sweet orange. Gene Ontology (GO) analysis indicated that the genes associated with microtubule-based movement, DNA packaging, cell cycle, DNA replication, mitosis and transcription regulator activity were over-represented among the genes that were highly expressed in ovules (Supplementary Table 33). The genes that were highly expressed in the ovules of sweet orange were significantly (P < 0.05; FDR < 0.05) enriched in almost the same gene ontologies as the group of 853 genes that were highly expressed in the ovules of both pummelo and sweet orange, such as regulation of transcription, mitosis, nuclear division and mitotic cell cycle, which contribute to cell division (Supplementary Table 34). These data highlight similarities in the transcriptomes of nucellar and sexual embryos.
To identify genes that are differentially expressed between polyembryony and monoembryony, a comparative transcriptome analysis was performed on the ovules from three pairs of monoembryonic and polyembryonic citrus cultivars (pummelo-grapefruit, citron-lemon, clementine-ponkan) at the key stage of nucellar embryo initiation (i.e., 3 d before anthesis) 14,15 . Considering the genetic differences between each pair of citrus cultivars, we performed a pairwise comparison of the three pairs of RNA-seq data. Among the 2,624 genes that were preferentially expressed in the ovules of pummelo and sweet orange, 29 showed markedly different (≥2-fold) expression levels  C g 4 g 0 1 9 0 0 0 C g 4 g 0 1 8 9 8 0 C g 4 g 0 1 8 9 7 0 C g 4 g 0 1 8 9 6 0 C g 4 g 0 1 8 9 4 0 C g 4 g 0 1 8 9 3 5 C g 4 g 0 1 8 9 3 0 C g 4 g 0 1 8 9 1 0 C g 4 g 0 1 8 9 0 0 C g 4 g 0 1 8 8  A r t i c l e s between each pair of monoembryonic and polyembryonic cultivars ( Fig. 6b and Supplementary Table 35). The annotation of these genes and analysis of homologous Arabidopsis sequences indicated that 11 of the 29 genes encode proteins with unknown function. Almost half of the remaining genes were associated with the biological processes of stress response (RNS1, ATSOD1, Cg4g011120 and Cg9g004560) and oxidation reduction (ATBBE18 and Cg3g004030). Three additional genes (EDA24, AGL2 and ACR4) contribute to flower and embryo development in Arabidopsis and other plants [32][33][34] . These genes are expressed in the ovules of both monoembryonic and polyembryonic cultivars and are expressed at dramatically higher levels in polyembryonic cultivars. CitRWP shows a similar expression trend in polyembryonic cultivars, which is consistent with our conclusion, from genetic analysis, that CitRWP is the key candidate gene controlling polyembryony ( Fig. 6c and Supplementary Tables 36 and 37).

DISCUSSION
We sequenced and assembled one high-quality genome and three draft genomes of Citrinae species. The haploid pummelo genome represents the most contiguous citrus genome assembly to date. The sequence contiguity (contig N50) of the long reads assembly using PacBio data is at least 18-fold higher than that of recently published citrus genomes 18,19 . The shotgun approach using PacBio singlemolecule sequencing in combination with Illumina data is effective and efficient for a de novo genome assembly. This high-quality pummelo genome together with the three draft genomes of citron, papeda and atalantia constitute a valuable platform for citrus genetic and genomic research. Change in the reproductive system is one of the striking features in the evolution and domestication of fruit crops 35 . Consistent with this observation, the sequence analysis of the six genomes that included primitive, wild and cultivated citrus indicated frequent variations in flowering-and seed-related genes, such as CKX 21 . The comparative population analysis of these species indicated that mitochondria-and reproduction-associated genes, such as WOX 27 and FPA 30 , are probably under selection in cultivated citrus.
Apomixis has received considerable attention because of its capability to permanently fix valuable traits and hybrid genotypes (hybrid vigor). It is possible that apomixis was used inadvertently during the history of citrus breeding and selection. When desirable variations and traits were observed, the apomictic lines tended to have a greater chance of being selected, maintained and dispersed. This may also explain why most of the commercial and elite citrus are apomictic. In this study, we provided a comprehensive analysis of citrus apomixis (polyembryony) using genetic, genomic and transcriptomic approaches. The segregation ratios of both the phenotype and genotype in a segregating F 1 population are consistent with the single dominant gene model for nucellar embryony 9 . The cosegregation of the MITE insertion in the promoter region of CitRWP with the polyembryonic phenotype lends more support to the single dominant mutation model for the emergence of apomixis in mandarin. Our genomic analysis localized the candidate region to an 80-kb interval containing 11 genes, which was embedded in a 380-kb region reported by a previous study 12 . In this interval, we identified a promising candidate gene for the single dominant allele responsible for polyembryony in citrus. Notably, this candidate gene, CitRWP, is specifically expressed in ovules and at higher levels in polyembryonic cultivars. A gene harboring the same domain, AtRKD4 (GROUNDED), was reported to promote embryogenesis in somatic cells when transiently overexpressed 36,37 . Fully elucidating the contribution of CitRWP to apomixis will require knocking out CitRWP in citrus. A r t i c l e s Our transcriptome analyses provide global insights into the molecular processes involved in the development of sexual and asexual (nucellar) embryos in citrus. The results indicate a close developmental relationship between sexual reproduction and apomixis in citrus, which is consistent with a report on apomictic Boechera gunnisoniana 38 . The unusually high expression of three sexual embryogenesis-associated genes during apomixis (AGL2, EDA24 and ACR4) provides more support for the notion that nucellar embryogenesis and sexual reproduction may utilize similar genes or pathways. Previous studies reported that extreme stress induces the initiation of autonomous embryo development in somatic cells in response to exogenous or endogenous signals 39 . Here we identified six nucellar embryo-associated genes that are regulated by stress, which is in agreement with a previous report on apomixis in Hieracium praealtum 40 .
This genomic information on primitive, wild and cultivated citrus may also be valuable for the future analysis of metabolism and to further investigate the basis of historical uses of citrus species. For example, C. medica was historically used for medicinal purposes, and atalantia was also used as a traditional medicinal plant for folk medicines aiming to treat malaria, chronic rheumatism, paralysis and snakebites 41 . Additionally, wild and primitive citrus, which have been largely neglected during the history of citrus breeding, exhibit higher genetic diversity that is useful for managing disease in cultivated citrus with narrow genetic backgrounds resulting from apomixis, clonal propagation and human selection. Future characterization and utilization of genes controlling apomixis, metabolism, disease resistance and other agriculturally important traits is a promising path for the improvement of crop breeding.

METHODS
Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.  the sequencing reads from the 108 accessions to the corresponding target sequences. The reads of each accession that mapped to the 1.96 Mb region were identified using the bam2fastq software (see URLs). These reads were remapped to gene sequences using the pummelo candidate regions as templates. After using iCORN2 (ref. 57) to substitute SNPs and small indels (1-3 bp), we obtained 292 gene sequences in the candidate region of each accession. For each of the 292 genes in the candidate region, a multiple sequence alignment of homologous sequences from 108 accessions was performed using ClustalW 58 . For each polymorphic site, the P value from the two-tailed Fisher's exact test was calculated to quantify the associations between the nucleotide polymorphisms and the polyembryonic trait. The P values were adjusted using the FDR correction for multiple testing. The sites were retained if they fit the criteria of P < 0.01 and FDR < 0.01 (Supplementary Note 9). For the 11 candidate genes in the 80-kb region, all of the SNPs in the gene sequence from the 108 accessions were extracted and analyzed. For each SNP, the proportion of the accessions in which the genotype of the site was associated with the phenotype of the embryo was regarded as an association score. The SNPs with association scores greater than 80% were used for further SNP association analysis (Fig. 5a).

MITE marker development.
The 203-bp MITE insertion is located on chromosome 4 of the pummelo genome between nucleotides 24,610,316 and 24,610,317, which is in the promoter region of CitRWP. Two pairs of primers ('mite_p1' and 'mite_p2' , Supplementary Table 41) were designed. Primer pair 'mite_p1' contains degenerate nucleotide to be used for the broad germplasm of 786 accessions. Primer pair 'mite_p2' was specifically designed to be used for the progeny from the HB pummelo × Fairchild mandarin cross. Both the primer pairs anneal approximately 100 bp upstream (forward primer) and 200 bp downstream (reversed primer) of the MITE insertion site (Supplementary Note 10). We performed a PCR-based screen to detect the MITE insertion. PCR was carried out in 20-µl reaction volumes containing 10 µl of 2× Taq Plus Mix (Vazyme Biotech), 100 ng genome template, 0.2 mM of each forward and reverse primer. The amplification was done at 95 °C for 5 min, followed by 30 cycles of 30 s at 95 °C, 30 s at 52 °C, and 60 s at 72 °C, and finally 5 min at 72 °C.
Transcriptome sequencing and analysis. A total of 26 samples from leaves, ovules, fruits, and seeds were harvested for RNA-seq analysis. Two replicates of plant tissue were analyzed. RNA-seq libraries were constructed and sequenced on Illumina Genome Analyzer platform. An average of 4 Gb data were generated for each sample (Supplementary Table 42). The RNA-seq reads were aligned to the pummelo genome sequences using TopHat 59 . The expression level of each predicted transcript in each RNA-seq library was calculated as the FPKM with Cufflinks 60 . Gene Ontology (GO) enrichment analysis was performed using the web-based agriGO program with the default parameters (P < 0.05 and FDR < 0.05, Supplementary Tables 33, 34 and 43 and Supplementary Note 11).
Quantitative PCR analysis. Total RNA from all the tissues was extracted using the TRIzol reagent (Takara). cDNA was synthesized using 1 µg total RNA and the HiScript II QRT SuperMix for qPCR (Vazyme, R223-01). qRT-PCR was performed on an ABI 7900 instrument (Applied Biosystems) using SYBR Green PCR Mastermix according to the manufacturer's instructions (Kapa, RR420). The cycling conditions included incubation for 3 min at 95 °C followed by 40 cycles of amplification (95 °C for 5 s and 60 °C for 35 s). Using the citrus β-actin gene as the internal reference gene, relative gene expression values were calculated using the 2 −∆∆Ct method 61 . The primers used in qRT-PCR are listed in Supplementary Table 41. Statistical analyses. We used the Fisher's exact test to conduct the GO enrichment analysis of the target genes relative to the background of the corresponding entire genome using agriGO program. The likelihood ratio test was used to determine the branch model during the dN/dS analysis using PAML program. The two-tailed Fisher's exact tests were used to quantify the associations between the nucleotide polymorphisms and the polyembryonic trait. Moreover, all the P values from the tests mentioned above were adjusted using the FDR correction for multiple testing. The filtering criteria of P < 0.05 and FDR < 0.05 were given for the analysis of GO enrichment analysis and likelihood ratio test, and the filtering criteria were P < 0.01 and FDR < 0.01 for the associations analysis of citrus polyembryony.