Discovery of SNPs and InDels in papaya genotypes and its potential for marker assisted selection of fruit quality traits

Papaya is a tropical and climacteric fruit that is recognized for its nutritional benefits and medicinal applications. Its fruits ripen quickly and show a drastic fruit softening, leading to great post-harvest losses. To overcome this scenario, breeding programs of papaya must invest in exploring the available genetic variation to continue developing superior cultivars with improved fruit quality traits. The objective of this study was to perform a whole-genome genotyping (WGG) of papaya, predict the effects of the identified variants, and develop a list of ripening-related genes (RRGs) with linked variants. The Formosa elite lines of papaya Sekati and JS-12 were submitted to WGG with an Illumina Miseq platform. The effects of variants were predicted using the snpEff program. A total of 28,451 SNPs having Ts/Tv (Transition/Transversion) ratio of 2.45 and 1,982 small insertions/deletions (InDels) were identified. Most variant effects were predicted in non-coding regions, with only 2,104 and 138 effects placed in exons and splice site regions, respectively. A total of 106 RRGs were found to be associated with 460 variants, which may be converted into PCR markers to facilitate genetic mapping and diversity studies and to apply marker-assisted selection (MAS) for specific traits in papaya breeding programs.

Papaya (Carica papaya L.) is a fruit crop cultivated in tropical and subtropical regions of the globe that is listed among the four major fresh tropical fruits. In Brazil, papaya is an important crop with a production of around 1.06 million tonnes in 2018, placing the country as the second major producer and the third major exporter, although with most of the production destined for the domestic market 1 . Papaya fruits are appreciated and highly indicated for their excellent nutritional and medicinal qualities, possessing high vitamin A and C content, antioxidants such as β-carotene and lycopene, minerals, and fibers 2, 3 .
In papaya, several genetic and genomic resources are available due to the great advances of sequencing technologies, which have contributed to understand the intriguing sex-determination system of the species [4][5][6][7][8] . Besides the sex determination of papaya, other relevant traits have been investigated through gene expression analysis, such as the fruit quality-related traits [9][10][11] , embryogenesis 12 , resistance to drought 13 , etc. However, the utilization of sequencing technologies to identify DNA polymorphisms for the genetic mapping of important traits for papaya breeding is scarce. The available linkage maps for papaya have varied in coverage, resolution, and type of DNA polymorphisms. The first high-density linkage map was based on 1498 Amplified Fragment Length Polymorphisms (AFLP) 14 . The following high-density map was developed with 706 Simple Sequence Repeat (SSR) markers 15 . The same mapping population was used to improve the map resolution with 277 AFLP and 712 SSR markers and allowed the identification of 14 quantitative trait loci (QTL) related to fruit quality traits 16 . More recently, a linkage map based on 219 single nucleotide polymorphisms (SNP) was developed 17 . Although this map was based on SNP markers, the great distortion of the expected marker segregation observed in F 2 (1:2:1) significantly decreased the map resolution. Still, a total of 21 QTLs for fruit quality traits were detected using this map and will enable candidate gene isolation and development of marker-assisted selection strategies.
DNA variants such as SNPs and InDels are very abundant in all genomes and are thought to bring out the phenotypic differences among individuals of a species, including differences related to yield and fruit quality  [18][19][20][21] . SNPs and InDels are quickly identified through Next Generation Sequencing (NGS) technologies and numerous studies in climacteric fruit crops revealed the potential of NGS-based markers for the genetic mapping of fruit quality traits 17,[22][23][24] .
Understanding the genetic and genomic aspects related to fruit quality traits in papaya is essential to continue developing superior cultivars with unique features to meet both the national and international markets. The conventional breeding of papaya for complex traits, such as fruit firmness and total soluble solids (TSS) content, is time-consuming and only gives small genetic gain per selection cycle. The ethylene is the main phytohormone regulating the ripening of climacteric fruits and its action influences the development of the sensorial and nutritional attributes of climacteric fruits 25 . One major change in texture during the ripening of such fruits is the rapid fruit softening, turning it more susceptible to physical injuries and post-harvest diseases. Fruit softening is a complex process with substantial activity of cell-wall degrading enzymes, such as polygalacturonase and beta-galactosidase 9,10 . Another problem of papaya breeding in Brazil is the occurrence of viral diseases that due to federal legislation the papaya plants even in breeding fields must be cut down when showing the first symptoms of viral diseases mainly the Papaya ringspot virus (PRSV), not allowing complete measurements in breeding populations. Thus, the use of molecular markers could speed up the time for selection in papaya breeding programs by allowing the analysis of a higher number of progenies at an early stage of development and increase the genetic gain 26 .
In Brazil, the papaya breeding program at UENF has had great success in the development of 21 new papaya cultivars 27 , which reduced the need to import hybrid seeds, expanded the options for farmers and consumers, and placed the Country as a potential papaya seed exporter. One of these cultivars is the UC10 hybrid, with fruits of around 1.9 kg and a high yield 28 . The parental of this hybrid are the Formosa elite lines Sekati and JS-12, which are contrasting for agronomic and fruit quality attributes. The Sekati parent produces large fruits, excellent pulp firmness, and median soluble solid contents. On the other hand, the JS12 parent diverges from Sekati in the last two traits, since it presents moderate pulp firmness and high soluble solid contents 29 . The availability of genomic information related to fruit quality traits will enable the development of tools to aid the selection process in papaya. Thus, in this study, we carried out a genome-wide identification of DNA variants among the Formosa elite lines Sekati and JS-12, using an Illumina MiSeq platform. The identified variants were used to predict its effects according to genomic location and to develop a list of ripening-related genes with linked variants to facilitate further genotype/phenotype association studies and to apply marker-assisted selection for the papaya breeding.

SNP and InDel discovery and chromosomal distribution.
A total of 12,709,090 sequence reads (with length ranging from 31 to 251 bp) were obtained from the Sekati and JS-12 lines. The Sekati sample generated 1.16 Gb of sequencing data (4,237,292 reads), while the JS-12 sample generated 2.4 Gb (8,471,798 reads). Mapping of the clean reads, after removing low quality reads, against the papaya reference genome resulted in the identification of 28,451 SNPs and 1,982 InDels (1,061 insertions and 921 deletions). The average coverage of variants was ~ 3.12× and ~ 5.02× for the Sekati and JS-12 lines, respectively.
The SNPs were identified in all nine papaya chromosomes (Fig. 1). The highest number of SNPs was observed on chromosome 4 (3,375 SNPs) and the lowest on chromosome 5 (1,751 SNPs). A total of 8,079 SNPs (28.4%) were identified in contigs and scaffolds that are not mapped to any papaya linkage group 15 and they were attributed to unmapped contigs and scaffolds. The comparison of SNPs identified in the lines Sekati and the JS-12 revealed that they share about 78% (22,629) and 22% (5,822), respectively, of the genome-wide SNP alleles with the reference genome, which is the SunUp, a transgenic variety of the Solo heterotic group. The lines showed different levels of SNP similarities with the SunUp in all chromosomes. The Ch4 and Ch7 of Sekati shares about 94.3% and 82.1% of similarity with the reference genome, respectively. On Ch6, Ch9, and Ch8 the similarity of Sekati with the reference is the less, showing about 68.3%, 70.66%, and 71.5% of similarity, respectively. The remaining chromosomes of Sekati presented the similarity of SNPs close to the genome-wide average. The highest similarities of JS-12 alleles with the reference were observed on Ch6 (31.7%), Ch9 (29.34%), and Ch8 (28.5%). On Ch4 and Ch7 the similarity was 5.7% and 17.9%, respectively. The allele similarities for the remaining chromosomes of JS-12 were close to the genome-wide average. www.nature.com/scientificreports/ The InDels were found in all nine papaya chromosomes (Fig. 1). The highest InDel number was found to be 260 on Ch4, while the lowest was 112 on Ch9. A total of 529 InDels were observed on unmapped contigs and scaffolds.
Based on nucleotide substitutions, the SNPs were classified as transitions (purine-purine and pyrimidinepyrimidine) or transversions (purine-pyrimidine and pyrimidine-purine). We found 20,199 transitions and 8,252 transversions, with a genome-wide transition to transversion ratio (Ts/Tv) of 2.45. Observation of SNPs in coding regions revealed that the nucleotide substitution frequency and the Ts/Tv ratio were higher at the third codon position (2.40), compared to the second (1.96) and first (1.83) codon positions (Table 1).

Functional classification of DNA variants.
A total of 58,498 effects based on genomic position were predicted from 30,433 DNA variants. The higher number of effects compared with the number of variants is because one specific variant can affect multiple genes (e.g. a variant can be downstream from one gene and upstream from another gene). The SNPs and InDels caused a total of 54,100 and 4,398 ( Fig. 2) effects, respectively. The effects of the variants were classified into four categories: modifier (56,380), low (1,117), moderate (1,062), and high (63) impact. Only 4% and 1.4% of the SNP and InDel effects, respectively, were placed in coding regions.
High impact variants had a direct impact on gene functionality. A total of 32 and 31 high impact variants were observed for SNPs and InDels, respectively. The most common effects caused by high impact SNPs are stop codon lost and stop codon gain (Fig. 3a), which may lead to a high level of functional consequences. Meanwhile, high impact InDels mainly caused disruption of the translational reading frame and may result in abnormal protein products with an incorrect amino acid sequence. Moderate impact SNPs caused a change in one amino acid due to a non-synonymous substitution (Fig. 3b). The InDels caused four types of effects in coding regions that were classified as moderate impact (Fig. 3b). Low impact SNPs mainly consisted of synonymous substitutions in which no change of amino acid is observed (Fig. 3c). The remaining effects were predicted in non-coding regions and they were classified as modifier impact (Fig. 3d).
Identification of fruit ripening-related genes with linked variants. We selected 48 differentially expressed genes (DEGs) during the fruit ripening process of papaya determined by RNAseq, including 20 cell wall-related genes (CW), 13 chlorophyll and carotenoid metabolism-related genes (CCM), four proteinases and their inhibitors (PROT), six plant hormone signal transduction pathway genes (PH), four transcription factors (TF), and one senescence-associated gene (SEN) 10 . These genes were used as Blastp queries to identify other ripening-related genes within the papaya genome. This search resulted in the identification of other 143 genes that are potentially involved in the fruit ripening process due to sequence similarity.   (Table 2).

Discussion
The frequency of SNPs and the Ts/Tv ratio was higher at the third codon position, compared with the second and first codon position (Table 1), revealing a trend of genomic conservation at codon sites during evolution. This trend was also observed in SNPs identified in Expressed Sequence Tags (ESTs) from Solanum lycopersicum and S. habrochiates 27 .
SNPs are known to be associated with many quantitative trait loci in plants 20,[30][31][32][33] and an individual SNP can have a large impact on the phenotype 34,35 . We found 2180 SNPs located in coding regions and 26,271 in noncoding regions of the papaya genome. Although most SNPs are not located inside genes, their abundance and robustness make them an important source of DNA variation to help papaya breeding programs in the development of superior cultivars. InDels also play important roles in the phenotypic variation observed between individuals of a species. In papaya, a dinucleotide insertion mutation in the gene encoding the enzyme lycopene β-cyclase (CpCYC-b) causes the phenotypic variation of red and yellow flesh 36 . When found in coding regions the InDels generally disrupt the translational reading frame (frameshift variant), except when the mutation is a multiple of three nucleotides 37 . In this study, we identified 62 InDels located in coding regions and 28 of these causing disruptions of the translational reading frame.
Fruit quality is one of the most important features pursued by papaya breeding programs, especially the selection of genotypes that keep fruit firmness for a longer period, resulting in longer shelf-life and decrease www.nature.com/scientificreports/ post-harvest losses. Studies at the gene expression level were developed to isolate the key genes underlying the fruit ripening process and fruit softening of papaya 9,10,38 . However, these studies analyzed only one genotype at a time and not considered the variation within DNA sequences among different papaya genotypes. During the ripening process of climacteric fruits such as papaya and peach, a positive feedback loop regulated by NAC transcription factor is thought to control the ethylene synthesis. This mechanism is observed in species that lack recent whole-genome duplication (WGD). On the other hand, climacteric fruit species with recent WGD, such as tomato, pear, and apple, appear to have evolved a MADS-type transcription factor positive feedback loop controlling ripening 25 . Fruit softening in papaya is mainly caused by the degradation of primary cell wall polymers. Several cell wall-degrading enzymes act cooperatively in a coordinated process to degrade the cellulose-hemicellulose matrix which is embedded in a structurally heterogeneous mixture of pectin 39 . While ethylene promotes fast fruit softening, on another hand it is also thought to improve the rate of sugar synthesis, transport, and degradation during the ripening of papaya. Several genes related to sugar metabolism are upregulated in response to ethylene during the ripening process 10 . Plant hormones also play important roles in controlling several processes of growth and development in plants. Besides the importance of ethylene for the fruit ripening process, other types of plant hormones can take place synergically or antagonistically with the ethylene action during the ripening of climacteric fruits. Besides, one of the major physiological changes observed during the ripening of papaya is a fast color change 10,38 . This is because of the fast degradation of chlorophyll and the appearance of carotenoids such as lutein and β-carotene 11 . Other genes that are involved in fruit softening include the class of protease enzymes. Studies have shown that some proteases have increased expression during the ripening process of papaya 10 and tomatoes 40 .
The availability of SNPs and InDels strongly associated with ripening-related genes of papaya is essential to develop studies of diversity, genetic mapping, and application of marker-assisted selection. Thus, we searched for www.nature.com/scientificreports/ DNA variants that are linked with ripening-related genes that are up or down-regulated in response to exogenous ethylene 10 and genes identified using BLASTp. A total of 106 genes with at least one variant associated, either inside or in the flanking region of the gene, were identified (Supplementary S1). The association between an SNP and InDel with a trait of interest can be accessed through the linkage disequilibrium analysis 41 , using the quantitative trait (QTL) analysis for example. Further analysis will examine the genotype-phenotype association related to fruit ripening traits in a segregant population derived from the cross between the Sekati and JS-12 lines. It is expected that the presence of alleles for these fruit ripening-related genes in papaya germplasm and breeding populations can contribute to observed differences for the fruit firmness and TSS content among papaya genotypes. The association of genotypic alleles with a trait of interest points to a genomic region where one or more genes may be affecting the phenotype. To effectively apply MAS in breeding programs the candidate genes have to be identified and validated through functional analysis. After all these identification and validation steps, DNA markers based on PCR, such as the low-cost technique called single nucleotide amplified polymorphism (SNAP) 42 or the real-time fluorescence-tagged probes technologies such as TaqMan, Kompetitive allele specific PCR (KASP), or rhAmp 43 , will be developed to apply marker-assisted selection and to direct gene editing studies in papaya breeding programs.  48 . To facilitate visualizing the overall distribution of variants across the papaya chromosomes, the contigs and scaffolds of the reference genome, which is still a draft version, were associated with 10 papaya linkage groups (LGs) 15 and the LGs with a pachytene chromosome-based karyotype of papaya 49 .

Annotation of single nucleotide polymorphisms and insertion/deletion polymorphism.
To predict the putative effects of DNA variants according to genomic location, the snpEff v4.3 program was used 37 .
To perform the analysis a C. papaya binary database file (.bin) was built in snpEff using the papaya reference genome in Fasta format 46 and an annotation file in gff3 format, both downloaded from the PLAZA: Comparative Genomics In Plants. A variant call format (VCF) file containing the SNPs and InDels was then annotated with the snpEff program using default parameters. The variants were classified as genic and intergenic according to their genomic location. The variants in intergenic regions are classified as Modifier impact and do not affect the coding regions of genes. Variants located in introns are classified as Modifier impact as well. The variants placed in coding genic regions can generate three types of impacts, such as low, moderate, and high impact. Low impact variants (e.g. synonymous variant) are assumed to be mostly harmless or unlikely to change protein behavior, while a non-disruptive variant that might change protein effectiveness is considered of moderate impact (e.g. missense variant and inframe deletion). The variants with high impact (e.g. stop gained and frameshift variant) probably cause protein truncation or loss of function 37 .
Identification of fruit ripening-related genes with linked variants. To identify ripening-related genes, we selected 48 genes isolated from a differential gene expression experiment during the fruit ripening process of papaya fruits 10 . The protein sequences of the 48 differentially expressed genes (DEGs) were used as queries to identify genes with related function based on sequence similarity within the papaya genome. The Blastp tool available at Phytozome was used and the ripening-related genes were selected with a minimum of 50% identity and E-value ≤ 1e−20. We removed from the list of the ripening-related genes those identified by Blastp with no expression during fruit development and ripening of papaya 25 and the genes without variants. We also removed the variants farther than 40 kb from the gene start/end.