Development and Application of InDel Markers for Capsicum spp. Based on Whole-Genome Re-Sequencing

Genome-wide identification of Insertion/Deletion polymorphisms (InDels) in Capsicum spp. was performed through comparing whole-genome re-sequencing data from two Capsicum accessions, C. annuum cv. G29 and C. frutescens cv. PBC688, with the reference genome sequence of C. annuum cv. CM334. In total, we identified 1,664,770 InDels between CM334 and PBC688, 533,523 between CM334 and G29, and 1,651,856 between PBC688 and G29. From these InDels, 1605 markers of 3–49 bp in length difference between PBC688 and G29 were selected for experimental validation: 1262 (78.6%) showed polymorphisms, 90 (5.6%) failed to amplify, and 298 (18.6%) were monomorphic. For further validation of these InDels, 288 markers were screened across five accessions representing five domesticated species. Of these assayed markers, 194 (67.4%) were polymorphic, 87 (30.2%) monomorphic and 7 (2.4%) failed. We developed three interspecific InDels, which associated with three genes and showed specific amplification in five domesticated species and clearly differentiated the interspecific hybrids. Thus, our novel PCR-based InDel markers provide high application value in germplasm classification, genetic research and marker-assisted breeding in Capsicum species.

technology, InDels have been identified and developed extensively through re-sequencing and have become a valuable resource for the study of various organism, especially plants and animals [30][31][32][33] . The publication of pepper genomic date has provided an important platform for the detection and development of genome-wide InDels 2, 34 . In Capsicum, multiple genetic maps were constructed with InDels based on intraspecific or interspecific populations 9,27,33 . In addition, InDels markers were used for QTL analysis in pepper, such as CMV resistance and initiation of flower primordia 25,28 . However, discovery efforts for InDels have lagged significantly behind those for SNPs, and relatively few InDels have been developed and applied in pepper 28,35,36 , nor have they been used with any frequency for pepper variety characterization or germplasm diversity assessment.
The purpose of the present study was to discover and develop stable and practical InDels based on re-sequencing data from C. annuum cv. G29 and C. frutescens cv. PBC688, as compared to a reference genome sequence, which could be detected with simple procedures based on size separation. Furthermore, identified polymorphic InDels among five domesticated species including two re-sequencing accessions and five additional ones. These reliable polymorphic InDels will become a useful resource for the Capsicum species identification, genetic relationship analysis and hybridization studies.

Materials and Methods
plant materials. Two pepper lines C. annuum cv. G29 and C. frutescens cv. PBC688 were selected for re-sequencing in this study. The former is a sweet line ssceptible to CMV, but with excellent horticultural traits, while the latter represents a wild small-fruited hot accession highly resistant to CMV. Among the 176 accessions introduced by Dr. W.P Diao from the National Plant Germplasm System (NPGS) of United States Department of Agriculture (USDA) in 2015, we selected 63 accessions representing five domesticated species of Capsicum (Table 1). Five accessions each representing one domesticated species: PI 224408 (2), PI 439512 (15), PI 441620 (24), PI 441539 (46), and PI 585277 (59) were carefully chosen for InDel polymorphism validation of inter-species together with G29 and PBC688. Two C. annuum accessions, G29 and G43, together with two C. frutescens PBC688 and PI 439512 (15) were tested for InDel intra-species polymorphism. All 63 accessions were used for validation of inter-species InDel polymorphism.
Library construction and sequencing. The CTAB extraction method was used to isolate genomic DNA from fresh leaves. High quality genomic DNA was confirmed through 1.0% agarose gel electrophoresis for library construction 37 . We constructed two paired-end libraries with 10-fold depth for each pepper line. Briefly, genomic DNA was sheared using ultrasonic to yield an average size of 500 bp DNA fragments. Then Illumina paired-end adaptors were ligated to the fragmented DNA. The ligated DNA products were selected based on the fragment size on a 2% agarose gel. Amplification of the products was performed by PCR using specific primers to form the libraries. After inspection, the resulting libraries were sequenced on an Illumina Hiseq TM 2500 sequencer (Illumina Inc., San Diego, CA, USA) in the company of Biomarker Technologies. Raw reads of 2 × 100 bp were generated for the downstream analyses.
Data filtering, alignment, variants calling. The genome sequence of C. annuum cv. CM334 (2.96 Mb) was obtained from the Pepper Genome Platform (PGP) (http://peppergenome.snu.ac.kr/download.php) to use as the reference. Low quality reads were filtered out using a custom C program based on the default parameters. The cleaned data were aligned to the reference pepper genome using the Burrows-Wheeler Aligner (BWA0.7.10-r789) program 38 with the default values. The alignment results in SAM format were transformed to Binary Alignment Map (BAM) format files through SAMTools 39 . Mark Duplicates in Picard tool (v1.102) (http://broadinstitute. github.io/picard/) was applied to remove replicate reads, and the two BAM files were used for the next analyses. To reduce the inaccurate alignments, GATK Tool Kits version 3.1 was used to conduct the local realignment around the insertions and deletions, reads base quality recalibration and variant calling 40 . InDels flanking sequences extraction and primer design. For the identification of InDel polymorphisms between the re-sequenced PBC688 and G29, we explored the reference genome of CM334 as a 'bridge' to detect sequence polymorphisms between them. The single-end reads of G29 were aligned to the reference sequence of CM334 via SOAP with no gaps allowed. The aligned reads dataset was compared against the InDel polymorphism dataset identified between PBC688 and CM334. Only those InDels with identical sequences between G29 and CM334 were considered as real InDels between G29 and PBC688. Once the location of InDel polymorphisms between one re-sequenced accession and the reference was established, those between the two re-sequenced accessions are readily distinguished at corresponding positions where the second accession is identical to the reference 31 . In order to develop the InDels markers, we extracted 150-bp flanking nucleotides on two sides of an InDel to query the reference genome sequence using a simple Visual C++ script for primers design. Primer 5 (http://www.PromerBiosoft.com) was used to design PCR primers with length of 19-22 bp, Tm of 52-60 °C, and PCR products of 80-250 bp.
Chromosomal location and genomic synteny in pepper. The chromosomal localization of InDel markers was acquired from the CM334 genome database PGP (http://peppergenome.snu.ac.kr), and the InDel markers were located on chromosomes using MapDraw 41 . The genomic information of C. annuum, C. chinense and C. baccatum were also downloaded from PGP. The C. annuum genome was compared to C. chinense and C. baccatum genomes using the MCScan toolkit (V1.1) 42 . To determine synteny blocks, we used all-against-all LAST 43 and fettered the LAST hits with a distance cutoff of 20 genes, also requiring at least 4 gene pairs per synteny block. Python version of MCScan was performed to construct chromosome-scale synteny blocks plots (https://github.com/tanghaibao/jcvi/wiki/ MCscan-(Python-version).

Results
Identification of InDel polymorphisms between C. annuum cv. G29 and C. frutecens cv. PBC688. A total of 319,522,376 and 309,682,186 clean reads were generated for PBC688 and G29, respectively.
Using the Burrows-Wheeler Alignment (BWA), 2.54 × 10 8 and 2.79 × 10 8 of the PBC688 and G29, respectively, obtained reads were mapped to the reference genome CM334. The mapping read depth was 11x for PBC688 and 12x for G29. The overall genome coverage was 94.0% for PBC688 and 97.5% for G29, with an average of 95.8%. For PBC688 and G29, 76.2% and 87.9% pair-end (PE) reads, and 3.2% and 2.2% single-end (SE) reads were mapped to the reference chromosomes corresponding to 2.96 Gb of CM334 (Table 2).
Genome-wide insertion/deletion polymorphisms were examined via GATK software. In total, 1,664,770 InDels were identified between PBC688 and CM334. These InDels were distributed across all the twelve chromosomes, varying from 168,460 on chromosome 09 to 88, 291 on chromosome 08. At the same time, we identified 533,523 InDels between G29 and CM334 that ranged from 82,799 on chromosome 11 to 13,647 on chromosome 08. The InDels between PBC688 and G29 included different InDels than those described above, and the number of InDels ranged from 173,195 on chromosome 11 to 86,696 on chromosome 8 ( Table 3).
The average densities of the detected InDels between CM334 with PBC688 and G29 were 604.6 and 193.8 InDels/Mb, respectively. The InDels frequencies ranged from 655.5 InDels/Mb on chromosome 02 to 559.1 InDels/Mb on chromosome 01 between PBC688 and CM334, from 318.8 InDels/Mb on chromosome 11 to 94.1 InDels/Mb on chromosome 08 between G29 and CM334, and from 669.1 InDels/Mb on chromosome 11 to 563.2 InDels/Mb on chromosome 04 between PBC688 and G29 (Table 3).
In the present study, we detected that the largest InDel was 49 bp and the single base-pair InDels were dominant and accounted for about 65% of those analyzed. The ratios of InDels less than 10 bp were 94.4%, 92.6% and 94.3%, and those of less 6 bp was 89.1%, 86.2% and 89.1%, respectively, among the three different genomes (Table 4).
Genomic annotation and synteny of InDels in pepper. The use of the annotated genome of CM334 enabled the annotation of InDels, and to assign them with corresponding genes. We examined the distribution www.nature.com/scientificreports www.nature.com/scientificreports/ of the InDels related to genes of Capsicum and found that most of them were located within intergenic regions. Among the 1,664,770 and 533,523 InDel polymorphisms detected in CM334 compared with PBC688 and G29, 63,992 (3.8%) and 23,897 (4.5%) InDels were in gene regions, and only 2,519 and 1,019 were found in coding sequences. Among the 1,651,856 InDels identified between PBC688 and G29, 58,944 (3.6%) InDels were in genetic regions, with only 2,252 in coding sequences ( Table 5).
The functional characterization of genes with the polymorphic InDels were distributed across all 12 chromosomes of pepper. Overall, most of the genes widely involved in cellular process, cell, cell part, metabolic process, response to stimulus, developmental process, biological regulation, organelle, multicellular organismal process, binding, catalytic activity, location and others (Fig. 1). Specifically, cellular process related genes consisted of most polymorphic InDels in all of chromosomes. Moreover, response to stimulus genes with high polymorphic InDels consisted of numerous polymorphic InDels in chromosome 1, 2, 4, 5, 8, 9 and 12. In chromosome 6, 7 and 11, the genes associated with cell (cellular component) consisted of more polymorphic InDels followed cellular process. However, in chromosome 3, genes referred to metabolic process involved in abundant InDels. In addition, most of genes have multiple functions and involve in regulation of multiple process (Supplementary Dataset 4).

Experimental validation of short InDel polymorphisms. To validate the InDels identified between
PBC688 and G29, we selected 1605 out of 1,651,856 InDels following the rule of uniform distribution and converted them to PCR-based markers. According to the chromosomal location of InDels in C. annuum cv. CM334, the 1605 markers were distributed across all 12 chromosomes of pepper ( Fig. 3 and Supplementary Dataset 3). Among the 1605 InDels, 69 (4.3%) InDels located to genetic regions (Supplementary Dataset 3). This rate was consistent with that of the whole genome. Then, we analyzed the genetic synteny of the blocks including 1605 InDels among the three published genomes of Capsicum. The C. annuum InDels shared highly conserved syntenic blocks with those of C. chinense and C. baccatum (Supplementary Fig. 1) improving the stability of these InDels among the different Capsicum species. Based on this selection, we designed primer pairs to amplify fragments of 150 bp surrounding the InDels. In the PCR analysis, most markers had clear amplification in PBC688 and G29 genomes with some others generating multiple amplicons.
For 1605 primer pairs of InDels, 1560 (97.2%) gave reliable amplification in PBC688 and G29. Using PAGE,1262 (78.6%) showed identifiable polymorphisms between PBC688 and G29; 90 of these produced an amplicon in only one genotype and therefore were not suitable for genetic analysis; 298 (18.6%) were monomorphic and 45 (2.8%) failed. The polymorphism rate increased slightly with increase of InDel length, and the polymorphism rate varied from 65.3% on InDels of 3 bp to 79.1% on those of more than 10 bp (Table 6).
To test the ability to identify the interspecific hybrids with three species-specific InDel markers, we selected six parents and their interspecific hybrids. We found that the fifth hybrid was incorrectly identified because its www.nature.com/scientificreports www.nature.com/scientificreports/ amplification pattern was not consistent with its parents with all three InDels (Fig. 7A-C). Either InDel-02-3b-22 or InDel-02-3b-25 could distinguish four of the remaining five hybrids, and InDel-03-3b-5 worked in all the cases (Fig. 7A-C). For the that hybrid that failed with InDel-02-3b-22 or InDel-02-3b-25, we found it was because these two markers could not differentiate its male parent C. chinense cv. PI 640902 and female parent C. annuum cv.       www.nature.com/scientificreports www.nature.com/scientificreports/ G83. Our results imply that these three species-specific InDel markers could discriminate most hybrids formed from interspecific hybridization, and molecular markers are more accurate and convincing than phenotyping for identification.

Discussion
Despite the development of SNP genotyping technologies, InDel markers also have important practical value for those researchers and breeders without the instruments to test SNP markers. We identified 1,651,856 InDels between PBC688 and G29 that represent an average of 599.9 InDels/Mb across the entire Capsicum genome. A previous study showed that the number of InDels from C. annuum cv. Perennial and cv. Dempsey was 654,158 and 694,494 respectively when compared with the CM334 genome sequence. However, the wild species C. chinense has a significantly higher level of InDels (2,450,533) compared to these two cultivars 34 . This is consistent with our study in that the number of InDels among C. annuum intra-species is quite low; in contrast, there exists a higher level of InDels among Capsicum inter-species. However, the number of InDels from the previous study was obviously less than that in our study. Approximately 555,400 short InDels (1-5 bp) were detected in Zunla-1 relative to Chiltepin, and, 373,785 and 231,056 short InDels (1-5 bp) were detected in Zunla-1 relative to C. chinense and CM334 2 . There may be two main reasons for the difference. Firstly, in our study, we used CM334 genome as the reference genome, so our results are consistent with the study. Secondly, the previous study only detected short InDels (1-5 bp), so the number of InDels was significantly less than that in our study.
Chromosomal rearrangement often produces unbalanced gametes that reduce hybrid fertility and plays an important role in promoting speciation 44 . In our study, collinearity comparison among Capsicum species revealed that chromosomes 1, 3, 5, 8, 9 and 12 46 . Based on the genomes of Capsicum species and two Solanum species, collinearity comparisons showed that chromosome 6 and 4 of Solanum were discovered in the terminal regions of the long and short arms of chromosomes 3 and 5 in C.annuum and C.chinense, respectively 45 .
In this study, the localization of InDels within the pepper genome showed more than 95% InDels were in intergenic regions. Similarly, more InDels were detected in the intron than in CDS. Previous studies about genome-wide SNP and InDel discovery revealed the similar results in multiple crops, such as tomato and Brassica rapa 31,47 . In pepper, 93.06% and 93.39% of intergenic SNPs were detected for varieties PRH1 and Saengryeg, respectively 48 . www.nature.com/scientificreports www.nature.com/scientificreports/ In order to obtain in-depth knowledge in the InDels in our study associated with genes, these polymorphic InDels within genetic regions were functionally annotated in each chromosome. The current results revealed that genes involved in cellular process consisted of most polymorphic InDels in all chromosomes. Then, high polymorphic InDels with "response to stimulus" related genes InDelwere mapped in chromosomes 1, 2, 4, 5, 8, 9 and 12. Because of different focus, our results had some differences with a previous study by Ahn et al., who reported that most genes with high polymorphic SNPs were related with carbohydrate metabolism, followed by transcription regulation, ion binding and others. In addition, they found numerous genes with high polymorphic SNPs related to disease resistance mapped to chromosome 4, which could play a vital role in future pepper breeding 47 .
In this study, we confirmed InDels can be developed as potentially valuable genetic markers with a reliable high rate of polymorphism. Among 1605 InDels of 3-49 bp in length, 1262 (78.6%) showed polymorphisms. Only 45 (2.8%) of the primers yielded no amplification from either of the two sequenced accessions. This can be explained by sequence variations in the primer binding sites among Capsicum species as we designed primers based on the reference genome sequence 31 . In contrast to the high polymorphism rate of InDels among five accessions representing five domesticated species, two C. annuum and C. frutescens accessions showed much lower polymorphism rates. As expected, our results suggest that polymorphism rate of InDel markers within species was much lower than that among species. In a previous study on genome-wide re-sequencing inbred lines C. annuum cv. BA3 and B702, more than 90% of the InDel markers were amplified. However, only 27.2% and 12.9% markers were polymorphic between BA3 with B702 or C. frutescens cv. YNXML, respectively 9,27 .
Most importantly, we found three inter-species specific InDels (InDel-02-3b-22, InDel-02-3b-25 and InDel-3b-3-5) each of which could highly discriminate among most of the accessions under study and which efficiently identified interspecific hybrids, implying their potential application for new germplasm classification and interspecific hybrid identification in the future. Our results showed that InDel-02-3b-22, InDel-02-3b-25 and InDel-03-3b-5 could individually discriminate almost all the accessions, which agrees with a previous study. Di Dato et al. (2015) showed that most accessions (among 59 accessions) were clearly differentiated with ten SSR markers except two accessions of C. chinense, which were grouped into C. frutescens cluster. He concluded that the two abnormal accessions were genetically distant from others analyzed C. chinense 12 . In our study, the accessions of C.annuum, C. baccatum and C. pubescens had clearly specific amplification products, although 4 accessions of C. chinense and 1 accessions of C. frutescens showed some confusing patterns. Our results confirmed previous findings based on both phenotypes and molecular markers that C. annuum was closely related to C. chinense and C. frutescens, and distant to C. baccatum and C. pubescens 12,49 .
The location of markers is a vital factor for the application value of markers. These markers are located in intragenic regions to implicate the phenotypic traits and have more potential applications in marker assisted selection as functional markers 4 . In our study, the three InDel markers InDel-02-3b-22, InDel-02-3b-25 and InDel-03-3b-5 were in intragenic regions and associated with three genes, CA02g13520, CA02g20590 and CA03g07770, respectively. CA02g20590 encoded serine/threonine-protein kinase STY17-like. In Arabidopsis thaliana, the protein kinases STY8, STY17, and STY46 played a vital role in phosphorylating of transit peptides for chloroplast-destined preproteins 50 . CA03g07770 encoded the chloride channel protein CLC-d. In Arabidopsis www.nature.com/scientificreports www.nature.com/scientificreports/ thaliana, CLCd was targeted to Golgi apparatus and could suppress the cation-sensitive phenotype of Δ gef1 51 . Although CA02g13520 encodes a protein with unknown function, but it can be applied to marker assisted selection as a functional marker without any effect.
Together, these novel InDel markers are very valuable reference tools for classification of germplasm resource, identification of interspecific hybrids, genetic research, and marker-assisted breeding in pepper.