Genome-wide association study of outcrossing in cytoplasmic male sterile lines of rice

Stigma exsertion and panicle enclosure of male sterile lines are two key determinants of outcrossing in hybrid rice seed production. Based on 43,394 single nucleotide polymorphism markers, 217 cytoplasmic male sterile lines were assigned into two subpopulations and a mixed-group where the linkage disequilibrium decay distances varied from 975 to 2,690 kb. Genome-wide association studies (GWAS) were performed for stigma exsertion rate (SE), panicle enclosure rate (PE) and seed-setting rate (SSR). A total of 154 significant association signals (P < 0.001) were identified. They were situated in 27 quantitative trait loci (QTLs), including 11 for SE, 6 for PE, and 10 for SSR. It was shown that six of the ten QTLs for SSR were tightly linked to QTLs for SE or/and PE with the expected allelic direction. These QTL clusters could be targeted to improve the outcrossing of female parents in hybrid rice breeding. Our study also indicates that GWAS-base QTL mapping can complement and enhance previous QTL information for understanding the genetic relationship between outcrossing and its related traits.

GWAS-base QTL mapping has been successfully employed for a wide range of agronomically important traits in rice, including flowering time, plant height, yield traits and quality traits [31][32][33] . A more recent study extended the trait to stigma exsertion, showing that GS3, GW5 and GW2 play an important role in the genetic basis of stigma exsertion in rice 34 . Nevertheless, none of these studies targeted at CMS line which is an essential category of rice cultivars for hybrid rice production. Moreover, it is inconvenient to use traditional population segregation for mining QTLs controlling natural outcrossing in rice for being a self-pollinating species. Association mapping could be a potential tool for connecting genomics and phenomics in CMS rice germplasm, and fill the gap on the genetic basis of natural outcrossing. Using a collection of diverse wild abortive CMS (CMS-WA) lines, the objectives of the present GWAS study were (i) to investigate the genetic architecture of CMS lines developed at the International Rice Research Institute (IRRI) based on a 44 K single nucleotide polymorphism (SNP) genotyping array; (ii) to identify candidate genomic regions controlling outcrossing; and (iii) to characterize the genetic relationship between outcrossing and its related traits.

Results
Phenotype performance. The phenotypic performance of 217 rice CMS lines is presented in Table 1. A broad diversity of phenotype was observed with respect to all the traits tested. Seed-setting rate (SSR) had the highest coefficient of variation (CV), estimated as 47.9% and 38.5% in the wet season of 2013 (13WS) and dry season of 2014 (14DS), respectively. Panicle enclosure rate (PE), which was the other trait tested in two seasons, had a greater CV (19.3%) in 13WS than in 14DS (12.6%). Variances between the two seasons were highly significant (P < 0.0001) for the two traits, contributing 26.2% and 50.7% to the total phenotypic variance for SSR and PE, respectively ( Table 2). Among the three traits for stigma exsertion rate which was measured in 14DS only, the maximum CV of 37.0% was detected for double stigma exsertion rate (DSE), followed by total stigma exsertion rate (TSE) and then single stigma exsertion rate (SSE).
Pearson's product-moment correlation coefficients between the traits tested in 14DS is presented in Supplementary Table S1. The three traits evaluating stigma exsertion were positively correlated with each other, but the coefficients were high between TSE and DSE (r = 0.9058, P < 0.001), moderate between TSE and SSE (r = 0.5386, P < 0.001), and low and insignificant between DSE and SSE. This indicates that phenotypic variation of TSE among the 217 rice lines was mainly contributed by DSE, which is expected since DSE was much more variable than SSE. Accordingly, while TSE and DSE were significantly positively correlated with SSR (P < 0.001; r = 0.4171 for TSE and r = 0.4260 for DSE), SSE was not significantly correlated with SSR. In the meantime, SSR was significantly negatively correlated with PE (r = −0.2400, P < 0.001). The correlation of seed-setting rate with stigma exsertion rate and panicle enclosure rate was in accord with the common understanding that enhancing stigma exsertion could facilitate outcrossing whereas panicle enclosure may interfere with outcrossing. SNP performance and genetic diversity. The Table S2). The gene diversity and PIC of the rice lines are slightly higher than those estimated using 215 diverse indica rice cultivars 35 . This indicates that our panel has well-captured the abundant genetic variation of the indica rice germplasm.
Population structure, kinship, and LD. Population structure was calculated by STRUCTURE using 25,335 SNPs. The LnP(K) appeared to be an increasing function of K as the putative K increased from 1 to 8, and no turning point was evident ( Fig. 1). At the same time, population structure indicated by the ΔK function showed that K = 2 could be determined as the optimum K (Fig. 1). Using a cut-off Q-value of 0.8, 59 and 57 lines were assigned into groups 1 and 2, respectively, while the remaining 101 entries went into the mixed group (Supplementary Table S3). The numbers of entries assigned into groups 1, group 2 and the mixed group were 12, 8 and 24 for the 44 germplasm accessions, and 47, 49 and 77 for the 173 breeding lines, respectively. A large number of mixed entries no only showed a broad gene exchange between the two subgroups, but also indicated that our rice panel did not have clear population stratification. Thus, it is suitable to use a single-set data including all the 217 entries for GWAS. The separation of the two subgroups were presented as well in a neighbor-joining tree based on the Cavalli-Sforza' Chord genetic distance, but the mixed entries were scattered over the tree (Fig. 2a). Separation between groups 1 and 2 was clearly observed in the PCA plots using the top two principal components which accounted for 18.8% and 10.1% of the genetic variation, respectively, with a large number of mixed-group entries locating between the two subgroups ( Fig. 2b). The two subgroups showed significant difference (P < 0.001) on heading date, with the entries in group 2 delaying for an average of 5 days in 13WS and 14DS ( Supplementary Fig. S1).
The kinship between all pairwise samples varied from 0 to 1.6810 and averaged as 0.4556, among which 82% of the kinship coefficients had a value less than 0.6 ( Supplementary Fig. S2). LD scores between markers (r 2 ) varied within and among the chromosomes (Fig. 3). The initial r 2 value for the 0-20 kb interval ranged from 0.37 to 0.88 and averaged as 0.64. The r 2 decreased with increasing genetic distance on all the 12 chromosomes. The average LD decay distance was 957 kb with a range from 577 kb on chromosome 6 to 2,690 kb on chromosome 4. Genome-wide association scanning. GWAS was performed with the compressed mixed linear model controlling population structure and familial relatedness. The quantile-quantile plots (Fig. 4) showed that the distribution of observed −log10 (P) values matched well with the expected distribution, indicating a low rate of false positive in the detection of significant trait-marker association. A total of 154 significant associations (P < 0.001) were detected, including 83 for SSR, 42 for TSE, 9 for DSE, and 20 for PE (Table 3 and Fig. 4). These significant associations were distributed in all the 12 rice chromosomes except chromosome 9. To reduce the redundancy of significant associations, a shared QTL region for multiple SNPs was declared for those located within the average LD decay distance of approximately 1,000 kb, which was estimated from the 217 rice lines. As a consequence, the number of QTLs detected was reduced to 29 (Table 3).
For stigma exsertion rate which was tested in 14DS only, four and nine QTLs were detected for DSE and TSE, respectively, and none was found for SSE. The strongest signal was found at the same 28,331,096 bp locus on chromosome 8 for the two traits. Major alleles of the two QTLs, qDSE8 and qTSE8.1, increased DSE and TSE by 4.2% and 4.7%, respectively. The QTL with the second strongest signal for DSE, qDSE11, also matched a QTL for TSE, qTSE11, with the minor allele increasing DSE and TSE by 3.9% and 4.0%, respectively. Such consensus was in accord with the high correlation between DSE and TSE. Since DSE and TSE are two related estimates for stigma exsertion, qDSE8/qTSE8.1 and qDSE11/qTSE11 could each be regarded as one QTL. Thus, the total number of QTLs detected in this study was reduced to 27, and the QTL number for stigma exsertion rate reduced to 11. Two other QTLs for DSE were located on chromosomes 5 and 10, and seven other QTLs for TSE were distributed on chromosomes 3, 6, 8, and 12. The major alleles promoted stigma exsertion at qDSE5, qTSE3.1, qTSE3.2, qTSE6.2, qTSE8.3, and qTSE12, with the allelic effects ranging from 3.8% to 5.6%. The minor allele promoted stigma exsertion at qDSE10, qTSE6.1, and qTSE8.2, with the allelic effects ranging from 3.4% to 4.5%. For panicle enclosure rate, two and four QTLs were identified in 13WS and 14DS, respectively. None of them were commonly detected in the two trials. The two QTLs detected in 13WS, qPE5.1 and qPE5.2, were located in the regions of 26.7−26.9 and 30.6−30.8 Mb on chromosome 5, with the major alleles reducing PE by 3.9% and 6.2%, respectively. The four QTLs detected in 14DS were located on chromosomes 1, 4, 10, and 12, respectively, with the major allele at qPE4 and minor alleles at qPE1, qPE10, and qPE12 reducing PE by 1.2%~2.4%.
For SSR, three and seven QTLs were identified in 13WS and 14DS, with the allelic effects ranging as 2.0%~2.9% and 2.8%~4.0%, respectively. Three of the QTLs, qSSR2, qSSR7, and qSSR11, were independently segregated from QTLs for other traits found in this study. While no QTLs for PE and stigma exsertion rate were identified on chromosomes 2 and 7, the two QTLs for stigma exsertion rate located on chromosome 11 were 14.4-Mb apart from qSSR11. The major alleles of the three QTLs for SSR all had enhancing effects.
Four other QTLs for SSR, qSSR10.1, qSSR10.2, qSSR10.3, and qSSR10.4, were linked in the 14.6−21.4 Mb region on chromosome 10. The minor alleles of these QTLs all increased SSR. It was found that two QTLs for other traits, qPE10 and qDSE10, were also located in this region, with the minor alleles reducing panicle enclosure rate and increasing stigma exsertion rate, respectively. Again, this was in accord with the unfavorable and favorable influence of panicle enclosure and stigma exsertion on outcrossing, respectively.
The remaining three QTLs for SSR were distributed on chromosomes 4, 8, and 12. Each of them was linked to a QTL for panicle enclosure rate or stigma exsertion rate, and the expected association was observed for two pairs of the QTLs. While the major alleles of qSSR4 and qSSR8 increased SSR, those of qPE4 and qTSE8.3 were favorable for reducing PE and increasing TSE, respectively. On the other hand, the minor alleles of qSSR12 and   qTSE12 were favorable for increasing SSR and unfavorable for increasing TSE, respectively, which is an exemption for the consensus between SSR and related traits.

Discussion
In this paper, the first result of GWAS for key outcrossing-related traits in rice was reported. Using the rice SNP chip consisting of 43,394 SNPs, the genetic diversity, LD pattern, population structure, and kinship of a diverse panel of 217 rice CMS-WA lines were characterized, and QTLs responsible for three key traits determining outcrossing in rice were identified.
The LD decay distance estimated in this study was found to vary greatly among different genomic regions, ranging from 577 kb to 2690 kb over the 12 chromosomes of rice. The large variation was in conformity with the results of previous studies [36][37][38][39] . It was also found that the average LD decay distance of approximately 1000 kb in our panel was much higher than the value of 100-300 kb uncovered by previous studies [31][32][33] . This was in agreement with the understanding that selection pressure could result in extending LD 40 . While a collection representing the entire diversity of Chinese landrace was used by Huang et al. 31 and a global collection of diverse rice varieties was used by Huang et al. 32 and Zhao et al. 33 , all the entries used in our study are in the category of CMS-WA and maintainer lines. Moreover, 173 of the total 217 lines are breeding lines selected from an on-going breeding program at IRRI.
The three key traits for outcrossing in rice were subjected to GWAS-base QTL mapping in this study, among which seed-setting rate is the direct measurement of outcrossing efficiency while stigma exsertion and  panicle enclosure are critical factors determining outcrossing. A total of 27 QTLs were detected, including 10 for seed-setting rate, 11 for stigma exsertion rate, and 6 for panicle enclosure rate. Eleven of the QTLs were located in genomic regions where QTLs for the same trait was previously detected, including qTSE3.1, qTSE3.2, qDSE5, qTSE6.1, qDSE8/qTSE8.1, qTSE8.2, qTSE8.3, and qDSE10 for stigma exsertion rate and qPE1, qPE5.1, and qPE5.2 for panicle enclosure rate (Table 3). It is noted that the strongest signal for stigma exsertion rate, qDSE8/qTSE8.1, has been reported in multiple studies 9,13,20 . The remaining six QTLs for stigma exsertion rate or panicle enclosure rate, qTSE6.2, qDSE11/qTSE11, qTSE12, qPE4, qPE10, and qPE12, are newly detected in this study. Regarding seed-setting rate as a measurement for outcrossing, no QTL has been reported before, thus new information is provided by the detection of QTL for this trait in the present study.
As it is well known, stigma exsertion plays a vital role in promoting outcrossing, while panicle enclosure has the opposite influence. In this study, seed-setting rate was found to be positively and negatively significantly correlated with stigma exsertion rate and panicle enclosure rate, respectively. Allelic directions of individual QTLs of the QTL clusters detected in this study has provided the genetic basis for this phenomenon. Seven of the QTLs detected for seed-setting rate were closely linked to QTLs for stigma exsertion rate or/and panicle enclosure rate. In all cases except the qSSR12/qTSE12 cluster, linked QTLs had the same allelic direction for seed-setting rate and stigma exsertion rate but opposite direction for seed-setting rate and panicle enclosure rate. QTL cluster of this kind is in high demand for marker-assisted breeding since several correlated traits could be selected at the same time to raise the breeding efficiency.

Methods
Plant materials. A total of 217 rice CMS-WA lines and their corresponding maintainer lines were used, among which 44 pairs were breeding-true lines which have been designated the accession numbers of IRRI germplasm bank, while the remaining 173 pairs were from IRRI CMS-WA breeding project which were documented by the pedigree (Supplementary Table S3). The 173 unnamed maintainers were breeding lines in the F 12~F18 generations of 69 crosses and their male sterile lines are completely sterile. The maintainer lines were used for genotyping and as the pollinator in evaluating the seed-setting rate of the CMS plants, and the CMS lines were used for trait measurement. SNP genotyping. The 217 maintainer lines were subjected to genotyping using genomic DNA from the fresh leaves of 28-day-old plants. SNP genotyping using the rice SNP chip consisting of 43,394 SNPs was provided by Syngenta India. SNPs without physical position information and those having call frequency < 0.8 or allele frequency < 0.05 were excluded.  of 2014 (14DS). The maintainer lines were sown three days later than the sterile lines. They were grown in a randomized complete block design with three replications. Twenty-one day old seedlings were transplanted using a planting density of 20 cm × 20 cm. Each pair was grown in five rows with 10 plants per row, of which the middle three rows were for sterile lines and the rest for maintainers. Field management followed local recommendations for the two different cropping seasons. Heading date was scored for individual plants and averaged for each sterile line and maintainer line in each replication. Stigma exsertion rate, panicle enclosure rate, and seed-setting rate were measured for each sterile line, with stigma exsertion rate being evaluated in 14DS only while the two other traits were determined in both seasons. For measuring the stigma exsertion rate of a sterile line, five panicles from five plants in each replication were selected on the third day after flowering. After the removal of the spikelets that were not yet flowering, spikelets with no stigma exsertion, single stigma exsertion, and double stigma exsertion (Fig. 5a) were counted, respectively. The single stigma exsertion rate (SSE), double stigma exsertion rate (DSE), and total stigma exsertion rate (TSE) were calculated as the percentage of the numbers of spikelets with single stigma exsertion, double stigma exsertion, and either single or double stigma exsertion in the total number of spikelets, respectively. For panicle enclosure rate (PE), the scoring samples were eight panicles from five plants per sterile line at maturity. PE, referring to panicle enclosed inside the sheath, was defined as the rate of the distance between the neck-panicle node and the flag leaf-cushion to the distance between the neck-panicle node to the panicle peak (Fig. 5b). Meanwhile, five healthy sterile lines in the central rows were bulk-harvested and measured for seed-setting rate (SSR). Statistical analysis. By using the PowerMarker version 3.25 41 , major allele frequency 42 , gene diversity 42 and polymorphism information content (PIC) 43 were calculated, and neighbor-joining trees was constructed based on the Cavalli-Sforza's Chord genetic distance 44 and viewed in MEGA version 5 45 . Population stratification was identified using STRUCTURE version 2.3.4 46 with the parameter settings as described by Xie et al. 35 . The uppermost level of the assumed number of genetic groups (K) was determined by ΔK 47 . An individual was assigned to a specific group if its genome composition derived from the group (Q-value) is above 80%; otherwise, it was assigned to the mixed group. LD was estimated using TASSEL version 5 48 with permutations of 1,000. By using the R package GAPIT 49 , kinship was calculated, and association scan was carried out with the compressed mixed linear model 50 where Q-value and kinship were included as the fixed and random effects, respectively. The value of P < 0.001 was used to declare a SNP-trait association.