Variability Assessment of Aromatic Rice Germplasm by Pheno-Genomic traits and Population Structure Analysis

While the pleasant scent of aromatic rice is making it more popular, with demand for aromatic rice expected to rise in future, varieties of this have low yield potential. Genetic diversity and population structure of aromatic germplasm provide valuable information for yield improvement which has potential market value and farm profit. Here, we show diversity and population structure of 113 rice germplasm based on phenotypic and genotypic traits. Phenotypic traits showed that considerable variation existed across the germplasm. Based on Shannon–Weaver index, the most variable phenotypic trait was lemma-palea color. Detecting 140 alleles, 11 were unique and suitable as a germplasm diagnostic tool. Phylogenetic cluster analysis using genotypic traits classified germplasm into three major groups. Moreover, model-based population structure analysis divided all germplasm into three groups, confirmed by principal component and neighbors joining tree analyses. An analysis of molecular variance (AMOVA) and pairwise FST test showed significant differentiation among all population pairs, ranging from 0.023 to 0.068, suggesting that all three groups differed. Significant correlation coefficient was detected between phenotypic and genotypic traits which could be valuable to select further improvement of germplasm. Findings from this study have the potential for future use in aromatic rice molecular breeding programs.


Results
Phenotypic traits. We present, in Fig. 1, the frequency distribution for 113 aromatic germplasm for 12 qualitative phenotypic traits. Most of the germplasm (>96%) showed green leaf blade color. More than 95% germplasm showed anthocyanin leaf sheath color. Most germplasm (92%) had horizontal flag leaf, while 1.8%, 2.7% and 3.5% germplasm had erect, semi-erect, and descending flag leaf, respectively. Three types of panicle were observed, with 8%, 86% and 6.2% as compact, intermediate, and open-type, respectively. Panicles were found well exerted in 95% germplasm and 5% moderately well exerted. More than 69% germplasm were awnless, with 69.9% white seed coated. The most variable phenotypic qualitative traits of the tested germplasm were aroma content, apiculus color, and lemma-palea color (Fig. 1). More than 58% of the germplasm were well scented, while 31% germplasm were moderately scented. Several tested germplasm (11%), however, also seemed to be non-scented based on KOH extraction methods.
UPGMA-based dendrogram displayed major four groups for the 113 tested germplasm ( Fig. 2A) based on 12 phenotypic qualitative traits. Group I, II, III, and IV consisted of 33, 45, 21, and 14 rice germplasm, respectively. PCA revealed that the first two components contributed 37.38% and 27.26% of the total variation, respectively. These two principal components were used to generate the group of the tested aromatic rice germplasm (Fig. 2B); Fig. 3 displays the resultant diversity profile. Increase in the H-based evenness value across the traits indicates the effective representativeness of the diversity available in the germplasm. Evenness varies from 0 to 1. Table 1 presents the Shannon-Weaver diversity index and evenness for the 12 phenotypic traits.
Generally, the Shannon-Weaver diversity index value represents the degree of diversity prevailing among the tested samples. Higher value indicates higher diversity and vice versa. In this study, the Shannon-Weaver index values ranged from 0.14 for leaf blade color to 1.71 for lemma-palea color, the latter showing considerable variation.
SSR polymorphism among aromatic rice germplasm. Overall, 52 well-spread microsatellite (SSR) markers covering all 12 chromosomes were used to characterize and assess genetic diversity among 113 aromatic rice germplasm. Only 45 markers, however, showed clear and consistent polymorphic banding patterns and amplification of each genotype, indicating that these microsatellites were suitable for genetic diversity analysis. The remaining seven markers produced monomorphic bands revealing one allele at each locus in all the germplasm (not discussed here) and hence were not useful for this study.
The gel picture for banding of the 113 aromatic rice germplasm with the RM447 marker is shown in Fig. 4. These loci were applied to discriminate morphologically uniform and non-uniform germplasm. The number of alleles, allele size, highest frequency allele, rare alleles, unique alleles, and polymorphism information content (PIC) detected among 113 germplasm are presented in Table 2.
Overall, 140 alleles were identified from the 113 tested germplasm with an average 3.11 alleles. The highest PIC value (0.67) was found for the marker RM114, with the lowest (0.051) observed for marker RM178, with a mean of 0.29. The most frequent major allele frequency (0.97) was found for the marker RM455 and the lowest (0.37) for marker RM283, with a mean of 0.77.
Rare alleles are described as alleles with a frequency less than 5%. In general, markers detecting a greater number of alleles per locus detected more rare alleles. Fifty-two (52) rare alleles were identified at 45 SSR markers with an average of 1.15 alleles per locus ( Table 2). The highest number of rare alleles was observed at the RM25 and RM284 loci (3 alleles), followed several markers (Table 2). Eleven unique alleles were also detected using 11 SSR markers specific to a given germplasm (Table 3); for instance, the popular rice variety "BRRI dhan50" was uniquely identified using RM6, "Sakkorkhora" with RM44, and "Bashful" with RM536.      Genetic relationship among the germplasm. An unrooted neighbor-joining tree, which showed the genetic relationships among the 113 tested germplasms, was constructed based on the alleles detected by 45 SSR markers. The genetic distance-based results stated in the unrooted neighbor-joining tree revealed three groups in the 113 tested germplasm (Fig. 5). We also constructed an UPGMA-based dendrogram to generate the genetic distance among the groups. Analysis of the pooled SSR marker data had an average similarity co-efficient of 0.44. The analysis also showed genetic variation among the aromatic rice germplasm tested, with similarity coefficient values ranging from 0.39 to 0.93, demonstrating a moderate degree of genetic diversity among the germplasm used in this study. Moreover, the three dimensional (3D) graphical view of the principal component analysis (PCA) showed the spatial distribution of the 113 tested germplasm along the three principal axes, with most closely associated near the centroid in the 3D graph (data not shown).

Spearman rank correlation test between D 2 and SSR rankings. Spearman rank correlation was
conducted to compare the morphological and molecular data for distinguishing germplasm. The 6328 genetic distances between germplasm measured through D 2 and SSR analysis were ranked separately (Supplementary information, Table S1) and assessed by Spearman's rank correlation formula, with r s = 0.276 and significant at t = 3.41 (p = 0.03) with two degrees of freedom, indicating that a statistical association between phenotypic and genotypic traits existed. The significant (p = 0.03) correlation coefficient between the D 2 analysis and the ranking of SSR markers, based on Nei distance, reflected that both of these two techniques were very effective for grouping the germplasm. Supplementary information (Table S1) presents the cumulative Nei distance ranking and D 2 genetic distance ranking position for all tested germplasm. By Nei distance ranking, StrawTAPL-500 and Desi Katari stood first and last rank, respectively; by D 2 distance ranking, however, Kalobakri and Tilkapur ranked first and last, respectively. These results indicated that a statistical association between groups, phenotypic and genotypic traits existed.
Population structure model based approach. A model without admixture was carried out varying K from 1 to 15 with five iterations using all 113 germplasm and 45 polymorphic markers for maximum likelihood and delta K (nK) values (data not shown). At K = 3, all germplasm stratified into three populations (P1, P2, P3), representing 56%, 14% and 30% of germplasm used in structure analysis respectively; Fig. 6 presents the inferred population structure. Moreover, based on the membership fractions, germplasm under these three population were classified as pure or an admixture: P1 showed 58 pure (90.6%) and 6 admixed (9.4%) individuals, P2 had 14 pure (87.5%) and 2 (12.5%) admixed individuals, and P3 had 29 pure (87.9%) and 4 (12.1%) admixed individuals. FST statistics tested genetic variation among the three population with values of 0.39 0.11 and 0.28 for P1, P2, and P3, respectively, with an average 0.26, indicating a moderate population structure. Thus, the most structured population was P1, followed by P3, and then the P2 population. The specific FST values (not the pair-wise FST values between populations) for the three populations were calculated using STRUCTURE. The expected heterozygosity or averaged distances between individuals, in same cluster, were 0.23, 0.47, and 0.30, respectively. The largest genetic (net nucleotide) distance (0.12) was observed between P1 and P2, while, the genetic distances between P1 vs P3 and P2 vs P3 were 0.023 and 0.09, respectively. The neighbor-joining (NJ) tree and principal component analysis (PCA) based on population derived from the structure analysis were conducted. Both the NJ tree and PCA analyses further confirmed the STRUCTURE results. The tree model-based groups (P1-P3) were clearly separated in the NJ tree (Fig. 7). In the PCA (Fig. 8), the first two eigenvectors clearly separated the overall germplasm into three major groups, which was similar to what was observed in STRUCTURE analysis and NJ tree.
Analysis of molecular variance. The three populations generated from the above structural analysis were also subjected to analysis of molecular variance (AMOVA) to estimate the percentage of variation between populations and within populations. In the total genetic variance between populations, based on structure, we observed that 6% was attributed to the populations and the remaining 93% was explained by individual differences within populations (Table 4). Approximately 1% of the total observed genetic variance could be explained by differences at the level of the individual. Pairwise FST values showed significant differentiation among all the pairs of populations, ranging from 0.02 to 0.07, suggesting that all of the three groups were different from each other. The P1 and P3 populations had the greatest level of differentiation from each other, as determined by the FST estimate (Supplementary information, Table S2).  Table 3. The unique alleles found in 11 germplasm out of 113 germplasm were tested using 52 SSR markers.
In general, the results from AMOVA and FST analysis were in agreement with the observations obtained through (i) the phylogenetic tree-based, (ii) similarity coefficient distribution, and (iii) structure analysis, confirming the presence of both a statistically moderate genetic diversity and a high level of population structure.

Discussion
Exploitation of genetic diversity, present in a crop species, can improve the target traits of that crop for the crop breeder, farmers, and ultimately consumers 17 . While cultivated varieties of aromatic rice in Bangladesh developed as a result of farmers' and scientists' selection from within the existing and available genetic diversity in a diversity of environments, one can argue that modern breeding over the past two centuries, in some cases, has resulted in the release or maintenance of varieties that are considered uniform, less stable, and better adapted to controlled and limited environmental conditions 32 . This makes narrow genetic background varieties popular among farmers, even as these improved varieties may, in some cases, be more susceptible to biotic and abiotic stresses.
In Bangladesh (as elsewhere), aromatic rice improvement requires (i) the identification of highly diverse germplasm, (ii) highly polymorphic molecular markers that, (iii) in turn, can be effectively utilized for the mapping of genes/QTLs for economically important traits and (iv) where a subset of these markers can be used in molecular breeding programs towards the development of improved varieties. Thus, steps towards understanding varietal characteristics of aromatic rice has the potential to play a vital role in future national and international breeding programs, especially where aromatic traits may be desirable for consumers. This research is a first step in a broader initiative to characterize the genetic base and improve the aromatic rice germplasm commonly grown in Bangladesh and conserved in the BRRI Genebank.
We used 12 qualitative phenotypic traits to classify the 113 germplasm into four groups. These phenotypic traits can be used for rapid identification of germplasm bank materials. Phenotypic traits of crop germplasm have been widely used for diversity analysis 33,34 . Because qualitative and quantitative phenotypic traits can impact the genetic diversity of rice germplasm 35 , they can be used for evaluating that genetic diversity 36,37 . Specifically, the degree of diversity can be evaluated using Shannon-Weaver diversity index value (H). We observed the highest H in lemma palea color, but the phenotypic traits used for diversity analysis had considerable variation ( Table 1).
The advent of molecular markers, and their use for the identification of diverse germplasm is highly advantageous over previous approaches 38 . Among different types of molecular markers, SSRs have been widely utilized in the study of genetic diversity analysis, genotypic grouping, and population structure analysis for numerous crop species, including aromatic rice 15,17,[39][40][41] . In the present study, selected SSR markers were determined to be suitable for a genetic diversity analysis of aromatic rice germplasm available in Bangladesh. The markers produced unique, rare and major allelic profiles for the 113 aromatic rice germplasm, with PIC values ranging from 0.05 to 0.67, with an average of 0.29; the genetic diversity observed in this study falls within the ranges found in several earlier studies 15,17,42 .
Previous studies detected different numbers of allele per locus: e.g., three alleles per locus and an average PIC of 0.41 17 43 . In another study, conducted by Thomson and co-workers, they observed an average PIC value of 0.45 among the 183 Indonesian rice landraces collected from across the islands of Borneo 44 . Shah and co-workers reported a slightly lower level of genetic diversity, averaging 2.75 alleles per locus as well as an average PIC value of 0.38 amongst the 40 rice accessions they tested from Pakistan 22 . Additionally, Singh and co-workers observed a lower SSR diversity in a study with 36 polymorphic HvSSRs, whereby they detected 2.22 alleles per locus with an average PIC value of 0.25 from a total of 375 Indian rice varieties which were collected from a diversity of regions across India 21 .
The PIC value of a marker is the probability of the marker allele that can be deduced in the progeny and is a good measure of a marker's usefulness for linkage analysis 45 . In general, higher PIC values were observed for SSRs having higher number of alleles while lower PIC value indicated that the genotypes under study are closely related; higher PIC values, conversely, indicate higher diversity in the germplasm being tested and such germplasm is better suitability for the development of new varieties. In our study, the primer RM 144 had the highest PIC value (0.67) and the highest number of alleles (5), indicating that it detected the highest level of polymorphism and the best marker for characterizing the aromatic rice germplasm. Markers RM342, RM 283, RM237, and RM 314 were also useful for molecular characterization of germplasm, though to a lesser extent.
Some of the SSR markers also generated germplasm-specific bands that can be used as molecular identity data for specific germplasm. For example, marker RM144 amplified all germplasm and showed specific fragments (Fig. 4). Overall, 52 rare alleles, with a frequency less than 5%, were identified across the tested rice germplasm, or an average 1.15 rare alleles per SSR marker, which is lower than in other similar reports [46][47][48] .
Identification of unique alleles has a great importance both for identifying specific genotypes but also for breeding 49,50 . In this study, nine unique alleles were identified by SSR markers (Table 3), and each germplasm showed unique alleles for at least one microsatellite locus. However, the number of unique alleles per locus varied from one to two 49 . Seven aromatic rice germplasm-"Sugandhi dhan, " "Chini Kanai, " "BRRI dhan50, " "Khazar, " "Bashful, " "Sakkorkhora, " and "Elai"-each amplified one unique allele (Table 2). Additionally, both "Begunbichi" and "Straw TAPL-554" showed two unique alleles. Generally, the higher number of unique alleles in a germplasm indicates its potential as a reservoir of novel alleles for crop improvement. The findings here are similar to other Diversity analysis relies on genotypic, phenotypic and geographic information of a crop species. Correlation coefficient analysis provides useful knowledge in terms of quantitative, qualitative characteristics issuing a credibility and important attributes about a species 53 . In this study, we determined correlation between phenotypic and genotypic traits of the tested germplasm which showed significant statistical relationship between two groups of data. This relationship is highly desirable and has significant value and used for selection because phenotypic traits are dependent on genotypic traits 54 .
Aromatic rice germplasm is composed of small-, medium-, and long-grain types with mild to strong aroma 8,9 . Based on conventional taxonomy, Bangladeshi aromatic rice has been classified as indica. Subsequent studies, which have been based on SSR and isozyme markers, have demonstrated that most of the aromatic rice cultivars from the Indian sub-continent, which includes Basmati (scented) types, have been characterized as a genetically distinct cluster 17,[55][56][57][58] . Based on SSR marker analysis, Roy et al. 15 reported three major groups from a set of 107 Indian aromatic rice accessions. In the present study, population structure analysis also revealed three populations (P1-P3) with a majority of germplasm in P1. This grouping agrees with genetic distance based clustering and PCA (Figs 7 and 8). While, to the authors' knowledge, this is the first genetic structure study for aromatic Bangladeshi germplasm, analysis of 141 aromatic Indian rice genotypes demonstrated five sub-populations 17 . Choudhury et al. 25 found two clusters within 24 indigenous and improved rice varieties in northeast India, while Das et al. 26 found four groups among a set of 26 rice cultivars.
In this study, genetic diversity among the tested germplasm was also evaluated by a model-based structure using the SSR genotypic data. The genetic architecture of diverse germplasm can be estimated by determining the STRUCTURE of the population using molecular markers such as, but not limited to, SSRs or SNPs 16,21,30,40,59,60 . In STRUCTURE, LnP(D) denotes the highest optimal number of subsets (K) [61][62][63] for an optimal number of divisions 64 . In this study, at K = 3, all 113 germplasm divided into three population, with 63 in P1, 17 in P2, and 33 in P3 (Fig. 6), indicating genetic differentiation in the overall germplasm.
Out of the total tested, ten were non-scented but were distributed in all three population: four each in P1 and P2 and two in P3. Similarly, the majority of scented germplasm were observed in P1. Four of five high-yield varieties developed by BRRI-"BR5, " "BRRI dhan34, " "BRRI dhan37, " and "BRRI dhan38"-were distributed in P1; "BRRI dhan50" was in P3. In Bangladesh, germplasm are typically classed by grain shape and size 65 . Most of the long-and slender-type germplasm were found in P2 population. Other grain types, e.g., short bold and short-medium germplasm, were found in P1 and P3.

Conclusion
Phenotypic traits and SSR marker based molecular characterization of 113 aromatic Bangladeshi germplasm confirmed genetic variation among the germplasm. The phenotypic traits classified the tested germplasm into major four groups. However, the population structure analysis allowed us to identify three major groups within these germplasm and this that generally agrees with the farmers' classification of the germplasm. Future population genetics-based studies that include extensive collections of rice genetic resources from all of the districts of Bangladesh would help in exploiting this rice gene pool more effectively for rice improvement program. Bangladesh has 64 districts where rice grows every year and usually farmers cultivate some local germplasm with high yielding varieties. In this study we only collected germplasm from 22 districts, remaining all other districts may have many other germplasm that will provide wide diversity and can be used in future rice breeding programme for improvement of aromatic rice in Bangladesh as well global if they will be collected and preserved in geebank.  In both domestic and international markets, aromatic rice commands a premium price, often two to three times higher than traditional rice cultivars due to consumer quality preferences. Some of the traditional aromatic rice varieties of Bangladesh-including "Kataribhog, " "Chini Sagar, " "Kalobhog, " "Chini Atob, " "Noyonmoni, " "Chinnigura," "Gopalbhog," "Tulsimoni," "Jirabuti," "Khirshaboti," "Rajbut," and "Kalijira"-are cultivated throughout the country for traditional and consumer-preference reasons. Low yield of these high-value rice, however, limit their market potential. Better understanding the genetic diversity preserved in this aromatic rice gene pool will facilitate proper maintenance, conservation, and utilization of this valuable resource.

Materials and Methods
Materials. In this study, we used 45 polymorphic SSR markers to analyze the genetic diversity of 113 aromatic rice germplasm representing landraces, fine rice genotypes, elite cultivars, and exotic genotypes preserved in BRRI genebank in Gazipur, Bangladesh. Names for the 113 aromatic rice germplasm along with quantitative phenotypic traits have been previously described 7 . Seeds from BRRI are publicly available for research purposes upon request with a materials transfer agreement.
For our analysis, 52 SSR markers were used from the 'Gramene' marker database 66 . Out of these 52 SSR markers, 45 were polymorphic, in terms of their bands, among the rice varieties, while seven were monomorphic; the marker names and their respective sequences are presented in Supplementary information (Table S3). The 45 polymorphic markers selected for analysis were distributed across the 12 chromosomes, whereby three were linked to aromatic traits, four were related to cooking and eating quality traits, 31 were listed in the panel of 50 standard SSR markers used for diversity analysis 66 and the remainder of the SSRs were selected randomly.
Analysis of phenotypic traits. The seed of each tested germplasm was taken from BRRI gene bank and grown in the BRRI research field following a rice production technology previously described by Islam et al. 7 . Data on 28 qualitative phenotypic traits were recorded. Ten randomly selected plants from each germplasm were used for recording the respective phenotypic traits data. Each phenotypic trait with their descriptors, according to BRRI 67 , are shown in Table 1. All other quantitative phenotypic traits have been previously reported 7 . The most important 12 qualitative phenotypic traits selected as follows: leaf blade color, leaf sheath color, flag leaf angle, lemma-palea color, seed coat color, stigma color, awn in the spikelet, apiculus color, leaf senescence, aroma content, panicle type, and panicle exertion. We used all 12 traits in the following statistical analyses. Shannon-Weaver diversity index (H), evenness, and diversity profile, as well as clustering and PCA analyses of 12 phenotypic traits, were determined using PAST (PAleontological STatistics) software 68 . Phenotypic frequency data of the 12 traits were analyzed using the Shannon-Weaver diversity index (H) given as: where k is the number of phenotypic classes for a trait and p i is the proportion of the total number of entries (n) in the i th (i) class.
DNA Extraction and PCR analysis. DNA was extracted from young leaves of 20-day-old plants following the mini-scale method 69 . Each PCR was carried out in a 20 μl reaction volume containing 1 μl of MgCl 2 free 10 × PCR buffer with (NH 4 ) 2 SO 4 , 1.2 μl of 25 mM of MgCl 2 , 0.2 μl of 10 mM of dNTPs, 0.2 μl of 5 U/μl Taq DNA polymerase, 0.5 μl of 10 μM forward and reverse primers, and 3 μl (10 ng) of DNA using a 96-well thermal cycler. An additional 10 μl of mineral oil was added in each well to prevent evaporation. Amplification was carried out using a G-storm PCR machine (Gene Technologies Ltd., England). Amplification conditions were one cycle at 94 °C for 5 minutes (initial denaturation) followed by 35 cycles at 94 °C for 1 minute (denaturation), 55 °C for 1 minute (annealing), 72 °C for 2 minutes (extension) with a final extension for 7 minutes at 72 °C at the end of 35 cycles. After mixing with the loading dye, PCR products were run through polyacrylamide gels. A 50 bp DNA ladder was used to determine the amplicon size. Three 4 μl PCR products were resolved by running gel in 1 × TBE buffer for 1.5 to 2.5 h depending upon the allele size at approximately 90 volts and 500 mA electricity. Gels were then stained with 1 μg/mL of ethidium bromide and documented using a Molecular Imager gel documentation unit (XR System, BIO-RAD, Korea).

Spearman's Coefficient of Rank Correlation.
Quantitative phenotypic data for tested germplasm have been reported previously 7 . In this study, however, only the quantitative traits' values were used for the Spearman ' s coefficient calculation. Morphological and genetic distances among the genotypes were estimated, and ranking was done using Spearman's coefficient with SSR analysis, 6328 = − n n ( 1) 2 . Rank coefficients (r s ) were further calculated by Spearman rank correlation test, in which data were collected as ranks or were ranked after observation on some other scale 70 . To measure and compare the association between two criteria of rankings, Spearman devised the following formula for estimating rank correlation (r s ): where n = number of observations, and d i = differences between the two ranks of each observation. In this study, ranking involved D 2 distances of phenotypic data and SSR analysis. Given that the number of pairs was large, estimated r s values were further tested for significance using the criterion Analysis of genotypic data model based approach. The DNA fragment of each germplasm was amplified using SSR markers and analyzed using different software. Molecular weight for each amplified DNA fragment was measured using Alpha Ease FC 4.0 software. We determined, were using Power Marker version 3.25, the summary statistics, which included (i) the number of alleles per locus, (ii) the frequency of major alleles, and the polymorphism information content (PIC) values 71 . An unrooted neighbor-joining tree from molecular data was constructed using MEGA 72,73 . Also, the allele frequency data from PowerMarker were used for analysis with NTSYS-pc version 2.1 74 , with the pair-wise genetic dissimilarity coefficients calculated using "Nei1983/ CSChord1967" distance 64,75 . We calculated a similarity matrix in the Simqual subprogram using the Dice coefficient, which was followed by cluster analysis using the sequential, agglomerative, hierarchical and nested (SAHN) subprogram using the unweighted pair-groups with the arithmetic averages (UPGMA) clustering method as implemented in NTSYS-pc. Population structure for 113 germplasm was constructed using STRUCTURE version 2.3.4 76,77 . The number of clusters (K) we investigated ranged 1 to 15. The analysis used five replications for each K value. The model was used following admixture and correlated allele frequency with a 5000 burn period and a run length of 50000. The output of the analysis was collected using the STRUCTURE harvester and to identify three as the best K value based on the LnP(D) and Evano's ΔK 78 . The principal components analysis (PCA) in a visualization technique that is commonly used in multivariate statistics, whereby the user can identify eigenvectors and amounts of variance and cumulatively explained variances per component 79,80 was conducted using the SPSS (Version 16). In order to summarize the major variance patterns in the multi-locus dataset, an analysis of molecular variance (AMOVA) was performed using GenAlEx V6.5 81 .