Introduction

World population is expected to reach 9 billion by 2050, and provision of food, feed, and fiber to such a huge mass is the ultimate challenge for plant breeders to develop high yielding crop varieties1. Modelling of yield-growth trends has showed that increase in the yield of agricultural crops cannot cope with the increasing population, and the situation is getting more worst because of climate change particularly rising temperatures and extreme weather patterns2. During recent times, availability of different data sets, ranging from high throughput phenotyping to genotyping by SNPs (single nucleotide polymorphism) and resulting biochemical “omics” data sets are on rising side. However, the major challenge for “big data” of modern plant science and technology, is to efficiently and precisely predict phenotypes from underlying genotypes in changing climatic conditions. Variability in genetic/DNA sequences translated into bio-chemical outlook of the cells and tissues, and the bio-chemical makeup in collaboration with environmental cues transcribed into organ formation, plant growth and crop yield, and resistant to abiotic and biotic conditions. Under modern molecular plant breeding, exploring the impacts of environmental and genotypic variation opens new horizon regarding the regulation of essential process occurring in the life cycle of plant crops. They are also responsible for understating the quality traits and to predict the crop yield under particularly environmental situations3.

Analyzing phenotypes at the different levels of integrity and formation, devising links between phenotypes and genotypes require integration and processing of big, noisy, and heterogenous datasets. Machine learning (ML), a set of different statistical computation methods and approaches, can be used to discover various predictive patterns present in the dataset4,5,6. Machine learning-based product classification applications are powerful tools for developing accurate and robust classifiers. These applications include various algorithms such as decision trees, artificial neural networks, genetic algorithms, regression, and fuzzy logic. In addition, powerful algorithms are fruitful for developing diversified machine learning models, adjusting complicated input–output mapping approaches, choosing, and deleting the appropriate features. These models are frequently being applied to select the suitable and descriptive traits when assessing the quality of agricultural commodities7.

North America is considered the native region for sunflower and has captured the focus of agricultural scientists and farmers’ communities as one of the most essential industrial crops8. Regarding adaptive introgression and evolutionary biology, the Helianthus genus is enduring hybrid traits9,10. Sunflower has its own importance as it is considered in a model to track the sun’s direction. It is also helpful to understand the flower development processes in the plant science division11. Furthermore, sunflower crop is being cultivated throughout the world for its premium quality edible oil. At present, it is the fourth most important oilseed crop in the world. Reason for the widespread cultivation of sunflower is its ability to grow in wider range of environments, high seed yielding capacity, potential of having two crops in a calendar year12,13. To have cultivars with greater yield potential, it is established fact the F1s produced from distantly related high yield inbred lines are more transgressively superior to their parents or crosses attempted in closely related genotypes14.

ML applications have been playing a significant role in different engineering and medical related domains, but their true potential and usage in applied plant search is yet to be fully explored. In this present study, novel approach of utilizing three different machine learning approaches i.e., hierarchical clustering, k-means clustering, and a hybrid clustering (hierarchical + k-means) is utilized to identify heterotic patterns and then studied the practical application and efficacy of these machine learning based identified heterotic groups by developing F1 sunflower hybrids among heterotic groups.

Materials and methods

Experiment 1

Plant material

Experiment 1 was conducted in the National Agricultural Research Center (NARC), Islamabad which is situated on latitude 33.6641° N, and longitude 73.1276° E. Plant material comprised of 109 genetically diverse sunflower lines were used to mine grouping pattern and then efficacy of the identified grouping pattern in hybrid seed development in sunflower. Plant material (sunflower genotypes) was obtained from National Agricultural Research Centre, Islamabad, Pakistan (Table S1). Collection of plant materials complied with the institutional, national, and international guidelines and legislation. All experiment and analysis methods were performed following the relevant regulations and guidelines.

Phenotyping

To characterize the sunflower genotypes phenotypically, all the 109 sunflower lines were planted in open field conditions for two consecutive years in spring season following a randomized augmented block design. Data for nine plant morphometric characteristics i.e., plant height, stem curvatures, days to flower initiation, days to flower completion, number of leaves per plant, head diameter, leaf area, 100 seed weight and seed yield per plant were recorded from 10 randomly selected plants from each genotype, and their average was computed (Supplementary Table 1).

Genotyping

Molecular based genotyping of 109 sunflower lines was conducted by employing 40 SSR markers (Supplementary Table 2). Genomic DNA extracted from 1.5 to 2 weeks old seedlings of sunflower following cetyl trimethyl ammonium bromide (CTAB) method described by Saghai-Maroof et al.15. DNA was then diluted with 50 µl TE buffer and run on 1% agarose gel to determine its quality and concentration, before PCR analysis. DNA fragments amplified by the respective microsatellite marker were designated as a unit trait with 1 for the presence and 0 for the absence of a DNA band, thus generating a binary matrix data set.

Protein characterization (proteomic analysis)

Protein characterization was performed through SDS-PAGE of total seed proteins in vertical slabs as described by Jan et al.16. Separating and stacking gels were prepared with different concentrations. The gels were run on 100 V till the BPB marker reached the bottom of gels. A pre-standard protein ladder with a range of 10–180 kDa (Lot: 00345 035) was used to determine the molecular weights of sunflower seed proteins. Staining of gels was performed using 0.25% (w/v) Commasie brilliant blue (CBB) solution homogenized in 10% (v/v) acetic acid and 40% (v/v) methanol diluted in water, to visualize the proteins on gels. After staining gels were de-stained to wash away extra CBB dye. The de-staining solution contained 5% acetic acid (v/v), 20% methanol (v/v) and distilled water in 5:20:75 ratios.

Each polypeptide band in a gel corresponds to a unit character and designated as 1 for the presence of a band in a particular sunflower genotype under observation and 0 for the absence of protein band. Therefore, generated a binary matrix of proteins bands observed in 109 sunflower genotypes under study.

Machine learning baseline

The application of artificial intelligence in the pacing of the breeding of new cultivars is a burgeoning area of development considering the major impact it could create in the plant breeding field. The machine learning baseline17 is a generic, modular, and reusable workflow that combines agronomic principles of crop modeling with machine learning. The input data consists of a diverse aggregated dataset collected at three different levels of plant’s organizational structure i.e., morphological, molecular, and biochemical. The corresponding outputs is a set of clustering pattern of 109 sunflower genotypes obtained by the application of machine learning classifiers i.e., hierarchical clustering, K-means clustering and hybrid (hierarchical + K-means) clustering algorithms. A schematic workflow describing the interrelationship between input variables (integrated dataset) and the corresponding output (heterotic groups) using 3 machine learning qualifiers is presented in Fig. 1. Strategy followed for three different datasets collection, processing and integrations was as follows:

  1. (a)

    Data from nine morphological parameters were collected from 109 sunflower genotypes.

  2. (b)

    Genotyping data scores (binary format) was collected from amplification data of 40 SSR primers.

  3. (c)

    Protein characterization data (binary format) was extracted from the amplification data of 14 different protein bands obtained after SDS-PAGE analysis of respective 109 sunflower genotypes.

  4. (d)

    Collected data from all there sources (morphological, molecular, and biochemical) were scaled in a range of 0–1.

  5. (e)

    Three different machine learning classifiers (hierarchical, K-means and hybrid (hierarchical + K-means) was employed for selection of suitable genotypes for F1 hybrid development and evaluation of identified heterotic groups in the studied genotypic pool of sunflower.

  6. (f)

    The resultant classification pattern obtained from each machine learning classifier algorithm was evaluated for selection of suitable genotypes.

  7. (g)

    Finally, performance of F1 hybrids was then evaluated to characterize the efficiency of machine learning algorithm in identification of suitable parental genotypes in hybrid breeding programs.

Figure 1
figure 1

A schematic workflow of the procedures adopted to identify the underlaying heterotic grouping pattern in the studied germplasm pool of sunflower. The input dataset obtained was from the combination of phenotypic (9 morphological traits); genotypic (binary data obtained from 40 SSR markers scoring); proteomic (binary data from SDS-PAGE based characterization). The dataset was then scaled to the values ranging between 0 and 1 to remove data biasedness. Three machine learning based classifiers were then tested to produce the heterotic grouping in the studied sunflower materials. Out of three, Hierarchical clustering showed the maximum resolution power in terms of correctly classifying the A, B, R and SFP lines. Hierarchical clustering classified the 109 sunflower genotypes in 12 heterotic groups, and this clustering pattern was then used to develop F1 hybrids, while selecting 1 sunflower line from each heterotic group. Both hierarchical and Knn based machine learning clustering have been widely used in plants dataset (including phenotypic and genotypic data) for classification of plant genotypes.

Data preparation and preprocessing

Data normalization and scaling is an important preprocessing procedure to remove data nuisance and improve the efficiency of learning from data18. Moreover, when different variables recorded have varied nature of recording i.e., continuous and binary, more care needs to be adopted to remove over/less fitting of the machine learning algorithms19. In this experiment, morphological data variables were recorded according to different scales, and the genotyping and protein profiling data were recorded in a binary scale. To address the overfitting/inaccuracy of machine learning algorithms, data scaling was performed according to Yeo and Johnson normalization method20. During data scaling, all traits were scaled to a [0,1] range using the following equation:

$${X}_{scaled}=\left[\frac{\left(X-{X}_{min}\right)}{\left({X}_{max}-{X}_{min}\right)} \times \left({X}_{max}-{X}_{min}\right)\right]+{X}_{min}$$

where \({X}_{scaled}\) is the scaled value of input variable, Xmin and Xmax are the minimum and maximum values of X variable, respectively. This scaling of data has also ensured that there are no outliers in the final dataset.

Data integration and heterotic grouping identification

For careful and accurate identification of heterotic grouping patterns present in the sunflower genotypes’ pool, all the three datasets, i.e., morphological, genotypic and proteins, were aggregated and evaluated collectively after data scaling. Three classification algorithms i.e., (a) hierarchical clustering, (b) K-means clustering and (c) hybrid clustering (hierarchical + K-means) was applied over the aggregated dataset. Data preprocessing, and application of machine learning algorithms were performed using R-Studio version 1.2.1335. Packages used for application of the above mentioned three machine learning algorithms were factoextra, FactomineR, dendextend, cluster and tidyverse.

Algorithms were compared and the one with highest resolution was selected for further selection of genotypes from the genotypic’ pool. The algorithm that classifies the sunflower genetic pool under study (A-lines, B-lines, R-lines) and self-pollinated (SFP) with more accuracy was regarded as the one with highest resolution power. Clustering pattern obtained from the best explaining algorithm by using aggregated dataset was then carefully evaluated to select the highly diverse and high yielding sunflower genotypes, i.e., one genotypes from each heterotic group (with the assumption that the selected genotype has the same breeding potential as the rest of genotypes in the same heterotic group). The selected genotypes were then crossed with each other to obtain F1 sunflower hybrids.

Experiment 2

Evaluation of identified heterotic groups

F1 crosses development and evaluation

Based on the results of experiments 1, one genotype from each identified heterotic group was selected as a representative of whole group and utilized in hybridization scheme following a Line × Tester mating design. Each male line was crossed with each female line. 6 CMS lines identified were crossed with 6 Restorer lines to generate 36 sunflower F1 hybrids.

F1 phenotyping

Sunflower F1 hybrid obtained were tested in open field conditions at National Agricultural Research Center, and data regarding growth and yield attributes (days to flower initiation, days to flower completion, plant height, stem curvature, head diameter, number of leaves per plant, leaf area, 100 seed weight and seed yield per plant) were recorded.

Statistical analysis

The collected data of various aspects of sunflower hybrid were subjected to statistical analysis such as ANOVA, heterosis, heterobeltiosis, and combining ability analysis to understand the yield potential exhibited by each respective sunflower hybrid and to assess the efficacy of heterotic grouping pattern identified. R-Studio version 1.3.1335 was used for statistical analysis of F1 cross combinations.

Results

Experiment 1

For accurate identification of heterotic grouping pattern, a multi-prong strategy was adopted, wherein morphological, bio-chemical, and molecular datasets of sunflower genotypes were analyzed by using three clustering algorithms, i.e., hierarchical, K-means and hierarchical + K-means hybrid classification algorithm. Efficacy of these three machine learning algorithms were tested on the sunflower genotypes and the algorithm that best explains and accurately classified the genotypes were used for final parental selection for further hybrid development.

Hierarchical clustering

Figure 2 represents the dendrogram obtained by using hierarchical classification algorithm. For hierarchical clustering, Ward.D2 method was applied on combined dataset of morphological + bio-chemical + molecular characterization. Cluster diagram (Fig. 2) showed two distinct classes of genotypes, wherein cluster 1 contains all the restorer lines, while cluster 2 has CMS + B-line and self-pollinated lines. Number of genotypes grouped in cluster 1 includes 31 sunflower genotypes, while the rest 78 sunflower genotypes grouped in cluster 2. Further, at genetic distance of 18, these clusters can be sub-divided into 6 smaller groups. Sub-group 1-A has six genotypes, while there are 3, 8, 6, 2 and 6 genotypes in subgroup 1-B, 1-C, 1-D, 1-E and 1-F respectively. Likewise, Cluster-2 can be divided into six sub-groups at the genetic distance 18. The number of genotypes recorded in sub-group 2-A was 8, while sub-group 2-B had 11 genotypes. Similarly, the number of genotypes recorded in sub-groups 2-C, 2-D, 2-E, and 2-F were 7, 20, 20 and 12 respectively.

Figure 2
figure 2

Hierarchical clustering of 109 sunflower genotypes through Ward.D2 method.

K-means clustering

K-means cluster algorithm is an unsupervised machine learning based approach that tends to group the similar data points in one cluster, which is away from the dis-matching data points. More precisely, this algorithm aims to minimize the sum of square values within a cluster and consequently maximize the sum of squares between clusters. In the present study, K-means clustering applied on the 109 sunflower genotypes, precisely grouped the sunflower genotypes into 2 major clusters (Fig. 3). The size of cluster 1 is 31, while cluster 2 classified 78 sunflower genotypes. Cluster 1 predominantly contains restorer lines, while cluster 2 contains self-pollinated (SFP) lines i.e. A-lines and B-lines of sunflower genetic pool under study. Although K-means application precisely grouped the sunflower genotypes into two major clusters, selecting genotypes with more precision to smaller groups was not possible using this algorithm. As many SFP lines lie closer to the A-line or B-lines, making it harder to distinguish between them.

Figure 3
figure 3

K-means clustering of 109 sunflower genotypes.

K-mean, hierarchical hybrid clustering approach

Finally, a hybrid algorithm by using hierarchical + K-means clustering algorithms was applied on the sunflower genotypes to examine if the accuracy of harvesting more precise heterotic groups can be improved further or not? Setting the number of k(s) to 12, two major clusters were observed, that were further grouped into 12 smaller clusters (Fig. 4). Cluster 1 contains 12 genotypes in which there were 2 B-lines and 10 restorer lines, cluster 2 contains 8 genotypes (4 CMS + 4 B-lines). Cluster 3 had 4 genotypes (1 B-line + 3 SFP lines), and 12 genotypes (6 CMS-lines, 5 B-lines and 1 SFP line) were grouped into cluster 4. Cluster 5 gathered 15 genotypes which were all Restorer lines, 11 genotypes were grouped in cluster 6 (5 CMS lines, 4 SFP lines, 1 Restorer line and 1 B-line). Likewise, cluster 7 had 6 sunflower genotypes (5 SFP lines + 1 CMS lines), cluster 8 had 11 genotypes (6 SFP lines, 4 restorer lines and 1 CMS line). 6 sunflower genotypes (3 CMS lines, 2 SFP lines and 1 restorer lines) were grouped in cluster 9, while cluster 10 showed a grouping of 8 genotypes (3 CMS lines, 3 Restorer lines and 2 B-lines). Cluster 11 had 8 sunflower genotypes (3 SFP lines, 2 CMS lines, 2 B-lines and 1 Restorer line) and 8 sunflower genotypes tend to group in cluster 12 (3 Restorer, 2 CMS-lines, 2 B-lines and 1 SFP line).

Figure 4
figure 4

Clustering of 109 sunflower genotypes through hybrid (hierarchical + K-means) machine learning.

Grouping of sunflower genotypes observed by the application of hybrid algorithm (hierarchical + K-means) was found to be useful to some extent as it can be used to group closer genotypes, however, grouping of genotypes with distinct characteristics like restorer lines and CMS lines closely is somewhat confusing, hence this algorithm is also found to be not a good fit for the current study. As the grouping of genotypes using hierarchical clustering algorithm is clearer and more definitive, hence selection of potential parents for the development of sunflower hybrids were based on the grouping observed through hierarchical clustering approach.

Selection of parents

As 12 clusters were observed through hierarchical clustering method, 1 genotype from each of the 12 clusters was selected for further utilization in sunflower hybrid breeding program. Genotypes exhibiting the highest seed yield potential from each of the 12 clusters (recorded at the height of 18) were selected. Moreover, all the restorer lines tend to cluster separately from CMS lines, hence Line × Tester mating design was followed for sunflower hybrid F1 development.

Experiment 2

Evaluation of identified heterotic groups

To assess the practical efficiency of the identified heterotic groups, selected parental lines were crossed in Line × Tester mating design and 36 F1 hybrids of sunflower were generated. Heterosis (mid-parent heterosis, better parent heterosis) and combining ability analysis (General combining ability and Specific combining ability) were conducted to evaluate the potential of methodology used for identification/mining of heterotic grouping pattern and thereof selection of potential parental lines for commercial hybrid development.

Mean performance of parents

Table 1 presents the mean performance of 12 sunflower lines that were planted at NARC, Islamabad. The study focused on nine agro-morphological traits. Among the lines, CMS-HAP-112 exhibited the shortest duration to initiate flowering, taking only 46.5 days, while RHP-41 had the longest duration of 56.5 days. CMS-HAP-111 completed 100% flowering the earliest, within 55 days, followed by CMS-HAP-112 at 55.5 days. On the other hand, RHP-41 took the maximum number of days to complete flowering, with a duration of 67.5 days. Regarding plant height, the 12 parental sunflower lines displayed a range from 200.14 cm (CMS-HAP-54) to 134.6 cm (CMS-HAP-111). In terms of leaf area, CMS-HAP-56 had the highest recorded value of 257.48 cm2, while RHP-38 had the lowest average leaf area of 141.5 cm2. The largest head diameter of 19.3 cm was observed in CMS-HAP-99, whereas the smallest head diameter of 10.45 cm was found in RHP-38. In the context of stem curvature, the lowest value recorded was 6.95 cm for RHP-71, while CMS-HAP-111 and CMS-HAP-12 exhibited the highest stem curvatures of 48 cm and 45.7 cm, respectively. The number of leaves varied among the parental lines, with CMS-HAP-111 having the fewest leaves (23.35), and CMS-HAP-112 having the highest number of leaves (33.1), followed by CMS-HAP-99 (33). The 100 seed weight of the parental lines ranged from 3.48 g (RHP-69) to 6.61 g (CMS-HAP-99). CMS-HAP-112 displayed the highest mean seed yield per plant at 68.19 g, while the lowest seed yield per plant was observed in RHP-68 (27.28 g) and RHP-41 (27.9 g) (Table 1).

Table 1 Mean performance of parents regarding morphometric characteristics.

Mean performance of hybrids

Table 2 shows the average of 36 sunflower hybrids grown in NARC, Islamabad. The research focused on nine agromorphological traits. Hybrids RHP-68 × CMS-HAP-112 and RHP-38 × CMS-HAP-112 had the shortest flowering times, only 44 days. On the other hand, the hybrid RHP-71 × CMS-HAP-56 had the longest time to flower initiation at 56.5 days. RHP-68 × CMS-HAP-112 and RHP-38 × CMS-HAP-54 showed the minimum number of days (50) required for hybrids to complete 100% flowering, whereas RHP-71 × CMS- HAP-111 was 66 5 days. The number of days until the flowering rate reaches 100%. Regarding the mean leaf area approaching physiological maturity, RHP-71 × CMS-HAP-56 showed the highest value of 176.53 cm2, while RHP-69 × CMS-HAP had the lowest mean leaf area. The largest head diameter he recorded with the RHP-71 × CMS-HAP-99 was 23.95 cm, followed by he with the RHP-53 × CMS-HAP-111 with a diameter of 22.77 cm. Conversely, RHP-68 × CMS-HAP-112 had the smallest head diameter of 17.11 cm, followed by RHP-68 × CMS-HAP-54 with 17.53 cm, and the tallest hybrid in terms of plant height was RHP-71 × CMS. -HAP-112 had an average height of 175.17 cm. while the smaller hybrids were RHP-53 × CMS-HAP-111 (131 cm) and RHP-41 × CMS-HAP-56 (132 cm).

Table 2 Mean performance of hybrids regarding morphometric characteristics.

Regarding stem curvature, the lowest recorded value was 42.77 cm for RHP-68 × CMS-HAP-54, followed by RHP-53 × CMS-HAP-54 with a stem curvature of 48.83 cm. HAP-99 and RHP-38 × CMS-HAP-112 exhibited maximum stem curvatures of 77.5 cm and 74.83 cm, respectively. RHP-53 × CMS-HAP-111 has the lowest number of seats (26), RHP-71 × CMS-HAP-56 has the highest number of seats (36.67), followed by RHP-71 × CMS-HAP-99 continued. (36.17). Test weights of hybrids ranged from 4.41 g (RHP-71 × CMS-HAP-111) to 7.34 g (RHP-38 × CMS-HAP-12). The minimum seed yield per plant for hybrid RHP-53 × CMS-HAP-111 was 49.3 g, whereas RHP-71 × CMS-HAP-54 showed the highest average seed yield of 103.36 g per plant, compared to RHP-41 followed by RHP-41 × CMS-HAP-111 of 99.45 g.

Heterosis and heterobeltiosis

Results of heterosis and heterobeltiosis for nine morphological characteristics of sunflower plants are presented in Table 3 and 4. Range of heterosis for days to flower initiation reported in present study was from 10.14**% (CMS-HAP-111 × RHP-71) to − 13.04% (CMS-HAP-56 × RHP-68). The heterotic effects of six hybrids were found to be in positive direction, while non-significant heterosis effects were found of six cross combinations. Remaining all cross combinations showed a highly significant heterosis for days to flower initiation. Heterobeltiotic effects recorded for 36 sunflower hybrids were found to be in the range of − 20.35% (CMS-HAP-112 × RHP-41) to 3.65*% (CMS-HAP-111 × RHP-71). Most of heterobeltiotic effects are in negative direction.

Table 3 Heterosis (mid parent) and heterobeltiosis (better parent) of 36 sunflower hybrids.
Table 4 Heterosis (mid parent) and heterobeltiosis (better parent) of 36 sunflower hybrids.

CMS-HAP-54 × RHP-38 showed the maximum heterotic effect in negative direction for days taken to 100% flowering (− 18.37**%) followed by CMS-HAP-56 × RHP-41 (− 17.0**%) and CMS-HAP-56 xRHP-38 (− 16.73**%). Whereas hybrid CMS-HAP-111 × RHP-71 depicted the highest positive heterotic effect for this trait (13.68**%) followed by CMS-HAP-12 × RHP-71 (8.94**%). The heterotic effect was significant for all hybrids except for CMS-HAP-111 × RHP-53. Range of heterobeltiosis was recorded from -23.7**% (CMS-HAP-112 × RHP-41) to 7.26**% (CMS-HAP-111 × RHP-71). Heterobeltiotic effect of all the hybrid combinations found to be statistically highly significant for days to complete flowering except four hybrids viz., CMS-HAP-112 × RHP-71, CMS-HAP-12 × RHP-71, CMS-HAP-54 × RHP-71 and CMS-HAP-99 × RHP-71.

Results obtained of heterosis and heterobeltiosis effects for leaf area in hybrid combination under study depicted that heterosis over mid parent ranged from 3.63ns% to − 44.26**%. Highest magnitude of positive heterosis effect was noted for CMS-HAP-12 × RHP-38 (3.63ns%) while negative heterotic effect in negative direction was recorded for F1 hybrid CMS-HAP-56 × RHP-41 (− 44.26**%). Highest effect for heterobeltiosis observed in negative direction was (− 48.28**%) for CMS-HAP-56 × RHP-41, followed by CMS-HAP-56 × RHP-68 (− 46.11**%). Heterobeltiotic effects of 29 hybrids was found to be statistically significant.

Maximum heterosis for head diameter was observed for CMS-HAP-12 × RHP-38 (59.49**%), whereas lowest magnitude of mid parent heterosis was depicted by CMS-HAP-112 × RHP-68(4.65ns%) (Table 3). All hybrids exhibited positive mid parent heterosis. Maximum heterobeltiosis was observed for CMS-HAP-12 × RHP-71 (31.71**%), while minimum heterobeltiosis was recorded for CMS-HAP-99 × RHP-69 (− 6.68 ns%). Only six sunflower hybrids showed a negative heterobeltiotic effect for head diameter. Maximum mid parent heterosis for plant height recorded was − 31.4**% (CMS-HAP-54 × RHP-53), while minimum mid parent heterosis of 13.92*% was observed for CMS-HAP-111 × RHP-38. As many as thirty hybrids exhibited a negative magnitude of mid parent heterosis for head diameter in the present study. Range of heterobeltiosis observed was from − 35.34% (CMS-HAP-54 × RHP-68) to 5.17*% (CMS-HAP-111 × RHP-71). Results for heterobeltiosis of 34 hybrids were found to be negative with respect to better parent heterosis.

Range of heterotic effects for the 36 sunflower hybrids under study recorded was from 65.87**% (CMS-HAP-111 × RHP-69) to 317.24**% (CMS-HAP-54 × RHP-71). All sunflower F1 hybrid combinations under study expressed highly significant positive heterotic effects for stem curvature. Heterobeltiosis was statistically significant for 24 hybrids and all 36 F1 hybrids showed positive heterotic effects over the best parent. Maximum heterobeltiosis observed was for CMS-HAP-99 × RHP-68 (194.68**%), while minimum heterobeltiosis was recorded for CMS-HAP-111 × RHP-69 (10.06ns%). Results for number of leaves per plant obtained depicted that maximum positive heterosis was recorded for CMS-HAP-111 × RHP-71 (45.58**%) followed by CMS-HAP-56 × RHP-71 (31.89**%). Maximum magnitude of negative heterotic effect was noted for CMS-HAP-112 × RHP-53 (− 9.25ns%), followed by CMS-HAP-99 × RHP-69 (− 8.66ns%). Of all the 36 hybrid combinations under study, 22 expressed positive heterosis for the average number of leaves per plant. Highest magnitude of heterobeltiotic effect in negative direction was recorded for CMS-HAP-111 × RHP-53 (− 20.37**%) while maximum better parent positive heterosis was noted for CMS-HAP-111 × RHP-71 (36.02**%) followed by CMS-HAP-56 × RHP-71 (24.29**%).

Among all the hybrids tested the results of 25 hybrids for 100 seed weight was found to be statistically significant (Table 4). Maximum heterotic effect noted for this character was 57.72**% (CMS-HAP-56 × RHP-69) while minimum mid-parent heterosis observed was − 3.45ns% (CMS-HAP-111 × RHP-71). Only two hybrid combinations expressed heterosis for 100 seed weight in negative direction. Heterosis over better parent for 100 seed weight ranges from − 15.49*% (CMS-HAP-111 × RHP-38) to 37.18**% (CMS-HAP-56 × RHP-53). Results of 10 hybrid combinations were found to be statistically significant. Heterobeltiotic effect of 24 hybrids were on positive side (Table 4). Among all the 36 hybrids tested, 35 sunflower hybrids expressed a positive mid parent heterosis for seed yield per plant. The maximum heterotic effect noted for this character was 134.69**% (CMS-HAP-111 × RHP-41) followed by 125.18**% (CMS-HAP-12 × RHP-71) and minimum mid-parent heterosis observed was − 1.79ns (CMS-HAP-112 × RHP-53). Maximum heterobeltiosis recorded was 74.93**% (CMS-HAP-11 × RHP-41) while minimum heterobeltiosis noted was − 27.58ns% (CMS-HAP-112 × RHP-53). Heterobeltiotic effect of only nine hybrids were negative while rest of 27 hybrids expressed a positive gain over their better parent for seed yield per plant (Table 4).

Combining ability analysis

Line × Tester mating design had the ability to evaluate a greater number of hybrids than the diallel and partial diallel mating designs. This technique of hybrid evaluation is quite successful in cases where hybrids must be developed from Restorer and complete male sterile lines. Results pertaining to General Combining Ability of 12 parental lines are presented in Table 5.

Table 5 Estimates of general combining ability (GCA) of parents regarding morphometric characteristics.

General combining ability (GCA)

Pursual of GCA estimates of all 12 hybrids for DFI showed that only two parents, one CMS, i.e., CMS-HAP-12 (7.65**) and one R-line i.e., RHP-68 (1.07**) had positive and significant GCA effects. Similarly, the same two parents had the highest, positive and significant GCA effect for DFC, depicting that these hybrids are late maturing. For leaf area GCA estimates, CMS-HAP-12 (14.73**) were found to be highly significant and positive among all the 12 parental lines under examination, while CMS-HAP-99 showed the lowest GCA magnitude of − 13.99**. GCA effects for average leaf area for all the six male lines were found to be non-significant. Range of GCA estimates for head diameter recorded was from 2.57** (CMS-HAP-12) to − 1.17** (CMS-HAP-54), while among male lines RHP-68 was found to be a good general combiner for head diameter with GCA effect of 1.02*. The best general combining ability recorded for plant height was from CMS-HAP-12 (13.22**), while lowest GCA estimate of − 10.3** was shown by CMS-HAP-111. Stem curvature GCA estimates of all the 12 parents under study were found to be statistically non-significant. GCA of number of leaves per plant were highly significant for two CMS lines viz., CMS-HAP-111 (− 1.94**) and CMS-HAP-12 (4.53**). RHP-71 (0.64ns) showed the maximum GCA among tester lines. For 100 seed weight only 2 parental lines i.e., CMS-HAP-112 (0.45*) and RHP-69 (0.41*) showed good general combining ability for this yield related important plant characteristic. CMS-HAP-12 exhibited highest GCA effect of 20.43** for seed yield per plant among female lines, while for testers no male line exhibited a significant positive GCA effect for seed yield.

Specific combining ability (SCA)

Result of combination specific combining ability of thirty-six sunflower hybrids developed from 12 parental line following L × T mating design for nine agro-morphological traits are presented in Table 6. SCA effect of CMS-HAP-12 × RHP-68 (3.18**) was the highest for DFI, while SCA estimate of − 2.9** showed by CMS-HAP-112 × RHP-41 was the lowest in magnitude. Combination specific combining ability estimates for days taken to flower completion was found to be highest for CMS-HAP-12 × RHP-68 (3.60**), while CMS-HAP-112 × RHP-68 cross combination recorded maximum negative SCA effect for DFC, showing that this cross combination is the earliest in flowering than rest of hybrids study. Significant SCA estimates were recorded for all the 36 hybrids for leaf area with maximum SCA effect of 20.87** was observed for CMS-HAP-54 × RHP-38. Only three hybrids showed a positive and significant SCA magnitude for head diameter, with maximum value of 2.46* (CMS-HAP-12 × RHP-38). For head diameter, 21 hybrid combination depicted a negative SCA estimates showing that head diameter of hybrids was less than that of their respective parents. The highest magnitude of SCA for plant height was shown by CMS-HAP-112 × RHP-71 (15.6*). Combination specific combining ability estimates for stem curvature were positive for 34 cross combinations. Range of SCA effects for number of leaves per plant was from 3.47* (CMS-HAP-99 × RHP-41) to − 3.53* (CMS-HAP-11 × RHP-53). Only one cross combination was found to be significant for head diameter SCA effect and in negative direction, i.e., CMS-HAP-111 × RHP-38 (− 1.30**). Positive SCA effects of 17 hybrids for 100 seed weight was observed. For seed yield per plant magnitude of SCA recorded was positive for 19 cross combinations, while maximum positive SCA magnitude was depicted by CMS-HAP-111 × RHP-53 (3.60**) followed by CMS-HAP-112 × RHP-53 (2.93**).

Table 6 Estimates of specific combining ability (SCA) of 36 sunflower hybrids regarding morphometric characteristics.

Discussion

Moder day agriculture more concerned with enhanced production capacity of crops in combination with efficient utilization of renewable and non-renewable resources21. Information and extent of genetic diversity available in a crop is the basic and utmost requirement for developing and designing a hybrid or cultivar improvement program of any crop including sunflower. In the present study, a novel approach of identification of diversity, then a methodology of utilization of the diversity for sunflower hybrid development has been proposed. Clustering is a type of unsupervised machine learning approach that tends to group data points having commonalities in a particular group, while data point in different groups have less similarities22. There are various types of clustering algorithms, among them Hierarchical clustering algorithm (HCA) is very common. This clustering technique tends to build a hierarchy of clusters one after the other23,24. A hierarchical clustering approach is frequently used in plant sciences for classification and diversity analysis. Use of this machine learning model has been successfully applied for identification of Cysteine-rich Receptor-like Kinase (CRK) genes in Arabidopsis thaliana25. Likewise, diversity paneling of wheat genotypes has been successfully carried out using HCA26. In current study, HCA applied on the sunflower data set, which is a combination of morphological, biochemical, and molecular attributes, to find the optimum number of clusters and most suitable genotype, which would represent the whole cluster in the crossing scheme.

In the case of current study, 2 major clusters are identified by applying the HCA, which could be divided into 6 smaller cluster each (Fig. 1). It was noted that in one major cluster, there were only restorer lines, while the other major cluster contains A-lines, B-lines and SFP line combined. This trend of clear separation of restorer lines from A, B and SFP lines had previously been monitored in sunflower12,14,27. Efficiency of HCA has been well documented in diversity paneling. Clustering of barley genotypes using HCA approach was found to be quite successful in delineating genetic diversity analysis28. In current study, HCA approach was the most successful in not only separation of R-lines from the rest of genetic materials but also dividing the genotypes into six smaller groups each of major cluster, comprising 12 overall heterotic groups in sunflower genotypes. These high-resolution heterotic groups were in-fact the product of combining different levels of diversity organization in sunflower plants, from molecular to proteins and then to organ and individual level.

K-means clustering (KMC) is another type of clustering/classification approach applied in machine learning, wherein a dataset is classified into a certain k-number of clusters, where k is an integer29. Use of KMC is well documented in datasets where the sole objective is to classify a dataset into different groups. The number of k-clusters was identified through hit and trial method. In the present study, the optimum number of clusters was identified at k = 2 at which genotypes can be grouped into two major clusters as observed through, HMC approach. In KMC, restorer lines were grouped separately from the rest of sunflower genotypes under study, however, using KMC approach it is almost impossible to further classify the sunflower lines in smaller clusters for making more accurate identification of potential parents for sunflower hybridization program.

Previously, KMC has been applied to compare gene expression patterns in plants under normal and stressful conditions30. Likewise, application of KMC based machine learning approach has been found very informative in functional association of biotic and abiotic genes31,32. Use of KMC in agro-morphological dataset of mung bean, revealed that genotypes grouped into seven different clusters irrespective of their geographical origin33. Iranian Rhabdosciadium aucheri, specie gene-pool were successfully characterized and differentiated into three populations after application of KMC. Hence, usage of KMC based approach is an effective technique for population identification/grouping, however, accurate identification of heterotic grouping and superior potential parents for breeding programs is not possible through KMC based clustering.

In unsupervised machine learning, both hierarchical and K-means clustering approaches utilization are well documented in analyzing unstructured datasets. However, both have their own advantages and disadvantages as well. Hierarchical clustering algorithm cannot represent distinct clusters with similar expression patterns. Moreover, as the size of cluster increases, the actual expression patterns become less relevant. Whereas K-means required a specific k-clusters (k is any integer) in advance to classify dataset into groups, also this algorithm is very sensitive to outliers as well34. In contrast to hybrid algorithms combine the strengths of other algorithms and tend to produce much more refined results. Using a hybrid algorithm of k-means and hierarchical clusters, produces better results than the standard average for Euclidean distance for hierarchical clustering. Similarly, much refined results of microarray datasets were obtained using hierarchical and k-means hybrid clusters35.

In the present study, a hybrid approach of bagging both hierarchical and k-means clustering was obtained on the combined dataset of morphological, biochemical, and molecular characterization of sunflower. No previous usage of this strategy in sunflower was obtained, making it a unique methodology to study characterization of un-structured dataset through multivariate and unsupervised machine learning techniques. Use of hybrid clusters making k-clusters to 12, as obtained in hierarchical produced 12 clusters of 109 sunflower datasets (Fig. 3). However, using hybrid approach was not as successful as hierarchical clusters alone, as in some cases, A-lines also has been classified with Restorer lines, which was not observed in hierarchical or k-means clustering approach. Therefore, it was deduced that more work on hybrid algorithms by applying other bagging and optimization techniques or use of more than two clusters in hybrid would have been practiced to obtained much better resolution of genotypes in heterotic groups.

Heterosis is defined as deviation observed in means of the progeny as compared to their parents. To exploit heterosis successfully in crop plants, presence of genetic variability among the participating parents is a pre-requisite. Many of the times, positive heterosis or hybrid vigor of F1 over their parents or better parent is required as in case of seed yield, 100 seed weight etc. However, in a few cases negative heterosis is also required for some important traits i.e., flowering time, time taken to maturity and plant height in sunflower36. In this study, mid and better parent heterosis estimates of 36 sunflower hybrids developed from 12 selected parental lines (each line representing a specific heterotic group) showed that both negative and positive heterotic values were obtained for different nine highly important plant characteristics.

Out of 36 F1 crosses tested for heterosis and heterobeltiosis showed that majority of F1 hybrids have shown heterosis in the desirable direction for all the traits under consideration. For days to flower initiation, days to complete flowering and plant height, most of the hybrids expressed a negative heterosis and heterobeltiosis effects as compared to their parents. Likewise, for leaf area and head diameter both positive and negative heterotic effects were observed, while for stem curvature, number of leaves per plant, 100 seed weight and seed yield per plant majority of the F1 hybrids under examination expressed a positive heterotic value against mid-parent and better parent means.

Desirability of heterotic direction depends upon the overall contribution of the component traits towards seed yield or oil yield in sunflower. As negative heterosis is desired in sunflower for flowering traits because the plants that are early in starting their flowering stage will have more time left to remain in the field for grain filling stages, thus a negative heterosis for flowering traits will ultimately lead to high seed yield in sunflower37,38. Likewise, leaf area corresponds to the availability of photosynthetic surfaces, therefore, heterosis in positive direction is required37. Similarly, head diameter is directly proportional to the surface available for seed filling, hence increase in head diameter over parental lines is desirable in sunflower breeding program39,40.

Since leaf area is the target of photosynthesis, previous results indicate that positive heterotic values are required for sunflower leaf area. Negative and positive heterosis, and heterobertiosis values are found regarding the leaf area in the present experimentation and these findings are also supported by Khan et al.35. Furthermore, Habib et al.41 and Khan42 confirmed that a higher positive vigor per 100 seeds is required as it is directly linked with economical yield of sunflower crop. Regarding our findings, 34 out of 36 hybrids showed a positive heterosis effect, likewise 23 hybrids were also positive in case of heterobeltiotic effect depicting that these hybrids had higher test weight as compared to both parents. Previous studies on sunflower heterosis estimation also confirmed that 100 seed weight increases as the distantly related genotypes crossed to produced F1 hybrids in sunflower37,39,40,43. Many researchers like Kaur40, Radhika et al.39, Phad et al.43, Alone et al.44, Manivannan et al.45, Sawant et al.46 and Channamma47 reported significant heterotic effect of seed yield in hybrids developed experimentally using diverse male and female lines.

It is generally believed that parents having high combining ability are not able to transmit their high yield potential to their progeny, hence estimation of combining ability is a pre-requisite for developing a high yielding and sustainable hybrids or cultivars. To increase the yield in sunflower vertically, development of hybrids with better yield potential and stability is required. Parents with diverse genetic makeup would generally produce superior transgressive hybrids48. To make a breeding program fruitful, the first step is to select the parental lines to be used for hybridization. In crop plants including sunflower, genetic variability, type of gene actions and combining ability analysis are the most important parameters49.

The occurrence of both significant GCA and SCA effects in the present study indicates the presence of both additive and non-additive gene effects in the expression of plant measured traits. GCA effects generally lead to the selection of suitable parents for population improvement or development of synthetics or composite cultivars, as these may be the preference of some growers because their seed can be used for more than one year48. In present study, high and significant GCA effects for both among CMS and restore lines in desirable directions i.e., negative for flowering, maturity, and plant height traits and positive for the rest of traits measured was observed. Short duration varieties are the preference of sunflower growers as these can reduce the risk of exposure to adverse climatic and biotic factors like diseases and insect attack48. Development of high yield hybrid/cultivar along with shorter growing period is among the prime objectives of sunflower breeder(s)49.

In present study, higher magnitude of SCA effect than that of GCA was observed for days to flower initiation, days taken to flower completion, head diameter, plant height, leaf area and 100 seed weight suggesting a pre-dominance of SCA/non-additive factors in controlling these flowering and yield affecting traits in sunflower. Control of these sunflower plant traits through a dominant or epistatic type of gene actions has been previously reported in many studies40,50,51. While a higher magnitude of GCA effects was recorded for stem curvature, number of leaves per plant and seed yield per plant. These higher values of GCA than SCA showed that genes controlling these traits are having an additive type of genetic inheritance and therefore the parental lines to be used for crossing programs must be improved first or should have high potential for these traits before using them in sunflower hybrid breeding program.

Conclusion

Application of machine learning in plant improvement programs would become a vital tool for breeders as it can speed up the steps involved in the release of final cultivar for general cultivation. However, more efforts for optimization and accurate application of machine learning algorithms in plant breeding is needed. In this study, two unsupervised machine learning clustering algorithms, i.e., hierarchical and k-means were applied on a combined morphological, bio-chemical, and molecular dataset of sunflower genotypes. In addition, a hybrid cluster algorithm of hierarchical + k-means was also designed and implanted on the same dataset for heterotic grouping identification. Results showed that hierarchical clustering approach is more suitable in given circumstances. Hence, 12 heterotic groups were identified (6 for CMS lines and 6 for restore lines), and one genotype from each group was selected as a representative of whole identified group. Selected 12 lines (one each from each heterotic group) were crossed in a L × T design and resulting F1 were evaluated in open field conditions for combining ability and heterosis studied. Results showed that most of the hybrid developed exhibited a significant amount of heterosis for all the studied traits and more importantly in the desirable directions. However, three hybrids (1) RHP-41 × CMS-HAP-56, (2) RHP-71 × CMS-HAP-111 and (3) RHP-71 × CMS-HAP-12 are more suitable for further evaluation and release of new sunflower hybrid cultivar.