Application of machine learning for identification of heterotic groups in sunflower through combined approach of phenotyping, genotyping and protein profiling

Application of machine learning in plant breeding is a recent concept, that has to be optimized for precise utilization in the breeding program of high yielding crop plants. Identification and efficient utilization of heterotic grouping pattern aided with machine learning approaches is of utmost importance in hybrid cultivar breeding as it can save time and resources required to breed a new plant hybrid/variety. In the present study, 109 genotypes of sunflower were investigated at morphological, biochemical (SDS-PAGE) and molecular levels (through micro-satellites (SSR) markers) for heterotic grouping. All the three datasets were combined, scaled, and subjected to unsupervised machine learning algorithms, i.e., Hierarchical clustering, K-means clustering and hybrid clustering algorithm (hierarchical + K-means) for assessment of efficiency and resolution power of these algorithms in practical plant breeding for heterotic grouping identification. Following the application of machine learning unsupervised clustering approach, two major groups were identified in the studied sunflower germplasm, and further classification revealed six smaller classes in each major group through hierarchical and hybrid clustering approach. Due to high resolution, obtained in hierarchical clustering, classification achieved through this algorithm was further used for selection of potential parents. One genotype from each smaller group was selected based on the maximum seed yield potential and hybridized in a line  ×  tester mating design producing 36 F1 cross combinations. These F1s along with their parents were studied in open field conditions for validating the efficacy of identified heterotic groups in sunflowers genetic material under study. Data for 11 agronomic and qualitative traits were recorded. These 36 F1 combinations were tested for their combining ability (General/Specific), heterosis, genotypic and phenotypic correlation and path analysis. Results suggested that F1 hybrids performed better for all the traits under investigation than their respective parents. Findings of the study validated the use of machine learning approaches in practical plant breeding; however, more accurate and robust clustering algorithms need to be developed to handle the data noisiness of open field experiments.


Experiment 1
Plant material Experiment 1 was conducted in the National Agricultural Research Center (NARC), Islamabad which is situated on latitude 33.6641° N, and longitude 73.1276° E. Plant material comprised of 109 genetically diverse sunflower lines were used to mine grouping pattern and then efficacy of the identified grouping pattern in hybrid seed development in sunflower.Plant material (sunflower genotypes) was obtained from National Agricultural Research Centre, Islamabad, Pakistan (Table S1).Collection of plant materials complied with the institutional, national, and international guidelines and legislation.All experiment and analysis methods were performed following the relevant regulations and guidelines.

Phenotyping
To characterize the sunflower genotypes phenotypically, all the 109 sunflower lines were planted in open field conditions for two consecutive years in spring season following a randomized augmented block design.Data for nine plant morphometric characteristics i.e., plant height, stem curvatures, days to flower initiation, days to flower completion, number of leaves per plant, head diameter, leaf area, 100 seed weight and seed yield per plant were recorded from 10 randomly selected plants from each genotype, and their average was computed (Supplementary Table 1).

Genotyping
Molecular based genotyping of 109 sunflower lines was conducted by employing 40 SSR markers (Supplementary Table 2).Genomic DNA extracted from 1.5 to 2 weeks old seedlings of sunflower following cetyl trimethyl ammonium bromide (CTAB) method described by Saghai-Maroof et al. 15 .DNA was then diluted with 50 µl TE buffer and run on 1% agarose gel to determine its quality and concentration, before PCR analysis.DNA fragments amplified by the respective microsatellite marker were designated as a unit trait with 1 for the presence and 0 for the absence of a DNA band, thus generating a binary matrix data set.
www.nature.com/scientificreports/Protein characterization (proteomic analysis) Protein characterization was performed through SDS-PAGE of total seed proteins in vertical slabs as described by Jan et al. 16 .Separating and stacking gels were prepared with different concentrations.The gels were run on 100 V till the BPB marker reached the bottom of gels.A pre-standard protein ladder with a range of 10-180 kDa (Lot: 00345 035) was used to determine the molecular weights of sunflower seed proteins.Staining of gels was performed using 0.25% (w/v) Commasie brilliant blue (CBB) solution homogenized in 10% (v/v) acetic acid and 40% (v/v) methanol diluted in water, to visualize the proteins on gels.After staining gels were de-stained to wash away extra CBB dye.The de-staining solution contained 5% acetic acid (v/v), 20% methanol (v/v) and distilled water in 5:20:75 ratios.
Each polypeptide band in a gel corresponds to a unit character and designated as 1 for the presence of a band in a particular sunflower genotype under observation and 0 for the absence of protein band.Therefore, generated a binary matrix of proteins bands observed in 109 sunflower genotypes under study.

Machine learning baseline
The application of artificial intelligence in the pacing of the breeding of new cultivars is a burgeoning area of development considering the major impact it could create in the plant breeding field.The machine learning baseline 17 is a generic, modular, and reusable workflow that combines agronomic principles of crop modeling with machine learning.The input data consists of a diverse aggregated dataset collected at three different levels of plant's organizational structure i.e., morphological, molecular, and biochemical.The corresponding outputs is a set of clustering pattern of 109 sunflower genotypes obtained by the application of machine learning classifiers i.e., hierarchical clustering, K-means clustering and hybrid (hierarchical + K-means) clustering algorithms.A schematic workflow describing the interrelationship between input variables (integrated dataset) and the corresponding output (heterotic groups) using 3 machine learning qualifiers is presented in Fig. 1.Strategy followed for three different datasets collection, processing and integrations was as follows:  www.nature.com/scientificreports/(g) Finally, performance of F1 hybrids was then evaluated to characterize the efficiency of machine learning algorithm in identification of suitable parental genotypes in hybrid breeding programs.

Data preparation and preprocessing
Data normalization and scaling is an important preprocessing procedure to remove data nuisance and improve the efficiency of learning from data 18 .Moreover, when different variables recorded have varied nature of recording i.e., continuous and binary, more care needs to be adopted to remove over/less fitting of the machine learning algorithms 19 .In this experiment, morphological data variables were recorded according to different scales, and the genotyping and protein profiling data were recorded in a binary scale.To address the overfitting/inaccuracy of machine learning algorithms, data scaling was performed according to Yeo and Johnson normalization method 20 .
During data scaling, all traits were scaled to a [0,1] range using the following equation: where X scaled is the scaled value of input variable, X min and X max are the minimum and maximum values of X variable, respectively.This scaling of data has also ensured that there are no outliers in the final dataset.

Data integration and heterotic grouping identification
For careful and accurate identification of heterotic grouping patterns present in the sunflower genotypes' pool, all the three datasets, i.e., morphological, genotypic and proteins, were aggregated and evaluated collectively after data scaling.Three classification algorithms i.e., (a) hierarchical clustering, (b) K-means clustering and (c) hybrid clustering (hierarchical + K-means) was applied over the aggregated dataset.Data preprocessing, and application of machine learning algorithms were performed using R-Studio version 1.2.1335.Packages used for application of the above mentioned three machine learning algorithms were factoextra, FactomineR, dendextend, cluster and tidyverse.Algorithms were compared and the one with highest resolution was selected for further selection of genotypes from the genotypic' pool.The algorithm that classifies the sunflower genetic pool under study (A-lines, B-lines, R-lines) and self-pollinated (SFP) with more accuracy was regarded as the one with highest resolution power.Clustering pattern obtained from the best explaining algorithm by using aggregated dataset was then carefully evaluated to select the highly diverse and high yielding sunflower genotypes, i.e., one genotypes from each heterotic group (with the assumption that the selected genotype has the same breeding potential as the rest of genotypes in the same heterotic group).The selected genotypes were then crossed with each other to obtain F 1 sunflower hybrids.

Experiment 2
Evaluation of identified heterotic groups F 1 crosses development and evaluation.Based on the results of experiments 1, one genotype from each identified heterotic group was selected as a representative of whole group and utilized in hybridization scheme following a Line × Tester mating design.Each male line was crossed with each female line.6 CMS lines identified were crossed with 6 Restorer lines to generate 36 sunflower F 1 hybrids.

F 1 phenotyping
Sunflower F 1 hybrid obtained were tested in open field conditions at National Agricultural Research Center, and data regarding growth and yield attributes (days to flower initiation, days to flower completion, plant height, stem curvature, head diameter, number of leaves per plant, leaf area, 100 seed weight and seed yield per plant) were recorded.

Statistical analysis
The collected data of various aspects of sunflower hybrid were subjected to statistical analysis such as ANOVA, heterosis, heterobeltiosis, and combining ability analysis to understand the yield potential exhibited by each respective sunflower hybrid and to assess the efficacy of heterotic grouping pattern identified.R-Studio version 1.3.1335was used for statistical analysis of F 1 cross combinations.

Experiment 1
For accurate identification of heterotic grouping pattern, a multi-prong strategy was adopted, wherein morphological, bio-chemical, and molecular datasets of sunflower genotypes were analyzed by using three clustering algorithms, i.e., hierarchical, K-means and hierarchical + K-means hybrid classification algorithm.Efficacy of these three machine learning algorithms were tested on the sunflower genotypes and the algorithm that best explains and accurately classified the genotypes were used for final parental selection for further hybrid development.

Hierarchical clustering
Figure 2 represents the dendrogram obtained by using hierarchical classification algorithm.For hierarchical clustering, Ward.D 2 method was applied on combined dataset of morphological + bio-chemical + molecular characterization.Cluster diagram (Fig. 2) showed two distinct classes of genotypes, wherein cluster 1 contains all the restorer lines, while cluster 2 has CMS + B-line and self-pollinated lines.Number of genotypes grouped in cluster  www.nature.com/scientificreports/K-means clustering K-means cluster algorithm is an unsupervised machine learning based approach that tends to group the similar data points in one cluster, which is away from the dis-matching data points.More precisely, this algorithm aims to minimize the sum of square values within a cluster and consequently maximize the sum of squares between clusters.In the present study, K-means clustering applied on the 109 sunflower genotypes, precisely grouped the sunflower genotypes into 2 major clusters (Fig. 3).The size of cluster 1 is 31, while cluster 2 classified 78 sunflower genotypes.Cluster 1 predominantly contains restorer lines, while cluster 2 contains self-pollinated (SFP) lines i.e.A-lines and B-lines of sunflower genetic pool under study.Although K-means application precisely grouped the sunflower genotypes into two major clusters, selecting genotypes with more precision to smaller groups was not possible using this algorithm.As many SFP lines lie closer to the A-line or B-lines, making it harder to distinguish between them.

K-mean, hierarchical hybrid clustering approach
Finally, a hybrid algorithm by using hierarchical + K-means clustering algorithms was applied on the sunflower genotypes to examine if the accuracy of harvesting more precise heterotic groups can be improved further or not?Setting the number of k(s) to 12, two major clusters were observed, that were further grouped into 12 smaller clusters (Fig. 4).Grouping of sunflower genotypes observed by the application of hybrid algorithm (hierarchical + K-means) was found to be useful to some extent as it can be used to group closer genotypes, however, grouping of genotypes with distinct characteristics like restorer lines and CMS lines closely is somewhat confusing, hence this algorithm is also found to be not a good fit for the current study.As the grouping of genotypes using hierarchical clustering algorithm is clearer and more definitive, hence selection of potential parents for the development of sunflower hybrids were based on the grouping observed through hierarchical clustering approach.

Selection of parents
As 12 clusters were observed through hierarchical clustering method, 1 genotype from each of the 12 clusters was selected for further utilization in sunflower hybrid breeding program.Genotypes exhibiting the highest seed yield potential from each of the 12 clusters (recorded at the height of 18) were selected.Moreover, all the www.nature.com/scientificreports/restorer lines tend to cluster separately from CMS lines, hence Line × Tester mating design was followed for sunflower hybrid F 1 development.

Evaluation of identified heterotic groups
To assess the practical efficiency of the identified heterotic groups, selected parental lines were crossed in Line × Tester mating design and 36 F 1 hybrids of sunflower were generated.Heterosis (mid-parent heterosis, better parent heterosis) and combining ability analysis (General combining ability and Specific combining ability) were www.nature.com/scientificreports/conducted to evaluate the potential of methodology used for identification/mining of heterotic grouping pattern and thereof selection of potential parental lines for commercial hybrid development.

Mean performance of parents
Table 1 presents the mean performance of 12 sunflower lines that were planted at NARC, Islamabad.The study focused on nine agro-morphological traits.Among the lines, CMS-HAP-112 exhibited the shortest duration to initiate flowering, taking only 46.5 days, while RHP-41 had the longest duration of 56.5 days.CMS-HAP-111 completed 100% flowering the earliest, within 55 days, followed by CMS-HAP-112 at 55.5 days.On the other hand, RHP-41 took the maximum number of days to complete flowering, with a duration of 67.5 days.Regarding plant height, the 12 parental sunflower lines displayed a range from 200.14 cm (CMS-HAP-54) to 134.6 cm (CMS-HAP-111).In terms of leaf area, CMS-HAP-56 had the highest recorded value of 257.48 cm2, while RHP-38 had the lowest average leaf area of 141.5 cm2.The largest head diameter of 19.3 cm was observed in CMS-HAP-99, whereas the smallest head diameter of 10.45 cm was found in RHP-38.In the context of stem curvature, the lowest value recorded was 6.95 cm for RHP-71, while CMS-HAP-111 and CMS-HAP-12 exhibited the highest stem curvatures of 48 cm and 45.7 cm, respectively.The number of leaves varied among the parental lines, with CMS-HAP-111 having the fewest leaves (23.35), and CMS-HAP-112 having the highest number of leaves (33.1), followed by CMS-HAP-99 (33).The 100 seed weight of the parental lines ranged from 3.48 g (RHP-69) to 6.61 g (CMS-HAP-99).CMS-HAP-112 displayed the highest mean seed yield per plant at 68.19 g, while the lowest seed yield per plant was observed in RHP-68 (27.28 g) and RHP-41 (27.9 g) (Table 1).Regarding stem curvature, the lowest recorded value was 42.77 cm for RHP-68 × CMS-HAP-54, followed by RHP-53 × CMS-HAP-54 with a stem curvature of 48.83 cm.HAP-99 and RHP-38 × CMS-HAP-112 exhibited maximum stem curvatures of 77.5 cm and 74.83 cm, respectively.RHP-53 × CMS-HAP-111 has the lowest number of seats (26), RHP-71 × CMS-HAP-56 has the highest number of seats (36.67), followed by RHP-71 × CMS-HAP-99 continued.(36.17).Test weights of hybrids ranged from 4.41 g (RHP-71 × CMS-HAP-111) to 7.34 g (RHP-38 × CMS-HAP-12).The minimum seed yield per plant for hybrid RHP-53 × CMS-HAP-111 was 49.3 g, whereas RHP-71 × CMS-HAP-54 showed the highest average seed yield of 103.36 g per plant, compared to RHP-41 followed by RHP-41 × CMS-HAP-111 of 99.45 g.

Combining ability analysis
Line × Tester mating design had the ability to evaluate a greater number of hybrids than the diallel and partial diallel mating designs.This technique of hybrid evaluation is quite successful in cases where hybrids must be developed from Restorer and complete male sterile lines.Results pertaining to General Combining Ability of 12 parental lines are presented in Table 5.

General combining ability (GCA)
Pursual of GCA estimates of all 12 hybrids for DFI showed that only two parents, one CMS, i.e., CMS-HAP-12 (7.65**) and one R-line i.e., RHP-68 (1.07**) had positive and significant GCA effects.Similarly, the same two parents had the highest, positive and significant GCA effect for DFC, depicting that these hybrids are late maturing.For leaf area GCA estimates, CMS-HAP-12 (14.73**) were found to be highly significant and positive among all the 12 parental lines under examination, while CMS-HAP-99 showed the lowest GCA magnitude of − 13.99**.GCA effects for average leaf area for all the six male lines were found to be non-significant.Range of GCA estimates for head diameter recorded was from 2.57** (CMS-HAP-12) to − 1.17** (CMS-HAP-54), while among male lines RHP-68 was found to be a good general combiner for head diameter with GCA effect of

Discussion
Moder day agriculture more concerned with enhanced production capacity of crops in combination with efficient utilization of renewable and non-renewable resources 21 .Information and extent of genetic diversity available in a crop is the basic and utmost requirement for developing and designing a hybrid or cultivar improvement program of any crop including sunflower.In the present study, a novel approach of identification of diversity, then a methodology of utilization of the diversity for sunflower hybrid development has been proposed.Clustering is a type of unsupervised machine learning approach that tends to group data points having commonalities in a particular group, while data point in different groups have less similarities 22 .There are various types of clustering algorithms, among them Hierarchical clustering algorithm (HCA) is very common.This clustering technique tends to build a hierarchy of clusters one after the other 23,24 .A hierarchical clustering approach is frequently used in plant sciences for classification and diversity analysis.Use of this machine learning model has been successfully applied for identification of Cysteine-rich Receptor-like Kinase (CRK) genes in Arabidopsis thaliana 25 .Likewise, diversity paneling of wheat genotypes has been successfully carried out using HCA 26 .In current study, HCA applied on the sunflower data set, which is a combination of morphological, biochemical, and molecular www.nature.com/scientificreports/attributes, to find the optimum number of clusters and most suitable genotype, which would represent the whole cluster in the crossing scheme.
In the case of current study, 2 major clusters are identified by applying the HCA, which could be divided into 6 smaller cluster each (Fig. 1).It was noted that in one major cluster, there were only restorer lines, while the other major cluster contains A-lines, B-lines and SFP line combined.This trend of clear separation of restorer lines from A, B and SFP lines had previously been monitored in sunflower 12,14,27 .Efficiency of HCA has been well documented in diversity paneling.Clustering of barley genotypes using HCA approach was found to be quite successful in delineating genetic diversity analysis 28 .In current study, HCA approach was the most successful in not only separation of R-lines from the rest of genetic materials but also dividing the genotypes into six smaller groups each of major cluster, comprising 12 overall heterotic groups in sunflower genotypes.These high-resolution heterotic groups were in-fact the product of combining different levels of diversity organization in sunflower plants, from molecular to proteins and then to organ and individual level.
K-means clustering (KMC) is another type of clustering/classification approach applied in machine learning, wherein a dataset is classified into a certain k-number of clusters, where k is an integer 29 .Use of KMC is well documented in datasets where the sole objective is to classify a dataset into different groups.The number of k-clusters was identified through hit and trial method.In the present study, the optimum number of clusters was identified at k = 2 at which genotypes can be grouped into two major clusters as observed through, HMC approach.In KMC, restorer lines were grouped separately from the rest of sunflower genotypes under study, however, using KMC approach it is almost impossible to further classify the sunflower lines in smaller clusters for making more accurate identification of potential parents for sunflower hybridization program.
Previously, KMC has been applied to compare gene expression patterns in plants under normal and stressful conditions 30 .Likewise, application of KMC based machine learning approach has been found very informative in functional association of biotic and abiotic genes 31,32 .Use of KMC in agro-morphological dataset of mung bean, revealed that genotypes grouped into seven different clusters irrespective of their geographical origin 33 .Iranian Rhabdosciadium aucheri, specie gene-pool were successfully characterized and differentiated into three populations after application of KMC.Hence, usage of KMC based approach is an effective technique for population identification/grouping, however, accurate identification of heterotic grouping and superior potential parents for breeding programs is not possible through KMC based clustering.
In unsupervised machine learning, both hierarchical and K-means clustering approaches utilization are well documented in analyzing unstructured datasets.However, both have their own advantages and disadvantages as well.Hierarchical clustering algorithm cannot represent distinct clusters with similar expression patterns.Moreover, as the size of cluster increases, the actual expression patterns become less relevant.Whereas K-means required a specific k-clusters (k is any integer) in advance to classify dataset into groups, also this algorithm is very sensitive to outliers as well 34 .In contrast to hybrid algorithms combine the strengths of other algorithms and tend to produce much more refined results.Using a hybrid algorithm of k-means and hierarchical clusters, produces better results than the standard average for Euclidean distance for hierarchical clustering.Similarly, much refined results of microarray datasets were obtained using hierarchical and k-means hybrid clusters 35 .
In the present study, a hybrid approach of bagging both hierarchical and k-means clustering was obtained on the combined dataset of morphological, biochemical, and molecular characterization of sunflower.No previous usage of this strategy in sunflower was obtained, making it a unique methodology to study characterization of un-structured dataset through multivariate and unsupervised machine learning techniques.Use of hybrid clusters making k-clusters to 12, as obtained in hierarchical produced 12 clusters of 109 sunflower datasets (Fig. 3).However, using hybrid approach was not as successful as hierarchical clusters alone, as in some cases, A-lines also has been classified with Restorer lines, which was not observed in hierarchical or k-means clustering approach.Therefore, it was deduced that more work on hybrid algorithms by applying other bagging and optimization techniques or use of more than two clusters in hybrid would have been practiced to obtained much better resolution of genotypes in heterotic groups.
Heterosis is defined as deviation observed in means of the progeny as compared to their parents.To exploit heterosis successfully in crop plants, presence of genetic variability among the participating parents is a prerequisite.Many of the times, positive heterosis or hybrid vigor of F 1 over their parents or better parent is required as in case of seed yield, 100 seed weight etc.However, in a few cases negative heterosis is also required for some important traits i.e., flowering time, time taken to maturity and plant height in sunflower 36 .In this study, mid and better parent heterosis estimates of 36 sunflower hybrids developed from 12 selected parental lines (each line representing a specific heterotic group) showed that both negative and positive heterotic values were obtained for different nine highly important plant characteristics.
Out of 36 F 1 crosses tested for heterosis and heterobeltiosis showed that majority of F 1 hybrids have shown heterosis in the desirable direction for all the traits under consideration.For days to flower initiation, days to complete flowering and plant height, most of the hybrids expressed a negative heterosis and heterobeltiosis effects as compared to their parents.Likewise, for leaf area and head diameter both positive and negative heterotic effects were observed, while for stem curvature, number of leaves per plant, 100 seed weight and seed yield per plant majority of the F 1 hybrids under examination expressed a positive heterotic value against mid-parent and better parent means.
Desirability of heterotic direction depends upon the overall contribution of the component traits towards seed yield or oil yield in sunflower.As negative heterosis is desired in sunflower for flowering traits because the plants that are early in starting their flowering stage will have more time left to remain in the field for grain filling stages, thus a negative heterosis for flowering traits will ultimately lead to high seed yield in sunflower 37,38 .Likewise, leaf area corresponds to the availability of photosynthetic surfaces, therefore, heterosis in positive direction is required 37 .Similarly, head diameter is directly proportional to the surface available for seed filling, hence increase in head diameter over parental lines is desirable in sunflower breeding program 39,40 35 .Furthermore, Habib et al. 41 and Khan 42 confirmed that a higher positive vigor per 100 seeds is required as it is directly linked with economical yield of sunflower crop.Regarding our findings, 34 out of 36 hybrids showed a positive heterosis effect, likewise 23 hybrids were also positive in case of heterobeltiotic effect depicting that these hybrids had higher test weight as compared to both parents.Previous studies on sunflower heterosis estimation also confirmed that 100 seed weight increases as the distantly related genotypes crossed to produced F 1 hybrids in sunflower 37,39,40,43 .Many researchers like Kaur 40 , Radhika et al. 39 , Phad et al. 43 , Alone et al. 44 , Manivannan et al. 45 , Sawant et al. 46 and Channamma 47 reported significant heterotic effect of seed yield in hybrids developed experimentally using diverse male and female lines.
It is generally believed that parents having high combining ability are not able to transmit their high yield potential to their progeny, hence estimation of combining ability is a pre-requisite for developing a high yielding and sustainable hybrids or cultivars.To increase the yield in sunflower vertically, development of hybrids with better yield potential and stability is required.Parents with diverse genetic makeup would generally produce superior transgressive hybrids 48 .To make a breeding program fruitful, the first step is to select the parental lines to be used for hybridization.In crop plants including sunflower, genetic variability, type of gene actions and combining ability analysis are the most important parameters 49 .
The occurrence of both significant GCA and SCA effects in the present study indicates the presence of both additive and non-additive gene effects in the expression of plant measured traits.GCA effects generally lead to the selection of suitable parents for population improvement or development of synthetics or composite cultivars, as these may be the preference of some growers because their seed can be used for more than one year 48 .In present study, high and significant GCA effects for both among CMS and restore lines in desirable directions i.e., negative for flowering, maturity, and plant height traits and positive for the rest of traits measured was observed.Short duration varieties are the preference of sunflower growers as these can reduce the risk of exposure to adverse climatic and biotic factors like diseases and insect attack 48 .Development of high yield hybrid/cultivar along with shorter growing period is among the prime objectives of sunflower breeder(s) 49 .
In present study, higher magnitude of SCA effect than that of GCA was observed for days to flower initiation, days taken to flower completion, head diameter, plant height, leaf area and 100 seed weight suggesting a pre-dominance of SCA/non-additive factors in controlling these flowering and yield affecting traits in sunflower.Control of these sunflower plant traits through a dominant or epistatic type of gene actions has been previously reported in many studies 40,50,51 .While a higher magnitude of GCA effects was recorded for stem curvature, number of leaves per plant and seed yield per plant.These higher values of GCA than SCA showed that genes controlling these traits are having an additive type of genetic inheritance and therefore the parental lines to be used for crossing programs must be improved first or should have high potential for these traits before using them in sunflower hybrid breeding program.

Conclusion
Application of machine learning in plant improvement programs would become a vital tool for breeders as it can speed up the steps involved in the release of final cultivar for general cultivation.However, more efforts for optimization and accurate application of machine learning algorithms in plant breeding is needed.In this study, two unsupervised machine learning clustering algorithms, i.e., hierarchical and k-means were applied on a combined morphological, bio-chemical, and molecular dataset of sunflower genotypes.In addition, a hybrid cluster algorithm of hierarchical + k-means was also designed and implanted on the same dataset for heterotic grouping identification.Results showed that hierarchical clustering approach is more suitable in given circumstances.Hence, 12 heterotic groups were identified (6 for CMS lines and 6 for restore lines), and one genotype from each group was selected as a representative of whole identified group.Selected 12 lines (one each from each heterotic group) were crossed in a L × T design and resulting F1 were evaluated in open field conditions for combining ability and heterosis studied.Results showed that most of the hybrid developed exhibited a significant amount of heterosis for all the studied traits and more importantly in the desirable directions.However, three hybrids (a) Data from nine morphological parameters were collected from 109 sunflower genotypes.(b) Genotyping data scores (binary format) was collected from amplification data of 40 SSR primers.(c) Protein characterization data (binary format) was extracted from the amplification data of 14 different protein bands obtained after SDS-PAGE analysis of respective 109 sunflower genotypes.(d) Collected data from all there sources (morphological, molecular, and biochemical) were scaled in a range of 0-1.(e) Three different machine learning classifiers (hierarchical, K-means and hybrid (hierarchical + K-means) was employed for selection of suitable genotypes for F 1 hybrid development and evaluation of identified heterotic groups in the studied genotypic pool of sunflower.(f) The resultant classification pattern obtained from each machine learning classifier algorithm was evaluated for selection of suitable genotypes.

Figure 1 .
Figure 1.A schematic workflow of the procedures adopted to identify the underlaying heterotic grouping pattern in the studied germplasm pool of sunflower.The input dataset obtained was from the combination of phenotypic (9 morphological traits); genotypic (binary data obtained from 40 SSR markers scoring); proteomic (binary data from SDS-PAGE based characterization).The dataset was then scaled to the values ranging between 0 and 1 to remove data biasedness.Three machine learning based classifiers were then tested to produce the heterotic grouping in the studied sunflower materials.Out of three, Hierarchical clustering showed the maximum resolution power in terms of correctly classifying the A, B, R and SFP lines.Hierarchical clustering classified the 109 sunflower genotypes in 12 heterotic groups, and this clustering pattern was then used to develop F1 hybrids, while selecting 1 sunflower line from each heterotic group.Both hierarchical and Knn based machine learning clustering have been widely used in plants dataset (including phenotypic and genotypic data) for classification of plant genotypes.

Table 1 .
. Range of heterosis for days to flower initiation reported in present study was from 10.14**% Mean performance of parents regarding morphometric characteristics.DFI days to flower initiation, DFC days to flower completion, PH pant height, LA leaf area, HD head diameter, S.C stem curvatures, LP leaves per plant, HSW 100 seed weight, SYP seed yield per plant.

Table 3 .
Heterosis (mid parent) and heterobeltiosis (better parent) of 36 sunflower hybrids.MPH mid-parent heterosis, BPH better parent heterosis, *significant, **highly significant, ns non-significant, DFI days to flower initiation, DFC days to flower completion, LA leaf area, HD head diameter, PH plant height.

Table 5 .
1.02*.The best general combining ability recorded for plant height was from CMS-HAP-12 (13.22**), while lowest GCA estimate of − 10.3** was shown by CMS-HAP-111.Stem curvature GCA estimates of all the 12 parents under study were found to be statistically non-significant.GCA of number of leaves per plant were highly significant for two CMS lines viz., CMS-HAP-111 (− 1.94**) and CMS-HAP-12 (4.53**).RHP-71 (0.64 ns ) showed the maximum GCA among tester lines.For 100 seed weight only 2 parental lines i.e., CMS-HAP-112 (0.45*) and RHP-69 (0.41*) showed good general combining ability for this yield related important plant characteristic.CMS-HAP-12 exhibited highest GCA effect of 20.43** for seed yield per plant among female lines, while for testers no male line exhibited a significant positive GCA effect for seed yield.Estimates Positive SCA effects of 17 hybrids for 100 seed weight was observed.For seed yield per plant magnitude of SCA recorded was positive for 19 cross combinations, while maximum positive SCA magnitude was depicted by CMS-HAP-111 × RHP-53 (3.60**) followed by CMS-HAP-112 × RHP-53 (2.93**).
Specific combining ability (SCA)Result of combination specific combining ability of thirty-six sunflower hybrids developed from 12 parental line following L × T mating design for nine agro-morphological traits are presented in Table6.SCA effect of CMS-HAP-12 × RHP-68 (3.18**) was the highest for DFI, while SCA estimate of − 2.9** showed by CMS-HAP-112 × RHP-41 was the lowest in magnitude.Combination specific combining ability estimates for days taken to flower completion was found to be highest for CMS-HAP-12 × RHP-68 (3.60**), while CMS-HAP-112 × RHP-68 cross combination recorded maximum negative SCA effect for DFC, showing that this cross combination is the earliest in flowering than rest of hybrids study.Significant SCA estimates were recorded for all the 36 hybrids for leaf area with maximum SCA effect of 20.87** was observed for CMS-HAP-54 × RHP-38.Only three hybrids showed a positive and significant SCA magnitude for head diameter, with maximum value of 2.46* (CMS-HAP-12 × RHP-38).For head diameter, 21 hybrid combination depicted a negative SCA estimates showing that head diameter of hybrids was less than that of their respective parents.The highest magnitude of SCA for plant height was shown by CMS-HAP-112 × RHP-71 (15.6*).Combination specific combining ability estimates for stem curvature were positive for 34 cross combinations.Range of SCA effects for number of leaves per plant was from 3.47* (CMS-HAP-99 × RHP-41) to − 3.53* (CMS-HAP-11 × RHP-53).Only one cross combination was found to be significant for head diameter SCA effect and in negative direction, i.e., CMS-HAP-111 × RHP-38 (− 1.30**). of general combining ability (GCA) of parents regarding morphometric characteristics.DFI days to flower initiation, DFC days to flower completion, PH pant height, LA leaf area, HD head diameter, S.C stem curvatures, LP Leaves per plant, HSW 100 seed weight, SYP seed yield per plant.Vol.:(0123456789) Scientific Reports | (2024) 14:7333 | https://doi.org/10.1038/s41598-024-58049-zwww.nature.com/scientificreports/

Table 6 .
Estimates of specific combining ability (SCA) of 36 sunflower hybrids regarding morphometric characteristics.DFI days to flower initiation, DFC days to flower completion, PH pant height, LA leaf area, HD head diameter, S.C stem curvatures, LP leaves per plant, HSW 100 seed weight, SYP seed yield per plant.
Since leaf area is the target of photosynthesis, previous results indicate that positive heterotic values are required for sunflower leaf area.Negative and positive heterosis, and heterobertiosis values are found regarding the leaf area in the present experimentation and these findings are also supported byKhan et al.