Introduction

Bread wheat (Triticum aestivum L.) is vital to the world's agricultural economy. Understanding the basis of its genetic and phenotypic diversity is essential to increase grain yield. Such a goal requires statistical methods with the correct application1. In this regard, cPCA has been one of the most important, simplest, and most widely used statistical tools for breeders. PCA aims to reduce data complexity in several principal components (PC). This multivariate method, in addition to studies of phenotypic and molecular diversity2,3, has been used in confirming population structure4,5, genotype selection6, understanding the pattern of genotype-by-environment interactions7,8, and selection of traits for yield modeling9. Despite its widespread use, the results of this technique can be highly biased in population genetic research10 and GWAS11, which have been cited as potential pitfalls12.

Most of the problems of PCA are due to the high sensitivity of this method to the presence of outliers in the data. Phenotypic and genotypic data are susceptible to outlier samples, regardless of the cause. Outlier observations have been observed in single-site and multi-environment trials13,14, DNA-seq15,16, and RNA-seq17 data. Therefore, the presence of such observations in the data is inevitable and violates the underlying assumptions of many statistical analyses18. The problem of outlier observations does not end only with their adverse effects, but identifying these observations and managing them is a very challenging task19,20. So far, several statistical methods have been developed to identify outliers, being the multivariate detection of outliers using PCA one of these approaches12. The sensitivity of cPCA to outliers can be solved by using a robust PCA (rPCA) algorithm17,21.

Over the past years, GWAS has proven its power in quantitative trait locus (QTL) mapping. Many researchers have tried deciphering and dissect the genetic basis of various wheat traits using this approach22,23,24,25,26. Meanwhile, various QTLs distributed in almost all chromosomes have been identified for important wheat traits such as spike weight, grain number, thousand kernel weight, and grain yield24,25,27,28,29,30,31. It was also concluded that many candidate genes for wheat grain yield under rainfed environments are in the form of gene clusters32. In some GWAS studies based on gene annotation, it has been determined that SNPs identified for yield and yield components under drought stress are linked to genes that play an important role in plant growth and survival33,34,35. Statistical methods are the core of GWAS36. Although several of these methods have been developed for GWAS, most of them are sensitive to phenotypic outliers, making their results unreliable11. Population structure is used as a covariate in GWAS to avoid false positive rates. Due to the high dimensionality, determining complex population substructures using SNPs is challenging37. PCA is used as a simple method in setting population classification in these types of studies, and usually, its first few PCs are included as covariates of the population structure in the model. This method may not be suitable for properly diagnosing the population structure in the presence of outlier data21. Such PCA can wrongly consider linkage disequilibrium (LD) structure instead of population structure, and principal component scores may capture outliers12. Therefore, the linear combinations of the first three PCs, included as covariates in the model, can affect the GWAS results. On the other hand, sometimes, instead of phenotype variables, the PC obtained from them is used for GWAS38,39,40,41. Therefore, the contamination of phenotypic data with outlier observations may also distort the GWAS results. In general, cleaning outliers and selecting robust statistical models in GWAS reduces the false positive and negative rates of QTL detection42. In addition to GWAS, it has been reported that different methods of genome-wide QTL mapping studies are also susceptible to outliers, and in their presence, they bring misleading results43.

This study has three specific aims: (i) to identify phenotypic and genotypic outlier observations using PCA and rPCA; (ii) to identify novel putative alleles associated with important spike traits under well-watered and rain-fed environments using GWAS; and (iii) to determine the power of PCA and rPCA linear combinations obtained from phenotypic and genotypic data to reduce the effect of outlier samples in GWAS analysis.

Materials and methods

Plant materials and phenotyping

A total of 294 Iranian bread wheat genotypes, including 90 cultivars released during the last century and 204 landraces collected from different locations (Supplementary Table 1), were evaluated in a randomized complete block design under well-watered (normal irrigation) and rain-fed environments. The experiments were carried out in a research farm with coordinates 50.58 E and 35.56 N and an altitude of 1112.5 m above sea level in 2019–2020. Climate information is presented in Supplementary Table 2 according to the months of the experiment. Five spikes from each genotype were randomly selected, and different traits were measured after being transferred to Urmia University. Spike weight (SW), grain number per spike (GN), grain yield (GY), thousand kernel weight (TKW), and spike internode length (SIL) were recorded. Spike internode length was obtained from the ratio of spike axis length to the number of nodes per spike44. The broad-sense heritability of these traits was calculated through the following equation:

$${H}_{BS}^{2}=\frac{{\delta }_{g}^{2}}{{\delta }_{p}^{2}}=\frac{{\delta }_{g}^{2}}{{\delta }_{g}^{2}+\frac{{\delta }_{ge}^{2}}{{n}_{e}}+\frac{{\delta }_{\varepsilon }^{2}}{{n}_{e}{n}_{r}}}$$

where \({\delta }_{g}^{2}\), \({\delta }_{p}^{2}\), and \({\delta }_{ge}^{2}\) are the genotypic, phenotypic, and genotype-by-environment interaction variances, respectively. \({\delta }_{\varepsilon }^{2}\) is the error variance, Also, \({n}_{e}\) and \({n}_{r}\) are the number of environments and replicates, respectively. Finally, \({H}_{BS}^{2}\) was expressed as a percentage.

Genotyping

Iranian wheat landraces and cultivars were genotyped using the genotyping-by-sequencing (GBS) technique, the details of which have been described previously45. In this method, SNPs were called using the UNEAK GBS pipeline46, and imputation was performed in BEAGLE v3.3.27347 using the w7984 reference genome. For more details, refer to Alipour et al.48. Finally, SNPs with heterozygotes greater than 10% and minor allele frequency less than 5% were removed, and 36,000 SNPs remained for GWAS. The distribution of SNP markers in 21 chromosomes, kinship relationships, and genetic diversity indices have been reviewed in more detail in our previous reports25,49.

Linkage disequilibrium (LD)

Genome-wide linkage disequilibrium (LD) analysis was checked by calculating pairwise marker allele squared correlation coefficient (r2) for all pairwise comparisons. The LD decay pattern for the whole genome and separately for A, B, and D genomes was displayed by plotting pairwise r2 values against genetic distance (bp).

Classical and robust principal component analysis

We used classical PCA analysis and four robust PCA (Proj, Grid, Hubert, and Locantore) methods to evaluate the population structure and identify outliers. Multivariate detection of outlier observations includes examining each observation based on a combination of variables. Usually, researchers deal with more than two variables in their data, so it is necessary to determine outlier observations in terms of a combination of variables. PCA uses the distance of the sample scores as a criterion for investigating outlier observations. The greater the distance between the observed scores, the higher the possibility of outliers. Most rPCAs use the Projection-Pursuit (PP) principle. PP methods are suitable for analyzing data sets with many variables, and they aim to find structures in multivariate data by projecting them on a low-dimensional subspace. These methods maximize a robust measure of spread to obtain consecutive directions on which the data points are projected50. Here, the above four PCA methods are briefly described. The Grid method uses the PP technique to calculate PCA estimators, computed via a grid search algorithm in the plane rather than p-dimensional space51. This algorithm leads to higher amounts of explained variability. Hubert's method, called the ROBPCA algorithm, combines PP and robust covariance estimation, i.e., minimum covariance determinant (MCD) techniques, to compute the robust loadings52. This rPCA method is resistant to outliers in the data and uses the MCD method for low-dimensional data sets. It is also very suitable for high-dimensional data53. ROBPCA yields more accurate estimates of noncontaminated datasets and more robust estimates of contaminated data. In addition, the Hubert model produces a diagnostic plot that displays and classifies outliers52. The spherical principal components procedure, called the Locantore method, is a functional data analysis method. This method performs cPCA on the data but is projected onto a unit sphere. The Locantore approach is very fast, and the estimates of the eigenvectors are consistent. This method is intended to explore the structure of populations of complex objects54. Finally, the Proj approach, which is as fast as the previous method, was used. This approach includes a simple algorithm for approximating the PP-estimators for PCA whose PCs can be sequentially computed. In other words, the Proj approach uses PP to calculate the robust eigenvalues and eigenvectors without going through robust covariance estimation53,55. The cPCA and rPCA analyses were performed using the rrcov package in the R program56. We used the outlier plots provided by these methods to identify outlier observations. Also, The first and second PCs were plotted to obtain a biplot.

Traditional single trait-GWAS

First, GWAS analysis was performed between SNP markers and average phenotypic data for each individual trait in each environment. Since the mixed linear model (MLM) is a standard method in GWAS for various wheat traits57 and has better control over confounding effects25, we used this model. In the MLM model, two covariates are used to prevent false positives: (i) matrix k, which represents kinship relations or family relatedness, and (ii) the first three PCs of cPCA, which are considered as the population structure (stratification) in the model. Simply put, the equation of this model is as follows:

$$Y=SNP+PCs+Kinship+e$$

where Y represents the studied trait, SNP provides genotypic information, PCs are the population structure, and the kinship represents the relationship between individuals in the population using genotypic information, and e is the residual error57. In the MLM model, individuals are considered random effects, and the relatedness among individuals is conveyed through a kinship matrix.

Single trait-GWAS with robust principal component covariates

As we mentioned earlier, classical PCA may not correctly represent the population structure. Therefore, in the second GWAS scenario, we used the four described rPCA methods as population structure. This work was done with the aim of obtaining different linear combinations of genotypic data to moderate the effect of genotypic outliers on GWAS results. Because rPCAs perform better in the presence of outlier observations.

Classical and robust principal component-based GWAS

PCs have been used instead of phenotype variables in GWAS in recent years. Since GWAS for PCs obtained from phenotypic data can be more valuable than traditional single trait-GWAS, we used two PCs instead of the traits themselves in the third scenario. The first two PCs were used because they explained more than 90% of the phenotypic changes, and the remaining PCs did not provide specific information. We also used the PCs obtained from cPCA and four rPCA methods. This work was done to reduce the effect of outlier observations from phenotypic data.

In all three scenarios above, the threshold of − log10 (p) > 3 was used to state statistically significant marker-trait associations (MTAs). Confidence intervals (CIs) for MTAs were calculated using the linkage disequilibrium (LD) decay. GWAS analyses were performed using genome association and prediction integrated tool (GAPIT) R-package58, but the necessary PCs were extracted through the rrcov package in the R program56. The quantile–quantile (Q-Q) plots of the observed and expected P values were plotted at − log10 (p) values to assess the adequacy of a fitted normal straight line to the markers. Finally, Venn diagrams were drawn using an online tool (https://bioinformatics.psb.ugent.be/webtools/Venn/) and the t-test was performed between the alleles of each SNP to compare the results.

Gene ontology (GO)

The sequences surrounding all associated SNPs in the EnsemblPlants (http://plants.ensembl.org/index.html) database were annotated using basic local alignment search tools (BLAST) and aligned with IWGSC v2.1 (International Wheat Genome Sequencing Consortium database) reference genome59. After aligning the SNP sequences with the reference genome, overlapping genes with the highest percentage identity (%ID) and the lowest expected value (E-val) were selected for further processing and interpretation. Then, the ontology of genes, i.e., molecular function, biological process, and their cellular component, was extracted from the ensemble-gramene database. It should be noted that markers with almost identical sequences were filtered. Finally, the gene network was analyzed using the genes identified in the database for annotation, visualization, and integrated discovery (DAVID) bioinformatics resources (https://david.ncifcrf.gov/). This tool examines gene pathways based on the Kyoto Encyclopedia of Gene and Genomes (KEGG) enrichment analysis60,61,62 (https://www.genome.jp/kegg/; www.kegg.jp/kegg/kegg1.html).

Complying with relevant institutional, national, and international guidelines and legislation

The authors declare that all relevant institutional, national, and international guidelines and legislation were respected.

Permission for land study

The authors declare that all land experiments and studies were carried out according to authorized rules.

Results

Identification of outlier observations

Outlier samples were identified using cPCA and different rPCA methods based on phenotypic data in both environments and genotypic data (Table 1). These samples are determined based on the outlier plots provided by the rrcov R-package (outlier plots are not reported). Eight genotypes were identified as outlier samples in well-watered and rain-fed environments, and only genotype 182 was common among them. Based on genotypic data, 16 samples were outliers. Genotypes 4, 200, and 252 were outliers under well-watered using cPCA, and only genotype 252 was confirmed by other rPCA methods. However, in the rain-fed environment, all three genotypes that were identified as outliers by cPCA were also confirmed by rPCA methods. On the other hand, some outlier samples were identified only by robust methods.

Table 1 Multivariate detection of outlier samples based on cPCA and various rPCA methods in different datasets.

According to the results, Hubert's approach differs from other methods, especially in genotypic data. The Hubert method divides the outlier plot into four parts (Supplementary Fig. 1). The normal samples are on the bottom left and are different from the other three outlier types because this category's score and orthogonal distances were low. The bottom right in the well-watered environment included genotypes No. 182, the rain-fed environment included genotypes No. 182 and 219, and the genotypic data included genotypes No. 19, 23, and 24, which had high score distance and low orthogonal distance. In the upper left space of the outlier plot, the genotypes in which the orthogonal distance is high but the score distance is low, are located. It is interesting to note that in the phenotypic data, we did not see an example in this area, but in the genotypic data, three genotypes (219, 262, and 283) were located there. The upper right will contain samples strongly deviated from normal samples and have high scores and orthogonal distances. We found such outlier examples only for phenotypic data. These results can help to understand the nature of outliers and determine their type.

There was a significant difference between genotypes in both environments for all traits. Also, except for SIL, the mean of traits in the rain-fed environment decreased (Supplementary Table 3). Descriptive statistics showed little changes by removing two genotypes (105 and 252) that were outliers by most methods. These changes were different from one trait to another. In SW, the mean decreased, but the genotypic variance increased. This mode for GN was increasing and decreasing, respectively. As expected, SD and δ2GE decreased for all traits (except SIL). The amount of error variance also remained almost constant. The highest and lowest heritability were related to SIL and GY traits, respectively. Removing two outlier observations caused the heritability to increase by about 2% in some traits (Table 2).

Table 2 Mean, standard deviation (SD), variance components, and general heritability (H2BS) for the studied traits in two environments. δ2G, δ2GE, and δ2E represent the genotypic, genotype-by-environment, and error variances, respectively$

Linkage disequilibrium and population structure

The pattern of LD decay differed between sub-genomes, ranging from 1707 to 5752 bp, so its level was high in the D genome and low in the B genome (Supplementary Fig. 2). The distribution of genotypes based on the first two PCs in all methods showed a clear distinction between Iranian wheat cultivars and landraces (Fig. 1). Some cultivars were among landraces. By looking closely at the pedigree of these cultivars, we found that some of them, such as Ohadi, Roshan, and Homa, are originally landraces that have experienced various selection processes. Although, in general, the distribution pattern of genotypes was similar in all five PCA methods, it seems that the Grid approach is different from others (Fig. 1).

Figure 1
figure 1

Classical and robust principal component analysis (PCA) based on 36,000 SNP markers generated from 294 Iranian wheat landraces (blue) and cultivars (orange).

Traditional single trait-GWAS

In GWAS analysis for five traits, 184 and 139 MTAs were identified in the well-watered and rain-fed environments, respectively. The details of these results are provided in Supplementary Tables 4 and 5. Briefly, In the well-watered, 38 SNPs were associated with GN, about 32% of which were located on chromosome 7A, and 90% of the markers had a significant difference between their two alleles in terms of GN (Table 3). For the same trait in rain-fed, 23 MTAs were discovered, mainly distributed in the A genome as in well-watered, and 60% of them were different between their two alleles. For SW under well-watered and rain-fed environments, respectively, we observed 19 MTAs (mostly in genome A) and 18 MTAs (mostly in genome B), of which 68% and 55% differed between their alleles in terms of average SW. Among the traits studied, most MTAs were assigned to SIL. 65 and 53 MTAs were associated with SIL in well-watered and rain-fed environments, respectively, and most of these SNPs were located on chromosomes 6B and 2B. After SIL, TKW, with 44 MTAs in a well-watered environment and 32 MTAs in a rain-fed environment, had the highest number of MTAs located in chromosome 2B. Mostly, we found a statistically significant difference between the two alleles of these markers in terms of average TKW. Finally, GY was significantly associated with 18 and 13 SNPs in well-watered and rain-fed, respectively, which were distributed in almost half of the 21 wheat chromosomes. Like the previous traits, the markers related to GY differed between their two alleles in terms of the average trait.

Table 3 The test between two alleles of all SNPs associated with traits in two environments.

Single trait-GWAS with robust principal component covariates

The details of single trait-GWAS results with different PCs as population structure are presented in Supplementary Table 4. In addition, Venn plots are drawn in Fig. 2 to show the number of common SNPs between different methods. The effect of population structure on GWAS results varied from one trait to another. In both environments, the number of common SNPs between cPCA and rPCA methods was high for GN, SIL, and TKW traits but low for traits such as GY and SW. Using the PCs provided by Hubert and Grid methods as population structure identified some new markers, especially in SW, GY, and SIL traits. Also, several new markers were identified using a covariate of the population structure by the Proj method in most traits. In some cases, the new SNPs identified by the rPCA covariate were in the same chromosomal regions as those identified by the classical method. However, others were located in new chromosomal regions that could be interesting. In Hubert and Proj's methods, new regions were found on chromosomes 7A and 6B for both SW and GY traits. Using the PCs obtained by the Locantore method, the lowest number of new MTAs was detected compared to other methods.

Figure 2
figure 2

Venn plots represent the number of SNPs in classical PCA and robust PCA for evaluated traits. (A): grain number under well-watered, (B): grain number under rain-fed, (C): grain yield under well-watered, (D): grain yield under rain-fed, (E): spike internode length under well-watered, (F): spike internode length under rain-fed, (G): spike weight under well-watered, (H): spike weight under rain-fed, (I): thousand kernel weight under well-watered, (J): thousand kernel weight under rain-fed.

We evaluated the fitness and efficiency of different models with Q-Q plots. The Q-Q plots of the expected − log10 (p) versus the observed − log10 (p) for PCA covariates were almost the same in both well-watered (Fig. 3) and rain-fed (Fig. 4) environments. In general, in these methods, the observed values were close to the expected values. Only at the tail of the point distribution did we see a deviation from the expected value, indicating significant marker effects. However, minor differences are also visible; for example, the values observed in Hubert's method for GY in well-watered were closer to the expected values. Also, the p-values for GY under rain-fed seems slightly inflated, which is solved in Hubert's method.

Figure 3
figure 3

Quantile–quantile (Q–Q) plots for genome wide association study (GWAS) based on different PCA covariates in studied traits under well-watered environment. (A): Grain number, (B): Grain yield, (C): Spike internode length, (D): Spike weight, and (E): Thousand kernel weight.

Figure 4
figure 4

Quantile–quantile (Q–Q) plots for genome wide association study (GWAS) based on different PCA covariates in studied traits under rain-fed environment. (A): Grain number, (B): Grain yield, (C): Spike internode length, (D): Spike weight, and (E): Thousand kernel weight.

We identified several markers associated with different traits, including two markers located in chromosomes 2A and 5A that were associated with GN, GY, and SW in a well-watered environment by all methods. Markers in chromosomes 4A, 2B, and 5B were associated with the above three traits in rain-fed. The highest number of identical markers in both environments was related to GY and SW. On the other hand, except for TKW, at least one marker was the same for the other four traits in both well-watered and rain-fed environments, which can be considered stable QTLs. As we expected, there was a significant difference between the two alleles of most of the identified SNPs in terms of the studied traits. Nevertheless, significant SNPs were high for traits such as SIL, TKW, and GN in well-watered and TKW in rain-fed. SNPs identified by Hubert and Proj methods had the highest percentage of significant markers for some traits (Table 3).

Classical and robust principal component-based GWAS

Under well-watered, 56 MTAs were discovered for the first two PCs using the classical method, while this number was 63, 66, 55, and 61 MTAs in the Grid, Hubert, Locantore, and Proj methods, respectively (Supplementary Table 5). Almost 30% of SNPs associated with PC1 in all methods were on chromosome 7A. Chromosome 2B was one of the other important chromosomes on which 20% of SNPs related to PC1 were located in Grid and Proj methods and 10% in Hubert's method. Hubert's method obtained the highest number of MTAs identified for PC2. This indicates the different linear combinations of this method compared to others. Also, Hubert's method had the highest number of common markers with traits. Almost 55% of the markers identified for traits were unrelated to any components. In contrast, 15% were observed by the components of all methods. PCs of other methods, especially Grid and Proj, discovered several MTAs that were not observed in other methods and single traits (Fig. 5A).

Figure 5
figure 5

Venn plots showing the number of common markers of GWAS analysis for traits and the first two components of PCA methods in well-watered (A) and rain-fed (B) environments.

Such results were also obtained under drought rain-fed. Out of 139 MTAs identified for traits, 65 were absent in GWAS for the two principal components resulting from traits. The highest and lowest MTAs were assigned to PCs obtained from Locantore and Grid methods, respectively (Supplementary Table 5). In the classical method, the markers associated with PC1 and PC2 were mainly located in the A and B genomes, respectively. Unlike other methods with the most MTAs in chromosome 2B for PC2, in Grid and Proj methods, this chromosome contained more SNPs related to PC1. In Hubert's method, chromosomes 1A, 2A, 6A, and 6D had a high association with PC1, and chromosomes 2B and 6B with PC1. Almost the same results as Hubert's were obtained for two PCs from the Locantore method. Interestingly, 18 MTAs were identified for components of the Locantore method that were not present in the others. On the other hand, the SNPs associated with the Hubert method components wholly overlapped with those associated with a single trait (Fig. 5B).

Gene annotation and Kyoto encyclopedia of genes and genomes (KEGG) pathways

Among all identified markers, 77 overlapped with a range of genes mainly located on chromosomes 2B, 3B, 6B, 1B, 7A, 4A, and 6A (Supplementary Table 6). Some of these genes had biological processes and molecular functions, forming part of cellular components such as membranes and nuclei. Defense response, protein phosphorylation, regulation of transcription, and DNA template are the most visible in biological processes. On the other hand, protein binding, ADP binding, DNA binding, iron ion binding, etc., were molecular functions of genes. The results of GO enrichment analysis and testing of statistically enriched pathways were performed in the KEGG, and some pathways, including biosynthesis of flavonoid, carotenoid, and secondary metabolites, were identified. Other pathways were metabolic, ubiquitin-mediated proteolysis, protein processing, plant-pathogen interaction, and fatty acid elongation (Table 4). Interestingly, some KEGG pathways were detected only by Hubert’s method.

Table 4 Kyoto encyclopedia of genes and genomes (KEGG) pathways (www.kegg.jp/kegg/kegg1.html) for overlapping genes identified by different methods.

Discussion

PCA is a data dimensionality reduction method. In the analysis above, it is possible to convert many primary variables into a few new principal components, i.e., linear combinations of the original variables, that express the most variance observed in the original data. In recent years, PCA has also been used to help detect outliers. Multivariate outlier detection includes multivariate analysis of each observation based on a combination of variables19,21,63. Outliers samples are observations that fall outside the general distribution pattern, and contamination of data with these samples is more of a rule than an exception53. Outliers can be caused by various factors such as low seed quality, outlying response, misidentification of genotype, wrong data imputation, etc.20,64. Looking more closely at the outliers, we found that genotype No. 252, which was an outlier under well-watered by all methods, had poor seed quality and shriveled because it had high GN and low TKW. Probably, genotype No. 105 had a wrong classification in well-watered because its parameters were similar and even higher than the average of well-watered. Such deviations can have a high impact on the genotypic average65.

In the present study, phenotypic data compared to genotypic data, different methods identify the same samples as outliers. This is probably due to the high dimensionality in the genotypic data. Determining outliers in high-dimensional genomic data is difficult17,20. However, using different methods to discover them can be useful. In the outlier plot obtained from the ROBPCA algorithm (Hubert), it was found that no outlier sample had high scores and orthogonal distances based on genotypic data. If the number of accessions is high, the outlier status of some samples will change. Genotype No. 161 was present in a core set designed from among 2403 Iranian wheat landraces, and 12% of other genotypes were in outlier groups16. In analyzing the diversity of 80,000 wheat accessions worldwide, it was found that some are outliers. A closer examination indicated that some samples were misclassified in their passport information66. This is an example of the application of identifying outlier samples. Also, tools have been developed based on PCA analysis to detect outliers67 to, e.g., determine the SNPs under selection, and involved in biological adaptation, and have been used in various crops15,68,69.

A good separation of cultivars from landraces based on PCA was expected because there are traces of foreign origin in the pedigree of many Iranian wheat cultivars. Some of the cultivars mixed with the landraces are old and the detailed information about their pedigree is not available, but they are most likely of landraces origin. In this regard, Khadka et al.4 have discussed the possibility of mistakenly collecting exotic germplasm instead of landraces in Nepalese wheat during the 1970s to 1990s. According to our results, it has been reported that PCA based on SNPs can divide wheat accessions into separate groups according to breeding origin32,70. In our study, the distribution of landraces was not strictly based on their geographical origin where they were collected. The subgroups identified as population structure can represent different wheat breeding programs33. Therefore, using PCs as a population structure in GWAS can avoid false positive associations.

Efforts to increase wheat yields continue to meet human needs. In this regard, evaluating spike traits is critical due to their direct relationship with the grain. Several studies have emphasized the importance of spike traits71,72,73. As it was apparent, the traits studied under well-watered showed a decrease in average. SIL was an exception to this rule because the number of nodes decreased in parallel with the reduction in spike length. Heritability range varied from 50% in GY to 75% in SIL. The heritability estimated in this study for SIL is almost equal to that reported in a similar study44. The high heritability of this trait might be due to being mainly controlled by genetic effects. While for GY, which has complex control, heritability was low25,74. Our results demonstrate the distorting effect of outlier samples in estimating heritability, as previously mentioned65. Due to the incorrect estimation of heritability in the presence of outliers, it is suggested to use the robust estimation of heritability and to pay attention to these methods along with the classical approach18. In another study, the robust DF-REML framework for estimating variance components was proposed, which could provide a robust estimate of heritability75. However, many reports still do not pay attention to outlier data, which seems to be due to a lack of deep understanding of the distorting effects of this data.

We performed GWAS based on three approaches. Although these methods had a lot in common in highly significant MTAs, there was a difference between the methods in terms of significant SNPs at the level of − log10 (p) > 3. Such changes in GWAS results are because genomic data are rarely of good quality and are contaminated with outliers, missing values, and noise20. Q-Q plot is a convenient tool that shows control of type I error in MTA detection and can be used for model selection in GWAS76,77,78. Population stratification or cryptic relatedness leads to systematic deviation from the diagonal at the upper-right end of the Q-Q plots79. Based on this plot, there was not much difference in the results. This result is probably why we did not have outlier observation in SNP data of the third type (both score and orthogonal distances high). However, it seems that using PCs obtained from Hubert's approach can be useful in data with outlier observations of the first and second type. Also, other rPCA methods need to be investigated in diverse populations because we saw slight differences in the deviation from the diagonal at the Q-Q plots of different methods. Finally, using cPCA in genotyping data that are not contaminated by outlier observations is better.

The distribution of GY-related markers in different chromosomes has been observed before29,80. 4A was the chromosome that contained MTAs for GY in both environments; even SNP rs12851 on this chromosome showed its significant association in both environmental conditions. The same result was reported in similar studies that observed a stable association between chromosome 4A and yield under drought stress81,82. In addition, SNP rs5823, associated with GN and PC1 in all methods, was located on chromosome 4A. Interestingly, it was present in the carotenoid biosynthesis pathway and supported the cellular anatomical entity. When we used Hubert's PC for population structure, chromosome 7A was most associated with GY. According to the annotation, SNP rs30826 on this chromosome played a role in the membrane structure and was present in the plant-pathogen interaction pathway. It has been proven that there are crosstalks between responses to biotic and abiotic stresses, and some stress-resistant genes can respond to biotic and abiotic stresses83. Also, the plant-pathogen interaction pathway was involved in the drought stress resistance of transgenic wheat84. In a robust GWAS analysis, to overcome the problems of outlier observations, the linear mixed model approach was strengthened using the β-divergence method. This method performed better than the linear regression and mixed model approaches in the presence of outlier data and identified new SNPs that can be used in breeding programs11. The combination of GWAS and t-tests help identify significant SNPs85, confirming CAPS markers86, and identifying favorable SNP alleles87. Hence, we tracked the t-test for a more accurate assessment of the changes in phenotypic data due to allelic variation. The Hubert and Proj method had the highest percentage of significant SNPs in some traits compared to the classical method. The non-significant SNPs between the two alleles in terms of the t-test were distributed in most traits in a range of chromosomes; however, 2B and 7A had the highest number of these SNPs for SIL and GN traits, respectively.

Principal component-based GWAS is statistically more powerful than single-trait GWAS and requires less computational time than multi-trait GWAS. Therefore, it is an efficient method to identify pleiotropic markers39. If the first two PCs explain a high percentage of trait variation, they can be suitable enough for GWAS38,40,41. In the present study, these two PCs captured more than 90% of the variation. Chromosome 7A had a high percentage of MTAs related to PC1 in all methods. The mentioned chromosome is important due to its association with multiple agronomic traits88 and contains pleiotropic QTL89,90. Chromosome 2B was among the other regions associated with PCs in both environments. Pleiotropic loci have been reported in chromosome 2B91. Interestingly, under well-watered, SNP rs7805 located on chromosome 3B was associated with PC1 in all methods, while it was not significantly associated with any trait. 3B is a chromosome that simultaneously controls several yield components25,92. Such QTLs controlling two or more traits lead to high genetic correlation in many traits. The presence of such QTLs in the present study is not far from expected because the correlation between spike traits is certain27,44.

Although PC-GWAS increases statistical power and has led to the identification of new SNPs in some GWAS studies40,41,93,94, the presence of even a single outlier has a negative effect on PCA results95. The impact of outlier samples on GWAS results has already been discussed. It has been stated that the number and identity of QTLs identified through GWAS are influenced by the curation and preparation of phenotypic data. If the phenotypic data is contaminated with outliers, the number of suspicious QTLs will increase, especially in loci with unbalanced allelic frequency64. Therefore, it seems necessary to identify the outliers and then enter the next steps for the accuracy and precision of GWAS. For this reason, we used rPCA to obtain other linear combinations of traits that moderate the effect of outliers as input for GWAS. As far as we searched, such an approach has not been investigated in any GWAS. Under well-watered, SNP rs19833 on chromosome 5B, associated with PC1 in Grid and Proj methods, was involved in protein processing in the endoplasmic reticulum. There is evidence of the consequences of protein processing in the endoplasmic reticulum on the final grain size of wheat96, and we know that spike traits strongly affect the final grain size. In addition, based on the above two methods, other markers were identified that play a role in essential candidate genes, such as protein binding, protein phosphorylation, and lipid metabolic process. We saw a similar situation in the well-watered environment where SNP rs48316 located in chromosome 2B, was not associated with any trait but was significantly associated with PC1 in different methods. According to KEGG results, this SNP overlapped with the TraesCS2B02G098500 gene and was present in pathways such as the phosphatidylinositol signaling system, inositol phosphate metabolism, and metabolic pathways. All three of these pathways play a role in tolerance to drought stress97,98,99.

Conclusion

When phenotypic and genotypic data are contaminated with outlier observations, PCA and different rPCA algorithms provided promising results in identifying them. We have shown that in such a situation, cPCA should be used cautiously in GWAS due to its sensitivity. Using robust strategies in GWAS, putative alleles for important agronomic traits in wheat were identified that were not found in conventional GWAS. Linear combinations of Hubert and Proj methods moderated the effect of phenotypic outliers to a great extent and effectively identified pleiotropic markers. Also, in GWAS results, the population structure provided by rPCA methods discovered several new QTLs associated with traits. Robust strategies in GWAS reduce the risk of missing interesting rare alleles. Finally, it is necessary to pay more attention to the effect of outlier samples and the above methods in future GWAS studies.