Introduction

One of the major challenges in recent genetics research is to explore the biomarker genes that are linked to complex traits of interests in living organisms. Trait variations in living organisms are related to genetic variations in genes. These variations are observed largely at physiological, developmental, and morphological stages. Identification of important genetic basis such as causal genetic variants for such distinction in phenotypic traits is identifiable at single nucleotide polymorphism (SNP) levels. The techniques to explore the SNP contribution in phenotypic variation are termed as Genome-Wide Association Studies (GWAS). SNPs, however, are usually tested for relationship study through the whole genome with the characters of important trait of interest. The SNPs identified by GWAS can be used for the treatment and prevention of certain complex traits in living organisms. A very large set of SNPs along with a very large number of accessions are simultaneously studied using different GWAS methods to uncover the significant relationship between genomic latent factors and phenotypic variations of interest1.

Linear regression model (LRM) is a popular approach in GWAS. It is implemented through PLINK software for detecting important SNPs associated with quantitative trait2. The PLINK tool focused on parametric or nonparametric-based linear regressions which do not control population stratification. Population stratification is one of the main concerning issues when extensive genome-wide association analysis with numerous subjects is in consideration3,4,5. Some unidentified new population structures are probable to exist due to a large number of subjects that may perhaps liable for regular dissimilarities being selected in SNPs amongst cases and controls4. Due to the higher false discovery rate (FDR), it is imperative to correct the observed population stratification in GWAS4,6. There are, however, several statistical methodologies proposed earlier for GWAS to address the effects of population stratification. The most commonly used statistical methods to avoid the bias of population stratification (PS) or genetic relatedness are genomic control7, structured association8, and principal component analysis9,10. The genomic control (GC) method adjusts the statistical relationship by a common element concerning the whole set of SNPs for precise correction of PS4. If the consequence of population structure increases, the power of GC approach decreases9,11,12,13,14. The structured association (SA) analysis technique suggests locating the samples to separate subpopulation groups for collecting signs of a relationship in each group8. SA method is only useful for small datasets4. Principal Component Analysis (PCA) approach is also used to overcome the influence of population stratification in GWAS by using several topmost principal components (PCs) as covariates4,9. But none of the methods mentioned above can handle the influence of the polygenic effect. To overcome these issues, the linear mixed model (LMM) was proposed which is one of the most popular approaches in GWAS. It is widely using through several computer software such as TASSEL13, EMMA 15, EMMAX16, rrBLUP17, GAPIT18, and GAPIT Version 219.

However, all the methods as early discussed are very much sensitive to phenotypic outliers. So, they can produce misleading results in presence of outlying observations. To overcome these issues, an attempt is made to robustify the LMM based GWAS by using a new type of outlier modification rule based on the minimum β-divergence method20,21. The performance of the proposed approach has been investigated using both simulated and real rice genome datasets related to flowering time.

Results and discussion

We investigated the performance of the proposed method compare to two popular approaches (LMM and LRM) using both synthetic and real data analysis as discussed below:

Results and discussion based on a complete simulation

To investigate the performance of SNPs detections with the synthetic datasets, we considered two original clean simulated datasets that were generated with heritabilities 0.2 and 0.3 respectively, as described in the materials and method section. We contaminated 1%, 2%, 3%, 4%, and 5% phenotypic observations by outliers to generate five contaminated datasets with each clean dataset to investigate the performance against the phenotypic outliers. Before going to the performance comparison, first, we would like to discuss the preprocessing steps for the proposed method as follows.

Outlier detection and modification of phenotypic observations by the proposed method

To analyze these datasets by the proposed method, at first, we identified phenotypic outliers by using the β-weight function for each genotypic group and then replace the outlying phenotypic observations with the corresponding group mean computed by the minimum β-divergence method. To show how β-weight function detects outliers, we plotted β-weight corresponding to each phenotypic observation in the Supplementary Fig. S1 (a-b). Supplementary Fig. S1a consists of two panels, where the left panel plotted the original phenotypic observations and the right panel plotted their β-weights. Similarly, Supplementary Fig. S1b consists of two panels, where the left panel plotted the phenotypic observations including the 5% contaminated observations (red color), and the right panel plotted their β-weights. To select the outlying observations, we used the threshold value τj = pth quantile value of the empirical distribution of \(W_{\beta } (y_{li} |\hat{\theta }_{l,\beta } )\) as introduced in Eq. (7). We observed that the β-weight function correctly identified the outlying observations. Now we would like to discuss the consequence of outliers in the classical and proposed approaches by decomposing phenotypic variations as follows.

The consequence of outliers on the partition of total phenotypic variations

To discuss the consequence outliers on the partition of total phenotypic variations for both classical and proposed approaches, we considered the original clean dataset including two contaminated datasets based on 2% and 5% outlying observations. Table 1 shows the consequence of outliers on the partition of total phenotypic variations for both classical and proposed approaches.

Table 1 Consequence phenotypic outliers on the partition of total variations with the classical and proposed approaches. The bold text indicates the partition of total variation for the clean dataset.

We observed that variance proportions with respect to the genetic effect gradually decrease as increases the rate of outlying observations in the dataset by the classical approach, while the variance proportions with respect to the genetic effects are almost stable for each level of outlying rates in the dataset by the proposed approach. Therefore, the heritability (h2*) in presence of outliers becomes smaller than the heritability (h2) in absence of outliers by the classical approach, while the heritability (\( \, \mathop h\nolimits_{\beta }^{2*} \, \)) in presence of outliers is almost similar to the heritability (h2) in absence of outliers by the proposed approach.

Performance comparison for SNPs detection

At first, we identified important SNPs by applying each of LRM, LMM, and the proposed methods on each of six datasets with each of two distinct genetic heritabilities 0.2 and 0.3 corresponding to the respective rates 0%, 1%, 2%, 3%, 4% and 5% of outliers. Then we computed statistical power and false discovery rate (FDR) for each of the methods to investigate the performance of the proposed method in a comparison of the classical LRM and LMM approaches. We computed average power and FDR based on 1000 replication of each dataset. Figure 1 showed the effect of outliers on statistical power and FDR with each of two distinct genetic heritabilities 0.2 and 0.3. It is observed from Figs. 1(a, c) that the power of the proposed method slowly decreased compare to the LMR and LMM approaches with the increasing rates of outliers. In absence of outliers, both the proposed and LMM approaches produced almost identical powers but much larger than the power of LRM. For example, in absence of outliers, the power of LRM, LMM and proposed methods were 67.55, 74.55 and 73.00 respectively for the dataset with heritability 0.2 (Fig. 1a), while 72.35, 89.95 and 88.65 respectively for the heritability 0.3 (Fig. 1c). However, with the increasing rates of outliers, the power of LMM approach decreased dramatically faster than the LRM approach. For example, in presence of 5% outliers with the phenotypic observations, the power of LRM, LMM and proposed methods were 58.35, 38.95 and 62.65 respectively for the dataset with heritability 0.2 (Fig. 1a), while 60.15, 49.10 and 83.90 respectively for the heritability 0.3 (Fig. 1c). Thus the, the proposed method produced much higher powers than both LRM and LMM for both scenarios of heritabilities in presence of outliers. The above two examples also indicated that the power of all three methods increase as the increasing of heritabilities in the datasets.

Figure 1
figure 1

Results computed by LRM, LMM and the proposed methods based on complete simulation (a) plot of statistical power against the rate of phenotypic outliers at heritability h2 = 0.2 and at the cutoff p-value 10–5. (b) Plot of FDR against the rate of phenotypic outliers at h2 = 0.2 and at cutoff = 10–5. (c) plot of statistical power against the rate of phenotypic outliers at h2 = 0.3 and cutoff 10–5 (d) plot of FDR against the rate of phenotypic outliers at h2 = 0.3 and cutoff 10–5.

Figures 1(b,d) showed that FDRs of the proposed and LMM methods were almost same and close to zero in each rate of phenotypic outliers. However FDR for LRM was too high due to the influence of outliers along with the population stratification and polygenic effects22. Thus we may conclude that both LRM and LMM approaches are very much sensitive to phenotypic outliers compare to the proposed method. This results also supported by the decomposition of phenotypic variations described in Table 1, since power of any method decreases as the increasing of computational heritabilities.

Performance comparison with some other robustification techniques

The proposed robustification technique was also compared with other two robustification techniques based on7-sigma rule (\(\overline{y }\pm 7\sigma \)) and inverse-normal transformation (INT)23,24 by the same datasets that were used in Fig. 1. We detected phenotypic outliers by the 7-sigma (7σ) rule and remove them before going to the analysis by LMM model. In the case of INT approach, we performed inverse-normal transformation on the phenotypic observations before going to the analysis by LMM model. In GWAS of quantitative traits/phenotypes, INT is commonly applied when the traits are distributed non-normally24. Figure 2 showed that the proposed method produces slightly higher power compare to 7σ and INT techniques for all cases of heritabilities and outliers. We also observed that the power of INT method is slightly higher compare to 7σ method in presence of outliers, but smaller in absence of outliers. There is one drawback with INT approach for the weak performance compare to the proposed method. The INT approach normalized the traits globally assuming the unimodal trait/phenotypic observations, while proposed method modified outliers corresponding to the genotypic groups assuming the multi-modal trait/phenotypic observations. Actually, phenotypic observations follow 3 modal distribution due to its 3 genotypes. There are two drawbacks with 7σ approach for the weak performance compare to both INT and the proposed methods. This method detected the outlying observations by assuming the unimodal distribution of phenotypic observations like INT approach and reduced the sample size by removing the outlying observation before going to the SNP detection by LMM model. The small sample size decreased the power significantly, which is satisfied by the previous study result25. Thus, the proposed robustification works well compare to 7σ and INT approaches.

Figure 2
figure 2

Plot of statistical power against the rate of phenotypic outliers based on the same dataset as used in Fig. 1 computed by three methods 7-sigmarule, inverse normal transformation (INT) and the proposed methods.

Performance comparison based on real SNP genotype and simulated phenotype data on rice flowering time

We also investigated the performance of the proposed method in a comparison of LRM and LMM based on partial simulation with the real SNP genotype and simulated phenotype data on rice flowering by considering the same condition like Fig. 1. The top four significant SNPs (id2005919, id2005983, ud2000772 and ud7002027) identified in real data analysis (See Table 2) were considered as main effect with the true effect size (ak). From the rest SNPs, we randomly selected 1996 SNPs as polygenic variants (effects). Finally the phenotypic observation were generated using Eq. (9).

Table 2 Identified 11 SNPs and the candidate genes of rice flowering time.

The Fig. 3 shows that the proposed method produces much larger power compare to both LRM and LMM which is supported by the results displayed in Fig. 1. However, with the increasing rate of phenotypic outliers, a slow decreasing rate of power was observed for the LMM method compared to LRM in Fig. 1. Again, with the increasing rate of phenotypic outliers, a decreasing trend of FDR for the LRM method was observed like Fig. 1d. This result could be due to the increase of outlier; the effect of population structure on the phenotype become weak. This result advocating that LMM method can control the confounding due to population stratification but not the outlier, whereas LRM method fail to control both the confounding and outlier. Thus, it may be concluded that the proposed method shows much better performance than the LRM and LMM in presence of phenotypic outliers; otherwise, it keeps the almost equal performance of LMM.

Figure 3
figure 3

Results computed by LRM, LMM and the proposed methods based on partial simulation with the real dataset (a) plot of statistical power against the rate of phenotypic outliers at heritability h 2= 0.2 and at the cutoff p-value 10–5. (b) Plot of FDR against the rate of phenotypic outliers at h2 = 0.2 and at cutoff = 10–5. (c) plot of statistical power against the rate of phenotypic outliers at h2 = 0.3 and cutoff 10–5 (d) plot of FDR against the rate of phenotypic outliers at h2 = 0.3 and cutoff 10–5.

Genome-wide association analysis of rice flowering time by the proposed method

We have analyzed the rice flowering time trait to identify the loci influencing this complex trait. Previous study by Zhao et al.1 on the same trait identified only 2 loci which motivated us to reanalyze the trait using our proposed robust GWAS (rGWAS) method. We have identified 11 significant SNPs using the proposed method (rGWAS) for rice flowering time with the threshold \(p<9.99\times {10}^{-5}\) (Fig. 4). From the Manhattan plot it is observed that among the identified 11 SNPs, six SNPs lie in chromosome 2; one SNP marker belongs to each of the chromosome 6 and 7 and the rest three SNP markers belong to chromosome 8. However, these 11 SNPs did not overlap with the previously identified 2 SNPs. Zhow et al. used the genotype data with missing values as input in EMMA that omit the individuals with missing corresponding to the testing SNP. As a result the sample size was reduced that is responsible to decrease the detection power25,26,27. Moreover, the analyzed trait contained seven phenotypic outliers (Supplementary Fig. S2) that also responsible to reduce the statistical power (Fig. 12). A QQ-plot was also constructed using whole genome p-values of the proposed method. QQ-plot showed that the observed values correspond to the expected values are on or near the middle line between the x-axis and the y-axis (Supplementary Fig. S3) indicating that no genomic inflation has occurred in this analysis as well as population stratification is sufficiently controlled.

Figure 4
figure 4

Manhattan plot using robust GWAS (rGWAS) on the trait of rice flowering time. The x-axis is the genomic position of the SNPs in the genome, and the y- axis is -log10 of the p-values. Each chromosome is colored differently. The grey horizontal line represents the minimal significant level at the cutoff.

Genomic information and characteristics of the candidate genes

To investigate the biological significance of the identified 11 SNPs, we validated these SNPs by using literature review and gold benchmark data. The corresponding chromosome and position of the identified SNPs were used to annotate and find the candidate genes comparing with the reference genome from the Rice Genome Annotation Project (RGAP) (http://rice.plantbiology.msu.edu/) database and collecting necessary genomic information given in Table 2 and Supplementary Table S1.

The SNP id2005644 was identified that located near the gene LOC_Os02g21070 which encodes pentatricopeptide repeats (PPR) domain-containing protein, and is assumed to take part in the biological molecule variation28. Plastid-localized pentatricopeptide repeat protein was reported in a study that is required for both pollen development and plant growth in rice29. A recent study in Arabidopsis showed that PPR containing protein affects flowering time30. Another variant id2005743 located in chromosome 2 and in the gene LOC_Os02g21880 was identified. The gene encodes coiled-coil domain-containing protein. This protein acts as the regulator of protein positioning in the cell during cell division by splitting and organizing signaling paths sequentially and spatially31. A coiled-coil domain containing protein in rice, PAIR1, was reported to express in the early stages of flower development32. Other four candidate genes (LOC_Os02g24770, LOC_Os02g24780 and LOC_Os02g27750, LOC_Os08g01070) were identified those encodes the retrotransposon and transposon transposable elements (TEs). Several studies have reported that these TEs can be induced by heat and cold stress in plants33. For example, Ty1-copia like retrotransposon ONSEN was found to be activated by heat stress in Arabidopsis34. Temperature and photoperiod are also found as two key regulatory factors associated with the flowering time in plants including rice35.

The detected SNPs were also mapped in the region of 100 kb of the genes those involved in the rice flowering time and seed development pathway36 and found five SNPs comprised of four genes involved in the two pathways (Supplementary Table S2). These results are suggesting the potential role of robust GWAS in detecting novel genes.

Functional enrichment analysis of the candidate genes

GO analysis is one of the major bioinformatics techniques for better understanding the underlying biological processes (BP) of the candidate genes along with their molecular functions (MF) and the cellular component (CC) of the genes37,38. Therefore, to more characterize the candidate genes, we have performed GO enrichment analysis and the results are shown in Fig. 5a and Supplementary Fig. S4a-c. The most important pathways or GO terms involving the candidate genes are highlighted with light yellow color in the rectangular boxes (Supplementary Fig. S4a-c). Flower development (GO:0,009,908), abscission (GO:0,009,838), signal transduction (GO:0,007,165), cell death (GO:0,008,219), cellular process (GO:0,009,987), response to stress (GO:0,006,950), and cellular protein modification (GO:0,006,464) were found as the most important pathways activated by the identified candidate genes in rice. Among the identified BPs, flower development (GO:0,009,908) is one of the crucial pathway that play the role for early and healthy grain development in rice39,40 and it is functionally linked to both the reproductive process (GO:0,022,414) and developmental process (GO:0,032,502) of rice plant (Fig. 5a and Supplementary Fig. S4a). Regulation of rice flowering time is delayed by several independent pathways and significantly influenced by prompt vegetative growth and reproductive process41. GO analysis showed that the shoot development (GO:0,048,367) and reproduction stages such as post-embryonic development (GO:0,009,791) are functionally related to rice flower development (Supplementary Fig. S4a).

Figure 5
figure 5

Expression map of GO and SCL for the 11 candidate genes. (a) Represents the expression map of the functional pathways viz., biological process (BP), molecular function (MF), and cellular component (CC) of the six candidate genes. (b) Represents the predicted subcellular location (SCL) of candidate genes in 10 molecular organs viz., cytosol (cytos), endoplasmic reticulum (ER), extracellular (extra), golgi apparatus (golgi), membrane (membr), mitochondria (mito), nuclear (nucl), peroxisome (pero), plastid (plast) and vacuole (vacu).

Abscission, a part of a multicellular organism (GO:0,007,275), in GO enrichment analysis showed that it is directly related to the developmental process (GO:0,032,502) in rice. Earlier studies showed that seed shattering is controlled by the development of the abscission layer pathways in rice that allows offspring dispersal in the natural environment42. Signaling components via signal transduction (GO: 0,007,165) chain activate the different plant steroid hormones namely Brassinosteroids to regulate various growth and developmental programs, including cell differentiation and elongation, reproductive development, senescence, skotomorphogenesis (seedling development in the dark), and vascular differentiation43,44,45. The hormonal signal transduction pathways are also responsible to control several yield-related traits, including leaf angle, plant height, tiller number, and grain size in rice43,44,46,47. The predicted metabolic routes in this study may execute metabolism activities to convert food to energy to run cellular processes, to build blocks for proteins, lipids, nucleic acids, and some carbohydrates48,49. Some enzyme-catalyzed reactions may allow rice plants to grow and reproduce, maintain their structures, and respond to various stresses. The GO pathway, response to stress (GO:0,006,950) is predicted to play the role for controlling cellular activity in terms of movement, secretion, enzyme production, gene expression for the result of exogenous disturbance, temperature, humidity in rice plants.

Subcellular location of the candidate genes

The cytosol is the place where the occurrence of the maximum different metabolisms in plants and most of the proteins in the cell are localized50,51. The predicted result of subcellular localization (SCL) of the candidate genes implied that nine genes out of eleven are localized in cytosol (Fig. 5b and Supplementary Data S1). Plastid is an important molecular organ found in plant cells mostly involve in photosynthesis and other gene expressions52. Photosynthesis is the key physiological parameter in rice that relates ultimately in many aspects to increase rice productivity53. Increase photosynthesis rate can utilize the solar radiation properly which leads to creating early flowering time because flowering signals are produced in leaves41,54. This gene expression in plastid likely to enhance the photosynthesis process, which regulates the leaf anatomy for earlier flowering in rice. Four candidate genes LOC_Os02g27750, LOC_Os06g18000, LOC_Os07g45950, and LOC_Os08g25040 were found in the plastid (plast). Out of the four genes the latter two genes were predicted to be located in extracellular (extra) and membrane (membr) of which LOC_Os07g45950 was predicted for vacuole (vacu) (Fig. 5b). It is also observed that three genes named LOC_Os02g21070, LOC_Os02g21880, and LOC_Os08g01070 are predicted to located in nuclear activity. However, no candidate genes were predicted to belong from the cellular locations viz., endoplasmic reticulum (ER), peroxisome (pero), and mitochondria (mito) (Fig. 5b).

Expression profile of the candidate genes

The expression level of the candidate genes in different organs or tissues such as seedling, vascular cell, root, leaves, post-emergence, pre-emergence, seed, endosperm, embryo, shoots, anther, pistil, and panicle were extracted from the database Rice Genome Annotation Project (RGAP)55 (Supplementary Data S2). Heatmap presented in Fig. 6 exhibits the expression levels of the candidate genes through the organs or tissues. From the figure it was observed that the genes LOC_Os02g21880 and LOC_Os02g24134 showed high-level expression in seedling, root, shoot, and panicle in rice while these two genes exhibited only high-level expression in the vascular cell at 14DAP (Fig. 6). Moreover, seedling, vascular cell at 14DAP and shoots specific expressions were maximum for the gene LOC_Os06g18000. The genes LOC_Os02g21880, LOC_Os02g24134, LOC_Os08g01070 and LOC_Os08g25060 were found with high expression in panicle. The earlier study also suggested that leaves, shoot, and panicles have significant roles in regulating flowering time41,56. Our results obtained from real data analysis also consistent with earlier outcomes1,22,31,57,58,59.

Figure 6
figure 6

Heatmap showing the expression pattern of the identified 11 candidate genes. The heatmap represents the expression in various organs (seedling, vascular cell stage, root, leaves, post, and pre-emergence inflor, seed, endosperm, embryo, shoots, anther, pistil, and panicle) of rice. The color scale bar of the figure represents log2 transformed FPKM values.

Conclusion

The GWAS is a powerful tool to explore the novel biomarker genes at the SNP level. The LMM approach has been considered as the leading statistical procedure to address the two main challenges of embedded population structures and genetic relatedness among individuals when GWAS is performed. It was however observed that this LMM approach is very much sensitive to phenotypic outlier that leads to misleading results. Therefore, in this article, we discussed a way to robustify the LMM procedure for controlling the outlying effect and reducing the puzzling effects of the population stratification and genetic relatedness in GWAS by using the minimum β-divergence method. This method introduced the β-weight function, which played a key role in the robustification procedure. For the convenience of the presentation, we call this method robust GWAS (rGWAS). Simulation results showed that the average power of correct identification of SNPs by LMM and the proposed method is almost the same and greater than 80%, which is much larger than the power of the linear regression model (LRM), in absence of outliers at heritabilities 0.2 and 0.3. The average power of LRM and the proposed method were slowly decreasing as gradually increasing the rate of phenotypic outliers with heritabilities 0.2 and 0.3, while the power of LMM was decreasing sharply and significantly. The false discovery rate (FDR) of LMM and proposed methods are almost the same and much smaller than the FDR of LRM in all cases of our simulation study. Thus, the proposed method outperformed the LRM and LMM in presence of outliers; otherwise, it shows the almost equal performance of LMM which is much better than LRM in absence of outliers. The demonstration of the proposed method with the real genome dataset against rice flowering time identified 11 important SNP makers. To investigate the biological significance of the identified 11 SNPs, we validated these SNPs by using literature review and gold benchmark data. We performed GSEA for the identified 11 SNP makers and SCL analysis to detect more valid SNPs out of 11 that have a significant association with the flowering time and other trait variations in rice. We also studied their expressions in various organs in rice to find the link with the flowering time. From GO analysis, it is observed that the gene LOC_Os06g18000 might play functional roles in flower development and response to stress in rice. Amongst the 11 genes, LOC_Os02g21880, LOC_Os06g18000, LOC_Os02g24134 exhibited larger expression in seedling, vascular cell, root, shoot, and panicle. Also, the gene LOC_Os08g25060 is predicted to provide maximum expression in vascular cell, root, and panicle. SCL results support that the cytosol contains the maximum number of genes. Plastid is an important molecular organ found in plant cells mostly involve in photosynthesis and other gene expressions. In our study, SCL analysis shows that the expression of the gene LOC_Os06g18000 in plastid may act as a flowering promoter. This gene expression in plastid likely to enhance the photosynthesis process which regulates the leaf anatomy for earlier flowering in rice. In GO analysis, it is also observed that this gene expression is associated with flowering in rice. Finally, it can be concluded that phenotypic outliers may significantly affect the analysis results in GWAS. Our proposed robust method outperforms the existing LRM and LMM methods in presence of outliers and the genomic information presented may however provide a basic platform for further biological investigations. To implement the proposed rGWAS method, the R-code and necessary instructions are available at the website.

Materials and methods

Robustification of LMM based GWAS by using the outlier modification rule (proposed)

The linear mixed model (LMM) approaches are extensively applied for genome-wide association studies (GWAS) for observable phenotypic variations in eukaryotic groups. If we consider that, there are m genotypes with n measurements of a phenotype. Efficient mixed-model association (EMMA)15 is such a model generally expressed by the following :

$$ {\varvec{y}} = {\varvec{Xa}} + {\varvec{Zb}} + {\varvec{\varepsilon}} $$
(1)

where \({\varvec{y}}\, = \,(y_{1} ,\,y_{2} ,...,y_{n} )^{\prime}\) is the n × 1 vector of phenotypic observations, and X = (xij) is an n × q matrix of fixed effects including mean, SNPs, and other confounding variables.\({\varvec{a}}\) is a q × 1 vector representing coefficients of the fixed effects. Z = (zij) is an n × m incidence (design) matrix mapping each phenotype to one of the m genotypes. b is the vector of random polygenic effects which follows N(0,σg2K), where σg2 is the polygenic variance component, and K = (kjt) is the m × m genomic relationship matrix. The genomic pairwise relationship coefficient between two individuals, j and t, is defined as follows

$$ k_{{jt}} = \frac{1}{{T_{\varphi } }}\sum\limits_{{i = 1}}^{{T_{\varphi } }} {\frac{{(x_{{ij}} - 2f_{i} )(x_{{it}} - 2f_{i} )}}{{2f_{i} (1 - f_{i} )}}} $$
(2)

where \(T_{\varphi }\) is the total number of SNPs, xijand xit measure the numbers (0, 1, 2) of the minor allele(s) for the ith SNP of the jth and tth individuals respectively, and fi is the frequency of the minor allele. ε is the vector of random error which follows N(0,σε2I), where σε2 is the error variance component and where I is the n × n identity matrix. The overall phenotypic variance–covariance matrix can be represented as \( {\mathbf{V}} = \sigma_{g}^{2} {\mathbf{ZKZ^{\prime}}} + \sigma_{\varepsilon }^{2} {\mathbf{I}}\).

where I is the n × n identity matrix. The variance components for polygenic effects and errors were estimated by restricted maximum likelihood (REML) using spectraldecomposition instead of the iterative expectation–maximization algorithm (EM algorithm)15. The full-likelihood function is maximized when \({\hat{a} = (X^{\prime}H}^{{{ - 1}}} {\mathbf{X}})^{ - 1} {\mathbf{X^{\prime}H}}^{ - 1} {\mathbf{y}}\) and the optimal variance component is \( \hat{\sigma }_{F}^{2} = R/n\) for full-likelihood and \(\hat{\sigma }_{R}^{2} = R/(n - q)\) for restricted likelihood, where \(R = ({\varvec{y}} - {\varvec{Xa}})^{\prime}H^{ - 1} ({\varvec{y}} - {\varvec{Xa}})\) is a function of δ as well and \(H = \sigma^{ - 1} V = {\varvec{Z}}K{\varvec{Z}} + \delta I\) is a function of δ, defined by \(\delta = \sigma_{\varepsilon }^{2} /\sigma_{g}^{2} ,\,\,\,\sigma = \sigma_{g}\). When the maximum likelihood (ML) or restricted maximum likelihood (REML) variance component \({\hat{\mathbf{V}}} = \hat{\sigma }_{g}^{2} {\mathbf{ZKZ^{\prime}}} + \hat{\sigma }_{\varepsilon }^{2} {\mathbf{I}}\) is estimated, the classical F-statistic for testing the null hypothesis Ma = 0 for an arbitrary full-rank p × q matrix M13,60.

$$ F = \frac{{(M\hat{\user2{a}})^{\prime}(M({\varvec{X}}^{\prime}\hat{V}^{ - 1} {\varvec{X}})^{ - 1} M^{\prime})^{ - 1} (M\hat{\user2{a}})}}{p}\,\,\, $$
(3)

with p numerator degrees of freedom and n-q denominator degrees of freedom. The Satterthwaite degrees of freedom are calculated to avoid computationally intensive matrix operations. Both ML and REML estimators are very much sensitive if one or more phenotypic observations in y are outliers. Thus, LMM based F-statistic in Eq. (3) produces misleading results to test the null hypothesis in presence of outliers. In this study, we, therefore, consider the robustification of LMM based F-statistic in Eq. (3) by using the outlier identification and modification rule. There are some procedures for the identification of outliers in the literature; those are suitable when usual/clean data follows the unimodal distribution. However, in the current problem, phenotypic observations follow the multimodal distribution. So, conventional procedures cannot be used to identify the outlying observations properly. Therefore, in this paper, an attempt is made to propose a new outlier identification and modification rule by using the minimum β-divergence methods20,61 as follows:

(i) Select the top-ranking significant SNP associated with the phenotypic variations by using the minimum β-divergence based robust ANOVA62.

(ii) Divide the phenotypic data into m groups corresponding to the m genotypic labels of the selected most significant SNP. For example, let

$$ {\varvec{y}} = (y_{1} ,y_{2} ,...,y_{n} )^{\prime} = (y_{11} ,..,y_{{_{1} n_{1} }} ,...,\;y_{m1} ,..,y_{{mn_{m} }} )^{\prime} $$

be the partition of phenotypic observations corresponding to the selected SNP, where, n = n1 + n2 + …….. + nm.

(iii) Detect the outlying observations from the lth (l = 1,2,….,k) group using the β-weight function defined by

$$ W_{\beta } (y_{li} |\hat{\theta }_{l} ) = exp\left\{ { - \frac{\beta }{{2\sigma_{li}^{2} }}(y_{li} - \hat{\mu }_{l} )^{2} } \right\} $$
(4)

where i = 1, 2, ... , nl

The minimum β-divergence estimators \(\hat{\theta }_{l,\beta } = (\hat{\mu }_{l,\beta } ,\,\hat{\sigma }_{l,\beta }^{2} )\) of the parameters \(\theta_{l,\beta } = (\mu_{l,\beta } , \,\sigma_{l,\beta }^{2} )\) are calculated iteratively by using the following formulas:

$$ \mu_{l,t + 1} = \frac{{\sum\limits_{i = 1}^{{n_{l} }} {W_{\beta } (y_{li} |\theta_{l,t} )y_{li} } }}{{\sum\limits_{i = 1}^{{n_{l} }} {W_{\beta } (y_{li} |\theta_{l,t} )} }} $$
(5)

and

$$ \sigma_{l,t + 1}^{2} = \frac{{\sum\limits_{i = 1}^{{n_{l} }} {W_{\beta } (y_{li} |\theta_{l,t} )(y_{li} - \mu_{l,t} )^{2} } }}{{(\beta + 1)^{ - 1} \sum\limits_{i = 1}^{{n_{l} }} {W_{\beta } (y_{li} |\theta_{l,t} )} }} $$
(6)

The notation θt+1 denotes the update of \(\theta_{t}\) in the (t + 1)th iteration. The robustness of these estimators has been discussed in the background of influence function20 and their reliability61. It is noteworthy that the minimum β-divergence estimators \(\hat{\theta }_{l,\beta } = (\hat{\mu }_{l,\beta } ,\hat{\sigma }_{l,\beta }^{2} )\) reduce to the classical maximum likelihood estimators (MLEs) \(\hat{\theta }_{l} = (\hat{\mu }_{l} ,\hat{\sigma }_{l}^{2} )\) when β = 0.

It is considered that the MLEs of a Gaussian distribution are consistent and asymptotically efficient in absence of outlying objects63. Therefore, in this article, an effort has been provided to develop a robust linear mixed model (LMM) method in which the classical MLEs \(\hat{\theta }_{l}\) are used in absence of outlying objects and minimum β-divergence estimators \(\hat{\theta }_{l,\beta }\) stated in Eq. (5) and (6) are used in presence of outliers for estimation of θl in the mixed model. The minimum β-divergence method suggests two approaches for combining the robustness and efficiency of estimation in LMM. The tuning parameter β is selected through the cross-validation (CV) technique20. CV process produces β = 0 for the minimum β-divergence method estimators and is then equivalent to the classical estimators. When there are outlying subjects in the phenotypic traits, the technique generates β > 0 for the minimum β-divergence estimators. To overcome the challenges of outlying observations in GWAS, an alternative approach that is the β-weight function mentioned in (4) has been proposed with β = 0.2 for outlier detection. This weight function imposes smaller weights (≥ 0) to outlying observations and larger weights (≤ 1) to uncontaminated/usual objects.

An outlying phenotypic observation yli in the lth group is defined based on the β-weight function mentioned below:

$$ W_{\beta } (y_{li} |\hat{\theta }_{l,\beta } ) = \left\{ \begin{gathered} >\tau_{l} ,{\text{if}}\;y_{li} \;{\text{is}}\;{\text{not}}\;{\text{an}}\;{\text{outlier}} \hfill \\ \le \tau_{l} ,\;{\text{if}}\;y_{li} \;{\text{is}}\;{\text{an}}\;{\text{outlier}} \hfill \\ \end{gathered} \right. $$
(7)

where the threshold value τl is the pth quantile value of the empirical distribution of \(W_{\beta } (y_{li} |\hat{\theta }_{l,\beta } )\).

(iv) Then replace the outlying phenotypic observations of lth group by its robust mean \(\mu_{l,\beta }\)(l = 1, 2,…,k), where m is the number of genotype in the selected SNP.

(v) After that apply an efficient mixed-model association (EMMA) to the modified dataset discussed in the previous step.

Simulated data generation

To investigate the performance of the proposed algorithm in a comparison of the conventional algorithms, we generated both the genotypic and phenotypic data as follows:

Genotype simulation

To explore how the proposed method performs, a set of synthetic genotype and phenotype data were generated. A synthetic genotype dataset was simulated that reflects population structure. For this purpose, m* = 2000 SNPs were generated for n = 1000 individuals, and these individuals were taken from k = 3 distinct population by considering different minor allele frequencies (MAFs). To do this, first, a set of latent vectors {v1, v2, …..vm*} was generated from a multivariate normal distribution with mean zero and variance–covariance matrix Cov(vj,vk) = ρ|j-k|64,65. In our simulation, we considered ρ = 0.5 to avoid the linkage disequilibrium (LD) between the SNPs. Finally, two cutoff values s1 and s2 were used to convert the design matrix V = [v1, v2, …, vm* ] = [vij] of latent vectors to the genotypic score matrix \( x_{ij} (i = 1,2,...,n,j = 1,2,...,m_{1} )\,{\text{and }}\,z_{ij} (i = 1,2,...,n;j = 1,2,...,m_{2} )\) as follows:

$$ x_{ij} ,z_{ij} = \left\{ \begin{gathered} 0,\;\;\;\;\;v_{ij} < s_{1} \hfill \\ \,\,1,\;\;\;\;\;s_{1} \le v_{ij} \le s_{2} \hfill \\ \,\,2,\;\;\;\;\;v_{ij} > s_{2} \hfill \\ \end{gathered} \right. $$

where s1 and s2 determine the minor allele frequency.

Phenotype simulation

Phenotypic datasets were produced by considering several factors including genetic variation, error variation, and population stratification. To generate phenotype data, two distinct situations were considered in terms of two heritability rates 0.2 and 0.3. In every situation, m1 = 4 SNPs were considered as causal variants and the remaining m2 = m*-m1 = 1996 SNPs were allocated as polygenic variants (effects). The quantitative trait/phenotypic values were simulated using Eq. (1) which can be re-written as

$$ \mathop y\nolimits_{j} = \mu + \sum\limits_{k = 1}^{{m_{1} }} {a_{k} x_{kj} } + \sum\limits_{{k = m_{1} + 1}}^{{m_{2} }} {b_{k} z_{kj} } + \mathop \varepsilon \nolimits_{j} \,\,\, $$
(8)

The total phenotypic variation was decomposed by assuming all three sources of variations in Eq. (8) are independent each other as

$$ {\text{var}} (y) = {\text{var}} \left( {\sum\limits_{k = 1}^{{m_{1} }} {a_{k} x_{k} } } \right) + {\text{var}} \left( {\sum\limits_{{k = m_{1} + 1}}^{{m_{2} }} {b_{k} z_{k} } } \right) + {\text{var}} (\varepsilon)$$
(9)

Then the contribution of main genetic effect in the total phenotypic variation (known as heritability) was defined by the ratio of main genetic variance over the phenotypic variance and it is written as

$$ {h}^{2} = \frac{{{\text{var}} \left( {\sum\limits_{k = 1}^{{m_{1} }} {a_{k} x_{k} } } \right)}}{{{\text{var}} (y)}}\,\,\,\,\,\, $$
(10)

The genomic outcomes of the SNPs were simulated from a normal distribution such that it satisfies a certain proportion of genetic variance for different genetic effects as given in Table 3.

Table 3 Distribution of different genetic variation in the phenotype.

To check the performance of the proposed method in a comparison of LRM and LMM approaches in presence of different rate of outliers, we contaminated 1%, 2%, 3%, 4%, and 5% of phenotypic data randomly by using theoutlying observations to generate five different contaminated datasets. We replicated these five contaminated datasets including original clean dataset 1000 times. Tthe outlying phenotypic observations was generated \(\left( {\mathop y\nolimits_{j}^{*} } \right)\) satisfying \( {\text{2}} \times {\mathbf{max}}({\mathbf{y}}){\mkern 1mu} < {\text{y}}_{j}^{*} < {\text{5}} \times \max ({\mathbf{y}}) \).

Consequence of phenotypic outliers on the partition of total variations

Let y* be the modified phenotypic response variable which is generated by replacing 5% observation of y by the outlying observations. Then var(y*) > var(y) obviously, which implies

$$ {\text{var}} \left( {\mathop y\nolimits^{*} } \right) = {\text{var}} \left( y \right) + c,\,\,\, $$
(11)

where c > 0. Now we can write the following equation similar to the Eq. (8) as

$$ \mathop y\nolimits_{j}^{*} = \mu + \sum\limits_{k = 1}^{{m_{1} }} {a_{k} x_{kj} } + \sum\limits_{k = 1}^{{m_{2} }} {b_{k} z_{kj} } + \mathop \varepsilon \nolimits_{j}^{*} \,\,\,\, $$
(12)

by assuming the same genetic effects but changed to the error variations. Then

$$ {\text{var}} \left( {\mathop y\nolimits^{*} } \right) = {\text{var}} \left( {\sum\limits_{k = 1}^{{m_{1} }} {a_{k} x_{k} } } \right) + {\text{var}} \left( {\sum\limits_{{k = m_{1} + 1}}^{{m_{2} }} {b_{k} z_{k} } } \right) + {\text{var}} \left( {\mathop \varepsilon \nolimits^{*} } \right)\,\,\,\, $$
(13)

Combining Eq. (9), (11) and (13), we get

$$ {\text{var}} \left( {\mathop \varepsilon \nolimits^{*} } \right) = {\text{var}} (\varepsilon ) + c > {\text{var}} (\varepsilon ) $$

Thus, error variance in presence of outliers must be larger than the error variance in absence of outliers. Conversely, the heritability (h2*) in presence of outliers must be smaller than the heritability (h2) in absence of outliers as follows.

$$ h^2 = \frac{{{\text{var}} \left( {\sum\limits_{k = 1}^{{m_{1} }} {a_{k} x_{k} } } \right)}}{{{\text{var}} (\mathop y\nolimits_{{}} )}} > \frac{{{\text{var}} \left( {\sum\limits_{k = 1}^{{m_{1} }} {a_{k} x_{k} } } \right)}}{{{\text{var}} (\mathop y\nolimits_{{}}^{*} )}} = \mathop h\nolimits^{2*} $$

Therefore, both LMM and LRM approach losses the SNPs identification power in presence of phenotypic outliers. However, in the case of the proposed method, the error variance in presence of outliers must be almost same to the error variance in absence of outliers. Similarly, the heritability \( \, \mathop h\nolimits_{\beta }^{*} \, \) in presence of outliers must be almost same to the heritability (h2) in absence of outliers, since

$$ {\text{var}} \left( {\mathop y\nolimits_{\beta }^{*} } \right) \approx {\text{var}} \left( {\mathop y\nolimits_{{}}^{{}} } \right), $$

where \({\text{var}} \left( {\mathop y\nolimits_{\beta }^{*} } \right)\) is the variance of total phenotypic variations after the preprocessing of phenotypic observations by the proposed method. Therefore, the power of the proposed method must be larger than the power of both LMM and LRM methods in presence of phenotypic outliers.

Performance measures

To investigate the performance, the statistical power and FDR of 3 methods were calculated by using the formula, Power = (PT/PC) × 100 and FDR = [PF/(PT + PF)] × 100, respectively, where PT measures the truly detected SNPs and PC measures total causal variants and PF is the number of falsely detected SNPs. For each situation, 1000 replications were performed to account for the average value of the power and FDR for comparison.

Real genotype and phenotype data on rice flowering time

We applied the proposed rGWAS method to explore the potential SNPs influencing rice flowering time. The genotypic and phenotypic data used to carry out the analysis for investigation were collected from the rice diversity database (www.ricediversity.org). The data set contain 413 accessions along with 36,901 SNPs of Oryza Sativa1. All selected SNPs were taken into consideration in the analysis with call rate > 70% and minor allele frequency (MAF) > 0.051. Missing genotypes were imputed with weighted k-Nearest-Neighbors method66 based on the five weighted nearest varieties. To compute the kinship matrix (using Eq. (2)), LD-pruned set of variants was used with an R2 greater than 0.9 in a 200 variant sliding window of size 1000. The individual with missing observation in the phenotypic dataset was not considered in this study. Experimental data on flowering time were obtained as the number of days until the inflorescence was 50% emerged from the flag leaf calculated from the day of planting. The phenotypic data used in this analysis for the flowering time were recorded at Faridpur district in Bangladesh.

SNP annotation and candidate gene identification

11 SNPs were identified through whole genome association analysis by using rGWAS. Rice Genome Annotation Project (RGAP) Release 7 (http://rice.plantbiology.msu.edu/) database was used to annotate the identified SNPs. Among the identified genes, six were annotated with protein-coding genes and the rest five were non-coding (Supplementary Table S1). For the non-coding SNPs, the nearest genes were considered as candidate gene and used for further functional characterization. We also aligned the significant SNP tags against the genes those involved in rice flowering time and seed development pathway36. Region within 100 kb were searched for the pathway genes (Supplementary Table S2). The QQ-plot were generated using qqmath function in the R package lattice67.

Gene-set enrichment analysis

To characterized the candidate genes/SNPs that may have a significant association with the phenotypic variations, we performed gene ontology (GO) enrichment analysis of the candidate genes in terms of biological process (BP), molecular function (MF), and cellular component (CC) were performed using the on-line tool QuickGO (https://www.ebi.ac.uk/QuickGO). A gene set was considered as significantly enriched for GO terms if \(p<0.05\). The heatmap of the GO corresponding to the candidate genes was plotted using R package ComplexHeatmap68.

Prediction of the subcellular location

An online-based tool called Plant Subcellular Localization Integrative (PSI) predictor69 was used for predicting the subcellular locations of the candidate genes in the plant cell. The predicted subcellular location corresponding to the candidate genes were visualize by using the R package corrplot70.

Tissue-specific expression

We explored the expression profile of the candidate genes in different tissues. Tissue-specific expression of the genes were obtained from the database Rice Genome Annotation Project (RGAP)55. The heatmap of the expression level of the genes was created via R package ComplexHeatmap68.

Data availability

To implement the proposed method, the necessary codes in R can be downloaded from the repository:http://www.ru.ac.bd/biorgru/software/r-code-robustgwas-zip/