Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Statistical Approach for Improving Genomic Prediction Accuracy through Efficient Diagnostic Measure of Influential Observation

Abstract

It is expected the predictive performance of genomic prediction methods may be adversely affected in the presence of outliers. In agriculture science an outlier may arise due to wrong data imputation, outlying response, and in a series of trials over the time or location. Although several statistical procedures are already there in literature for identification of outlier but identification of true outlier is still a challenge especially in case of high dimensional genomic data. Here we have proposed an efficient approach for detecting outlier in high dimensional genomic data, our approach is p-value based combination methods to produce single p-value for detecting the outliers. Robustness of our approach has been tested using simulated data through the evaluation measures like precision, recall etc. It has been observed that significant improvement in the performance of genomic prediction has been obtained by detecting the outliers and handling them accordingly through our proposed approach using real data.

Introduction

Genomic selection (GS) has been a popular choice for selection of appropriate candidates for breeding in the current research arena of plant and animal science. Various studies has been carried out in recent past. GS is an advance method of breeding where genome-wide dense markers information is used to predict genetic merit of an individuals in a breeding programme. In today’s scenario GS is a promising tool for improving genetic gain of individuals under study. Genomic selection is firstly introduced by Meuwissen et al.1. In this approach individual effect of each marker is estimated and sum of all markers effect is used for calculation of genotypic value i.e. Genome Estimated Breeding Value (GEBVs) of each individual.

GS process starts with building a statistical model from individuals having both genotypic and phenotypic data (i.e. training set), this model is further used for estimation of GEBVs for individuals having only genotypic information. Individuals are then ranked on the basis of GEBVs and subsequently superior individuals are selected. Genomic selection methods have been successfully applied for plants2,3 and animals4,5,6,7. However, success of genomic selection depends on the quality of the data suitable for implementing the various statistical models. But in practical situation genomic data quality seldom fulfill the ideal condition and often having many constraints such as presence of influential observations, missing points, noise etc.

Influential observations can potentially have devastating effects on genome estimated breeding values8. These influential observations can be the results of wrong data imputation, outlying response, and in a series of trials over the time or location. Detection of influential observation has been an extensive research area based on linear regression approach9,10,11,12. Some of most widely used measures for this are Cook’s D, DFBETA, DFFITS, Atkinson’s Ci, COVRATIOi. Among them Cook’s D is one of the most commonly used measure for outlier detection through linear regression technique10. Various statistical model with t-distributed error has been proposed (Bayesian with t-linear model13, Gaussian process with t-likelihood14, Regression with t-error15) as robust method against treating the outlier. Lange et al.15 have applied this model (Regression with t-error) to various datasets and concluded that it can handle outliers and address robustness concerns practically and routinely in a wide range of settings. However, discriminating true outlier from non-outlier is still a challenge especially in case of high dimensional genomic data. The key complication in handling the problem of outlier is that distinguishing mild outlier from regular observations and masking of true outlier16. In high dimensional genomic data, where no of markers (p) are greater than no of individuals (n) creates a problem termed as large p small n problem (p > n). This is a very common phenomena in genomics and molecular biology research now a days. In such cases, penalized regression based approach such as Least Absolute Shrinkage and Selection Operator (LASSO) could be a preferable choice as it takes care of n  p problem by shrinking the estimates of some less significant markers and dropping others from the model. Increased use of LASSO has been motivated by plenty of high dimensional biological data. But it becomes very crucial when some influential observations are present in high dimensional genomic data as each observation has tremendous effect on model selection and interpretation. So it is quite imperative to examine effect of influential observation before implementing the LASSO regression. Hence new measure for detection of influential observation in high dimensional genomic data is a need of hour for improving GEBV’s.

Rajaratnam et al.17 recently developed approach for outlier detection for high dimensional data by considering the LASSO regression technique. In their approach they have proposed four measures i.e. df-model, df-lambda, df-regpath and df-cvpath for detection of influential observations influenced by different aspect of LASSO regression directly or indirectly. However, the results coming from these measures are not consistent i.e. different influential points are detected from these measures. In order to produce more concrete and consistent results, a meta-analysis based approach can be applied where an improved measure of outlier detection can be developed based on integration of these measures using p-values18,19,20.

In this study, an improved measure for detection of influential observation has been developed using above mentioned approach. Performance of the developed measure has been empirically evaluated and it was observed that the outliers detected from this measure are more accurate. This developed method has been implemented in the case of genomic selection data (real and simulated) and results shows that there is remarkable improvement in the prediction accuracy of GEBVs.

Material and Methods

LASSO was first time introduced by Tibshirani21. LASSO minimizes the sum of squares of residuals subject to a constraint on sum of absolute values of regression coefficients. It is different from usual regression as it adds some additional penalty to usual regression estimator. So it diminishes the effect of less important βs (i.e. marker effect) and reduces least important βs as zero.

Statistical formulation of LASSO estimates can be defined as:

$${\hat{\beta }}^{lasso}=\begin{array}{c}argmin\\ \beta \end{array}\mathop{\sum }\limits_{i=1}^{n}{({y}_{i}-\mathop{\sum }\limits_{j=1}^{p}{x}_{ij}{\beta }_{j})}^{2}$$
(1)

subject to \({\sum }_{j=1}^{p}|{\beta }_{j}|\le t\), \(i=1,\ldots ,n\) (individuals), \(j=1,\ldots ,p\) (markers), \({Y}_{i}\) is the phenotypic value for individual \(i\), \({x}_{ij}\) is an element of the incidence matrix corresponding to individual \(i\) and marker \(j\), \({\beta }_{j}\) is the marker effect for marker \(j\). It has been assumed that response variable has zero mean. The constraints \({\sum }_{j=1}^{p}|{\beta }_{j}|\le t\) shrinks effects of variables and sets some of them to zero.

We can also write the LASSO problem in the equivalent Lagrangian form:

$${\hat{\beta }}^{lasso}=\begin{array}{c}argmin\\ \beta \end{array}\left\{\frac{1}{2}\mathop{\sum }\limits_{i=1}^{n}{({y}_{i}-\mathop{\sum }\limits_{j=1}^{p}{x}_{ij}{\beta }_{j})}^{2}+\lambda \mathop{\sum }\limits_{j=1}^{p}|{\beta }_{j}|\right\}$$
(2)

Here \({\sum }_{j=1}^{p}|{\beta }_{j}|\) is \({l}_{1}\) norm penalty on \(\beta \) which results in sparsity of solution and λ is a regularization parameter. Computing the LASSO solution is quadratic programming problem which can be obtained through efficient algorithm like Least Angle Regression (LARS)22. Other important question to be addressed in this case is calculation of upper limit of sum of absolute value of predictor variable, for this cross validation approach can be used23.

Here we have used a recently proposed approach for detection of influential observation based on LASSO technique17. They proposed four different measure i.e. df-model- it measure the change in model selected; df-lambda: it measure the change in λ, where λ is a regularization parameter in LASSO regression path, df-regpath: it measure the changes observed in LASSO regularization path and df-cvpath which observe changes in LASSO cross-validation path. These measures detects outlier from high dimensional genomic data based on LASSO regression. It can be observed that all these measures i.e. df-model, df-lambda, df-regpath and df-cvpath detects influential observations which affects model directly or indirectly, has difference in their results regarding detection of influential observation, it means that there is lack of concordance among them. In order to overcome this limitation, we have proposed a more robust measure for detection of influential observation by integrating above discussed measure using p-values based meta-analysis approach.

Approach of proposed measure

In order to develop a robust statistics for detection of influential measure, we have used p-value based meta-analysis approach. In this approach, we have combined the above mentioned four measures on the basis of their p-values. We used various methods for combining these p-values and explored the performance of each method. The brief description of this approach has been as follows. Let’s say, there are K independent test and their corresponding p-values are p1, p2,…, pK. Under H0, it is assumed that p-values from different methods (for individual observations) are uniformly distributed between 0 and 1 (i.e. pk ~ U [0, 1]). To get overall statistical significance for the hypothesis under test (H0 i.e. null hypothesis vs. H1 alternative hypothesis), individual p-values for each observation/genotype from different methods (i.e. df-model, df-lambda, df-regpath and df-cvpath) can be combined. Methods used for this purpose has been summarized in Table 1.

Table 1 List of methods used in study for combining p-value to calculate overall significance.

Using this approach (Table 1), the final statistical significance value i.e. combined p-values for selected observation/genotype has been calculated and influential observation is identified based on suitable p-value cut-off. Source code for our proposed approach can also be accessed from github repository at https://github.com/BudhlakotiN/OGS.

Experimental dataset

In order to check the robustness of our approach the same has been validated using real data. We have used total six datasets in the current study. Detailed discussion regarding each of dataset is given below.

Dataset 1: Wheat

Wheat lines were genotyped using 1447 Diversity Array Technology markers generated by Triticarte Pty. Ltd. (Canberra, Australia; http://www.triticarte.com.au). These markers may take two different values i.e. their presence (1) or absence (0). This data set includes 599 lines observed for trait grain yield (GY) for four mega environments. However for our convenience we have just considered GY for first mega environment. The final number of DArT markers after edition was 1279 hence same has been used in this study. Same has been also used in genomic prediction study24,25.

Dataset 2: Maize

Maize dataset is generated by CIMMYT’s Global Maize Program24. It originally include 300 maize line with 1148 SNP markers. For marker with highest frequency is coded as 0 and lowest frequency as 1. Here trait under study is also GY, evaluated under draught and watered conditions. The average minor allele frequency in these data sets was 0.20. After some editing 264 maize lines with 1135 SNPs markers were available for final study24.

Dataset 3–6: Wheat

This wheat dataset is generated from CIMMYT semiarid wheat breeding program which is comprised of 254 advanced wheat breeding lines genotyped for 1726 DArt markers26. This dataset is recorded for four phenotypic traits i.e. Days to heading (DTH), Thousand Kernel Weight (TKW), Yield (under irrigated condition hence denoted as YI), Yield (under draught condition i.e. YD). For convenience, here trait DTH is considered as Dataset-3, trait TKW as Dataset-4, trait YI as Dataset-5 and trait YD as Dataset-6.

Simulation

For illustration simulated data were generated using QTL Bayesian interval mapping (“qtlbim”)27, a R based (R Development Core Team 2019) package. R is available at http://www.r-project.org and qtlbim package can be loaded from R library. This package has been used in various studies for simulation of data related to genomic selection28,29,30. The qtlbim package uses Cockerham’s model as the underlying genetic model. We have simulated a total of three data sets for genotypic and phenotype information. Here we have created range of diversified genetic architecture i.e. with very low heritability 0.10 to medium 0.5 and high heritability 0.7. Accordingly, we have simulated data at these particular heritability levels. For each stage we have simulated data for 1000 SNPs for 200 individuals. Simulated data have 10 chromosomes with 100 SNPs in each with specified length. Total 1000 markers are distributed over the all 10 chromosomes in such a way that each marker is equi-spaced over the chromosome. We have simulated normally distributed phenotype, with further no genotype or phenotype information missing. In order to check the sensitivity of all methods to detect true outlier, we have replaced 5% of observation and made them outlier (i.e. beyond mean ± 3*SD). Overview of whole workflow of the current study presented in Fig. 1.

Figure 1
figure1

Operational workflow of the whole procedure used in the study.

Evaluation measure

As an evaluation measure, prediction accuracy and prediction error were used. Prediction accuracy can be defined as Pearson correlation coefficient (r) between observed phenotypic value and predicted phenotypic value.

If \(\,\hat{Y}=X\hat{\beta }\), where \(\hat{Y}\) is estimated response and \(\hat{\beta }\) is estimated value of \(\beta \), then correlation coefficient (r) can be expressed in following form:

$$r=\frac{{S}_{Y,\hat{Y}}}{{S}_{Y}{S}_{\hat{Y}}}$$
(3)

where \(\,{S}_{Y,\hat{Y}}\) denotes the covariance between observed and predicted phenotypic value, \({S}_{Y}\) is standard deviation of observed phenotype and \({S}_{\hat{Y}}\) denotes standard deviation of predicted phenotype. Prediction Error (PE) can be simply defined as mean sum of square error (MSE) between observed phenotypic value and predicted phenotypic value. Same can be expressed using following formula (Eq. 3).

$$PE/MSE=\frac{1}{{n}_{test}}\mathop{\sum }\limits_{i=1}^{{n}_{test}}{({Y}_{i}-{\hat{Y}}_{i})}^{2}$$
(4)

Where \({Y}_{i}\) is observed response; \({\hat{Y}}_{i}\) is predicted phenotype value. It can be understood that \(n\) is the total no. of individual’s i.e. \(\,n={n}_{train}+{n}_{test}\), here \(\,{n}_{train}\) denotes no of individuals in the training set and \({n}_{test}\) is no. of individuals in test set.

In order to assess performance of methods to identify true outlier (observation with added noise) and non-outlier (observation without any noise), we have used precision (i.e. proportion of True Positive (TP) to total positives (i.e. sum total of true positive (TP) and False Negative (FN), Eq. 5), recall (i.e. proportion of TP to TP and False Negative (FN), Eq. 6) and F1 score (i.e. harmonic mean of precision and recall, Eq. 7). All these can be computed using the following expressions:

$$\text{Precision}\,=\,\frac{\text{TP}}{\text{TP}+\text{FP}}$$
(5)
$$\text{Recall}\,=\,\frac{\text{TP}}{\text{TP}+\text{FN}}$$
(6)
$$\text{F}1=\sqrt{\text{Precision}\times \text{Recall}}$$
(7)

To calculate the overall performance of different methods, Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) has been used. It is a multi-criteria based decision making method given by Hwang and Yoon31. It is based on the impression that the selected alternative should have the shortest geometric distance from the positive ideal solution (PIS) and the longest geometric distance from the negative ideal solution (NIS)32. TOPSIS compares set of alternatives by giving some weights to each criteria followed by normalization of each single criteria and calculates the geometric distance between each alternative and ideal alternative. TOPSIS is based on the assumption of that criteria are monotonically increasing or decreasing. Here final rank has been calculated using R package ‘topsis’ motivated from TOPSIS method.

Results and Discussion

Performance of our proposed method that how well it distinguishes true outlier from non-outlier, various measures like precision, recall and F1 score has been calculated for different datasets (generated at different heritability h2 = 0.1, 0.5 and 0.7) and presented in Table 2. It suggest that our proposed approach i.e. based on combining p-value outperformed in almost every scenario.

Table 2 Performance of different methods (in terms of Precision, Recall and F1 score) for different simulated datasets.

Computational efficiency

The time required to compute our proposed measures is calculated using an Intel(R) Core(TM) i7–5500U CPU@2.40 GHz processor on a dataset with varying dimension (i.e. no. of individuals (500, 750 & 1000) and markers (2000, 5000 & 10000) with all possible combination). Results of same is presented below in Table 3.

Table 3 Time required for running the datasets of varying combination of dimension using our proposed approach.

In order to understand the effect of outlier on the genomic prediction accuracy, we have studied their effects on real dataset. First of all we have fitted LASSO regression with original experimental data say it as LASSO*. Then using the approach given by Rajaratnam et al.17, we have calculated p-values for all the four measures i.e. df-model, df-lambda, df-regpath, df-cvpath followed by combining these p-values into single value for each observation/genotype. Using the same we have identified the outlier in the response. The outlier and their corresponding marker genotype were dropped from the model and again LASSO is refitted using the modified data. In order to check robustness of our proposed approach, we have also fitted some of most commonly used methods for genomic selection i.e. Ridge Regression, Best Linear Unbiased Prediction (BLUP), Genomic-BLUP (GBLUP) and Bayesian methods. BLUP i.e. Best Linear Unbiased Prediction introduced by Henderson33 is used in a linear mixed model for prediction of random effects. GBLUP is an improved version of BLUP where additive genomic relationship matrix (G) is used as a variance-covariance matrix of random effect in the model34. For performance evaluations of methods under study, cross validation techniques is used. Data is divided into two parts i.e. training and testing sets such that training set comprises of 70% data and testing set of 30%. Former one is used for model building and later one for model evaluation. The performance of methods was evaluated by calculating prediction accuracy and prediction error. Whole procedures is repeated 100 times and prediction accuracy and prediction error is averaged and their respective standard error is calculated. Results of the same has been discussed below. Here Tables 49 reports the average prediction accuracy and prediction error (i.e. MSE) with their sampling variability (SE i.e. standard error) of the methods under study for dataset 1–6. In order to calculate gain in prediction accuracy all the fitted model were compared to baseline model i.e. LASSO and percentage change in prediction accuracy is calculated. In same way percentage reduction in MSE is also calculated.

Table 4 Mean and standard error of prediction accuracy and prediction error for various methods using dataset 1.
Table 5 Mean and standard error of prediction accuracy and prediction error for various methods using dataset 2.
Table 6 Mean and standard error of prediction accuracy and prediction error for various methods using dataset 3.
Table 7 Mean and standard error of prediction accuracy and prediction error for various methods using dataset 4.
Table 8 Mean and standard error of prediction accuracy and prediction error for various methods using dataset 5.
Table 9 Mean and standard error of prediction accuracy and prediction error for various methods using dataset 6.

Here all the analysis has been carried out using R (R Development Core Team 2019). LASSO model is fitted using R package glmnet35, other methods like BLUP, GBLUP are fitted using rrBLUP package36 with mixed.solve and kin.blup function respectively. Ridge regression is fitted using Gustavo de los Campos R code, fitting this require heritability of underlying trait. For better description, heritability for each traits under study is provided in the supplementary material (Table S1). Regression with t-error fitted using R package “hett” (using tlm function)37. Degree of freedom is estimated for different dataset used in study by using the tlm function with option (estDof = TRUE), available in R package “hett” and then t-regression is fitted.

In this Table 4 and others (Tables 59) LASSO* represents LASSO regression fitted in original data (i.e. without any treatment to possible outlier), next four methods in the table represent performance of LASSO in the absence of outlier (i.e. possible outlier and corresponding genotype marker genotype dropped from the model detected using LASSO diagnostic) whereas next four methods in the table represent performance of LASSO in the absence of outlier (i.e. possible outlier and their corresponding marker genotype to be dropped from original data detected by our various p-value based meta-analysis approach). Last four methods shows the performance of other methods on our proposed approach.

In order to assess gain in the prediction accuracy for different datasets under study, It could be observed that there is significant amount of gain in prediction accuracy (Tables 49) as compare to their counterparts (41% increase in case of dataset 1, 69% for dataset2, 31% for dataset 3, 57% for dataset 4, 36% for dataset 5 and 27% for dataset 6). In case of Prediction error it can observed from results (Tables 49) that MSE for our proposed approach has been significantly reduced (i.e. 37% for dataset 1, 28% for dataset 2, 46% for dataset 3, 57% for dataset 4, 40% for dataset 5 and 43% for dataset 6). It shows clear advantage of our integrated approach (i.e. p-value based meta-analysis method) over the existing approach. In order to see that gain in terms of predictions performance is not only restricted to LASSO, we have also investigated the performance of integrated approach by using most commonly used GS models (RR, GBLUP etc.). It can be marked with confidence that gain in terms of prediction performance has been maintained to other methods also (Tables 49).

In Fig. 2, each graph (Fig. 2a–f) contains the ten box plot for prediction accuracy for dataset 1–6 respectively. In each figure first box plot shows the prediction accuracy by fitting simple LASSO regression, next four box plot shows the prediction accuracy calculated following the approach of Rajaratnam et al.17 and next method (Inverse Chi) represent performance of LASSO in the absence of outlier (i.e. possible outlier and their corresponding marker genotype to be dropped from original data detected by p-value based meta-analysis approach i.e. Inverse Chi). Last four methods shows the performance of other GS methods on our proposed approach. These Box plots shows the distribution of prediction accuracy with their SE, estimated over 100 replications.

Figure 2
figure2

Box plot of prediction accuracy for different methods under study using various datasets (a) dataset 1 (b) dataset 2 (c) dataset 3 (d) dataset 4 (e) dataset 5 (f) dataset 6.

In Fig. 3, each graph (Fig. 3a–f) represents ten box plot for prediction error for dataset 1–6 on the same pattern of boxplot to Fig. 2. These boxplots represents the distribution of the MSE values over 100 runs. These plots (Figs. 2 and 3) show a clear cut advantage of our proposed approach over the LASSO diagnostic given by Rajaratnam et al.17, in improving genomic prediction accuracy and other existing approach. In almost every scenario i.e. wheat and maize dataset (dataset 1–6), prediction accuracy has been improved and prediction error get minimized. Clear distinctions of estimated accuracy and prediction error shows the importance of outlier detection for estimating more accurate GEBVs leads to enhanced prediction accuracy. It can be summarized from the Tables 49 that among p-value combination methods Inverse Chi, logit and sumz performed equally although advantage goes to Inverse Chi and sumz over logit and meanp method.

Figure 3
figure3

Box plot of prediction error (MSE) for different methods under study using various datasets (a) dataset 1 (b) dataset 2 (c) dataset 3 (d) dataset 4 (e) dataset 5 (f) dataset 6.

Ranking of the various methods used for performance evaluation has been done using multi criteria based decision method called TOPSIS. Result of same has been given in Tables S2 and S3 (Supplementary Information). It can be concluded from Tables S2 and S3 that our integrated approach (based on p-value meta-analysis) using Inverse Chi method ranked first among other p-value based meta-analysis methods (i.e. logit, meanp and sumz) for both in case of dataset 1 and dataset 2 and same pattern has been observed for other datasets also using TOPSIS methods based on multi criteria.

Mean shift as substitute for deletion

Instead for deleting the observation flagged as outlier here we have substituted the outlier with the mean shift of data using mean shift outlier model (MSOM)38. Here one or more observation is assumed to be introduced from a shifted location as compare to remaining observation. This method can be important for robust modelling where we identify the observation flagged as outlier with separate mean shift effect instead of dropping them from model. Earlier we have fitted the model to real and simulated data and for each observation outliers are identified (p-value < 0.05) based on p-value combination approach. Here instead of deleting the observation flagged as outlier, we have replaced them with separate mean shift effect (using MSOM).

Best linear unbiased prediction i.e. BLUP33 and GBLUP34 model is fitted on original data and data where outliers are treated with MSOM. A Significant improvement in the accuracy over baseline model (using original data as such) has been observed. Details of same is presented in Table 10.

Table 10 Effect of Mean Shift Model over baseline model on accuracy of the genomic prediction using BLUP.

Conclusion

Impact of outlier on genomic prediction accuracy has been explored. In this study, a new efficient method using meta-analysis for outlier detection in genomic data has been proposed. It has been shown that by implementing efficient diagnostic measure for outlier detection, accuracy of GS model can be improved. Comparative study has been made among various existing methods of outlier detection in high dimensional genomic data for their impact on accuracy of genomic estimated breeding value. It has been observed that our proposed method outperformed among existing methods.

Data availability

All secondary datasets used in this study are publicly available.

References

  1. 1.

    Hayes, B. & Goddard, M. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829 (2001).

    PubMed  PubMed Central  Google Scholar 

  2. 2.

    Jannink, J.-L., Lorenz, A. J. & Iwata, H. Genomic selection in plant breeding: from theory to practice. Briefings in functional genomics 9, 166–177 (2010).

    CAS  Article  Google Scholar 

  3. 3.

    Zhao, Y., Mette, M. F. & Reif, J. C. Genomic selection in hybrid breeding. Plant Breeding 134, 1–10 (2015).

    Article  Google Scholar 

  4. 4.

    Hayes, B. J., Bowman, P. J., Chamberlain, A. & Goddard, M. Invited review: Genomic selection in dairy cattle: Progress and challenges. Journal of dairy science 92, 433–443 (2009).

    CAS  Article  Google Scholar 

  5. 5.

    Daetwyler, H. D., Swan, A. A., van der Werf, J. H. & Hayes, B. J. Accuracy of pedigree and genomic predictions of carcass and novel meat quality traits in multi-breed sheep data assessed by cross-validation. Genetics Selection Evolution 44, 33 (2012).

    Article  Google Scholar 

  6. 6.

    Daetwyler, H., Kemper, K., Van der Werf, J. & Hayes, B. Components of the accuracy of genomic prediction in a multi-breed sheep population. Journal of animal science 90, 3375–3384 (2012).

    CAS  Article  Google Scholar 

  7. 7.

    Wang, C. et al. Accuracy of genomic prediction using an evenly spaced, low-density single nucleotide polymorphism panel in broiler chickens. Poultry science 92, 1712–1723 (2013).

    CAS  Article  Google Scholar 

  8. 8.

    Atkinson, A. & PLOTS, T. Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis. Oxford Statistical Science Series, Oxford University Press: Oxford (1985).

  9. 9.

    Belsley, D. A., Kuh, E. & Welsch, R. Identifying influential data and sources of collinearity. Regression Diagnostics (1980).

  10. 10.

    Cook, R. D. Detection of influential observation in linear regression. Technometrics 19, 15–18 (1977).

    MathSciNet  MATH  Google Scholar 

  11. 11.

    Cook, R. D. Influential observations in linear regression. Journal of the American Statistical Association 74, 169–174 (1979).

    MathSciNet  Article  Google Scholar 

  12. 12.

    Peña, D. A new statistic for influence in linear regression. Technometrics 47, 1–12 (2005).

    MathSciNet  Article  Google Scholar 

  13. 13.

    Geweke, J. Bayesian treatment of the independent Student‐t linear model. Journal of applied econometrics 8, S19–S40 (1993).

    Article  Google Scholar 

  14. 14.

    Jylänki, P., Vanhatalo, J. & Vehtari, A. Robust Gaussian process regression with a Student-t likelihood. Journal of Machine Learning Research 12, 3227–3257 (2011).

    MathSciNet  MATH  Google Scholar 

  15. 15.

    Lange, K. L., Little, R. J. & Taylor, J. M. Robust statistical modeling using the t distribution. Journal of the American Statistical Association 84, 881–896 (1989).

    MathSciNet  Google Scholar 

  16. 16.

    Lourenço, V. M. & Pires, A. M. M-regression, false discovery rates and outlier detection with application to genetic association studies. Computational Statistics & Data Analysis 78, 33–42 (2014).

    MathSciNet  Article  Google Scholar 

  17. 17.

    Rajaratnam, B., Roberts, S., Sparks, D. & Yu, H. Influence Diagnostics for High-Dimensional Lasso Regression. Journal of Computational and Graphical Statistics, 1–14 (2019).

  18. 18.

    Edgington, E. S. An additive method for combining probability values from independent experiments. The Journal of Psychology 80, 351–363 (1972).

    Article  Google Scholar 

  19. 19.

    Sutton, A. J., Abrams, K. R., Jones, D. R., Sheldon, T. A. & Song, F. Methods for meta-analysis in medical research. Vol. 348 (Wiley Chichester, 2000).

  20. 20.

    Won, S., Morris, N., Lu, Q. & Elston, R. C. Choosing an optimal method to combine P‐values. Statistics in medicine 28, 1537–1553 (2009).

    MathSciNet  Article  Google Scholar 

  21. 21.

    Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 267–288 (1996).

    MathSciNet  MATH  Google Scholar 

  22. 22.

    Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. Least angle regression. The Annals of statistics 32, 407–499 (2004).

    MathSciNet  Article  Google Scholar 

  23. 23.

    Usai, M. G., Goddard, M. E. & Hayes, B. J. LASSO with cross-validation for genomic selection. Genetics research 91, 427–436 (2009).

    CAS  Article  Google Scholar 

  24. 24.

    Crossa, J. et al. Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 186, 713–724 (2010).

    CAS  Article  Google Scholar 

  25. 25.

    Cuevas, J. et al. Genomic prediction of genotype× environment interaction kernel regression models. The Plant Genome 9 (2016).

  26. 26.

    Poland, J. et al. Genomic selection in wheat breeding using genotyping-by-sequencing. The Plant Genome 5, 103–113 (2012).

    CAS  Article  Google Scholar 

  27. 27.

    Yandell, B. S. et al. R/qtlbim: QTL with Bayesian interval mapping in experimental crosses. Bioinformatics 23, 641–643 (2007).

    CAS  Article  Google Scholar 

  28. 28.

    Yi, N. et al. An efficient Bayesian model selection approach for interacting quantitative trait loci models with many effects. Genetics 176, 1865–1877 (2007).

    Article  Google Scholar 

  29. 29.

    Yi, N. & Banerjee, S. Hierarchical generalized linear models for multiple quantitative trait locus mapping. Genetics 181, 1101–1113 (2009).

    CAS  Article  Google Scholar 

  30. 30.

    Piao, Z. et al. Bayesian dissection for genetic architecture of traits associated with nitrogen utilization efficiency in rice. African Journal of Biotechnology 8 (2009).

  31. 31.

    Hwang, C.-L. & Yoon, K. In Multiple attribute decision making 58–191 (Springer, 1981).

  32. 32.

    Assari, A. & Assari, E. Role of public participation in sustainability of historical city: usage of TOPSIS method. Indian Journal of Science and Technology 5, 2289–2294 (2012).

    Google Scholar 

  33. 33.

    Henderson, C. R. Estimation of changes in herd environment. Journal of Dairy Science 32, 706–715 (1949).

    Google Scholar 

  34. 34.

    Endelman, J. B. & Jannink, J.-L. Shrinkage estimation of the realized relationship matrix. G3: Genes, Genomes, Genetics 2, 1405–1413 (2012).

    Article  Google Scholar 

  35. 35.

    Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33, 1 (2010).

    Article  Google Scholar 

  36. 36.

    Endelman, J. B. Ridge regression and other kernels for genomic selection with R package rrBLUP. The Plant Genome 4, 250–255 (2011).

    Article  Google Scholar 

  37. 37.

    Taylor, J. & Taylor, M. J. hett: Heteroscedastic t-Regression. R package version 0.3-2. https://CRAN.R-project.org/package=hett. (2018).

  38. 38.

    Tanaka, E. Simple robust genomic prediction and outlier detection for a multi-environmental field trial. arXiv preprint arXiv:1807.07268 (2018).

  39. 39.

    Fisher, R. (Edinburgh, 1932).

  40. 40.

    Mudholkar, G. & George, E. In Symposium on optimizing methods in statistics. 345–366 (Academic Press New York).

  41. 41.

    Stouffer, S., Suchman, E., Devinney, L., Star, S. & Williams, R. (Princeton: Princeton University Press).

Download references

Author information

Affiliations

Authors

Contributions

Conceived the idea: A.R., N.B.; Designed the study: A.R., D.C.M., N.B.; Collected and analyzed the data: N.B., D.C.M.; Developed the methodology/approach: N.B., D.C.M.; Drafted the manuscript: N.B.; corrected the manuscript: N.B., D.C.M. and A.R. All authors read and approved the final manuscript.

Corresponding author

Correspondence to D. C. Mishra.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Budhlakoti, N., Rai, A. & Mishra, D.C. Statistical Approach for Improving Genomic Prediction Accuracy through Efficient Diagnostic Measure of Influential Observation. Sci Rep 10, 8408 (2020). https://doi.org/10.1038/s41598-020-65323-3

Download citation

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing