Statistical Approach for Improving Genomic Prediction Accuracy through Efficient Diagnostic Measure of Influential Observation

It is expected the predictive performance of genomic prediction methods may be adversely affected in the presence of outliers. In agriculture science an outlier may arise due to wrong data imputation, outlying response, and in a series of trials over the time or location. Although several statistical procedures are already there in literature for identification of outlier but identification of true outlier is still a challenge especially in case of high dimensional genomic data. Here we have proposed an efficient approach for detecting outlier in high dimensional genomic data, our approach is p-value based combination methods to produce single p-value for detecting the outliers. Robustness of our approach has been tested using simulated data through the evaluation measures like precision, recall etc. It has been observed that significant improvement in the performance of genomic prediction has been obtained by detecting the outliers and handling them accordingly through our proposed approach using real data.

are greater than no of individuals (n) creates a problem termed as large p small n problem (p > n). This is a very common phenomena in genomics and molecular biology research now a days. In such cases, penalized regression based approach such as Least Absolute Shrinkage and Selection Operator (LASSO) could be a preferable choice as it takes care of n ≪ p problem by shrinking the estimates of some less significant markers and dropping others from the model. Increased use of LASSO has been motivated by plenty of high dimensional biological data. But it becomes very crucial when some influential observations are present in high dimensional genomic data as each observation has tremendous effect on model selection and interpretation. So it is quite imperative to examine effect of influential observation before implementing the LASSO regression. Hence new measure for detection of influential observation in high dimensional genomic data is a need of hour for improving GEBV's.
Rajaratnam et al. 17 recently developed approach for outlier detection for high dimensional data by considering the LASSO regression technique. In their approach they have proposed four measures i.e. df-model, df-lambda, df-regpath and df-cvpath for detection of influential observations influenced by different aspect of LASSO regression directly or indirectly. However, the results coming from these measures are not consistent i.e. different influential points are detected from these measures. In order to produce more concrete and consistent results, a meta-analysis based approach can be applied where an improved measure of outlier detection can be developed based on integration of these measures using p-values [18][19][20] .
In this study, an improved measure for detection of influential observation has been developed using above mentioned approach. Performance of the developed measure has been empirically evaluated and it was observed that the outliers detected from this measure are more accurate. This developed method has been implemented in the case of genomic selection data (real and simulated) and results shows that there is remarkable improvement in the prediction accuracy of GEBVs.

Material and Methods
LASSO was first time introduced by Tibshirani 21 . LASSO minimizes the sum of squares of residuals subject to a constraint on sum of absolute values of regression coefficients. It is different from usual regression as it adds some additional penalty to usual regression estimator. So it diminishes the effect of less important βs (i.e. marker effect) and reduces least important βs as zero.
Statistical formulation of LASSO estimates can be defined as: is an element of the incidence matrix corresponding to individual i and marker j, β j is the marker effect for marker j. It has been assumed that response variable has zero mean. The constraints β ∑ ≤ = t j p j 1 shrinks effects of variables and sets some of them to zero.
We can also write the LASSO problem in the equivalent Lagrangian form: is l 1 norm penalty on β which results in sparsity of solution and λ is a regularization parameter. Computing the LASSO solution is quadratic programming problem which can be obtained through efficient algorithm like Least Angle Regression (LARS) 22 . Other important question to be addressed in this case is calculation of upper limit of sum of absolute value of predictor variable, for this cross validation approach can be used 23 .
Here we have used a recently proposed approach for detection of influential observation based on LASSO technique 17 . They proposed four different measure i.e. df-model-it measure the change in model selected; df-lambda: it measure the change in λ, where λ is a regularization parameter in LASSO regression path, df-regpath: it measure the changes observed in LASSO regularization path and df-cvpath which observe changes in LASSO cross-validation path. These measures detects outlier from high dimensional genomic data based on LASSO regression. It can be observed that all these measures i.e. df-model, df-lambda, df-regpath and df-cvpath detects influential observations which affects model directly or indirectly, has difference in their results regarding detection of influential observation, it means that there is lack of concordance among them. In order to overcome this limitation, we have proposed a more robust measure for detection of influential observation by integrating above discussed measure using p-values based meta-analysis approach.
Approach of proposed measure. In order to develop a robust statistics for detection of influential measure, we have used p-value based meta-analysis approach. In this approach, we have combined the above mentioned four measures on the basis of their p-values. We used various methods for combining these p-values and explored the performance of each method. The brief description of this approach has been as follows. Let (1) or absence (0). This data set includes 599 lines observed for trait grain yield (GY) for four mega environments. However for our convenience we have just considered GY for first mega environment. The final number of DArT markers after edition was 1279 hence same has been used in this study. Same has been also used in genomic prediction study 24,25 . Dataset 2: Maize. Maize dataset is generated by CIMMYT's Global Maize Program 24 . It originally include 300 maize line with 1148 SNP markers. For marker with highest frequency is coded as 0 and lowest frequency as 1. Here trait under study is also GY, evaluated under draught and watered conditions. The average minor allele frequency in these data sets was 0.20. After some editing 264 maize lines with 1135 SNPs markers were available for final study 24 . Simulation. For illustration simulated data were generated using QTL Bayesian interval mapping ("qtlbim") 27 , a R based (R Development Core Team 2019) package. R is available at http://www.r-project.org and qtlbim package can be loaded from R library. This package has been used in various studies for simulation of data related to genomic selection [28][29][30] . The qtlbim package uses Cockerham's model as the underlying genetic model. We have simulated a total of three data sets for genotypic and phenotype information. Here we have created range of diversified genetic architecture i.e. with very low heritability 0.10 to medium 0.5 and high heritability 0.7. Accordingly, we have simulated data at these particular heritability levels. For each stage we have simulated data for 1000 SNPs for 200 individuals. Simulated data have 10 chromosomes with 100 SNPs in each with specified length. Total 1000 markers are distributed over the all 10 chromosomes in such a way that each marker is equi-spaced over the chromosome. We have simulated normally distributed phenotype, with further no genotype or phenotype information missing. In order to check the sensitivity of all methods to detect true outlier, we have replaced 5% of observation and made them outlier (i.e. beyond mean ± 3*SD). Overview of whole workflow of the current study presented in Fig. 1.
Evaluation measure. As an evaluation measure, prediction accuracy and prediction error were used. Prediction accuracy can be defined as Pearson correlation coefficient (r) between observed phenotypic value and predicted phenotypic value.
If β =Ŷ X , where Ŷ is estimated response and βˆ is estimated value of β, then correlation coefficient (r) can be expressed in following form:

Transformed Variable
Dist. under H 0 Reference , denotes the covariance between observed and predicted phenotypic value, S Y is standard deviation of observed phenotype and Ŝ Y denotes standard deviation of predicted phenotype. Prediction Error (PE) can be simply defined as mean sum of square error (MSE) between observed phenotypic value and predicted phenotypic value. Same can be expressed using following formula (Eq. 3).∑ Where Y i is observed response; Ŷ i is predicted phenotype value. It can be understood that n is the total no. of individual's i.e. = + n n n train t est , here n train denotes no of individuals in the training set and n test is no. of individuals in test set.
In order to assess performance of methods to identify true outlier (observation with added noise) and non-outlier (observation without any noise), we have used precision (i.e. proportion of True Positive (TP) to total positives (i.e. sum total of true positive (TP) and False Negative (FN), Eq. 5), recall (i.e. proportion of TP to TP  www.nature.com/scientificreports www.nature.com/scientificreports/ and False Negative (FN), Eq. 6) and F1 score (i.e. harmonic mean of precision and recall, Eq. 7). All these can be computed using the following expressions: Precision Recall (7) To calculate the overall performance of different methods, Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) has been used. It is a multi-criteria based decision making method given by Hwang and Yoon 31 . It is based on the impression that the selected alternative should have the shortest geometric distance from the positive ideal solution (PIS) and the longest geometric distance from the negative ideal solution (NIS) 32 . TOPSIS compares set of alternatives by giving some weights to each criteria followed by normalization of each single criteria and calculates the geometric distance between each alternative and ideal alternative. TOPSIS is based on the assumption of that criteria are monotonically increasing or decreasing. Here final rank has been calculated using R package 'topsis' motivated from TOPSIS method.

Results and Discussion
Performance of our proposed method that how well it distinguishes true outlier from non-outlier, various measures like precision, recall and F1 score has been calculated for different datasets (generated at different heritability h 2 = 0.1, 0.5 and 0.7) and presented in Table 2. It suggest that our proposed approach i.e. based on combining p-value outperformed in almost every scenario.
Computational efficiency. The time required to compute our proposed measures is calculated using an Intel(R) Core(TM) i7-5500U CPU@2.40 GHz processor on a dataset with varying dimension (i.e. no. of individuals (500, 750 & 1000) and markers (2000, 5000 & 10000) with all possible combination). Results of same is presented below in Table 3.
In order to understand the effect of outlier on the genomic prediction accuracy, we have studied their effects on real dataset. First of all we have fitted LASSO regression with original experimental data say it as LASSO*. Then using the approach given by Rajaratnam et al. 17 , we have calculated p-values for all the four measures i.e. df-model, df-lambda, df-regpath, df-cvpath followed by combining these p-values into single value for each observation/genotype. Using the same we have identified the outlier in the response. The outlier and their corresponding marker genotype were dropped from the model and again LASSO is refitted using the modified data. In order to check robustness of our proposed approach, we have also fitted some of most commonly used methods for genomic selection i.e. Ridge Regression, Best Linear Unbiased Prediction (BLUP), Genomic-BLUP (GBLUP) and Bayesian methods. BLUP i.e. Best Linear Unbiased Prediction introduced by Henderson 33 is used in a linear mixed model for prediction of random effects. GBLUP is an improved version of BLUP where additive genomic relationship matrix (G) is used as a variance-covariance matrix of random effect in the model 34 . For performance evaluations of methods under study, cross validation techniques is used. Data is divided into two parts i.e. training and testing sets such that training set comprises of 70% data and testing set of 30%. Former one is used for model building and later one for model evaluation. The performance of methods was evaluated by calculating prediction accuracy and prediction error. Whole procedures is repeated 100 times and prediction accuracy and prediction error is averaged and their respective standard error is calculated. Results of the same has been discussed below. Here Tables 4-9 reports the average prediction accuracy and prediction error (i.e. MSE) with their sampling variability (SE i.e. standard error) of the methods under study for dataset 1-6. In order to calculate gain in prediction accuracy all the fitted model were compared to baseline model i.e. LASSO and percentage change in prediction accuracy is calculated. In same way percentage reduction in MSE is also calculated.  www.nature.com/scientificreports www.nature.com/scientificreports/ Here all the analysis has been carried out using R (R Development Core Team 2019). LASSO model is fitted using R package glmnet 35 , other methods like BLUP, GBLUP are fitted using rrBLUP package 36 with mixed.solve and kin.blup function respectively. Ridge regression is fitted using Gustavo de los Campos R code, fitting this require heritability of underlying trait. For better description, heritability for each traits under study is provided in the supplementary material (Table S1). Regression with t-error fitted using R package "hett" (using tlm function) 37 . Degree of freedom is estimated for different dataset used in study by using the tlm function with option (estDof = TRUE), available in R package "hett" and then t-regression is fitted.
In this Table 4 and others (Tables 5-9) LASSO* represents LASSO regression fitted in original data (i.e. without any treatment to possible outlier), next four methods in the table represent performance of LASSO in the absence of outlier (i.e. possible outlier and corresponding genotype marker genotype dropped from the model detected using LASSO diagnostic) whereas next four methods in the table represent performance of LASSO in the absence of outlier (i.e. possible outlier and their corresponding marker genotype to be dropped from original data detected by our various p-value based meta-analysis approach). Last four methods shows the performance of other methods on our proposed approach.
In order to assess gain in the prediction accuracy for different datasets under study, It could be observed that there is significant amount of gain in prediction accuracy (Tables 4-9) as compare to their counterparts (41% increase in case of dataset 1, 69% for dataset2, 31% for dataset 3, 57% for dataset 4, 36% for dataset 5 and 27% for dataset 6). In case of Prediction error it can observed from results (Tables 4-9) that MSE for our proposed approach has been significantly reduced (i.e. 37% for dataset 1, 28% for dataset 2, 46% for dataset 3, 57% for dataset 4, 40% for dataset 5 and 43% for dataset 6). It shows clear advantage of our integrated approach (i.e. p-value based meta-analysis method) over the existing approach. In order to see that gain in terms of predictions performance is not only restricted to LASSO, we have also investigated the performance of integrated approach by   www.nature.com/scientificreports www.nature.com/scientificreports/ using most commonly used GS models (RR, GBLUP etc.). It can be marked with confidence that gain in terms of prediction performance has been maintained to other methods also (Tables 4-9).
In Fig. 2, each graph (Fig. 2a-f) contains the ten box plot for prediction accuracy for dataset 1-6 respectively. In each figure first box plot shows the prediction accuracy by fitting simple LASSO regression, next four box plot shows the prediction accuracy calculated following the approach of Rajaratnam et al. 17 and next method (Inverse Chi) represent performance of LASSO in the absence of outlier (i.e. possible outlier and their corresponding marker genotype to be dropped from original data detected by p-value based meta-analysis approach i.e. Inverse Chi). Last four methods shows the performance of other GS methods on our proposed approach. These Box plots shows the distribution of prediction accuracy with their SE, estimated over 100 replications.  Table 6. Mean and standard error of prediction accuracy and prediction error for various methods using dataset 3. www.nature.com/scientificreports www.nature.com/scientificreports/ In Fig. 3, each graph (Fig. 3a-f) represents ten box plot for prediction error for dataset 1-6 on the same pattern of boxplot to Fig. 2. These boxplots represents the distribution of the MSE values over 100 runs. These plots (Figs. 2 and 3) show a clear cut advantage of our proposed approach over the LASSO diagnostic given by Rajaratnam et al. 17 , in improving genomic prediction accuracy and other existing approach. In almost every scenario i.e. wheat and maize dataset (dataset 1-6), prediction accuracy has been improved and prediction error get minimized. Clear distinctions of estimated accuracy and prediction error shows the importance of outlier detection for estimating more accurate GEBVs leads to enhanced prediction accuracy. It can be summarized from the Tables 4-9 that among p-value combination methods Inverse Chi, logit and sumz performed equally although advantage goes to Inverse Chi and sumz over logit and meanp method.   Table 7. Mean and standard error of prediction accuracy and prediction error for various methods using dataset 4.
Ranking of the various methods used for performance evaluation has been done using multi criteria based decision method called TOPSIS. Result of same has been given in Tables S2 and S3 (Supplementary Information). It can be concluded from Tables S2 and S3 that our integrated approach (based on p-value meta-analysis) using Inverse Chi method ranked first among other p-value based meta-analysis methods (i.e. logit, meanp and sumz) for both in case of dataset 1 and dataset 2 and same pattern has been observed for other datasets also using TOPSIS methods based on multi criteria.
Mean shift as substitute for deletion. Instead for deleting the observation flagged as outlier here we have substituted the outlier with the mean shift of data using mean shift outlier model (MSOM) 38 . Here one or more observation is assumed to be introduced from a shifted location as compare to remaining observation. This method can be important for robust modelling where we identify the observation flagged as outlier with separate mean shift effect instead of dropping them from model. Earlier we have fitted the model to real and simulated data and for each observation outliers are identified (p-value < 0.05) based on p-value combination approach.    Table 8. Mean and standard error of prediction accuracy and prediction error for various methods using dataset 5.