Regularized selection indices for breeding value prediction using hyper-spectral image data

High-throughput phenotyping (HTP) technologies can produce data on thousands of phenotypes per unit being monitored. These data can be used to breed for economically and environmentally relevant traits (e.g., drought tolerance); however, incorporating high-dimensional phenotypes in genetic analyses and in breeding schemes poses important statistical and computational challenges. To address this problem, we developed regularized selection indices; the methodology integrates techniques commonly used in high-dimensional phenotypic regressions (including penalization and rank-reduction approaches) into the selection index (SI) framework. Using extensive data from CIMMYT’s (International Maize and Wheat Improvement Center) wheat breeding program we show that regularized SIs derived from hyper-spectral data offer consistently higher accuracy for grain yield than those achieved by standard SIs, and by vegetation indices commonly used to predict agronomic traits. Regularized SIs offer an effective approach to leverage HTP data that is routinely generated in agriculture; the methodology can also be used to conduct genetic studies using high-dimensional phenotypes that are often collected in humans and model organisms including body images and whole-genome gene expression profiles.

phenotype is high-dimensional, the naïve application of the SI can lead to overfitting and sub-optimal accuracy of indirect selection.
To address this problem, we developed regularized selection indices (including penalized and reduced-rank methods) that are tailored to achieve accurate prediction of genetic values using high-dimensional phenotypes. The proposed methodology integrates into the SI framework methods often used to prevent overfitting in high-dimensional phenotypic regressions 14 . Using extensive multi-environment crop imaging data from CIMMYT's wheat breeding program we show that regularized SIs offer improved accuracy of indirect selection in both optimal and stress environments.

Results
A selection index is a linear combination of p measured phenotypes, x i = (x i1 , …, x ip )′, of the form β = ′ x I i i , where β = (β 1 , …, β p )′ is a vector of regression coefficients whose entries define the weights of each of the measured phenotypes in the SI. In a standard SI the weights are derived by minimizing the expected squared deviation between the genetic merit for the selection goal (g y i , e.g., the genetic merit for grain yield of the i th genotype) and the index, that is: The solution to this optimization problem is (see Methods section): x x y 1 , ) ( , , ) x y i y x y x y , , , i p 1 is a vector containing the genetic covariances between the selection objective (y i ) and each of the measured traits (x i ), and P x is the (population) phenotypic variance-covariance matrix of the measured phenotypes, that is, = = . The theory underlying the derivation of SIs and response to indirect selection is well established 15,16 .
The SI is by construction the best linear predictor (BLP) of the genetic merit for the selection goal; this property holds when G x y , and P x are known. However, when the number of measured phenotypes is large, errors in the estimation of P x and G x y , may lead to overfitting and sub-optimal accuracy of indirect selection.

Regularized selection indices.
Reduced-rank (e.g., principal components methods) and penalized regression 14 are two approaches commonly used to confront overfitting in high-dimensional regression problems. These methodologies were developed for regression problems involving an observable phenotype (y i ). In the SI, the response (g y i ) is unobservable; however, the same principles that are applied in phenotypic reduced-rank and penalized regressions can be integrated into the SI framework.

Reduced-rank selection indices.
In principal components (PC) regression, the response is regressed on a reduced number ( < q p) of PCs extracted from a set of predictors (x i ); the same concept can be used to derive a reduced-rank SI. For instance, one can extract a reduced number of PCs from a crop image and the resulting PCs can be used as 'measured traits' in Eq. (1). The solution of Eq. (1) will render estimates of the regression coefficients for the PCs, which can be transformed back to coefficients applicable to the measured traits (see Methods). Thus, a reduced-rank SI (referred to as PC-SI) can be derived following these steps: (i) extract, using the singular value decomposition, q PCs from the matrix containing the measured phenotypes, (ii) estimate the genetic covariances between the first q PCs and the selection objective, (iii) use these estimated (co)variances to derive coefficients associated with the top q PCs; finally, (iv) transform these coefficients into coefficients for the measured phenotypes. This process can be done using = … q p 1, 2, , PCs ( = q p renders the standard SI). For the sequence of estimates (β β β .ˆ, , , ), one can evaluate the accuracy of indirect selection of the resulting SI and an optimal rank for the PC-SI can be chosen to maximize the accuracy of indirect selection. penalized selection indices. In a penalized regression, regularization is achieved by including in the objective function a penalty on model complexity. In the context of a SI, we have where λ is a penalty parameter (λ = 0 yields the coefficients for the standard SI) and β J( ) is a penalty function. Commonly used penalties include the L2 β β = ∑ = where I is a × p p identity matrix. The RR-PSI (referred to as the L2-PSI) yields shrunken estimates of the regression coefficients.
In many applications, variable selection (i.e., a SI that is a function of a subset of the measured phenotypes) may be desirable. This property can be obtained using penalties involving the L1-norm, either alone, (elastic-net 18 ). Unlike the L2-PSI, the LASSO and elastic-net SIs (hereinafter referred to as L1-PSI and EN-PSI, respectively) do not have a closed-form solution 14 . However, solutions for PSIs involving an L1-penalty can be obtained using iterative procedures such as the coordinate descent 20 and the least angle regression 21 (LARS) algorithms (see Methods). As with the PC-SI, an optimal PSI can be obtained by choosing the values of the regularizing parameters λ α ( , ) that maximize the accuracy of indirect selection.

Accuracy of indirect selection.
Indirect selection accuracy is defined as the correlation between the index used to rank genotypes and the genetic merit of the selection objective, that is, . This parameter is equal to the product of the square root of the heritability of the SI (h I ) times the genetic correlation between the SI and the selection target, cor g g ( , ) I y i i 16 . To avoid estimation bias Acc I ( ) must be estimated using data that was not used to derive the coefficients of the index (Fig. 1); therefore, in the application presented below we: (i) partitioned the data into training and testing sets, (ii) derived the coefficients of the SI in the training set, (iii) applied these coefficients to image data of the testing set ( Regularized selection indices for wheat grain yield using hyper-spectral image data. We applied the methodology described in the previous section to data (n = 3,276) from the CIMMYT's Global Wheat Program consisting of grain yield (ton ha −1 ) and hyper-spectral image data. The data were collected at CIMMYT's experimental station in Ciudad Obregon, Sonora, Mexico (27°20′N, 109°54′W, 38 masl) from 39 yield trials in which a total of 1,092 genotypes were tested. Rainfall in Obregon is very limited; therefore, four different environments were generated representing a combination of planting methods (Flat or Bed), controlled irrigation (minimal, 2 or 5 irrigations), and planting dates (optimum or early-heat). As expected, average yield decreased as drought stress intensity increased (see Table 1 and Supplementary Fig. S1 for boxplots of yield by environment). Image data was collected using an infrared and an hyper-spectral camera and consisted of reflectance of electromagnetic power at 250 wavebands within the visible and near-infrared spectrums (392-850 nm). Separate images were collected at 9 time-points covering vegetative (VEG), grain filling (GF), and maturity (MAT) stages in Eq. (3), using λ = 0 renders a standard SI). For each of the solutions (β λ ( ) (1) , β λ ( ) (2) , …) we estimated the heritability of the resulting index and the genetic correlation between the index and the selection target, and from those estimates we derived the accuracy of indirect selection. The same approach was used to evaluate the accuracy of indirect selection of PC-SIs with a varying number (1, 2, …) of PCs.
We first fitted PSIs and PC-SIs using data from a single time-point; the results from the latest time-point (corresponding to MAT or late GF stages depending on the environment) are presented in Fig ) exhibited similar patterns with some differences between environments. The accuracy of indirect selection of the optimal L1-PSI was always close to that of the optimal PC-SI and that of the optimal L2-PSI (Supplementary Table S1). Importantly, in all cases the accuracy of indirect selection of the optimal regularized SIs was considerably higher than that of the standard SI, which is the one corresponding to 250 active bands or 250 PCs (i.e., the right-most results in the plots in Fig. 2). Figure 3 displays the accuracy of indirect selection across time-points for the optimal (i.e., the one with the highest accuracy of indirect selection) L1-PSI and PC-SI. For comparison we also display in the plot the accuracy of indirect selection achieved by a standard SI (in green). The estimated 95% confidence intervals of the accuracy of the regularized SIs (either PC-SI or L1-PSI) are all above (and do not overlap) with the confidence intervals for the accuracy of the standard SI, except for a single time-point (57 DAS in environment Bed-2IR). Results from Tukey's Honest Significance Difference confirmed that the accuracy of the regularized SIs is statistically different (higher) than the standard SI at a 5% of significance (Supplementary Table S1) for all but one (57 DAS in environment Bed-2IR) time-point/environment. Regularization increased the selection accuracy across time-points and environments. Regularized SIs (either PC-SI or L1-PSI) had an accuracy of indirect selection that was in average 10-40% higher than the accuracy achieved by a standard SI. These gains in accuracy were stronger in the optimal environment (Bed-5IR with a median of 36%) and smaller in the stressed environments (Flat-Drought and Bed-EHeat with a median of 16%). Interestingly, there were no sizable differences between the accuracy of indirect selection achieved with the optimal L1-PSI and that of the optimal PC-SI. Compared with a standard SI, regularized SIs had higher heritability ( Supplementary Fig. S6); this was achieved without compromising the genetic correlation ( Supplementary Fig. S7), thus leading to a higher accuracy of indirect selection achieved by either penalization or rank-reduction strategies.
Using data from multiple time-points further improves selection accuracy. The results presented above were based on data from a single time-point. We also generated selection indices using data from multiple time-points (in this case, x i was a vector containing 2,250 traits, corresponding to 250 wavebands measured at each of 9 time-points). Integrating data from multiple time-points further increased the accuracy of L1-PSI by a margin that ranged from 1 to 8 points on the correlation scale ( Table 2). The gains in selection accuracy obtained using data from multiple time-points were more evident in environments with lower accuracy; similar results were obtained for the PC-SI and L2-PSI (Supplementary Table S1). Figure 4 shows a heatmap for the solutions of the optimal L1-PSI that integrated data from the 9 time-points. Each panel represents an environment, horizontal bands represent time-points. Within each time-point wavebands not entering in the solution are in grey and non-zero coefficients are represented in a yellow-red scale (red indicates large absolute-value coefficients). The well-irrigated environments (Bed-5IR and Bed-EHeat) had considerably sparser indices with only a reduced number of wavebands in the solutions; these were mostly located in the violet, blue and red regions of the spectrum. In stressed environments (Flat-Drought and Bed-2IR) there were also a few wavebands in the green and infrared regions that were active. In all the indices, there were wavebands from several time-points that were active in the optimal solution, suggesting that data from both early and late phenological stages are informative about the genetic merit for grain yield. www.nature.com/scientificreports www.nature.com/scientificreports/ comparison with phenotypic prediction. We compared the accuracy of indirect selection of the PSI and PC-SI with vegetation indices and penalized phenotypic prediction. Vegetation indices are often used to predict yield 22 , biomass, and chlorophyll content 23,24 . We considered two vegetation indices: the Red and Green Normalized Difference Vegetation Indices (RNDVI 25 and GNDVI 26 respectively). For each of these indices we estimated the genetic correlation with grain yield, as well as their heritability and accuracy of indirect selection (Supplementary Table S1). Overall, the accuracy of indirect selection of the GNDVI and RNDVI was lower than the one achieved with a PSI (the average difference in accuracy between RNDVI and the L1-PSI varied by environment from 0.02 to 0.14 points in correlation, Supplementary Table S1, in favor of the L1-PSI). The heritability of the GNDVI and RNDVI was similar and superior in some cases to that of the L1-PSI ( Supplementary Fig. S6); however, the genetic correlation between the vegetation indices and grain yield was (in most time-points and environments) lower than the genetic correlation between the L1-PSI and grain yield ( Supplementary Fig. S7). Thus, the main driver of the difference in accuracy between the L1-PSI and the vegetation indices was the difference in genetic correlation.

L1-penalization leads to sparse selection indices.
We also fitted L1-penalized phenotypic prediction (L1-Phen) and compared the accuracy of indirect selection of these phenotypic prediction methods with that of penalized SIs. Overall, the L1-Phen achieved an accuracy of indirect selection very close to that of the L1-PSI (Supplementary Table S1); however, in a few environments at some time-points, the L1-PSI achieved a higher accuracy of indirect selection than the phenotypic prediction.

Discussion
High-throughput phenotyping has been extensively adopted in agricultural research and commercial production. Extracting interpretable information from HTP data poses important statistical challenges. The clear majority of research in this area has focused on calibrating equations to predict phenotypes (e.g., total biomass, grain yield) using HTP data as inputs. This approach is well-suited for phenotypic prediction; however, the same approach can be sub-optimal for selection because the best predictor of a phenotype is not always the best predictor of the genetic merit of the same trait.
The best (linear) phenotypic predictor is the sum of the best linear predictor of the genetic merit (g y ) plus the best linear predictor of the environmental term (ε y ), that is, , is the SI and it is, by construction, maximally correlated with the genetic merit. The second term, ε |  x ( ) y , is relevant for phenotypic prediction but represents noise when the problem is that of selecting the best genotypes.
Selection indices exploit genetic covariances, while phenotypic prediction relies on phenotypic covariances between the selection target and the measured phenotype (e.g., crop imaging). Thus, the two methods yield different results whenever the patterns of phenotypic correlations are sufficiently different from the patterns of genetic correlations. In our data set, environmental conditions were highly controlled, with relatively low un-controlled within-trial variability in environmental conditions. Consequently, the patterns of phenotypic and genetic correlations were very similar (see Supplementary Fig. S8). This was true for many time-points and environments but not in others (e.g., 80, 85 and 93 DAS in Flat-Drought, and 90 and 98 DAS in Bed-2IR); it was exactly in those www.nature.com/scientificreports www.nature.com/scientificreports/ time-points and environments that the L1-PSI achieved higher accuracy of indirect selection than the L1-Phen method (Supplementary Table S1).
A standard SI (Eq. (1)) is, by construction, maximally correlated with the genetic merit of the selection objective. This optimality property holds when the genetic and phenotypic (co)variance matrices that are needed to derive the coefficients of the SI (see Eq. (2)) are known without error. However, when the measured phenotype is high-dimensional, estimation errors in the phenotypic (co)variance matrix (P x ), as well as in the genetic covariances (G x y , ), can make the standard SI sub-optimal. Our empirical results confirm this: standard SIs over-fitted the data; this leads to a SI with low heritability and low accuracy of indirect selection.
To prevent overfitting, we considered integrating ideas commonly used in high-dimensional regression into the SI methodology. Our empirical results show that regularization consistently improves the accuracy of indirect selection relative to standard SIs. We verified this for various environmental conditions and for crop imaging data collected at 9 different time-points. The optimal PSI and the optimal PC-SI achieved almost the same accuracy www.nature.com/scientificreports www.nature.com/scientificreports/ of indirect selection for all the environments and time-points, suggesting that either type of regularization can be effective.
Reduced-rank selection indices are appealing because after dimension reduction the problem of deriving a SI is trivial and can be dealt with methods commonly used to derive standard SIs. Moreover, after HTP has been reduced to a few derived-traits (say the top 10 PCs), these traits can be integrated into genetic evaluations (either pedigree-based 27 or genomic-enabled 28 ) using standard multi-trait models.
Principal components-based methods have been considered before in the analysis of Fourier-transformed infrared (FTIR) spectra derived from milk samples. For instance, Soyeurt, Misztal & Gengler 29 used a reduced number of FTIR-derived PCs to estimate variance components for selection objectives (e.g., fat or protein content in milk). Building upon this idea, Dagnachew, Meuwissen & Ådnøy 30 suggested predicting the genetic merit for milk fatty acids using FTIR-derived PCs as 'traits' in a genetic evaluation. However, when mapping from genetic predictions of PC-lodgings onto genetic predictions for the selection objective the authors used coefficients derived from a phenotypic (partial least squares) regression. This does not guarantee that the resulting index is maximally correlated with the genetic merit of the selection target. The penalized and PC-SI presented in this study address that problem by using coefficients that are derived using genetic (and not phenotypic) covariances.
A disadvantage of the PC-SI is that the methodology does not naturally provide variable selection, a feature that may be desirable when the measured phenotype is high-dimensional.
Penalized selection indices can perform variable selection based on genetic covariances. While the derivation of a PSI is a bit more challenging than that of the PC-SI, the computational burden involved in the derivation of a PSI is not extremely high.
integration of pSi and pc-Si into genetic evaluations. The SIs considered here predict genetic merit for a selection target from a set of traits measured on an individual ( β = ′ x I i i ); such indices exploit borrowing of information between traits within an individual. Borrowing of information between individuals increases selection accuracy; we envision two ways in which regularized SIs can be integrated into pedigree or genomic-based genetic evaluations.
One possibility is to use a two-steps approach whereas in the first step a PSI or a PC-SI is used to predict the genetic merit using within-individual information. This step can be considered as a task where patterns attributable to genetic covariances are extracted and those attributable to environmental covariances are smoothed-out. Then, in a second step, the resulting index-data … I I { , , } n 1 could be used as a trait in a genetic evaluation. Our study shows that the use of a regularized SI leads to a derived-phenotype that has higher genetic accuracy than standard SIs, and that of best phenotypic prediction. In principle, using a more accurate phenotype should lead to a higher accuracy of the predicted breeding values in the second step. However, further studies are needed to determine whether the gains in accuracy at the level of the HTP-derived phenotype will fully translate into a higher accuracy of the predicted breeding values in a two-steps procedure. A one-step approach is conceptually feasible and statistically more efficient as it offers the possibility of considering correlations between traits, relationships between genotypes, and the effects of non-genetic factors jointly; however, the implementation of the one-step approach using high-dimensional phenotypes can be computationally challenging. To implement a one-step approach, the optimization problem of Eq. (3) can be modified by replacing x i , the vector with the measured phenotypes on the i th individual, with a vector x x x x ( , , , ) that contains all the available HTP data (measured on all n individuals); after expanding the squared error loss and taking expectations we get where G gx is a × pn 1 vector of genetic covariances including between-traits-within-individual (co)variances and between-subjects covariances. In standard genetic models, G gx takes a Kronecker form xy , , where A i are genetic (either DNA-or pedigree-derived) relationships between the candidate for selection and each of the individuals in the training set, and G x y , is, as before, a vector of genetic covariances between the selection objective and the measured traits (X). Likewise, P x is a × pn pn phenotypic (co)variance matrix. Estimating P x would require estimating all the genetic and environmental covariances among the measured traits.
impact of the use of high-throughput phenotypes in breeding programs. According to breeders' equation 16,31 , the rate of genetic gain from selection is directly proportional to selection accuracy and selection intensity. Thus, relative to the use of standard SIs, the use of regularized SIs is expected to increase selection gains by a factor equal to the gains observed in accuracy, that is between 10% and 40%. Relative to mass phenotypic selection, the PSIs had efficiencies, RE, ranging from 60% to 90%; therefore, relative to direct phenotypic selection, selection decisions based on PSI derived from images are expected to yield lower genetic gains than the ones that could be achieved via direct mass selection. However, the use of HTP technologies (e.g., crop monitoring using hyper-spectral cameras mounted in drones) may enable the expansion of the number of genotypes tested/ measured as well as the number of locations where those genotypes are tested. This could lead to an increase in selection intensity which will in turn increase selection gains. For instance, if the use of HTP enables doubling the number of genotypes tested, the increase in selection gains that could be achieved with HTP may range from 20% (in the case where the PSI has RE of 60%) to 80% (for the traits/environments with RE of 90%).
The discussion in the preceding paragraph is entirely based on breeders' equation, which does not contemplate the long-term impacts of selection in genetic diversity. A more accurate and more intensive selection may affect diversity and long-term response to selection. To address this problem, attention to diversity will be needed with regularized SIs as with any other selection criteria.

Regularized selection indices can also be a valuable tool in genetic research. High-dimensional
phenotypes are also becoming increasingly available in genetic studies involving human subjects and model organisms. Performing genetic studies (e.g., genome-wide association analyses) on high-dimensional phenotypes is challenging and the burden of multiple testing across hundreds or thousands of phenotypes (e.g., RNA-abundance across thousands of genes) may critically compromise power. The PSI and PC-SI presented in this study could be used to extract genetic patterns from high-dimensional phenotype data such as brain imaging or whole-genome gene expression profiles and these patterns can then be used as traits in genetic studies.

conclusion
We proposed two novel methods for predicting the genetic merit for selection objectives from high-dimensional phenotypes. These phenotypes are becoming increasingly available as the adoption of HTP in crop and animal production increases. The proposed methods integrate regularization procedures commonly used in high-dimensional regressions into the SI methodology. Regularization prevents overfitting and increases the accuracy of the index. The methods proposed here can be used to extract genetic patterns from almost any kind of high-dimensional phenotype, including not only HTP data emerging in agriculture but also high-dimensional phenotypes that emerge in genetic studies involving human subjects and model organisms.

Methods
Standard selection index. The weights on a SI are derived as the solution to the optimization problem of Eq. (1): The right-hand side can be expressed as The first term, , does not involve β; therefore, it can be dropped from the objective function. Furthermore, if x i has null mean, and assuming that the environmental effects on x i are orthogonal to g y i , then www.nature.com/scientificreports www.nature.com/scientificreports/ Differentiating the right-hand side with respect to vector β and setting the derivatives equal to zero leads to the first order conditions: β = P G x xy , ; therefore,

Reduced-rank selection index.
Recall that the singular value decomposition of a real-valued matrix, , ≤ q p) as 'measured phenotypes' in the SI: The second and third right-hand side terms can be combined to obtain: where I is a × p p identity matrix. Differentiating with respect to β and setting the derivatives equal to zero, we obtain the first-order conditions: ; therefore:

EN-PSI.
The coefficients for the elastic-net family are obtained by considering an objective function as in Eq.
The L1-PSI and L2-PSI are particular cases corresponding to α = 1 and α = 0, respectively. When α = 0 the solution has a closed-form (see L2-PSI above). If α > 0, no closed-form solution exists; however, a solution can be obtained using the same iterative algorithms that are used to solve elastic-net regressions (e.g., LARS and coordinate descent 14 ). These algorithms can be implemented either by 'partial residuals' or using 'covariance updates' 32 . In our case, the objective function is entirely based on (co)variance terms. The objects P x and G x y , enter in the objective function of the PSI in the same way that ′ X X and ′ X y enter in a standard elastic-net regression. Therefore, to obtain solutions, we implemented the standard LARS algorithm (e.g., Hastie et al. 14 ) entirely based on covariance updates.
Data. The data set consists of 1,092 inbred wheat lines grouped into 39 trials and grown during the 2013-2014 season at the Norman Borlaug experimental research station in Ciudad Obregon, Sonora, Mexico. Each trial consisted of 28 breeding lines that were arranged in an alpha-lattice design with three replicates and six sub-blocks. The trials were grown in four different environments: Flat-Drought (sowing in flat with irrigation of 180 mm through drip system), Bed-2IR (sowing in bed with 2 irrigations approximately 250 mm), Bed-EHeat (bed sowing 30 days before optimal planting date with 5 normal irrigations approximately 500 mm), and Bed-5IR (bed sowing with 5 normal irrigations). In 2013, all the trials were planted by mid-November (optimal planting date), on the 21 st (Bed-2IR and Bed-5IR) and on the 26 th for Flat-Drought. Trials for Bed-EHeat were planted on October 30 th . Grain yield (ton ha −1 , total plot yield after maturity) was recorded.
Reflectance data were collected from the fields using both infrared (A600 series Infrared camera, FLIR, Wilsonville, OR) and hyper-spectral (A-series, Micro-Hyperspec, VNIR Headwall Photonics, Fitchburg, MA) cameras mounted on a Piper PA-16 Clipper aircraft on 9 different dates (time-points) between January 10 th and March 27 th , 2014. During each flight, data from = p 250 wavebands ranging from 392 to 850 nm were collected for each pixel in the pictures. Using ArcMap software (ESRI, CA), the average reflectance of all the pixels within each geo-referenced trial plot was calculated and reported as a single data-point for each genotype for each band. Days to heading were recorded as the number of days from the date of sowing/first irrigation until 50% of spike emergence in each plot. Heading of about 50-80% of the total number of plots was used as criterion to distinguish between vegetative (VEG) and grain filling (GF) stages. The crop was considered to be at maturity (MAT) stage when the average RNDVI decreased to ~0.4. phenotype pre-processing. Within each environment, grain yield phenotypic records were pre-adjusted by fitting the following mixed model, where y jklm is the grain yield phenotype value for the j th genotype, k th trial, l th replicate (within trial), m th sub-block (within trial and replicate), µ is the overall mean and g j , t k , r l k ( ) , and b m kl ( ) are the genotype, trial, replicate, and sub-block effects, respectively (all assumed to be random) and ε jklm is an error term. Random effects were assumed to be independently and identically distributed (iid) normal with null mean and effect-specific variances. Likewise, the error terms were assumed to be iid with null mean and common error variance.
Grain yield data were pre-adjusted by subtracting from the phenotypic record (y jklm ) the mean (μ) plus BLUPs of trial, replicate, and sub-block effects; this is Reflectance data was pre-adjusted by fitting the above model, using reflectance at individual bands as phenotype expanded with the inclusion of a time-point effect. Separate models were fitted to each of the wavebands. As with grain yield, reflectance data were pre-adjusted by subtracting from the measured reflectance the estimated mean and predicted time-point, trial, replicate, and sub-block effects.
For quality control, pre-adjusted grain yield and reflectance phenotypes were removed for those grain yield scores lying beyond 3 times the inter-quantile region from the 0.25 and 0.75 quantiles.
After pre-adjusting, all phenotypes were standardized (to have unit variance); for ease of exposition, hereinafter we refer to the adjusted-scaled phenotypes (including grain yield and the image data) simply as phenotypes.
Heritability estimation. After pre-adjusting standardization, we analyzed phenotypes using a mixed model of the form where y ij is the phenotype for the i th observation (i here is a single index for indices k, l, and m in Eq. (4)) of the j th genotype; the genetic values are training-testing partitions. The data set contains information from 39 trials with 84 observations each. To assess the accuracy of indirect selection, we randomly assigned complete trials to testing sets. The training set comprised all the data from the trials not assigned to the testing set. This approach guarantees that no data from a single trial is present in both training and testing sets. This approach aims at representing a situation where one www.nature.com/scientificreports www.nature.com/scientificreports/ has calibrated the coefficients of the index using historical trials and apply these coefficients to image data of future trials. A similar validation scheme has been used (using herd-year-season groups instead of trials) in validation problems in previous studies involving milk spectra data 33 . In each training-testing partition, out of the 29 trials available, 26 trials ( ≈ n 2,184 trn observations) were randomly assigned to the training set, and the data from the remaining 13 trials ( ≈ n 1,092 tst ) was used for testing set. The regression coefficients of the indices (the β's for the standard SI, PSI, and PC-SI) were calculated using grain yield and reflectance data of the training set. Estimates of the coefficients and reflectance data were then used to calculate the SI, β = ′ x I ij ij^, for each observation i in the testing set ( = … i n 1, , tst ). The heritability of the index and the genetic correlation between the index and the selection goal were estimated in the testing set.
The training-testing procedure described above was repeated 100 times by randomly assigning trials to training and testing sets. From these analyses, we reported the mean of heritability, genetic correlation, and selection accuracy; and their standard deviation across training-testing partitions. estimation of phenotypic and genetic parameters. The population phenotypic (co)variance matrix P x was estimated within the training set using the unbiased sample (co)variance matrix given by P is the matrix containing all measured traits in the training set.
The genetic covariance G ( ) x y , j between grain yield and the j th measured trait ( = … j p 1, , ) was estimated using a sequence of univariate genetic models as in Eq. (5). We fitted that model with grain yield phenotypes as response, then for each of the reflectance bands and then for the sum of grain yield and each of the bands. The genetic covariances between the bands and grain yield were then estimated using estimation of the accuracy of indirect selection. To assess the accuracy of indirect selection we applied the regression coefficients derived in the training set to image data from the testing set to derive x I ij ij β = ′ . Then, using a mixed model analysis like that described in the previous section we estimated the heritability of the SI (h I 2 ), the heritability of grain yield (h y 2 ), and the genetic correlation between the SI and grain yield (cor g g ( , ) I y i i Software. All the aforementioned analyses were implemented in the R software environment 34 , version 3.5.1.
Linear mixed models were implemented using the 'lmer' function from the LME4 35 R-package. The software that implements the LARS and coordinate descent algorithms are available through the SFSI R-package (https:// github.com/MarcooLopez/SFSI).

Data availability
The data used in this study are publicly available by CIMMYT (https://www.cimmyt.org/) who owns all rights in the data. The data is also included in the SFSI R-package. The R-scripts needed to perform the analyses presented in this study can be found in the documentation of the SFSI R-package.